Data

Module of data processing.

adtk.data.expand_events(events, left_expand=0, right_expand=0, freq_as_period=True)[source]

Expand duration of events.

Parameters
  • events (list, dict, pandas Series, or pandas DataFrame) –

    Events to be expanded.

    • If list, a list of events where an event is a pandas Timestamp if it is instantaneous or a 2-tuple of pandas Timestamps if it is a closed time interval.

    • If dict, each key-value pair represents an independent list of events.

    • If pandas Series, it is binary where 1 represents events cover this time point.

    • If pandas DataFrame, each column is treated as an independent Series.

  • left_expand (pandas Timedelta, str, or int, optional) –

    Time range to expand backward.

    • If str, it must be able to be converted into a pandas Timedelta object.

    • If int, it must be in nanosecond.

    Default: 0.

  • right_expand (pandas Timedelta, str, or int, optional) –

    Time range to expand forward.

    • If str, it must be able to be converted into a pandas Timedelta object.

    • If int, it must be in nanosecond.

    Default: 0.

  • freq_as_period (bool, optional) –

    Whether to regard time index with regular frequency (i.e. attribute freq of time index is not None) as time intervals. Only used when input events is pandas Series or DataFrame.

    For example, DatetimeIndex([‘2017-01-01’, ‘2017-01-02’, ‘2017-01-03’, ‘2017-01-04’, ‘2017-01-05’], dtype=’datetime64[ns]’, freq=’D’) has daily frequency. If freq_as_period=True, each time point in the index represents that day (24 hours). Otherwsie, each time point represents the instantaneous time instance of 00:00:00 on that day.

    Default: True.

Returns

Expanded events.

Return type

list, dict, pandas Series, or pandas DataFrame

adtk.data.split_train_test(ts, mode=1, n_splits=1, train_ratio=0.7)[source]

Split time series into training and testing set for cross validation.

Parameters
  • ts (pandas Series or DataFrame) – Time series to process.

  • mode (int, optional) –

    The split mode to use. Choose from 1, 2, 3 and 4.

    1. Divide time series into n_splits folds of equal length, split each fold into training and testing based on train_ratio.

    2. Create n_splits folds, where each fold starts at t_0 and ends at t_(n/n_splits), where n goes from 0 to n_splits and the first train_ratio of the fold is for training.

    3. Create n_splits folds, where each fold starts at t_0. Each fold has len(ts)/(1 + n_splits) test points at the end. Each fold is n * len(ts)/(1 + n_splits) long, where n ranges from 1 to n_splits.

    4. Create n_splits folds, where each fold starts at t_0. Each fold has n * len(ts)/(1 + n_splits) training points at the beginning of the time series, where n ranges from 1 to n_splits and the remaining points are testing points.

    Default: 1.

  • n_splits (int, optional) – Number of splits. Default: 1.

  • train_ratio (float, optional) – Ratio between length of training series and each fold, only used by mode 1 and 2. Default: 0.7.

Returns

Splitted training and testing series.

Return type

list of 2-tuples (train, test)

Examples

In the following description of the four modes, 1s represent positions assigned to training, 2s represent those assigned to testing, 0s are those not assigned.

For a time series with length 40, if n_splits=4, train_ratio=0.7,

  • If mode=1:

    1111111222000000000000000000000000000000 0000000000111111122200000000000000000000 0000000000000000000011111112220000000000 0000000000000000000000000000001111111222

  • If mode=2:

    1111111222000000000000000000000000000000 1111111111111122222200000000000000000000 1111111111111111111112222222220000000000 1111111111111111111111111111222222222222

  • If mode=3:

    1111111122222222000000000000000000000000 1111111111111111222222220000000000000000 1111111111111111111111112222222200000000 1111111111111111111111111111111122222222

  • If mode=4:

    1111111122222222222222222222222222222222 1111111111111111222222222222222222222222 1111111111111111111111112222222222222222 1111111111111111111111111111111122222222

adtk.data.to_events(labels, freq_as_period=True, merge_consecutive=None)[source]

Convert binary label series to event list.

Parameters
  • labels (pandas Series or DataFrame) – Binary series of anomaly labels. If a DataFrame, each column is regarded as a type of anomaly independently.

  • freq_as_period (bool, optional) –

    Whether to regard time index with regular frequency (i.e. attribute freq of time index is not None) as time intervals.

    For example, DatetimeIndex([‘2017-01-01’, ‘2017-01-02’, ‘2017-01-03’, ‘2017-01-04’, ‘2017-01-05’], dtype=’datetime64[ns]’, freq=’D’) has daily frequency. If freq_as_period=True, each time point in the index represents that day (24 hours). Otherwsie, each time point represents the instantaneous time instance of 00:00:00 on that day.

    Default: True.

  • merge_consecutive (bool, optional) – Whether to merge consecutive events into a single time window. If not specified, it is on automatically if the input time index has a regular frequency and freq_as_period=True, and it is off otherwise. Default: None.

Returns

  • If input is a Series, output is a list of events where an event is a pandas Timestamp if it is instantaneous or a 2-tuple of pandas Timestamps if it is a closed time interval.

  • If input is a DataFrame, every column is treated as an independent binary series, and output is a dict where keys are column names and values are event lists.

Return type

list or dict

adtk.data.to_labels(lists, time_index, freq_as_period=True)[source]

Convert event list to binary series along a time index.

Parameters
  • lists (list or dict) –

    A list of events, or a dict of lists of events.

    • If list, a list of events where an event is a pandas Timestamp if it is instantaneous or a 2-tuple of pandas Timestamps if it is a closed time interval.

    • If dict, each key-value pair represents an independent list of events.

  • time_index (pandas DatatimeIndex) – Time index to build the label series.

  • freq_as_period (bool, optional) –

    Whether to regard time index with regular frequency (i.e. attribute freq of time index is not None) as time intervals.

    For example, DatetimeIndex([‘2017-01-01’, ‘2017-01-02’, ‘2017-01-03’, ‘2017-01-04’, ‘2017-01-05’], dtype=’datetime64[ns]’, freq=’D’) has daily frequency. If freq_as_period=True, each time piont represents that day, and that day will be marked positive if an event in the event list overlaps with the period of that day (24 hours). Otherwsie, each time point represents the instantaneous time instance of 00:00:00 on that day, and that time point will be marked positive if an event in the event list covers it.

    Default: True.

Returns

Series of binary labels.

  • If input is asingle list, the output is a Series.

  • If input is a dict of lists, the output is a DataFrame where each column corresponds a list in the dict.

Return type

pandas Series or DataFrame

adtk.data.validate_events(event_list, point_as_interval=False)[source]

Validate event list.

This function will check and fix some common issues in an event list (a list of time windows), including invalid time window, overlapped time windows, unsorted events, etc.

Parameters
  • event_list (list) – A list of events, where an event is a pandas Timestamp if it is instantaneous or a 2-tuple of pandas Timestamps if it is a closed time interval.

  • point_as_interval (bool, optional) – Whether to return all instantaneous event as a close interval with identicial start point and end point. Default: False.

Returns

A validated list of events.

Return type

list

adtk.data.validate_series(ts, check_freq=True, check_categorical=False)[source]

Validate time series.

This functoin will check some common critical issues of time series that may cause problems if anomaly detection is performed without fixing them. The function will automatically fix some of them and raise errors for the others.

Issues will be checked and automatically fixed include:

  • Time index is not monotonically increasing;

  • Time index contains duplicated time stamps (fix by keeping first values);

  • (optional) Time index attribute freq is missed while the index follows a frequency;

  • (optional) Time series include categorical (non-binary) label columns (to fix by converting categorical labels into binary indicators).

Issues will be checked and raise error include:

  • Wrong type of time series object (must be pandas Series or DataFrame);

  • Wrong type of time index object (must be pandas DatetimeIndex).

Parameters
  • ts (pandas Series or DataFrame) – Time series to be validated.

  • check_freq (bool, optional) – Whether to check time index attribute freq is missed. Default: True.

  • check_categorical (bool, optional) – Whether to check time series include categorical (non-binary) label columns. Default: False.

Returns

Validated time series.

Return type

pandas Series or DataFrame