Data

Module of data processing.

adtk.data.validate_series(ts, check_freq=True, check_categorical=False)[source]

Validate time series.

This process will check some common critical issues that may cause problems if anomaly detection is performed to time series without fixing them. The function will automatically fix some of them, while it will raise errors when detect others.

Issues will be checked and automatically fixed include:

  • Time index is not monotonically increasing;
  • Time index contains duplicated time stamps (fix by keeping first values);
  • (optional) Time index attribute freq is missed;
  • (optional) Time series include categorical (non-binary) label columns (to fix by converting categorical labels into binary indicators).

Issues will be checked and raise error include:

  • Wrong type of time series object (must be pandas Series or DataFrame);
  • Wrong type of time index object (must be pandas DatetimeIndex).
Parameters:
  • ts (pandas Series or DataFrame) – Time series to be validated.
  • check_freq (bool, optional) – Whether to check time index attribute freq is missed. Default: True.
  • check_categorical (bool, optional) – Whether to check time series include categorical (non-binary) label columns. Default: False.
Returns:

Validated time series.

Return type:

pandas Series or DataFrame

adtk.data.to_events(labels, freq_as_period=True, merge_consecutive=None)[source]

Convert binary label series to event list(s).

Parameters:
  • labels (pandas Series or DataFrame) – Binary series of anomaly labels. If a DataFrame, each column is regarded as an independent type of anomaly.
  • freq_as_period (bool, optional) – Whether to regard time index with regular frequency (i.e. attribute freq of time index is not None) as time spans. E.g. DatetimeIndex([‘2017-01-01’, ‘2017-01-02’, ‘2017-01-03’, ‘2017-01-04’, ‘1970-01-05’], dtype=’datetime64[ns]’, freq=’D’) has daily frequency. If freq_as_period=True, each element represents that day. Otherwsie, each time element represents the instantaneous time stamp 00:00:00 on that day. Default: True.
  • merge_consecutive (bool, optional) – Whether to merge consecutive events into a time window. When the option is on, if input time index has regular frequency (i.e. attribute freq of time index is not None) and freq_as_period=True, a merged event ends at the end of last period; otherwise, it ends at the last instantaneous time point. If the option is not specified, it is on automatically if the input time index has regular frequency and freq_as_period=True, and is off otherwise. Default: None.
Returns:

  • If input is a Series, output is a list of time instants or periods.
  • If input is a DataFrame, every column is treated as an independent binary series, and output is a dict where keys are column names and values are corresponding event lists.

A time instant is a pandas Timestamp object, while a time period is a 2-tuple of Timestamp objects that is regarded as a closed interval.

Return type:

list or dict

adtk.data.to_labels(lists, time_index, freq_as_period=True)[source]

Convert event list(s) to binary series along a time line.

Parameters:
  • lists (list or dict) –

    A list of events, or a dict of lists of events.

    • If list, it represents a single type of event;
    • If dict, each key-value pair represents a type of event.

    Each event in a list can be a pandas Timestamp, or a tuple of two Timestamps that is regarded as a closed interval.

  • time_index (pandas DatatimeIndex) – Time index to build the label series.
  • freq_as_period (bool, optional) – Whether to regard time index with regular frequency (i.e. attribute freq of time index is not None) as time spans. E.g. DatetimeIndex([‘2017-01-01’, ‘2017-01-02’, ‘2017-01-03’, ‘2017-01-04’, ‘1970-01-05’], dtype=’datetime64[ns]’, freq=’D’) has daily frequency. If freq_as_period=True, each element represents that day, and that day will be marked positive if an event in the event list overlaps with any part of that day. Otherwsie, each time element represents the instantaneous time stamp 00:00:00 on that day, and that time point will be marked positive if an event in the event list covers it. Default: True.
Returns:

Series of binary labels. If input is asingle list, the output is a Series, otherwise if input is a dict, the output is a DataFrame.

Return type:

pandas Series or DataFrame

adtk.data.expand_events(lists, left_expand=0, right_expand=0)[source]

Expand time windows in an event list.

Given a list of events, expand the duration of events by a given factor. This may help to process true event list before calculating the quality of a detection result using a scoring function, if slight offset in result is considered acceptable.

Parameters:
  • lists (list or dict) –

    A list of events, or a dict of lists of events.

    • If list, each event can be a pandas Timestamp, or a tuple of two Timestamps that is regarded as a closed interval.
    • If dict, each key-value pair represents an independent type of event.
  • left_expand (pandas Timedelta, str, or int, optional) – Time range to expand backward. If str, it must be able to be converted into a pandas Timedelta object. If int, it must be in nanosecond. Default: 0.
  • right_expand (pandas Timedelta, str, or int, optional) – Time range to expand forward. f str, it must be able to be converted into a pandas Timedelta object. If int, it must be in nanosecond. Default: 0.
Returns:

Expanded events.

Return type:

list or dict

adtk.data.validate_events(event_list, point_as_interval=False)[source]

Validate event list.

This process will check some common issues in an event list (a list of time windows), including invalid time window, overlapped or consecutive time windows, unsorted events, etc.

Parameters:
  • event_list (list of pandas Timestamp 2-tuples) – Start and end of original (unmerged) time windows. Every window is regarded as a closed interval.
  • point_as_interval (bool, optional) – Whether to return a singular time point as a close interval. Default: False.
Returns:

Start and end of merged events.

Return type:

list of pandas Timestamp 2-tuples

adtk.data.resample(ts, dT=None)[source]

Resample the time points of a time series with given constant spacing. The values at new time points are calcuated by time-weighted linear interpolation.

Parameters:
  • ts (pandas Series or DataFrame) – Time series to resample. Index of the object must be DatetimeIndex.
  • dT (pandas Timedelta, str, or int, optional) – The new constant time step. If str, it must be able to be converted into a pandas Timedelta object. If int, it must be in nanosecond. If not given, the greatest common divider of original time steps will be used, which makes the refinement a minimal refinement subject to keeping all original time points still included in the resampled time series. Please note that this may dramatically increase the size of time series and memory usage. Default: None.
Returns:

Resampled time series.

Return type:

pandas Series or DataFrame

adtk.data.split_train_test(ts, mode=1, n_splits=1, train_ratio=0.7)[source]

Split time series into training and testing set for cross validation.

Parameters:
  • ts (pandas Series or DataFrame) – Time series to process.
  • mode (int, optional) –

    The split mode to use. Choose from 1, 2, 3 and 4.

    1. Divide time series into n_splits folds of equal length, split each fold into training and testing based on train_ratio.
    2. Create n_splits folds, where each fold starts at t_0 and ends at t_(n/n_splits), where n goes from 0 to n_splits and the first train_ratio of the fold is for training.
    3. Create n_splits folds, where each fold starts at t_0. Each fold has len(ts)/(1 + n_splits) test points at the end. Each fold is n * len(ts)/(1 + n_splits) long, where n ranges from 1 to n_splits.
    4. Create n_splits folds, where each fold starts at t_0. Each fold has n * len(ts)/(1 + n_splits) training points at the beginning of the time series, where n ranges from 1 to n_splits and the remaining points are testing points.

    Default: 1.

  • n_splits (int, optional) – Number of splits. Default: 1.
  • train_ratio (float, optional) – Ratio between length of training series and each fold, only used by mode 1 and 2. Default: 0.7.
Returns:

  • list of pandas Series or DataFrame: Training time series
  • list of pandas Series or DataFrame: Testing time series

Return type:

tuple (list, list)

Examples

In the following description of the four modes, 1s represent positions assigned to training, 2s represent those assigned to testing, 0s are those not assigned.

For a time series with length 40, if n_splits=4, train_ratio=0.7,

  • If mode=1:

    1111111222000000000000000000000000000000 0000000000111111122200000000000000000000 0000000000000000000011111112220000000000 0000000000000000000000000000001111111222

  • If mode=2:

    1111111222000000000000000000000000000000 1111111111111122222200000000000000000000 1111111111111111111112222222220000000000 1111111111111111111111111111222222222222

  • If mode=3:

    1111111122222222000000000000000000000000 1111111111111111222222220000000000000000 1111111111111111111111112222222200000000 1111111111111111111111111111111122222222

  • If mode=4:

    1111111122222222222222222222222222222222 1111111111111111222222222222222222222222 1111111111111111111111112222222222222222 1111111111111111111111111111111122222222