paddlets.models.data_adapter

class SampleDataset(rawdataset: TSDataset, in_chunk_len: int = 1, out_chunk_len: int = 0, skip_chunk_len: int = 0, sampling_stride: int = 1, fill_last_value: Optional[Union[floating, integer]] = None, time_window: Optional[Tuple[int, int]] = None)[source]

Bases: Dataset

An implementation of paddle Dataset.

The default time_window assumes each sample contains X (i.e. in_chunk), skip_chunk, and Y (i.e. out_chunk).

If caller explicitly passes time_window parameter in, and time_window upper bound is larger than max standard timeseries (possibly be target or observed_cov) idx len, it means that each built sample will only contain X (i.e. in_chunk), but will not contain skip_chunk or Y (i.e. out_chunk).

Parameters

rawdataset (TSDataset) – Raw TSDataset to be converted.
in_chunk_len (int) – The size of the loopback window, i.e., the number of time steps feed to the model.
out_chunk_len (int) – The size of the forecasting horizon, i.e., the number of time steps output by the model.
skip_chunk_len (int) – Optional, the number of time steps between in_chunk and out_chunk for a single sample. The skip chunk is neither used as a feature (i.e. X) nor a label (i.e. Y) for a single sample. By default, it will NOT skip any time steps.
sampling_stride (int) – Time steps to stride over the i-th sample and (i+1)-th sample. More precisely, let t be the time index of target time series, t[i] be the start time of the i-th sample, t[i+1] be the start time of the (i+1)-th sample, then sampling_stride represents the result of t[i+1] - t[i].
fill_last_value (float, optional) – The value used for filling last sample. Set to None if no need to fill. For any type t of fill_last_value that np.issubdtype(type(t), np.floating) or np.issubdtype(type(t), np.integer) is True are valid.
time_window (Tuple, optional) – A two-element-tuple-shaped time window that allows adapter to build samples. time_window[0] refers to the window lower bound, while time_window[1] refers to the window upper bound. Each element in the left-closed-and-right-closed interval refers to the TAIL index of each sample.

Examples

# 1) in_chunk_len examples
# Given:
tsdataset.target = [0, 1, 2, 3, 4]
skip_chunk_len = 0
out_chunk_len = 1

# 1.1) If in_chunk_len = 1, sample[0]:
# X -> skip_chunk -> Y
# (0) -> () -> (1)

# 1.2) If in_chunk_len = 2, sample[0]:
# X -> skip_chunk -> Y
# (0, 1) -> () -> (2)

# 1.3) If in_chunk_len = 3, sample[0]:
# X -> skip_chunk -> Y
# (0, 1, 2) -> () -> (3)

# 2) out_chunk_len examples
# Given:
tsdataset.target = [0, 1, 2, 3, 4]
in_chunk_len = 1
skip_chunk_len = 0

# 2.1) If out_chunk_len = 1, sample[0]:
# X -> skip_chunk -> Y
# (0) -> () -> (1)

# 2.2) If out_chunk_len = 2, sample[0]:
# X -> skip_chunk -> Y
# (0) -> () -> (1, 2)

# 2.3) If out_chunk_len = 3, sample[0]:
# X -> skip_chunk -> Y
# (0) -> () -> (1, 2, 3)

# 3) skip_chunk_len examples
# Given:
tsdataset.target = [0, 1, 2, 3, 4]
in_chunk_len = 1
out_chunk_len = 1

# 3.1) If skip_chunk_len = 0, sample[0]:
# X -> skip_chunk -> Y
# (0) -> () -> (1)

# 3.2) If skip_chunk_len = 1, sample[0]:
# X -> skip_chunk -> Y
# (0) -> (1) -> (2)

# 3.3) If skip_chunk_len = 2, sample[0]:
# X -> skip_chunk -> Y
# (0) -> (1, 2) -> (3)

# 3.4) If skip_chunk_len = 3, sample[0]:
# X -> skip_chunk -> Y
# (0) -> (1, 2, 3) -> (4)

# 4) sampling_stride examples
# Given:
tsdataset.target = [0, 1, 2, 3, 4]
in_chunk_len = 1
skip_chunk_len = 0
out_chunk_len = 1

# 4.1) If sampling_stride = 1, samples:
# X -> skip_chunk -> Y
# (0) -> () -> (1)
# (1) -> () -> (2)
# (2) -> () -> (3)
# (3) -> () -> (4)

# 4.2) If sampling_stride = 2, samples:
# X -> skip_chunk -> Y
# (0) -> () -> (1)
# (2) -> () -> (3)

# 4.3) If sampling_stride = 3, samples:
# X -> skip_chunk -> Y
# (0) -> () -> (1)
# (3) -> () -> (4)

# 5) time_window examples:
# 5.1) The default time_window calculation formula is as follows:
# time_window[0] = 0 + in_chunk_len + skip_chunk_len + (out_chunk_len - 1)
# time_window[1] = max_target_idx
#
# Given:
tsdataset.target = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
in_chunk_len = 4
skip_chunk_len = 3
out_chunk_len = 2
sampling_stride = 1

# The following equation holds:
max_target_idx = tsdataset.target[-1] = 10

# The default time_window is calculated as follows:
time_window[0] = 0 + 2 + 3 + (4 - 1) = 5 + 3 = 8
time_window[1] = max_target_idx = 10
time_window = (8, 10)

# 3 samples will be built in total:
X -> Y
(0, 1, 2, 3) -> (7, 8)
(1, 2, 3, 4) -> (8, 9)
(2, 3, 4, 5) -> (9, 10)


# 5.2) Each element in time_window refers to the TAIL index of each sample, but NOT the HEAD index.
# The following two scenarios shows how to pass in the expected time_window parameter to build samples.
# Given:
tsdataset.target = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
in_chunk_len = 4
skip_chunk_len = 3
out_chunk_len = 2

# Scenario 5.2.1 - Suppose the following training samples are expected to be built:
# X -> Y
# (0, 1, 2, 3) -> (7, 8)
# (1, 2, 3, 4) -> (8, 9)
# (2, 3, 4, 5) -> (9, 10)

# The 1st sample's tail index is 8
# The 2nd sample's tail index is 9
# The 3rd sample's tail index is 10

# Thus, the time_window parameter should be as follows:
time_window = (8, 10)

# All other time_window showing up as follows are NOT correct:
time_window = (0, 2)
time_window = (0, 10)

# Scenario 5.2.2 - Suppose the following predict sample is expected to be built:
# X -> Y
# (7, 8, 9, 10) -> (14, 15)

# The first (i.e. the last) sample's tail index is 15;

# Thus, the time_window parameter should be as follows:
time_window = (15, 15)

# 5.3) The calculation formula of the max allowed time_window upper bound is as follows:
# time_window[1] <= len(tsdataset.target) - 1 + skip_chunk_len + out_chunk_len
# The reason is that the built paddle.io.Dataset is used for a single call of :func: `model.predict`, as
# it only allow for a single predict sample, any time_window upper bound larger than a single predict
# sample's TAIL index will not be allowed because there is not enough target time series to build past
# target time series chunk.
#
# Given:
tsdataset.target = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
in_chunk_len = 4
skip_chunk_len = 3
out_chunk_len = 2

# For a single :func:`model.predict` call:
X = in_chunk = (7, 8, 9, 10)

# max allowed time_window[1] is calculated as follows:
time_window[1] <= len(tsdataset) - 1 + skip_chunk_len + out_chunk_len = 11 - 1 + 3 + 2 = 15

# Note that time_window[1] (i.e. 15) is larger than the max_target_idx (i.e. 10), but this time_window
# upper bound is still valid, because predict sample does not need skip_chunk (i.e.  [11, 12, 13]) or
# out_chunk (i.e. [14, 15]).

# Any values larger than 15 (i.e. 16) is invalid, because the existing target time series is NOT long
# enough to build X for the prediction sample, see following example:
# Given:
time_window = (16, 16)

# The calculated out_chunk = (15, 16)
# The calculated skip_chunk = (12, 13, 14)

# Thus, the in_chunk should be [8, 9, 10, 11]
# However, the tail index of the calculated in_chunk 11 is beyond the max target time series
# (i.e. tsdataset.target[-1] = 10), so current target timeseries cannot provide 11 to build this sample.

class MLDataLoader(dataset: SampleDataset, batch_size: int, collate_fn: Optional[Callable[[List[Dict[str, ndarray]]], Dict[str, ndarray]]] = None)[source]

Bases: object

Machine learning Data loader, provides an iterable over the given SampleDataset.

The MLDataLoader supports iterable-style datasets with single-process loading and optional user defined batch collation.

Parameters

dataset (SampleDataset) – SampleDataset to be built.
batch_size (int) – The number of samples for each batch.
collate_fn (Callable, optional) – A user defined collate function for each batch, optional.

class DataAdapter[source]

Bases: object

Data adapter for dl and ml models, converts TSDataset to SampleDataset and DataLoader.

to_sample_dataset(rawdataset: TSDataset, in_chunk_len: int = 1, out_chunk_len: int = 0, skip_chunk_len: int = 0, sampling_stride: int = 1, fill_last_value: Optional[Union[floating, integer]] = None, time_window: Optional[Tuple[int, int]] = None) → SampleDataset[source]

Convert TSDataset to SampleDataset.

Parameters

rawdataset (TSDataset) – Raw TSDataset to be converted.
in_chunk_len (int) – The size of the loopback window, i.e., the number of time steps feed to the model.
out_chunk_len (int) – The size of the forecasting horizon, i.e., the number of time steps output by the model.
skip_chunk_len (int) – Optional, the number of time steps between in_chunk and out_chunk for a single sample. The skip chunk is neither used as a feature (i.e. X) nor a label (i.e. Y) for a single sample. By default, it will NOT skip any time steps.
sampling_stride (int) – Time steps to stride over the i-th sample and (i+1)-th sample. More precisely, let t be the time index of target time series, t[i] be the start time of the i-th sample, t[i+1] be the start time of the (i+1)-th sample, then sampling_stride represents the result of t[i+1] - t[i].
fill_last_value (float, optional) – The value used for filling last sample. Set to None if no need to fill. For any type t of fill_last_value that np.issubdtype(type(t), np.floating) or np.issubdtype(type(t), np.integer) is True are valid.
time_window (Tuple, optional) – A two-element-tuple-shaped time window that allows adapter to build samples. time_window[0] refers to the window lower bound, while time_window[1] refers to the window upper bound. Each element in the left-closed-and-right-closed interval refers to the TAIL index of each sample.

Examples

samples = [
    {
        "past_target": np.ndarray(
            shape=(in_chunk_len, target_col_num)
        ),
        "future_target": np.ndarray(
            shape=(out_chunk_len, target_col_num)
        ),
        "known_cov_numeric": np.ndarray(
            shape=(in_chunk_len + out_chunk_len, known_numeric_col_num)
        ),
        "known_cov_categorical": np.ndarray(
            shape=(in_chunk_len + out_chunk_len, known_categorical_col_num)
        ),
        "observed_cov_numeric": np.ndarray(
            shape=(in_chunk_len, observed_numeric_col_num)
        ),
        "observed_cov_categorical": np.ndarray(
            shape=(in_chunk_len + out_chunk_len, observed_categorical_col_num)
        ),
        "static_cov_numeric": np.ndarray(
            shape=(in_chunk_len + out_chunk_len, static_numeric_col_num)
        ),
        "static_cov_categorical": np.ndarray(
            shape=(in_chunk_len + out_chunk_len, static_categorical_col_num)
        )
    },
    # ...
]

to_paddle_dataloader(sample_dataset: SampleDataset, batch_size: int, collate_fn: Optional[Callable] = None, shuffle: bool = True, drop_last: bool = False) → DataLoader[source]

Convert SampleDataset to paddle DataLoader.

Parameters

sample_dataset (SampleDataset) – SampleDataset to be converted.
batch_size (int) – The number of samples for a single batch.
collate_fn (Callable, optional) – User-defined collate function for each batch, optional.
shuffle (bool, optional) – Whether to shuffle indices order before generating batch indices, default True.
drop_last (bool, optional) – Whether to discard when the remaining data does not meet a batch, default False.

Returns

A built paddle DataLoader.

Return type

PaddleDataLoader

Examples

dataloader = [
    # 1st batch
    {
        "past_target": paddle.Tensor(
            shape=(batch_size, in_chunk_len, target_col_num)
        ),
        "future_target": paddle.Tensor(
            shape=(batch_size, out_chunk_len, target_col_num)
        ),
        "known_cov_numeric": paddle.Tensor(
            shape=(batch_size, known_cov_chunk_len, known_cov_numeric_col_num)
        ),
        "known_cov_categorical": paddle.Tensor(
            shape=(batch_size, known_cov_chunk_len, known_cov_categorical_col_num)
        ),
        "observed_cov_numeric": paddle.Tensor(
            shape=(batch_size, observed_cov_chunk_len, observed_cov_col_num)
        ),
        "observed_cov_categorical": paddle.Tensor(
            shape=(batch_size, observed_cov_chunk_len, observed_cov_categorical_col_num)
        ),
        "static_cov_numeric": paddle.Tensor(
            shape=(batch_size, 1, static_cov_numeric_col_num)
        ),
        "static_cov_categorical": paddle.Tensor(
            shape=(batch_size, 1, static_cov_categorical_col_num)
        )
    },

    # ...
]

to_ml_dataloader(sample_dataset: SampleDataset, batch_size: int, collate_fn: Optional[Callable] = None) → MLDataLoader[source]

Convert SampleDataset to MLDataLoader.

Parameters

sample_dataset (SampleDataset) – SampleDataset to be converted.
batch_size (int) – The number of samples for a single batch.
collate_fn (Callable, optional) – User-defined collate function for each batch, optional.

Returns

A built MLDataLoader.

Return type

MLDataLoader

Examples

dataloader = [
    # 1st batch
    {
        "past_target": np.ndarray(
            shape=(batch_size, in_chunk_len, target_col_num)
        ),
        "future_target": np.ndarray(
            shape=(batch_size, out_chunk_len, target_col_num)
        ),
        "known_cov_numeric": np.ndarray(
            shape=(batch_size, known_cov_chunk_len, known_cov_numeric_col_num)
        ),
        "known_cov_categorical": np.ndarray(
            shape=(batch_size, known_cov_chunk_len, known_cov_categorical_col_num)
        ),
        "observed_cov_numeric": np.ndarray(
            shape=(batch_size, observed_cov_chunk_len, observed_cov_col_num)
        ),
        "observed_cov_categorical": np.ndarray(
            shape=(batch_size, observed_cov_chunk_len, observed_cov_categorical_col_num)
        ),
        "static_cov_numeric": np.ndarray(
            shape=(batch_size, 1, static_cov_numeric_col_num)
        ),
        "static_cov_categorical": np.ndarray(
            shape=(batch_size, 1, static_cov_categorical_col_num)
        )
    },

    # ...
]