paddlets.datasets.tsdataset

TSDataset is the fundamental data class in PaddleTS, which is designed as the first-class citizen to represent the time series data. It is widely used in PaddleTS. In many cases, a function consumes a TSDataset and produces another TSDataset. A TSDataset object is comprised of two kinds of time series data:

  1. Target: the key time series data in the time series modeling tasks (e.g. those needs to be forecasted in the time series forecasting tasks).

  2. Covariate: the relevant time series data which are usually helpful for the time series modeling tasks.

Currently, it supports the representation of:

  1. Time series of single target w/wo covariates.

  2. Time series of multiple targets w/wo covariates.

And the covariates can be categorized into one of the following 3 types:

  1. Observed covariates (observed_cov):

    referring to those variables which can only be observed in the historical data, e.g. measured temperatures

  2. Known covariates (known_cov):

    referring to those variables which can be determined at present for future time steps, e.g. weather forecasts

  3. Static covariates (static_cov):

    referring to those variables which keep constant over time

A TSDataset object includes one or more TimeSeries objects, representing targets, known covariates (known_cov), observed covariates (observed_cov), and static covariates (static_cov), respectively.

class TimeSeries(data: Union[DataFrame, Series], freq: Union[int, str])[source]

Bases: object

TimeSeries is the atomic data structure for representing target(s), observed covariates (observed_cov), and known covariates (known_cov). Each could be comprised of a single or multiple time series data.

Parameters
  • data (DataFrame|Series) – A Pandas DataFrame or Series containing the time series data

  • freq (str|int) – A string or int representing the Pandas DateTimeIndex’s frequency or RangeIndex’s step size

Returns

None

classmethod load_from_dataframe(data: Union[DataFrame, Series], time_col: Optional[str] = None, value_cols: Optional[Union[str, List[str]]] = None, freq: Optional[Union[str, int]] = None, drop_tail_nan: bool = False, dtype: Optional[Union[type, Dict[str, type]]] = None) TimeSeries[source]

Construct a TimeSeries object from the specified columns of a DataFrame

Parameters
  • data (DataFrame|Series) – A Pandas DataFrame or Series containing the time series data

  • time_col (str|None) – The name of time column, a Pandas DatetimeIndex or RangeIndex. If not set, the DataFrame’s index will be used.

  • value_cols (list|str|None) – The name of column or the list of columns from which to extract the time series data. If set to None, all columns except for the time column will be used as value columns.

  • freq (str|int|None) – A string or int representing the Pandas DateTimeIndex’s frequency or RangeIndex’s step size

  • drop_tail_nan (bool) – Drop time series tail nan value or not, if True, drop all Nan value after the last non-Nan element in the current time series. eg: [nan, 3, 2, nan, nan] -> [nan, 3, 2], [3, 2, nan, nan] -> [3, 2], [nan, nan, nan] -> []

  • dtype (np.dtype|type|dict) – Use a numpy.dtype or Python type to cast entire TimeSeries object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.

Returns

TimeSeries object

property time_index

the time index

property columns

the data columns

property start_time: Union[Timestamp, int]

the first value of the time index

property end_time: Union[Timestamp, int]

the last value of the time index

property data

DataFrame storing the data

property freq

Frequency of TimeSeries

property dtypes: Series

dtypes of TimeSeries

astype(dtype: Union[dtype, type, Dict[str, Union[dtype, type]]])[source]

Cast a TimeSeries object to the specified dtype

Parameters

dtype (np.dtype|type|dict) – Use a numpy.dtype or Python type to cast entire TimeSeries object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.

Raises
  • TypeError

  • KeyError

to_dataframe(copy: bool = True) DataFrame[source]

Return a pd.DataFrame representation of the TimeSeries object

Parameters

copy (bool) – Return a copy or reference

Returns

pd.DataFrame

to_numpy(copy: bool = True) ndarray[source]

Return a numpy.ndarray representation of the TimeSeries object

Parameters

copy (bool) – Return a copy or reference. Note that copy=False does not ensure that to_numpy() is no-copy. Rather, copy=True ensure that a copy is made, even if not strictly necessary. refer:https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_numpy.html

Returns

np.ndarray

get_index_at_point(point: Union[Timestamp, str, float, int], after=True) int[source]

Convert a point along the time axis into an integer index.

Parameters
  • point (pd.Timestamp|float|int) –

    Time point, supports 3 types

    pd.Timestamp|str: It only takes effect when the time_index type is pd.DatatimeIndex, the corresponding index is returned, and str will be forcibly converted to pd.DatatimeIndex

    float: the parameter will be treated as the proportion of the time series that should lie before the point.

    int: the parameter will returned as such, provided that it is in the series. Otherwise it will raise a ValueError.

  • after (bool) – If the provided pandas Timestamp is not in the time series index, whether to return the index of the next timestamp or the index of the previous one.

Returns

index

Return type

int

Raises
  • ValueError

  • TypeError

split(split_point: Union[Timestamp, str, float, int], after=True) Tuple[TimeSeries, TimeSeries][source]

Split the TimeSeries object into two TimeSeries objects according to split_point

Parameters
  • split_point (pd.Timestamp|float|int) –

    Where to split the TSDataset, which could be

    pd.Timestamp|str: Only valid when the type of time_index is pd.DatatimeIndex, and str will be forcibly converted to pd.DatatimeIndex

    float: The proportion of the length of the first TSDataset object

    int: Only valid when the type of time_index is pd.RangeIndex

    If the data of the split_point exists, it will be included in the first TimeSeries object.

  • after (bool) – If split_point (pd.TimeSeries) doesn’t exist in the time index, use the next valid index (True) or the previous one (False)

Returns

Tuple[“TimeSeries”, “TimeSeries”]

Raises
  • ValueError

  • TypeError

copy() TimeSeries[source]

Make a copy of the TimeSeries object

Returns

TimeSeries

classmethod concat(tss: List[TimeSeries], axis: int = 0, drop_duplicates: bool = True, keep: str = 'first') TimeSeries[source]

Concatenate a list of TimeSeries objects along the specified axis

Parameters
  • tss (list[TimeSeries]) – A list of TimeSeries objects All TimeSeries’ freqs are required to be consistent. When axis=1, time_col is required to be non-repetitive; when axis=0, all columns are required to be non-repetitive

  • axis (int) – The axis along which to concatenate the TimeSeries objects

  • drop_duplicates (bool) – Drop duplicate indices.

  • keep (str) – keep ‘first’ or ‘last’ when drop duplicates.

Returns

TimeSeries

Raises

ValueError

reindex(index, fill_value=nan, *args, **kwargs) TimeSeries[source]

Reindex the TimeSeries object with optional filling logic

Parameters
  • index – array-like, new index to conform. Preferably an Index object to avoid duplicating data.

  • fill_value – Value to use for missing values. NaN by default, but can be any “compatible” value.

  • args – Optional arguments passed to DataFrame.reindex

  • kwargs – Optional arguments passed to DataFrame.reindex

Returns

TimeSeries

Raises

ValueError

sort_columns(ascending: bool = True)[source]

Sort the TimeSeries object by the index

Parameters

ascending (bool) – Sort ascending or descending. When the index is a MultiIndex the sort direction can be controlled for each level individually.

drop_tail_nan()[source]

Drop trailing consecutive Nan values

to_json() str[source]

Return a str json representation of the TimeSeries object.

Returns

str

classmethod load_from_json(json_data: str, **json_load_kwargs) TimeSeries[source]

Construct a TimeSeries object from a str json_data

Parameters
  • json_data (str) – json object from which to load data

  • **json_load_kwargs – Optional arguments passed to json.loads function

Returns

TimeSeries

to_categorical(col: Optional[Union[str, List[str]]] = None)[source]

Modify col’s type to int as categorical.

Parameters

col (Optional[Union[str, List[str]]]) – col names in ts

to_numeric(col: Optional[Union[str, List[str]]] = None)[source]

Modify col’s type to float as numeric.

Parameters

col (Optional[Union[str, List[str]]]) – col names in ts

class TSDataset(target: Optional[TimeSeries] = None, observed_cov: Optional[TimeSeries] = None, known_cov: Optional[TimeSeries] = None, static_cov: Optional[dict] = None, fill_missing_dates: bool = False, fillna_method: str = 'pre', fillna_window_size: int = 10)[source]

Bases: object

TSDataset is the fundamental data class in PaddleTS, which is designed as the first-class citizen to represent the time series data. It is widely used in PaddleTS. In many cases, a function consumes a TSDataset and produces another TSDataset. A TSDataset object is comprised of two kinds of time series data:

  1. Target: the key time series data in the time series modeling tasks (e.g. those needs to be forecasted in the time series forecasting tasks).

  2. Covariate: the relevant time series data which are usually helpful for the time series modeling tasks.

Currently, it supports the representation of:

  1. Time series of single target w/wo covariates.

  2. Time series of multiple targets w/wo covariates.

And the covariates can be categorized into one of the following 3 types:

  1. Observed covariates (observed_cov):

    referring to those variables which can only be observed in the historical data, e.g. measured temperatures

  2. Known covariates (known_cov):

    referring to those variables which can be determined at present for future time steps, e.g. weather forecasts

  3. Static covariates (static_cov):

    referring to those variables which keep constant over time

A TSDataset object includes one or more TimeSeries objects, representing targets, known covariates (known_cov), observed covariates (observed_cov), and static covariates (static_cov), respectively.

Parameters
  • target (TimeSeries|None) – Target

  • observed_cov (TimeSeries|None) – Observed covariates

  • known_cov (TimeSeries|None) – Known covariates

  • static_cov (dict|None) – Static covariates

  • fill_missing_dates (bool) – Fill missing dates or not

  • fillna_method (str) – Method of filling missing values. Totally 7 methods are supported currently: max: Use the max value in the sliding window min: Use the min value in the sliding window avg: Use the mean value in the sliding window median: Use the median value in the sliding window pre: Use the previous value back: Use the next value zero: Use 0s

  • fillna_window_size – Size of the sliding window

classmethod load_from_csv(filepath_or_buffer: str, group_id: Optional[str] = None, time_col: Optional[str] = None, target_cols: Optional[Union[str, List[str]]] = None, label_col: Optional[Union[str, List[str]]] = None, observed_cov_cols: Optional[Union[str, List[str]]] = None, feature_cols: Optional[Union[str, List[str]]] = None, known_cov_cols: Optional[Union[str, List[str]]] = None, static_cov_cols: Optional[Union[str, List[str]]] = None, freq: Optional[Union[str, int]] = None, fill_missing_dates: bool = False, fillna_method: str = 'pre', fillna_window_size: int = 10, drop_tail_nan: bool = False, dtype: Optional[Union[type, Dict[str, type]]] = None, **kwargs) Union[TSDataset, List[TSDataset]][source]

Construct a TSDataset object from a csv file

Parameters
  • filepath_or_buffer (str) – The path to the CSV file, or the file object; consistent with the argument of pandas.read_csv function

  • group_id (str|None) – The column name identifying a time series. This means that the group_id identify a sample together with the time_index. If you have only one timeseries dataset, Do not pass this parameter or set this to the name of column that is constant. If group_id is provided, the function will return a list of TSDataset which length equal to len(group_id.unique()). eg: A sample of equipment load detection guarantees the data of multiple equipment, in which the ID column is used to distinguish different equipment. In this case, the group_id=’ID’.

  • time_col (str|None) – The name of time column

  • target_cols (list|str|None) – The names of columns for target

  • label_col (list|str|None) – The names of columns for label in anomaly detection

  • observed_cov_cols (list|str|None) – The names of columns for observed covariates

  • feature_cols (list|str|None) – The names of columns for feature in anomaly detection

  • known_cov_cols (list|str|None) – The names of columns for konwn covariates

  • static_cov_cols (list|str|None) – The names of columns for static covariates

  • freq (str|int|None) – A str or int representing the DateTimeIndex’s frequency or RangeIndex’s step size

  • fill_missing_dates (bool) – Fill missing dates or not

  • fillna_method (str) – Method of filling missing values. Totally 7 methods are supported currently: max: Use the max value in the sliding window min: Use the min value in the sliding window avg: Use the mean value in the sliding window median: Use the median value in the sliding window pre: Use the previous value back: Use the next value zero: Use 0

  • fillna_window_size (int) – Size of the sliding window

  • drop_tail_nan (bool) – Drop time series tail nan value or not

  • dtype (np.dtype|type|dict) – Use a numpy.dtype or Python type to cast entire TimeSeries object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.

  • kwargs – Optional arguments passed to pandas.read_csv

Returns

Union[TSDataset, List[TSDataset]]

classmethod load_from_dataframe(df: DataFrame, group_id: Optional[str] = None, time_col: Optional[str] = None, target_cols: Optional[Union[str, List[str]]] = None, label_col: Optional[Union[str, List[str]]] = None, observed_cov_cols: Optional[Union[str, List[str]]] = None, feature_cols: Optional[Union[str, List[str]]] = None, known_cov_cols: Optional[Union[str, List[str]]] = None, static_cov_cols: Optional[Union[str, List[str]]] = None, freq: Optional[Union[str, int]] = None, fill_missing_dates: bool = False, fillna_method: str = 'pre', fillna_window_size: int = 10, drop_tail_nan: bool = False, dtype: Optional[Union[type, Dict[str, type]]] = None) Union[TSDataset, List[TSDataset]][source]

Construct a TSDataset object from a DataFrame

Parameters
  • df (pd.DataFrame) – panas.DataFrame object from which to load data

  • group_id (str|None) – The column name identifying a time series. This means that the group_id identify a sample together with the time_index. If you have only one timeseries dataset, Do not pass this parameter or set this to the name of column that is constant. If group_id is provided, the function will return a list of TSDataset which length equal to len(group_id.unique()). eg: A sample of equipment load detection guarantees the data of multiple equipment, in which the ID column is used to distinguish different equipment. In this case, the group_id=’ID’.

  • time_col (str|None) – The name of time column

  • target_cols (list|str|None) – The names of columns for target

  • label_col (list|str|None) – The names of columns for label in anomaly detection

  • observed_cov_cols (list|str|None) – The names of columns for observed covariates

  • feature_cols (list|str|None) – The names of columns for feature in anomaly detection

  • known_cov_cols (list|str|None) – The names of columns for konwn covariates

  • static_cov_cols (list|str|None) – The names of columns for static covariates

  • freq (str|int|None) – A str or int representing the DateTimeIndex’s frequency or RangeIndex’s step size

  • fill_missing_dates (bool) – Fill missing dates or not

  • fillna_method (str) – Method of filling missing values. Totally 7 methods are supported currently: max: Use the max value in the sliding window min: Use the min value in the sliding window avg: Use the mean value in the sliding window median: Use the median value in the sliding window pre: Use the previous value back: Use the next value zero: Use 0s

  • fillna_window_size (int) – Size of the sliding window

  • drop_tail_nan (bool) – Drop time series tail nan value or not

  • dtype (np.dtype|type|dict) – Use a numpy.dtype or Python type to cast entire TimeSeries object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.

Returns

Union[TSDataset, List[TSDataset]]

to_dataframe(copy: bool = True) DataFrame[source]

Return a pd.DataFrame representation of the TSDataset object

Parameters

copy (bool) – Return a copy of or a reference to the underlying DataFrame objects

Returns

pd.DataFrame

to_numpy(copy: bool = True) ndarray[source]

Return a np.ndarray representation of the TSDataset object

Parameters

copy (bool) – Return a copy of or a reference to the underlying DataFrame objects, Note that copy=False does not ensure that to_numpy() is no-copy. Rather, copy=True ensures that a copy is made, even if not strictly necessary. refer:https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_numpy.html

Returns

np.ndarray

get_target() Optional[TimeSeries][source]
Returns

target

Return type

TimeSeries|None

get_label() Optional[TimeSeries][source]
Returns

target

Return type

TimeSeries|None

get_observed_cov() Optional[TimeSeries][source]
Returns

observed_cov

Return type

TimeSeries|None

get_feature() Optional[TimeSeries][source]
Returns

observed_cov

Return type

TimeSeries|None

get_known_cov() Optional[TimeSeries][source]
Returns

known_cov

Return type

TimeSeries|None

get_static_cov() Optional[dict][source]
Returns

static_cov

Return type

dict|None

get_all_cov() Optional[TimeSeries][source]
Returns

Merge observed_cov and konw_cov

Return type

pd.DataFrame|None

set_target(target: TimeSeries)[source]
Parameters

target (TimeSeries) – New target

Returns

None

Raises

ValueError

set_label(label: TimeSeries)[source]
Parameters

label (TimeSeries) – New label

Returns

None

Raises

ValueError

set_observed_cov(observed_cov: TimeSeries)[source]
Parameters

observed_cov (TimeSeries) – New observed_cov

Returns

None

Raises

ValueError

set_feature(feature: TimeSeries)[source]
Parameters

feature (TimeSeries) – New feature

Returns

None

Raises

ValueError

set_known_cov(known_cov: TimeSeries)[source]
Parameters

known_cov (TimeSeries) – New known_cov

Returns

None

Raises

ValueError

set_static_cov(static_cov: dict, append: bool = True)[source]
Parameters
  • static_cov (dict) – New static_cov

  • append (bool) – Append to the existing static_cov or replace the existing satic_cov

Returns

None

Raises

ValueError

property target: Optional[TimeSeries]

Returns: TimeSeries|None: target

property label: Optional[TimeSeries]

Returns: TimeSeries|None: target

property observed_cov: Optional[TimeSeries]

Returns: TimeSeries|None: observed_cov

property feature: Optional[TimeSeries]

Returns: TimeSeries|None: observed_cov

property known_cov: Optional[TimeSeries]

Returns: TimeSeries|None: known_cov

property static_cov: Optional[dict]

Returns: dict|None: static_cov

split(split_point: Union[Timestamp, str, float, int], after=True) Tuple[TSDataset, TSDataset][source]

Splits the TSDataset object into two TSDataset objects according to split_point, only valid when self._target is not None

Parameters
  • split_point (pd.Timestamp|float|int) –

    Where to split the TSDataset, which could be

    pd.Timestamp|str: Only valid when the type of time_index is pd.DatatimeIndex, and str will be forcibly converted to pd.DatatimeIndex

    float: The proportion of the length of the first TSDataset object

    int: Only valid when the type of time_index is pd.RangeIndex

    If the data of the split_point exists, it will be included in the first data

  • after (bool) – If split_point (pd.TimeSeries) doesn’t exist in the time column, use the next valid index (True) or the previous one (False)

Returns

Tuple[“TSDataset”, “TSDataset”]

Raises
  • ValueError

  • TypeError

get_item_from_column(column: Union[str, int]) Union[TimeSeries, dict][source]

Get the underlying TimeSeries object for targets, observed covariates, and know covariates, or the dict for static_covs according to the column name

Parameters

column (str) – column name

Returns

Union[“TimeSeries”, dict]

Raises

ValueError

set_column(column: Union[str, int], value: Union[Series, str, int], type: str = 'known_cov')[source]

Add a new column or update the existing column

Parameters
  • column (str|int) – column name

  • value (pd.Series|str|int) – New column values. When value=pd.Series, its index must be same as the index of the TSDataset object. When type=’static_cov’, value can only be int or str.

  • type (str) – Only effective when adding a new column, where to put the new column. By default, the new column will be added to known_cov.

Returns

None

Raises

ValueError

drop(columns: Union[str, int, List[Union[str, int]]])[source]

Drop column or columns

Parameters

columns (str|int|List) – Column name or column names

Returns

None

Raises

ValueError

plot(columns: Union[List[str], str] = None, add_data: Union[List[TSDataset], TSDataset] = None, labels: Union[List[str], str] = None, low_quantile: float = 0.05, high_quantile: float = 0.95, central_quantile: float = 0.5, **kwargs) pyplot[source]

plot function, a wrapper for Dataframe.plot()

Parameters
  • columns (str|List) – The names of columns to be plot. When columns is None, the targets will be plot by default.

  • add_data (List|TSDataset) – Add data for joint plotprinting, the default is None

  • labels (str|List) – Custom labels, length should be equal to nums of added datasets.

  • central_quantile (float) – The quantile (between 0 and 1) to plot as a “central” value, For instance, setting central_quantile=0.5 will plot the median of each component. (only used when dataset is probability forecasting output )

  • low_quantile (float) – The quantile to use for the lower bound of the plotted confidence interval. Similar to central_quantile, this is applied to each component separately (i.e., displaying marginal distributions). No confidence interval is shown if confidence_low_quantile is None (default 0.05). (only used when dataset is probability forecasting output )

  • high_quantile (float) – The quantile to use for the upper bound of the plotted confidence interval. Similar to central_quantile, this is applied to each component separately (i.e., displaying marginal distributions). No confidence interval is shown if high_quantile is None (default 0.95). (only used when dataset is probability forecasting output )

  • **kwargs – Optional arguments passed to Dataframe.plot function

Returns

matplotlib.pyplot object

Raises

ValueError

copy() TSDataset[source]

Make a copy of the TSDataset object

Returns

TSDataset

save(file: str)[source]

Save TSDataset object to a file

Parameters

file (str) – file path

classmethod load(file: str) TSDataset[source]

Load TSDataset from the saved file

Parameters

file (str) – file path

Returns

TSDataset

to_json() str[source]

Return a str json representation of the TSDataset object.

Returns

str

classmethod load_from_json(json_data: str, **json_load_kwargs) TSDataset[source]

Construct a TSDataset object from a str json_data

Parameters
  • json_data (str) – json object from which to load data

  • **json_load_kwargs – Optional arguments passed to json.loads function

Returns

TSDataset

property columns: dict

return all columns(except static columns)

Returns

The key is the column name, and the value is the type, including target, known_cov, and observed_cov

Return type

dict

property freq

Frequency of TSDataset

classmethod concat(tss: List[TSDataset], axis: int = 0, drop_duplicates=True, keep='first') TSDataset[source]

Concatenate a list of TSDataset objects along the specified axis

Parameters
  • tss (list[TimeSeries]) – A list of TSDataset objects. All TSDatasets’ freqs are required to be consistent. When axis=1, time_col is required to be non-repetitive; when axis=0, all columns are required to be non-repetitive

  • axis (int) – The axis along which to concatenate the TimeSeries objects

  • drop_duplicates (bool) – Drop duplicate indices.

  • keep (str) – keep ‘first’ or ‘last’ when drop duplicates.

Returns

TSDataset

Raises

ValueError

astype(dtype: Union[dtype, type, Dict[str, Union[dtype, type]]])[source]

Cast a TSDataset object to the specified dtype

Parameters

dtype (Union[np.dtype, type, Dict[str, Union[np.dtype, type]]]) – Use a numpy.dtype or Python type to cast entire TimeSeries object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.

Raises
  • TypeError

  • KeyError

property dtypes: Series

Get dtypes of target, known_covs, observed_covs

Returns

<column name, dtype>

Return type

pd.Series

sort_columns(ascending: bool = True)[source]

Sort the TSDataset object by the index

Parameters

ascending (bool) – Ascending or descending. When the index is a MultiIndex the sort direction can be controlled for each level individually.

to_categorical(col: Optional[Union[str, List[str]]] = None)[source]

Modify col’s type to int as categorical.

Parameters

col (Optional[Union[str, List[str]]]) – col names in ts

to_numeric(col: Optional[Union[str, List[str]]] = None)[source]

Modify col’s type to float as numeric.

Parameters

col (Optional[Union[str, List[str]]]) – col names in ts