paddlets.datasets.tsdataset

TSDataset is the fundamental data class in PaddleTS, which is designed as the first-class citizen to represent the time series data. It is widely used in PaddleTS. In many cases, a function consumes a TSDataset and produces another TSDataset. A TSDataset object is comprised of two kinds of time series data:

Target: the key time series data in the time series modeling tasks (e.g. those needs to be forecasted in the time series forecasting tasks).

Covariate: the relevant time series data which are usually helpful for the time series modeling tasks.

Currently, it supports the representation of:

Time series of single target w/wo covariates.

Time series of multiple targets w/wo covariates.

And the covariates can be categorized into one of the following 3 types:

Observed covariates (observed_cov):
referring to those variables which can only be observed in the historical data, e.g. measured temperatures

Known covariates (known_cov):
referring to those variables which can be determined at present for future time steps, e.g. weather forecasts

Static covariates (static_cov):
referring to those variables which keep constant over time

A TSDataset object includes one or more TimeSeries objects, representing targets, known covariates (known_cov), observed covariates (observed_cov), and static covariates (static_cov), respectively.

class TimeSeries(data: Union[DataFrame, Series], freq: Union[int, str])[source]

Bases: object

TimeSeries is the atomic data structure for representing target(s), observed covariates (observed_cov), and known covariates (known_cov). Each could be comprised of a single or multiple time series data.

Parameters

data (DataFrame|Series) – A Pandas DataFrame or Series containing the time series data
freq (str|int) – A string or int representing the Pandas DateTimeIndex’s frequency or RangeIndex’s step size

Returns

None

classmethod load_from_dataframe(data: Union[DataFrame, Series], time_col: Optional[str] = None, value_cols: Optional[Union[str, List[str]]] = None, freq: Optional[Union[str, int]] = None, drop_tail_nan: bool = False, dtype: Optional[Union[type, Dict[str, type]]] = None) → TimeSeries[source]

Construct a TimeSeries object from the specified columns of a DataFrame

Parameters

data (DataFrame|Series) – A Pandas DataFrame or Series containing the time series data
time_col (str|None) – The name of time column, a Pandas DatetimeIndex or RangeIndex. If not set, the DataFrame’s index will be used.
value_cols (list|str|None) – The name of column or the list of columns from which to extract the time series data. If set to None, all columns except for the time column will be used as value columns.
freq (str|int|None) – A string or int representing the Pandas DateTimeIndex’s frequency or RangeIndex’s step size
drop_tail_nan (bool) – Drop time series tail nan value or not, if True, drop all Nan value after the last non-Nan element in the current time series. eg: [nan, 3, 2, nan, nan] -> [nan, 3, 2], [3, 2, nan, nan] -> [3, 2], [nan, nan, nan] -> []
dtype (np.dtype|type|dict) – Use a numpy.dtype or Python type to cast entire TimeSeries object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.

Returns

TimeSeries object

property time_index: the time index

property columns: the data columns

property start_time: Union[Timestamp, int]: the first value of the time index

property end_time: Union[Timestamp, int]: the last value of the time index

property data: DataFrame storing the data

property freq: Frequency of TimeSeries

property dtypes: Series: dtypes of TimeSeries

astype(dtype: Union[dtype, type, Dict[str, Union[dtype, type]]])[source]

Cast a TimeSeries object to the specified dtype

Parameters

dtype (np.dtype|type|dict) – Use a numpy.dtype or Python type to cast entire TimeSeries object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.

Raises

TypeError –
KeyError –

to_dataframe(copy: bool = True) → DataFrame[source]

Return a pd.DataFrame representation of the TimeSeries object

Parameters: copy (bool) – Return a copy or reference
Returns: pd.DataFrame

to_numpy(copy: bool = True) → ndarray[source]

Return a numpy.ndarray representation of the TimeSeries object

Parameters: copy (bool) – Return a copy or reference. Note that copy=False does not ensure that to_numpy() is no-copy. Rather, copy=True ensure that a copy is made, even if not strictly necessary. refer：https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_numpy.html
Returns: np.ndarray

get_index_at_point(point: Union[Timestamp, str, float, int], after=True) → int[source]

Convert a point along the time axis into an integer index.

Parameters

point (pd.Timestamp|float|int) –
Time point, supports 3 types

pd.Timestamp|str: It only takes effect when the time_index type is pd.DatatimeIndex, the corresponding index is returned, and str will be forcibly converted to pd.DatatimeIndex

float: the parameter will be treated as the proportion of the time series that should lie before the point.

int: the parameter will returned as such, provided that it is in the series. Otherwise it will raise a ValueError.
after (bool) – If the provided pandas Timestamp is not in the time series index, whether to return the index of the next timestamp or the index of the previous one.

Returns

index

Return type

int

Raises

ValueError –
TypeError –

split(split_point: Union[Timestamp, str, float, int], after=True) → Tuple[TimeSeries, TimeSeries][source]

Split the TimeSeries object into two TimeSeries objects according to split_point

Parameters

split_point (pd.Timestamp|float|int) –
Where to split the TSDataset, which could be

pd.Timestamp|str: Only valid when the type of time_index is pd.DatatimeIndex, and str will be forcibly converted to pd.DatatimeIndex

float: The proportion of the length of the first TSDataset object

int: Only valid when the type of time_index is pd.RangeIndex

If the data of the split_point exists, it will be included in the first TimeSeries object.
after (bool) – If split_point (pd.TimeSeries) doesn’t exist in the time index, use the next valid index (True) or the previous one (False)

Returns

Tuple[“TimeSeries”, “TimeSeries”]

Raises

ValueError –
TypeError –

copy() → TimeSeries[source]

Make a copy of the TimeSeries object

Returns: TimeSeries

classmethod concat(tss: List[TimeSeries], axis: int = 0, drop_duplicates: bool = True, keep: str = 'first') → TimeSeries[source]

Concatenate a list of TimeSeries objects along the specified axis

Parameters

tss (list[TimeSeries]) – A list of TimeSeries objects All TimeSeries’ freqs are required to be consistent. When axis=1, time_col is required to be non-repetitive; when axis=0, all columns are required to be non-repetitive
axis (int) – The axis along which to concatenate the TimeSeries objects
drop_duplicates (bool) – Drop duplicate indices.
keep (str) – keep ‘first’ or ‘last’ when drop duplicates.

Returns

TimeSeries

Raises

ValueError –

reindex(index, fill_value=nan, *args, **kwargs) → TimeSeries[source]

Reindex the TimeSeries object with optional filling logic

Parameters

index – array-like, new index to conform. Preferably an Index object to avoid duplicating data.
fill_value – Value to use for missing values. NaN by default, but can be any “compatible” value.
args – Optional arguments passed to DataFrame.reindex
kwargs – Optional arguments passed to DataFrame.reindex

Returns

TimeSeries

Raises

ValueError –

sort_columns(ascending: bool = True)[source]

Sort the TimeSeries object by the index

Parameters: ascending (bool) – Sort ascending or descending. When the index is a MultiIndex the sort direction can be controlled for each level individually.

drop_tail_nan()[source]: Drop trailing consecutive Nan values

to_json() → str[source]

Return a str json representation of the TimeSeries object.

Returns: str

classmethod load_from_json(json_data: str, **json_load_kwargs) → TimeSeries[source]

Construct a TimeSeries object from a str json_data

Parameters

json_data (str) – json object from which to load data
**json_load_kwargs – Optional arguments passed to json.loads function

Returns

TimeSeries

to_categorical(col: Optional[Union[str, List[str]]] = None)[source]

Modify col’s type to int as categorical.

Parameters: col (Optional[Union[str, List[str]]]) – col names in ts

to_numeric(col: Optional[Union[str, List[str]]] = None)[source]

Modify col’s type to float as numeric.

Parameters: col (Optional[Union[str, List[str]]]) – col names in ts

class TSDataset(target: Optional[TimeSeries] = None, observed_cov: Optional[TimeSeries] = None, known_cov: Optional[TimeSeries] = None, static_cov: Optional[dict] = None, fill_missing_dates: bool = False, fillna_method: str = 'pre', fillna_window_size: int = 10)[source]

Bases: object

TSDataset is the fundamental data class in PaddleTS, which is designed as the first-class citizen to represent the time series data. It is widely used in PaddleTS. In many cases, a function consumes a TSDataset and produces another TSDataset. A TSDataset object is comprised of two kinds of time series data:

Target: the key time series data in the time series modeling tasks (e.g. those needs to be forecasted in the time series forecasting tasks).

Covariate: the relevant time series data which are usually helpful for the time series modeling tasks.

Currently, it supports the representation of:

Time series of single target w/wo covariates.

Time series of multiple targets w/wo covariates.

And the covariates can be categorized into one of the following 3 types:

Observed covariates (observed_cov):
referring to those variables which can only be observed in the historical data, e.g. measured temperatures

Known covariates (known_cov):
referring to those variables which can be determined at present for future time steps, e.g. weather forecasts

Static covariates (static_cov):
referring to those variables which keep constant over time

A TSDataset object includes one or more TimeSeries objects, representing targets, known covariates (known_cov), observed covariates (observed_cov), and static covariates (static_cov), respectively.

Parameters

target (TimeSeries|None) – Target
observed_cov (TimeSeries|None) – Observed covariates
known_cov (TimeSeries|None) – Known covariates
static_cov (dict|None) – Static covariates
fill_missing_dates (bool) – Fill missing dates or not
fillna_method (str) – Method of filling missing values. Totally 7 methods are supported currently: max: Use the max value in the sliding window min: Use the min value in the sliding window avg: Use the mean value in the sliding window median: Use the median value in the sliding window pre: Use the previous value back: Use the next value zero: Use 0s
fillna_window_size – Size of the sliding window

classmethod load_from_csv(filepath_or_buffer: str, group_id: Optional[str] = None, time_col: Optional[str] = None, target_cols: Optional[Union[str, List[str]]] = None, label_col: Optional[Union[str, List[str]]] = None, observed_cov_cols: Optional[Union[str, List[str]]] = None, feature_cols: Optional[Union[str, List[str]]] = None, known_cov_cols: Optional[Union[str, List[str]]] = None, static_cov_cols: Optional[Union[str, List[str]]] = None, freq: Optional[Union[str, int]] = None, fill_missing_dates: bool = False, fillna_method: str = 'pre', fillna_window_size: int = 10, drop_tail_nan: bool = False, dtype: Optional[Union[type, Dict[str, type]]] = None, **kwargs) → Union[TSDataset, List[TSDataset]][source]

Construct a TSDataset object from a csv file

Parameters

filepath_or_buffer (str) – The path to the CSV file, or the file object; consistent with the argument of pandas.read_csv function
group_id (str|None) – The column name identifying a time series. This means that the group_id identify a sample together with the time_index. If you have only one timeseries dataset, Do not pass this parameter or set this to the name of column that is constant. If group_id is provided, the function will return a list of TSDataset which length equal to len(group_id.unique()). eg: A sample of equipment load detection guarantees the data of multiple equipment, in which the ID column is used to distinguish different equipment. In this case, the group_id=’ID’.
time_col (str|None) – The name of time column
target_cols (list|str|None) – The names of columns for target
label_col (list|str|None) – The names of columns for label in anomaly detection
observed_cov_cols (list|str|None) – The names of columns for observed covariates
feature_cols (list|str|None) – The names of columns for feature in anomaly detection
known_cov_cols (list|str|None) – The names of columns for konwn covariates
static_cov_cols (list|str|None) – The names of columns for static covariates
freq (str|int|None) – A str or int representing the DateTimeIndex’s frequency or RangeIndex’s step size
fill_missing_dates (bool) – Fill missing dates or not
fillna_method (str) – Method of filling missing values. Totally 7 methods are supported currently: max: Use the max value in the sliding window min: Use the min value in the sliding window avg: Use the mean value in the sliding window median: Use the median value in the sliding window pre: Use the previous value back: Use the next value zero: Use 0
fillna_window_size (int) – Size of the sliding window
drop_tail_nan (bool) – Drop time series tail nan value or not
dtype (np.dtype|type|dict) – Use a numpy.dtype or Python type to cast entire TimeSeries object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.
kwargs – Optional arguments passed to pandas.read_csv

Returns

Union[TSDataset, List[TSDataset]]

classmethod load_from_dataframe(df: DataFrame, group_id: Optional[str] = None, time_col: Optional[str] = None, target_cols: Optional[Union[str, List[str]]] = None, label_col: Optional[Union[str, List[str]]] = None, observed_cov_cols: Optional[Union[str, List[str]]] = None, feature_cols: Optional[Union[str, List[str]]] = None, known_cov_cols: Optional[Union[str, List[str]]] = None, static_cov_cols: Optional[Union[str, List[str]]] = None, freq: Optional[Union[str, int]] = None, fill_missing_dates: bool = False, fillna_method: str = 'pre', fillna_window_size: int = 10, drop_tail_nan: bool = False, dtype: Optional[Union[type, Dict[str, type]]] = None) → Union[TSDataset, List[TSDataset]][source]

Construct a TSDataset object from a DataFrame

Parameters

df (pd.DataFrame) – panas.DataFrame object from which to load data
group_id (str|None) – The column name identifying a time series. This means that the group_id identify a sample together with the time_index. If you have only one timeseries dataset, Do not pass this parameter or set this to the name of column that is constant. If group_id is provided, the function will return a list of TSDataset which length equal to len(group_id.unique()). eg: A sample of equipment load detection guarantees the data of multiple equipment, in which the ID column is used to distinguish different equipment. In this case, the group_id=’ID’.
time_col (str|None) – The name of time column
target_cols (list|str|None) – The names of columns for target
label_col (list|str|None) – The names of columns for label in anomaly detection
observed_cov_cols (list|str|None) – The names of columns for observed covariates
feature_cols (list|str|None) – The names of columns for feature in anomaly detection
known_cov_cols (list|str|None) – The names of columns for konwn covariates
static_cov_cols (list|str|None) – The names of columns for static covariates
freq (str|int|None) – A str or int representing the DateTimeIndex’s frequency or RangeIndex’s step size
fill_missing_dates (bool) – Fill missing dates or not
fillna_method (str) – Method of filling missing values. Totally 7 methods are supported currently: max: Use the max value in the sliding window min: Use the min value in the sliding window avg: Use the mean value in the sliding window median: Use the median value in the sliding window pre: Use the previous value back: Use the next value zero: Use 0s
fillna_window_size (int) – Size of the sliding window
drop_tail_nan (bool) – Drop time series tail nan value or not
dtype (np.dtype|type|dict) – Use a numpy.dtype or Python type to cast entire TimeSeries object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.

Returns

Union[TSDataset, List[TSDataset]]

to_dataframe(copy: bool = True) → DataFrame[source]

Return a pd.DataFrame representation of the TSDataset object

Parameters: copy (bool) – Return a copy of or a reference to the underlying DataFrame objects
Returns: pd.DataFrame

to_numpy(copy: bool = True) → ndarray[source]

Return a np.ndarray representation of the TSDataset object

Parameters: copy (bool) – Return a copy of or a reference to the underlying DataFrame objects, Note that copy=False does not ensure that to_numpy() is no-copy. Rather, copy=True ensures that a copy is made, even if not strictly necessary. refer：https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_numpy.html
Returns: np.ndarray

get_target() → Optional[TimeSeries][source]

Returns: target
Return type: TimeSeries|None

get_label() → Optional[TimeSeries][source]

Returns: target
Return type: TimeSeries|None

get_observed_cov() → Optional[TimeSeries][source]

Returns: observed_cov
Return type: TimeSeries|None

get_feature() → Optional[TimeSeries][source]

Returns: observed_cov
Return type: TimeSeries|None

get_known_cov() → Optional[TimeSeries][source]

Returns: known_cov
Return type: TimeSeries|None

get_static_cov() → Optional[dict][source]

Returns: static_cov
Return type: dict|None

get_all_cov() → Optional[TimeSeries][source]

Returns: Merge observed_cov and konw_cov
Return type: pd.DataFrame|None

set_target(target: TimeSeries)[source]

Parameters: target (TimeSeries) – New target
Returns: None
Raises: ValueError –

set_label(label: TimeSeries)[source]

Parameters: label (TimeSeries) – New label
Returns: None
Raises: ValueError –

set_observed_cov(observed_cov: TimeSeries)[source]

Parameters: observed_cov (TimeSeries) – New observed_cov
Returns: None
Raises: ValueError –

set_feature(feature: TimeSeries)[source]

Parameters: feature (TimeSeries) – New feature
Returns: None
Raises: ValueError –

set_known_cov(known_cov: TimeSeries)[source]

Parameters: known_cov (TimeSeries) – New known_cov
Returns: None
Raises: ValueError –

set_static_cov(static_cov: dict, append: bool = True)[source]

Parameters

static_cov (dict) – New static_cov
append (bool) – Append to the existing static_cov or replace the existing satic_cov

Returns

None

Raises

ValueError –

property target: Optional[TimeSeries]: Returns: TimeSeries|None: target

property label: Optional[TimeSeries]: Returns: TimeSeries|None: target

property observed_cov: Optional[TimeSeries]: Returns: TimeSeries|None: observed_cov

property feature: Optional[TimeSeries]: Returns: TimeSeries|None: observed_cov

property known_cov: Optional[TimeSeries]: Returns: TimeSeries|None: known_cov

property static_cov: Optional[dict]: Returns: dict|None: static_cov

split(split_point: Union[Timestamp, str, float, int], after=True) → Tuple[TSDataset, TSDataset][source]

Splits the TSDataset object into two TSDataset objects according to split_point, only valid when self._target is not None

Parameters

split_point (pd.Timestamp|float|int) –
Where to split the TSDataset, which could be

pd.Timestamp|str: Only valid when the type of time_index is pd.DatatimeIndex, and str will be forcibly converted to pd.DatatimeIndex

float: The proportion of the length of the first TSDataset object

int: Only valid when the type of time_index is pd.RangeIndex

If the data of the split_point exists, it will be included in the first data
after (bool) – If split_point (pd.TimeSeries) doesn’t exist in the time column, use the next valid index (True) or the previous one (False)

Returns

Tuple[“TSDataset”, “TSDataset”]

Raises

ValueError –
TypeError –

get_item_from_column(column: Union[str, int]) → Union[TimeSeries, dict][source]

Get the underlying TimeSeries object for targets, observed covariates, and know covariates, or the dict for static_covs according to the column name

Parameters: column (str) – column name
Returns: Union[“TimeSeries”, dict]
Raises: ValueError –

set_column(column: Union[str, int], value: Union[Series, str, int], type: str = 'known_cov')[source]

Add a new column or update the existing column

Parameters

column (str|int) – column name
value (pd.Series|str|int) – New column values. When value=pd.Series, its index must be same as the index of the TSDataset object. When type=’static_cov’, value can only be int or str.
type (str) – Only effective when adding a new column, where to put the new column. By default, the new column will be added to known_cov.

Returns

None

Raises

ValueError –

drop(columns: Union[str, int, List[Union[str, int]]])[source]

Drop column or columns

Parameters: columns (str|int|List) – Column name or column names
Returns: None
Raises: ValueError –

plot(columns: Union[List[str], str] = None, add_data: Union[List[TSDataset], TSDataset] = None, labels: Union[List[str], str] = None, low_quantile: float = 0.05, high_quantile: float = 0.95, central_quantile: float = 0.5, **kwargs) → pyplot[source]

plot function, a wrapper for Dataframe.plot()

Parameters

columns (str|List) – The names of columns to be plot. When columns is None, the targets will be plot by default.
add_data (List|TSDataset) – Add data for joint plotprinting, the default is None
labels (str|List) – Custom labels, length should be equal to nums of added datasets.
central_quantile (float) – The quantile (between 0 and 1) to plot as a “central” value, For instance, setting central_quantile=0.5 will plot the median of each component. (only used when dataset is probability forecasting output )
low_quantile (float) – The quantile to use for the lower bound of the plotted confidence interval. Similar to central_quantile, this is applied to each component separately (i.e., displaying marginal distributions). No confidence interval is shown if confidence_low_quantile is None (default 0.05). (only used when dataset is probability forecasting output )
high_quantile (float) – The quantile to use for the upper bound of the plotted confidence interval. Similar to central_quantile, this is applied to each component separately (i.e., displaying marginal distributions). No confidence interval is shown if high_quantile is None (default 0.95). (only used when dataset is probability forecasting output )
**kwargs – Optional arguments passed to Dataframe.plot function

Returns

matplotlib.pyplot object

Raises

ValueError –

copy() → TSDataset[source]

Make a copy of the TSDataset object

Returns: TSDataset

save(file: str)[source]

Save TSDataset object to a file

Parameters: file (str) – file path

classmethod load(file: str) → TSDataset[source]

Load TSDataset from the saved file

Parameters: file (str) – file path
Returns: TSDataset

to_json() → str[source]

Return a str json representation of the TSDataset object.

Returns: str

classmethod load_from_json(json_data: str, **json_load_kwargs) → TSDataset[source]

Construct a TSDataset object from a str json_data

Parameters

json_data (str) – json object from which to load data
**json_load_kwargs – Optional arguments passed to json.loads function

Returns

TSDataset

property columns: dict

return all columns(except static columns)

Returns: The key is the column name, and the value is the type, including target, known_cov, and observed_cov
Return type: dict

property freq: Frequency of TSDataset

classmethod concat(tss: List[TSDataset], axis: int = 0, drop_duplicates=True, keep='first') → TSDataset[source]

Concatenate a list of TSDataset objects along the specified axis

Parameters

tss (list[TimeSeries]) – A list of TSDataset objects. All TSDatasets’ freqs are required to be consistent. When axis=1, time_col is required to be non-repetitive; when axis=0, all columns are required to be non-repetitive
axis (int) – The axis along which to concatenate the TimeSeries objects
drop_duplicates (bool) – Drop duplicate indices.
keep (str) – keep ‘first’ or ‘last’ when drop duplicates.

Returns

TSDataset

Raises

ValueError –

astype(dtype: Union[dtype, type, Dict[str, Union[dtype, type]]])[source]

Cast a TSDataset object to the specified dtype

Parameters

dtype (Union[np.dtype, type, Dict[str, Union[np.dtype, type]]]) – Use a numpy.dtype or Python type to cast entire TimeSeries object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.

Raises

TypeError –
KeyError –

property dtypes: Series

Get dtypes of target, known_covs, observed_covs

Returns: <column name, dtype>
Return type: pd.Series

sort_columns(ascending: bool = True)[source]

Sort the TSDataset object by the index

Parameters: ascending (bool) – Ascending or descending. When the index is a MultiIndex the sort direction can be controlled for each level individually.

to_categorical(col: Optional[Union[str, List[str]]] = None)[source]

Modify col’s type to int as categorical.

Parameters: col (Optional[Union[str, List[str]]]) – col names in ts

to_numeric(col: Optional[Union[str, List[str]]] = None)[source]

Modify col’s type to float as numeric.

Parameters: col (Optional[Union[str, List[str]]]) – col names in ts