paddlets.datasets.tsdataset
TSDataset is the fundamental data class in PaddleTS, which is designed as the first-class citizen to represent the time series data. It is widely used in PaddleTS. In many cases, a function consumes a TSDataset and produces another TSDataset. A TSDataset object is comprised of two kinds of time series data:
Target: the key time series data in the time series modeling tasks (e.g. those needs to be forecasted in the time series forecasting tasks).
Covariate: the relevant time series data which are usually helpful for the time series modeling tasks.
Currently, it supports the representation of:
Time series of single target w/wo covariates.
Time series of multiple targets w/wo covariates.
And the covariates can be categorized into one of the following 3 types:
- Observed covariates (observed_cov):
referring to those variables which can only be observed in the historical data, e.g. measured temperatures
- Known covariates (known_cov):
referring to those variables which can be determined at present for future time steps, e.g. weather forecasts
- Static covariates (static_cov):
referring to those variables which keep constant over time
A TSDataset object includes one or more TimeSeries objects, representing targets, known covariates (known_cov), observed covariates (observed_cov), and static covariates (static_cov), respectively.
- class TimeSeries(data: Union[DataFrame, Series], freq: Union[int, str])[source]
Bases:
objectTimeSeries is the atomic data structure for representing target(s), observed covariates (observed_cov), and known covariates (known_cov). Each could be comprised of a single or multiple time series data.
- Parameters
data (DataFrame|Series) – A Pandas DataFrame or Series containing the time series data
freq (str|int) – A string or int representing the Pandas DateTimeIndex’s frequency or RangeIndex’s step size
- Returns
None
- classmethod load_from_dataframe(data: Union[DataFrame, Series], time_col: Optional[str] = None, value_cols: Optional[Union[List[str], str]] = None, freq: Optional[Union[str, int]] = None, drop_tail_nan: bool = False, dtype: Optional[Union[type, Dict[str, type]]] = None) TimeSeries[source]
Construct a TimeSeries object from the specified columns of a DataFrame
- Parameters
data (DataFrame|Series) – A Pandas DataFrame or Series containing the time series data
time_col (str|None) – The name of time column, a Pandas DatetimeIndex or RangeIndex. If not set, the DataFrame’s index will be used.
value_cols (list|str|None) – The name of column or the list of columns from which to extract the time series data. If set to None, all columns except for the time column will be used as value columns.
freq (str|int|None) – A string or int representing the Pandas DateTimeIndex’s frequency or RangeIndex’s step size
drop_tail_nan (bool) – Drop time series tail nan value or not, if True, drop all Nan value after the last non-Nan element in the current time series. eg: [nan, 3, 2, nan, nan] -> [nan, 3, 2], [3, 2, nan, nan] -> [3, 2], [nan, nan, nan] -> []
dtype (np.dtype|type|dict) – Use a numpy.dtype or Python type to cast entire TimeSeries object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.
- Returns
TimeSeries object
- property time_index
the time index
- property columns
the data columns
- property start_time: Union[Timestamp, int]
the first value of the time index
- property end_time: Union[Timestamp, int]
the last value of the time index
- property data
DataFrame storing the data
- property freq
Frequency of TimeSeries
- property dtypes: Series
dtypes of TimeSeries
- astype(dtype: Union[dtype, type, Dict[str, Union[dtype, type]]])[source]
Cast a TimeSeries object to the specified dtype
- Parameters
dtype (np.dtype|type|dict) – Use a numpy.dtype or Python type to cast entire TimeSeries object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.
- Raises
TypeError –
KeyError –
- to_dataframe(copy: bool = True) DataFrame[source]
Return a pd.DataFrame representation of the TimeSeries object
- Parameters
copy (bool) – Return a copy or reference
- Returns
pd.DataFrame
- to_numpy(copy: bool = True) ndarray[source]
Return a numpy.ndarray representation of the TimeSeries object
- Parameters
copy (bool) – Return a copy or reference. Note that copy=False does not ensure that to_numpy() is no-copy. Rather, copy=True ensure that a copy is made, even if not strictly necessary. refer:https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_numpy.html
- Returns
np.ndarray
- get_index_at_point(point: Union[Timestamp, str, float, int], after=True) int[source]
Convert a point along the time axis into an integer index.
- Parameters
point (pd.Timestamp|float|int) –
Time point, supports 3 types
pd.Timestamp|str: It only takes effect when the time_index type is pd.DatatimeIndex, the corresponding index is returned, and str will be forcibly converted to pd.DatatimeIndex
float: the parameter will be treated as the proportion of the time series that should lie before the point.
int: the parameter will returned as such, provided that it is in the series. Otherwise it will raise a ValueError.
after (bool) – If the provided pandas Timestamp is not in the time series index, whether to return the index of the next timestamp or the index of the previous one.
- Returns
index
- Return type
int
- Raises
ValueError –
TypeError –
- split(split_point: Union[Timestamp, str, float, int], after=True) Tuple[TimeSeries, TimeSeries][source]
Split the TimeSeries object into two TimeSeries objects according to split_point
- Parameters
split_point (pd.Timestamp|float|int) –
Where to split the TSDataset, which could be
pd.Timestamp|str: Only valid when the type of time_index is pd.DatatimeIndex, and str will be forcibly converted to pd.DatatimeIndex
float: The proportion of the length of the first TSDataset object
int: Only valid when the type of time_index is pd.RangeIndex
If the data of the split_point exists, it will be included in the first TimeSeries object.
after (bool) – If split_point (pd.TimeSeries) doesn’t exist in the time index, use the next valid index (True) or the previous one (False)
- Returns
Tuple[“TimeSeries”, “TimeSeries”]
- Raises
ValueError –
TypeError –
- copy() TimeSeries[source]
Make a copy of the TimeSeries object
- Returns
TimeSeries
- classmethod concat(tss: List[TimeSeries], axis: int = 0, drop_duplicates: bool = True, keep: str = 'first') TimeSeries[source]
Concatenate a list of TimeSeries objects along the specified axis
- Parameters
tss (list[TimeSeries]) – A list of TimeSeries objects All TimeSeries’ freqs are required to be consistent. When axis=1, time_col is required to be non-repetitive; when axis=0, all columns are required to be non-repetitive
axis (int) – The axis along which to concatenate the TimeSeries objects
drop_duplicates (bool) – Drop duplicate indices.
keep (str) – keep ‘first’ or ‘last’ when drop duplicates.
- Returns
TimeSeries
- Raises
ValueError –
- reindex(index, fill_value=nan, *args, **kwargs) TimeSeries[source]
Reindex the TimeSeries object with optional filling logic
- Parameters
index – array-like, new index to conform. Preferably an Index object to avoid duplicating data.
fill_value – Value to use for missing values. NaN by default, but can be any “compatible” value.
args – Optional arguments passed to DataFrame.reindex
kwargs – Optional arguments passed to DataFrame.reindex
- Returns
TimeSeries
- Raises
ValueError –
- sort_columns(ascending: bool = True)[source]
Sort the TimeSeries object by the index
- Parameters
ascending (bool) – Sort ascending or descending. When the index is a MultiIndex the sort direction can be controlled for each level individually.
- classmethod load_from_json(json_data: str, **json_load_kwargs) TimeSeries[source]
Construct a TimeSeries object from a str json_data
- Parameters
json_data (str) – json object from which to load data
**json_load_kwargs – Optional arguments passed to json.loads function
- Returns
TimeSeries
- class TSDataset(target: Optional[TimeSeries] = None, observed_cov: Optional[TimeSeries] = None, known_cov: Optional[TimeSeries] = None, static_cov: Optional[dict] = None, fill_missing_dates: bool = False, fillna_method: str = 'pre', fillna_window_size: int = 10)[source]
Bases:
objectTSDataset is the fundamental data class in PaddleTS, which is designed as the first-class citizen to represent the time series data. It is widely used in PaddleTS. In many cases, a function consumes a TSDataset and produces another TSDataset. A TSDataset object is comprised of two kinds of time series data:
Target: the key time series data in the time series modeling tasks (e.g. those needs to be forecasted in the time series forecasting tasks).
Covariate: the relevant time series data which are usually helpful for the time series modeling tasks.
Currently, it supports the representation of:
Time series of single target w/wo covariates.
Time series of multiple targets w/wo covariates.
And the covariates can be categorized into one of the following 3 types:
- Observed covariates (observed_cov):
referring to those variables which can only be observed in the historical data, e.g. measured temperatures
- Known covariates (known_cov):
referring to those variables which can be determined at present for future time steps, e.g. weather forecasts
- Static covariates (static_cov):
referring to those variables which keep constant over time
A TSDataset object includes one or more TimeSeries objects, representing targets, known covariates (known_cov), observed covariates (observed_cov), and static covariates (static_cov), respectively.
- Parameters
target (TimeSeries|None) – Target
observed_cov (TimeSeries|None) – Observed covariates
known_cov (TimeSeries|None) – Known covariates
static_cov (dict|None) – Static covariates
fill_missing_dates (bool) – Fill missing dates or not
fillna_method (str) – Method of filling missing values. Totally 7 methods are supported currently: max: Use the max value in the sliding window min: Use the min value in the sliding window avg: Use the mean value in the sliding window median: Use the median value in the sliding window pre: Use the previous value back: Use the next value zero: Use 0s
fillna_window_size – Size of the sliding window
- classmethod load_from_csv(filepath_or_buffer: str, group_id: Optional[str] = None, time_col: Optional[str] = None, target_cols: Optional[Union[List[str], str]] = None, label_col: Optional[Union[List[str], str]] = None, observed_cov_cols: Optional[Union[List[str], str]] = None, feature_cols: Optional[Union[List[str], str]] = None, known_cov_cols: Optional[Union[List[str], str]] = None, static_cov_cols: Optional[Union[List[str], str]] = None, freq: Optional[Union[str, int]] = None, fill_missing_dates: bool = False, fillna_method: str = 'pre', fillna_window_size: int = 10, drop_tail_nan: bool = False, dtype: Optional[Union[type, Dict[str, type]]] = None, **kwargs) Union[TSDataset, List[TSDataset]][source]
Construct a TSDataset object from a csv file
- Parameters
filepath_or_buffer (str) – The path to the CSV file, or the file object; consistent with the argument of pandas.read_csv function
group_id (str|None) – The column name identifying a time series. This means that the group_id identify a sample together with the time_index. If you have only one timeseries dataset, Do not pass this parameter or set this to the name of column that is constant. If group_id is provided, the function will return a list of TSDataset which length equal to len(group_id.unique()). eg: A sample of equipment load detection guarantees the data of multiple equipment, in which the ID column is used to distinguish different equipment. In this case, the group_id=’ID’.
time_col (str|None) – The name of time column
target_cols (list|str|None) – The names of columns for target
label_col (list|str|None) – The names of columns for label in anomaly detection
observed_cov_cols (list|str|None) – The names of columns for observed covariates
feature_cols (list|str|None) – The names of columns for feature in anomaly detection
known_cov_cols (list|str|None) – The names of columns for konwn covariates
static_cov_cols (list|str|None) – The names of columns for static covariates
freq (str|int|None) – A str or int representing the DateTimeIndex’s frequency or RangeIndex’s step size
fill_missing_dates (bool) – Fill missing dates or not
fillna_method (str) – Method of filling missing values. Totally 7 methods are supported currently: max: Use the max value in the sliding window min: Use the min value in the sliding window avg: Use the mean value in the sliding window median: Use the median value in the sliding window pre: Use the previous value back: Use the next value zero: Use 0
fillna_window_size (int) – Size of the sliding window
drop_tail_nan (bool) – Drop time series tail nan value or not
dtype (np.dtype|type|dict) – Use a numpy.dtype or Python type to cast entire TimeSeries object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.
kwargs – Optional arguments passed to pandas.read_csv
- Returns
Union[TSDataset, List[TSDataset]]
- classmethod load_from_dataframe(df: DataFrame, group_id: Optional[str] = None, time_col: Optional[str] = None, target_cols: Optional[Union[List[str], str]] = None, label_col: Optional[Union[List[str], str]] = None, observed_cov_cols: Optional[Union[List[str], str]] = None, feature_cols: Optional[Union[List[str], str]] = None, known_cov_cols: Optional[Union[List[str], str]] = None, static_cov_cols: Optional[Union[List[str], str]] = None, freq: Optional[Union[str, int]] = None, fill_missing_dates: bool = False, fillna_method: str = 'pre', fillna_window_size: int = 10, drop_tail_nan: bool = False, dtype: Optional[Union[type, Dict[str, type]]] = None) Union[TSDataset, List[TSDataset]][source]
Construct a TSDataset object from a DataFrame
- Parameters
df (pd.DataFrame) – panas.DataFrame object from which to load data
group_id (str|None) – The column name identifying a time series. This means that the group_id identify a sample together with the time_index. If you have only one timeseries dataset, Do not pass this parameter or set this to the name of column that is constant. If group_id is provided, the function will return a list of TSDataset which length equal to len(group_id.unique()). eg: A sample of equipment load detection guarantees the data of multiple equipment, in which the ID column is used to distinguish different equipment. In this case, the group_id=’ID’.
time_col (str|None) – The name of time column
target_cols (list|str|None) – The names of columns for target
label_col (list|str|None) – The names of columns for label in anomaly detection
observed_cov_cols (list|str|None) – The names of columns for observed covariates
feature_cols (list|str|None) – The names of columns for feature in anomaly detection
known_cov_cols (list|str|None) – The names of columns for konwn covariates
static_cov_cols (list|str|None) – The names of columns for static covariates
freq (str|int|None) – A str or int representing the DateTimeIndex’s frequency or RangeIndex’s step size
fill_missing_dates (bool) – Fill missing dates or not
fillna_method (str) – Method of filling missing values. Totally 7 methods are supported currently: max: Use the max value in the sliding window min: Use the min value in the sliding window avg: Use the mean value in the sliding window median: Use the median value in the sliding window pre: Use the previous value back: Use the next value zero: Use 0s
fillna_window_size (int) – Size of the sliding window
drop_tail_nan (bool) – Drop time series tail nan value or not
dtype (np.dtype|type|dict) – Use a numpy.dtype or Python type to cast entire TimeSeries object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.
- Returns
Union[TSDataset, List[TSDataset]]
- to_dataframe(copy: bool = True) DataFrame[source]
Return a pd.DataFrame representation of the TSDataset object
- Parameters
copy (bool) – Return a copy of or a reference to the underlying DataFrame objects
- Returns
pd.DataFrame
- to_numpy(copy: bool = True) ndarray[source]
Return a np.ndarray representation of the TSDataset object
- Parameters
copy (bool) – Return a copy of or a reference to the underlying DataFrame objects, Note that copy=False does not ensure that to_numpy() is no-copy. Rather, copy=True ensures that a copy is made, even if not strictly necessary. refer:https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_numpy.html
- Returns
np.ndarray
- get_target() Optional[TimeSeries][source]
- Returns
target
- Return type
TimeSeries|None
- get_label() Optional[TimeSeries][source]
- Returns
target
- Return type
TimeSeries|None
- get_observed_cov() Optional[TimeSeries][source]
- Returns
observed_cov
- Return type
TimeSeries|None
- get_feature() Optional[TimeSeries][source]
- Returns
observed_cov
- Return type
TimeSeries|None
- get_known_cov() Optional[TimeSeries][source]
- Returns
known_cov
- Return type
TimeSeries|None
- get_all_cov() Optional[TimeSeries][source]
- Returns
Merge observed_cov and konw_cov
- Return type
pd.DataFrame|None
- set_target(target: TimeSeries)[source]
- Parameters
target (TimeSeries) – New target
- Returns
None
- Raises
ValueError –
- set_label(label: TimeSeries)[source]
- Parameters
label (TimeSeries) – New label
- Returns
None
- Raises
ValueError –
- set_observed_cov(observed_cov: TimeSeries)[source]
- Parameters
observed_cov (TimeSeries) – New observed_cov
- Returns
None
- Raises
ValueError –
- set_feature(feature: TimeSeries)[source]
- Parameters
feature (TimeSeries) – New feature
- Returns
None
- Raises
ValueError –
- set_known_cov(known_cov: TimeSeries)[source]
- Parameters
known_cov (TimeSeries) – New known_cov
- Returns
None
- Raises
ValueError –
- set_static_cov(static_cov: dict, append: bool = True)[source]
- Parameters
static_cov (dict) – New static_cov
append (bool) – Append to the existing static_cov or replace the existing satic_cov
- Returns
None
- Raises
ValueError –
- property target: Optional[TimeSeries]
Returns: TimeSeries|None: target
- property label: Optional[TimeSeries]
Returns: TimeSeries|None: target
- property observed_cov: Optional[TimeSeries]
Returns: TimeSeries|None: observed_cov
- property feature: Optional[TimeSeries]
Returns: TimeSeries|None: observed_cov
- property known_cov: Optional[TimeSeries]
Returns: TimeSeries|None: known_cov
- property static_cov: Optional[dict]
Returns: dict|None: static_cov
- split(split_point: Union[Timestamp, str, float, int], after=True) Tuple[TSDataset, TSDataset][source]
Splits the TSDataset object into two TSDataset objects according to split_point, only valid when self._target is not None
- Parameters
split_point (pd.Timestamp|float|int) –
Where to split the TSDataset, which could be
pd.Timestamp|str: Only valid when the type of time_index is pd.DatatimeIndex, and str will be forcibly converted to pd.DatatimeIndex
float: The proportion of the length of the first TSDataset object
int: Only valid when the type of time_index is pd.RangeIndex
If the data of the split_point exists, it will be included in the first data
after (bool) – If split_point (pd.TimeSeries) doesn’t exist in the time column, use the next valid index (True) or the previous one (False)
- Returns
Tuple[“TSDataset”, “TSDataset”]
- Raises
ValueError –
TypeError –
- get_item_from_column(column: Union[str, int]) Union[TimeSeries, dict][source]
Get the underlying TimeSeries object for targets, observed covariates, and know covariates, or the dict for static_covs according to the column name
- Parameters
column (str) – column name
- Returns
Union[“TimeSeries”, dict]
- Raises
ValueError –
- set_column(column: Union[str, int], value: Union[Series, str, int], type: str = 'known_cov')[source]
Add a new column or update the existing column
- Parameters
column (str|int) – column name
value (pd.Series|str|int) – New column values. When value=pd.Series, its index must be same as the index of the TSDataset object. When type=’static_cov’, value can only be int or str.
type (str) – Only effective when adding a new column, where to put the new column. By default, the new column will be added to known_cov.
- Returns
None
- Raises
ValueError –
- drop(columns: Union[str, int, List[Union[str, int]]])[source]
Drop column or columns
- Parameters
columns (str|int|List) – Column name or column names
- Returns
None
- Raises
ValueError –
- plot(columns: Union[List[str], str] = None, add_data: Union[List[TSDataset], TSDataset] = None, labels: Union[List[str], str] = None, low_quantile: float = 0.05, high_quantile: float = 0.95, central_quantile: float = 0.5, **kwargs) pyplot[source]
plot function, a wrapper for Dataframe.plot()
- Parameters
columns (str|List) – The names of columns to be plot. When columns is None, the targets will be plot by default.
add_data (List|TSDataset) – Add data for joint plotprinting, the default is None
labels (str|List) – Custom labels, length should be equal to nums of added datasets.
central_quantile (float) – The quantile (between 0 and 1) to plot as a “central” value, For instance, setting central_quantile=0.5 will plot the median of each component. (only used when dataset is probability forecasting output )
low_quantile (float) – The quantile to use for the lower bound of the plotted confidence interval. Similar to central_quantile, this is applied to each component separately (i.e., displaying marginal distributions). No confidence interval is shown if confidence_low_quantile is None (default 0.05). (only used when dataset is probability forecasting output )
high_quantile (float) – The quantile to use for the upper bound of the plotted confidence interval. Similar to central_quantile, this is applied to each component separately (i.e., displaying marginal distributions). No confidence interval is shown if high_quantile is None (default 0.95). (only used when dataset is probability forecasting output )
**kwargs – Optional arguments passed to Dataframe.plot function
- Returns
matplotlib.pyplot object
- Raises
ValueError –
- classmethod load(file: str) TSDataset[source]
Load TSDataset from the saved file
- Parameters
file (str) – file path
- Returns
TSDataset
- classmethod load_from_json(json_data: str, **json_load_kwargs) TSDataset[source]
Construct a TSDataset object from a str json_data
- Parameters
json_data (str) – json object from which to load data
**json_load_kwargs – Optional arguments passed to json.loads function
- Returns
TSDataset
- property columns: dict
return all columns(except static columns)
- Returns
The key is the column name, and the value is the type, including target, known_cov, and observed_cov
- Return type
dict
- property freq
Frequency of TSDataset
- classmethod concat(tss: List[TSDataset], axis: int = 0, drop_duplicates=True, keep='first') TSDataset[source]
Concatenate a list of TSDataset objects along the specified axis
- Parameters
tss (list[TimeSeries]) – A list of TSDataset objects. All TSDatasets’ freqs are required to be consistent. When axis=1, time_col is required to be non-repetitive; when axis=0, all columns are required to be non-repetitive
axis (int) – The axis along which to concatenate the TimeSeries objects
drop_duplicates (bool) – Drop duplicate indices.
keep (str) – keep ‘first’ or ‘last’ when drop duplicates.
- Returns
TSDataset
- Raises
ValueError –
- astype(dtype: Union[dtype, type, Dict[str, Union[dtype, type]]])[source]
Cast a TSDataset object to the specified dtype
- Parameters
dtype (Union[np.dtype, type, Dict[str, Union[np.dtype, type]]]) – Use a numpy.dtype or Python type to cast entire TimeSeries object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.
- Raises
TypeError –
KeyError –
- property dtypes: Series
Get dtypes of target, known_covs, observed_covs
- Returns
<column name, dtype>
- Return type
pd.Series
- sort_columns(ascending: bool = True)[source]
Sort the TSDataset object by the index
- Parameters
ascending (bool) – Ascending or descending. When the index is a MultiIndex the sort direction can be controlled for each level individually.