Dataset
TSDataset
TSDataset is the fundamental data class in PaddleTS, which is designed as the first-class citizen to represent the time series data. It is widely used in PaddleTS. In many cases, a function consumes a TSDataset and produces another TSDataset. A TSDataset object is comprised of two kinds of time series data:
Target: the key time series data in the time series modeling tasks (e.g. those needs to be forecasted in the time series forecasting tasks).
Covariate: the relevant time series data which are usually helpful for the time series modeling tasks.
Currently, it supports the representation of:
Time series of single target w/wo covariates.
Time series of multiple targets w/wo covariates.
And the covariates can be categorized into one of the following 3 types:
- Observed covariates (observed_cov):
referring to those variables which can only be observed in the historical data, e.g. measured temperatures
- Known covariates (known_cov):
referring to those variables which can be determined at present for future time steps, e.g. weather forecasts
- Static covariates (static_cov):
referring to those variables which keep constant over time
A TSDataset object includes one or more TimeSeries objects, representing targets, known covariates (known_cov), observed covariates (observed_cov), and static covariates (static_cov), respectively.
TimeSeries
TimeSeries is the atomic data structure for representing target(s), observed covariates (observed_cov), and known covariates (known_cov).
Each could be comprised of a single or multiple time series data.
TimeSeries needs to be converted to TSDataset before used in PaddleTS.
Examples
Build TSDataset
Building TSDataset which contains only Target from DataFrame or CSV file
import pandas as pd
import numpy as np
from paddlets import TSDataset
x = np.linspace(-np.pi, np.pi, 200)
sinx = np.sin(x) * 4 + np.random.randn(200)
df = pd.DataFrame(
{
'time_col': pd.date_range('2022-01-01', periods=200, freq='1h'),
'value': sinx
}
)
target_dataset = TSDataset.load_from_dataframe(
df, #Also can be path to the CSV file
time_col='time_col',
target_cols='value',
freq='1h'
)
target_dataset.plot()

Building TSDataset which contains Target and Covariates:
Option 1:
import pandas as pd
from paddlets import TSDataset
df = pd.DataFrame(
{
'time_col': pd.date_range('2022-01-01', periods=200, freq='1h'),
'value': sinx,
'known_cov_1': sinx + 4,
'known_cov_2': sinx + 5,
'observed_cov': sinx + 8,
'static_cov': [1 for i in range(200)],
}
)
target_cov_dataset = TSDataset.load_from_dataframe(
df,
time_col='time_col',
target_cols='value',
known_cov_cols=['known_cov_1', 'known_cov_2'],
observed_cov_cols='observed_cov',
static_cov_cols='static_cov',
freq='1h'
)
target_cov_dataset.plot(['value', 'known_cov_1', 'known_cov_2', 'observed_cov'])

Option 2:
import pandas as pd
from paddlets import TSDataset
x_l = np.linspace(-np.pi, np.pi, 300)
sinx_l = np.sin(x_l) * 4 + np.random.randn(300)
df = pd.DataFrame(
{
'time_col': pd.date_range('2022-01-01', periods=300, freq='1h'),
'known_cov_1': sinx_l + 4,
'known_cov_2': sinx_l + 5
}
)
known_cov_dataset = TSDataset.load_from_dataframe(
df,
time_col='time_col',
known_cov_cols=['known_cov_1', 'known_cov_2'],
freq='1h'
)
df = pd.DataFrame(
{
'time_col': pd.date_range('2022-01-01', periods=200, freq='1h'),
'observed_cov': sinx + 8
}
)
observed_cov_dataset = TSDataset.load_from_dataframe(
df,
time_col='time_col',
observed_cov_cols='observed_cov',
freq='1h'
)
target_cov_dataset = TSDataset.concat([target_dataset, known_cov_dataset, observed_cov_dataset])
target_cov_dataset.plot(['value', 'known_cov_1', 'known_cov_2', 'observed_cov'])

Option 3:
import pandas as pd
from paddlets import TSDataset
from paddlets import TimeSeries
df = pd.DataFrame(
{
'time_col': pd.date_range('2022-01-01', periods=300, freq='1h'),
'known_cov_1': sinx_l + 4,
'known_cov_2': sinx_l + 5,
}
)
known_cov_dataset = TimeSeries.load_from_dataframe(
df,
time_col='time_col',
value_cols=['known_cov_1', 'known_cov_2'],
freq='1h'
)
df = pd.DataFrame(
{
'time_col': pd.date_range('2022-01-01', periods=200, freq='1h'),
'observed_cov': sinx + 8
}
)
observed_cov_dataset = TimeSeries.load_from_dataframe(
df,
time_col='time_col',
value_cols='observed_cov',
freq='1h'
)
target_cov_dataset = target_dataset.copy()
target_cov_dataset.known_cov = known_cov_dataset
target_cov_dataset.observed_cov = observed_cov_dataset
target_cov_dataset.plot(['value', 'known_cov_1', 'known_cov_2', 'observed_cov'])

If the original dataset has missing data, we can fill missing data during the loading process by using load_from_dataframe.
We provide 7 fill methods listed below.
import pandas as pd
import numpy as np
from paddlets import TSDataset
df = pd.DataFrame(
{
'time_col': pd.date_range('2022-01-01', periods=200, freq='1h'),
'value': sinx,
'known_cov_1': sinx + 4,
'known_cov_2': sinx + 5,
'observed_cov': sinx + 8,
'static_cov': [1 for i in range(200)],
}
)
df.loc[1, 'value'] = np.nan
target_cov_dataset = TSDataset.load_from_dataframe(
df,
time_col='time_col',
target_cols='value',
known_cov_cols=['known_cov_1', 'known_cov_2'],
observed_cov_cols='observed_cov',
static_cov_cols='static_cov',
freq='1h',
fill_missing_dates=True,
fillna_method='pre' #max, min, avg, median, pre, back, zero
)
print(target_cov_dataset['value'][1])
#0.0
Data Exploration
Plot TSDataset.
target_cov_dataset.plot(['value'])

To get the summary statistics of TSDataset, simply call TSDataset.summary.
target_cov_dataset.summary()

Creating the training, validation, and testing datasets
train_dataset, val_test_dataset = target_cov_dataset.split(0.8)
val_dataset, test_dataset = val_test_dataset.split(0.5)
train_dataset.plot(add_data=[val_dataset,test_dataset])

Add columns
train_dataset
new_line = pd.Series(
np.array(range(200)),
index=pd.date_range('2022-01-01', periods=200, freq='1D')
)
## option 1:
## The name of new column which need not to exists in TSDataset`s columns
## The type of value is pd.Series
## type represent the TimeSeries where to put the new column, konw_cov by default
## The index of value must be same as the index of the TSDataset object
target_cov_dataset.set_column(
column='new_b',
value=new_line,
type='observed_cov'
)
## option 2:
## The option is equal to option 1 which type is default
target_cov_dataset['new_b'] = new_line
Update columns
## option 1:
## The name of new column which need to exists in TSDataset`s columns
## The type of value is pd.Series
## The index of value must be same as the index of the TSDataset object
target_cov_dataset.set_column(
column='observed_cov',
value=new_line
)
## Option 2:
## No different from option 1
target_cov_dataset['observed_cov'] = new_line
Delete columns
## Delete column
target_cov_dataset.drop('new_b')
## Delete columns
target_cov_dataset.drop(['known_cov_1', 'new_b'])
Get columns
## Get column
column = target_cov_dataset['known_cov_2'] # The type of column is pd.Serie
## Get columns
columns = target_cov_dataset[['known_cov_2', 'observed_cov']] # The type of columns is pd.DataFrame
Get data type
dtypes = target_cov_dataset.dtypes
print(dtypes)
value int64
known_cov_1 int64
known_cov_2 int64
observed_cov int64
dtype: object
Modify dtype
target_cov_dataset.astype('float32')
dtypes = target_cov_dataset.dtypes
print(dtypes)
value float32
known_cov_1 float32
known_cov_2 float32
observed_cov float32
dtype: object
Get column names
columns = target_cov_dataset.columns
print(columns)
#{'value': 'target', 'known_cov_1': 'known_cov', 'known_cov_2': 'known_cov', 'observed_cov': 'observed_cov'}