AutoTS

AutoTS是用于支持PaddleTS的自动机器学习能力组件。

AutoTS 可以支持 PaddleTS 模型和 pipeline 的自动超参数选择，减少人工介入成本，降低专业门槛。

1. 安装

pip install paddlets[autots]

2. 快速入门

2.1. 准备数据

from paddlets.datasets.repository import get_dataset
tsdataset = get_dataset("UNI_WTH")
train_tsdataset, valid_tsdataset = tsdataset.split(0.3)

2.2. 构造和训练

通过四行代码，我们利用 MLPRegressor 初始化了一个 AutoTS 模型。 AutoTS 会在训练的过程中自动进行超参优化。

from paddlets.models.forecasting import MLPRegressor
from paddlets.automl.autots import AutoTS
autots_model = AutoTS(MLPRegressor, 96, 24, sampling_stride=24)
autots_model.fit(train_tsdataset, valid_tsdataset)

2.3. 持久化

虽然 AutoTS 自身不提供持久化的支持，但我们可以保存超参优化后 AutoTS 所找到的最佳评估器。

# Method 1
best_estimator = autots_model.fit(train_tsdataset, valid_tsdataset)
best_estimator.save(path="./autots_best_estimator_m1")

# Method 2
best_estimator = autots_model.best_estimator()
best_estimator.save(path="./autots_best_estimator_m2")

3. 搜索空间

3.1. 指定搜索空间运行

对于超参数优化来说，你可以定义一个搜索空间。如果你没有指定一个搜索空间，我们也为每个 PaddleTS 模型内置了推荐的默认搜索空间。

你可以利用搜索空间去控制你的超参的取值是如何采样的，控制值范围是多少。

下面是一个指定了搜索空间的 autots pipeline 的例子：

from ray.tune import uniform, qrandint, choice
from paddlets.transform import Fill

sp = {
    "Fill": {
        "cols": ['WetBulbCelsius'],
        "method": choice(['max', 'min', 'mean', 'median', 'pre', 'next', 'zero']),
        "value": uniform(0.1, 0.9),
        "window_size": qrandint(20, 50, q=1)
    },
    "MLPRegressor": {
        "batch_size": qrandint(16, 64, q=16),
        "use_bn": choice([True, False]),
        "max_epochs": qrandint(10, 50, q=10)
    }
}
autots_model = AutoTS([Fill, MLPRegressor], 96, 24, search_space=sp, sampling_stride=24)
autots_model.fit(tsdataset)
sp = autots_model.search_space()
predicted = autots_model.predict(tsdataset)

搜索空间定义的 API 可以参考：https://docs.ray.io/en/latest/tune/api_docs/search_space.html

3.2. 通过Search Space Configer获取内置超参搜索空间

为了让用户更简单的使用 AutoTS，我们提供了为 PaddleTS 模型内置了默认超参搜索空间的SearchSpaceConfiger。

适配了SearchSpaceConfiger的算法有: [“MLPRegressor”, “RNNBlockRegressor”, “NBEATSModel”, “NHiTSModel”, “LSTNetRegressor”, “TransformerModel”, “TCNRegressor”, “InformerModel”, “DeepARModel”]

获取字符串形式的超参搜索空间

>>> from paddlets.automl.autots import SearchSpaceConfiger
>>> from paddlets.models.forecasting import MLPRegressor
>>> configer = SearchSpaceConfiger()
>>> sp_str = configer.recommend(MLPRegressor)
>>> print(sp_str)
The recommended search space are as follows:
=======================================================
from ray.tune import uniform, quniform, loguniform, qloguniform, randn, qrandn, randint, qrandint, lograndint, qlograndint, choice
recommended_sp = {
        "hidden_config": choice([[64], [64, 64], [64, 64, 64], [128], [128, 128], [128, 128, 128]]),
        "use_bn": choice([True, False]),
        "batch_size": qrandint(8, 128, q=8),
        "max_epochs": qrandint(30, 600, q=30),
        "optimizer_params": {
                "learning_rate": uniform(0.0001, 0.01)
        },
        "patience": qrandint(5, 50, q=5)
}
=====================================================
Please note that the **USER_DEFINED_SEARCH_SPACE** parameters need to be set by the user

获取字典形式的超参搜索空间

>>> sp_dict = configer.get_default_search_space(MLPRegressor)
>>> from pprint import pprint as print
>>> print(sp_dict)
{'batch_size': <ray.tune.sample.Integer object at 0x7f88bef520a0>,
 'hidden_config': <ray.tune.sample.Categorical object at 0x7f88bef45fd0>,
 'max_epochs': <ray.tune.sample.Integer object at 0x7f88bef52e80>,
 'optimizer_params': {'learning_rate': <ray.tune.sample.Float object at 0x7f88bef521c0>},
 'patience': <ray.tune.sample.Integer object at 0x7f88bef52070>,
 'use_bn': <ray.tune.sample.Categorical object at 0x7f88bef52250>}

4. 搜索算法

PaddleTS 的搜索算法是利用多个开源的超参优化库包装而成。

我们内置了如下几种搜索算法：: [“Random”, “CMAES”, “TPE”, “CFO”, “BlendSearch”, “Bayes”]

对于这些超参优化库的细节，可以参考它们对应的开源文档。

你可以通过下面的方式指定搜索算法：

autots_model = AutoTS(MLPRegressor, 96, 2, search_alg="CMAES")

如果搜索算法没有被指定，我们会默认使用TPE算法。

5. 并行与计算资源管理

函数 AutoTS.fit() 会在超参优化过程中运行 n_trials （默认为20）个 trials（超参试验），即，在超参搜索空间中采样 n_trials 组超参数组合。

并行度会受到cpu_resource, gpu_resource, max_concurrent_trials参数的影响。

max_concurrent_trials （默认为1）控制了最大的并行运行的 trials 的数量。

# If you have 4 CPUs on your machine, this will run 2 concurrent trials at a time.
autots.fit(train_tsdataset, valid_tsdataset, cpu_resource=2)

# If you have 4 CPUs on your machine, this will run 1 trial at a time.
autots.fit(train_tsdataset, valid_tsdataset, cpu_resource=4)

# Fractional values are also supported, (i.e., cpu_resource=0.5, which means running 8 concurrent trials at a time).
autots.fit(train_tsdataset, valid_tsdataset, cpu_resource=0.5)

5.1. 如何利用 GPUs？

为了利用 GPUs，你需要在AutoTS.fit()中设置 gpu_resource （默认为0），并且设置CUDA_VISIBLE_DEVICES。

注意，如果你不设置 gpu_resource （默认为0），则GPUs不会被使用。

import os
# If you have 8 GPUs, this will run 8 trials at once.
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3,4,5,6,7"
autots.fit(train_tsdataset, valid_tsdataset, cpu_resource=1, gpu_resource=1)

# If you have 4 CPUs on your machine and 1 GPU, this will run 1 trial at a time.
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
autots.fit(train_tsdataset, valid_tsdataset, cpu_resource=2, gpu_resource=1)

搜索空间定义的 API 细节可以参考：https://docs.ray.io/en/latest/tune/api_docs/search_space.html

6. 日志和临时文件

AutoTS()的初始化参数 local_dir 可以指定一个目录保存训练过程中的结果信息（默认为 ./，结果信息默认保存至 ./ray_results ）。

6.1. 临时文件

AutoTS 依赖 Ray 实现，而 Ray 的临时文件目录与系统相关，例如/tmp/ray。由于 Ray 的已知 issue，AutoTS暂未指定它的临时文件目录，如果指定会造成 Ray 的启动失败。我们将持续跟进这个issue。

请用户自行清理临时文件夹。

基于系统，临时文件夹（例如，/tmp/ray）可能存在于/tmp或者/usr/tmp，或者根据用户的系统根目录所决定。它可以被环境变量RAY_TMPDIR/TMPDIR所改变，或者通过代码tempfile.gettempdir()读取。