Skip to main content

AutoFlow:自动机器学习工作流建模平台。

项目描述

AutoFlow自动机器学习工作流建模平台

介绍

在表格数据的数据挖掘和机器学习问题中,数据科学家通常对特征进行分组,构造有向无环图(DAG),形成机器学习工作流。

在这个有向无环图的每条有向边中,尾节点表示预处理前的特征组,头节点表示预处理后的特征组。边缘表示数据处理或特征工程算法,在每个边缘算法选择和超参数优化都在做。

不幸的是,如果数据科学家想要为这样的工作流手动选择算法和超参数,这将是一项非常繁琐的任务。为了解决这个问题,我们开发了Hyperflow,它可以自动选择算法并优化机器学习工作流的参数。换句话说,它可以为表格数据实现 AutoML。

文档/图像/workflow_space.png

文档

文档可以在这里找到。

安装

要求

本项目是在Linux系统上搭建和测试的,所以需要Linux平台。如果你使用的是 Windows 系统,WSL值得考虑。

除了列出的要求(请参阅 requirements.txt),SMAC3 中使用的随机森林需要 SWIG (>= 3.0, <4.0) 作为构建依赖项。如果您使用的是 Ubuntu 或其他 Debain Linux,您可以输入以下命令:

apt-get install swig

在 Arch Linux(或任何以 swig4 作为默认实现的发行版)上:

pacman -Syu swig3
ln -s /usr/bin/swig-3 /usr/bin/swig

AutoFlow 需要Python 3.6 或更高版本。

通过 pip 安装

pip install auto-flow

手动安装

git clone https://github.com/auto-flow/autoflow.git && cd autoflow
python setup.py install

快速开始

泰坦尼克号可能是数据科学家最熟悉的机器学习任务。出于教程目的,您可以在examples/data/train_classification.csvexamples/data/test_classification.csv中找到 Titanic 数据集。您可以使用 AutoFlow 来完成此 ML 任务,而不是手动探索数据集的所有特征。去做吧!

$ cd examples/classification
import os

import joblib
import pandas as pd
from sklearn.model_selection import KFold

from autoflow import AutoFlowClassifier

# load data from csv file
train_df = pd.read_csv("../data/train_classification.csv")
test_df = pd.read_csv("../data/test_classification.csv")
# initial_runs  -- initial runs are totally random search, to provide experience for SMAC algorithm.
# run_limit     -- is the maximum number of runs.
# n_jobs        -- defines how many search processes are started.
# included_classifiers -- restrict the search space . lightgbm is the only classifier that needs to be selected
# per_run_time_limit -- restrict the run time. if a trial during 60 seconds, it is expired, should be killed.
trained_pipeline = AutoFlowClassifier(initial_runs=5, run_limit=10, n_jobs=1, included_classifiers=["lightgbm"],
                                    per_run_time_limit=60)
# describing meaning of columns. `id`, `target` and `ignore` all has specific meaning
# `id` is a column name means unique descriptor of each rows,
# `target` column in the dataset is what your model will learn to predict
# `ignore` is some columns which contains irrelevant information
column_descriptions = {
    "id": "PassengerId",
    "target": "Survived",
    "ignore": "Name"
}
if not os.path.exists("autoflow_classification.bz2"):
    # pass `train_df`, `test_df` and `column_descriptions` to classifier,
    # if param `fit_ensemble_params` set as "auto", Stack Ensemble will be used
    # ``splitter`` is train-valid-dataset splitter, in here it is set as 3-Fold Cross Validation
    trained_pipeline.fit(
        X_train=train_df, X_test=test_df, column_descriptions=column_descriptions,
        fit_ensemble_params=False,
        splitter=KFold(n_splits=3, shuffle=True, random_state=42),
    )
    # finally , the best model will be serialize and store in local file system for subsequent use
    joblib.dump(trained_pipeline, "autoflow_classification.bz2")
    # if you want to see what the workflow AutoFlow is searching, you can use `draw_workflow_space` to visualize
    hdl_constructor = trained_pipeline.hdl_constructors[0]
    hdl_constructor.draw_workflow_space()
# suppose you are processing predict procedure, firstly, you should load serialized model from file system
predict_pipeline = joblib.load("autoflow_classification.bz2")
# secondly, use loaded model to do predicting
result = predict_pipeline.predict(test_df)
print(result)

项目详情


下载文件

下载适用于您平台的文件。如果您不确定要选择哪个,请了解有关安装包的更多信息。

源分布

auto-flow-0.1.1.tar.gz (193.2 kB 查看哈希)

已上传 source