auto-flow - AutoFlow：自动机器学习工作流建模平台。

AutoFlow：自动机器学习工作流建模平台。

项目描述

AutoFlow：自动机器学习工作流建模平台

介绍

在表格数据的数据挖掘和机器学习问题中，数据科学家通常对特征进行分组，构造有向无环图（DAG），形成机器学习工作流。

在这个有向无环图的每条有向边中，尾节点表示预处理前的特征组，头节点表示预处理后的特征组。边缘表示数据处理或特征工程算法，在每个边缘算法选择和超参数优化都在做。

不幸的是，如果数据科学家想要为这样的工作流手动选择算法和超参数，这将是一项非常繁琐的任务。为了解决这个问题，我们开发了Hyperflow，它可以自动选择算法并优化机器学习工作流的参数。换句话说，它可以为表格数据实现 AutoML。

文档

文档可以在这里找到。

安装

要求

本项目是在Linux系统上搭建和测试的，所以需要Linux平台。如果你使用的是 Windows 系统，WSL值得考虑。

除了列出的要求（请参阅 requirements.txt），SMAC3 中使用的随机森林需要 SWIG (>= 3.0, <4.0) 作为构建依赖项。如果您使用的是 Ubuntu 或其他 Debain Linux，您可以输入以下命令：

apt-get install swig

在 Arch Linux（或任何以 swig4 作为默认实现的发行版）上：

pacman -Syu swig3
ln -s /usr/bin/swig-3 /usr/bin/swig

AutoFlow 需要Python 3.6 或更高版本。

通过 pip 安装

pip install auto-flow

手动安装

git clone https://github.com/auto-flow/autoflow.git && cd autoflow
python setup.py install

快速开始

泰坦尼克号可能是数据科学家最熟悉的机器学习任务。出于教程目的，您可以在examples/data/train_classification.csv和 examples/data/test_classification.csv中找到 Titanic 数据集。您可以使用 AutoFlow 来完成此 ML 任务，而不是手动探索数据集的所有特征。去做吧！

$ cd examples/classification

import os

import joblib
import pandas as pd
from sklearn.model_selection import KFold

from autoflow import AutoFlowClassifier

# load data from csv file
train_df = pd.read_csv("../data/train_classification.csv")
test_df = pd.read_csv("../data/test_classification.csv")
# initial_runs  -- initial runs are totally random search, to provide experience for SMAC algorithm.
# run_limit     -- is the maximum number of runs.
# n_jobs        -- defines how many search processes are started.
# included_classifiers -- restrict the search space . lightgbm is the only classifier that needs to be selected
# per_run_time_limit -- restrict the run time. if a trial during 60 seconds, it is expired, should be killed.
trained_pipeline = AutoFlowClassifier(initial_runs=5, run_limit=10, n_jobs=1, included_classifiers=["lightgbm"],
                                    per_run_time_limit=60)
# describing meaning of columns. `id`, `target` and `ignore` all has specific meaning
# `id` is a column name means unique descriptor of each rows,
# `target` column in the dataset is what your model will learn to predict
# `ignore` is some columns which contains irrelevant information
column_descriptions = {
    "id": "PassengerId",
    "target": "Survived",
    "ignore": "Name"
}
if not os.path.exists("autoflow_classification.bz2"):
    # pass `train_df`, `test_df` and `column_descriptions` to classifier,
    # if param `fit_ensemble_params` set as "auto", Stack Ensemble will be used
    # ``splitter`` is train-valid-dataset splitter, in here it is set as 3-Fold Cross Validation
    trained_pipeline.fit(
        X_train=train_df, X_test=test_df, column_descriptions=column_descriptions,
        fit_ensemble_params=False,
        splitter=KFold(n_splits=3, shuffle=True, random_state=42),
    )
    # finally , the best model will be serialize and store in local file system for subsequent use
    joblib.dump(trained_pipeline, "autoflow_classification.bz2")
    # if you want to see what the workflow AutoFlow is searching, you can use `draw_workflow_space` to visualize
    hdl_constructor = trained_pipeline.hdl_constructors[0]
    hdl_constructor.draw_workflow_space()
# suppose you are processing predict procedure, firstly, you should load serialized model from file system
predict_pipeline = joblib.load("autoflow_classification.bz2")
# secondly, use loaded model to do predicting
result = predict_pipeline.predict(test_df)
print(result)

项目详情

发布历史发布通知| RSS订阅

这个版本

0.1.1

2020 年 4 月 16 日

下载文件

下载适用于您平台的文件。如果您不确定要选择哪个，请了解有关安装包的更多信息。

源分布

auto-flow-0.1.1.tar.gz (193.2 kB 查看哈希)

已上传 2020 年 4 月 16 日 source

auto-flow-0.1.1.tar.gz 的哈希值

auto-flow-0.1.1.tar.gz 的哈希值
算法	哈希摘要
SHA256	`3dd795cdc984b282a412c92862ef2c91a96d2670b8d1db6e51207bb64cf6f18a`
MD5	`92296da1bbd06e4de489c06833c24a79`
布莱克2-256	`fdc10eaa7dc382458e6a29d7883394d28aae5fd2615d59747b6d4278438950ba`

auto-flow 0.1.1

导航

项目链接

统计数据

Meta

Maintainers

分类

项目描述

介绍

文档

安装

要求

通过 pip 安装

手动安装

快速开始

项目详情

项目链接

统计数据

元

维护者

分类器

发布历史发布通知| RSS订阅

下载文件

源分布

auto-flow 0.1.1

导航

项目链接

统计数据

Meta

Maintainers

分类

项目描述

介绍

文档

安装

要求

通过 pip 安装

手动安装

快速开始

项目详情

项目链接

统计数据

元

维护者

分类器

发布历史 发布通知| RSS订阅

下载文件

源分布

发布历史发布通知| RSS订阅