谈谈机器学习模型的部署-低调大师

谈谈机器学习模型的部署

2018-10-08 685

随着机器学习的广泛应用，如何高效的把训练好的机器学习的模型部署到生产环境，正在被越来越多的工具所支持。我们今天就来看一看不同的工具是如何解决这个问题的。

上图的过程是一个数据科学项目所要经历的典型的过程。从数据采集开始，经历数据分析，数据变形，数据验证，数据拆分，训练，模型创建，模型验证，大规模训练，模型发布，到提供服务，监控和日志。诸多的机器学习工具如Scikt-Learn，Spark, Tensorflow, MXnet, PyTorch提供给数据科学家们不同的选择，同时也给模型的部署带来了不同的挑战。

我们先来简单的看一看机器学习的模型是如何部署，它又会遇到那些挑战。

模型持久化

模型部署一般就是把训练的模型持久化，然后运行服务器加载模型，并提供REST或其它形式的服务接口。我们以RandomForestClassification为例，看一下Sklearn，Spark和Tensorflow是如何持久化模型。

Sklearn

我们使用Iris数据集，利用RandomForestClassifier分类。

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.externals import joblib

data = load_iris()

X, y = data["data"], data["target"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X_train, y_train)

print(clf.feature_importances_)

print(classification_report(y_test, clf.predict(
    X_test), target_names=data["target_names"]))

joblib.dump(clf, 'classification.pkl')

训练的代码如上。这里模型导出的代码在最后一句。joblib.dump()，参考这里。Sklearn的模型到处本质上是利用Python的Pickle机制。Python的函数进行序列化，也就是说把训练好的Transformer函数序列化并存为文件。

要加载模型也很简单，只要调用joblib.load()就好了。

from sklearn.externals import joblib

from sklearn.datasets import load_iris
from sklearn.metrics import classification_report

data = load_iris()

X, y = data["data"], data["target"]

clf = joblib.load('classification.pkl')

print(clf.feature_importances_)
print(classification_report(y, clf.predict(
    X), target_names=data["target_names"]))

Sklearn对Pickle做了一下封装和优化，但这并不能解决Pickle本身的一些限制，例如：

版本兼容问题，不同的Python，Pickle，Sklearn的版本，生成的序列化文件并不兼容
安全性问题，例如序列化的文件中被人注入恶意代码
扩展问题，你自己写了一个扩展类，无法序列化，或者你在Python中调用了C函数
模型的管理，如果我生成了不同版本的模型，该如何管理

Spark

Spark的Pipeline和Model都支持Save到文件，然后可以很方便的在另一个Context中加载。

训练的代码如下：

from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

from pyspark.sql.types import DoubleType

from pyspark import SparkFiles
from pyspark import SparkContext

url = "https://server/iris.csv"

spark.sparkContext.addFile(url)

# Load and parse the data file, converting it to a DataFrame.
data = spark.read.csv(SparkFiles.get("iris.csv"), header=True)

data = data.withColumn("sepal_length", data["sepal_length"].cast(DoubleType()))
data = data.withColumn("sepal_width", data["sepal_width"].cast(DoubleType()))
data = data.withColumn("petal_width", data["petal_width"].cast(DoubleType()))
data = data.withColumn("petal_length", data["petal_length"].cast(DoubleType()))

#data.show()
data.printSchema()

assembler = VectorAssembler(
    inputCols=["sepal_length", "sepal_width", "petal_width", "petal_length"],
    outputCol="features")

output = assembler.transform(data)

# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="species", outputCol="indexedLabel").fit(output)

# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(output)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a RandomForest model.
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=10)

# Convert indexed labels back to original labels.
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
                               labels=labelIndexer.labels)

# Chain indexers and forest in a Pipeline
pipeline = Pipeline(stages=[assembler, labelIndexer, featureIndexer, rf, labelConverter])

# Train model.  This also runs the indexers.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("predictedLabel", "species", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))

rfModel = model.stages[3]
print(rfModel)  # summary only

filebase="hdfs://server:9000/tmp"

pipeline.write().overwrite().save("{}/classification-pipeline".format(filebase))
model.write().overwrite().save("{}/classification-model".format(filebase))

模型加载的代码如下：

%pyspark
from pyspark.ml import Pipeline
from pyspark.ml import PipelineModel

from pyspark import SparkFiles

url = "https://server/iris.csv"
spark.sparkContext.addFile(url)

# Load and parse the data file, converting it to a DataFrame.
data = spark.read.csv(SparkFiles.get("iris.csv"), header=True)

data = data.withColumn("sepal_length", data["sepal_length"].cast(DoubleType()))
data = data.withColumn("sepal_width", data["sepal_width"].cast(DoubleType()))
data = data.withColumn("petal_width", data["petal_width"].cast(DoubleType()))
data = data.withColumn("petal_length", data["petal_length"].cast(DoubleType()))

filebase="hdfs://server:9000/tmp/"

pipeline = Pipeline.read().load("{}/classification-pipeline".format(filebase))
model = PipelineModel.read().load("{}/classification-model".format(filebase))

# Make predictions.
predictions = model.transform(data)

# Select example rows to display.
predictions.select("predictedLabel", "species", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))

调用model的toDebugString方法可以看到分类器的内部细节。

RandomForestClassificationModel (uid=rfc_225ef4968bf9) with 10 trees
  Tree 0 (weight 1.0):
    If (feature 3 <= 1.9)
     Predict: 1.0
    Else (feature 3 > 1.9)
     If (feature 3 <= 4.7)
      Predict: 2.0
     Else (feature 3 > 4.7)
      If (feature 3 <= 5.1)
       If (feature 1 <= 2.5)
        Predict: 0.0
       Else (feature 1 > 2.5)
        If (feature 1 <= 2.7)
         Predict: 2.0
        Else (feature 1 > 2.7)
         Predict: 0.0
      Else (feature 3 > 5.1)
       Predict: 0.0
  Tree 1 (weight 1.0):
    If (feature 3 <= 1.9)
     Predict: 1.0
    Else (feature 3 > 1.9)
     If (feature 3 <= 4.9)
      If (feature 2 <= 1.6)
       Predict: 2.0
      Else (feature 2 > 1.6)
       If (feature 0 <= 4.9)
        Predict: 0.0
       Else (feature 0 > 4.9)
        If (feature 0 <= 5.9)
         Predict: 2.0
        Else (feature 0 > 5.9)
         Predict: 0.0
     Else (feature 3 > 4.9)
      If (feature 1 <= 3.0)
       If (feature 3 <= 5.1)
        If (feature 2 <= 1.7)
         Predict: 0.0
        Else (feature 2 > 1.7)
         Predict: 0.0
       Else (feature 3 > 5.1)
        Predict: 0.0
      Else (feature 1 > 3.0)
       Predict: 0.0
  Tree 2 (weight 1.0):
    If (feature 3 <= 1.9)
     Predict: 1.0
    Else (feature 3 > 1.9)
     If (feature 3 <= 5.0)
      If (feature 2 <= 1.6)
       Predict: 2.0
      Else (feature 2 > 1.6)
       If (feature 1 <= 2.5)
        Predict: 0.0
       Else (feature 1 > 2.5)
        Predict: 2.0
     Else (feature 3 > 5.0)
      If (feature 0 <= 6.0)
       If (feature 1 <= 2.7)
        If (feature 0 <= 5.8)
         Predict: 0.0
        Else (feature 0 > 5.8)
         Predict: 2.0
       Else (feature 1 > 2.7)
        Predict: 0.0
      Else (feature 0 > 6.0)
       Predict: 0.0
  Tree 3 (weight 1.0):
    If (feature 3 <= 1.9)
     Predict: 1.0
    Else (feature 3 > 1.9)
     If (feature 3 <= 4.9)
      If (feature 2 <= 1.5)
       Predict: 2.0
      Else (feature 2 > 1.5)
       If (feature 2 <= 1.7)
        Predict: 0.0
       Else (feature 2 > 1.7)
        Predict: 2.0
     Else (feature 3 > 4.9)
      If (feature 3 <= 5.1)
       If (feature 0 <= 6.5)
        If (feature 0 <= 5.9)
         Predict: 0.0
        Else (feature 0 > 5.9)
         Predict: 0.0
       Else (feature 0 > 6.5)
        Predict: 2.0
      Else (feature 3 > 5.1)
       Predict: 0.0
  Tree 4 (weight 1.0):
    If (feature 2 <= 0.5)
     Predict: 1.0
    Else (feature 2 > 0.5)
     If (feature 2 <= 1.5)
      If (feature 2 <= 1.4)
       Predict: 2.0
      Else (feature 2 > 1.4)
       If (feature 3 <= 4.9)
        Predict: 2.0
       Else (feature 3 > 4.9)
        Predict: 0.0
     Else (feature 2 > 1.5)
      If (feature 2 <= 1.8)
       If (feature 3 <= 5.0)
        If (feature 0 <= 4.9)
         Predict: 0.0
        Else (feature 0 > 4.9)
         Predict: 2.0
       Else (feature 3 > 5.0)
        Predict: 0.0
      Else (feature 2 > 1.8)
       Predict: 0.0
  Tree 5 (weight 1.0):
    If (feature 2 <= 0.5)
     Predict: 1.0
    Else (feature 2 > 0.5)
     If (feature 2 <= 1.6)
      If (feature 2 <= 1.3)
       Predict: 2.0
      Else (feature 2 > 1.3)
       If (feature 3 <= 4.9)
        Predict: 2.0
       Else (feature 3 > 4.9)
        Predict: 0.0
     Else (feature 2 > 1.6)
      If (feature 3 <= 4.8)
       If (feature 2 <= 1.7)
        Predict: 0.0
       Else (feature 2 > 1.7)
        Predict: 2.0
      Else (feature 3 > 4.8)
       Predict: 0.0
  Tree 6 (weight 1.0):
    If (feature 3 <= 1.9)
     Predict: 1.0
    Else (feature 3 > 1.9)
     If (feature 3 <= 4.9)
      If (feature 2 <= 1.6)
       Predict: 2.0
      Else (feature 2 > 1.6)
       If (feature 1 <= 2.8)
        Predict: 0.0
       Else (feature 1 > 2.8)
        Predict: 2.0
     Else (feature 3 > 4.9)
      If (feature 1 <= 2.7)
       If (feature 2 <= 1.6)
        If (feature 3 <= 5.0)
         Predict: 0.0
        Else (feature 3 > 5.0)
         Predict: 2.0
       Else (feature 2 > 1.6)
        Predict: 0.0
      Else (feature 1 > 2.7)
       Predict: 0.0
  Tree 7 (weight 1.0):
    If (feature 0 <= 5.4)
     If (feature 2 <= 0.5)
      Predict: 1.0
     Else (feature 2 > 0.5)
      Predict: 2.0
    Else (feature 0 > 5.4)
     If (feature 2 <= 1.7)
      If (feature 3 <= 1.5)
       Predict: 1.0
      Else (feature 3 > 1.5)
       If (feature 0 <= 6.9)
        If (feature 3 <= 5.0)
         Predict: 2.0
        Else (feature 3 > 5.0)
         Predict: 0.0
       Else (feature 0 > 6.9)
        Predict: 0.0
     Else (feature 2 > 1.7)
      If (feature 0 <= 5.9)
       If (feature 2 <= 1.8)
        Predict: 2.0
       Else (feature 2 > 1.8)
        Predict: 0.0
      Else (feature 0 > 5.9)
       Predict: 0.0
  Tree 8 (weight 1.0):
    If (feature 3 <= 1.7)
     Predict: 1.0
    Else (feature 3 > 1.7)
     If (feature 3 <= 5.1)
      If (feature 2 <= 1.6)
       If (feature 2 <= 1.4)
        Predict: 2.0
       Else (feature 2 > 1.4)
        If (feature 1 <= 2.2)
         Predict: 0.0
        Else (feature 1 > 2.2)
         Predict: 2.0
      Else (feature 2 > 1.6)
       If (feature 1 <= 2.5)
        Predict: 0.0
       Else (feature 1 > 2.5)
        If (feature 3 <= 5.0)
         Predict: 2.0
        Else (feature 3 > 5.0)
         Predict: 0.0
     Else (feature 3 > 5.1)
      Predict: 0.0
  Tree 9 (weight 1.0):
    If (feature 2 <= 0.5)
     Predict: 1.0
    Else (feature 2 > 0.5)
     If (feature 0 <= 6.1)
      If (feature 3 <= 4.8)
       If (feature 0 <= 4.9)
        If (feature 2 <= 1.0)
         Predict: 2.0
        Else (feature 2 > 1.0)
         Predict: 0.0
       Else (feature 0 > 4.9)
        Predict: 2.0
      Else (feature 3 > 4.8)
       Predict: 0.0
     Else (feature 0 > 6.1)
      If (feature 3 <= 4.9)
       If (feature 1 <= 2.8)
        If (feature 0 <= 6.2)
         Predict: 0.0
        Else (feature 0 > 6.2)
         Predict: 2.0
       Else (feature 1 > 2.8)
        Predict: 2.0
      Else (feature 3 > 4.9)
       Predict: 0.0

下图是Spark存储的Piple模型的目录结构：

我们可以看到，它包含了元数据Pipeline的五个阶段的数据，这里的文件都是二进制的数据，只有Spark自己可以加载。

Tensorflow

最后我们来看一下Tensorflow。Tensorflow提供了tf.train.Saver来导出他的模型到元图（MetaGraph）。

from __future__ import print_function

import tensorflow as tf
from tensorflow.contrib.tensor_forest.python import tensor_forest
from tensorflow.python.ops import resources

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

import numpy as np

# Ignore all GPUs, tf random forest does not benefit from it.
import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""

data = load_iris()
dX, dy = data["data"], data["target"]
X_train, X_test, y_train, y_test = train_test_split(
    dX, dy, test_size=0.33, random_state=42)

# Parameters
num_steps = 500  # Total steps to train
batch_size = 10  # The number of samples per batch
num_classes = 3  # The 10 digits
num_features = 4  # Each image is 28x28 pixels
num_trees = 10
max_nodes = 100

# Input and Target data
X = tf.placeholder(tf.float32, shape=[None, num_features])
# For random forest, labels must be integers (the class id)
Y = tf.placeholder(tf.int32, shape=[None])

# Random Forest Parameters
hparams = tensor_forest.ForestHParams(num_classes=num_classes,
                                      num_features=num_features,
                                      num_trees=num_trees,
                                      max_nodes=max_nodes).fill()

# Build the Random Forest
forest_graph = tensor_forest.RandomForestGraphs(hparams)
# Get training graph and loss
train_op = forest_graph.training_graph(X, Y)
loss_op = forest_graph.training_loss(X, Y)

# Measure the accuracy
infer_op, _, _ = forest_graph.inference_graph(X)
correct_prediction = tf.equal(tf.argmax(infer_op, 1), tf.cast(Y, tf.int64))
accuracy_op = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

# Initialize the variables (i.e. assign their default value) and forest resources
init_vars = tf.group(tf.global_variables_initializer(),
                     resources.initialize_resources(resources.shared_resources()))


def next_batch(size):
    index = range(len(X_train))
    index_batch = np.random.choice(index, size)
    return X_train[index_batch], y_train[index_batch]


# Start TensorFlow session
sess = tf.Session()

# Run the initializer
sess.run(init_vars)

saver = tf.train.Saver()

# Training
for i in range(1, num_steps + 1):
    # Prepare Data
    # Get the next batch of MNIST data (only images are needed, not labels)
    batch_x, batch_y = next_batch(batch_size)
    _, l = sess.run([train_op, loss_op], feed_dict={X: batch_x, Y: batch_y})
    if i % 50 == 0 or i == 1:
        acc = sess.run(accuracy_op, feed_dict={X: batch_x, Y: batch_y})
        print('Step %i, Loss: %f, Acc: %f' % (i, l, acc))
# Test Model
print("Test Accuracy:", sess.run(
    accuracy_op, feed_dict={X: X_test, Y: y_test}))

# Print the tensors related to this model
print(accuracy_op)
print(infer_op)
print(X)
print(Y)

# save the model to a check point file
save_path = saver.save(sess, "/tmp/model.ckpt")

导出的模型会包含以下文件：

其中checkpoint是元数据，包含其它文件的路径信息。还包含了一个Pickle文件和其它几个checkpiont文件。可以看出，Tensorflow也利用了Python的Pickle机制来存储模型，并在这之外加入了额外的元数据。

模型加载的代码如下：

from __future__ import print_function

import tensorflow as tf

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# note: this has to be imported in case to support forest graph
from tensorflow.contrib.tensor_forest.python import tensor_forest

import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""

saver = tf.train.import_meta_graph('/tmp/model.ckpt.meta')

data = load_iris()

dX, dy = data["data"], data["target"]

graph = tf.get_default_graph()
with tf.Session() as sess:
    new_saver = tf.train.import_meta_graph('/tmp/model.ckpt.meta')
    new_saver.restore(sess, '/tmp/model.ckpt')
    #input = graph.get_operation_by_name("train")
    # print(graph.as_graph_def())
    load_infer_op = graph.get_tensor_by_name('probabilities:0')
    accuracy_op = graph.get_tensor_by_name('Mean_1:0')
    X = graph.get_tensor_by_name('Placeholder:0')
    Y = graph.get_tensor_by_name('Placeholder_1:0')
    print("Test Accuracy:", sess.run(accuracy_op, feed_dict={X: dX, Y: dy}))
    result = sess.run(load_infer_op, feed_dict={X: dX})
    prediction_result = [i.argmax() for i in result]
    print(classification_report(dy, prediction_result,
                                target_names=data["target_names"]))

这里要注意的是，RandomForest不是tensforflow的核心包，所以在模型加载的时候必须tensorflow.contrib.tensor_forest.python.tensor_forest, 否则模型是无法成功加载的。因为不加载的话tensor_forest中定义的一些属性会缺失。

另外就是Tensorflow也可以存储计算图，调用tf.train.write_graph（）方法可以把图定义存储下来。当然也可以在TesnsorBoard中展示该图。

tf.train.write_graph(sess.graph_def, '/tmp', 'train.pbtxt')
%cat /tmp/train.pbtxt

好了，我们看到，Sklearn，Spark和Tensorflow都提供了自己的模型持久化的方法，那么简单来说，只要使用一个web服务器例如Flask，加一些模型加载和管理的方法，然后暴露REST API就可以提供预测服务了，是不是很简单呢？

其实要在生产环境下提供服务，还需要面对很多其它的挑战，例如：

在云上如何扩展和伸缩
如何进行性能调优
如何管理模型的版本
安全性
如何持续集成和持续部署
如何支持AB测试

为了解决模型部署的挑战，不同的组织开发了一些开源的工具，例如：Clipper，Seldon，MFlow，MLeap，Oracle Graphpipe，MXnet model server 等等，我们就选其中几个看个究竟。

Clipper

Clipper是由UC Berkeley RISE Lab 开发的，在用户应用和机器学习模型之间的一个提供预测服务的系统，通过解耦合用户应用和机器学习系统的方式，简化部署流程。

它有以下功能：

利用简单标准化的REST接口来简化机器学习系统的集成，支持主要的机器学习框架。
使用开发模型相同的库和环境简化模型部署
利用可适配的Batching，缓存等技术改善吞吐量
通过智能选择和合并模型来改善预测的准确率

Clipper的架构如下图：

Clipper使用了容器和微服务技术来构架架构。使用Redis来管理配置，Prometheus来进行监控。Clipper支持使用Kubernetes或者本地的Docker来管理容器。

Clipper支持以下几种模型：

纯Python函数
PyShark
PyTorch
Tensorflow
MXnet
自定义

Clipper模型部署的基本过程如下，大家可以参考我的这个notebook

创建Clipper集群（使用K8s或者本地Docker）
创建一个应用
训练模型
调用Clipper提供的模型部署方法部署模型，这里不同的工具需要调用不同的部署方法。部署时，会把训练好的Estimator利用CloudPickle之久化，本地构建一个容器镜像，部署到Docker或者K8s。
把模型和应用关联到一起，相当于发布模型。然后就可以调用对应的REST API来做预测了。

我试着把之前的三种工具的RomdomForest的例子用Clipper发布到我的Kubernetes集群，踩到了以下的坑坑：

我本地的Cloudpickle的版本太新，导致模型不能反序列化，参考这个Issue
Tensorflow在Pickle的时候失败，应该是调用了C的code
我的K8s运行在AWS上，我在K8S上使用内部IP失败，clipper连接一直在使用外部的域名，导致无法部署PySpark的模型。

总之，除了Sklearn成功部署之外，Tensorflow和Spark都失败了。大家可以参考这里的例子。

Seldon

Seldon是一家创办于伦敦的公司，致力于提供对于基于开源软件的机器学习系统的控制。Seldon Core是该公司开源的提供在Kubernetes上部署机器学习模型的工具。它拥有以下功能：

Python/Spark/H2O/R 的模型支持
REST API和gRPC接口
部署基于Model/Routers/Combiner/Transformers的图的微服务
利用K8S来提供扩展，安全性，监控等等DevOps的功能

Seldon的使用过程如上图，

首先在K8s上安装Seldon Core，Seldon利用ksonnet，以CRD的形式安装seldon core
利用S2i（s2i是openshift开源的一款工具，用于把代码构建成容器镜像），构建运行时模型容器，并注册到容器注册表
编写你的运行图，并提交到K8s来部署你的模型

Seldon支持基于四种基本单元，Model，Transformer， Router， Combiner来构建你的运行图，并按照该图在K8s创建对应的资源和实例，来获得AB测试，模型ensemble的功能。

例如下图的几个例子：

AB 测试

模型ensemble

复杂图

图模式是Seldon最大的亮点，可以训练不同的模型，然后利用图来组合出不同的运行时，非常方便。更多的例子参考这里。

笔者尝试在K8S上利用Seldon部署之前提到的三种工具生成的模型，都获得了成功（代码在这里）。这里分享一下遇到的几个问题：

Seldon支持Java的Python，然而用运行PySpark，这两个都需要，所以我不得不自己构建了一个镜像，手工在Python镜像上安装Java。
因为使用CDR的原因，我没有找到有效改变容器的liveness和readiness的设置，因为Spark初始化模型在Hadoop上，加载模型需要时间，总是readiness超时导致容器无法正常启动，K8s不断的重启容器。所以我只好修改代码，让模型加载变成Lazy Load，但是这样第一次REST Call会比较耗时，但是容器和服务总算是能够正常启动。

在我之前的一篇介绍Kubeflow的文章中，大家可以了解到，Kubeflow就是使用Seldon来管理模型部署的。

MLflow

MLflow是Databricks开发的开源系统，用于管理机器学习的端到端的生命周期。我之前写过一篇介绍该工具的文章。

MLflow提供跟踪，项目管理和模型管理的功能。使用MLFlow来提供一个基于Sklearn的模型服务非常简单

from __future__ import print_function

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

import mlflow
import mlflow.sklearn

if __name__ == "__main__":
    data = load_iris()

    X, y = data["data"], data["target"]
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.33, random_state=42)

    clf = RandomForestClassifier(max_depth=2, random_state=0)
    clf.fit(X_train, y_train)

    print(clf.feature_importances_)

    print(classification_report(y_test, clf.predict(
        X_test), target_names=data["target_names"]))

    mlflow.sklearn.log_model(clf, "model")
    print("Model saved in run %s" % mlflow.active_run().info.run_uuid)

调用mlflow.sklearn.log_model(), MLflow创建以下的目录来管理模型：

我们看到在artifacts目录下有Python的pickle文件和另一个元数据文件，MLModel。

artifact_path: model
flavors:
  python_function:
    data: model.pkl
    loader_module: mlflow.sklearn
    python_version: 2.7.10
  sklearn:
    pickled_model: model.pkl
    sklearn_version: 0.20.0
run_id: 44ae85c084904b4ea5bad5aa42c9ce05
utc_time_created: '2018-10-02 23:38:49.786871'

使用 mlflow sklearn serve -m model 就可以很方便的提供基于sklearn的模型服务了。

虽然MLFlow也号称支持Spark和Tensorflow，但是他们都是基于Python来做，我尝试使用，但是文档和例子比较少，所以没能成功。但原理上都是使用Pickle元数据的方式。大家有兴趣的可以尝试一下。

关于部署功能，MLFlow的一个亮点是和Sagemaker，AzureML的支持。

MLeap

MLeap的目标是提供一个在Spark和Sklearn之间可移植的模型格式，和运行引擎。它包含：

基于JSON的序列化
运行引擎
Benchmark

MLeap的架构如下图：

这是一个使用MLeap导出Sklearn模型的例子：

# Initialize MLeap libraries before Scikit/Pandas
import mleap.sklearn.preprocessing.data
import mleap.sklearn.pipeline
from mleap.sklearn.ensemble import forest
from mleap.sklearn.preprocessing.data import FeatureExtractor

# Import Scikit Transformer(s)
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

data = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
input_features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

output_vector_name = 'extracted_features' # Used only for serialization purposes
output_features = [x for x in input_features]

feature_extractor_tf = FeatureExtractor(input_scalars=input_features,
                                        output_vector=output_vector_name,
                                        output_vector_items=output_features)

classification_tf = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=2, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
                       oob_score=False, random_state=0, verbose=0, warm_start=False)

classification_tf.mlinit(input_features="features", prediction_column='species',feature_names="features")

rf_pipeline = Pipeline([(feature_extractor_tf.name, feature_extractor_tf),
                        (classification_tf.name, classification_tf)])
rf_pipeline.mlinit()
rf_pipeline.fit(data[input_features],data['species'])

rf_pipeline.serialize_to_bundle('./', 'mleap-scikit-rf-pipeline', init=True)

导出的模型结构如下图所示：

这个是randonforest的模型json

{
   "attributes": {
      "num_features": {
         "long": 4
      }, 
      "trees": {
         "type": "list", 
         "string": [
            "tree0", 
            "tree1", 
            "tree2", 
            "tree3", 
            "tree4", 
            "tree5", 
            "tree6", 
            "tree7", 
            "tree8", 
            "tree9"
         ]
      }, 
      "tree_weights": {
         "double": [
            1.0, 
            1.0, 
            1.0, 
            1.0, 
            1.0, 
            1.0, 
            1.0, 
            1.0, 
            1.0, 
            1.0
         ], 
         "type": "list"
      }
   }, 
   "op": "random_forest_classifier"
}

我们可以看出MLeap把模型完全序列化成与代码无关的JSON文件，这样就可以在不同的运行时工具Spark/Sklearn之间做到可移植。

MLeap对模型提供服务，不需要依赖任何Sklearn或者Spark的代码。只要启动MLeap的Server，然后提交模型就好了。

docker run -p 65327:65327 -v /tmp/models:/models combustml/mleap-serving:0.9.0-SNAPSHOT
curl -XPUT -H "content-type: application/json" \
		-d '{"path":"/models/yourmodel.zip"}' \
		http://localhost:65327/model

下面的代码用Scala在Spark 上训练一个同样的Randonforest分类模型，并利用MLeap持久化模型。

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.types.{IntegerType, DoubleType}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}

import ml.combust.bundle.BundleFile
import ml.combust.bundle.serializer.SerializationFormat
import ml.combust.mleap.spark.SparkSupport._
import resource._

import org.apache.spark.SparkFiles

spark.sparkContext.addFile("https://s3-us-west-2.amazonaws.com/mlapi-samples/demo/data/input/iris.csv")
val data = spark.read.format("csv").option("header", "true").load(SparkFiles.get("iris.csv"))

//data.show()
//data.printSchema()

// Transform, convert string coloumn to number
// this transform is not part of the pipeline
val featureDf = data.select(data("sepal_length").cast(DoubleType).as("sepal_length"),
                            data("sepal_width").cast(DoubleType).as("sepal_width"),
                            data("petal_width").cast(DoubleType).as("petal_width"),
                            data("petal_length").cast(DoubleType).as("petal_length"),
                            data("species") )

// assember the features
val assembler = new VectorAssembler()
  .setInputCols(Array("sepal_length", "sepal_width", "petal_width", "petal_length"))
  .setOutputCol("features")
  
val output = assembler.transform(featureDf)

// create lable and features
val labelIndexer = new StringIndexer()
  .setInputCol("species")
  .setOutputCol("indexedLabel")
  .fit(output)

val featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(4)
  .fit(output)
  
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = featureDf.randomSplit(Array(0.7, 0.3))

// Train a RandomForest model.
val rf = new RandomForestClassifier()
  .setLabelCol("indexedLabel")
  .setFeaturesCol("indexedFeatures")
  .setNumTrees(10)

// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
  .setInputCol("prediction")
  .setOutputCol("predictedLabel")
  .setLabels(labelIndexer.labels)

// Chain indexers and forest in a Pipeline.
val pipeline = new Pipeline()
  .setStages(Array(assembler, labelIndexer, featureIndexer, rf, labelConverter))

// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)

// Make predictions.
val predictions = model.transform(testData)

// Select example rows to display.
predictions.select("predictedLabel", "species", "features").show(5)

// Select (prediction, true label) and compute test error.
val evaluator = new MulticlassClassificationEvaluator()
  .setLabelCol("indexedLabel")
  .setPredictionCol("prediction")
  .setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println("Test Error = " + (1.0 - accuracy))

val rfModel = model.stages(3).asInstanceOf[RandomForestClassificationModel]
println("Learned classification forest model:\n" + rfModel.toDebugString)

val pipelineModel = SparkUtil.createPipelineModel(Array(model))

for(bundle <- managed(BundleFile("file:/tmp/mleap-examples/rf"))) {
  pipelineModel.writeBundle.format(SerializationFormat.Json).save(bundle)
}

导出的模型和之前的Sklearn具有相同的格式。

MLeap的问题在于要支持所有的算法，对于每一个算法都要实现对应的序列化，这也使得它的需要很多的开发来支持客户自定义的算法。对于常用算法的支持，大家可以参考这里。

其它

除了以上几个，还有一些我们没有涉及，有兴趣的读者可以自行搜索。

总结

Seldon Core和K8S结合的很好，它提供的运行图的方式非常强大，它也是我实验中唯一一个能够成功部署Sklearn，Spark和Tensorflow三种模型的工具，非常推荐！

Clipper提供基于K8s和Docker的模型部署，它的模型版本管理做得不错，但是代码不太稳定，小问题不少，基于CloudPickle也有不少的限制，只能支持Python也是个问题。推荐给数据科学家有比较多的本地交互的情况。

MLFlow能够提供很方便的基于Python的模型服务，但是缺乏和容器的结合。但是它能够支持和Sagemaker，AzureML等云的支持。推荐给已经在使用这些云的玩家。

MLeap的特色是支持模型的可交互性，也就是说我可以把sklearn训练的模型导出在Spark上运行，这的功能很有吸引力，但是要支持全部的算法，它还有很长的路要走。关于机器学习模型标准化的问题，大家也可以关注PMML。现阶段各个工具对PMML的支持比较有限，随着深度学习的广泛应用，PMML何去何从还未可知。

下表是对以上几个工具的简单总结，供大家参考

	Model Persistent	ML Tools	Kubernetest Integration	Version	License	Implementation
Seldon Core	S2i + Pickle	Tensorflow, SKlearn, Keras, R, H2O, Nodejs, PMML	Yes	0.3.2	Apache	Docker + K8s CRD
Clipper	Pickle	Python, PySpark, PyTorch, Tensorflow, MXnet, Customer Container	Yes	0.3.0	Apache	CPP / Python
MLFlow	Directory + Metadata	Python, H2O, Kera, MLeap, PyTorch, Sklearn, Spark, Tensorflow, R	No	Alpha	Apache	Python
MLeap	JSON	Spark,Sklearn, Tensorflow	No	0.12.0	Apache	Scala/Java

本文来自云栖社区合作伙伴“开源中国”

本文作者：naughty

原文链接

微信关注我们

原文链接：https://yq.aliyun.com/articles/649286

转载内容版权归作者及来源网站所有！

低调大师中文资讯倾力打造互联网数据资讯、行业资源、电子商务、移动互联网、网络营销平台。持续更新报道IT业界、互联网、市场资讯、驱动更新,是最及时权威的产业资讯及硬件资讯报道平台。

袋鼠云CTO宁海元（花名江枫）在2018杭州云栖大会袋鼠云数据智能专场演讲内容

袋鼠云从去年开始提出数据智能的理念，已经被越来越多的企业所认同。我认为，数据智能需要分成两部分来看，一部分叫做数据中台，打造企业在数据底层基础建设的能力；另外一部分是基于数据中台，和不同行业和企业客户的业务场景结合的智能应用的创新。今天想要跟大家再次分享数据中台这个理念。数据中台提出到现在差不多有两到三年的时间，基本上跟袋鼠云创业的时间是差不多的。所以，我们从创业的第一天开始，就很坚定说袋鼠云希望帮助企业构建数据中台的能力，那么什么是数据中台呢？我们可以看到，数据从产生到应用的整个链条其实是非常长的。首先是一切业务数据化，其次是不同业务数据之间的打通。其实数据打通是一件非常非常麻烦的事情，我们跟很多客户聊下来，不管是政府也好，企业也好，数据打通都是一个很头大的问题。我们在跟西湖景区的交流过程当中，大家知道过去二十年，我们的企业和政府做了大量的做IT信息化建设，取得了很多的成果，到目前为止建设了超过500套系统，这些系统由不同的供应商做的，今天我们称之为烟囱式的建设方式，不同之间的数据打通就是一个很大的问题。因为数据不通，要做数据应用的创新就会有巨大的成本。我们今天可以把数据从产生到...

2018-10-09

630

大数据能够在国内得到快速发展，甚至是国家层面的支持，最为重要的一点就是我们纯国产大数据处理技术的突破以及跨越式发展。在互联网深刻改变我们的生活、工作方式的当下，数据就成为了最为重要的资料。尤其是数据安全问题就更为突出，前阶段的Facebook用户数据泄漏所引发产生的一系列问题，就充分的说明了数据安全问题的严重性。大数据发展的必然趋势就是将会深刻改变我们的工作和生活方式，无论是企业还是个人也都必然会成为其中的一个“数据”。选择什么样的大数据处理，不仅仅考虑是简单、易用，更重要的是能够确保数据的安全！当前国内的hadoop大数据处理平台可以说是比较杂乱的，有国外的、有在国外版本基础上二次开发，却很少有做原生态开发的。而至于做原生态开发的，目前已知也就是大快搜索了。所以，个人一直很喜欢大快搜索产品手册封面上的一句话：让每个程序员都能开发大数据底层技术从此触手可及！在这里我也是直接把大快搜索的手册封面图拿来了做了文章的封面。hadoop大数据处理平台与案例大数据可以说是从搜索引擎诞生之处就有了，我们熟悉的搜索引擎，如百度搜索引擎、360搜索引擎等可以说是大数据技处理技术的最早的也是比较基础的...

2018-10-09

592

资源下载

更多资源

Mario

马里奥是站在游戏界顶峰的超人气多面角色。马里奥靠吃蘑菇成长，特征是大鼻子、头戴帽子、身穿背带裤，还留着胡子。与他的双胞胎兄弟路易基一起，长年担任任天堂的招牌角色。

腾讯云软件源

为解决软件依赖安装时官方源访问速度慢的问题，腾讯云为一些软件搭建了缓存服务。您可以通过使用腾讯云软件源站来提升依赖包的安装速度。为了方便用户自由搭建服务架构，目前腾讯云软件源站支持公网访问和内网访问。

Nacos

Nacos /nɑ:kəʊs/ 是 Dynamic Naming and Configuration Service 的首字母简称，一个易于构建 AI Agent 应用的动态服务发现、配置管理和AI智能体管理平台。Nacos 致力于帮助您发现、配置和管理微服务及AI智能体应用。Nacos 提供了一组简单易用的特性集，帮助您快速实现动态服务发现、服务配置、服务元数据、流量管理。Nacos 帮助您更敏捷和容易地构建、交付和管理微服务平台。

Spring

Spring框架（Spring Framework）是由Rod Johnson于2002年提出的开源Java企业级应用框架，旨在通过使用JavaBean替代传统EJB实现方式降低企业级编程开发的复杂性。该框架基于简单性、可测试性和松耦合性设计理念，提供核心容器、应用上下文、数据访问集成等模块，支持整合Hibernate、Struts等第三方框架，其适用范围不仅限于服务器端开发，绝大多数Java应用均可从中受益。