比 DataX 快 20%！SeaTunnel 同步计算引擎性能测试全新发布

2022-11-16 353

点亮 ⭐️ Star · 照亮开源之路https://github.com/apache/incubator-seatunnel

本月初，SeaTunnel同步计算引擎STE 2.3.0 beta2（commit id 7393c47）在社区的共同努力之下正式发布。与此同时，社区对大家期待的性能指标进行了测试。

为了让大家对测试结果有一个更直观的概念，我们采用了对比测试的方法。熟悉数据集成领域的人应该了解，DataX是目前数据开源同步引擎里，性能较好的同步工具之一，这次SeaTunnel做对比的对象，正是这款目前在数据集成领域使用较多的开源同步引擎。

为了保证对比测试的准确性，我们选取了相同的测试场景：在相同的资源情况下，测试DataX和SeaTunnel将数据批量从MySQL同步到HDFS，以Text格式保存，所需要花费的时间，并进行对比。

测试环境

MySQL

阿里云RDS MySQL 8Core 32G

HDFS

CPU：Intel(R) Xeon(R) Platinum 8369B CPU @ 2.70GHz

Memory：32G

节点数：3

NameNode -Xmx4G

DataNode -Xmx16G

测试数据

列数：31

行数：32226320 （3000万条）

大小：数据写入HDFS（text格式）大小为18G

我们在Mysql中创建了一张包含了31个字段的表，主键选择递增的id，其他所有字段采用随机的方式生成，除了主键外均不设置索引。

建表语句为

create table test.type_source_table
(
    id                   int auto_increment
        primary key,
    f_binary             binary(64)          null,
    f_blob               blob                null,
    f_long_varbinary     mediumblob          null,
    f_longblob           longblob            null,
    f_tinyblob           tinyblob            null,
    f_varbinary          varbinary(100)      null,
    f_smallint           smallint            null,
    f_smallint_unsigned  smallint unsigned   null,
    f_mediumint          mediumint           null,
    f_mediumint_unsigned mediumint unsigned  null,
    f_int                int                 null,
    f_int_unsigned       int unsigned        null,
    f_integer            int                 null,
    f_integer_unsigned   int unsigned        null,
    f_bigint             bigint              null,
    f_bigint_unsigned    bigint unsigned     null,
    f_numeric            decimal             null,
    f_decimal            decimal             null,
    f_float              float               null,
    f_double             double              null,
    f_double_precision   double              null,
    f_longtext           longtext            null,
    f_mediumtext         mediumtext          null,
    f_text               text                null,
    f_tinytext           tinytext            null,
    f_varchar            varchar(100)        null,
    f_date               date                null,
    f_datetime           datetime            null,
    f_time               time                null,
    f_timestamp          timestamp           null
);

DataX任务配置

为了充分利用DataX提供的特性，我们采用了DataX提供的splitPk的特性，将单个Job对应的分片进行拆分，产生一定数量的子任务。具体配置如下：

{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "mysqlreader",
                    "parameter": {
                        "column": [
                            "id",
                            "f_binary",
                            "f_blob",
                            "f_long_varbinary",
                            "f_longblob",
                            "f_tinyblob",
                            "f_varbinary",
                            "f_smallint",
                            "f_smallint_unsigned",
                            "f_mediumint",
                            "f_mediumint_unsigned",
                            "f_int",
                            "f_int_unsigned",
                            "f_integer",
                            "f_integer_unsigned",
                            "f_bigint",
                            "f_bigint_unsigned",
                            "f_numeric",
                            "f_decimal",
                            "f_float",
                            "f_double",
                            "f_double_precision",
                            "f_longtext",
                            "f_mediumtext",
                            "f_text",
                            "f_tinytext",
                            "f_varchar",
                            "f_date",
                            "f_datetime",
                            "f_time",
                            "f_timestamp"
                        ],
                        "connection": [
                            {
                                "jdbcUrl": [
                                    "jdbc:mysql://seatunnel.rds.aliyuncs.com:3306/test"
                                ],
                                "table": [
                                    "type_source_table"
                                ]
                            }
                        ],
                        "password": "password",
                        "username": "root",
                        "splitPk": "id"
                    }
                },
                "writer": {
                    "name": "hdfswriter",
                    "parameter": {
                        "column": [
                            {
                                "name": "id",
                                "type": "INT"
                            },
                            {
                                "name": "f_binary",
                                "type": "STRING"
                            },
                            {
                                "name": "f_blob",
                                "type": "STRING"
                            },
                            {
                                "name": "f_long_varbinary",
                                "type": "STRING"
                            },
                            {
                                "name": "f_longblob",
                                "type": "STRING"
                            },
                            {
                                "name": "f_tinyblob",
                                "type": "STRING"
                            },
                            {
                                "name": "f_varbinary",
                                "type": "STRING"
                            },
                            {
                                "name": "f_smallint",
                                "type": "SMALLINT"
                            },
                            {
                                "name": "f_smallint_unsigned",
                                "type": "SMALLINT"
                            },
                            {
                                "name": "f_mediumint",
                                "type": "SMALLINT"
                            },
                            {
                                "name": "f_mediumint_unsigned",
                                "type": "SMALLINT"
                            },
                            {
                                "name": "f_int",
                                "type": "INT"
                            },
                            {
                                "name": "f_int_unsigned",
                                "type": "INT"
                            },
                            {
                                "name": "f_integer",
                                "type": "INT"
                            },
                            {
                                "name": "f_integer_unsigned",
                                "type": "INT"
                            },
                            {
                                "name": "f_bigint",
                                "type": "BIGINT"
                            },
                            {
                                "name": "f_bigint_unsigned",
                                "type": "BIGINT"
                            },
                            {
                                "name": "f_numeric",
                                "type": "DOUBLE"
                            },
                            {
                                "name": "f_decimal",
                                "type": "DOUBLE"
                            },
                            {
                                "name": "f_float",
                                "type": "FLOAT"
                            },
                            {
                                "name": "f_double",
                                "type": "DOUBLE"
                            },
                            {
                                "name": "f_double_precision",
                                "type": "DOUBLE"
                            },
                            {
                                "name": "f_longtext",
                                "type": "STRING"
                            },
                            {
                                "name": "f_mediumtext",
                                "type": "STRING"
                            },
                            {
                                "name": "f_text",
                                "type": "STRING"
                            },
                            {
                                "name": "f_tinytext",
                                "type": "STRING"
                            },
                            {
                                "name": "f_varchar",
                                "type": "STRING"
                            },
                            {
                                "name": "f_date",
                                "type": "DATE"
                            },
                            {
                                "name": "f_datetime",
                                "type": "TIMESTAMP"
                            },
                            {
                                "name": "f_time",
                                "type": "DATE"
                            },
                            {
                                "name": "f_timestamp",
                                "type": "TIMESTAMP"
                            }
                        ],
                        "defaultFS": "hdfs://hadoop1:9000",
                        "fieldDelimiter": ",",
                        "fileName": "result",
                        "fileType": "text",
                        "path": "/test/result",
                        "writeMode": "append"
                    }
                }
            }
        ],
        "setting": {
            "speed": {
                "channel": 8
            }
        }
    }
}

在固定JVM内存为8G的情况下，得到最佳的channel数为8。同时固定channel数的情况下，得到最佳的内存大小为2G，用时114S完成同步。基于该结论，我们在相同的内存和并发数上，测试SeaTunnel能够达到的速度。

SeaTunnel Engine任务配置

在SeaTunnel中，我们同样使用和DataX类似的特性，根据ID字段来进行数据拆分，分成多个子任务进行数据处理。

下面是SeaTunnel的配置文件：

env {
  # You can set engine configuration here
  job.mode = "BATCH"
  checkpoint.interval = 300000
  #execution.checkpoint.data-uri = "hdfs://localhost:9000/checkpoint"
}
 
source {
  # This is a example source plugin **only for test and demonstrate the feature source plugin**
  jdbc{
    url = "jdbc:mysql://seatunnel.mysql.rds.aliyuncs.com:3306/test"
    driver = "com.mysql.cj.jdbc.Driver"
    user = "root"
    password = "password"
    query = "select * from type_source_table"
    partition_column = "id"
    parallelism = 8
  }
}
 
transform {
}
 
sink {
  HdfsFile {
    fs.defaultFS="hdfs://hadoop1:9000"
    path="/test/result/"
    field_delimiter="\\t"
    row_delimiter="\\n"
    file_name_expression="${transactionId}_${now}"
    file_format="text"
    filename_time_format="yyyy.MM.dd"
    is_enable_transaction=true
  }
}

在相同的2G，8线程的情况下，SeaTunnel Engine比DataX快20%，具体对比见后表。

结论

在对比了最佳的配置之后，我们针对不同的内存大小，不同的线程数进行了更加深入的对比。在相同的环境下，重复测试得到如下对比结果图表。

单位：秒

从上表可以看出，在相同测试环境下，最新发布的同步计算引擎 SeaTunnel Engine 均比DataX同步数据的速度更快，甚至在内存吃紧的情况下，内存的降低对SeaTunnel Engine没有显著影响。这得益于SeaTunnel优秀的架构和高效的代码逻辑。

值得注意的是，这只是单机版本测试，DataX也支持单机版本，而SeaTunnel新引擎是支持集群版本的，单机性能差异就如此之大，可想而知SeaTunnel集群会给用户带来多大的性能提升！Note：本次对比基于DataX: datax_v202209. SeaTunnel: commit id 7393c47，欢迎大家下载测试！

微信关注我们

原文链接：https://www.oschina.net/news/217925

转载内容版权归作者及来源网站所有！

低调大师中文资讯倾力打造互联网数据资讯、行业资源、电子商务、移动互联网、网络营销平台。持续更新报道IT业界、互联网、市场资讯、驱动更新,是最及时权威的产业资讯及硬件资讯报道平台。

DataGear 4.2.0 发布，数据可视化分析平台

DataGear4.2.0发布，图表插件新特性，多个BUG修复，具体更新内容如下：新增：图表插件支持定义插件属性元信息，可根据用户定义图表时设置的属性值绘制图表；新增：看板新增dg-chart-attr-values图表元素属性，用于设置看板级图表属性值；新增：看板可视编辑模式新增编辑图表属性值功能，用于设置看板级图表属性值；新增：图表编辑页面新增编辑图表属性值功能；新增：图表编辑页面新增编辑图表选项功能；新增：图表JS对象新增pluginAttributes()函数，用于获取插件属性信息；新增：图表JS对象新增attrValue()、attrValues()函数，用于获取和设置图表属性值；新增：图表JS对象新增attrValuesOrigin()函数，用于获取原始图表属性值；新增：图表JS对象新增optionsOrigin()函数，用于获取原始图表选项；修复：修复看板可视编辑模式图表选项不支持编辑函数的BUG；修复：修复看板模板没有定义</body>标签会导致展示页面不渲染任何图表的BUG；修复：修复异步加载图表即使无权限仍可请求到数据的BUG； ...

2022-11-16

361

4.12.3 - 正式版更新详情: feat (lamp-cache): redis 查询接口，返回用CacheResult包装，方便判断缓存的真实值，解决缓存击穿问题 refactor (lamp-web-pro): 优化移动资源时，增加无法移动节点的文案提示 (lamp-boot): 生产环境，全局异常捕捉器不返回详细的错误日志 fix (lamp-core): Entity与SuperEntity未加注解@EqualsAndHashCode(callSuper = true)导致子类继承后即便加上该注解hashCode也不一致，调用.equals将返回false (lamp-common): 修复缓存全局数据时，key的生成含有租户ID的BUG (lamp-gateway-server): 修复异步方法使用线程变量引起的问题 (lamp-generator): 代码生成器模板异常 (lamp-generator): column 模式支持获取 mysql 5.7 版本的注释 (lamp-web-pro): 修复 updateSchema 方法数据覆盖问题 (lamp-w...

2022-11-16

433

资源下载

更多资源

Mario

马里奥是站在游戏界顶峰的超人气多面角色。马里奥靠吃蘑菇成长，特征是大鼻子、头戴帽子、身穿背带裤，还留着胡子。与他的双胞胎兄弟路易基一起，长年担任任天堂的招牌角色。

腾讯云软件源

为解决软件依赖安装时官方源访问速度慢的问题，腾讯云为一些软件搭建了缓存服务。您可以通过使用腾讯云软件源站来提升依赖包的安装速度。为了方便用户自由搭建服务架构，目前腾讯云软件源站支持公网访问和内网访问。

Nacos

Nacos /nɑ:kəʊs/ 是 Dynamic Naming and Configuration Service 的首字母简称，一个易于构建 AI Agent 应用的动态服务发现、配置管理和AI智能体管理平台。Nacos 致力于帮助您发现、配置和管理微服务及AI智能体应用。Nacos 提供了一组简单易用的特性集，帮助您快速实现动态服务发现、服务配置、服务元数据、流量管理。Nacos 帮助您更敏捷和容易地构建、交付和管理微服务平台。

Sublime Text

Sublime Text具有漂亮的用户界面和强大的功能，例如代码缩略图，Python的插件，代码段等。还可自定义键绑定，菜单和工具栏。Sublime Text 的主要功能包括：拼写检查，书签，完整的 Python API ， Goto 功能，即时项目切换，多选择，多窗口等等。Sublime Text 是一个跨平台的编辑器，同时支持Windows、Linux、Mac OS X等操作系统。