全方位解读SeaTunnel MySQL CDC连接器：实现数据高效同步的强大工具-低调大师

全方位解读SeaTunnel MySQL CDC连接器：实现数据高效同步的强大工具

2023-12-07 706

在当今数据快速增长的时代，实时、高效地同步和处理来自各种数据源的信息成为了企业和开发者面临的重要挑战。

MySQL作为广泛使用的数据库之一，其变更数据捕获（CDC）功能对于实现这一目标至关重要。在这篇文章中，我们将深入探讨MySQL CDC源连接器在SeaTunnel框架下的应用，涵盖从基础设置到高级配置的各个方面。

MySQL CDC源连接器

支持的引擎

SeaTunnel Zeta<br/> Flink <br/>

主要特性

描述

MySQL CDC连接器允许从MySQL数据库读取快照数据和增量数据。本文档描述了如何设置MySQL CDC连接器以对MySQL数据库运行SQL查询。

支持的数据源信息

数据源	支持的版本	驱动	Url	Maven
MySQL	<li> MySQL: 5.6, 5.7, 8.0.x </li><li> RDS MySQL: 5.6, 5.7, 8.0.x </li>	com.mysql.cj.jdbc.Driver	jdbc:mysql://localhost:3306/test	https://mvnrepository.com/artifact/mysql/mysql-connector-java/8.0.28

数据库依赖

安装Jdbc驱动程序

请将mysql驱动程序下载并放入${SEATUNNEL_HOME}/lib/目录中。例如：cp mysql-connector-java-xxx.jar $SEATNUNNEL_HOME/lib/

创建MySQL用户

您必须为Debezium MySQL连接器监视的所有数据库定义一个具有适当权限的MySQL用户。

创建MySQL用户：

mysql> CREATE USER 'user'@'localhost' IDENTIFIED BY 'password';

为用户授予所需权限：

mysql> GRANT SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'user' IDENTIFIED BY 'password';

完成用户的权限设置：

mysql> FLUSH PRIVILEGES;

启用MySQL binlog

为了实现MySQL的复制，必须启用二进制日志。二进制日志记录了用于复制工具传播更改的事务更新。

检查log-bin选项是否已经开启：

mysql> show variables where variable_name in ('log_bin', 'binlog_format', 'binlog_row_image', 'gtid_mode', 'enforce_gtid_consistency');
+--------------------------+----------------+
| Variable_name            | Value          |
+--------------------------+----------------+
| binlog_format            | ROW            |
| binlog_row_image         | FULL           |
| enforce_gtid_consistency | ON             |
| gtid_mode                | ON             |
| log_bin                  | ON             |
+--------------------------+----------------+
5 rows in set (0.00 sec)

如果与上述结果不一致，请使用以下属性配置您的MySQL服务器配置文件（$MYSQL_HOME/mysql.cnf），如下表所示：

# 启用二进制复制日志并设置前缀、过期时间和日志格式。
# 前缀是任意的，过期时间对于集成测试可以短一些，但在生产系统中会更长。
# 行级信息对于摄取工作是必需的。
# 服务器ID在生产系统上是必需的，但会有所不同。
server-id         = 223344
log_bin           = mysql-bin
expire_logs_days  = 10
binlog_format     = row
binlog_row_image  = FULL

# 启用gtid模式
gtid_mode = on
enforce_gtid_consistency = on

重启MySQL服务器

/etc/inint.d/mysqld restart

再次确认您的更改，通过再次检查binlog状态：

mysql> show variables where variable_name in ('log_bin', 'binlog_format', 'binlog_row_image', 'gtid_mode', 'enforce_gtid_consistency');
+--------------------------+----------------+
| Variable_name            | Value          |
+--------------------------+----------------+
| binlog_format            | ROW            |
| binlog_row_image         | FULL           |
| enforce_gtid_consistency | ON             |
| gtid_mode                | ON             |
| log_bin                  | ON             |
+--------------------------+----------------+
5 rows in set (0.00 sec)

注意

设置MySQL会话超时

当为大型数据库创建初始一致快照时，在读取表期间，已建立的连接可能会超时。您可以通过在MySQL配置文件中配置interactive_timeout和wait_timeout来防止这种行为。

interactive_timeout：服务器等待交互连接活动关闭之前的秒数。有关更多详细信息，请参阅MySQL文档。
wait_timeout：服务器等待非交互连接活动关闭之前的秒数。有关更多详细信息，请参阅MySQL文档。

有关更多数据库设置，请参见Debezium MySQL连接器

数据类型映射

Mysql Data type	SeaTunnel Data type
BIT(1)<br/>TINYINT(1)	BOOLEAN
TINYINT	TINYINT
TINYINT UNSIGNED<br/>SMALLINT	SMALLINT
SMALLINT UNSIGNED<br/>MEDIUMINT<br/>MEDIUMINT UNSIGNED<br/>INT<br/>INTEGER<br/>YEAR	INT
INT UNSIGNED<br/>INTEGER UNSIGNED<br/>BIGINT	BIGINT
BIGINT UNSIGNED	DECIMAL(20,0)
DECIMAL(p, s) <br/>DECIMAL(p, s) UNSIGNED <br/>NUMERIC(p, s) <br/>NUMERIC(p, s) UNSIGNED	DECIMAL(p,s)
FLOAT<br/>FLOAT UNSIGNED	FLOAT
DOUBLE<br/>DOUBLE UNSIGNED<br/>REAL<br/>REAL UNSIGNED	DOUBLE
CHAR<br/>VARCHAR<br/>TINYTEXT<br/>MEDIUMTEXT<br/>TEXT<br/>LONGTEXT<br/>ENUM<br/>JSON	STRING
DATE	DATE
TIME	TIME
DATETIME<br/>TIMESTAMP	TIMESTAMP
BINARY<br/>VARBINAR<br/>BIT(p)<br/>TINYBLOB<br/>MEDIUMBLOB<br/>BLOB<br/>LONGBLOB	BYTES

源选项

名称	类型	必需	默认值	描述
base-url	String	是	-	JDBC连接的URL。参考示例：`jdbc:mysql://localhost:3306:3306/test`。
username	String	是	-	连接到数据库服务器时使用的数据库名称。
password	String	是	-	连接到数据库服务器时使用的密码。
database-names	List	否	-	要监视的数据库的名称。
table-names	List	是	-	要监视的数据库的表名。表名需要包含数据库名称，例如：`database_name.table_name`。
startup.mode	Enum	No	INITIAL	Optional startup mode for MySQL CDC consumer, valid enumerations are `initial`, `earliest`, `latest` and `specific`. <br/> `initial`: Synchronize historical data at startup, and then synchronize incremental data.<br/> `earliest`: Startup from the earliest offset possible.<br/> `latest`: Startup from the latest offset.<br/> `specific`: Startup from user-supplied specific offsets.
startup.specific-offset.file	String	No	-	Start from the specified binlog file name. Note, This option is required when the `startup.mode` option used `specific`.
startup.specific-offset.pos	Long	No	-	Start from the specified binlog file position. Note, This option is required when the `startup.mode` option used `specific`.
stop.mode	Enum	No	NEVER	Optional stop mode for MySQL CDC consumer, valid enumerations are `never`, `latest` or `specific`. <br/> `never`: Real-time job don't stop the source.<br/> `latest`: Stop from the latest offset.<br/> `specific`: Stop from user-supplied specific offset.
stop.specific-offset.file	String	No	-	Stop from the specified binlog file name. Note, This option is required when the `stop.mode` option used `specific`.
stop.specific-offset.pos	Long	No	-	Stop from the specified binlog file position. Note, This option is required when the `stop.mode` option used `specific`.
snapshot.split.size	Integer	No	8096	The split size (number of rows) of table snapshot, captured tables are split into multiple splits when read the snapshot of table.
snapshot.fetch.size	Integer	No	1024	The maximum fetch size for per poll when read table snapshot.
server-id	String	No	-	A numeric ID or a numeric ID range of this database client, The numeric ID syntax is like `5400`, the numeric ID range syntax is like '5400-5408'. <br/> Every ID must be unique across all currently-running database processes in the MySQL cluster. This connector joins the <br/> MySQL cluster as another server (with this unique ID) so it can read the binlog. <br/> By default, a random number is generated between 5400 and 6400, though we recommend setting an explicit value.
server-time-zone	String	No	UTC	The session time zone in database server. If not set, then ZoneId.systemDefault() is used to determine the server time zone.
connect.timeout.ms	Duration	No	30000	The maximum time that the connector should wait after trying to connect to the database server before timing out.
connect.max-retries	Integer	No	3	The max retry times that the connector should retry to build database server connection.
connection.pool.size	Integer	No	20	The jdbc connection pool size.
chunk-key.even-distribution.factor.upper-bound	Double	No	100	The upper bound of the chunk key distribution factor. This factor is used to determine whether the table data is evenly distributed. If the distribution factor is calculated to be less than or equal to this upper bound (i.e., (MAX(id) - MIN(id) + 1) / row count), the table chunks would be optimized for even distribution. Otherwise, if the distribution factor is greater, the table will be considered as unevenly distributed and the sampling-based sharding strategy will be used if the estimated shard count exceeds the value specified by `sample-sharding.threshold`. The default value is 100.0.
chunk-key.even-distribution.factor.lower-bound	Double	No	0.05	The lower bound of the chunk key distribution factor. This factor is used to determine whether the table data is evenly distributed. If the distribution factor is calculated to be greater than or equal to this lower bound (i.e., (MAX(id) - MIN(id) + 1) / row count), the table chunks would be optimized for even distribution. Otherwise, if the distribution factor is less, the table will be considered as unevenly distributed and the sampling-based sharding strategy will be used if the estimated shard count exceeds the value specified by `sample-sharding.threshold`. The default value is 0.05.
sample-sharding.threshold	Integer	No	1000	This configuration specifies the threshold of estimated shard count to trigger the sample sharding strategy. When the distribution factor is outside the bounds specified by `chunk-key.even-distribution.factor.upper-bound` and `chunk-key.even-distribution.factor.lower-bound`, and the estimated shard count (calculated as approximate row count / chunk size) exceeds this threshold, the sample sharding strategy will be used. This can help to handle large datasets more efficiently. The default value is 1000 shards.
inverse-sampling.rate	Integer	No	1000	The inverse of the sampling rate used in the sample sharding strategy. For example, if this value is set to 1000, it means a 1/1000 sampling rate is applied during the sampling process. This option provides flexibility in controlling the granularity of the sampling, thus affecting the final number of shards. It's especially useful when dealing with very large datasets where a lower sampling rate is preferred. The default value is 1000.
exactly_once	Boolean	No	true	Enable exactly once semantic.
format	Enum	No	DEFAULT	Optional output format for MySQL CDC, valid enumerations are `DEFAULT`、`COMPATIBLE_DEBEZIUM_JSON`.
debezium	Config	No	-	Pass-through Debezium's properties to Debezium Embedded Engine which is used to capture data changes from MySQL server.
common-options		no	-	Source plugin common parameters, please refer to Source Common Options for details

Task Example

Simple

Support multi-table reading

env {
  parallelism = 1
  job.mode = "STREAMING"
  checkpoint.interval = 10000
}

source {
  MySQL-CDC {
    catalog = {
      factory = MySQL
    }
    base-url = "jdbc:mysql://localhost:3306/testdb"
    username = "root"
    password = "root@123"
    table-names = ["testdb.table1", "testdb.table2"]
    
    startup.mode = "initial"
  }
}

sink {
  Console {
  }
}

Support debezium-compatible format send to kafka

Must be used with kafka connector sink, see compatible debezium format for details

Changelog

Add MySQL CDC Source Connector

next version

通过对MySQL CDC源连接器的深入了解，我们不仅能够更好地掌握数据同步的核心机制，还能有效提升数据处理的效率和精度。

无论是在数据集成、实时分析还是其他复杂的数据处理场景中，MySQL CDC源连接器都将成为SeaTunnel用户强大的助手。随着数据技术的不断进步，期待看到更多创新和优化在未来版本中的实现，为开发者带来更多便利和可能。

本文由白鲸开源科技提供发布支持！

微信关注我们

原文链接：https://my.oschina.net/dailidong/blog/10319909

转载内容版权归作者及来源网站所有！

低调大师中文资讯倾力打造互联网数据资讯、行业资源、电子商务、移动互联网、网络营销平台。持续更新报道IT业界、互联网、市场资讯、驱动更新,是最及时权威的产业资讯及硬件资讯报道平台。

使用函数计算，数禾如何实现高效的数据处理？

作者｜邱鑫鑫，王彬，牟柏旭公司背景和业务数禾科技以大数据和技术为驱动，为金融机构提供高效的智能零售金融解决方案，服务银行、信托、消费金融公司、保险、小贷公司等持牌金融机构，业务涵盖消费信贷、小微企业信贷、场景分期等多个领域，提供营销获客、风险防控、运营管理等服务。数禾科技通过自主开发的消费信贷产品，连接金融机构与普罗大众，赋能金融机构数字化转型，迎接中国消费升级的大潮。数禾当前有三款主要产品，还呗，还享花，小店邦。每款产品都有大量的受众，每天会产生大量的应用日志，数据通过压缩后归档到阿里云 OSS 存储，以达到最优的存储成本。低效的数据处理应用日志通过 SLS 收集，压缩并归档到 OSS，整个链路都非常顺滑。但日常有些业务需要查看详细的应用日志，由于日志收集会将 APP 上不同应用的日志都打到一起。因此，获取某个应用的日志，需要从 OSS 解压大量的文件，并从中过滤出特定的应用，才可以进一步分析排查。这个过程在实效性和数据处理效率上都存在很大的问题，为此，数禾运维团队计划从源头重构整个任务处理链路，以求以最低的开发成本，最高的处理效率，最优的资源费用，最好的扩展性打造高可用，...

2023-12-07

450

本文分享自华为云社区《【安全攻防】深入浅出实战系列专题-安全随机数》，作者： MDKing 。随机数的使用场景使用随机数可分类安全场景跟非安全场景。非安全场景需要生成的越快越好。安全场景使用的随机数必须足够安全，保证不能被预测到。常见的非安全场景：数据的索引号、标识；文件的名称或目录； UUID、用户ID、随机填充字节；常见安全场景包括但不限于以下场景：用于密码算法用途，如生成IV、盐值、密钥等；会话标识（sessionId）的生成；挑战算法中的随机数生成；验证码的随机数生成；密码学意义上的安全随机数安全场景下使用的随机数必须是密码学意义上的安全随机数。密码学意义上的安全随机数分为两类：真随机数产生器产生的随机数；以真随机数产生器产生的少量随机数作为种子的密码学安全的伪随机数产生器产生的大量随机数。已知的可供产品使用的密码学安全的非物理真随机数产生器有： Linux操作系统的/dev/random设备接口（存在阻塞问题） Windows操作系统的 CryptGenRandom() 接口我们可以看到，密码学意义上的安全随机数非指必须都为真随机数生成器生成...

2023-12-07

483

资源下载

更多资源

Mario

马里奥是站在游戏界顶峰的超人气多面角色。马里奥靠吃蘑菇成长，特征是大鼻子、头戴帽子、身穿背带裤，还留着胡子。与他的双胞胎兄弟路易基一起，长年担任任天堂的招牌角色。

Nacos

Nacos /nɑ:kəʊs/ 是 Dynamic Naming and Configuration Service 的首字母简称，一个易于构建 AI Agent 应用的动态服务发现、配置管理和AI智能体管理平台。Nacos 致力于帮助您发现、配置和管理微服务及AI智能体应用。Nacos 提供了一组简单易用的特性集，帮助您快速实现动态服务发现、服务配置、服务元数据、流量管理。Nacos 帮助您更敏捷和容易地构建、交付和管理微服务平台。

Rocky Linux

Rocky Linux（中文名：洛基）是由Gregory Kurtzer于2020年12月发起的企业级Linux发行版，作为CentOS稳定版停止维护后与RHEL（Red Hat Enterprise Linux）完全兼容的开源替代方案，由社区拥有并管理，支持x86_64、aarch64等架构。其通过重新编译RHEL源代码提供长期稳定性，采用模块化包装和SELinux安全架构，默认包含GNOME桌面环境及XFS文件系统，支持十年生命周期更新。

WebStorm

WebStorm 是jetbrains公司旗下一款JavaScript 开发工具。目前已经被广大中国JS开发者誉为“Web前端开发神器”、“最强大的HTML5编辑器”、“最智能的JavaScript IDE”等。与IntelliJ IDEA同源，继承了IntelliJ IDEA强大的JS部分的功能。