您现在的位置是：首页 > 文章详情

MySQL运行时的可观测性

日期：2023-08-21点击：241收藏

1.说在前面的话
2.安装employees测试库
3.观测SQL运行状态
- 3.1 观测SQL运行时的内存消耗
- 3.2 观测SQL运行时的其他开销
- 3.3 观测SQL运行进度

感知SQL运行时的状态

1. 说在前面的话

在MySQL里，一条SQL运行时产生多少磁盘I/O，占用多少内存，是否有创建临时表，这些指标如果都能观测到，有助于更快发现SQL瓶颈，扑灭潜在隐患。

从MySQL 5.7版本开始，performance_schema就默认启用了，并且还增加了sys schema，到了8.0版本又进一步得到增强提升，在SQL运行时就能观察到很多有用的信息，实现一定程度的可观测性。

下面举例说明如何进行观测，以及主要观测哪些指标。

2. 安装employees测试库

安装MySQL官方提供的employees测试数据库，戳此链接(https://dev.mysql.com/doc/index-other.html)下载，解压缩后开始安装：

$ mysql -f < employees.sql; INFO CREATING DATABASE STRUCTURE INFO storage engine: InnoDB INFO LOADING departments INFO LOADING employees INFO LOADING dept_emp INFO LOADING dept_manager INFO LOADING titles INFO LOADING salaries data_load_time_diff 00:00:37

MySQL还提供了相应的使用文档：https://dev.mysql.com/doc/employee/en/

本次测试采用GreatSQL 8.0.32-24版本，且运行在MGR环境中：

greatsql> \s ... Server version: 8.0.32-24 GreatSQL, Release 24, Revision 3714067bc8c ... greatsql> select MEMBER_ID, MEMBER_ROLE, MEMBER_VERSION from performance_schema.replication_group_members; +--------------------------------------+-------------+----------------+ | MEMBER_ID | MEMBER_ROLE | MEMBER_VERSION | +--------------------------------------+-------------+----------------+ | 2adec6d2-febb-11ed-baca-d08e7908bcb1 | SECONDARY | 8.0.32 | | 2f68fee2-febb-11ed-b51e-d08e7908bcb1 | ARBITRATOR | 8.0.32 | | 5e34a5e2-feb6-11ed-b288-d08e7908bcb1 | PRIMARY | 8.0.32 | +--------------------------------------+-------------+----------------+

3. 观测SQL运行状态

查看当前连接/会话的连接ID、内部线程ID：

greatsql> select processlist_id, thread_id from performance_schema.threads where processlist_id = connection_id(); +----------------+-----------+ | processlist_id | thread_id | +----------------+-----------+ | 110 | 207 | +----------------+-----------+

查询得到当前的连接ID=110，内部线程ID=207。

P.S，由于本文整理过程不是连续的，所以下面看到的 thread_id 值可能会有好几个，每次都不同。

3.1 观测SQL运行时的内存消耗

执行下面的SQL，查询所有员工的薪资总额，按员工号分组，并按薪资总额倒序，取前10条记录：

greatsql> explain select emp_no, sum(salary) as total_salary from salaries group by emp_no order by total_salary desc limit 10\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: salaries partitions: NULL type: index possible_keys: PRIMARY key: PRIMARY key_len: 7 ref: NULL rows: 2838426 filtered: 100.00 Extra: Using temporary; Using filesort

看到需要全索引扫描（其实也等同于全表扫描，因为是基于PRIMARY索引），并且还需要生成临时表，以及额外的filesort。

在正式运行该SQL之前，在另外的窗口中新建一个连接会话，执行下面的SQL先观察该连接/会话当前的内存分配情况：

greatsql> select * from sys.x$memory_by_thread_by_current_bytes where thread_id = 207\G *************************** 1. row *************************** thread_id: 207 user: root@localhost current_count_used: 9 current_allocated: 26266 current_avg_alloc: 2918.4444 current_max_alloc: 16464 total_allocated: 30311

等到该SQL执行完了，再一次查询内存分配情况：

greatsql> select * from sys.x$memory_by_thread_by_current_bytes where thread_id = 207\G *************************** 1. row *************************** thread_id: 207 user: root@localhost current_count_used: 13 current_allocated: 24430 current_avg_alloc: 1879.2308 current_max_alloc: 16456 total_allocated: 95719

我们注意到几个数据的变化情况，用下面表格来展示：

指标	运行前	运行后
total_allocated	30311	95719

也就是说，SQL运行时，需要分配的内存是：95719 - 30311 = 65408 字节。

3.2 观测SQL运行时的其他开销

通过观察 performance_schema.status_by_thread 表，可以知道相应连接/会话中SQL运行的一些状态指标。在SQL运行结束后，执行下面的SQL命令即可查看：

greatsql> select * from performance_schema.status_by_thread where thread_id = 207; ... | 207 | Created_tmp_disk_tables | 0 | | 207 | Created_tmp_tables | 0 | ... | 207 | Handler_read_first | 1 | | 207 | Handler_read_key | 1 | | 207 | Handler_read_last | 0 | | 207 | Handler_read_next | 2844047 | | 207 | Handler_read_prev | 0 | | 207 | Handler_read_rnd | 0 | | 207 | Handler_read_rnd_next | 0 | | 207 | Handler_rollback | 0 | | 207 | Handler_savepoint | 0 | | 207 | Handler_savepoint_rollback | 0 | | 207 | Handler_update | 0 | | 207 | Handler_write | 0 | | 207 | Last_query_cost | 286802.914893 | | 207 | Last_query_partial_plans | 1 | ... | 207 | Select_full_join | 0 | | 207 | Select_full_range_join | 0 | | 207 | Select_range | 0 | | 207 | Select_range_check | 0 | | 207 | Select_scan | 1 | | 207 | Slow_launch_threads | 0 | | 207 | Slow_queries | 1 | | 207 | Sort_merge_passes | 0 | | 207 | Sort_range | 0 | | 207 | Sort_rows | 1 | | 207 | Sort_scan | 1 | ...

上面我们只罗列了部分比较重要的状态指标。从这个结果也可以佐证slow query log中的结果，确实没创建临时表。

作为参照，查看这条SQL对应的slow query log记录：

# Query_time: 0.585593 Lock_time: 0.000002 Rows_sent: 10 Rows_examined: 2844057 Thread_id: 110 Errno: 0 Killed: 0 Bytes_received: 115 Bytes_sent: 313 Read_first: 1 Read_last: 0 Read_key: 1 Read_next: 2844047 Read_prev: 0 Read_rnd: 0 Read_rnd_next: 0 Sort_merge_passes: 0 Sort_range_count: 0 Sort_rows: 10 Sort_scan_count: 1 Created_tmp_disk_tables: 0 Created_tmp_tables: 0 Start: 2023-07-06T10:06:01.438376+08:00 End: 2023-07-06T10:06:02.023969+08:00 Schema: employees Rows_affected: 0 # Tmp_tables: 0 Tmp_disk_tables: 0 Tmp_table_sizes: 0 # InnoDB_trx_id: 0 # Full_scan: Yes Full_join: No Tmp_table: No Tmp_table_on_disk: No # Filesort: Yes Filesort_on_disk: No Merge_passes: 0 # InnoDB_IO_r_ops: 0 InnoDB_IO_r_bytes: 0 InnoDB_IO_r_wait: 0.000000 # InnoDB_rec_lock_wait: 0.000000 InnoDB_queue_wait: 0.000000 # InnoDB_pages_distinct: 4281 use employees; SET timestamp=1688609161; select emp_no, sum(salary) as total_salary from salaries group by emp_no order by total_salary desc limit 10;

可以看到，Created_tmp_disk_tables, Created_tmp_tables, Handler_read_next, Select_full_join, Select_scan, Sort_rows, Sort_scan, 等几个指标的数值是一样的。

还可以查看该SQL运行时的I/O latency情况，SQL运行前后两次查询对比：

greatsql> select * from sys.io_by_thread_by_latency where thread_id = 207; +----------------+-------+---------------+-------------+-------------+-------------+-----------+----------------+ | user | total | total_latency | min_latency | avg_latency | max_latency | thread_id | processlist_id | +----------------+-------+---------------+-------------+-------------+-------------+-----------+----------------+ | root@localhost | 7 | 75.39 us | 5.84 us | 10.77 us | 22.12 us | 207 | 110 | +----------------+-------+---------------+-------------+-------------+-------------+-----------+----------------+ ... greatsql> select * from sys.io_by_thread_by_latency where thread_id = 207; +----------------+-------+---------------+-------------+-------------+-------------+-----------+----------------+ | user | total | total_latency | min_latency | avg_latency | max_latency | thread_id | processlist_id | +----------------+-------+---------------+-------------+-------------+-------------+-----------+----------------+ | root@localhost | 8 | 85.29 us | 5.84 us | 10.66 us | 22.12 us | 207 | 110 | +----------------+-------+---------------+-------------+-------------+-------------+-----------+----------------+

可以看到这个SQL运行时的I/O latency是：85.29 - 75.39 = 9.9us。

3.3 观测SQL运行进度

我们知道，运行完一条SQL后，可以利用PROFLING功能查看它各个阶段的耗时，但是在运行时如果也想查看各阶段耗时该怎么办呢？

从MySQL 5.7版本开始，可以通过 performance_schema.events_stages_% 相关表查看SQL运行过程以及各阶段耗时，需要先修改相关设置：

# 确认是否对所有主机&用户都启用 greatsql> SELECT * FROM performance_schema.setup_actors; +------+------+------+---------+---------+ | HOST | USER | ROLE | ENABLED | HISTORY | +------+------+------+---------+---------+ | % | % | % | NO | NO | +------+------+------+---------+---------+ # 修改成对所有主机&用户都启用 greatsql> UPDATE performance_schema.setup_actors SET ENABLED = 'YES', HISTORY = 'YES' WHERE HOST = '%' AND USER = '%'; # 修改 setup_instruments & setup_consumers 设置 greatsql> UPDATE performance_schema.setup_consumers SET ENABLED = 'YES' WHERE NAME LIKE '%events_statements_%'; greatsql> UPDATE performance_schema.setup_consumers SET ENABLED = 'YES' WHERE NAME LIKE '%events_stages_%';

这就实时可以观测SQL运行过程中的状态了。

在SQL运行过程中，从另外的窗口查看该SQL对应的 EVENT_ID：

greatsql> SELECT EVENT_ID, TRUNCATE(TIMER_WAIT/1000000000000,6) as Duration, SQL_TEXT FROM performance_schema.events_statements_history WHERE thread_id = 85 order by event_id desc limit 5; +----------+----------+-------------------------------------------------------------------------------------------------------------------------------+ | EVENT_ID | Duration | SQL_TEXT | +----------+----------+-------------------------------------------------------------------------------------------------------------------------------+ | 149845 | 0.6420 | select emp_no, sum(salary) as total_salary, sleep(0.000001) from salaries group by emp_no order by total_salary desc limit 10 | | 149803 | 0.6316 | select emp_no, sum(salary) as total_salary, sleep(0.000001) from salaries group by emp_no order by total_salary desc limit 10 | | 149782 | 0.6245 | select emp_no, sum(salary) as total_salary, sleep(0.000001) from salaries group by emp_no order by total_salary desc limit 10 | | 149761 | 0.6361 | select emp_no, sum(salary) as total_salary, sleep(0.000001) from salaries group by emp_no order by total_salary desc limit 10 | | 149740 | 0.6245 | select emp_no, sum(salary) as total_salary, sleep(0.000001) from salaries group by emp_no order by total_salary desc limit 10 | +----------+----------+-------------------------------------------------------------------------------------------------------------------------------+ # 再根据 EVENT_ID 值去查询 events_stages_history_long greatsql> SELECT thread_id ,event_Id, event_name AS Stage, TRUNCATE(TIMER_WAIT/1000000000000,6) AS Duration FROM performance_schema.events_stages_history_long WHERE NESTING_EVENT_ID = 149845 order by event_id; +-----------+----------+------------------------------------------------+----------+ | thread_id | event_Id | Stage | Duration | +-----------+----------+------------------------------------------------+----------+ | 85 | 149846 | stage/sql/starting | 0.0000 | | 85 | 149847 | stage/sql/Executing hook on transaction begin. | 0.0000 | | 85 | 149848 | stage/sql/starting | 0.0000 | | 85 | 149849 | stage/sql/checking permissions | 0.0000 | | 85 | 149850 | stage/sql/Opening tables | 0.0000 | | 85 | 149851 | stage/sql/init | 0.0000 | | 85 | 149852 | stage/sql/System lock | 0.0000 | | 85 | 149854 | stage/sql/optimizing | 0.0000 | | 85 | 149855 | stage/sql/statistics | 0.0000 | | 85 | 149856 | stage/sql/preparing | 0.0000 | | 85 | 149857 | stage/sql/Creating tmp table | 0.0000 | | 85 | 149858 | stage/sql/executing | 0.6257 | | 85 | 149859 | stage/sql/end | 0.0000 | | 85 | 149860 | stage/sql/query end | 0.0000 | | 85 | 149861 | stage/sql/waiting for handler commit | 0.0000 | | 85 | 149862 | stage/sql/closing tables | 0.0000 | | 85 | 149863 | stage/sql/freeing items | 0.0000 | | 85 | 149864 | stage/sql/logging slow query | 0.0000 | | 85 | 149865 | stage/sql/cleaning up | 0.0000 | +-----------+----------+------------------------------------------------+----------+

上面就是这条SQL的运行进度展示，以及各个阶段的耗时，和PROFILING的输出一样，当我们了解一条SQL运行所需要经历的各个阶段时，从上面的输出结果中也就能估算出该SQL大概还要多久能跑完，决定是否要提前kill它。

如果想要观察DDL SQL的运行进度，可以参考这篇文章：不用MariaDB/Percona也能查看DDL的进度。

更多的观测指标、维度还有待继续挖掘，以后有机会再写。

另外，也可以利用MySQL Workbench工具，或MySQL Enterprise Monitor，都已集成了很多可观测性指标，相当不错的体验。

延伸阅读