主从同步UUID问题：found a zombie dump thread with the same UUID

vlambda
2020-08-15

主从同步UUID问题：found a zombie dump thread with the same UUID

问题现象：

环境情况：

MySQL:5.7.21架构1主2从M<-->M—>SLAVE

现象分析：

1、每60秒产生一次报警2、相同的uuid

问题原因分析：

原因

Uuid 重复问题

机器上的auto.conf id问题

网络问题

备份时停止同步问题

slave 超时时间slave_net_timeout

版本配置问题

排查思路：

排查思路	排查内容	排查结果
1	Uuid 重复问题	都是唯一
2	机器上的Auto.conf	都是唯一
3	网络问题	利用iptables模拟
4	备份停止时的操作	备份时间不对
5	Master 数据不更新Slave 超时时间问题	有的db组大量更新还是存在
6	Slave 重启	很像，但是凌晨没人重启

测试步骤：

根据上面的排查思路准备环境进行开始测试。

环境：

Master：10.110.10.31Slave:10.110.10.30互为主从Sysbench:用来测试插入数据配置同步（略）

测试

Uuid 与 auto.conf 相同因为做主从配置时通常拷贝数据，因此有可能存在auto相同的情况，但是一一确认后发现不同。

测试slave 重启，master日志情况

在10.110.10.30 上执行如下

Stop slave;

Start salve;

发现日志如下：

Master 日志：

2020-08-05T17:48:46.033859+08:00 63510[Note] While initializing dump thread for slave with UUID<db3e0ce7-a3eb-11ea-9068-52540019c2f3>, found a zombie dump thread withthe same UUID. Master is killing the zombie dump thread(8398).2020-08-05T17:48:46.034067+08:00 63510 [Note] Start binlog_dump to master_thread_id(63510)slave_server(9013100), pos(, 4)

解说：

发现日志很符合线上报错的日志，但是查日志发现每60秒就会产生，线上也不可能频繁重启slave。

测试网络

在10.110.10.31 上执行IP tables 限制30访问，查看日志情况

iptables-I INPUT -s 10.110.10.30 -j DROP

slave 同步状态

mysql> show slave status\G;*************************** 1. row ***************************Slave_IO_State: Reconnecting after a failed master event readMaster_Host: 10.110.10.31Master_User: replicMaster_Port: 3070Connect_Retry: 60Master_Log_File: mysql-bin.000008Read_Master_Log_Pos: 2659Relay_Log_File: relay-bin.000010Relay_Log_Pos: 2832Relay_Master_Log_File: mysql-bin.000008Slave_IO_Running: ConnectingSlave_SQL_Running: Yes

30上日志：

2020-08-05T20:08:04.931077+08:00 63498 [Warning]Storing MySQL user name or password information in the master info repositoryis not secure and is therefore not recommended. Please consider using the USERand PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax'in the MySQL Manual for more information.2020-08-05T20:08:34.869063+08:00 63510 [Note] Aborted connection 63510 to db:'unconnected' user: 'replic' host: '10.110.10.30' (failed on flush_net())2020-08-05T20:08:34.925537+08:00 63792 [Note] Start binlog_dump tomaster_thread_id(63792) slave_server(9013100), pos(, 4)

解说：

发现日志访问master 出现flush_net() 关键字，但是线上并没有，因此排除网络问题。

备份停止时的操作xtrabackup

因为备份是利用percona 的xtrabackup工具，加了slave信息保持一致性的参数--slave-info但是发现时间对应不上因此排除

思路中断

此时发现常规的想法已经实现不通，需要查下同步参数，翻阅官方文档发现同步参数slave_net_timeout，根据参数定义为slave 在没有收到master 数据时的超时时间，会再次重新连接master，感觉有点线索了。对slave_net_timeout开始进行测试。此值默认为60

缩短 slave 网络超时时间,但没数据

在10.110.10.30 上执行如下：

Stop slave;set global slave_net_timeout=5;start salve;

测试日志：

10.110.10.31 上的日志:

2020-08-06T09:59:31.226156+08:00 65500 [Note] While initializingdump thread for slave with UUID <db3e0ce7-a3eb-11ea-9068-52540019c2f3>,found a zombie dump thread with the same UUID. Master is killing the zombiedump thread(65499).2020-08-06T09:59:31.226273+08:00 65500 [Note] Start binlog_dump tomaster_thread_id(65500) slave_server(901311100), pos(mysql-bin.000008, 828)2020-08-06T09:59:32.237103+08:00 65501 [Note] While initializing dump threadfor slave with UUID <db3e0ce7-a3eb-11ea-9068-52540019c2f3>, found azombie dump thread with the same UUID. Master is killing the zombie dumpthread(65500).2020-08-06T09:59:32.237253+08:00 65501 [Note] Start binlog_dump tomaster_thread_id(65501) slave_server(901311100), pos(mysql-bin.000008, 828)2020-08-06T09:59:33.245805+08:00 65502 [Note] While initializing dump threadfor slave with UUID <db3e0ce7-a3eb-11ea-9068-52540019c2f3>, found azombie dump thread with the same UUID. Master is killing the zombie dumpthread(65501).2020-08-06T09:59:33.245910+08:00 65502 [Note] Start binlog_dump to master_thread_id(65502)slave_server(901311100), pos(mysql-bin.000008, 828)

10.110.10.30 上的日志:

2020-08-06T10:02:01.030159+08:00 18 [Warning] Storing MySQL username or password information in the master info repository is not secure and istherefore not recommended. Please consider using the USER and PASSWORDconnection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQLManual for more information.

解说：

发现结果日志已经符合预期，当slave_net_timeout时间过短，会频繁重连，发现产生日志和问题一样，那么接下来我插入数据看下情况

缩短 slave 网络超时时间,但有数据插入

在10.110.10.30 上执行如下：

Stop slave;set global slave_net_timeout=5;start salve;

插入数据：

sysbench/root/sysbench-1.0.19/tests/include/oltp_legacy/oltp.lua--mysql-host=10.110.10.30 --mysql-user=root --mysql-password='xxx'--mysql-port=3070 --mysql-db=sbtest --oltp-table-size=10000000--oltp-tables-count=1 --threads=1 --events=500000 --time=1200--report-interval=1 prepare

日志：

主从同步UUID问题：found a zombie dump thread with the same UUID

解说：

发现已经没有日志输出，说明插入数据时，说明slave_net_timeout在有master 有binlog日志产生的情况下没问题，只有在空闲时会发生。

再次中断

可以验证slave_net_timeout 是产生问题的原因，但是线上有很多组db 没有日志产生也没有报错，反而写入量大的会存在问题，此时陷入迷惑中，因为此理由不能说服领导啊。继续看官方文档。发下如下信息

主从同步UUID问题：found a zombie dump thread with the same UUID

发现有个MASTER_HEARTBEAT_PERIOD参数，意思是多长时间对master 进行探测是否存活，默认为slave_net_timeout的一半也就是30，有线索了继续测试

MASTER_HEARTBEAT_PERIOD测试

对于发生问题的db机器进行查询MASTER_HEARTBEAT_PERIOD参数的值，发现为1800

而正常机器都是30，发现其中猫腻。

mysql> select * from mysql.slave_master_info\G*************************** 1. row ***************************Number_of_lines: 25Master_log_name: mysql-bin.000651Master_log_pos: 131563718Connect_retry: 60Heartbeat: 1800

在10.110.10.30slave上执行

slave：上

mysql> stop slave;Query OK, 0 rows affected (0.00 sec)mysql> change master to master_heartbeat_period = 1800;Query OK, 0 rows affected, 1 warning (0.11 sec)mysql> start slave;Query OK, 0 rows affected (0.00 sec)

查看日志：

Slave上：

Master上：

发现问题已经复现。

。

这个确实是从5.5 升级到5.7 的版本，slave_net_timetou原来是3600，Heartbeat：1800，因此问题已经定位。

vlambda博客
学习文章列表