Redis无限全量复制问题分析与优化

本文部分内容来自《Redis开发与运维》一书,转载请声明。

一、现象和危害

线上有台机器内存接近了90%,总内存为24G,整个部署如下图:

现要将Redis-2迁移走,由于特殊原因此节点没有slave节点,需要添加一个slave节点,然后做failover操作。

通过对日志的观察,发现主从不停地做全量复制。

1.slave首次连接master做全量复制。

1
2
3
4
5
6
7
8
9
10
11
12
36207:S 23 Nov 21:39:43.086 * Connecting to MASTER 10.10.xx.200:6390
36207:S 23 Nov 21:39:43.086 * MASTER <-> SLAVE sync started
36207:S 23 Nov 21:39:43.087 * Non blocking connect for SYNC fired the event.
36207:S 23 Nov 21:39:43.087 * Master replied to PING, replication can continue...
36207:S 23 Nov 21:39:43.087 * (Non critical) Master does not understand REPLCONF capa: -ERR Unrecognized REPLCONF option: capa
36207:S 23 Nov 21:39:43.087 * Partial resynchronization not possible (no cached master)
36207:S 23 Nov 21:39:43.096 * Full resync from master: e3e319f97545661d820bc573c3f9e0436789d912:1
36207:S 23 Nov 21:39:43.302 # CONFIG REWRITE executed with success.
36207:S 23 Nov 21:40:28.268 * MASTER <-> SLAVE sync: receiving 2203990784 bytes from master
36207:S 23 Nov 21:41:12.721 * MASTER <-> SLAVE sync: Flushing old data
36207:S 23 Nov 21:41:12.721 * MASTER <-> SLAVE sync: Loading DB in memory
36207:S 23 Nov 21:42:24.117 * MASTER <-> SLAVE sync: Finished with success

2.slave与master连接断开

1
2
36207:S 23 Nov 21:42:24.152 # Connection with master lost.
36207:S 23 Nov 21:42:24.177 * Caching the disconnected master state.

3.slave与master重连,重新开始全量复制。(因为复制偏移量不够,无法达到部分复制)

1
2
3
4
5
6
7
8
9
10
11
36207:S 23 Nov 21:42:24.921 * Connecting to MASTER 10.10.xx.200:6390
36207:S 23 Nov 21:42:24.921 * MASTER <-> SLAVE sync started
36207:S 23 Nov 21:42:24.921 * Non blocking connect for SYNC fired the event.
36207:S 23 Nov 21:42:24.921 * Master replied to PING, replication can continue...
36207:S 23 Nov 21:42:24.922 * (Non critical) Master does not understand REPLCONF capa: -ERR Unrecognized REPLCONF option: capa
36207:S 23 Nov 21:42:24.922 * Trying a partial resynchronization (request e3e319f97545661d820bc573c3f9e0436789d912:3851662).
36207:S 23 Nov 21:42:24.922 * Full resync from master: e3e319f97545661d820bc573c3f9e0436789d912:226383867
36207:S 23 Nov 21:42:24.922 * Discarding previously cached master state.
36207:S 23 Nov 21:43:01.148 * MASTER <-> SLAVE sync: receiving 2206702643 bytes from master
36207:S 23 Nov 21:44:20.419 * MASTER <-> SLAVE sync: Flushing old data
36207:S 23 Nov 21:44:39.331 * MASTER <-> SLAVE sync: Loading DB in memory

危害: 很明显这样会进入无限次全量复制的模式,本身分片比较大,对机器资源有一定开销,会对该机器上其他实例造成影响。

二、分析

1.网络原因导致重连:

(1) 添加的slave为同一个网段,使用redis-cli latency未测试出异常
(2) 添加另一个slave节点,发现依然出现上述情况。

2.Redis版本造成。

在日志中发现了如下一句话:

1
36207:S 23 Nov 21:42:24.922 * (Non critical) Master does not understand REPLCONF capa: -ERR Unrecognized REPLCONF option: capa

初步怀疑是Redis版本造成,master:3.0.2,slave:3.0.5,于是模拟实验一下,发现主从复制正常,而且从(not critical)可以看到应该问题不在这里。

3.连接确实由于某些原因断开。

到现在还有看主节点的日志,如下:

(1) master节点接收到了slave节点的复制请求,并生成RDB传给slave节点。

1
2
3
4
5
6
7
8
9
2030:M 23 Nov 21:39:43.125 * Slave 10.10.xx.138:6392 asks for synchronization
2030:M 23 Nov 21:39:43.134 * Full resync requested by slave 10.10.xx.138:6392
2030:M 23 Nov 21:39:43.134 * Starting BGSAVE for SYNC with target: disk
2030:M 23 Nov 21:39:43.327 * Background saving started by pid 46221
2030:M 23 Nov 21:39:43.337 # CONFIG REWRITE executed with success.
46221:C 23 Nov 21:40:27.992 * DB saved on disk
46221:C 23 Nov 21:40:28.097 * RDB: 451 MB of memory used by copy-on-write
2030:M 23 Nov 21:40:28.305 * Background saving terminated with success
2030:M 23 Nov 21:41:12.736 * Synchronization with slave 10.10.xx.138:6392 succeeded

(2) slave节点的客户端连接被杀掉,由于超过了client-output-buffer-limit slave,cachecloud的使用的是512mb 128mb 60。由于分片比较大全量复制时间较长,且master写入量较大,所以slave节点的客户端被干掉了(omem=266324615)。

1
2
3
4
5
6
7
8
2030:M 23 Nov 21:41:42.027 # Client id=28191723 addr=10.10.xx.138:38638 fd=735 name= age=119 idle=1 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=10514 omem=266324615 events=rw cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits.
2030:M 23 Nov 21:41:42.028 # Connection with slave 10.10.xx.138:6392 lost.
2030:M 23 Nov 21:42:24.950 * Slave 10.10.xx.138:6392 asks for synchronization
2030:M 23 Nov 21:42:24.950 * Unable to partial resync with slave 10.10.xx.138:6392 for lack of backlog (Slave request was: 3851662).
2030:M 23 Nov 21:42:24.950 * Starting BGSAVE for SYNC with target: disk
2030:M 23 Nov 21:42:25.158 * Background saving started by pid 47033
47033:C 23 Nov 21:42:59.788 * DB saved on disk
.....

实际上已经收到报警了:提示client_longest_output_list过大。

综上分析可以确定原因了。

三、后期优化

1.对于写入量大,且分片较大的情况,可以适当增大client-output-buffer-limit slave
2.减少大分片的使用,尽量每个节点控制在4GB以内。
3.可以对一些关键日志进行监控,帮助快速定位问题。