Redis无限全量复制问题分析与优化

本文部分内容来自《Redis开发与运维》一书，转载请声明。

一、现象和危害

线上有台机器内存接近了90%，总内存为24G，整个部署如下图：

现要将Redis-2迁移走，由于特殊原因此节点没有slave节点，需要添加一个slave节点，然后做failover操作。

通过对日志的观察，发现主从不停地做全量复制。

1.slave首次连接master做全量复制。

36207:S 23 Nov 21:39:43.086 * Connecting to MASTER 10.10.xx.200:6390
36207:S 23 Nov 21:39:43.086 * MASTER <-> SLAVE sync started
36207:S 23 Nov 21:39:43.087 * Non blocking connect for SYNC fired the event.
36207:S 23 Nov 21:39:43.087 * Master replied to PING, replication can continue...
36207:S 23 Nov 21:39:43.087 * (Non critical) Master does not understand REPLCONF capa: -ERR Unrecognized REPLCONF option: capa
36207:S 23 Nov 21:39:43.087 * Partial resynchronization not possible (no cached master)
36207:S 23 Nov 21:39:43.096 * Full resync from master: e3e319f97545661d820bc573c3f9e0436789d912:1
36207:S 23 Nov 21:39:43.302 # CONFIG REWRITE executed with success.
36207:S 23 Nov 21:40:28.268 * MASTER <-> SLAVE sync: receiving 2203990784 bytes from master
36207:S 23 Nov 21:41:12.721 * MASTER <-> SLAVE sync: Flushing old data
36207:S 23 Nov 21:41:12.721 * MASTER <-> SLAVE sync: Loading DB in memory
36207:S 23 Nov 21:42:24.117 * MASTER <-> SLAVE sync: Finished with success

2.slave与master连接断开

1 2	36207:S 23 Nov 21:42:24.152 # Connection with master lost. 36207:S 23 Nov 21:42:24.177 * Caching the disconnected master state.

3.slave与master重连，重新开始全量复制。(因为复制偏移量不够，无法达到部分复制)

36207:S 23 Nov 21:42:24.921 * Connecting to MASTER 10.10.xx.200:6390
36207:S 23 Nov 21:42:24.921 * MASTER <-> SLAVE sync started
36207:S 23 Nov 21:42:24.921 * Non blocking connect for SYNC fired the event.
36207:S 23 Nov 21:42:24.921 * Master replied to PING, replication can continue...
36207:S 23 Nov 21:42:24.922 * (Non critical) Master does not understand REPLCONF capa: -ERR Unrecognized REPLCONF option: capa
36207:S 23 Nov 21:42:24.922 * Trying a partial resynchronization (request e3e319f97545661d820bc573c3f9e0436789d912:3851662).
36207:S 23 Nov 21:42:24.922 * Full resync from master: e3e319f97545661d820bc573c3f9e0436789d912:226383867
36207:S 23 Nov 21:42:24.922 * Discarding previously cached master state.
36207:S 23 Nov 21:43:01.148 * MASTER <-> SLAVE sync: receiving 2206702643 bytes from master
36207:S 23 Nov 21:44:20.419 * MASTER <-> SLAVE sync: Flushing old data
36207:S 23 Nov 21:44:39.331 * MASTER <-> SLAVE sync: Loading DB in memory

危害：很明显这样会进入无限次全量复制的模式，本身分片比较大，对机器资源有一定开销，会对该机器上其他实例造成影响。

二、分析

1.网络原因导致重连：

(1) 添加的slave为同一个网段，使用redis-cli latency未测试出异常
(2) 添加另一个slave节点，发现依然出现上述情况。

2.Redis版本造成。

在日志中发现了如下一句话：

1	36207:S 23 Nov 21:42:24.922 * (Non critical) Master does not understand REPLCONF capa: -ERR Unrecognized REPLCONF option: capa

初步怀疑是Redis版本造成，master:3.0.2，slave:3.0.5，于是模拟实验一下，发现主从复制正常，而且从(not critical)可以看到应该问题不在这里。

3.连接确实由于某些原因断开。

到现在还有看主节点的日志，如下：

(1) master节点接收到了slave节点的复制请求，并生成RDB传给slave节点。

2030:M 23 Nov 21:39:43.125 * Slave 10.10.xx.138:6392 asks for synchronization
2030:M 23 Nov 21:39:43.134 * Full resync requested by slave 10.10.xx.138:6392
2030:M 23 Nov 21:39:43.134 * Starting BGSAVE for SYNC with target: disk
2030:M 23 Nov 21:39:43.327 * Background saving started by pid 46221
2030:M 23 Nov 21:39:43.337 # CONFIG REWRITE executed with success.
46221:C 23 Nov 21:40:27.992 * DB saved on disk
46221:C 23 Nov 21:40:28.097 * RDB: 451 MB of memory used by copy-on-write
2030:M 23 Nov 21:40:28.305 * Background saving terminated with success
2030:M 23 Nov 21:41:12.736 * Synchronization with slave 10.10.xx.138:6392 succeeded

(2) slave节点的客户端连接被杀掉，由于超过了client-output-buffer-limit slave，cachecloud的使用的是512mb 128mb 60。由于分片比较大全量复制时间较长，且master写入量较大，所以slave节点的客户端被干掉了(omem=266324615)。

2030:M 23 Nov 21:41:42.027 # Client id=28191723 addr=10.10.xx.138:38638 fd=735 name= age=119 idle=1 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=10514 omem=266324615 events=rw cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits.
2030:M 23 Nov 21:41:42.028 # Connection with slave 10.10.xx.138:6392 lost.
2030:M 23 Nov 21:42:24.950 * Slave 10.10.xx.138:6392 asks for synchronization
2030:M 23 Nov 21:42:24.950 * Unable to partial resync with slave 10.10.xx.138:6392 for lack of backlog (Slave request was: 3851662).
2030:M 23 Nov 21:42:24.950 * Starting BGSAVE for SYNC with target: disk
2030:M 23 Nov 21:42:25.158 * Background saving started by pid 47033
47033:C 23 Nov 21:42:59.788 * DB saved on disk
.....

实际上已经收到报警了：提示client_longest_output_list过大。

综上分析可以确定原因了。

三、后期优化

1.对于写入量大，且分片较大的情况，可以适当增大client-output-buffer-limit slave
2.减少大分片的使用，尽量每个节点控制在4GB以内。
3.可以对一些关键日志进行监控，帮助快速定位问题。