PXC集群第3个节点无法加入故障处理

2022年1月18日 763点热度 0人点赞 1条评论

一个PXC 8.0.23集群,因为项目操作导致无法提供服务了,提示信息为:
ERROR 1047 (08S01): WSREP has not yet prepared node for application use
或者
2013 - Lost connection to MySQL server during query
登录各个节点查看集群wsrep_cluster_size均为0,节点状态wsrep_cluster_status都不是Primary状态(好像是not connected),查看grastate.dat文件,3号节点safe_to_bootstrap为1.
因此关闭各个节点,在3号节点启动集群,之后顺利将2号加入,可是在加入1号是遭遇错误如下:

2022-01-12T11:12:43.552286Z 0 [Note] [MY-000000] [WSREP-SST] ............Waiting for SST streaming to complete!
2022-01-12T11:20:32.979860Z 0 [ERROR] [MY-000000] [WSREP-SST] Killing SST (16448) with SIGKILL after stalling for 120 seconds
2022-01-12T11:20:33.010860Z 0 [Note] [MY-000000] [WSREP-SST] /usr/bin/wsrep_sst_xtrabackup-v2: 行 183: 16450 已杀死               socat -u openssl-listen:4444,reuseaddr,cert=/mysql/pxc/data//server-cert.pem,key=/mysql/pxc/data//server-key.pem,cafile=/mysql/pxc/data//ca.pem,verify=1,retry=30 stdio
2022-01-12T11:20:33.010931Z 0 [Note] [MY-000000] [WSREP-SST]      16451                       | /usr/bin/pxc_extra/pxb-8.0/bin/xbstream -x
2022-01-12T11:20:33.011525Z 0 [ERROR] [MY-000000] [WSREP-SST] ******************* FATAL ERROR **********************
2022-01-12T11:20:33.011676Z 0 [ERROR] [MY-000000] [WSREP-SST] Error while getting data from donor node:  exit codes: 137 137
2022-01-12T11:20:33.011756Z 0 [ERROR] [MY-000000] [WSREP-SST] Line 1268
2022-01-12T11:20:33.011874Z 0 [ERROR] [MY-000000] [WSREP-SST] ******************************************************
2022-01-12T11:20:33.012861Z 0 [ERROR] [MY-000000] [WSREP-SST] Cleanup after exit with status:32
2022-01-12T11:20:33.210760Z 0 [ERROR] [MY-000000] [WSREP] Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.222.50.101' --datadir '/mysql/pxc/data/' --basedir '/usr/' --plugindir '/usr/lib64/mysql/plugin/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --parent '15908' --mysqld-version '8.0.23-14.1'   '' : 32 (Broken pipe)
2022-01-12T11:20:33.210898Z 0 [ERROR] [MY-000000] [WSREP] Failed to read uuid:seqno from joiner script.
2022-01-12T11:20:33.210973Z 0 [ERROR] [MY-000000] [WSREP] SST script aborted with error 32 (Broken pipe)
2022-01-12T11:20:33.211182Z 3 [Note] [MY-000000] [Galera] Processing SST received
2022-01-12T11:20:33.211268Z 3 [Note] [MY-000000] [Galera] SST request was cancelled
2022-01-12T11:20:33.211352Z 3 [ERROR] [MY-000000] [Galera] State transfer request failed unrecoverably: 32 (Broken pipe). Most likely it is due to inability to communicate with the cluster primary component. Restart required.

网搜的文章五花八门,参考过几个文章,均没用。因为看到错误日志信息--address '10.222.50.101',一度怀疑配置参数wsrep_node_address是否需要显式指定,因为都是默认注释掉的,显式指定后仍然报错如下:

2022-01-13T08:03:32.978322Z 0 [Note] [MY-000000] [WSREP-SST] Proceeding with SST.........
2022-01-13T08:03:33.036563Z 0 [Note] [MY-000000] [WSREP-SST] ............Waiting for SST streaming to complete!
2022-01-13T08:12:38.715388Z 0 [Note] [MY-000000] [Galera] Created page /mysql/pxc/data/gcache.page.000000 of size 592621440 bytes
2022-01-13T08:12:51.193262Z 0 [ERROR] [MY-000000] [WSREP-SST] Killing SST (27632) with SIGKILL after stalling for 120 seconds
2022-01-13T08:12:51.217686Z 0 [Note] [MY-000000] [WSREP-SST] /usr/bin/wsrep_sst_xtrabackup-v2: line 183: 27634 killed               socat -u openssl-listen:4444,reuseaddr,cert=/mysql/pxc/data//server-cert.pem,key=/mysql/pxc/data//server-key.pem,cafile=/mysql/pxc/data//ca.pem,verify=1,retry=30 stdio
2022-01-13T08:12:51.217754Z 0 [Note] [MY-000000] [WSREP-SST]      27635                       | /usr/bin/pxc_extra/pxb-8.0/bin/xbstream -x
2022-01-13T08:12:51.218372Z 0 [ERROR] [MY-000000] [WSREP-SST] ******************* FATAL ERROR ********************** 
2022-01-13T08:12:51.218550Z 0 [ERROR] [MY-000000] [WSREP-SST] Error while getting data from donor node:  exit codes: 137 137
2022-01-13T08:12:51.218628Z 0 [ERROR] [MY-000000] [WSREP-SST] Line 1268
2022-01-13T08:12:51.218722Z 0 [ERROR] [MY-000000] [WSREP-SST] ****************************************************** 
2022-01-13T08:12:51.219631Z 0 [ERROR] [MY-000000] [WSREP-SST] Cleanup after exit with status:32
2022-01-13T08:12:51.431617Z 0 [ERROR] [MY-000000] [WSREP] Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.230.245.214' --datadir '/mysql/pxc/data/' --basedir '/usr/' --plugindir '/usr/lib64/mysql/plugin/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --parent '27097' --mysqld-version '8.0.23-14.1'   '' : 32 (Broken pipe)
2022-01-13T08:12:51.431820Z 0 [ERROR] [MY-000000] [WSREP] Failed to read uuid:seqno from joiner script.
2022-01-13T08:12:51.431892Z 0 [ERROR] [MY-000000] [WSREP] SST script aborted with error 32 (Broken pipe)
2022-01-13T08:12:51.432257Z 3 [Note] [MY-000000] [Galera] Processing SST received
2022-01-13T08:12:51.432372Z 3 [Note] [MY-000000] [Galera] SST request was cancelled
2022-01-13T08:12:51.432458Z 3 [ERROR] [MY-000000] [Galera] State transfer request failed unrecoverably: 32 (Broken pipe). Most likely it is due to inability to communicate with the cluster primary component. Restart required.

也怀疑过防火墙配置问题,去掉所有的配置,并关闭防火墙还是报错依旧。
为了不影响业务,只好先用2个节点提供服务,恢复业务。
同时到官网提交了这个问题,得到了官方回复如下:

【matthewb Percona】
Your log indicates that port 4444 is not open TCP/UDP to all hosts. Make sure all necessary ports (3306, 4444, 4567, 4568) are open between all nodes.

【liking】
Thanks for your reply, but I am sure I have closed firewall between all nodes. Maybe there is some other issues?

【Evgeniy_Patlan Percona】
"while getting data from donor node: exit codes: 137 137"
Such issue appeared once it is not possible to connect to the needed port. So please recheck your firewall options

【matthewb Percona】
"I am sure I have closed firewall between all nodes"
That’s your problem. You need to OPEN the firewall between nodes, not close it. Use socat or nc to test connectivity between nodes on the ports I mentioned.

【liking】
Many thanks to you all, I will do this according to your suggest

看到了,官方很肯定是网络端口设置的原因,由于目前网络不太方便,择机再试。

数天后,择机重试,在官方论坛回复如下:
It is ok now.
According to your suggest, I modified the netfilter rules on all nodes like this:

  1. Accept all input
  2. Clear all netfilter rules
    Now the cluster works fine.
    以下是具体的操作步骤:

    [root@db-1 ~]#  iptables -P INPUT ACCEPT
    [root@db-1 ~]#  iptables -F
    [root@db-1 ~]#  iptables -X
    [root@db-1 ~]#  iptables -Z
    [root@db-1 ~]#  iptables -A INPUT -i lo -j ACCEPT
    [root@db-1 ~]#  iptables-save
    #Generated by iptables-save v1.4.21 on Mon Jan 24 11:33:23 2022
    *filter
    :INPUT ACCEPT [884:105489]
    :FORWARD ACCEPT [0:0]
    :OUTPUT ACCEPT [685:162312]
    -A INPUT -i lo -j ACCEPT
    COMMIT
    #Completed on Mon Jan 24 11:33:23 2022

liking

我是雪人