一次意外的ADG主备切换过程分析

2021年1月21日 1609点热度 0人点赞 0条评论

■2021-01-14 15:07
存储开始出现问题,主数据库2个节点(node1、node2)均报共享存储'I/O error'

-------node1 alert
2021-01-14T15:07:24.122930+08:00
WARNING: Write Failed. group:1 disk:3 AU:81 offset:2022912 size:4096
path:/dev/asmdisks/su001_lun04
         incarnation:0xf0f0720e asynchronous result:'I/O error'
         subsys:System krq:0x7fd35c49c928 bufp:0x2060dcb800 osderr1:0x69b5 osderr2:0x0
         IO elapsed time: 0 usec Time waited on I/O: 0 usec
---此处省略若干
2021-01-14T15:23:38.253848+08:00
NOTE: updating disk modes to 0x7 from 0x5 for disk 3 (DATA3_0003) in group 3 (DATA3): lflags 0x0
NOTE: updating disk modes to 0x7 from 0x5 for disk 5 (DATA3_0005) in group 3 (DATA3): lflags 0x0
NOTE: updating disk modes to 0x7 from 0x5 for disk 7 (DATA3_0007) in group 3 (DATA3): lflags 0x0
NOTE: updating disk modes to 0x7 from 0x5 for disk 33 (DATA3_0033) in group 3 (DATA3): lflags 0x0
NOTE: updating disk modes to 0x7 from 0x5 for disk 46 (DATA3_0046) in group 3 (DATA3): lflags 0x0
NOTE: disk 3 (DATA3_0003) in group 3 (DATA3) is online for reads
NOTE: disk 5 (DATA3_0005) in group 3 (DATA3) is online for reads
NOTE: disk 7 (DATA3_0007) in group 3 (DATA3) is online for reads
NOTE: disk 33 (DATA3_0033) in group 3 (DATA3) is online for reads
NOTE: disk 46 (DATA3_0046) in group 3 (DATA3) is online for reads
---此处省略若干
2021-01-14T15:31:09.600620+08:00
NOTE: ASMB process initiating disk discovery for grp 1 (reqid:4751423187424713884)
NOTE: Found /dev/asmdisks/su001_lun02 for disk DATA1_0001
SUCCESS: disk DATA1_0001 (1.4042289684) replaced in diskgroup DATA1 path: /dev/asmdisks/su001_lun02
SUCCESS: completed reconfiguration of group 1 (DATA1)
NOTE: updating disk modes to 0x5 from 0x1 for disk 1 (DATA1_0001) in group 1 (DATA1): lflags 0x0
NOTE: disk 1 (DATA1_0001) in group 1 (DATA1) is online for writes
2021-01-14T15:31:25.524942+08:00
NOTE: updating disk modes to 0x7 from 0x5 for disk 1 (DATA1_0001) in group 1 (DATA1): lflags 0x0
NOTE: disk 1 (DATA1_0001) in group 1 (DATA1) is online for reads

-------node2 alert
2021-01-14T15:07:23.320330+08:00
PDBAPP(3):WARNING: Read Failed. group:1 disk:5 AU:138139 offset:3162112 size:8192
PDBAPP(3):path:/dev/asmdisks/su001_lun06
PDBAPP(3):       incarnation:0xf0f0510c synchronous result:'I/O error'
PDBAPP(3):       subsys:System krq:0x7f8f2bd86ec0 bufp:0x1f5a5de000 osderr1:0x69c0 osderr2:0x0
PDBAPP(3):       IO elapsed time: 0 usec Time waited on I/O: 0 usec
---此处省略若干
2021-01-14T15:37:12.581586+08:00
         IO elapsed time: 0 usec Time waited on I/O: 0 usec
---此处省略若干
2021-01-14T15:37:45.068876+08:00 
WARNING: Read Failed. group:1 disk:20 AU:7 offset:16384 size:16384 
path:/dev/asmdisks/su003_lun01 
         incarnation:0xf0f0511e synchronous result:'I/O error' 
         subsys:System krq:0x7ff4b2ec9b10 bufp:0x7ff4b2c43000 osderr1:0x69c0 osderr2:0x0 (mirrored read) 
         IO elapsed time: 0 usec Time waited on I/O: 0 usec 

■2021-01-14 15:37
备库监控开始报错,主库长时间失联,导致主备切换

-------1 adg
2021-01-14T15:37:30.328+08:00
Fast-Start Failover cannot proceed because: "primary last contacted this standby within FastStartFailoverThreshold seconds"
primary last contacted to instance 1 17 seconds ago
2021-01-14T15:37:33.343+08:00
Fast-Start Failover cannot proceed because: "primary last contacted this standby within FastStartFailoverThreshold seconds"
primary last contacted to instance 1 20 seconds ago
2021-01-14T15:37:36.356+08:00
Fast-Start Failover cannot proceed because: "primary last contacted this standby within FastStartFailoverThreshold seconds"
primary last contacted to instance 1 23 seconds ago
2021-01-14T15:37:39.370+08:00
Fast-Start Failover cannot proceed because: "primary last contacted this standby within FastStartFailoverThreshold seconds"
primary last contacted to instance 1 26 seconds ago
2021-01-14T15:37:42.383+08:00
Fast-Start Failover cannot proceed because: "primary last contacted this standby within FastStartFailoverThreshold seconds"
primary last contacted to instance 1 29 seconds ago
2021-01-14T15:37:46.854+08:00
FAILOVER TO wydbadg
Starting FAILOVER

-------2 adg alert
2021-01-14T15:37:45.397868+08:00
Attempting Fast-Start Failover because the threshold of 30 seconds has elapsed.
2021-01-14T15:37:46.514371+08:00
Errors in file /u01/app/oracle/diag/rdbms/wydbadg/wydb1/trace/wydb1_ora_120118.trc:
2021-01-14T15:37:46.847916+08:00
Attempting Fast-Start Failover because the threshold of 30 seconds has elapsed.
2021-01-14T15:37:46.890811+08:00
Beginning Fast-Start Failover to database wydbadg.
2021-01-14T15:37:46.915192+08:00
ALTER DATABASE FAILOVER TO wydbadg

-------3 node1 alert
2021-01-14T15:38:30.185383+08:00
Starting background process FSFP
2021-01-14T15:38:30.214348+08:00
FSFP started with pid=87, OS id=205120
2021-01-14T15:38:31.184950+08:00
Primary has heard from neither observer nor target standby within FastStartFailoverThreshold seconds.
It is likely an automatic failover has already occurred. Primary is shutting down.

■分析结论
当时厂家人员做心跳故障模拟测试,可能有相关操作导致了存储出现问题(日志显示两个节点的共享磁盘均出现了掉线、上线的过程),从而导致了主库2个节点均出现了问题,最终触发了主备自动切换。

liking

这个人很懒,什么都没留下

文章评论