一套12C的数据库2节点RAC集群,节点2被down了,查看alertlog,报错如下:
2021-06-16T01:26:16.516936+08:00 Thread 2 advanced to log sequence 8611 (LGWR switch) Current log# 4 seq# 8611 mem# 0: +DATA/CB2QDB/ONLINELOG/group_4.305.992195801 Current log# 4 seq# 8611 mem# 1: +DATA/CB2QDB/ONLINELOG/group_4.306.992195809 2021-06-16T01:26:16.852672+08:00 Archived Log entry 16829 added for T-2.S-8610 ID 0x643dd214 LAD:1 2021-06-16T01:27:09.344853+08:00 Instance Critical Process (pid: 34, ospid: 60840, DBW1) died unexpectedly PMON (ospid: 60735): terminating the instance due to error 471 2021-06-16T01:27:12.685143+08:00 Instance terminated by PMON, pid = 60735
可见DBW1死掉了,没有更多信息描述。
查看grid日志,那个时刻的信息如下:
2021-06-16T01:27:06.031574+08:00 NOTE: ASM client -MGMTDB:_mgmtdb:cb2qdb disconnected unexpectedly. NOTE: check client alert log. 2021-06-16T01:27:06.431417+08:00 Dumping diagnostic data in directory=[cdmp_20210616012706], requested by (instance=0, osid=9777), summary=[trace bucket dump request (kfnclDelete0)]. 2021-06-16T01:27:07.537686+08:00 NOTE: ASMB process exiting, either shutdown is in progress or foreground connected to ASMB was killed. NOTE: ASMB0 clearing idle groups before exit 2021-06-16T01:27:08.443825+08:00 Instance Critical Process (pid: 6, ospid: 76748, GEN0) died unexpectedly PMON (ospid: 76720): terminating the instance due to error 495 2021-06-16T01:27:09.630299+08:00 Instance terminated by PMON, pid = 76720
可见那个时刻asm实例挂掉了,导致了DBW1死掉了。
关于471错误,在MOS关于这个错误有明确的描述,(Doc ID 1622379.1),仅供参考。
此处的错误根源在于495,MOS给出的原因和解决方案很明确,如下:
PMON (ospid: nnnn): Terminating the Instance Due to Error 495 (Doc ID 2584936.1)
CAUSE: Extending FS where datafiles exist. Sep 3 23:26:18 cmstashdbl02 kernel: [51963237.523394] EXT4-fs warning (device dm-8): ext4_resize_begin:44: There are errors in the filesystem, so online resizing is not allowed Sep 3 23:26:18 cmstashdbl02 kernel: [51963237.523394] SOLUTION: Check with os/storage admin.
问题是,6月16日凌晨那个时间点,谁在做存储的操作了?
我们看到,在这之前,有一个关键的操作信息,即:
NOTE: ASM client -MGMTDB:_mgmtdb:cb2qdb disconnected unexpectedly.
NOTE: check client alert log.
查看/u01/app/grid/diag/rdbms/mgmtdb/-MGMTDB/trace/alert-MGMTDB.log,可见如下信息:
2021-06-16T01:27:04.310472+08:00 License high water mark = 19 2021-06-16T01:27:04.310951+08:00 USER (ospid: 10795): terminating the instance 2021-06-16T01:27:07.315957+08:00 Instance terminated by USER, pid = 10795
就这么多信息,这个12c数据库当时还被迫装了这个mgmtdb,关于这个mgmtdb,现在实在不想去看那些文档了,先这样吧,oracle它奶奶个腿儿
知道了来龙去脉,修改了一些后来不知被谁改错的系统配置如sshd、limits等,强行关闭半死不活的crs,重启即恢复正常。
文章评论
同时处理了一个ulimit无权限问题:
Last login: Mon Jul 5 20:25:39 2021
-bash: ulimit: open files: cannot modify limit: Operation not permitted
在limits.conf配置没问题的情况下,这个错误还是发生了,网搜下,将sshd的配置参数UseLogin改为yes,即可。