terminating the instance due to error 471/495

2021年7月6日 1117点热度 0人点赞 1条评论

一套12C的数据库2节点RAC集群,节点2被down了,查看alertlog,报错如下:

2021-06-16T01:26:16.516936+08:00
Thread 2 advanced to log sequence 8611 (LGWR switch)
  Current log# 4 seq# 8611 mem# 0: +DATA/CB2QDB/ONLINELOG/group_4.305.992195801
  Current log# 4 seq# 8611 mem# 1: +DATA/CB2QDB/ONLINELOG/group_4.306.992195809
2021-06-16T01:26:16.852672+08:00
Archived Log entry 16829 added for T-2.S-8610 ID 0x643dd214 LAD:1
2021-06-16T01:27:09.344853+08:00
Instance Critical Process (pid: 34, ospid: 60840, DBW1) died unexpectedly
PMON (ospid: 60735): terminating the instance due to error 471
2021-06-16T01:27:12.685143+08:00
Instance terminated by PMON, pid = 60735

可见DBW1死掉了,没有更多信息描述。
查看grid日志,那个时刻的信息如下:

2021-06-16T01:27:06.031574+08:00
NOTE: ASM client -MGMTDB:_mgmtdb:cb2qdb disconnected unexpectedly.
NOTE: check client alert log.
2021-06-16T01:27:06.431417+08:00
Dumping diagnostic data in directory=[cdmp_20210616012706], requested by (instance=0, osid=9777), summary=[trace bucket dump request (kfnclDelete0)].
2021-06-16T01:27:07.537686+08:00
NOTE: ASMB process exiting, either shutdown is in progress or foreground connected to ASMB was killed.
NOTE: ASMB0 clearing idle groups before exit
2021-06-16T01:27:08.443825+08:00
Instance Critical Process (pid: 6, ospid: 76748, GEN0) died unexpectedly
PMON (ospid: 76720): terminating the instance due to error 495
2021-06-16T01:27:09.630299+08:00
Instance terminated by PMON, pid = 76720

可见那个时刻asm实例挂掉了,导致了DBW1死掉了。
关于471错误,在MOS关于这个错误有明确的描述,(Doc ID 1622379.1),仅供参考。
此处的错误根源在于495,MOS给出的原因和解决方案很明确,如下:
PMON (ospid: nnnn): Terminating the Instance Due to Error 495 (Doc ID 2584936.1)

CAUSE:
Extending FS where datafiles exist.

Sep 3 23:26:18 cmstashdbl02 kernel: [51963237.523394] EXT4-fs warning (device dm-8): ext4_resize_begin:44: There are errors in the filesystem, so online resizing is not allowed
Sep 3 23:26:18 cmstashdbl02 kernel: [51963237.523394]

SOLUTION:
Check with os/storage admin.

问题是,6月16日凌晨那个时间点,谁在做存储的操作了?
我们看到,在这之前,有一个关键的操作信息,即:
NOTE: ASM client -MGMTDB:_mgmtdb:cb2qdb disconnected unexpectedly.
NOTE: check client alert log.
查看/u01/app/grid/diag/rdbms/mgmtdb/-MGMTDB/trace/alert-MGMTDB.log,可见如下信息:

2021-06-16T01:27:04.310472+08:00
License high water mark = 19
2021-06-16T01:27:04.310951+08:00
USER (ospid: 10795): terminating the instance
2021-06-16T01:27:07.315957+08:00
Instance terminated by USER, pid = 10795

就这么多信息,这个12c数据库当时还被迫装了这个mgmtdb,关于这个mgmtdb,现在实在不想去看那些文档了,先这样吧,oracle它奶奶个腿儿
知道了来龙去脉,修改了一些后来不知被谁改错的系统配置如sshd、limits等,强行关闭半死不活的crs,重启即恢复正常。

liking

我是雪人

文章评论

  • liking

    同时处理了一个ulimit无权限问题:
    Last login: Mon Jul 5 20:25:39 2021
    -bash: ulimit: open files: cannot modify limit: Operation not permitted
    在limits.conf配置没问题的情况下,这个错误还是发生了,网搜下,将sshd的配置参数UseLogin改为yes,即可。

    2021年7月6日