CRS-0184 Cannot communicate with the CRS daemon
oracle rac遇到了问题:报错:
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4534: Cannot communicate with Event Manager‘
问题分析:由于网站上云,oracle有一套rac从idc机房撤回到了公司本地,,按着步骤关闭了数据库,领导关闭的,只是su - oracle 然后shu immediate,关闭了oracle实例,asm实例则没有关闭,然后搬到公司按着原来的位置插好了网线并尝试启动,我只尝试着把ora010的实例起来了,然后就不管了,后来要用这套库的时候,我才看ora102的状态,才意识到数据库实例和asm实例都没有启动,于是尝试启动,但是报错如下:
首先先说下oracle rac
服务器需要重启的时候,oracle相关资源关闭的的流程:
方法一:
1)关闭oracle实例
[grid@ora102 ~]$ srvctl stop database -d ORCL
2)关闭asm实例
[grid@ora102 ~]$ srvctl stop asm -n ora102
[grid@ora102 ~]$ srvctl stop asm -n ora101
如果报错就强制关闭,如下
[root@ora101 bin]# ./srvctl stop asm
PRCR-1065 : Failed to stop resource ora.asm
CRS-2529: Unable to act on 'ora.asm' because that would require stopping or relocating 'ora.DATA.dg', but the force option was not specified
CRS-2529: Unable to act on 'ora.asm' because that would require stopping or relocating 'ora.DATA.dg', but the force option was not specified
加上强制关闭 即可:
[grid@ora101 ~]$ srvctl stop asm -f
[grid@ora101 ~]$ srvctl status asm
ASM is not running.
3)最后还需要关闭crs
[root@ora101 bin]# ./crsctl stop cluster -all
方法二:
1)关闭oracle实例,两个节点都执行
su - oracle
sqlplus / as sysdba
shu immediate
2)关闭asm实例,两个节点都执行
su - grid
sqlplus / as sysasm
shu immediate
sqlplu abort强制关闭
[grid@ora101 ~]$ sqlplus / as sysasm
SQL> shu abort
ASM instance shutdown
3)最后还需要关闭crs
[root@ora101 bin]# ./crsctl stop cluster -all
检查数据库和asm实例的状态,以及crs的状态
[grid@ora101 ~]$ srvctl status asm
ASM is running on ora101,ora102
[grid@ora101 ~]$ srvctl status database -d ORCL
Instance orcl1 is not running on node ora101
Instance orcl2 is not running on node ora102
好了言归正传,继续说遇到的问题。
[root@ora102 ~]# su - grid
[grid@ora102 ~]$ sqlplus / as sysasm
[grid@ora102 ~]$ sqlplus / as sysasm
SQL*Plus: Release 11.2.0.4.0 Production on Wed Nov 29 22:28:20 2017
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options
SQL> startup
报错。。。
在ora102节点上检查集群服务的状态,报错
[root@ora102 ~]# /u01/app/11.2.0/grid/bin/crs_stat -t
CRS-0184: Cannot communicate with the CRS daemon.
根据上面报错,可以判断出crs是有问题。
尝试启动也报错:注意需要使用root
[root@ora102 ~]# /u01/app/11.2.0/grid/bin/crsctl start crs
CRS-4640: Oracle High Availability Services is already active
CRS-4000: Command Start failed, or completed with errors.
正常情况是:
[root@ora102 bin]# /u01/app/11.2.0/grid/bin/crsctl start crs
CRS-4123: Oracle High Availability Services has been started.
检查crs服务,发现有问题:
[grid@ora102 ~]$ crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services demon
CRS-4534: Cannot communicate with Event Manager‘
然后节点ora102查看ip情况,发现vip和scan ip都已经不在,vip在节点ora101上了,可以判断出节点ora102已经脱离了集群。
查看ip配置。。。
[root@ora102 ~]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.0.44 ora101
192.168.0.45 ora102
192.168.0.46 ora101-vip
192.168.0.47 ora102-vip
192.168.0.48 ora-cluster-scan
172.168.56.101 ora101-priv
172.168.56.102 ora102-priv
查看节点的ip情况,发现只有物理ip(192.168.0.45 )了。
[root@ora102 ~]# ip a
1: lo: mtu 65536 qdisc noqueue state UNKNOWN qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp11s0f0: mtu 1500 qdisc mq state UP qlen 1000
link/ether 5c:f3:fc:e6:63:40 brd ff:ff:ff:ff:ff:ff
inet 192.168.0.45/24 brd 192.168.0.255 scope global enp11s0f0
valid_lft forever preferred_lft forever
inet6 fe80::f451:31ab:4b4a:b224/64 scope link
valid_lft forever preferred_lft forever
3: enp11s0f1: mtu 1500 qdisc mq state UP qlen 1000
link/ether 5c:f3:fc:e6:63:42 brd ff:ff:ff:ff:ff:ff
inet 172.168.56.102/24 brd 172.168.56.255 scope global enp11s0f1
valid_lft forever preferred_lft forever
inet 169.254.20.215/16 brd 169.254.255.255 scope global enp11s0f1:1
valid_lft forever preferred_lft forever
inet6 fe80::7ee2:d8da:d7fa:12d5/64 scope link
valid_lft forever preferred_lft forever
4: enp0s29f0u2: mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 1000
link/ether 5e:f3:fc:de:63:43 brd ff:ff:ff:ff:ff:ff
5: virbr0: mtu 1500 qdisc noqueue state DOWN qlen 1000
link/ether 52:54:00:f5:11:c7 brd ff:ff:ff:ff:ff:ff
inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
valid_lft forever preferred_lft forever
6: virbr0-nic: mtu 1500 qdisc pfifo_fast master virbr0 state DOWN qlen 1000
link/ether 52:54:00:f5:11:c7 brd ff:ff:ff:ff:ff:ff
解决问题过程。。。。
首先尝试重启节点2的crs
关闭crs
[root@ora102 bin]# ./crsctl stop crs
或者
[root@ora102 bin]# ./crsctl stop cluster
之后启动cluster集群:
方法一和方法二的区别:crsctl start/stop crs 只能管理本地节点的clusterware stack,并不允许我们管理远程节点,crsctl strat/stop cluster既可以管理本地 clusterware stack,也可以管理整个集群
指定–all 启动集群中所有节点的集群件,即启动整个集群。-n 启动指定节点的集群件.
方法一:
[root@ora102 bin]# ./crsctl start crs
或者
方法二:
[root@ora102 bin]# ./crsctl start cluster
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'ora102'
CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'ora102' succeeded
CRS-2679: Attempting to clean 'ora.asm' on 'ora102'
CRS-2681: Clean of 'ora.asm' on 'ora102' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'ora102'
CRS-2676: Start of 'ora.asm' on 'ora102' succeeded
CRS-2672: Attempting to start 'ora.crsd' on 'ora102'
CRS-2676: Start of 'ora.crsd' on 'ora102' succeeded
如果还是有问题那么清理节点2的配置信息,然后重新运行root.sh
[root@ora102 trace]$ /u01/app/11.2.0/grid/crs/install/rootcrs.pl -verbose -deconfig -force
[root@ora102 ~]# /u01/app/11.2.0/grid/crs/install/roothas.pl -verbose -deconfig -force
[root@ora102 bin]# /u01/app/11.2.0/grid/root.sh
然后检查状态是否正常,如果不正常,再次重启crs,就好了。
检查状态,发现正常。。。。
[root@ora102 bin]# ./crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora.DATA.dg ora....up.type ONLINE ONLINE ora101
ora.FRA.dg ora....up.type ONLINE ONLINE ora101
ora....ER.lsnr ora....er.type ONLINE ONLINE ora101
ora....N1.lsnr ora....er.type ONLINE ONLINE ora101
ora.OCR.dg ora....up.type ONLINE ONLINE ora101
ora.asm ora.asm.type ONLINE ONLINE ora101
ora.cvu ora.cvu.type ONLINE ONLINE ora101
ora.gsd ora.gsd.type OFFLINE OFFLINE
ora....network ora....rk.type ONLINE ONLINE ora101
ora.oc4j ora.oc4j.type ONLINE ONLINE ora101
ora.ons ora.ons.type ONLINE ONLINE ora101
ora....SM1.asm application ONLINE ONLINE ora101
ora....01.lsnr application ONLINE ONLINE ora101
ora.ora101.gsd application OFFLINE OFFLINE
ora.ora101.ons application ONLINE ONLINE ora101
ora.ora101.vip ora....t1.type ONLINE ONLINE ora101
ora....SM2.asm application ONLINE ONLINE ora102
ora....02.lsnr application ONLINE ONLINE ora102
ora.ora102.gsd application OFFLINE OFFLINE
ora.ora102.ons application ONLINE ONLINE ora102
ora.ora102.vip ora....t1.type ONLINE ONLINE ora102
ora.orcl.db ora....se.type ONLINE ONLINE ora101
ora.scan1.vip ora....ip.type ONLINE ONLINE ora101
检查ocr状态
[grid@ora101 ~]$ ocrcheck
Status of Oracle Cluster Registry is as follows :
Version : 3
Total space (kbytes) : 262120
Used space (kbytes) : 2948
Available space (kbytes) : 259172
ID : 87127720
Device/File Name : +OCR
Device/File integrity check succeeded
Device/File not configured
Device/File not configured
Device/File not configured
Device/File not configured
Cluster registry integrity check succeeded
Logical corruption check bypassed due to non-privileged user
检查crs状态 状态正常。。。。
[grid@ora101 ~]$ crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
题外话。。
一:关闭asm实例报错。。。。
[root@ora101 bin]# ./srvctl stop asm
PRCR-1065 : Failed to stop resource ora.asm
CRS-2529: Unable to act on 'ora.asm' because that would require stopping or relocating 'ora.DATA.dg', but the force option was not specified
CRS-2529: Unable to act on 'ora.asm' because that would require stopping or relocating 'ora.DATA.dg', but the force option was not specified
加上强制关闭 即可:
[grid@ora101 ~]$ srvctl stop asm -f
[grid@ora101 ~]$ srvctl status asm
ASM is not running.
或者 sqlplu abort强制关闭
[grid@ora101 ~]$ sqlplus / as sysasm
SQL> shu abort
ASM instance shutdown
此时查看crs:
[grid@ora101 ~]$ crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
使用crsctl stop crs停止CRS,同时也停止了ASM磁盘
从停止的过程可以看到VIP的飘移,
[root@ora101 bin]# ./crsctl stop crs
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'ora101'
CRS-2673: Attempting to stop 'ora.crsd' on 'ora101'
CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on 'ora101'
CRS-2673: Attempting to stop 'ora.OCR.dg' on 'ora101'
CRS-2673: Attempting to stop 'ora.DATA.dg' on 'ora101'
CRS-2673: Attempting to stop 'ora.FRA.dg' on 'ora101'
CRS-2673: Attempting to stop 'ora.LISTENER.lsnr' on 'ora101'
CRS-2677: Stop of 'ora.LISTENER.lsnr' on 'ora101' succeeded
CRS-2673: Attempting to stop 'ora.ora101.vip' on 'ora101'
CRS-2677: Stop of 'ora.FRA.dg' on 'ora101' succeeded
CRS-2677: Stop of 'ora.DATA.dg' on 'ora101' succeeded
CRS-2677: Stop of 'ora.ora101.vip' on 'ora101' succeeded
CRS-2672: Attempting to start 'ora.ora101.vip' on 'ora102'
CRS-2676: Start of 'ora.ora101.vip' on 'ora102' succeeded -----实现vip飘逸
CRS-2677: Stop of 'ora.OCR.dg' on 'ora101' succeeded
CRS-2673: Attempting to stop 'ora.asm' on 'ora101'
CRS-2677: Stop of 'ora.asm' on 'ora101' succeeded
CRS-2673: Attempting to stop 'ora.ons' on 'ora101'
CRS-2677: Stop of 'ora.ons' on 'ora101' succeeded
CRS-2673: Attempting to stop 'ora.net1.network' on 'ora101'
CRS-2677: Stop of 'ora.net1.network' on 'ora101' succeeded
CRS-2792: Shutdown of Cluster Ready Services-managed resources on 'ora101' has completed
CRS-2677: Stop of 'ora.crsd' on 'ora101' succeeded
CRS-2673: Attempting to stop 'ora.ctssd' on 'ora101'
CRS-2673: Attempting to stop 'ora.evmd' on 'ora101'
CRS-2673: Attempting to stop 'ora.asm' on 'ora101'
CRS-2673: Attempting to stop 'ora.m
dnsd' on 'ora101'
CRS-2677: Stop of 'ora.evmd' on 'ora101' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'ora101' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'ora101' succeeded
CRS-2677: Stop of 'ora.asm' on 'ora101' succeeded
CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'ora101'
CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'ora101' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'ora101'
CRS-2677: Stop of 'ora.cssd' on 'ora101' succeeded
CRS-2673: Attempting to stop 'ora.crf' on 'ora101'
CRS-2677: Stop of 'ora.crf' on 'ora101' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'ora101'
CRS-2677: Stop of 'ora.gipcd' on 'ora101' succeeded
CRS-2673: Attempting to stop 'ora.gpnpd' on 'ora101'
CRS-2677: Stop of 'ora.gpnpd' on 'ora101' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'ora101' has completed
CRS-4133: Oracle High Availability Services has been stopped.
启动asm,先启动crs服务
[root@ora101 bin]# ./crsctl start crs
[root@ora101 bin]# ./crsctl status crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
启动RAC实例和数据库
[grid@ora102 ~]$ srvctl start asm
PRCC-1014 : asm was already running
[root@ora101 bin]# ./srvctl start database -d ORCL
二:简单概述CRS架构 :
1)Cluster Synchronization Services (CSS)—管理群集配置,谁是成员、谁来、谁走,通知成员。
2)Cluster Ready Services (CRS)—管理群集内高可用操作的主要程序,crs管理的全部内容都被看作资源,包括数据库、实例、服务、监听器、vip地址、应用进程等。Crs进程根据OCR中的配置信息管理群集资源,包括启动、停止、监视和容错操作。当某个资源的状态发生改变时,crs进程产生事件。RAC安装完成后,crs进程监视各种资源,发生异常时自动重启该资源,一般来说重启5次,如不成功不再尝试。
3)Event Management (EVM)—后台进程发布由crs生成的事件。
4)Oracle Notification Service (ONS)—通信FAN消息的发布和订阅服务。
5)RACG—扩展集群支持oracle特定的需求和复杂的资源。
6)Process Monitor Daemon (OPROCD)—锁定在内存中监视集群运行并执行I/O隔离。利用 hangchecker,监测、停止、再监测、再停止,如果醒来时时间不对则重启该节点。
注意:
CRS进程栈默认随着操作系统的启动而自启动,有时出于维护目的需要关闭这个特性,可以用root用户执行下面命令。
[root@rac1 bin]# ./crsctl disable crs
[root@rac1 bin]# ./crsctl enable crs
这个命令实际是修改了/etc/oracle/scls_scr/raw/root/crsstart这个文件里的内容
CRS由CRS,CSS,EVM三个服务组成,每个服务又是由一系列module组成,crsctl允许对每个module进行跟踪,并把跟踪内容记录到日志中。
[root@rac1 bin]# ./crsctl lsmodules css
[root@rac1 bin]# ./crsctl lsmodules evm
–跟踪CSSD模块,需要root用户执行:
[root@rac1 bin]# ./crsctl debug log css "CSSD:1"
Configuration parameter trace is now set to 1.
Set CRSD Debug Module: CSSD Level: 1
–查看跟踪日志
[root@rac1 cssd]# pwd
/u01/app/oracle/product/crs/log/rac1/cssd
[root@rac1 cssd]# more ocssd.log
四:Oracle Cluster Registry (OCR):
管理Oracle集群软件和Oracle RAC数据库配置信息;类似于windows的注册表;这也包含Oracle Local Registry (OLR),存在于集群的每个节点上,管理Oracle每个节点的集群配置信息。Oracle Clusterware 把整个集群的配置信息放在共享存储上,这个存储就是OCR Disk.在整个集群中,只有一个节点能对OCR Disk进行读写操作,这个节点叫作Master Node,所有节点都会在内存中保留一份OCR的拷贝,同时有一个OCR Process从这个内存中读取内容。OCR内容发生改变时,由Master Node的OCR Process负责同步到其他节点的OCR Process。
ocrcheck:
Ocrcheck命令用于检查OCR内容的一致性,命令执行过程会在$CRS_HOME\log\nodename\client目录下产生ocrcheck_pid.log日志文件。 这个命令不需要参数。
[root@rac1 bin]#./ocrcheck
五:最后检查数据库的状态:
1)检查数据库实例的状态:
[root@ora102 bin]# ./srvctl status database -d ORCL
Instance orcl1 is running on node ora101
Instance orcl2 is running on node ora102
2)检查asm实例的状态:
[root@ora102 bin]# ./srvctl status asm
ASM is running on ora101,ora102
3)检查crs的状态,如下是正常的
[root@ora102 bin]# ./crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
–检查单个状态
[root@rac1 bin]# ./crsctl check cssd
CSS appears healthy
[root@rac1 bin]# ./crsctl check crsd
CRS appears healthy
[root@rac1 bin]# ./crsctl check evmd
EVM appears healthy
总结:oracle rac集群,是一个整体,需要同时启动和关闭,如果你只启动其中一个,那么另一个节点的vip就会飘到这个节点,voting disk投票把这个节点踢出集群,也就是脑裂。解决脑裂问题的基本思路就是:首先重启被踢出集群的节点的crs(crsctl stop crs ,然后crsctl start crs ),如果不行,那就清理节点2的配置信息,然后重新运行root.sh,然后执行crsctlstart crs开启crs即可。