Bug 608397
Summary: | rgmanager: fail to recover from clurgmgrd crash | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | yeylon <yeylon> | ||||||
Component: | rgmanager | Assignee: | Lon Hohberger <lhh> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Brandon Perkins <bperkins> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | 5.5.z | CC: | bperkins, ccaulfie, cluster-maint, edamato, jkortus, jwest, srevivo, tdunnon, ykaul | ||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | rgmanager-2.0.52-6.10.el5 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 608709 609181 (view as bug list) | Environment: | |||||||
Last Closed: | 2011-01-13 23:27:02 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 609182 | ||||||||
Attachments: |
|
Description
yeylon@redhat.com
2010-06-27 10:02:49 UTC
rgmanager-2.0.52-6.el5_5.7 This looks like a kernel bug: kvm: exiting hardware virtualization Synchronizing SCSI cache for disk sdu: Synchronizing SCSI cache for disk sdt: Synchronizing SCSI cache for disk sds: Synchronizing SCSI cache for disk sdr: Synchronizing SCSI cache for disk sdq: Synchronizing SCSI cache for disk sdp: Synchronizing SCSI cache for disk sdo: Synchronizing SCSI cache for disk sdn: Synchronizing SCSI cache for disk sdm: Synchronizing SCSI cache for disk sdl: Synchronizing SCSI cache for disk sdk: Synchronizing SCSI cache for disk sdj: Synchronizing SCSI cache for disk sdi: Synchronizing SCSI cache for disk sdh: Synchronizing SCSI cache for disk sdg: Synchronizing SCSI cache for disk sdf: Synchronizing SCSI cache for disk sde: Synchronizing SCSI cache for disk sdd: Synchronizing SCSI cache for disk sdc: Synchronizing SCSI cache for disk sdb: Restarting system. . machine restart I am still logged in to this machine. This occurred after the syscall 'reboot(RB_AUTOBOOT)', which should never fail. Jun 28 17:13:09 green-vdsa clurgmgrd[5472]: <crit> Watchdog: Daemon died, rebooting... Jun 28 17:13:09 green-vdsa kernel: md: stopping all md devices. The machine was running one qemu-kvm instance. As a crash recovery measure, rgmanager has a watchdog process which reboots the host if the main rgmanager process fails unexpectedly. This causes the node to get fenced and rgmanager to recover the service on the other host. When we kill rgmanager proper, the watchdog process calls reboot(RB_AUTOBOOT). At this point, we log the above messages in comment #2, the kernel reported messages as per comment #2, and the machine never rebooted. I've cloned this for the kernel component, but I believe it is possible to work around this issue in rgmanager by issuing a cman_kill_node() to the local host. I will test this. The cman_kill_node(x, me) works as expected during preliminary testing. I will test another scenario prior to posting a patch here. Jun 28 15:16:52 frederick clurgmgrd[2936]: <crit> Watchdog: Daemon died, rebooting... Jun 28 15:16:52 frederick openais[2858]: [CMAN ] cman killed by node 2 because we were killed by cman_tool or other application Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading all openais components Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_confdb v0 (20/10) Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_cpg v0 (19/8) Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_cfg v0 (18/7) Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_msg v0 (17/6) Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_lck v0 (16/5) Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_evt v0 (15/4) Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_ckpt v0 (14/3) Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_amf v0 (13/2) Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_clm v0 (12/1) Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_evs v0 (11/0) Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_cman v0 (10/9) Jun 28 15:16:53 frederick openais[2858]: [SERV ] AIS Executive exiting (reason: CMAN kill requested, exiting). Jun 28 15:16:53 frederick dlm_controld[2894]: cluster is down, exiting Jun 28 15:16:53 frederick gfs_controld[2900]: groupd_dispatch error -1 errno 0 Jun 28 15:16:53 frederick gfs_controld[2900]: groupd connection died Jun 28 15:16:53 frederick gfs_controld[2900]: cluster is down, exiting Jun 28 15:16:53 frederick fenced[2888]: cluster is down, exiting Jun 28 15:16:53 frederick kernel: dlm: closing connection to node 2 Jun 28 15:16:53 frederick kernel: dlm: closing connection to node 1 Jun 28 15:16:53 frederick qdiskd[1954]: <err> cman_dispatch: Host is down Jun 28 15:16:53 frederick qdiskd[1954]: <err> Halting qdisk operations Jun 28 15:16:54 frederick kernel: md: stopping all md devices. Irrespective of whether reboot occurs at this point, the node has been kicked itself out of the cluster and therefore will be fenced by the other node. Created attachment 427487 [details]
Make the rgmanager watchdog process kill CMAN if rgmanager (main) crashes
This patch works around the case that the reboot() system call fails for some reason. Effectively, it issues a cman_kill_node(c, my_node_id) to cause an unclean eviction of the cluster node, resulting in the host being fenced by the cluster.
Because there are cases where cman_admin_init() may also fail, we must still reboot from the watchdog process.
Another possibility which was suggested by Mike Snitzer is to try writing 'b' to /proc/sysrq-trigger; I will attach a patch for this as well. Created attachment 427506 [details]
Patch using Mike's proposal
This makes rgmanager tickle /proc/sysrq-trigger in order to do a reboot, which has a less error-prone path than the reboot() system call.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0134.html |