Bug 222919
| Summary: | cman killed by node 2 for reason 3 | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Corey Marthaler <cmarthal> | ||||
| Component: | cman | Assignee: | Christine Caulfield <ccaulfie> | ||||
| Status: | CLOSED WORKSFORME | QA Contact: | Cluster QE <mspqa-list> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 5.0 | CC: | cluster-maint, rmccabe, sdake, teigland | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2007-12-19 17:39:01 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Corey Marthaler
2007-01-16 21:14:35 UTC
The first 'killed' message "by node 2 for reason 3" is a very likely a result of the openAIS rejoin - see the first syslog line you posted: Jan 16 08:17:48 taft-01 openais[2556]: [TOTEM] The token was lost in the COMMIT state. So cman has shutdown openais, and all the subsidiary components should tidy up too (as best they can). The second 'killed' message is very different: Jan 16 08:21:42 taft-01 openais[5688]: [CMAN ] cman killed by node 1 for reason 2 This is (I think) groupd detected already running subsystems and deciding it can't continue with them in that state and forcing a leave of the cluster: Jan 16 08:21:42 taft-01 groupd[5696]: found uncontrolled kernel object TAFT_CLUSTER:taft0 in /sys/fs/gfs Jan 16 08:21:42 taft-01 groupd[5696]: local node must be reset to clear 9 uncontrolled instances of gfs and/or dlm So, I reckon the initial cause of this is probably the elevated multicast load causing the openais rejoin to occur. It might be worth opening another bug to cover the fact that GFS doesn't shut down correctly when cman forces an emergency shutdown (though how fixable that is I'm not sure!) It also bothers me that the node was not fenced after being ejected from the cluster but I'll pass on that for the moment cos I'm not sure what (if any) fencing system you're using with revolver. I'll basically repeat what Patrick explained, adding a couple points here and there. 1. openais shuts down taft-01, presumably due to elevated multicast load (whether that's a big issue or not depends I'd guess on how excessive the load was, Steven may be able to comment on that.) 2. the other daemons shut down, abandoning local instances of gfs/dlm. There's no way to umount/leave these instances because the cluster has already gone away. 3. if there are enough other nodes remaining for quorum, they should be fencing taft-01 as a result of step 1. The group_tool output from taft-04 appears to show that taft-01 has been fenced. If it wasn't then there may be a problem in fenced, we'd want to see the fenced logs from the other nodes. What kind of fencing is configured? 4. 'service cman start' was run again. openais joins the cluster, but groupd recognizes the abandoned instances of gfs/dlm from above. From this it knows what happened in steps 1 and 2, and knows that it needs to be reset/fenced. groupd kills cman on the local node (which we see above). At this stage groupd could do one of two different things, it doesn't really matter which: i) It could just sit and wait to be fenced which it knows someone will eventually do. ii) It could attempt to fence _itself_. There's one situation where this can be beneficial. That's when power fencing is being used and there are not enough other nodes to make quorum. The other nodes won't fence/reboot it because they don't have quorum. Fencing itself means it reboots itself. When it comes back up, it'll properly rejoin the cluster and may result in the cluster having quorum again. (groupd is using the fencing system in this case for its side effects of rebooting the node, not in the normal sense of doing a real fencing operation). groupd is currently doing (ii) -- it forks/execs "fence_node" with its own name. So, it appears that multiple things may have gone wrong here. First, openais was shut down. Second, the initial fencing operation didn't reboot the node. Third, when the cluster was restarted on this defunct node, it should have fenced itself, which didn't reboot the node either. Created attachment 146332 [details]
All /var/log/messages output after boot
I'm seeing the same thing on one of our clusters here. When fencing is configured properly, the node that is being killed by cman is fenced endlessly. When I disable fencing, I see (basically) the same output as above. After openais and groupd die, ccsd and clurgmgrd are left running. One of the rgmanager threads is blocked trying to connect to DLM. If I wake it up by stracing it, the thread exits, and almost always, the kernel oopses. All cluster nodes are running the x86_64 20070112.3 tree with the 2.6.18-1.3002.el5xen kernel. Here's the kernel output dlm: closing connection to node 5 dlm: closing connection to node 4 dlm: closing connection to node 3 dlm: closing connection to node 2 dlm: closing connection to node 1 Unable to handle kernel NULL pointer dereference at 0000000000000002 RIP: [<ffffffff8032ade1>] list_del+0x8/0x71 PGD 698b9067 PUD 69f99067 PMD 0 Oops: 0000 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:03.0/0000:01:00.0/0000:02:08.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/vendor CPU 1 Modules linked in: bridge netloop netbk blktap blkbk nfs lockd fscache nfs_acl lock_dlm gfs2 dlm configfs sunrpc ipv6 video sbs i2c_ec i2c_core button batter y asus_acpi ac parport_pc lp parport joydev sr_mod sg pcspkr bnx2 shpchp ide_cd serial_core cdrom serio_raw dm_snapshot dm_zero dm_mirror dm_mod usb_storage qla2xxx scsi_transport_fc mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd Pid: 3970, comm: clurgmgrd Not tainted 2.6.18-1.3002.el5xen #1 RIP: e030:[<ffffffff8032ade1>] [<ffffffff8032ade1>] list_del+0x8/0x71 RSP: e02b:ffff880069929dd8 EFLAGS: 00010246 RAX: 0000000000000002 RBX: ffff88006a7d8000 RCX: ffff88006a468000 RDX: ffffffffff578040 RSI: ffff880075b407a0 RDI: ffff88006a7d8000 RBP: ffff88006a7d8000 R08: ffff880069928000 R09: 0000000000000001 R10: ffff880075a4cac0 R11: ffff880075b407a0 R12: 00000000fffffe00 R13: ffff88006a7d8030 R14: 0000000000000020 R15: ffff880073f660d8 FS: 00002aaaaaabba00(0000) GS:ffffffff8058d080(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Process clurgmgrd (pid: 3970, threadinfo ffff880069928000, task ffff880075b407a0) Stack: ffff880069929e08 ffffffff8841c853 0000000000000071 00000000f14c7dd0 ffff880069929eb0 ffff88006a7d84c4 6567616e616d6772 0000000000000072 0000000000000000 0000000000000000 Call Trace: [<ffffffff8841c853>] :dlm:dlm_new_lockspace+0x6b7/0x76a [<ffffffff8841f8b3>] :dlm:device_write+0x3b7/0x5a7 [<ffffffff80216292>] vfs_write+0xce/0x174 [<ffffffff80216a8e>] sys_write+0x45/0x6e [<ffffffff8025c5dd>] tracesys+0xa7/0xb2 Code: 48 8b 10 48 39 fa 74 1b 48 89 fe 31 c0 48 c7 c7 a8 4b 47 80 RIP [<ffffffff8032ade1>] list_del+0x8/0x71 RSP <ffff880069929dd8> CR2: 0000000000000002 <1>Unable to handle kernel paging request at 0000000100000001 RIP: [<ffffffff8841ca7a>] :dlm:dlm_scand+0x17/0x45 PGD 699e9067 PUD 0 Oops: 0000 [2] SMP last sysfs file: /devices/pci0000:00/0000:00:03.0/0000:01:00.0/0000:02:08.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/vendor CPU 0 Modules linked in: bridge netloop netbk blktap blkbk nfs lockd fscache nfs_acl lock_dlm gfs2 dlm configfs sunrpc ipv6 video sbs i2c_ec i2c_core button batter y asus_acpi ac parport_pc lp parport joydev sr_mod sg pcspkr bnx2 shpchp ide_cd serial_core cdrom serio_raw dm_snapshot dm_zero dm_mirror dm_mod usb_storage qla2xxx scsi_transport_fc mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd Pid: 3972, comm: dlm_scand Not tainted 2.6.18-1.3002.el5xen #1 RIP: e030:[<ffffffff8841ca7a>] [<ffffffff8841ca7a>] :dlm:dlm_scand+0x17/0x45 RSP: e02b:ffff88006985fee0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000100000001 RCX: ffff88006b1a6080 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88006a7d8000 RBP: ffffffff8841ca63 R08: ffff88006985e000 R09: 0000000000000000 R10: ffffffff804b9ce0 R11: ffff88006b1a6080 R12: ffff880069929c98 R13: 0000000000000000 R14: ffff880069929c88 R15: ffffffff8029442b FS: 00002aaaaaaba2a0(0000) GS:ffffffff8058d000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Process dlm_scand (pid: 3972, threadinfo ffff88006985e000, task ffff88006b1a6080) Stack: 0000000000000000 ffffffff80232807 0000000000000000 0000000000000000 ffffffffffffffff ffffffffffffffff ffffffffffffffff 7fffffffffffffff ffffffffffffffff ffff880001004460 Call Trace: [<ffffffff80232807>] kthread+0xfe/0x132 [<ffffffff8025ce10>] child_rip+0xa/0x12 [<ffffffff8029442b>] keventd_create_kthread+0x0/0x61 [<ffffffff80232709>] kthread+0x0/0x132 [<ffffffff8025ce06>] child_rip+0x0/0x12 Code: 48 8b 03 0f 18 08 48 81 fb 10 3c 43 88 75 e6 69 3d a9 e3 00 RIP [<ffffffff8841ca7a>] :dlm:dlm_scand+0x17/0x45 RSP <ffff88006985fee0> CR2: 0000000100000001 I'll create an attachment for everything that is logged to /var/log/messages. I can reproduce this at will and anyone who wants to take a look into it, can use the machine. "When fencing is configured properly, the node that is being killed by cman is fenced endlessly." That sounds wrong. I just checked the logs on the node that did the fencing, and it looks like you're right. The fence operation succeeded but reported failure. Looks like a bug in the new fence APC agent. Argh. Sorry, I spoke too quickly. I stopped the fencing by just removing exec perms on the fence agent script, and the failed fence operations I was seeing were a result of that. Looking back farther into the logs, the node really was being fenced as soon as it joined: Jan 23 13:16:03 marge fenced[2947]: homer.lab.boston.redhat.com not a cluster me mber after 2 sec post_fail_delay Jan 23 13:16:03 marge fenced[2947]: fencing node "homer.lab.boston.redhat.com" Jan 23 13:16:06 marge openais[2931]: [MAIN ] Killing node homer.lab.boston.redha t.com because it has rejoined the cluster without cman_tool join Jan 23 13:16:36 marge fenced[2947]: fence "homer.lab.boston.redhat.com" success ad nauseam That looks pretty much like a consequence of the first error. The cluster is unstable because a node has rejoined without going through cman. I suspect the node that gets repeatedly fenced is the victim of "CMAN: Joined a cluster with disallowed nodes. must die". The way around this is to increase the token timeout in cluster.conf eg: <totem token="15000"/> Patrick, if this error occurs, ie: cman is killed by another node, shouldn't the remaining cluster be stable? In the case of the first error, I believe we should fence the node whether or not there is quorum by running reboot -fin from within cman instead of exiting. Increasing the token timeout simply causes this case not to happen, which in the end it eventually will. It depends. If the node is killed because the cluster is already unstable without it (which I think is what's happening here) then removing a node that hasn't really joined isn't going to make any difference. The reason the 'rejoined' node isn't killed is because the cluster is not quorate and so doesn't know whether it is part of the stable portion (or even if there is one). This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. I think for sites with an "elevated multicast load" we should probably recommend upping the timeouts (I'm sure that happens with RHEL4 anyway). The message is improved in the head of CVS & the RHEL5 branch. I'll mark this as NEEDINFO - let me know if it still fails with increased timeouts. Can't currently reproduce, moving off the release criteria. This bug hasn't been seen in almost a year, closing... |