Bug 429927
Summary: | qdisk does not check the heuristics | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Thorsten Scherf <tscherf> | ||||||
Component: | cman | Assignee: | Lon Hohberger <lhh> | ||||||
Status: | CLOSED ERRATA | QA Contact: | GFS Bugs <gfs-bugs> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | 5.1 | CC: | cluster-maint, edamato, mceci, tellis | ||||||
Target Milestone: | rc | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | RHBA-2008-0347 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2008-05-21 15:58:44 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 430574 | ||||||||
Attachments: |
|
Description
Thorsten Scherf
2008-01-23 20:37:03 UTC
Ping's not exiting. It gets stuck in a loop: recvmsg ... -1 / EAGAIN This only seems to happen for me if I start qdiskd from within the init script. If I start it by hand, it works fine. Created attachment 292708 [details]
Kills ping after 2 seconds
I placed this script (ping-wrap) in /sbin - and changed my heuristic from:
ping -c2 -t2 192.168.79.254
to:
ping-wrap -c2 -t2 192.168.79.254
... and it worked.
I'm wrong - it seems that if you do 'iptables -A OUTPUT -d <ipaddr> -j DROP', ping always hangs, qdiskd or not! [root@molly ~]# ping 192.168.79.254 PING 192.168.79.254 (192.168.79.254) 56(84) bytes of data. ping: sendmsg: Operation not permitted ping: sendmsg: Operation not permitted ping: sendmsg: Operation not permitted [root@molly ~]# iptables -F [root@molly ~]# ping 192.168.79.254 PING 192.168.79.254 (192.168.79.254) 56(84) bytes of data. 64 bytes from 192.168.79.254: icmp_seq=1 ttl=255 time=13.4 ms --- 192.168.79.254 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 13.440/13.440/13.440/0.000 ms However, as Thorsten figured out: It seems that if you do 'iptables -A OUTPUT -d <ipaddr> -j REJECT', ping works fine. ... it's because ping's using SIGALRM, and qdiskd blocks it. Created attachment 292717 [details]
Fix
> * why does no node gets fenced when the cluster isn't quorate?
My understanding is that nodes within quorum should fence failed nodes, but
outsiders shouldn't fence anything in a running or trying to form cluster.
Normally, that's the case, with some exceptions: (a) If using a two_node="1" cluster, nodes who cannot see each other will try to fence each other (b) if using qdisk, and your heuristics are "good" while a "majority" is "bad", you can gain quorum and then fence the majority set of nodes But generally, yes, only the quorate partition fences. In the case Thorsten was worried about, fencing didn't occur due to a bug in qdiskd - where it was blocking signals in child processes. Qdiskd wasn't declaring the node dead like it should have - so the node appeared "just fine". What should have happened is the node with the iptables rule should have rebooted (or removed its qdisk vote(s) if reboot was set to 0) - allowing the good node to fence it. What happened was that because the heuristic hung (and never exited), qdiskd didn't remove the node at all. CMAN, however, thought it was alive-and-quorate, causing a fence-race. Patch to restore signals in CVS cman-2.0.81 fixes this given the test case in comment #6 Marking verified. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0347.html |