Bug 152100
Summary: | recovery deadlock after having shot the master | ||
---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Corey Marthaler <cmarthal> |
Component: | gulm | Assignee: | Chris Feist <cfeist> |
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 3 | CC: | cluster-maint, tao |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHBA-2006-0269 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2006-03-27 18:10:41 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Corey Marthaler
2005-03-24 20:23:38 UTC
[root@tank-03 root]# gulm_tool getstats tank-03:lt000 I_am = Master run time = 13053 pid = 7280 verbosity = Default id = 0 partitions = 1 out_queue = 0 drpb_queue = 22 locks = 35226 unlocked = 8031 exclusive = 9810 shared = 17385 deferred = 0 lvbs = 1902 expired = 6359 lock ops = 1807295 conflicts = 30 incomming_queue = 0 conflict_queue = 27 reply_queue = 0 free_locks = 54774 free_lkrqs = 49 used_lkrqs = 49 free_holders = 89266 used_holders = 40734 highwater = 1048576 I also grabed a lock dump from tank-01 and tank-03 and put them in: /home/msp/cmarthal/pub/bugs/152100 one last thing, Gulm was using the usedev interface: [root@tank-02 root]# cat tank-cluster.cca #CCA:tank-cluster #nigeb=cluster.ccs mtime=1111531650 size=146 cluster{ name="tank-cluster" lock_gulm{ servers=["tank-01.lab.msp.redhat.com","tank-03.lab.msp.redhat.com","tank-05.lab.msp.redhat.com"] } } #dne=cluster.ccs hash=348D0569 #nigeb=nodes.ccs mtime=1111608446 size=864 nodes{ tank-01.lab.msp.redhat.com{ ip_interfaces { eth1="10.1.1.91" } usedev = "eth1" fence{ fence1{ tank-apc{ port="1" switch="1" } } } } tank-02.lab.msp.redhat.com{ ip_interfaces { eth1="10.1.1.92" } usedev = "eth1" fence{ fence1{ tank-apc{ port="2" switch="1" } } } } tank-03.lab.msp.redhat.com{ ip_interfaces { eth1="10.1.1.93" } usedev = "eth1" fence{ fence1{ tank-apc{ port="3" switch="1" } } } } tank-04.lab.msp.redhat.com{ ip_interfaces { eth1="10.1.1.94" } usedev = "eth1" fence{ fence1{ tank-apc{ port="4" switch="1" } } } } tank-05.lab.msp.redhat.com{ ip_interfaces { eth1="10.1.1.95" } usedev = "eth1" fence{ fence1{ tank-apc{ port="5" switch="1" } } } } } #dne=nodes.ccs hash=9FD1F953 #nigeb=fence.ccs mtime=1111531650 size=105 fence_devices{ tank-apc{ agent="fence_apc" ipaddr="192.168.47.32" login="apc" passwd="apc" } } #dne=fence.ccs hash=B2725FC6 I've been seeing this lock up in my tests as well. The lock dumps show that there are extra locks hanging around for a node after That node has been expired. I've not been able to figure it out as to what's going on in my environment yet. My setups had a few more nodes. There were ~13 clients that each mounted two GFS filesystems (gfs1 and gfs2). I had three dedicated gulm server nodes. Sometimes I coulf get the lockup to happen right away, other times it took a while (~days) I did not have ip_interfaces{} or usedev sections in my nodes.ccs files. well, the "tank-04 lock_gulmd_LTPX[7281]: Errors trying to login to LT000: (tank-03.lab.msp.redhat.com:10.1.1.93) 1006:Not Allowed" message is why things aren't mounting. No locks. just gotta figure out why.... well, in corey's case, I'm gonna bet this has something to do with the config saying: tank-04.lab.msp.redhat.com = 10.1.1.94, but gulm_tool nodelist saying: tank-04.lab.msp.redhat.com = 192.168.44.94 looking into it. so, corey said that 10.1.1.* can ping 192.168.44.* and vica-versa. Which now means that gulm_core can go from 192.168.44.* -> 10.1.1.* and gulm_LT can try to connect from 10.1.1.* to 10.1.1.* Which will confuse the hell out of gulm and give the above errors. So this might be the problem here. I'm not quite sure how this all works out. Need to poke about more. What did tank-04 report as its IP? (grep log for 'I am ') Those machines have since been reimaged to RHEL4, all log info has been lost, :( but I'm almost positive I saw it was using the 10.1.1.94. tank-04 would have had: /etc/sysconfig/network-scripts/ifcfg-eth1 TYPE="Ethernet" DEVICE="eth1" ONBOOT="yes" BOOTPROTO="static" IPADDR="10.1.1.94" NETMASK="255.255.255.0" /etc/hosts 127.0.0.1 localhost.localdomain localhost 10.1.1.91 tank-01-pvt.lab.map.redhat.com tank-01-pvt 10.1.1.92 tank-02-pvt.lab.map.redhat.com tank-02-pvt 10.1.1.93 tank-03-pvt.lab.map.redhat.com tank-03-pvt 10.1.1.94 tank-04-pvt.lab.map.redhat.com tank-04-pvt 10.1.1.95 tank-05-pvt.lab.map.redhat.com tank-05-pvt possibly fixed. binding outgoing sockets to the address we think we should have. It looks like this bug isn't fixed yet, re-opening. This has been committted and should be in the next release of GFS for RHEL3. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0269.html |