Bug 177467
Summary: | clustat hangs during rgmanager recovery testing w/ gulm | ||
---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Nate Straz <nstraz> |
Component: | gulm | Assignee: | Chris Feist <cfeist> |
Status: | CLOSED WONTFIX | QA Contact: | Cluster QE <mspqa-list> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4 | CC: | cfeist, cluster-maint, dex.chen, sghosh |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2006-11-14 00:00:41 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Nate Straz
2006-01-10 21:37:09 UTC
FYI, on why I jumped from clustat to clurgmgrd > Start with what clustat is doing [root@morph-03 ~]# ps -C clustat PID TTY TIME CMD 3634 ? 00:00:00 clustat [root@morph-03 ~]# strace -p 3634 Process 3634 attached - interrupt to quit recv(5, <unfinished ...> Process 3634 detached > Look at what fd 5 is [root@morph-03 ~]# lsof -p 3634 COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME clustat 3634 root cwd DIR 253,0 4096 2 / clustat 3634 root rtd DIR 253,0 4096 2 / clustat 3634 root txt REG 253,0 42628 1784514 /usr/sbin/clustat clustat 3634 root mem REG 253,0 79488 472443 /lib/libresolv-2.3.4.so clustat 3634 root mem REG 253,0 11940 491004 /lib/magma/magma_gulm.so clustat 3634 root mem REG 253,0 59755 1775641 /usr/lib/libgulm.so.1.0.0 clustat 3634 root mem REG 253,0 21280 472396 /lib/libnss_dns-2.3.4.so clustat 3634 root mem REG 253,0 45889 473911 /lib/libnss_files-2.3.4.so clustat 3634 root mem REG 253,0 106397 472354 /lib/ld-2.3.4.so clustat 3634 root mem REG 253,0 1454546 472397 /lib/tls/libc-2.3.4.so clustat 3634 root mem REG 253,0 15324 472399 /lib/libdl-2.3.4.so clustat 3634 root mem REG 253,0 93940 472400 /lib/tls/libpthread-2.3.4.so clustat 3634 root mem REG 253,0 879961 1782980 /usr/lib/libncurses.so.5.4 clustat 3634 root 0u IPv4 8553 TCP morph-03.lab.msp.redhat.com:32803->tin.msp.redhat.com:5013 (ESTABLISHED) clustat 3634 root 1u IPv4 8554 TCP morph-03.lab.msp.redhat.com:32804->tin.msp.redhat.com:5014 (ESTABLISHED) clustat 3634 root 2u IPv4 8555 TCP morph-03.lab.msp.redhat.com:32805->tin.msp.redhat.com:5015 (ESTABLISHED) clustat 3634 root 3u IPv6 8558 TCP [::1]:32806->[::1]:40040 (ESTABLISHED) clustat 3634 root 4u IPv6 8560 TCP [::1]:32807->[::1]:40042 (ESTABLISHED) clustat 3634 root 5u IPv4 8592 TCP morph-03.lab.msp.redhat.com:32808->morph-03.lab.msp.redhat.com:41966 (ESTABLISHED) > See who else is connected to the socket [root@morph-03 ~]# lsof -i 4tcp:32808 COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME clustat 3634 root 5u IPv4 8592 TCP morph-03.lab.msp.redhat.com:32808->morph-03.lab.msp.redhat.com:41966 (ESTABLISHED) [root@morph-03 ~]# lsof -i 4tcp:41966 COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME clurgmgrd 2858 root 7u IPv4 7968 TCP *:41966 (LISTEN) clustat 3634 root 5u IPv4 8592 TCP morph-03.lab.msp.redhat.com:32808->morph-03.lab.msp.redhat.com:41966 (ESTABLISHED) > No one seems to be, but clurgmgrd is listening. What is clurgmgrd up to? [root@morph-03 ~]# strace -p 2858 Process 2858 attached - interrupt to quit recv(5, <unfinished ...> Process 2858 detached > Stuck in a recv too... [root@morph-03 ~]# lsof -p 2858 COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME clurgmgrd 2858 root cwd DIR 253,0 4096 2 / clurgmgrd 2858 root rtd DIR 253,0 4096 2 / clurgmgrd 2858 root txt REG 253,0 122872 1785629 /usr/sbin/clurgmgrd clurgmgrd 2858 root mem REG 253,0 45889 473911 /lib/libnss_files-2.3.4.so clurgmgrd 2858 root mem REG 253,0 79488 472443 /lib/libresolv-2.3.4.so clurgmgrd 2858 root mem REG 253,0 965536 1788673 /usr/lib/libxml2.so.2.6.16 clurgmgrd 2858 root mem REG 253,0 31184 472484 /lib/libgcc_s-3.4.5-20051201.so.1 clurgmgrd 2858 root mem REG 253,0 59755 1775641 /usr/lib/libgulm.so.1.0.0 clurgmgrd 2858 root mem REG 253,0 106397 472354 /lib/ld-2.3.4.so clurgmgrd 2858 root mem REG 253,0 1454546 472397 /lib/tls/libc-2.3.4.so clurgmgrd 2858 root mem REG 253,0 15324 472399 /lib/libdl-2.3.4.so clurgmgrd 2858 root mem REG 253,0 178019 472401 /lib/tls/libm-2.3.4.so clurgmgrd 2858 root mem REG 253,0 11940 491004 /lib/magma/magma_gulm.so clurgmgrd 2858 root mem REG 253,0 93940 472400 /lib/tls/libpthread-2.3.4.so clurgmgrd 2858 root mem REG 253,0 63624 1783259 /usr/lib/libz.so.1.2.1.2 clurgmgrd 2858 root mem REG 253,0 21280 472396 /lib/libnss_dns-2.3.4.so clurgmgrd 2858 root 0u CHR 1,3 1729 /dev/null clurgmgrd 2858 root 1u CHR 1,3 1729 /dev/null clurgmgrd 2858 root 2u CHR 1,3 1729 /dev/null clurgmgrd 2858 root 3u unix 0xf5ec3d00 6670 socket clurgmgrd 2858 root 4u IPv6 7963 TCP [::1]:32798->[::1]:40040 (ESTABLISHED) clurgmgrd 2858 root 5u IPv6 7965 TCP [::1]:32799->[::1]:40042 (ESTABLISHED) clurgmgrd 2858 root 6u IPv6 7967 TCP *:41967 (LISTEN) clurgmgrd 2858 root 7u IPv4 7968 TCP *:41966 (LISTEN) clurgmgrd 2858 root 8u IPv6 7979 TCP *:41969 (LISTEN) clurgmgrd 2858 root 9u IPv4 7980 TCP *:41968 (LISTEN) [root@morph-03 ~]# lsof -i 6:32799 COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME lock_gulm 2339 root 10u IPv6 7966 TCP [::1]:40042->[::1]:32799 (ESTABLISHED) clurgmgrd 2858 root 5u IPv6 7965 TCP [::1]:32799->[::1]:40042 (ESTABLISHED) > ... talking with lock_gulm ... [root@morph-03 ~]# strace -p 2339 Process 2339 attached - interrupt to quit gettimeofday({1136930470, 970353}, NULL) = 0 poll([{fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}], 7, 1000) = 0 gettimeofday({1136930471, 970910}, NULL) = 0 poll([{fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}], 7, 1000) = 0 gettimeofday({1136930472, 971728}, NULL) = 0 poll([{fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}], 7, 1000) = 0 gettimeofday({1136930473, 972600}, NULL) = 0 poll( <unfinished ...> Process 2339 detached [root@morph-03 ~]# lsof -p 2339 COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME lock_gulm 2339 root cwd DIR 253,0 4096 2 / lock_gulm 2339 root rtd DIR 253,0 4096 2 / lock_gulm 2339 root txt REG 253,0 205708 521381 /sbin/lock_gulmd lock_gulm 2339 root mem REG 253,0 79488 472443 /lib/libresolv-2.3.4.so lock_gulm 2339 root mem REG 253,0 21280 472396 /lib/libnss_dns-2.3.4.so lock_gulm 2339 root mem REG 253,0 106397 472354 /lib/ld-2.3.4.so lock_gulm 2339 root mem REG 253,0 1454546 472397 /lib/tls/libc-2.3.4.so lock_gulm 2339 root mem REG 253,0 45889 473911 /lib/libnss_files-2.3.4.so lock_gulm 2339 root 0u CHR 1,3 1729 /dev/null lock_gulm 2339 root 1u CHR 1,3 1729 /dev/null lock_gulm 2339 root 2u CHR 1,3 1729 /dev/null lock_gulm 2339 root 3u unix 0xf64ba800 5299 socket lock_gulm 2339 root 4wW REG 253,0 5 1693977 /var/run/sistina/lock_gulmd_LTPX.pid lock_gulm 2339 root 5u IPv6 5301 TCP *:40042 (LISTEN) lock_gulm 2339 root 6u IPv6 5302 TCP [::1]:32778->[::1]:40040 (ESTABLISHED) lock_gulm 2339 root 7u IPv6 5307 TCP [::1]:40042->[::1]:32780 (ESTABLISHED) lock_gulm 2339 root 8u IPv6 5347 TCP morph-03.lab.msp.redhat.com:32786->morph-05.lab.msp.redhat.com:41040 (ESTABLISHED) lock_gulm 2339 root 9u IPv6 5523 TCP [::1]:40042->[::1]:32793 (ESTABLISHED) lock_gulm 2339 root 10u IPv6 7966 TCP [::1]:40042->[::1]:32799 (ESTABLISHED) lock_gulm 2339 root 11u IPv6 8561 TCP [::1]:40042->[::1]:32807 (ESTABLISHED) I hit it again and was able to dump the lock table by sending SIGUSR1 to all lock_gulmd processes. (Might have to increase verbosity with gulm_tool setverb) I found one lock where people were waiting. key : 'dXNybXVzcm06OnJnPSIxMC4xNS44NC44OSI=' LVBlen : 0 LVB : HolderCount : 0 Holders : LVBHolderCount : 0 LVBHolders : ExpiredCount : 1 ExpiredHolders : - name : morph-05 subid : 2842 state : gio_lck_st_Exclusive start : 0 stop : 18446744073709551615 flags : 0x32 idx : 2 reply_waiter : Waiters : - key : 'dXNybXVzcm06OnJnPSIxMC4xNS44NC44OSI=' name : morph-04 subid : 2843 state : gio_lck_st_Exclusive flags : 0x32 start : 0 stop : 18446744073709551615 LVB : 'dXNybTo6Z3VsbQ==' Slave_rply : 0x0 Slave_sent : 0x0 idx : 5 - key : 'dXNybXVzcm06OnJnPSIxMC4xNS44NC44OSI=' name : morph-03 subid : 2854 state : gio_lck_st_Exclusive flags : 0x32 start : 0 stop : 18446744073709551615 LVB : 'dXNybTo6Z3VsbQ==' Slave_rply : 0x0 Slave_sent : 0x0 idx : 4 - key : 'dXNybXVzcm06OnJnPSIxMC4xNS44NC44OSI=' name : morph-01 subid : 2846 state : gio_lck_st_Exclusive flags : 0x32 start : 0 stop : 18446744073709551615 LVB : 'dXNybTo6Z3VsbQ==' Slave_rply : 0x0 Slave_sent : 0x0 idx : 2 - key : 'dXNybXVzcm06OnJnPSIxMC4xNS44NC44OSI=' name : morph-02 subid : 2901 state : gio_lck_st_Exclusive flags : 0x32 start : 0 stop : 18446744073709551615 LVB : 'dXNybTo6Z3VsbQ==' Slave_rply : 0x0 Slave_sent : 0x0 idx : 6 - key : 'dXNybXVzcm06OnJnPSIxMC4xNS44NC44OSI=' name : morph-05 subid : 2847 state : gio_lck_st_Exclusive flags : 0x32 start : 0 stop : 18446744073709551615 LVB : 'dXNybTo6Z3VsbQ==' Slave_rply : 0x0 Slave_sent : 0x0 idx : 8 High_Waiters : Action_Waiters : State_Waiters : So, on morph-05, the lock had somehow expired, thereby sending everything else into a lock wait state. Of course, morph-05 was also trying to acquire the lock, it looks like. Here is the derringer output that goes with this lock dump: Iteration 4 started at Wed Jan 11 15:39:55 CST 2006 Gulm Status =========== morph-01: Slave morph-02: Client morph-03: Slave morph-04: Client morph-05: Master For this iteration we're gonna RECOVER MACHINES Those machines facing the derringer=morph-03 morph-02 morph-01 morph-05 Whoops, can't kill every server... morph-01 has been pardoned Those NOW facing the derringer=morph-03 morph-02 morph-05 Feeling lucky morph-03? Well do ya? Go'head make my day... No heartbeat from remote host Remote command exited with unknown state Feeling lucky morph-02? Well do ya? Go'head make my day... No heartbeat from remote host Remote command exited with unknown state Feeling lucky morph-05? Well do ya? Go'head make my day... No heartbeat from remote host Remote command exited with unknown state morph-03... DEAD morph-02... DEAD morph-05... DEAD The lock masters are inquorate (2 are dead out of 3). Fencing will not occur; the whole cluster hangs at this point. I have a fix which will make clustat call select() before doing the recv() in msg_peek, but that's not going to solve it all ways :) (In reply to comment #6) > The lock masters are inquorate (2 are dead out of 3). Fencing will not occur; > the whole cluster hangs at this point. Shouldn't this clear up after morph-03 and morph-05 come back up? I'm still stuck in this state after the cluster regains quorum. Yes, it should. If we're stuck in recv() in libgulm, I'm not sure how to fix it. I'll have to look more into it. I hit this again during regression testing. rgmanager-1.9.46-0 gulm-1.0.6-0 Ok, looking at the libgulm code, when the gulm master cluster loses quorum, it appears that clients using libgulm are never notified of either event, and the socket is not closed. This is why clurgmgrd hangs in that case... There's no notification of either loss of quorum or regain, and libgulm doesn't seem to retry. It just blocks in recv() forever. We ran into the same problem. When could we get the fix? This is actually a problem with gulm not notifying clients correctly after a loss of quorum, as far as I can tell. We're not currently planning on fixing this issue because it requires a large amount of code changes. If it becomes a serious issue we will re-evaluate the issue. |