Description of problem: [revolver] Iteration 2 started at Fri Sep 23 16:14:21 CDT 2005 [revolver] Gulm Status [revolver] =========== [revolver] link-01: Slave [revolver] link-02: Slave [revolver] link-08: Master [revolver] Those facing the revolver=link-08 link-02 [revolver] Feeling lucky link-08? Well do ya? Go'head make my day... [revolver] Feeling lucky link-02? Well do ya? Go'head make my day... [revolver] checking pulses for death: [revolver] link-08... DEAD [revolver] link-02... DEAD [revolver] Sleeping a minute and a half for victim(s) to come back to life... [revolver] Read from remote host link-02: Connection reset by peer [revolver] Still not all alive, sleeping another 10 seconds [revolver] Still not all alive, sleeping another 10 seconds [revolver] Still not all alive, sleeping another 10 seconds [revolver] Still not all alive, sleeping another 10 seconds [revolver] Still not all alive, sleeping another 10 seconds [revolver] Still not all alive, sleeping another 10 seconds [revolver] Still not all alive, sleeping another 10 seconds [revolver] All killed nodes are back up, making sure they're sshable... [revolver] Still not all sshable, sleeping another 10 seconds [revolver] Still not all sshable, sleeping another 10 seconds [revolver] Still not all sshable, sleeping another 10 seconds [revolver] Still not all sshable, sleeping another 10 seconds [revolver] Still not all sshable, sleeping another 10 seconds [revolver] Still not all sshable, sleeping another 10 seconds [revolver] mounting /dev/mapper/gfs-gfs0 on /mnt/gfs0 on link-08 [HANG] I then atttempted a mount of a different filesystem on a different node and the hung as well: [...] close(3) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2a95576000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2a95577000 mprotect(0x37be328000, 12288, PROT_READ) = 0 arch_prctl(0x1002, 0x2a95576b00) = 0 munmap(0x2a95557000, 123393) = 0 brk(0) = 0x615000 brk(0x636000) = 0x636000 open("/usr/lib/locale/locale-archive", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=39554608, ...}) = 0 mmap(NULL, 39554608, PROT_READ, MAP_PRIVATE, 3, 0) = 0x2a95578000 close(3) = 0 umask(022) = 022 open("/dev/null", O_RDWR) = 3 close(3) = 0 getuid() = 0 geteuid() = 0 lstat("/etc/mtab", {st_mode=S_IFREG|0644, st_size=321, ...}) = 0 stat("/sbin/mount.gfs", 0x7fbffff570) = -1 ENOENT (No such file or directory) rt_sigprocmask(SIG_BLOCK, ~[TRAP SEGV RTMIN RT_1], NULL, 8) = 0 mount("/dev/gfs/gfs2", "/mnt/gfs2", "gfs", 0xc0ed0000, 0 Version-Release number of selected component (if applicable): GFS 2.6.9-42.1 (built Sep 20 2005 16:14:27) installed Gulm 2.6.9-42.1 (built Sep 20 2005 16:14:11) installed How reproducible: Have only seen this once in during regression testing. Will try and reproduce to gather more info.
GFS: Trying to join cluster "lock_gulm", "LINK_128:gfs2" GFS: fsid=LINK_128:gfs2.2: Joined cluster. Now mounting FS... Sep 23 11:04:13 link-08 kernel: GFS: Trying to join cluster "lock_gulm", "LINK_128:gfs2" Sep 23 11:04:13 link-08 kernel: GFS: fsid=LINK_128:gfs2.2: Joined cluster. Now mounting FS... More info: [root@link-01 ~]# gulm_tool nodelist $(hostname) Name: link-01 ip = ::ffff:10.15.84.151 state = Logged in last state = Was Logged in mode = Master missed beats = 0 last beat = 1127494104165538 delay avg = 10001062 max delay = 18446744072689134512 Name: link-08 ip = ::ffff:10.15.84.158 state = Logged in last state = Logged out mode = Slave missed beats = 0 last beat = 1127494104385518 delay avg = 9998692 max delay = 18446744072687992192 Name: link-02 ip = ::ffff:10.15.84.152 state = Logged in last state = Was Logged in mode = Slave missed beats = 0 last beat = 1127494108379824 delay avg = 10000074 max delay = 18446744072689134512 [root@link-01 ~]# gulm_tool getstats link-01:lt000 I_am = Master run time = 3392 pid = 2225 verbosity = Default id = 0 partitions = 1 out_queue = 0 drpb_queue = 0 exclusive = 2936 shared = 1333 deferred = 0 lvbs = 7640 expired = 4980 lock ops = 4811750 conflicts = 8 incomming_queue = 0 conflict_queue = 8 reply_queue = 0 free_locks = 9 used_locks = 12886 free_lkrqs = 8 used_lkrqs = 8 free_holders = 0 used_holders = 16889 elss = 4416016 highwater = 1048576
I'm unable to reproduce... Waiting for a way to reproduce this.
I think I hit this bug. I was running revolver when two slaves were shot. Iteration 4 started at Fri Jan 6 10:30:25 CST 2006 Gulm Status =========== tank-01: Slave tank-02: Client tank-03: Slave tank-04: Client tank-05: Master Those facing the revolver=tank-01 tank-03 ... checking Gulm recovery... mounting /dev/mapper/gfs-gfs0 on /mnt/gfs0 on tank-01 mounting /dev/mapper/gfs-gfs1 on /mnt/gfs1 on tank-01 mounting /dev/mapper/gfs-gfs2 on /mnt/gfs2 on tank-01 [HANG] /mnt/gfs0 and /mnt/gfs1 were mounted on tank-01. The mount for /mnt/gfs2 was hung on tank-01. I was able to mount /mnt/gfs0 and /mnt/gfs1 on tank-03. I tried mounting /mnt/gfs2 on tank-03, but that hung too. It doesn't appear that recovery completed on tank-05. - Syslog from tank-05: Jan 6 10:30:55 tank-05 lock_gulmd_core[2502]: tank-01 missed a heartbeat (time:1136565055589138 mb:1) Jan 6 10:31:10 tank-05 lock_gulmd_core[2502]: tank-03 missed a heartbeat (time:1136565070591488 mb:1) Jan 6 10:31:10 tank-05 lock_gulmd_core[2502]: tank-01 missed a heartbeat (time:1136565070591488 mb:2) Jan 6 10:31:25 tank-05 lock_gulmd_core[2502]: tank-03 missed a heartbeat (time:1136565085594840 mb:2) Jan 6 10:31:25 tank-05 lock_gulmd_core[2502]: tank-01 missed a heartbeat (time:1136565085594840 mb:3) Jan 6 10:31:25 tank-05 lock_gulmd_core[2502]: Client (tank-01) expired Jan 6 10:31:25 tank-05 lock_gulmd_core[4109]: Gonna exec fence_node -O tank-01 Jan 6 10:31:25 tank-05 lock_gulmd_core[2502]: Forked [4109] fence_node -O tank-01 with a 0 pause. Jan 6 10:31:28 tank-05 fence_node[4109]: Fence of "tank-01" was successful Jan 6 10:31:28 tank-05 lock_gulmd_core[2502]: found match on pid 4109, marking node tank-01 as logged out. Jan 6 10:31:28 tank-05 kernel: lock_gulm: Checking for journals for node "tank-01" Jan 6 10:31:29 tank-05 clurgmgrd[3019]: <info> Magma Event: Membership Change Jan 6 10:31:29 tank-05 clurgmgrd[3019]: <info> State change: tank-01 DOWN Jan 6 10:31:40 tank-05 lock_gulmd_core[2502]: tank-03 missed a heartbeat (time:1136565100597186 mb:3) Jan 6 10:31:40 tank-05 lock_gulmd_core[2502]: Client (tank-03) expired Jan 6 10:31:40 tank-05 lock_gulmd_core[2502]: Core lost slave quorum. Have 1, need 2. Switching to Arbitrating. Jan 6 10:31:40 tank-05 lock_gulmd_core[4111]: Gonna exec fence_node -O tank-03 Jan 6 10:31:40 tank-05 lock_gulmd_core[2502]: Forked [4111] fence_node -O tank-03 with a 0 pause. Jan 6 10:31:40 tank-05 clurgmgrd[3019]: <emerg> #1: Quorum Dissolved Jan 6 10:31:44 tank-05 fence_node[4111]: Fence of "tank-03" was successful Jan 6 10:31:44 tank-05 lock_gulmd_core[2502]: found match on pid 4111, marking node tank-03 as logged out. Jan 6 10:31:44 tank-05 kernel: lock_gulm: Checking for journals for node "tank-03" Jan 6 10:31:48 tank-05 ccsd[2440]: Cluster is not quorate. Refusing connection. Jan 6 10:31:48 tank-05 ccsd[2440]: Error while processing connect: Connection refused Jan 6 10:31:58 tank-05 ccsd[2440]: Cluster is not quorate. Refusing connection. Jan 6 10:31:58 tank-05 ccsd[2440]: Error while processing connect: Connection refused Jan 6 10:32:08 tank-05 ccsd[2440]: Cluster is not quorate. Refusing connection. Jan 6 10:32:08 tank-05 ccsd[2440]: Error while processing connect: Connection refused Jan 6 10:32:18 tank-05 ccsd[2440]: Cluster is not quorate. Refusing connection. Jan 6 10:32:18 tank-05 ccsd[2440]: Error while processing connect: Connection refused Jan 6 10:32:28 tank-05 ccsd[2440]: Cluster is not quorate. Refusing connection. Jan 6 10:32:28 tank-05 ccsd[2440]: Error while processing connect: Connection refused Jan 6 10:32:38 tank-05 ccsd[2440]: Cluster is not quorate. Refusing connection. Jan 6 10:32:18 tank-05 ccsd[2440]: Cluster is not quorate. Refusing connection. Jan 6 10:32:18 tank-05 ccsd[2440]: Error while processing connect: Connection refused Jan 6 10:32:28 tank-05 ccsd[2440]: Cluster is not quorate. Refusing connection. Jan 6 10:32:28 tank-05 ccsd[2440]: Error while processing connect: Connection refused Jan 6 10:32:38 tank-05 ccsd[2440]: Cluster is not quorate. Refusing connection. Jan 6 10:32:38 tank-05 ccsd[2440]: Error while processing connect: Connection refused Jan 6 10:32:48 tank-05 ccsd[2440]: Cluster is not quorate. Refusing connection. Jan 6 10:32:48 tank-05 ccsd[2440]: Error while processing connect: Connection refused Jan 6 10:32:58 tank-05 ccsd[2440]: Cluster is not quorate. Refusing connection. Jan 6 10:32:58 tank-05 ccsd[2440]: Error while processing connect: Connection refused Jan 6 10:33:06 tank-05 lock_gulmd_core[2502]: Now have Slave quorum, going full Master. Jan 6 10:33:06 tank-05 lock_gulmd_core[2502]: New Client: idx:1 fd:6 from tank-01 ::ffff:10.15.84.91 Jan 6 10:33:06 tank-05 clurgmgrd[3019]: <notice> Quorum Achieved Jan 6 10:33:06 tank-05 clurgmgrd[3019]: <info> Magma Event: Membership Change Jan 6 10:33:06 tank-05 clurgmgrd[3019]: <info> State change: Local UP Jan 6 10:33:06 tank-05 clurgmgrd[3019]: <info> Loading Service Data Jan 6 10:33:06 tank-05 ccsd[2440]: Cluster is quorate. Allowing connections. Jan 6 10:33:07 tank-05 clurgmgrd[3019]: <info> Initializing Services Jan 6 10:33:07 tank-05 clurgmgrd[3019]: <info> Services Initialized Jan 6 10:33:07 tank-05 clurgmgrd[3019]: <info> State change: tank-04 UP Jan 6 10:33:07 tank-05 clurgmgrd[3019]: <info> State change: tank-02 UP Jan 6 10:33:07 tank-05 clurgmgrd[3019]: <info> State change: tank-01 UP Jan 6 10:33:07 tank-05 lock_gulmd_LT000[2506]: Attached slave tank-01 ::ffff:10.15.84.91 idx:3 fd:8 (soff:2 connected:0x4) Jan 6 10:33:07 tank-05 lock_gulmd_LT000[2506]: New Client: idx 4 fd 9 from tank-01 ::ffff:10.15.84.91 Jan 6 10:33:07 tank-05 kernel: GFS: fsid=tank-cluster:gfs1.4: jid=0: Trying to acquire journal lock... Jan 6 10:33:07 tank-05 kernel: GFS: fsid=tank-cluster:gfs0.4: jid=0: Trying to acquire journal lock... Jan 6 10:33:07 tank-05 kernel: GFS: fsid=tank-cluster:gfs2.4: jid=0: Trying to acquire journal lock... Jan 6 10:33:08 tank-05 kernel: GFS: fsid=tank-cluster:gfs0.4: jid=0: Looking at journal... Jan 6 10:33:08 tank-05 kernel: GFS: fsid=tank-cluster:gfs1.4: jid=0: Looking at journal... Jan 6 10:33:08 tank-05 kernel: GFS: fsid=tank-cluster:gfs2.4: jid=0: Busy Jan 6 10:33:08 tank-05 kernel: GFS: fsid=tank-cluster:gfs2.4: jid=0: Trying to acquire journal lock... Jan 6 10:33:08 tank-05 kernel: GFS: fsid=tank-cluster:gfs2.4: jid=0: Busy Jan 6 10:33:08 tank-05 kernel: GFS: fsid=tank-cluster:gfs2.4: jid=2: Trying to acquire journal lock... Jan 6 10:33:08 tank-05 kernel: GFS: fsid=tank-cluster:gfs2.4: jid=2: Busy Jan 6 10:33:08 tank-05 kernel: GFS: fsid=tank-cluster:gfs0.4: jid=0: Acquiring the transaction lock... Jan 6 10:33:08 tank-05 kernel: GFS: fsid=tank-cluster:gfs1.4: jid=0: Acquiring the transaction lock... Jan 6 10:33:08 tank-05 kernel: GFS: fsid=tank-cluster:gfs2.4: jid=2: Trying to acquire journal lock... Jan 6 10:33:08 tank-05 kernel: GFS: fsid=tank-cluster:gfs2.4: jid=2: Busy Jan 6 10:33:08 tank-05 kernel: GFS: fsid=tank-cluster:gfs0.4: jid=0: Replaying journal... Jan 6 10:33:08 tank-05 kernel: GFS: fsid=tank-cluster:gfs1.4: jid=0: Replaying journal... Jan 6 10:33:09 tank-05 kernel: GFS: fsid=tank-cluster:gfs0.4: jid=0: Replayed 435 of 503 blocks Jan 6 10:33:09 tank-05 kernel: GFS: fsid=tank-cluster:gfs0.4: jid=0: replays = 435, skips = 15, sames = 53 Jan 6 10:33:09 tank-05 kernel: GFS: fsid=tank-cluster:gfs0.4: jid=0: Journal replayed in 1s Jan 6 10:33:09 tank-05 kernel: GFS: fsid=tank-cluster:gfs0.4: jid=0: Done Jan 6 10:33:09 tank-05 kernel: GFS: fsid=tank-cluster:gfs0.4: jid=0: Trying to acquire journal lock... Jan 6 10:33:09 tank-05 kernel: GFS: fsid=tank-cluster:gfs0.4: jid=0: Busy Jan 6 10:33:09 tank-05 kernel: GFS: fsid=tank-cluster:gfs0.4: jid=2: Trying to acquire journal lock... Jan 6 10:33:09 tank-05 kernel: GFS: fsid=tank-cluster:gfs0.4: jid=2: Busy Jan 6 10:33:09 tank-05 kernel: GFS: fsid=tank-cluster:gfs0.4: jid=2: Trying to acquire journal lock... Jan 6 10:33:09 tank-05 kernel: GFS: fsid=tank-cluster:gfs0.4: jid=2: Busy Jan 6 10:33:09 tank-05 kernel: GFS: fsid=tank-cluster:gfs1.4: jid=0: Replayed 579 of 668 blocks Jan 6 10:33:09 tank-05 kernel: GFS: fsid=tank-cluster:gfs1.4: jid=0: replays = 579, skips = 25, sames = 64 Jan 6 10:33:09 tank-05 kernel: GFS: fsid=tank-cluster:gfs1.4: jid=0: Journal replayed in 2s Jan 6 10:33:09 tank-05 kernel: GFS: fsid=tank-cluster:gfs1.4: jid=0: Done Jan 6 10:33:09 tank-05 kernel: GFS: fsid=tank-cluster:gfs1.4: jid=0: Trying to acquire journal lock... Jan 6 10:33:09 tank-05 kernel: GFS: fsid=tank-cluster:gfs1.4: jid=0: Busy Jan 6 10:33:09 tank-05 kernel: GFS: fsid=tank-cluster:gfs1.4: jid=2: Trying to acquire journal lock... Jan 6 10:33:09 tank-05 kernel: GFS: fsid=tank-cluster:gfs1.4: jid=2: Looking at journal... Jan 6 10:33:09 tank-05 kernel: GFS: fsid=tank-cluster:gfs1.4: jid=2: Done Jan 6 10:33:56 tank-05 lock_gulmd_core[2502]: New Client: idx:9 fd:14 from tank-03 ::ffff:10.15.84.93 - Backtrace of mount /mnt/gfs2 on tank-01 PID: 3968 TASK: f5ba0630 CPU: 3 COMMAND: "mount" #0 [f586cb94] schedule at c02d0655 #1 [f586cbf8] wait_for_completion at c02d077a #2 [f586cc4c] jid_get_lock_state_inr at f8ca0a8e #3 [f586cd30] gulm_mount at f8ca029a #4 [f586cd8c] gfs_lm_mount at f8c0e2f9 #5 [f586cdbc] fill_super at f8c1a676 #6 [f586cea0] gfs_get_sb at f8c1b246 #7 [f586ced8] do_kern_mount at c01604c8 #8 [f586cefc] do_new_mount at c017334e #9 [f586cf20] do_mount at c017399b #10 [f586cf9c] sys_mount at c0173cf2 #11 [f586cfc0] system_call at c02d271c EAX: 00000015 EBX: 093b6aa8 ECX: 093b6ac8 EDX: 093b6ad8 DS: 007b ESI: c0ed0000 ES: 007b EDI: 093b6b08 SS: 007b ESP: bff194fc EBP: bff19528 CS: 0073 EIP: 004527a2 ERR: 00000015 EFLAGS: 00000286 - Backtraces on tank-02 crash> bt 3801 PID: 3801 TASK: f62df130 CPU: 1 COMMAND: "gfs_recoverd" #0 [f6b3dcf8] schedule at c02d0655 #1 [f6b3dd5c] wait_for_completion at c02d077a #2 [f6b3ddb0] jid_get_lock_state_inr at f8c64a8e #3 [f6b3dde4] check_for_stale_expires at f8c65193 #4 [f6b3de80] gulm_recovery_done at f8c64714 #5 [f6b3dea8] gfs_recover_journal at f8ce48a1 #6 [f6b3dfc8] gfs_check_journals at f8ce49a6 #7 [f6b3dfe0] gfs_recoverd at f8cbbedf #8 [f6b3dff0] kernel_thread_helper at c01041f3 crash> bt 3797 PID: 3797 TASK: f623b830 CPU: 1 COMMAND: "gulm_Cb_Handler" #0 [f6ab0e40] schedule at c02d0655 #1 [f6ab0ea4] __sched_text_start at c02cfc40 #2 [f6ab0ed8] __down_failed at c02cfdb7 #3 [f6ab0ee8] .text.lock.gulm_jid (via check_for_stale_expires) at f8c653d0 #4 [f6ab0f74] handler at f8c67e34 #5 [f6ab0ff0] kernel_thread_helper at c01041f3 - Backtraces on tank-03 PID: 3937 TASK: ee1de730 CPU: 0 COMMAND: "mount" #0 [f5fd4b94] schedule at c02d0655 #1 [f5fd4bf8] wait_for_completion at c02d077a #2 [f5fd4c4c] jid_get_lock_state_inr at f8c9da8e #3 [f5fd4d30] gulm_mount at f8c9d29a #4 [f5fd4d8c] gfs_lm_mount at f8bf82f9 #5 [f5fd4dbc] fill_super at f8c04676 #6 [f5fd4ea0] gfs_get_sb at f8c05246 #7 [f5fd4ed8] do_kern_mount at c01604c8 #8 [f5fd4efc] do_new_mount at c017334e #9 [f5fd4f20] do_mount at c017399b #10 [f5fd4f9c] sys_mount at c0173cf2 #11 [f5fd4fc0] system_call at c02d271c EAX: ffffffda EBX: 09b07bc8 ECX: 09b07be8 EDX: 09b07bf8 DS: 007b ESI: c0ed0000 ES: 007b EDI: 09b07c28 SS: 007b ESP: bfee8a9c EBP: bfee8ac8 CS: 0073 EIP: 0072e7a2 ERR: 00000015 EFLAGS: 00000286 - Backtraces on tank-04 crash> bt 3559 PID: 3559 TASK: f651b1b0 CPU: 1 COMMAND: "gfs_recoverd" #0 [f6ceacf8] schedule at c02d0655 #1 [f6cead5c] wait_for_completion at c02d077a #2 [f6ceadb0] jid_get_lock_state_inr at f8c9da8e #3 [f6ceade4] check_for_stale_expires at f8c9e193 #4 [f6ceae80] gulm_recovery_done at f8c9d714 #5 [f6ceaea8] gfs_recover_journal at f8c0b8a1 #6 [f6ceafc8] gfs_check_journals at f8c0b9a6 #7 [f6ceafe0] gfs_recoverd at f8be2edf #8 [f6ceaff0] kernel_thread_helper at c01041f3 crash> bt 3555 PID: 3555 TASK: f76845b0 CPU: 0 COMMAND: "gulm_Cb_Handler" #0 [f6d54e40] schedule at c02d0655 #1 [f6d54ea4] __sched_text_start at c02cfc40 #2 [f6d54ed8] __down_failed at c02cfdb7 #3 [f6d54ee8] .text.lock.gulm_jid (via check_for_stale_expires) at f8c9e3d0 #4 [f6d54f74] handler at f8ca0e34 #5 [f6d54ff0] kernel_thread_helper at c01041f3 - The last message on tank-01 and tank-03 after the mount attempt is: GFS: Trying to join cluster "lock_gulm", "tank-cluster:gfs2"
Waiting to see if this bug re-appears with the latest gulm (1.0.5)
This appears to be a problem with name resolution. When a master & slave are killed the remaining slave attempts to mark tank-01.lab.msp.redhat.com expired instead of tank-01.
Closing this bug because the initial problems have been fixed. (Further issues are addressed in bz #183383).
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0238.html