Bug 444884
Summary: | ccsd fails to start, so cluster waits for quorum in FAIL_ALL_STOPPED | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Corey Marthaler <cmarthal> | ||||
Component: | cman | Assignee: | David Teigland <teigland> | ||||
Status: | CLOSED DUPLICATE | QA Contact: | GFS Bugs <gfs-bugs> | ||||
Severity: | low | Docs Contact: | |||||
Priority: | low | ||||||
Version: | 5.2 | CC: | ccaulfie, cluster-maint, edamato | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2008-05-02 17:56:48 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Corey Marthaler
2008-05-01 15:07:44 UTC
I'll add the full dlm dumps: [root@taft-03 ~]# dlm_tool lockdump TAFT1 id 01e60001 gr PR rq NL pid 8195 master 1 " 5 19" id 03110001 gr PR rq NL pid 8195 master 1 " 2 16" id 007a0001 gr PR rq NL pid 8195 master 1 " 1 1" id 03890001 gr EX rq NL pid 8195 master 0 " 4 3b5a600" id 03220001 gr PR rq NL pid 8195 master 1 " 5 1a" id 00dd0001 gr PR rq NL pid 8195 master 1 " 2 1a" id 01a40001 gr PR rq NL pid 8195 master 1 " 5 17" id 02c70001 gr PR rq NL pid 8242 master 1 " 5 1dab3f3" id 02970004 gr PR rq NL pid 8242 master 1 " 1 2" id 02210003 gr IV rq PR pid 8199 master 1 " 2 1dab3f3" id 01dc0001 gr PR rq NL pid 8195 master 1 " 5 16" id 039c0001 gr PR rq NL pid 8195 master 1 " 5 18" id 03d80001 gr PR rq NL pid 8195 master 1 " 2 17" [root@taft-03 ~]# dlm_tool lockdump TAFT0 id 02120001 gr PR rq NL pid 8158 master 1 " 5 19" id 00a70001 gr PR rq NL pid 8231 master 1 " 5 48" id 032f0001 gr PR rq NL pid 8158 master 1 " 2 16" id 03990001 gr PR rq NL pid 8158 master 1 " 1 1" id 0274000a gr IV rq PR pid 8164 master 1 " 2 48" id 03030001 gr EX rq NL pid 8158 master 0 " 4 3b5a600" id 035b0001 gr PR rq NL pid 8158 master 1 " 5 1a" id 029a0001 gr PR rq NL pid 8158 master 1 " 2 1a" id 02830001 gr PR rq NL pid 8158 master 1 " 5 17" id 03ad0005 gr PR rq NL pid 8231 master 1 " 1 2" id 016d0001 gr PR rq NL pid 8158 master 1 " 5 16" id 03b50001 gr PR rq NL pid 8158 master 1 " 5 18" id 027f0001 gr PR rq NL pid 8158 master 1 " 2 17" [root@taft-03 ~]# dlm_tool lockdebug TAFT0 Resource ffff81021fa852c0 Name (len=24) " 5 19" Local Copy, Master is node 1 Granted Queue 02120001 PR Master: 024b0007 Conversion Queue Waiting Queue Resource ffff8102108156c0 Name (len=24) " 5 48" Local Copy, Master is node 1 Granted Queue 00a70001 PR Master: 02810007 Conversion Queue Waiting Queue Resource ffff810214b97980 Name (len=24) " 2 16" Local Copy, Master is node 1 Granted Queue 032f0001 PR Master: 01b80003 Conversion Queue Waiting Queue Resource ffff81021427c980 Name (len=24) " 1 1" Local Copy, Master is node 1 Granted Queue 03990001 PR Master: 004d0003 Conversion Queue Waiting Queue Resource ffff810211417980 Name (len=24) " 2 48" Local Copy, Master is node 1 Granted Queue Conversion Queue Waiting Queue 0274000a -- (PR) Master: 03bd0011 Resource ffff81021fa854c0 Name (len=24) " 4 3b5a600" Master Copy Granted Queue 03030001 EX Conversion Queue Waiting Queue Resource ffff81021aafe1c0 Name (len=24) " 5 1a" Local Copy, Master is node 1 Granted Queue 035b0001 PR Master: 036a0004 Conversion Queue Waiting Queue Resource ffff8102167510c0 Name (len=24) " 2 1a" Local Copy, Master is node 1 Granted Queue 029a0001 PR Master: 02d30006 Conversion Queue Waiting Queue Resource ffff8102141ad9c0 Name (len=24) " 5 17" Local Copy, Master is node 1 Granted Queue 02830001 PR Master: 025c0007 Conversion Queue Waiting Queue Resource ffff81021427c080 Name (len=24) " 1 2" Local Copy, Master is node 1 Granted Queue 03ad0005 PR Master: 03050003 Conversion Queue Waiting Queue Resource ffff810214499d80 Name (len=24) " 5 16" Local Copy, Master is node 1 Granted Queue 016d0001 PR Master: 0396000a Conversion Queue Waiting Queue Resource ffff81021fa850c0 Name (len=24) " 5 18" Local Copy, Master is node 1 Granted Queue 03b50001 PR Master: 038e0005 Conversion Queue Waiting Queue Resource ffff81021aafe2c0 Name (len=24) " 2 17" Local Copy, Master is node 1 Granted Queue 027f0001 PR Master: 02160007 Conversion Queue Waiting Queue [root@taft-03 ~]# dlm_tool lockdebug TAFT1 Resource ffff810210815ac0 Name (len=24) " 5 19" Local Copy, Master is node 1 Granted Queue 01e60001 PR Master: 01850001 Conversion Queue Waiting Queue Resource ffff810212a261c0 Name (len=24) " 2 16" Local Copy, Master is node 1 Granted Queue 03110001 PR Master: 02070002 Conversion Queue Waiting Queue Resource ffff81021ecd0880 Name (len=24) " 1 1" Local Copy, Master is node 1 Granted Queue 007a0001 PR Master: 01890003 Conversion Queue Waiting Queue Resource ffff810212a266c0 Name (len=24) " 4 3b5a600" Master Copy Granted Queue 03890001 EX Conversion Queue Waiting Queue Resource ffff810211417180 Name (len=24) " 5 1a" Local Copy, Master is node 1 Granted Queue 03220001 PR Master: 01c40004 Conversion Queue Waiting Queue Resource ffff810211417380 Name (len=24) " 2 1a" Local Copy, Master is node 1 Granted Queue 00dd0001 PR Master: 01080005 Conversion Queue Waiting Queue Resource ffff810211417580 Name (len=24) " 5 17" Local Copy, Master is node 1 Granted Queue 01a40001 PR Master: 01250002 Conversion Queue Waiting Queue Resource ffff810212a26ec0 Name (len=24) " 5 1dab3f3" Local Copy, Master is node 1 Granted Queue 02c70001 PR Master: 03ee0001 Conversion Queue Waiting Queue Resource ffff8102102a7680 Name (len=24) " 1 2" Local Copy, Master is node 1 Granted Queue 02970004 PR Master: 003d0002 Conversion Queue Waiting Queue Resource ffff8102133c23c0 Name (len=24) " 2 1dab3f3" Local Copy, Master is node 1 Granted Queue Conversion Queue Waiting Queue 02210003 -- (PR) Master: 028c0014 Resource ffff810212a263c0 Name (len=24) " 5 16" Local Copy, Master is node 1 Granted Queue 01dc0001 PR Master: 03150004 Conversion Queue Waiting Queue Resource ffff810210815dc0 Name (len=24) " 5 18" Local Copy, Master is node 1 Granted Queue 039c0001 PR Master: 00c70002 Conversion Queue Waiting Queue Resource ffff810211417780 Name (len=24) " 2 17" Local Copy, Master is node 1 Granted Queue 03d80001 PR Master: 032d0002 Conversion Queue Waiting Queue This looks like its waiting for fencing to me ... dave ? Here are the stack traces for the processes listed in comment #0. lock_dlm2 S ffffffff80142f13 0 8164 87 8165 8161 (L-TLB) ffff81020da11e60 0000000000000046 0000000000000000 ffff81020ce13d28 ffff81020d6402d0 0000000000000009 ffff81021bdf00c0 ffffffff802e3ae0 00000119247956c9 00000000000003cd ffff81021bdf02a8 0000000010081000 Call Trace: [<ffffffff885ce519>] :lock_dlm:gdlm_thread+0x16a/0x668 [<ffffffff8009dde2>] autoremove_wake_function+0x0/0x2e [<ffffffff8009dbca>] keventd_create_kthread+0x0/0xc4 [<ffffffff885cea17>] :lock_dlm:gdlm_thread2+0x0/0x7 [<ffffffff8009dbca>] keventd_create_kthread+0x0/0xc4 [<ffffffff8003253d>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff8009dbca>] keventd_create_kthread+0x0/0xc4 [<ffffffff8003243f>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 lock_dlm2 S ffffffff80142f13 0 8201 87 8202 8199 (L-TLB) ffff81020d1e9e60 0000000000000046 0000000000000000 ffff81020cee9e70 0000000000000001 0000000000000009 ffff8102108887a0 ffffffff802e3ae0 0000011923f31a3f 0000000000000288 ffff810210888988 000000000cee9e60 Call Trace: [<ffffffff885ce519>] :lock_dlm:gdlm_thread+0x16a/0x668 [<ffffffff8009dde2>] autoremove_wake_function+0x0/0x2e [<ffffffff8009dbca>] keventd_create_kthread+0x0/0xc4 [<ffffffff885cea17>] :lock_dlm:gdlm_thread2+0x0/0x7 [<ffffffff8009dbca>] keventd_create_kthread+0x0/0xc4 [<ffffffff8003253d>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff8009dbca>] keventd_create_kthread+0x0/0xc4 [<ffffffff8003243f>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 lock_dlm1 S ffffffff80142f13 0 8161 87 8164 8104 (L-TLB) ffff81020d90be60 0000000000000046 0000000000000000 ffffffff885ed647 0000000000000000 000000000000000a ffff81021139f040 ffff8101fff15100 000001192478dd8f 0000000000000c1c ffff81021139f228 0000000100000000 Call Trace: [<ffffffff885ed647>] :gfs:gfs_iget+0x3d/0x1ce [<ffffffff885ce519>] :lock_dlm:gdlm_thread+0x16a/0x668 [<ffffffff8009dde2>] autoremove_wake_function+0x0/0x2e [<ffffffff8009dbca>] keventd_create_kthread+0x0/0xc4 [<ffffffff885cea1e>] :lock_dlm:gdlm_thread1+0x0/0xa [<ffffffff8009dbca>] keventd_create_kthread+0x0/0xc4 [<ffffffff8003253d>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff8009dbca>] keventd_create_kthread+0x0/0xc4 [<ffffffff8003243f>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 lock_dlm1 S ffffffff80142f13 0 8199 87 8201 8181 (L-TLB) ffff81020d1e5e60 0000000000000046 0000000000000000 ffff81020cee9d28 ffff81020d640540 0000000000000009 ffff8101ffd69040 ffff8101fff15100 0000011923f37927 0000000000000577 ffff8101ffd69228 0000000110443000 Call Trace: [<ffffffff885ce519>] :lock_dlm:gdlm_thread+0x16a/0x668 [<ffffffff8009dde2>] autoremove_wake_function+0x0/0x2e [<ffffffff8009dbca>] keventd_create_kthread+0x0/0xc4 [<ffffffff885cea1e>] :lock_dlm:gdlm_thread1+0x0/0xa [<ffffffff8009dbca>] keventd_create_kthread+0x0/0xc4 [<ffffffff8003253d>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff8009dbca>] keventd_create_kthread+0x0/0xc4 [<ffffffff8003243f>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 Created attachment 304314 [details]
here are the rest of the traces that I was able to grab from the console
All systems are appropriately stopped, and none have begun recovery (all systems in FAIL_ALL_STOPPED). groupd is probably waiting for the cluster to have quorum before it allows any of the systems (fence,dlm,gfs) to begin recovery. Once there's quorum, recovery will begin: first fencing will happen, next dlm recovery will happen (and the locks from the dead nodes cleared), third gfs recovery will happen. Reproduced this. Like Dave mentioned in comment #5, once quroum was established and fencing took place, everything was fine. So the question now is why did cman fail to start in the begining? Why couldn't ccsd start? May 1 13:32:32 taft-02 ccsd[7817]: Unable to bind to socket. Possibly a dupe of bug 221528? Yes, I think so. it's NEEDINFO so if you can reproduce it and get the information mentioned in that BZ it would be really helpful. I'll close this and move 221528 back to 'assigned'. *** This bug has been marked as a duplicate of 221528 *** |