Bug 444884

Summary: ccsd fails to start, so cluster waits for quorum in FAIL_ALL_STOPPED
Product: Red Hat Enterprise Linux 5 Reporter: Corey Marthaler <cmarthal>
Component: cmanAssignee: David Teigland <teigland>
Status: CLOSED DUPLICATE QA Contact: GFS Bugs <gfs-bugs>
Severity: low Docs Contact:
Priority: low    
Version: 5.2CC: ccaulfie, cluster-maint, edamato
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-05-02 17:56:48 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
here are the rest of the traces that I was able to grab from the console none

Description Corey Marthaler 2008-05-01 15:07:44 UTC
Description of problem:
I was running revolver and in this case, it shot 3 of the 4 taft nodes. The only
remaining node was taft-03 (nodeid 3) and it still thinks that taft-01 (nodeid
1) is the master for the locks it's looking for.

# Revolver output:
[...]
Mounting configfs on all nodes
Mounting configfs on taft-01...pass
Mounting configfs on taft-04...pass
Mounting configfs on taft-02...pass
Starting ccsd on cluster
Starting ccsd on taft-01...pass
Starting ccsd on taft-04...pass
Starting ccsd on taft-02...pass
nodes joining cluster...
cman joining on taft-01
cman joining on taft-04
fail, cman_tool: ccsd is not running


[root@taft-03 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M   1272   2008-05-01 03:09:58  taft-01
   2   X   1264                        taft-02
   3   M   1236   2008-05-01 02:49:48  taft-03
   4   X   1256                        taft-04

[root@taft-03 ~]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00010001 FAIL_ALL_STOPPED 1 100030003 -1
[1 2 3 4]
dlm              1     clvmd    00020001 FAIL_ALL_STOPPED 1 100030003 -1
[1 2 3 4]
dlm              1     TAFT0    00040001 FAIL_ALL_STOPPED 1 100030003 -1
[1 2 3 4]
dlm              1     TAFT1    00060001 FAIL_ALL_STOPPED 1 100030003 -1
[1 2 3 4]
gfs              2     TAFT0    00030001 FAIL_ALL_STOPPED 1 100030003 -1
[1 2 3 4]
gfs              2     TAFT1    00050001 FAIL_ALL_STOPPED 1 100030003 -1
[1 2 3 4]


# Here's the one lock with a Waiting Queue in group TAFT0:
Resource ffff810211417980 Name (len=24) "       2              48"
Local Copy, Master is node 1
Granted Queue
Conversion Queue
Waiting Queue
0274000a -- (PR) Master:     03bd0011

id 0274000a gr IV rq PR pid 8164 master 1 "       2              48"

[root@taft-03 ~]# ps -elf | grep 8164
1 S root      8164    87  0  70  -5 -     0 gdlm_t 02:50 ?        00:00:00
[lock_dlm2]


# Here's the one lock with a Waiting Queue in group TAFT1:
Resource ffff8102133c23c0 Name (len=24) "       2         1dab3f3"
Local Copy, Master is node 1
Granted Queue
Conversion Queue
Waiting Queue
02210003 -- (PR) Master:     028c0014

id 02210003 gr IV rq PR pid 8199 master 1 "       2         1dab3f3"

[root@taft-03 ~]# ps -elf | grep 8199
1 S root      8199    87  0  70  -5 -     0 gdlm_t 02:50 ?        00:00:00
[lock_dlm1]

Version-Release number of selected component (if applicable):
2.6.18-91.el5
cman-2.0.84-2.el5

Comment 1 Corey Marthaler 2008-05-01 15:10:18 UTC
I'll add the full dlm dumps:

[root@taft-03 ~]# dlm_tool lockdump TAFT1
id 01e60001 gr PR rq NL pid 8195 master 1 "       5              19"
id 03110001 gr PR rq NL pid 8195 master 1 "       2              16"
id 007a0001 gr PR rq NL pid 8195 master 1 "       1               1"
id 03890001 gr EX rq NL pid 8195 master 0 "       4         3b5a600"
id 03220001 gr PR rq NL pid 8195 master 1 "       5              1a"
id 00dd0001 gr PR rq NL pid 8195 master 1 "       2              1a"
id 01a40001 gr PR rq NL pid 8195 master 1 "       5              17"
id 02c70001 gr PR rq NL pid 8242 master 1 "       5         1dab3f3"
id 02970004 gr PR rq NL pid 8242 master 1 "       1               2"
id 02210003 gr IV rq PR pid 8199 master 1 "       2         1dab3f3"
id 01dc0001 gr PR rq NL pid 8195 master 1 "       5              16"
id 039c0001 gr PR rq NL pid 8195 master 1 "       5              18"
id 03d80001 gr PR rq NL pid 8195 master 1 "       2              17"


[root@taft-03 ~]# dlm_tool lockdump TAFT0
id 02120001 gr PR rq NL pid 8158 master 1 "       5              19"
id 00a70001 gr PR rq NL pid 8231 master 1 "       5              48"
id 032f0001 gr PR rq NL pid 8158 master 1 "       2              16"
id 03990001 gr PR rq NL pid 8158 master 1 "       1               1"
id 0274000a gr IV rq PR pid 8164 master 1 "       2              48"
id 03030001 gr EX rq NL pid 8158 master 0 "       4         3b5a600"
id 035b0001 gr PR rq NL pid 8158 master 1 "       5              1a"
id 029a0001 gr PR rq NL pid 8158 master 1 "       2              1a"
id 02830001 gr PR rq NL pid 8158 master 1 "       5              17"
id 03ad0005 gr PR rq NL pid 8231 master 1 "       1               2"
id 016d0001 gr PR rq NL pid 8158 master 1 "       5              16"
id 03b50001 gr PR rq NL pid 8158 master 1 "       5              18"
id 027f0001 gr PR rq NL pid 8158 master 1 "       2              17"


[root@taft-03 ~]# dlm_tool lockdebug TAFT0

Resource ffff81021fa852c0 Name (len=24) "       5              19"
Local Copy, Master is node 1
Granted Queue
02120001 PR Master:     024b0007
Conversion Queue
Waiting Queue

Resource ffff8102108156c0 Name (len=24) "       5              48"
Local Copy, Master is node 1
Granted Queue
00a70001 PR Master:     02810007
Conversion Queue
Waiting Queue

Resource ffff810214b97980 Name (len=24) "       2              16"
Local Copy, Master is node 1
Granted Queue
032f0001 PR Master:     01b80003
Conversion Queue
Waiting Queue

Resource ffff81021427c980 Name (len=24) "       1               1"
Local Copy, Master is node 1
Granted Queue
03990001 PR Master:     004d0003
Conversion Queue
Waiting Queue

Resource ffff810211417980 Name (len=24) "       2              48"
Local Copy, Master is node 1
Granted Queue
Conversion Queue
Waiting Queue
0274000a -- (PR) Master:     03bd0011

Resource ffff81021fa854c0 Name (len=24) "       4         3b5a600"
Master Copy
Granted Queue
03030001 EX
Conversion Queue
Waiting Queue

Resource ffff81021aafe1c0 Name (len=24) "       5              1a"
Local Copy, Master is node 1
Granted Queue
035b0001 PR Master:     036a0004
Conversion Queue
Waiting Queue

Resource ffff8102167510c0 Name (len=24) "       2              1a"
Local Copy, Master is node 1
Granted Queue
029a0001 PR Master:     02d30006
Conversion Queue
Waiting Queue

Resource ffff8102141ad9c0 Name (len=24) "       5              17"
Local Copy, Master is node 1
Granted Queue
02830001 PR Master:     025c0007
Conversion Queue
Waiting Queue

Resource ffff81021427c080 Name (len=24) "       1               2"
Local Copy, Master is node 1
Granted Queue
03ad0005 PR Master:     03050003
Conversion Queue
Waiting Queue

Resource ffff810214499d80 Name (len=24) "       5              16"
Local Copy, Master is node 1
Granted Queue
016d0001 PR Master:     0396000a
Conversion Queue
Waiting Queue

Resource ffff81021fa850c0 Name (len=24) "       5              18"
Local Copy, Master is node 1
Granted Queue
03b50001 PR Master:     038e0005
Conversion Queue
Waiting Queue

Resource ffff81021aafe2c0 Name (len=24) "       2              17"
Local Copy, Master is node 1
Granted Queue
027f0001 PR Master:     02160007
Conversion Queue
Waiting Queue




[root@taft-03 ~]# dlm_tool lockdebug TAFT1

Resource ffff810210815ac0 Name (len=24) "       5              19"
Local Copy, Master is node 1
Granted Queue
01e60001 PR Master:     01850001
Conversion Queue
Waiting Queue

Resource ffff810212a261c0 Name (len=24) "       2              16"
Local Copy, Master is node 1
Granted Queue
03110001 PR Master:     02070002
Conversion Queue
Waiting Queue

Resource ffff81021ecd0880 Name (len=24) "       1               1"
Local Copy, Master is node 1
Granted Queue
007a0001 PR Master:     01890003
Conversion Queue
Waiting Queue

Resource ffff810212a266c0 Name (len=24) "       4         3b5a600"
Master Copy
Granted Queue
03890001 EX
Conversion Queue
Waiting Queue

Resource ffff810211417180 Name (len=24) "       5              1a"
Local Copy, Master is node 1
Granted Queue
03220001 PR Master:     01c40004
Conversion Queue
Waiting Queue

Resource ffff810211417380 Name (len=24) "       2              1a"
Local Copy, Master is node 1
Granted Queue
00dd0001 PR Master:     01080005
Conversion Queue
Waiting Queue

Resource ffff810211417580 Name (len=24) "       5              17"
Local Copy, Master is node 1
Granted Queue
01a40001 PR Master:     01250002
Conversion Queue
Waiting Queue

Resource ffff810212a26ec0 Name (len=24) "       5         1dab3f3"
Local Copy, Master is node 1
Granted Queue
02c70001 PR Master:     03ee0001
Conversion Queue
Waiting Queue

Resource ffff8102102a7680 Name (len=24) "       1               2"
Local Copy, Master is node 1
Granted Queue
02970004 PR Master:     003d0002
Conversion Queue
Waiting Queue

Resource ffff8102133c23c0 Name (len=24) "       2         1dab3f3"
Local Copy, Master is node 1
Granted Queue
Conversion Queue
Waiting Queue
02210003 -- (PR) Master:     028c0014

Resource ffff810212a263c0 Name (len=24) "       5              16"
Local Copy, Master is node 1
Granted Queue
01dc0001 PR Master:     03150004
Conversion Queue
Waiting Queue

Resource ffff810210815dc0 Name (len=24) "       5              18"
Local Copy, Master is node 1
Granted Queue
039c0001 PR Master:     00c70002
Conversion Queue
Waiting Queue

Resource ffff810211417780 Name (len=24) "       2              17"
Local Copy, Master is node 1
Granted Queue
03d80001 PR Master:     032d0002
Conversion Queue
Waiting Queue

Comment 2 Christine Caulfield 2008-05-01 15:29:40 UTC
This looks like its waiting for fencing to me ... dave ?

Comment 3 Corey Marthaler 2008-05-01 15:33:55 UTC
Here are the stack traces for the processes listed in comment #0.

lock_dlm2     S ffffffff80142f13     0  8164     87          8165  8161 (L-TLB)
 ffff81020da11e60 0000000000000046 0000000000000000 ffff81020ce13d28
 ffff81020d6402d0 0000000000000009 ffff81021bdf00c0 ffffffff802e3ae0
 00000119247956c9 00000000000003cd ffff81021bdf02a8 0000000010081000
Call Trace:
 [<ffffffff885ce519>] :lock_dlm:gdlm_thread+0x16a/0x668
 [<ffffffff8009dde2>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8009dbca>] keventd_create_kthread+0x0/0xc4
 [<ffffffff885cea17>] :lock_dlm:gdlm_thread2+0x0/0x7
 [<ffffffff8009dbca>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003253d>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8009dbca>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003243f>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

lock_dlm2     S ffffffff80142f13     0  8201     87          8202  8199 (L-TLB)
 ffff81020d1e9e60 0000000000000046 0000000000000000 ffff81020cee9e70
 0000000000000001 0000000000000009 ffff8102108887a0 ffffffff802e3ae0
 0000011923f31a3f 0000000000000288 ffff810210888988 000000000cee9e60
Call Trace:
 [<ffffffff885ce519>] :lock_dlm:gdlm_thread+0x16a/0x668
 [<ffffffff8009dde2>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8009dbca>] keventd_create_kthread+0x0/0xc4
 [<ffffffff885cea17>] :lock_dlm:gdlm_thread2+0x0/0x7
 [<ffffffff8009dbca>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003253d>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8009dbca>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003243f>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11




lock_dlm1     S ffffffff80142f13     0  8161     87          8164  8104 (L-TLB)
 ffff81020d90be60 0000000000000046 0000000000000000 ffffffff885ed647
 0000000000000000 000000000000000a ffff81021139f040 ffff8101fff15100
 000001192478dd8f 0000000000000c1c ffff81021139f228 0000000100000000
Call Trace:
 [<ffffffff885ed647>] :gfs:gfs_iget+0x3d/0x1ce
 [<ffffffff885ce519>] :lock_dlm:gdlm_thread+0x16a/0x668
 [<ffffffff8009dde2>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8009dbca>] keventd_create_kthread+0x0/0xc4
 [<ffffffff885cea1e>] :lock_dlm:gdlm_thread1+0x0/0xa
 [<ffffffff8009dbca>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003253d>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8009dbca>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003243f>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

lock_dlm1     S ffffffff80142f13     0  8199     87          8201  8181 (L-TLB)
 ffff81020d1e5e60 0000000000000046 0000000000000000 ffff81020cee9d28
 ffff81020d640540 0000000000000009 ffff8101ffd69040 ffff8101fff15100
 0000011923f37927 0000000000000577 ffff8101ffd69228 0000000110443000
Call Trace:
 [<ffffffff885ce519>] :lock_dlm:gdlm_thread+0x16a/0x668
 [<ffffffff8009dde2>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8009dbca>] keventd_create_kthread+0x0/0xc4
 [<ffffffff885cea1e>] :lock_dlm:gdlm_thread1+0x0/0xa
 [<ffffffff8009dbca>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003253d>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8009dbca>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003243f>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11



Comment 4 Corey Marthaler 2008-05-01 15:46:45 UTC
Created attachment 304314 [details]
here are the rest of the traces that I was able to grab from the console

Comment 5 David Teigland 2008-05-01 15:51:02 UTC
All systems are appropriately stopped, and none have begun recovery
(all systems in FAIL_ALL_STOPPED).  groupd is probably waiting for the
cluster to have quorum before it allows any of the systems (fence,dlm,gfs)
to begin recovery.  Once there's quorum, recovery will begin: first
fencing will happen, next dlm recovery will happen (and the locks from
the dead nodes cleared), third gfs recovery will happen.


Comment 6 Corey Marthaler 2008-05-01 19:53:19 UTC
Reproduced this.

Comment 7 Corey Marthaler 2008-05-01 20:38:03 UTC
Like Dave mentioned in comment #5, once quroum was established and fencing took
place, everything was fine. So the question now is why did cman fail to start in
the begining? Why couldn't ccsd start?

May  1 13:32:32 taft-02 ccsd[7817]: Unable to bind to socket.


Comment 8 Nate Straz 2008-05-01 20:59:44 UTC
Possibly a dupe of bug 221528?

Comment 9 Christine Caulfield 2008-05-02 07:23:19 UTC
Yes, I think so. it's NEEDINFO so if you can reproduce it and get the
information mentioned in that BZ it would be really helpful.

Comment 10 Corey Marthaler 2008-05-02 17:56:48 UTC
I'll close this and move 221528 back to 'assigned'.

*** This bug has been marked as a duplicate of 221528 ***