Bug 480709
| Summary: | cluster deadlock issues when mounting 50 gfs filesystems | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Corey Marthaler <cmarthal> | ||||||||||||||||||
| Component: | openais | Assignee: | Steven Dake <sdake> | ||||||||||||||||||
| Status: | CLOSED DUPLICATE | QA Contact: | Cluster QE <mspqa-list> | ||||||||||||||||||
| Severity: | medium | Docs Contact: | |||||||||||||||||||
| Priority: | low | ||||||||||||||||||||
| Version: | 5.3 | CC: | ccaulfie, cluster-maint, edamato, fdinitto, rwheeler, sdake, swhiteho, teigland | ||||||||||||||||||
| Target Milestone: | rc | ||||||||||||||||||||
| Target Release: | --- | ||||||||||||||||||||
| Hardware: | All | ||||||||||||||||||||
| OS: | Linux | ||||||||||||||||||||
| Whiteboard: | |||||||||||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||||||
| Last Closed: | 2009-05-21 14:22:21 UTC | Type: | --- | ||||||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||
| Embargoed: | |||||||||||||||||||||
| Attachments: |
|
||||||||||||||||||||
|
Description
Corey Marthaler
2009-01-19 23:43:06 UTC
Created attachment 329410 [details]
log from taft-01
Created attachment 329411 [details]
log from taft-02
Created attachment 329412 [details]
log from taft-03
Created attachment 329413 [details]
log from taft-04
This doesn't looks like a gfs issue to me, sounds more like a cluster infrastructure problem . comment 3 shows the likely problem, Jan 19 16:01:01 taft-03 gfs_controld[6625]: Assertion failed on line 411 of file recover.c Assertion: "memb->opts & MEMB_OPT_RECOVER" (It appears the groupd state in the description was collected after sysrq-t wrecked the cluster, so it's not useful.) I'll need a dump of the gfs_controld debug log from all nodes next time this happens: group_tool dump gfs > gfs_controld.txt It might be good to grep /var/log/messages for that error message first to verify that it's the same bug that's been reproduced. I appear to have hit something similar to this. On all three grant nodes I tried this: [root@grant-01 ~]# for i in $(seq 1 20); do mount /dev/B/lv$i /mnt/B$i; done [and all are hung] grant-01: root 11717 10021 0 May07 pts/0 00:00:00 mount /dev/B/lv6 /mnt/B6 root 11720 11717 0 May07 pts/0 00:00:00 /sbin/mount.gfs /dev/B/lv6 /mnt/B6 -o rw grant-02: root 11713 10056 0 May07 pts/0 00:00:00 mount /dev/B/lv6 /mnt/B6 root 11718 11713 0 May07 pts/0 00:00:00 /sbin/mount.gfs /dev/B/lv6 /mnt/B6 -o rw grant-03: root 11713 10048 0 May07 pts/0 00:00:00 mount /dev/B/lv6 /mnt/B6 root 11716 11713 0 May07 pts/0 00:00:00 /sbin/mount.gfs /dev/B/lv6 /mnt/B6 -o rw grant-02 is the only node with a stuck cman service: [root@grant-02 ~]# cman_tool services type level name id state fence 0 default 00010001 none [1 2 3] dlm 1 clvmd 00010003 none [1 2 3] dlm 1 B1 00020002 none [1 2 3] dlm 1 B2 00040002 none [1 2 3] dlm 1 B3 00060002 none [1 2 3] dlm 1 B4 00080002 none [1 2 3] dlm 1 B5 000a0002 none [1 2 3] dlm 1 B6 00000000 JOIN_STOP_WAIT [-1341969328 2] gfs 2 B1 00010002 none [1 2 3] gfs 2 B2 00030002 none [1 2 3] gfs 2 B3 00050002 none [1 2 3] gfs 2 B4 00070002 none [1 2 3] gfs 2 B5 00090002 none [1 2 3] gfs 2 B6 000b0002 none [1 2 3] I'll attach a group_tool dump of gfs from each node. Created attachment 343153 [details]
gfs_tool dump from grant-01
Created attachment 343154 [details]
gfs_tool dump from grant-02
Created attachment 343156 [details]
gfs_tool dump from grant-03
Here's the only interesting thing in the syslog (on grant-02): May 7 16:17:07 grant-02 kernel: Trying to join cluster "lock_dlm", "GRANT:B5" May 7 16:17:07 grant-02 kernel: Joined cluster. Now mounting FS... May 7 16:17:07 grant-02 kernel: GFS: fsid=GRANT:B5.0: jid=0: Trying to acquire journal lock... May 7 16:17:07 grant-02 kernel: GFS: fsid=GRANT:B5.0: jid=0: Looking at journal... May 7 16:17:07 grant-02 kernel: GFS: fsid=GRANT:B5.0: jid=0: Done May 7 16:17:07 grant-02 kernel: GFS: fsid=GRANT:B5.0: jid=1: Trying to acquire journal lock... May 7 16:17:07 grant-02 kernel: GFS: fsid=GRANT:B5.0: jid=1: Looking at journal... May 7 16:17:07 grant-02 kernel: GFS: fsid=GRANT:B5.0: jid=1: Done May 7 16:17:07 grant-02 kernel: GFS: fsid=GRANT:B5.0: jid=2: Trying to acquire journal lock... May 7 16:17:07 grant-02 kernel: GFS: fsid=GRANT:B5.0: jid=2: Looking at journal... May 7 16:17:07 grant-02 openais[10160]: [TOTEM] Retransmit List: 313 May 7 16:17:07 grant-02 kernel: GFS: fsid=GRANT:B5.0: jid=2: Done May 7 16:17:07 grant-02 kernel: Trying to join cluster "lock_dlm", "GRANT:B6" May 7 16:17:09 grant-02 openais[10160]: [TOTEM] Retransmit List: 395 May 7 16:17:09 grant-02 openais[10160]: [TOTEM] Retransmit List: 3aa May 7 16:20:20 grant-02 kernel: dlm: B6: group join failed -512 0 May 7 16:20:20 grant-02 kernel: lock_dlm: dlm_new_lockspace error -512 May 7 16:20:20 grant-02 kernel: can't mount proto=lock_dlm, table=GRANT:B6, hostdata=jid=0:id=720898:first=1 May 7 16:20:20 grant-02 kernel: Trying to join cluster "lock_dlm", "GRANT:B6" May 7 16:20:20 grant-02 dlm_controld[10186]: process_uevent online@ error -17 errno I snuck onto grant-02 and also grabbed the debug log from groupd, which shows this: 1241731027 1:B6 got join 1241731027 1:B6 is cpg client 19 name 1_B6 handle 79a1deaa0000000e 1241731027 1:B6 cpg_join ok 1241731027 1:B6 waiting for first cpg event 1241731027 1:B6 confchg left 0 joined 1 total 2 1241731027 1:B6 process_node_join 2 1241731027 1:B6 cpg add node 2 total 1 1241731027 1:B6 cpg add node -1341969328 total 2 1241731027 1:B6 make_event_id 200020001 nodeid 2 memb_count 2 type 1 which shows the confchg giving us two members, the first with nodeid 2 and the second with nodeid -1341969328. It's not immediately clear whether the confchg should have been for just one member, or whether there's really two members and the nodeid is bad. Created attachment 343166 [details]
group_tool dump from grant-02
Chrissie, I wonder if this is related to the other recent openais regressions. *** This bug has been marked as a duplicate of bug 499734 *** changed component to openais. |