Bug 373711
| Summary: | recovery "stuck" on 3 remaining nodes after fourth node is shot. | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Retired] Red Hat Cluster Suite | Reporter: | Dean Jansa <djansa> | ||||||||||
| Component: | cman-kernel | Assignee: | Christine Caulfield <ccaulfie> | ||||||||||
| Status: | CLOSED DUPLICATE | QA Contact: | Cluster QE <mspqa-list> | ||||||||||
| Severity: | high | Docs Contact: | |||||||||||
| Priority: | medium | ||||||||||||
| Version: | 4 | CC: | cluster-maint | ||||||||||
| Target Milestone: | --- | ||||||||||||
| Target Release: | --- | ||||||||||||
| Hardware: | ia64 | ||||||||||||
| OS: | Linux | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2007-11-14 14:18:05 UTC | Type: | --- | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Attachments: |
|
||||||||||||
Created attachment 253351 [details]
link-13 stack
Created attachment 253361 [details]
link-14 stack
Created attachment 253371 [details]
link-15 stack
Created attachment 253381 [details]
link-16 stack
I strongly suspect this is the same as https://bugzilla.redhat.com/show_bug.cgi?id=299061#c39 |
Description of problem: Four node cluster: link-{13,14,15,16} link-13 is shot during recovery testing, remaining 3 nodes never complete recovery. Dave dug about and found link-{14,15,16} are all waiting for a barrier to complete, which isn't happening for some unknown reason. Version-Release number of selected component (if applicable): 2.6.9-55.0.12.ELlargesmp cman-kernel-largesmp-2.6.9-50.2.0.6 dlm-kernel-largesmp-2.6.9-46.16.0.12 How reproducible: Haven't tried at this point. Steps to Reproduce: 1. Run revolver with single gfs fs on a 4 node cluster. 2. 3. ---------------------------- [root@link-13 ~]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 2 2 join S-4,4,1 [2 3 4 1] DLM Lock Space: "clvmd" 4 3 join S-4,4,1 [2 3 4 1] User: "usrm::manager" 11 4 run - [1] --------- [root@link-14 ~]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 2 2 run U-1,10,1 [4 2 3] DLM Lock Space: "clvmd" 4 4 run U-1,10,1 [4 2 3] DLM Lock Space: "link_ia640" 5 5 run - [4 2 3] DLM Lock Space: "link_ia641" 7 7 run - [4 2 3] DLM Lock Space: "link_ia642" 9 9 recover 4 - [4 2 3] GFS Mount Group: "link_ia640" 6 6 recover 0 - [4 2 3] GFS Mount Group: "link_ia641" 8 8 recover 0 - [4 2 3] GFS Mount Group: "link_ia642" 10 10 recover 0 - [4 2 3] -------------------- [root@link-15 ~]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 2 2 run U-1,10,1 [2 4 3] DLM Lock Space: "clvmd" 4 4 run U-1,10,1 [2 4 3] DLM Lock Space: "link_ia640" 5 5 run - [2 4 3] DLM Lock Space: "link_ia641" 7 7 run - [2 4 3] DLM Lock Space: "link_ia642" 9 9 recover 4 - [2 4 3] GFS Mount Group: "link_ia640" 6 6 recover 0 - [2 4 3] GFS Mount Group: "link_ia641" 8 8 recover 0 - [2 4 3] GFS Mount Group: "link_ia642" 10 10 recover 0 - [2 4 3] -------------------- [root@link-16 ~]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 2 2 run U-1,10,1 [2 3 4] DLM Lock Space: "clvmd" 4 4 run U-1,10,1 [2 3 4] DLM Lock Space: "link_ia640" 5 5 run - [2 3 4] DLM Lock Space: "link_ia641" 7 7 run - [2 3 4] DLM Lock Space: "link_ia642" 9 9 recover 4 - [2 3 4] GFS Mount Group: "link_ia640" 6 6 recover 0 - [2 3 4] GFS Mount Group: "link_ia641" 8 8 recover 0 - [2 3 4] GFS Mount Group: "link_ia642" 10 10 recover 0 - [2 3 4]