Bug 239856
| Summary: | failed server election due to suspended mirror log causes clvm deadlock | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Retired] Red Hat Cluster Suite | Reporter: | Corey Marthaler <cmarthal> | ||||||||||||||
| Component: | cmirror-kernel | Assignee: | Jonathan Earl Brassow <jbrassow> | ||||||||||||||
| Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||||||||
| Severity: | high | Docs Contact: | |||||||||||||||
| Priority: | high | ||||||||||||||||
| Version: | 4 | CC: | agk, dwysocha, jbrassow, mbroz, prockai | ||||||||||||||
| Target Milestone: | --- | ||||||||||||||||
| Target Release: | --- | ||||||||||||||||
| Hardware: | All | ||||||||||||||||
| OS: | Linux | ||||||||||||||||
| Whiteboard: | |||||||||||||||||
| Fixed In Version: | RHBA-2007-0991 | Doc Type: | Bug Fix | ||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||
| Last Closed: | 2007-11-21 21:15:19 UTC | Type: | --- | ||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
| Embargoed: | |||||||||||||||||
| Attachments: |
|
||||||||||||||||
Appear to have reproduced this bug of the weekend. Link-02: [...] May 13 04:02:09 link-02 kernel: dm-cmirror: Failed to receive election results from server: (yDf7i4D1,-110) May 13 04:02:09 link-02 kernel: dm-cmirror: server_id=dead, server_valid=1, yDf7i4D1 May 13 04:02:09 link-02 kernel: dm-cmirror: trigger = LRT_GET_SYNC_COUNT May 13 04:02:09 link-02 kernel: dm-cmirror: LRT_ELECTION(10): (yDf7i4D1) May 13 04:02:09 link-02 kernel: dm-cmirror: starter : 3 May 13 04:02:09 link-02 kernel: dm-cmirror: co-ordinator: 3 May 13 04:02:09 link-02 kernel: dm-cmirror: node_count : 0 [...] Link-04: [...] May 13 04:02:36 link-04 kernel: dm-cmirror: Failed to receive election results from server: (1P3Mw1Yx,-110) May 13 04:02:36 link-04 kernel: dm-cmirror: server_id=dead, server_valid=1, 1P3Mw1Yx May 13 04:02:36 link-04 kernel: dm-cmirror: trigger = LRT_COMPLETE_RESYNC_WORK May 13 04:02:36 link-04 kernel: dm-cmirror: LRT_ELECTION(10): (1P3Mw1Yx) May 13 04:02:36 link-04 kernel: dm-cmirror: starter : 4 May 13 04:02:36 link-04 kernel: dm-cmirror: co-ordinator: 4 May 13 04:02:36 link-04 kernel: dm-cmirror: node_count : 0 May 13 04:02:49 link-04 kernel: dm-cmirror: LRT_ELECTION(10): (yDf7i4D1) May 13 04:02:49 link-04 kernel: dm-cmirror: starter : 3 May 13 04:02:49 link-04 kernel: dm-cmirror: co-ordinator: 3 May 13 04:02:49 link-04 kernel: dm-cmirror: node_count : 1 [...] LINK-07: [...] May 13 04:02:57 link-07 kernel: dm-cmirror: Failed to receive election results from server: (yDf7i4D1,-110) May 13 04:02:57 link-07 kernel: dm-cmirror: server_id=dead, server_valid=1, yDf7i4D1 May 13 04:02:57 link-07 kernel: dm-cmirror: trigger = LRT_GET_SYNC_COUNT [...] LINK-08: [...] May 13 04:02:13 link-08 kernel: dm-cmirror: Failed to receive election results from server: (yDf7i4D1,-110) May 13 04:02:13 link-08 kernel: dm-cmirror: server_id=dead, server_valid=1, yDf7i4D1 May 13 04:02:13 link-08 kernel: dm-cmirror: trigger = LRT_GET_SYNC_COUNT May 13 04:02:13 link-08 kernel: dm-cmirror: LRT_ELECTION(10): (yDf7i4D1) May 13 04:02:13 link-08 kernel: dm-cmirror: starter : 1 May 13 04:02:13 link-08 kernel: dm-cmirror: co-ordinator: 1 May 13 04:02:13 link-08 kernel: dm-cmirror: node_count : 0 May 13 04:02:18 link-08 kernel: dm-cmirror: LRT_ELECTION(10): (1P3Mw1Yx) May 13 04:02:18 link-08 kernel: dm-cmirror: starter : 4 May 13 04:02:18 link-08 kernel: dm-cmirror: co-ordinator: 4 May 13 04:02:18 link-08 kernel: dm-cmirror: node_count : 1 May 13 04:02:31 link-08 kernel: dm-cmirror: LRT_ELECTION(10): (yDf7i4D1) May 13 04:02:31 link-08 kernel: dm-cmirror: starter : 3 May 13 04:02:31 link-08 kernel: dm-cmirror: co-ordinator: 3 May 13 04:02:31 link-08 kernel: dm-cmirror: node_count : 2 [...] [root@link-02 log]# cman_tool nodes Node Votes Exp Sts Name 1 1 4 M link-08 2 1 4 M link-07 3 1 4 M link-02 4 1 4 M link-04 [root@link-02 log]# cman_tool services Service Name GID LID State Code Fence Domain: "default" 3 2 run - [2 1 3 4] DLM Lock Space: "clvmd" 96 74 run - [3 4 2 1] DLM Lock Space: "clustered_log" 142 117 run - [2 1 3 4] [root@link-02 log]# dmsetup table lock_stress-link--04.7689_mimage_0: 0 1024000 linear 8:33 1032576 lock_stress-link--08.7687_mimage_2: 0 1024000 linear 8:33 384 lock_stress-link--08.7687: 0 1024000 mirror clustered_disk 4 253:2 1024 LVM-vFSXsH7SJ789IvPT9Eq2sLLq2KabOXaAOdiWIhfSW3ZTVejln0c2djc2k8y0WVxJ block_on_error 3 253:3 0 253:4 0 253:5 0 lock_stress-link--02.7691: 0 1024000 mirror clustered_disk 4 253:13 1024 LVM-vFSXsH7SJ789IvPT9Eq2sLLq2KabOXaADG9t1bHbmkh8v2eXxFKnLlURyDf7i4D1 block_on_error 5 253:14 0 253:15 0 253:16 0 253:17 0 253:18 0 lock_stress-link--08.7687_mimage_1: 0 1024000 linear 8:17 384 lock_stress-link--07.7693_mimage_3: 0 1024000 linear 8:97 8576 lock_stress-link--08.7687_mimage_0: 0 1024000 linear 8:1 384 lock_stress-link--08.7687_mlog: 0 8192 linear 8:97 384 lock_stress-link--07.7693_mimage_2: 0 1024000 linear 8:81 384 lock_stress-link--04.7689_mlog: 0 8192 linear 8:81 2048384 lock_stress-link--02.7691_mimage_4: 0 1024000 linear 8:81 1024384 lock_stress-link--07.7693_mimage_1: 0 1024000 linear 8:65 384 lock_stress-link--04.7689: 0 1024000 mirror clustered_disk 4 253:20 1024 LVM-vFSXsH7SJ789IvPT9Eq2sLLq2KabOXaAHL5efyw6EHAacfcdSpz3NLqj1P3Mw1Yx block_on_error 2 253:21 0 253:22 0 lock_stress-link--02.7691_mlog: 0 8192 linear 8:97 1032576 lock_stress-link--02.7691_mimage_3: 0 1024000 linear 8:65 1024384 lock_stress-link--07.7693_mimage_0: 0 1024000 linear 8:49 384 lock_stress-link--07.7693_mlog: 0 8192 linear 8:33 1024384 lock_stress-link--02.7691_mimage_2: 0 1024000 linear 8:49 1024384 VolGroup00-LogVol01: 0 4063232 linear 3:2 73793920 lock_stress-link--02.7691_mimage_1: 0 1024000 linear 8:17 1024384 VolGroup00-LogVol00: 0 73793536 linear 3:2 384 lock_stress-link--02.7691_mimage_0: 0 1024000 linear 8:1 1024384 lock_stress-link--07.7693: 0 1024000 mirror clustered_disk 4 253:7 1024 LVM-vFSXsH7SJ789IvPT9Eq2sLLq2KabOXaAfE57gu7FYrcC89vEF3wBfVnjUFKeuzZ2 block_on_error 4 253:8 0 253:9 0 253:10 0 253:11 0 lock_stress-link--04.7689_mimage_1: 0 1024000 linear 8:97 1040768 Bumping the priority as this appears to be easily reproduced, though a customer will most likely never do this many locking operations inorder to cause this. Created attachment 154871 [details]
Debugging information on link-02
Created attachment 154872 [details]
Debugging information on link-04
Created attachment 154873 [details]
Debugging information on link-07
Created attachment 154874 [details]
Debugging information on link-08
Created attachment 154875 [details]
Relevent messages output from link-07
Shows that there are two copies of the log in play. Could be getting
conflicting 'lc->suspend' readings.
Created attachment 154876 [details]
More from messages on link-07
I can reproduce this on the rhel4.6 codebase as well.
If I simply do:
while true; do
lvchange -an vg1/`hostname -s`
lvremove vg1/`hostname -s`
lvcreate -m3 -L 1G -n `hostname -s` vg1
for i in 2 1 0; do
lvconvert -m $i vg1/`hostname -s`
done
done
on each machine in the cluster, I can reproduce in a matter of minutes.
This bug only affects mirrors >= 3-way. It requires enough load (either by changing mirror configurations or via I/O) to force a down convert (mirror -> mirror) to choose a different server from the original. The cause of the bug is an uninitialized variable. Just a note that this bug was seen today while running on the 4.5.z packages. 2.6.9-55.0.3.ELsmp cmirror-kernel-2.6.9-32.0.0.2 I am no longer able to reproduce this with the latest rhel4.6 code. The test in comment 10 used to trigger the problem very quickly. I've now been unsuccessful for days. Please try to reproduce. I'm using 2.6.9-55.16.ELsmp and the latest rhel4.6 code from CVS. I'm sorry, I'm not sure if it has been built yet. Will check. Looks like the lvm2 package was built on July 24th, 2007 I've had a patch in my tree that I was unaware I had not checked in... Probably explains why I couldn't reproduce. assigned -> post. This bug does appear to be fixed with the latest packages. I did however eventually hit bz 245799, so I didn't get to let the tests run as long as I had hoped. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0991.html |
Description of problem: I had been running 'cmirror_lock_stress' which basically just does multiple instances of cmirror operations (create, down convert, up convert, deactivate, reactivate, delete) on all the nodes in a cluster at the same time. Eventually, two mirror logs failed to activate and got stuck in the SUSPEND state and thus caused the cmirror server election failures. The devices were only in the suspend state on 3 of the nodes in the cluster (link-0[247]) but were active on link-08. Here is what was happen on all the nodes just before this took place: LINK-02: ay 10 16:18:13 link-02 qarshd[27150]: Running cmdline: lvs lock_stress/link-02.12483 --noheadings -o mirror_log | awk {'print $1'} May 10 16:18:18 link-02 lvm[6466]: Monitoring mirror device lock_stress-link--08.12485 for events May 10 16:18:27 link-02 lvm[6466]: Monitoring mirror device lock_stress-link--04.12481 for events May 10 16:18:32 link-02 qarshd[27355]: Talking to peer 10.15.80.47:49734 May 10 16:18:32 link-02 qarshd[27355]: Running cmdline: lvconvert -m 0 lock_stress/link-02.12483 May 10 16:18:33 link-02 lvm[6466]: No longer monitoring mirror device lock_stress-link--02.12483 for events May 10 16:18:40 link-02 qarshd[27396]: Talking to peer 10.15.80.47:49738 May 10 16:18:40 link-02 qarshd[27396]: Running cmdline: lvs lock_stress/link-02.12483 --noheadings -o mirror_log | awk {'print $1'} May 10 16:18:41 link-02 lvm[6466]: No longer monitoring mirror device lock_stress-link--08.12485 for events May 10 16:19:01 link-02 kernel: dm-cmirror: Failed to receive election results from server: (3u0694BM,-110) May 10 16:19:41 link-02 last message repeated 2 times LINK-04: May 10 16:17:35 link-04 qarshd[27126]: Running cmdline: lvcreate -m 2 -n link-04.12481 -L 500M lock_stress May 10 16:17:57 link-04 [6834]: Monitoring mirror device lock_stress-link--07.12479 for events May 10 16:18:02 link-04 lvm[6834]: Monitoring mirror device lock_stress-link--02.12483 for events May 10 16:18:08 link-04 lvm[6834]: Monitoring mirror device lock_stress-link--08.12485 for events May 10 16:18:17 link-04 lvm[6834]: Monitoring mirror device lock_stress-link--04.12481 for events May 10 16:18:17 link-04 qarshd[27509]: Talking to peer 10.15.80.47:33482 May 10 16:18:17 link-04 qarshd[27509]: Running cmdline: lvs lock_stress/link-04.12481 --noheadings -o mirror_log | awk {'print $1'} May 10 16:18:22 link-04 qarshd[27519]: Talking to peer 10.15.80.47:33486 May 10 16:18:22 link-04 qarshd[27519]: Running cmdline: lvconvert -m 0 lock_stress/link-04.12481 May 10 16:18:23 link-04 lvm[6834]: No longer monitoring mirror device lock_stress-link--02.12483 for events May 10 16:18:30 link-04 lvm[6834]: No longer monitoring mirror device lock_stress-link--08.12485 for events May 10 16:18:50 link-04 kernel: dm-cmirror: Failed to receive election results from server: (3u0694BM,-110) May 10 16:19:30 link-04 last message repeated 2 times LINK-07: May 10 16:17:58 link-07 qarshd[27065]: Running cmdline: lvcreate -m 3 -n link-07.12479 -L 500M lock_stress May 10 16:18:08 link-07 [6442]: Monitoring mirror device lock_stress-link--07.12479 for events May 10 16:18:18 link-07 hald[3552]: Timed out waiting for hotplug event 3615. Rebasing to 3600 May 10 16:18:20 link-07 qarshd[27180]: Talking to peer 10.15.80.47:47611 May 10 16:18:20 link-07 qarshd[27180]: Running cmdline: lvs lock_stress/link-07.12479 --noheadings -o mirror_log | awk {'print $1'} May 10 16:18:22 link-07 hald[3552]: Timed out waiting for hotplug event 3601. Rebasing to 3615 May 10 16:18:25 link-07 lvm[6442]: Monitoring mirror device lock_stress-link--02.12483 for events May 10 16:18:31 link-07 lvm[6442]: Monitoring mirror device lock_stress-link--08.12485 for events May 10 16:18:40 link-07 lvm[6442]: Monitoring mirror device lock_stress-link--04.12481 for events May 10 16:18:44 link-07 qarshd[27458]: Talking to peer 10.15.80.47:47621 May 10 16:18:44 link-07 qarshd[27458]: Running cmdline: lvconvert -m 1 lock_stress/link-07.12479 May 10 16:18:45 link-07 lvm[6442]: No longer monitoring mirror device lock_stress-link--02.12483 for events May 10 16:18:53 link-07 lvm[6442]: No longer monitoring mirror device lock_stress-link--08.12485 for events May 10 16:19:13 link-07 kernel: dm-cmirror: Failed to receive election results from server: (3u0694BM,-110) May 10 16:19:28 link-07 kernel: dm-cmirror: Error listening for server(2) response for 3u0694BM: -110 May 10 16:19:43 link-07 kernel: dm-cmirror: Error listening for server(2) response for 3u0694BM: -110 May 10 16:19:53 link-07 lvm[6442]: No longer monitoring mirror device lock_stress-link--07.12479 for events May 10 16:19:58 link-07 kernel: dm-cmirror: Error listening for server(2) response for 3u0694BM: -110 May 10 16:20:43 link-07 last message repeated 3 times May 10 16:21:58 link-07 last message repeated 5 times May 10 16:22:09 link-07 qarshd[27458]: Sending child 27459 signal 2 May 10 16:22:13 link-07 kernel: dm-cmirror: Error listening for server(2) response for 3u0694BM: -110 May 10 16:22:58 link-07 last message repeated 3 times May 10 16:24:13 link-07 last message repeated 5 times May 10 16:25:28 link-07 last message repeated 5 times May 10 16:26:43 link-07 last message repeated 5 times May 10 16:27:13 link-07 last message repeated 2 times May 10 16:27:13 link-07 kernel: dm-cmirror: Clustered mirror retried requests :: 32 of 39442 (1%) May 10 16:27:13 link-07 kernel: dm-cmirror: Too many retries, attempting to re-establish server connection. May 10 16:27:33 link-07 kernel: dm-cmirror: Failed to receive election results from server: (3u0694BM,-110) LINK-08: May 10 16:18:31 link-08 qarshd[31709]: Running cmdline: lvs lock_stress/link-08.12485 --noheadings -o mirror_log | awk {'print $1'} May 10 16:18:40 link-08 lvm[7254]: Monitoring mirror device lock_stress-link--04.12481 for events May 10 16:18:44 link-08 qarshd[31803]: Talking to peer 10.15.80.47:41860 May 10 16:18:44 link-08 qarshd[31803]: Running cmdline: lvconvert -m 2 lock_stress/link-08.12485 May 10 16:18:45 link-08 lvm[7254]: No longer monitoring mirror device lock_stress-link--02.12483 for events May 10 16:18:53 link-08 lvm[7254]: No longer monitoring mirror device lock_stress-link--08.12485 for events May 10 16:19:13 link-08 kernel: dm-cmirror: Failed to receive election results from server: (3u0694BM,-110) May 10 16:19:53 link-08 last message repeated 2 times Version-Release number of selected component (if applicable): 2.6.9-55.ELlargesmp cmirror-kernel-2.6.9-32.0 lvm2-cluster-2.02.21-7.el4