Description of problem: Was running looping cmirror creation/deletions on 2 of 4 nodes in a x86_64 cluster and saw these messages over and over along with hung lvm operations. link-08: dm-cmirror: Failed to receive election results from server: -110 dm-cmirror: Failed to receive election results from server: -110 dm-cmirror: Failed to receive election results from server: -110 dm-cmirror: Failed to receive election results from server: -110 link-07: Nov 30 11:30:16 link-07 kernel: dm-cmirror: Clustered mirror retried requests :: 1824 of 12737177 (1%) Nov 30 11:30:16 link-07 kernel: dm-cmirror: Last request: Nov 30 11:30:16 link-07 kernel: dm-cmirror: - my_id :: 7 Nov 30 11:30:16 link-07 kernel: dm-cmirror: - server :: 2 Nov 30 11:30:16 link-07 kernel: dm-cmirror: - log uuid:: 20W6FADr (active) Nov 30 11:30:16 link-07 kernel: dm-cmirror: - request :: LRT_GET_RESYNC_WORK Nov 30 11:30:16 link-07 kernel: dm-cmirror: - error :: -110 link-04: Nov 30 06:35:03 link-04 kernel: dm-cmirror: Clustered mirror retried requests :: 1952 of 13234029 (1%) Nov 30 06:35:03 link-04 kernel: dm-cmirror: Last request: Nov 30 06:35:03 link-04 kernel: dm-cmirror: - my_id :: 4 Nov 30 06:35:03 link-04 kernel: dm-cmirror: - server :: 2 Nov 30 06:35:03 link-04 kernel: dm-cmirror: - log uuid:: cCXKYk21 (active) Nov 30 06:35:03 link-04 kernel: dm-cmirror: - request :: LRT_CLEAR_REGION Nov 30 06:35:03 link-04 kernel: dm-cmirror: - error :: -110 link-02: Nov 30 06:08:04 link-02 kernel: dm-cmirror: Clustered mirror retried requests :: 1792 of 13969193 (1%) Nov 30 06:08:04 link-02 kernel: dm-cmirror: Last request: Nov 30 06:08:04 link-02 kernel: dm-cmirror: - my_id :: 2 Nov 30 06:08:04 link-02 kernel: dm-cmirror: - server :: 2 Nov 30 06:08:04 link-02 kernel: dm-cmirror: - log uuid:: 20W6FADr (active) Nov 30 06:08:04 link-02 kernel: dm-cmirror: - request :: LRT_GET_RESYNC_WORK Nov 30 06:08:04 link-02 kernel: dm-cmirror: - error :: -110 [root@link-08 ~]# cman_tool nodes Node Votes Exp Sts Name 2 1 4 M link-02 4 1 4 M link-04 7 1 4 M link-07 8 1 4 M link-08 [root@link-08 ~]# cman_tool services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [2 4 8 7] DLM Lock Space: "clvmd" 2 3 run - [2 4 8 7] DLM Lock Space: "clustered_log" 3 4 run - [2 4 8 7] [root@link-08 ~]# dmsetup ls corey-mirror5_mimage_1 (253, 20) corey-mirror5 (253, 21) corey-mirror5_mimage_0 (253, 19) corey-mirror4_mimage_1 (253, 16) corey-mirror2_mlog (253, 6) corey-mirror3 (253, 13) corey-mirror4_mimage_0 (253, 15) corey-mirror3_mimage_1 (253, 12) corey-mirror (253, 5) corey-mirror2 (253, 9) corey-mirror3_mimage_0 (253, 11) corey-mirror3_mlog (253, 10) corey-mirror2_mimage_1 (253, 8) corey-mirror2_mimage_0 (253, 7) corey-mirror4_mlog (253, 14) corey-mirror_mimage_1 (253, 4) corey-mirror_mimage_0 (253, 3) corey-mirror5_mlog (253, 18) VolGroup00-LogVol01 (253, 1) VolGroup00-LogVol00 (253, 0) corey-mirror_mlog (253, 2) [root@link-08 ~]# lvs [HANG] Here's the stuck lvs process on link-08: Nov 30 06:29:41 link-08 kernel: lvs S 0000000000000012 0 31490 31454 (NOTLB) Nov 30 06:29:41 link-08 kernel: 0000010036f87bd8 0000000000000002 000000000000020f 00000100013d1080 Nov 30 06:29:41 link-08 kernel: 0000010036f87b98 ffffffff801322ad ffffffff1f0b5a80 0000000000000080 Nov 30 06:29:41 link-08 kernel: 000001001eba7030 0000000000004fba Nov 30 06:29:41 link-08 kernel: Call Trace:<ffffffff801322ad>{recalc_task_prio+337} <ffffffff8013233b>{activate_task+124} Nov 30 06:29:41 link-08 kernel: <ffffffff8030b35c>{schedule_timeout+224} <ffffffff801357c8>{prepare_to_wait+21} Nov 30 06:29:41 link-08 kernel: <ffffffff803064ca>{unix_stream_recvmsg+592} <ffffffff801358cc>{autoremove_wake_function+0} Nov 30 06:29:41 link-08 kernel: <ffffffff801358cc>{autoremove_wake_function+0} <ffffffff802a7a62>{sock_aio_read+297} Nov 30 06:29:41 link-08 kernel: <ffffffff802a7ba8>{sock_aio_write+306} <ffffffff801796f8>{do_sync_read+173} Nov 30 06:29:41 link-08 kernel: <ffffffff801875d0>{__user_walk+93} <ffffffff80181d3a>{vfs_stat+24} Nov 30 06:29:41 link-08 kernel: <ffffffff8030a8c9>{thread_return+0} <ffffffff8030a921>{thread_return+88} Nov 30 06:29:41 link-08 kernel: <ffffffff801358cc>{autoremove_wake_function+0} <ffffffff80179806>{vfs_read+226} Nov 30 06:29:41 link-08 kernel: <ffffffff80179a4a>{sys_read+69} <ffffffff8011026a>{system_call+126} Here's the stuck lvchange on link-08: Nov 30 06:29:41 link-08 kernel: lvchange S 0000000000000012 0 30847 3941 (NOTLB) Nov 30 06:29:41 link-08 kernel: 000001001790bbd8 0000000000000002 000001003f5aa030 0000000000000074 Nov 30 06:29:41 link-08 kernel: 000001001790bb98 00000000801322ad 0000010020013e80 0000000180132866 Nov 30 06:29:41 link-08 kernel: 000001003cae57f0 0000000000000adf Nov 30 06:29:41 link-08 kernel: Call Trace:<ffffffff8013233b>{activate_task+124} <ffffffff8030b35c>{schedule_tim eout+224} Nov 30 06:29:41 link-08 kernel: <ffffffff801357c8>{prepare_to_wait+21} <ffffffff803064ca>{unix_stream_rec vmsg+592} Nov 30 06:29:41 link-08 kernel: <ffffffff801358cc>{autoremove_wake_function+0} <ffffffff801358cc>{autorem ove_wake_function+0} Nov 30 06:29:41 link-08 kernel: <ffffffff802a7a62>{sock_aio_read+297} <ffffffff802a7ba8>{sock_aio_write+3 06} Nov 30 06:29:41 link-08 kernel: <ffffffff801796f8>{do_sync_read+173} <ffffffff801875d0>{__user_walk+93} Nov 30 06:29:41 link-08 kernel: <ffffffff80181d3a>{vfs_stat+24} <ffffffff801358cc>{autoremove_wake_functi on+0} Nov 30 06:29:41 link-08 kernel: <ffffffff80193610>{dnotify_parent+34} <ffffffff80179806>{vfs_read+226} Nov 30 06:29:41 link-08 kernel: <ffffffff80179a4a>{sys_read+69} <ffffffff8011026a>{system_call+126} Here's the cluster_log and and stuck lvchange on link-04: Nov 30 06:37:20 link-04 kernel: cluster_log_s R running task 0 4378 1 4400 4346 (L-TLB) Nov 30 06:37:20 link-04 kernel: kcopyd S ffffffffa005ec84 0 4399 4 4322 (L-TLB) Nov 30 06:37:20 link-04 kernel: 0000010032077e68 0000000000000046 0000010032077de8 ffffffff80133fb3 Nov 30 06:37:20 link-04 kernel: 0000000000000000 000001002f329e50 0000000000000000 000000002f3298b0 Nov 30 06:37:20 link-04 kernel: 00000100320ef030 00000000000001a0 Nov 30 06:37:20 link-04 kernel: Call Trace:<ffffffff80133fb3>{__wake_up+54} <ffffffffa005ec84>{:dm_mod:do_work+0 } Nov 30 06:37:20 link-04 kernel: <ffffffff8014791d>{worker_thread+226} <ffffffff80133f0c>{default_wake_fun ction+0} Nov 30 06:37:20 link-04 kernel: <ffffffff80133f5d>{__wake_up_common+67} <ffffffff80133f0c>{default_wake_f unction+0} Nov 30 06:37:20 link-04 kernel: <ffffffff8014b67c>{keventd_create_kthread+0} <ffffffff8014783b>{worker_th read+0} Nov 30 06:37:20 link-04 kernel: <ffffffff8014b67c>{keventd_create_kthread+0} <ffffffff8014b653>{kthread+2 00} Nov 30 06:37:20 link-04 kernel: <ffffffff80110f47>{child_rip+8} <ffffffff8014b67c>{keventd_create_kthread +0} Nov 30 06:37:20 link-04 kernel: <ffffffff8014b58b>{kthread+0} <ffffffff80110f3f>{child_rip+0} Nov 30 06:37:20 link-04 kernel: Nov 30 06:37:20 link-04 kernel: lvchange S 0000000000000012 0 30829 3927 (NOTLB) Nov 30 06:37:20 link-04 kernel: 0000010030f8fbd8 0000000000000006 0000010028d0f3a0 0000010000000086 Nov 30 06:37:20 link-04 kernel: 0000010030f8fb98 00000000801322ad 0000010001013b00 000000003ec532e0 Nov 30 06:37:20 link-04 kernel: 0000010038c8b7f0 000000000000137d Nov 30 06:37:20 link-04 kernel: Call Trace:<ffffffff8013233b>{activate_task+124} <ffffffff8030b35c>{schedule_timeout+224} Nov 30 06:37:20 link-04 kernel: <ffffffff801357c8>{prepare_to_wait+21} <ffffffff803064ca>{unix_stream_recvmsg+592} Nov 30 06:37:20 link-04 kernel: <ffffffff801358cc>{autoremove_wake_function+0} <ffffffff801358cc>{autoremove_wake_function+0} Nov 30 06:37:20 link-04 kernel: <ffffffff802a7a62>{sock_aio_read+297} <ffffffff802a7ba8>{sock_aio_write+306} Nov 30 06:37:20 link-04 kernel: <ffffffff801796f8>{do_sync_read+173} <ffffffff801875d0>{__user_walk+93} Nov 30 06:37:20 link-04 kernel: <ffffffff80181d3a>{vfs_stat+24} <ffffffff8030a8c9>{thread_return+0} Nov 30 06:37:20 link-04 kernel: <ffffffff8030a921>{thread_return+88} <ffffffff801358cc>{autoremove_wake_function+0} Nov 30 06:37:20 link-04 kernel: <ffffffff80179806>{vfs_read+226} <ffffffff80179a4a>{sys_read+69} Nov 30 06:37:20 link-04 kernel: <ffffffff8011026a>{system_call+126} Version-Release number of selected component (if applicable): [root@link-08 ~]# uname -ar Linux link-08 2.6.9-42.17.ELsmp #1 SMP Mon Oct 9 18:42:57 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux [root@link-08 ~]# rpm -q cmirror-kernel-smp cmirror-kernel-smp-2.6.9-17.1 [root@link-08 ~]# rpm -qa | grep lvm2 lvm2-cluster-debuginfo-2.02.06-7.0.RHEL4 lvm2-2.02.15-3 lvm2-cluster-2.02.15-3
I can't seem to reproduce this, but I'm not convinced it is gone. Requesting help in reproducing.
Reproduced this issue along with bz 207132 last night doing looping create/convert/remove cmirror operations. Saw this on all the nodes in the cluster except the node which had the messages seen in bz 207132. So these two bugs are most likely related. dm-cmirror: Failed to receive election results from server: -110
I fixed a race where critical variables (restart_event_id/restart_event_type) could be overwritten by the server after a client had set them. This would cause the kcl_start_done issued by the server to fail. It would loose the ability to acknowledge the start and the cluster mirror would stall... This has all sorts of side-effects, including never being able to reach the server.
I wonder if I hit one of the side effects mentioned in comment #4, I was doing different cmirror operations on different nodes in the cluster and saw that the device-mapper devices were still around for a cmirror on one of the nodes after it had been deactivated. Should this issue be opened as another bz? Mar 8 03:31:54 taft-01 qarshd[25508]: Running cmdline: lvchange -an hang/taft-01.28863 Mar 8 03:31:54 taft-01 lvm[5026]: Monitoring mirror device hang-taft--04.28866 for events Mar 8 03:32:05 taft-01 lvm[5026]: Monitoring mirror device hang-taft--03.28869 for events Mar 8 03:32:07 taft-01 kernel: dm-cmirror: Clean-up required due to new server Mar 8 03:32:07 taft-01 kernel: dm-cmirror: - Wiping clear region list Mar 8 03:32:07 taft-01 kernel: dm-cmirror: - 1 clear region requests wiped Mar 8 03:32:07 taft-01 kernel: dm-cmirror: - Resending all mark region requests Mar 8 03:32:07 taft-01 kernel: dm-cmirror: Clean-up complete Mar 8 03:32:07 taft-01 kernel: dm-cmirror: Continuing request type, 7 (LRT_COMPLETE_RESYNC_WORK) Mar 8 03:32:07 taft-01 kernel: dm-cmirror: Error while listening for server response: -110 Mar 8 03:32:07 taft-01 kernel: dm-cmirror: Clustered mirror retried requests :: 1280 of 1683741 (1%) Mar 8 03:32:07 taft-01 kernel: dm-cmirror: Too many retries, attempting to re-establish server connection. Mar 8 03:32:07 taft-01 kernel: dm-cmirror: Failed to receive election results from server: (7ta2WQra,-110) Mar 8 03:32:07 taft-01 lvm[5026]: dm_task_run failed, errno = 6, No such device or address Mar 8 03:32:07 taft-01 lvm[5026]: hang-taft--01.28863 disappeared, detaching Mar 8 03:32:08 taft-01 kernel: dm-cmirror: Clean-up required due to new server Mar 8 03:32:08 taft-01 kernel: dm-cmirror: - Wiping clear region list Mar 8 03:32:08 taft-01 kernel: dm-cmirror: - 0 clear region requests wiped Mar 8 03:32:08 taft-01 kernel: dm-cmirror: - Resending all mark region requests Mar 8 03:32:08 taft-01 kernel: dm-cmirror: Clean-up complete Mar 8 03:32:08 taft-01 kernel: dm-cmirror: Continuing request type, 6 (LRT_GET_RESYNC_WORK) Mar 8 03:32:08 taft-01 kernel: dm-cmirror: Clean-up required due to new server Mar 8 03:32:08 taft-01 kernel: dm-cmirror: - Wiping clear region list Mar 8 03:32:08 taft-01 kernel: dm-cmirror: - 1 clear region requests wiped Mar 8 03:32:08 taft-01 kernel: dm-cmirror: - Resending all mark region requests Mar 8 03:32:08 taft-01 kernel: dm-cmirror: Clean-up complete Mar 8 03:32:08 taft-01 kernel: dm-cmirror: Continuing request type, 7 (LRT_COMPLETE_RESYNC_WORK) Mar 8 03:32:08 taft-01 lvm[5026]: dm_task_run failed, errno = 6, No such device or address Mar 8 03:32:08 taft-01 lvm[5026]: hang-taft--04.28866 disappeared, detaching [root@taft-01 ~]# lvs LV VG Attr LSize Origin Snap% Move Log Copy% LogVol00 VolGroup00 -wi-ao 66.16G LogVol01 VolGroup00 -wi-ao 1.94G taft-01.28863 hang mwi--- 500.00M taft-01.28863_mlog taft-02.28860 hang mwi--- 500.00M taft-02.28860_mlog taft-03.28869 hang mwi-a- 500.00M taft-03.28869_mlog 100.00 taft-04.28866 hang mwi--- 500.00M taft-04.28866_mlog [root@taft-01 ~]# dmsetup ls hang-taft--03.28869_mlog (253, 15) hang-taft--03.28869_mimage_3 (253, 20) hang-taft--03.28869_mimage_2 (253, 19) hang-taft--02.28860_mimage_1 (253, 4) hang-taft--03.28869_mimage_1 (253, 17) hang-taft--02.28860_mimage_0 (253, 3) hang-taft--03.28869_mimage_0 (253, 16) hang-taft--04.28866_mimage_3 (253, 14) hang-taft--04.28866_mimage_2 (253, 13) hang-taft--02.28860_mlog (253, 2) hang-taft--04.28866_mimage_1 (253, 11) hang-taft--04.28866_mimage_0 (253, 10) hang-taft--01.28863_mlog (253, 6) VolGroup00-LogVol01 (253, 1) hang-taft--03.28869 (253, 22) VolGroup00-LogVol00 (253, 0) hang-taft--01.28863_mimage_1 (253, 8) hang-taft--04.28866_mlog (253, 9) hang-taft--01.28863_mimage_0 (253, 7) [root@taft-02 ~]# lvs LV VG Attr LSize Origin Snap% Move Log Copy% LogVol00 VolGroup00 -wi-ao 66.16G LogVol01 VolGroup00 -wi-ao 1.94G taft-01.28863 hang mwi--- 500.00M taft-01.28863_mlog taft-02.28860 hang mwi--- 500.00M taft-02.28860_mlog taft-03.28869 hang mwi-a- 500.00M taft-03.28869_mlog 100.00 taft-04.28866 hang mwi--- 500.00M taft-04.28866_mlog [root@taft-02 ~]# dmsetup ls hang-taft--03.28869_mlog (253, 15) hang-taft--03.28869_mimage_3 (253, 20) hang-taft--03.28869_mimage_2 (253, 19) hang-taft--03.28869_mimage_1 (253, 17) hang-taft--03.28869_mimage_0 (253, 16) VolGroup00-LogVol01 (253, 1) hang-taft--03.28869 (253, 22) VolGroup00-LogVol00 (253, 0) [root@taft-03 ~]# lvs LV VG Attr LSize Origin Snap% Move Log Copy% LogVol00 VolGroup00 -wi-ao 66.16G LogVol01 VolGroup00 -wi-ao 1.94G taft-01.28863 hang mwi--- 500.00M taft-01.28863_mlog taft-02.28860 hang mwi--- 500.00M taft-02.28860_mlog taft-03.28869 hang mwi-a- 500.00M taft-03.28869_mlog 100.00 taft-04.28866 hang mwi--- 500.00M taft-04.28866_mlog [root@taft-03 ~]# dmsetup ls hang-taft--03.28869_mlog (253, 15) hang-taft--03.28869_mimage_3 (253, 20) hang-taft--03.28869_mimage_2 (253, 19) hang-taft--03.28869_mimage_1 (253, 17) hang-taft--03.28869_mimage_0 (253, 16) VolGroup00-LogVol01 (253, 1) hang-taft--03.28869 (253, 22) VolGroup00-LogVol00 (253, 0) [root@taft-04 ~]# lvs LV VG Attr LSize Origin Snap% Move Log Copy% LogVol00 VolGroup00 -wi-ao 66.16G LogVol01 VolGroup00 -wi-ao 1.94G taft-01.28863 hang mwi--- 500.00M taft-01.28863_mlog taft-02.28860 hang mwi--- 500.00M taft-02.28860_mlog taft-03.28869 hang mwi-a- 500.00M taft-03.28869_mlog 100.00 taft-04.28866 hang mwi--- 500.00M taft-04.28866_mlog [root@taft-04 ~]# dmsetup ls hang-taft--03.28869_mlog (253, 15) hang-taft--03.28869_mimage_3 (253, 20) hang-taft--03.28869_mimage_2 (253, 19) hang-taft--03.28869_mimage_1 (253, 17) hang-taft--03.28869_mimage_0 (253, 16) VolGroup00-LogVol01 (253, 1) hang-taft--03.28869 (253, 22) VolGroup00-LogVol00 (253, 0)
I was attempting to reproduce bz 231230, and ended up hitting a case similar to this problem. I had a very basic cmirror, 2 legs, 10G, no fs on top, no I/O running, and I failed the primary leg of the cmirror while the mirror was still syncing. This caused a lost of elections results and ended up causing a node in the cluster to be fenced. [...] Mar 15 07:17:09 link-02 kernel: end_request: I/O error, dev sda, sector 10165809 Mar 15 07:17:09 link-02 kernel: SCSI error : <0 0 0 1> return code = 0x10000 Mar 15 07:17:09 link-02 kernel: end_request: I/O error, dev sda, sector 10165817 Mar 15 07:17:09 link-02 kernel: device-mapper: Unable to read from primary mirror during recovery Mar 15 07:17:09 link-02 kernel: device-mapper: recovery failed on region 9911 Mar 15 07:17:27 link-02 kernel: dm-cmirror: Failed to receive election results from server: (TKbs5wES,-110) Mar 15 07:17:39 link-02 fenced[4519]: link-08 not a cluster member after 30 sec post_fail_delay Mar 15 07:17:39 link-02 fenced[4519]: fencing node "link-08" Mar 15 07:17:47 link-02 kernel: dm-cmirror: Failed to receive election results from server: (TKbs5wES,-110) Mar 15 07:17:59 link-02 fenced[4519]: fence "link-08" success Mar 15 07:18:06 link-02 kernel: dm-cmirror: A cluster mirror log member has failed. Mar 15 07:18:07 link-02 kernel: dm-cmirror: Failed to receive election results from server: (TKbs5wES,-110) A write attempt to this mirror inorder to trigger the down convert fails: [root@link-07 ~]# dd if=/dev/zero of=/dev/helter_skelter/mirror count=10 dd: writing to `/dev/helter_skelter/mirror': Input/output error 1+0 records in 0+0 records out I'll attach the clvmd debug logs
Created attachment 150149 [details] clvmd -d log from link-02
Created attachment 150150 [details] clvmd -d log from link-04
Created attachment 150151 [details] clvmd -d log from link-07
Created attachment 150152 [details] clvmd -d log from link-08
We know now that the dd command used to trigger the down convert does not work because it does reads before writes. Doing real writes should trigger the convert. The issue now is, why did a node get fenced?
Note: It is not a problem to recieve a few of these: dm-cmirror: Failed to receive election results It is a problem if it never recovers from it. Please reproduce so we can determine why the member got fenced. Is something else going on that might trigger that? assigned -> needinfo
I think the root cause of this has been fixed since I marked it 'needinfo' needinfo -> modified
I have not seen any "failed election results" messages for quite awhile, and there are other bugs to document a cmirror operations deadlock. Marking this bz verified.
I may have spoken too soon. I appear to have hit this while doing IO to 4 mirrors from 4 vgs while at the same time doing disk to core log (and back) converts.
The scenario in comment #15 reliably reproduces this bug.
This still appears to be reproducable (that or this is a new bug). I was attempting something similar to that in comment #15 with I/O to two cmirrors each on a different vg, all while converting to and from disk log and core log. After a couple successful converts, it failed with locking errors. Error locking on node grant-03: Command timed out Error locking on node link-08: Command timed out Failed to lock mirror1 The devices in that mirror failed convert mirror are now suspended: mirror1 vg1 Mwi-so 15.00G 100.00 mirror1_mimage_0(0),mirror1_mimage_1(0) [mirror1_mimage_0] vg1 iwi-so 15.00G /dev/sdb1(0) [mirror1_mimage_1] vg1 iwi-so 15.00G /dev/sdc1(0) mirror1_mlog vg1 -wi--- 4.00M /dev/sdd1(0) On the consoles where thelocks failed (link-08 and grant-03), I'm seeing a steady stream of cmirror server messages... Nov 8 16:20:09 link-08 kernel: dm-cmirror: Too many retries, attempting to re-establish server connection. Nov 8 16:20:09 link-08 kernel: dm-cmirror: Clustered mirror retried requests :: 5223200 of 5244922 (100%) Nov 8 16:20:09 link-08 kernel: dm-cmirror: Too many retries, attempting to re-establish server connection. Nov 8 16:20:09 link-08 kernel: dm-cmirror: Clustered mirror retried requests :: 5223264 of 5244986 (100%) Nov 8 16:20:09 link-08 kernel: dm-cmirror: Too many retries, attempting to re-establish server connection. Nov 8 16:20:09 link-08 kernel: dm-cmirror: Clustered mirror retried requests :: 5223296 of 5245018 (100%) Nov 8 16:20:09 link-08 kernel: dm-cmirror: Too many retries, attempting to re-establish server connection. Nov 8 16:20:09 link-08 kernel: dm-cmirror: Clustered mirror retried requests :: 5223328 of 5245050 (100%) Nov 8 16:20:09 link-08 kernel: dm-cmirror: Too many retries, attempting to re-establish server connection
Hit this bug during 4.6 regression tesing.
Hit this bug during 4.6.Z regression testing.
Bumping the priority and marking this as a TestBlocker, since this is stopping us from verifying other cmirror bug fixes.
Looks like I've hit this in the current 4.6.Z testing. lvm2-cluster-2.02.27-2.el4_6.1 lvm2-cluster-debuginfo-2.02.27-2.el4_6.1 lvm2-debuginfo-2.02.27-2.el4_6.1 lvm2-2.02.27-2.el4_6.1 device-mapper-1.02.21-1.el4 I see many of these messages on the consoles while running [lvm_cluster_cmirror_lock_stress]: dm-cmirror: Failed to receive election results from server: (k1SECe8j/1,-110) And my tests are stuck at this point. The tests running: <herd name="lock_stress_cmirror" desc="lock_stress some mirrors" version="1"> <pan2_opts numactive="all" allstop="true" iterations="5"/> <test name="link-13_1"> <cmd timeout="450">/usr/tests/sts-rhel4.6/lvm2/bin/lock_ops -n link-13 -v lock_stress</cmd> </test> <test name="link-14_1"> <cmd timeout="450">/usr/tests/sts-rhel4.6/lvm2/bin/lock_ops -n link-14 -v lock_stress</cmd> </test> <test name="link-16_1"> <cmd timeout="450">/usr/tests/sts-rhel4.6/lvm2/bin/lock_ops -n link-16 -v lock_stress</cmd> </test> </herd>
This hasn't been seen in over 6 months. Marking verified. Will re-open if seen again.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0803.html