217895 – lost election results from cmirror server cause mirror activation deadlock

Bug 217895 - lost election results from cmirror server cause mirror activation deadlock

Summary: lost election results from cmirror server cause mirror activation deadlock

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	cmirror
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Jonathan Earl Brassow
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	227155 362691
TreeView+	depends on / blocked

Reported:	2006-11-30 17:25 UTC by Corey Marthaler
Modified:	2010-01-12 02:01 UTC (History)
CC List:	4 users (show)
Fixed In Version:	RHBA-2008-0803
Clone Of:
Environment:
Last Closed:	2008-07-25 19:27:59 UTC
Embargoed:

Attachments	(Terms of Use)
clvmd -d log from link-02 (18.25 KB, text/plain) 2007-03-15 18:35 UTC, Corey Marthaler	no flags	Details
clvmd -d log from link-04 (18.25 KB, text/plain) 2007-03-15 18:35 UTC, Corey Marthaler	no flags	Details
clvmd -d log from link-07 (61.90 KB, text/plain) 2007-03-15 18:36 UTC, Corey Marthaler	no flags	Details
clvmd -d log from link-08 (20.01 KB, text/plain) 2007-03-15 18:37 UTC, Corey Marthaler	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2008:0803	0	normal	SHIPPED_LIVE	cmirror-kernel bug fix update	2008-07-25 19:27:56 UTC

Description Corey Marthaler 2006-11-30 17:25:57 UTC

Description of problem:
Was running looping cmirror creation/deletions on 2 of 4 nodes in a x86_64
cluster and saw these messages over and over along with hung lvm operations.

link-08:
dm-cmirror: Failed to receive election results from server: -110
dm-cmirror: Failed to receive election results from server: -110
dm-cmirror: Failed to receive election results from server: -110
dm-cmirror: Failed to receive election results from server: -110

link-07:
Nov 30 11:30:16 link-07 kernel: dm-cmirror: Clustered mirror retried requests ::
1824 of 12737177 (1%)
Nov 30 11:30:16 link-07 kernel: dm-cmirror: Last request:
Nov 30 11:30:16 link-07 kernel: dm-cmirror:  - my_id   :: 7
Nov 30 11:30:16 link-07 kernel: dm-cmirror:  - server  :: 2
Nov 30 11:30:16 link-07 kernel: dm-cmirror:  - log uuid:: 20W6FADr (active)
Nov 30 11:30:16 link-07 kernel: dm-cmirror:  - request :: LRT_GET_RESYNC_WORK
Nov 30 11:30:16 link-07 kernel: dm-cmirror:  - error   :: -110

link-04:
Nov 30 06:35:03 link-04 kernel: dm-cmirror: Clustered mirror retried requests ::
1952 of 13234029 (1%)
Nov 30 06:35:03 link-04 kernel: dm-cmirror: Last request:
Nov 30 06:35:03 link-04 kernel: dm-cmirror:  - my_id   :: 4
Nov 30 06:35:03 link-04 kernel: dm-cmirror:  - server  :: 2
Nov 30 06:35:03 link-04 kernel: dm-cmirror:  - log uuid:: cCXKYk21 (active)
Nov 30 06:35:03 link-04 kernel: dm-cmirror:  - request :: LRT_CLEAR_REGION
Nov 30 06:35:03 link-04 kernel: dm-cmirror:  - error   :: -110

link-02:
Nov 30 06:08:04 link-02 kernel: dm-cmirror: Clustered mirror retried requests ::
1792 of 13969193 (1%)
Nov 30 06:08:04 link-02 kernel: dm-cmirror: Last request:
Nov 30 06:08:04 link-02 kernel: dm-cmirror:  - my_id   :: 2
Nov 30 06:08:04 link-02 kernel: dm-cmirror:  - server  :: 2
Nov 30 06:08:04 link-02 kernel: dm-cmirror:  - log uuid:: 20W6FADr (active)
Nov 30 06:08:04 link-02 kernel: dm-cmirror:  - request :: LRT_GET_RESYNC_WORK
Nov 30 06:08:04 link-02 kernel: dm-cmirror:  - error   :: -110


[root@link-08 ~]# cman_tool nodes
Node  Votes Exp Sts  Name
   2    1    4   M   link-02
   4    1    4   M   link-04
   7    1    4   M   link-07
   8    1    4   M   link-08
[root@link-08 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[2 4 8 7]

DLM Lock Space:  "clvmd"                             2   3 run       -
[2 4 8 7]

DLM Lock Space:  "clustered_log"                     3   4 run       -
[2 4 8 7]

[root@link-08 ~]# dmsetup ls
corey-mirror5_mimage_1  (253, 20)
corey-mirror5   (253, 21)
corey-mirror5_mimage_0  (253, 19)
corey-mirror4_mimage_1  (253, 16)
corey-mirror2_mlog      (253, 6)
corey-mirror3   (253, 13)
corey-mirror4_mimage_0  (253, 15)
corey-mirror3_mimage_1  (253, 12)
corey-mirror    (253, 5)
corey-mirror2   (253, 9)
corey-mirror3_mimage_0  (253, 11)
corey-mirror3_mlog      (253, 10)
corey-mirror2_mimage_1  (253, 8)
corey-mirror2_mimage_0  (253, 7)
corey-mirror4_mlog      (253, 14)
corey-mirror_mimage_1   (253, 4)
corey-mirror_mimage_0   (253, 3)
corey-mirror5_mlog      (253, 18)
VolGroup00-LogVol01     (253, 1)
VolGroup00-LogVol00     (253, 0)
corey-mirror_mlog       (253, 2)
[root@link-08 ~]# lvs
[HANG]

                                                                               
                          
Here's the stuck lvs process on link-08:
Nov 30 06:29:41 link-08 kernel: lvs           S 0000000000000012     0 31490 
31454                     (NOTLB)
Nov 30 06:29:41 link-08 kernel: 0000010036f87bd8 0000000000000002
000000000000020f 00000100013d1080
Nov 30 06:29:41 link-08 kernel:        0000010036f87b98 ffffffff801322ad
ffffffff1f0b5a80 0000000000000080
Nov 30 06:29:41 link-08 kernel:        000001001eba7030 0000000000004fba
Nov 30 06:29:41 link-08 kernel: Call
Trace:<ffffffff801322ad>{recalc_task_prio+337} <ffffffff8013233b>{activate_task+124}
Nov 30 06:29:41 link-08 kernel:        <ffffffff8030b35c>{schedule_timeout+224}
<ffffffff801357c8>{prepare_to_wait+21}
Nov 30 06:29:41 link-08 kernel:       
<ffffffff803064ca>{unix_stream_recvmsg+592}
<ffffffff801358cc>{autoremove_wake_function+0}
Nov 30 06:29:41 link-08 kernel:       
<ffffffff801358cc>{autoremove_wake_function+0} <ffffffff802a7a62>{sock_aio_read+297}
Nov 30 06:29:41 link-08 kernel:        <ffffffff802a7ba8>{sock_aio_write+306}
<ffffffff801796f8>{do_sync_read+173}
Nov 30 06:29:41 link-08 kernel:        <ffffffff801875d0>{__user_walk+93}
<ffffffff80181d3a>{vfs_stat+24}
Nov 30 06:29:41 link-08 kernel:        <ffffffff8030a8c9>{thread_return+0}
<ffffffff8030a921>{thread_return+88}
Nov 30 06:29:41 link-08 kernel:       
<ffffffff801358cc>{autoremove_wake_function+0} <ffffffff80179806>{vfs_read+226}
Nov 30 06:29:41 link-08 kernel:        <ffffffff80179a4a>{sys_read+69}
<ffffffff8011026a>{system_call+126}


Here's the stuck lvchange on link-08:
Nov 30 06:29:41 link-08 kernel: lvchange      S 0000000000000012     0 30847  
3941                     (NOTLB)
Nov 30 06:29:41 link-08 kernel: 000001001790bbd8 0000000000000002
000001003f5aa030 0000000000000074
Nov 30 06:29:41 link-08 kernel:        000001001790bb98 00000000801322ad
0000010020013e80 0000000180132866
Nov 30 06:29:41 link-08 kernel:        000001003cae57f0 0000000000000adf
Nov 30 06:29:41 link-08 kernel: Call Trace:<ffffffff8013233b>{activate_task+124}
<ffffffff8030b35c>{schedule_tim
eout+224}
Nov 30 06:29:41 link-08 kernel:        <ffffffff801357c8>{prepare_to_wait+21}
<ffffffff803064ca>{unix_stream_rec
vmsg+592}
Nov 30 06:29:41 link-08 kernel:       
<ffffffff801358cc>{autoremove_wake_function+0} <ffffffff801358cc>{autorem
ove_wake_function+0}
Nov 30 06:29:41 link-08 kernel:        <ffffffff802a7a62>{sock_aio_read+297}
<ffffffff802a7ba8>{sock_aio_write+3
06}
Nov 30 06:29:41 link-08 kernel:        <ffffffff801796f8>{do_sync_read+173}
<ffffffff801875d0>{__user_walk+93}
Nov 30 06:29:41 link-08 kernel:        <ffffffff80181d3a>{vfs_stat+24}
<ffffffff801358cc>{autoremove_wake_functi
on+0}
Nov 30 06:29:41 link-08 kernel:        <ffffffff80193610>{dnotify_parent+34}
<ffffffff80179806>{vfs_read+226}
Nov 30 06:29:41 link-08 kernel:        <ffffffff80179a4a>{sys_read+69}
<ffffffff8011026a>{system_call+126}

Here's the cluster_log and and stuck lvchange on link-04:
Nov 30 06:37:20 link-04 kernel: cluster_log_s R  running task       0  4378    
 1          4400  4346 (L-TLB)  Nov 30 06:37:20 link-04 kernel: kcopyd        S
ffffffffa005ec84     0  4399      4                4322 (L-TLB)
Nov 30 06:37:20 link-04 kernel: 0000010032077e68 0000000000000046
0000010032077de8 ffffffff80133fb3             Nov 30 06:37:20 link-04 kernel:  
     0000000000000000 000001002f329e50 0000000000000000 000000002f3298b0
Nov 30 06:37:20 link-04 kernel:        00000100320ef030 00000000000001a0       
                                Nov 30 06:37:20 link-04 kernel: Call
Trace:<ffffffff80133fb3>{__wake_up+54} <ffffffffa005ec84>{:dm_mod:do_work+0
}                                                                              
                                Nov 30 06:37:20 link-04 kernel:       
<ffffffff8014791d>{worker_thread+226} <ffffffff80133f0c>{default_wake_fun
ction+0}                                                                       
                                Nov 30 06:37:20 link-04 kernel:       
<ffffffff80133f5d>{__wake_up_common+67} <ffffffff80133f0c>{default_wake_f
unction+0}                                                                     
                                Nov 30 06:37:20 link-04 kernel:       
<ffffffff8014b67c>{keventd_create_kthread+0} <ffffffff8014783b>{worker_th
read+0}                                                                        
                                Nov 30 06:37:20 link-04 kernel:       
<ffffffff8014b67c>{keventd_create_kthread+0} <ffffffff8014b653>{kthread+2
00}                                                                            
                                Nov 30 06:37:20 link-04 kernel:       
<ffffffff80110f47>{child_rip+8} <ffffffff8014b67c>{keventd_create_kthread
+0}                                                                            
                                Nov 30 06:37:20 link-04 kernel:       
<ffffffff8014b58b>{kthread+0} <ffffffff80110f3f>{child_rip+0}
Nov 30 06:37:20 link-04 kernel:                                                
                                Nov 30 06:37:20 link-04 kernel: lvchange      S
0000000000000012     0 30829   3927                     (NOTLB)
Nov 30 06:37:20 link-04 kernel: 0000010030f8fbd8 0000000000000006
0000010028d0f3a0 0000010000000086             Nov 30 06:37:20 link-04 kernel:  
     0000010030f8fb98 00000000801322ad 0000010001013b00 000000003ec532e0
Nov 30 06:37:20 link-04 kernel:        0000010038c8b7f0 000000000000137d
Nov 30 06:37:20 link-04 kernel: Call Trace:<ffffffff8013233b>{activate_task+124}
<ffffffff8030b35c>{schedule_timeout+224}
Nov 30 06:37:20 link-04 kernel:        <ffffffff801357c8>{prepare_to_wait+21}
<ffffffff803064ca>{unix_stream_recvmsg+592}
Nov 30 06:37:20 link-04 kernel:       
<ffffffff801358cc>{autoremove_wake_function+0}
<ffffffff801358cc>{autoremove_wake_function+0}
Nov 30 06:37:20 link-04 kernel:        <ffffffff802a7a62>{sock_aio_read+297}
<ffffffff802a7ba8>{sock_aio_write+306}
Nov 30 06:37:20 link-04 kernel:        <ffffffff801796f8>{do_sync_read+173}
<ffffffff801875d0>{__user_walk+93}
Nov 30 06:37:20 link-04 kernel:        <ffffffff80181d3a>{vfs_stat+24}
<ffffffff8030a8c9>{thread_return+0}
Nov 30 06:37:20 link-04 kernel:        <ffffffff8030a921>{thread_return+88}
<ffffffff801358cc>{autoremove_wake_function+0}
Nov 30 06:37:20 link-04 kernel:        <ffffffff80179806>{vfs_read+226}
<ffffffff80179a4a>{sys_read+69}
Nov 30 06:37:20 link-04 kernel:        <ffffffff8011026a>{system_call+126}



Version-Release number of selected component (if applicable):
[root@link-08 ~]# uname -ar
Linux link-08 2.6.9-42.17.ELsmp #1 SMP Mon Oct 9 18:42:57 EDT 2006 x86_64 x86_64
x86_64 GNU/Linux
[root@link-08 ~]# rpm -q cmirror-kernel-smp
cmirror-kernel-smp-2.6.9-17.1
[root@link-08 ~]# rpm -qa | grep lvm2
lvm2-cluster-debuginfo-2.02.06-7.0.RHEL4
lvm2-2.02.15-3
lvm2-cluster-2.02.15-3

Comment 2 Jonathan Earl Brassow 2007-01-08 23:15:50 UTC

I can't seem to reproduce this, but I'm not convinced it is gone.  Requesting
help in reproducing.

Comment 3 Corey Marthaler 2007-02-14 19:47:01 UTC

Reproduced this issue along with bz 207132 last night doing looping
create/convert/remove cmirror operations.

Saw this on all the nodes in the cluster except the node which had the messages
seen in bz 207132. So these two bugs are most likely related.

dm-cmirror: Failed to receive election results from server: -110

Comment 4 Jonathan Earl Brassow 2007-02-20 19:44:27 UTC

I fixed a race where critical variables (restart_event_id/restart_event_type)
could be overwritten by the server after a client had set them.  This would
cause the kcl_start_done issued by the server to fail.  It would loose the
ability to acknowledge the start and the cluster mirror would stall...  This has
all sorts of side-effects, including never being able to reach the server.

Comment 5 Corey Marthaler 2007-03-08 16:27:52 UTC

I wonder if I hit one of the side effects mentioned in comment #4, I was doing
different cmirror operations on different nodes in the cluster and saw that the
device-mapper devices were still around for a cmirror on one of the nodes after
it had been deactivated. Should this issue be opened as another bz?


Mar  8 03:31:54 taft-01 qarshd[25508]: Running cmdline: lvchange -an
hang/taft-01.28863
Mar  8 03:31:54 taft-01 lvm[5026]: Monitoring mirror device hang-taft--04.28866
for events
Mar  8 03:32:05 taft-01 lvm[5026]: Monitoring mirror device hang-taft--03.28869
for events
Mar  8 03:32:07 taft-01 kernel: dm-cmirror: Clean-up required due to new server
Mar  8 03:32:07 taft-01 kernel: dm-cmirror:  - Wiping clear region list
Mar  8 03:32:07 taft-01 kernel: dm-cmirror:  - 1 clear region requests wiped
Mar  8 03:32:07 taft-01 kernel: dm-cmirror:  - Resending all mark region requests
Mar  8 03:32:07 taft-01 kernel: dm-cmirror: Clean-up complete
Mar  8 03:32:07 taft-01 kernel: dm-cmirror: Continuing request type, 7
(LRT_COMPLETE_RESYNC_WORK)
Mar  8 03:32:07 taft-01 kernel: dm-cmirror: Error while listening for server
response: -110
Mar  8 03:32:07 taft-01 kernel: dm-cmirror: Clustered mirror retried requests ::
1280 of 1683741 (1%)
Mar  8 03:32:07 taft-01 kernel: dm-cmirror: Too many retries, attempting to
re-establish server connection.
Mar  8 03:32:07 taft-01 kernel: dm-cmirror: Failed to receive election results
from server: (7ta2WQra,-110)
Mar  8 03:32:07 taft-01 lvm[5026]: dm_task_run failed, errno = 6, No such device
or address
Mar  8 03:32:07 taft-01 lvm[5026]: hang-taft--01.28863 disappeared, detaching
Mar  8 03:32:08 taft-01 kernel: dm-cmirror: Clean-up required due to new server
Mar  8 03:32:08 taft-01 kernel: dm-cmirror:  - Wiping clear region list
Mar  8 03:32:08 taft-01 kernel: dm-cmirror:  - 0 clear region requests wiped
Mar  8 03:32:08 taft-01 kernel: dm-cmirror:  - Resending all mark region requests
Mar  8 03:32:08 taft-01 kernel: dm-cmirror: Clean-up complete
Mar  8 03:32:08 taft-01 kernel: dm-cmirror: Continuing request type, 6
(LRT_GET_RESYNC_WORK)
Mar  8 03:32:08 taft-01 kernel: dm-cmirror: Clean-up required due to new server
Mar  8 03:32:08 taft-01 kernel: dm-cmirror:  - Wiping clear region list
Mar  8 03:32:08 taft-01 kernel: dm-cmirror:  - 1 clear region requests wiped
Mar  8 03:32:08 taft-01 kernel: dm-cmirror:  - Resending all mark region requests
Mar  8 03:32:08 taft-01 kernel: dm-cmirror: Clean-up complete
Mar  8 03:32:08 taft-01 kernel: dm-cmirror: Continuing request type, 7
(LRT_COMPLETE_RESYNC_WORK)
Mar  8 03:32:08 taft-01 lvm[5026]: dm_task_run failed, errno = 6, No such device
or address
Mar  8 03:32:08 taft-01 lvm[5026]: hang-taft--04.28866 disappeared, detaching



[root@taft-01 ~]# lvs
  LV            VG         Attr   LSize   Origin Snap%  Move Log               
Copy%
  LogVol00      VolGroup00 -wi-ao  66.16G
  LogVol01      VolGroup00 -wi-ao   1.94G
  taft-01.28863 hang       mwi--- 500.00M                    taft-01.28863_mlog
  taft-02.28860 hang       mwi--- 500.00M                    taft-02.28860_mlog
  taft-03.28869 hang       mwi-a- 500.00M                    taft-03.28869_mlog
100.00
  taft-04.28866 hang       mwi--- 500.00M                    taft-04.28866_mlog
[root@taft-01 ~]# dmsetup ls
hang-taft--03.28869_mlog        (253, 15)
hang-taft--03.28869_mimage_3    (253, 20)
hang-taft--03.28869_mimage_2    (253, 19)
hang-taft--02.28860_mimage_1    (253, 4)
hang-taft--03.28869_mimage_1    (253, 17)
hang-taft--02.28860_mimage_0    (253, 3)
hang-taft--03.28869_mimage_0    (253, 16)
hang-taft--04.28866_mimage_3    (253, 14)
hang-taft--04.28866_mimage_2    (253, 13)
hang-taft--02.28860_mlog        (253, 2)
hang-taft--04.28866_mimage_1    (253, 11)
hang-taft--04.28866_mimage_0    (253, 10)
hang-taft--01.28863_mlog        (253, 6)
VolGroup00-LogVol01     (253, 1)
hang-taft--03.28869     (253, 22)
VolGroup00-LogVol00     (253, 0)
hang-taft--01.28863_mimage_1    (253, 8)
hang-taft--04.28866_mlog        (253, 9)
hang-taft--01.28863_mimage_0    (253, 7)





[root@taft-02 ~]# lvs
  LV            VG         Attr   LSize   Origin Snap%  Move Log               
Copy%
  LogVol00      VolGroup00 -wi-ao  66.16G
  LogVol01      VolGroup00 -wi-ao   1.94G
  taft-01.28863 hang       mwi--- 500.00M                    taft-01.28863_mlog
  taft-02.28860 hang       mwi--- 500.00M                    taft-02.28860_mlog
  taft-03.28869 hang       mwi-a- 500.00M                    taft-03.28869_mlog
100.00
  taft-04.28866 hang       mwi--- 500.00M                    taft-04.28866_mlog
[root@taft-02 ~]# dmsetup ls
hang-taft--03.28869_mlog        (253, 15)
hang-taft--03.28869_mimage_3    (253, 20)
hang-taft--03.28869_mimage_2    (253, 19)
hang-taft--03.28869_mimage_1    (253, 17)
hang-taft--03.28869_mimage_0    (253, 16)
VolGroup00-LogVol01     (253, 1)
hang-taft--03.28869     (253, 22)
VolGroup00-LogVol00     (253, 0)




[root@taft-03 ~]# lvs
  LV            VG         Attr   LSize   Origin Snap%  Move Log               
Copy%
  LogVol00      VolGroup00 -wi-ao  66.16G
  LogVol01      VolGroup00 -wi-ao   1.94G
  taft-01.28863 hang       mwi--- 500.00M                    taft-01.28863_mlog
  taft-02.28860 hang       mwi--- 500.00M                    taft-02.28860_mlog
  taft-03.28869 hang       mwi-a- 500.00M                    taft-03.28869_mlog
100.00
  taft-04.28866 hang       mwi--- 500.00M                    taft-04.28866_mlog
[root@taft-03 ~]# dmsetup ls
hang-taft--03.28869_mlog        (253, 15)
hang-taft--03.28869_mimage_3    (253, 20)
hang-taft--03.28869_mimage_2    (253, 19)
hang-taft--03.28869_mimage_1    (253, 17)
hang-taft--03.28869_mimage_0    (253, 16)
VolGroup00-LogVol01     (253, 1)
hang-taft--03.28869     (253, 22)
VolGroup00-LogVol00     (253, 0)




[root@taft-04 ~]# lvs
  LV            VG         Attr   LSize   Origin Snap%  Move Log               
Copy%
  LogVol00      VolGroup00 -wi-ao  66.16G
  LogVol01      VolGroup00 -wi-ao   1.94G
  taft-01.28863 hang       mwi--- 500.00M                    taft-01.28863_mlog
  taft-02.28860 hang       mwi--- 500.00M                    taft-02.28860_mlog
  taft-03.28869 hang       mwi-a- 500.00M                    taft-03.28869_mlog
100.00
  taft-04.28866 hang       mwi--- 500.00M                    taft-04.28866_mlog
[root@taft-04 ~]# dmsetup ls
hang-taft--03.28869_mlog        (253, 15)
hang-taft--03.28869_mimage_3    (253, 20)
hang-taft--03.28869_mimage_2    (253, 19)
hang-taft--03.28869_mimage_1    (253, 17)
hang-taft--03.28869_mimage_0    (253, 16)
VolGroup00-LogVol01     (253, 1)
hang-taft--03.28869     (253, 22)
VolGroup00-LogVol00     (253, 0)

Comment 6 Corey Marthaler 2007-03-15 18:28:33 UTC

I was attempting to reproduce bz 231230, and ended up hitting a case similar to
this problem. I had a very basic cmirror, 2 legs, 10G, no fs on top, no I/O
running, and I failed the primary leg of the cmirror while the mirror was still
syncing. This caused a lost of elections results and ended up causing a node in
the cluster to be fenced.

[...]
Mar 15 07:17:09 link-02 kernel: end_request: I/O error, dev sda, sector 10165809
Mar 15 07:17:09 link-02 kernel: SCSI error : <0 0 0 1> return code = 0x10000
Mar 15 07:17:09 link-02 kernel: end_request: I/O error, dev sda, sector 10165817
Mar 15 07:17:09 link-02 kernel: device-mapper: Unable to read from primary
mirror during recovery
Mar 15 07:17:09 link-02 kernel: device-mapper: recovery failed on region 9911
Mar 15 07:17:27 link-02 kernel: dm-cmirror: Failed to receive election results
from server: (TKbs5wES,-110)
Mar 15 07:17:39 link-02 fenced[4519]: link-08 not a cluster member after 30 sec
post_fail_delay
Mar 15 07:17:39 link-02 fenced[4519]: fencing node "link-08"
Mar 15 07:17:47 link-02 kernel: dm-cmirror: Failed to receive election results
from server: (TKbs5wES,-110)
Mar 15 07:17:59 link-02 fenced[4519]: fence "link-08" success
Mar 15 07:18:06 link-02 kernel: dm-cmirror: A cluster mirror log member has failed.
Mar 15 07:18:07 link-02 kernel: dm-cmirror: Failed to receive election results
from server: (TKbs5wES,-110)


A write attempt to this mirror inorder to trigger the down convert fails:
[root@link-07 ~]# dd if=/dev/zero of=/dev/helter_skelter/mirror count=10
dd: writing to `/dev/helter_skelter/mirror': Input/output error
1+0 records in
0+0 records out


I'll attach the clvmd debug logs

Comment 7 Corey Marthaler 2007-03-15 18:35:18 UTC

Created attachment 150149 [details]
clvmd -d log from link-02

Comment 8 Corey Marthaler 2007-03-15 18:35:45 UTC

Created attachment 150150 [details]
clvmd -d log from link-04

Comment 9 Corey Marthaler 2007-03-15 18:36:36 UTC

Created attachment 150151 [details]
clvmd -d log from link-07

Comment 10 Corey Marthaler 2007-03-15 18:37:31 UTC

Created attachment 150152 [details]
clvmd -d log from link-08

Comment 11 Jonathan Earl Brassow 2007-03-16 19:12:01 UTC

We know now that the dd command used to trigger the down convert does not work
because it does reads before writes.  Doing real writes should trigger the convert.

The issue now is, why did a node get fenced?

Comment 12 Jonathan Earl Brassow 2007-03-16 19:16:02 UTC

Note:  It is not a problem to recieve a few of these:
dm-cmirror: Failed to receive election results

It is a problem if it never recovers from it.

Please reproduce so we can determine why the member got fenced.  Is something
else going on that might trigger that?

assigned -> needinfo

Comment 13 Jonathan Earl Brassow 2007-04-16 20:19:35 UTC

I think the root cause of this has been fixed since I marked it 'needinfo'

needinfo -> modified

Comment 14 Corey Marthaler 2007-06-26 21:17:34 UTC

I have not seen any "failed election results" messages for quite awhile, and
there are other bugs to document a cmirror operations deadlock. Marking this bz
verified.

Comment 15 Corey Marthaler 2007-09-06 21:40:37 UTC

I may have spoken too soon. I appear to have hit this while doing IO to 4
mirrors from 4 vgs while at the same time doing disk to core log (and back)
converts.

Comment 16 Corey Marthaler 2007-09-07 15:24:49 UTC

The scenario in comment #15 reliably reproduces this bug.

Comment 17 Corey Marthaler 2007-11-08 22:11:08 UTC

This still appears to be reproducable (that or this is a new bug). 

I was attempting something similar to that in comment #15 with I/O to two
cmirrors each on a different vg, all while converting to and from disk log and
core log.

After a couple successful converts, it failed with locking errors. 

  Error locking on node grant-03: Command timed out
  Error locking on node link-08: Command timed out
  Failed to lock mirror1

The devices in that mirror failed convert mirror are now suspended:
  mirror1            vg1        Mwi-so 15.00G                                
100.00 mirror1_mimage_0(0),mirror1_mimage_1(0)
  [mirror1_mimage_0] vg1        iwi-so 15.00G                                  
     /dev/sdb1(0)                      
  [mirror1_mimage_1] vg1        iwi-so 15.00G                                  
     /dev/sdc1(0)                      
  mirror1_mlog       vg1        -wi---  4.00M                                  
     /dev/sdd1(0)


On the consoles where thelocks failed (link-08 and grant-03), I'm seeing a
steady stream of cmirror server messages...

Nov  8 16:20:09 link-08 kernel: dm-cmirror: Too many retries, attempting to
re-establish server connection.
Nov  8 16:20:09 link-08 kernel: dm-cmirror: Clustered mirror retried requests ::
5223200 of 5244922 (100%)
Nov  8 16:20:09 link-08 kernel: dm-cmirror: Too many retries, attempting to
re-establish server connection.
Nov  8 16:20:09 link-08 kernel: dm-cmirror: Clustered mirror retried requests ::
5223264 of 5244986 (100%)
Nov  8 16:20:09 link-08 kernel: dm-cmirror: Too many retries, attempting to
re-establish server connection.
Nov  8 16:20:09 link-08 kernel: dm-cmirror: Clustered mirror retried requests ::
5223296 of 5245018 (100%)
Nov  8 16:20:09 link-08 kernel: dm-cmirror: Too many retries, attempting to
re-establish server connection.
Nov  8 16:20:09 link-08 kernel: dm-cmirror: Clustered mirror retried requests ::
5223328 of 5245050 (100%)
Nov  8 16:20:09 link-08 kernel: dm-cmirror: Too many retries, attempting to
re-establish server connection

Comment 18 Corey Marthaler 2007-11-19 16:24:48 UTC

Hit this bug during 4.6 regression tesing.

Comment 19 Corey Marthaler 2007-12-19 23:11:30 UTC

Hit this bug during 4.6.Z regression testing.

Comment 20 Corey Marthaler 2007-12-20 19:51:22 UTC

Bumping the priority and marking this as a TestBlocker, since this is stopping
us from verifying other cmirror bug fixes.

Comment 21 Dean Jansa 2008-01-24 15:27:51 UTC

Looks like I've hit this in the current 4.6.Z testing.
lvm2-cluster-2.02.27-2.el4_6.1
lvm2-cluster-debuginfo-2.02.27-2.el4_6.1
lvm2-debuginfo-2.02.27-2.el4_6.1
lvm2-2.02.27-2.el4_6.1
device-mapper-1.02.21-1.el4

I see many of these messages on the consoles while running
[lvm_cluster_cmirror_lock_stress]:
dm-cmirror: Failed to receive election results from server: (k1SECe8j/1,-110)

And my tests are stuck at this point.

The tests running:
<herd name="lock_stress_cmirror" desc="lock_stress some mirrors" version="1">
        <pan2_opts numactive="all" allstop="true" iterations="5"/>

        <test name="link-13_1">
                <cmd timeout="450">/usr/tests/sts-rhel4.6/lvm2/bin/lock_ops -n
link-13 -v lock_stress</cmd>
        </test>
        <test name="link-14_1">
                <cmd timeout="450">/usr/tests/sts-rhel4.6/lvm2/bin/lock_ops -n
link-14 -v lock_stress</cmd>
        </test>
        <test name="link-16_1">
                <cmd timeout="450">/usr/tests/sts-rhel4.6/lvm2/bin/lock_ops -n
link-16 -v lock_stress</cmd>
        </test>

</herd>

Comment 23 Corey Marthaler 2008-06-30 20:01:52 UTC

This hasn't been seen in over 6 months. Marking verified. Will re-open if seen
again.

Comment 25 errata-xmlrpc 2008-07-25 19:27:59 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0803.html

Note You need to log in before you can comment on or make changes to this bug.