Description of problem: I'm not sure if this is specifically a clvmd or cmirror issue. [root@link-07 ~]# lvconvert -m 1 corey/coreymir1 Logical volume coreymir1 converted. [root@link-07 ~]# lvconvert -m 1 corey/coreymir2 EOF reading CLVMD Problem reactivating coreymir2 Error writing data to clvmd: Broken pipe Feb 28 14:59:59 link-07 lvm[4809]: Monitoring mirror device corey-coreymir2 for events Feb 28 14:59:59 link-07 kernel: clvmd[14593]: segfault at 0000000000000018 rip 000000000041c411 rsp 0000000042802930 error 4 I had a two GFS filesystems on top of cmirrors, both doing I/O, when I failed the primary legs of both cmirrors. The conversion from cmirrors down to linears appeared to work just fine. I then turned back on the failed device, and proceeded to up convert those linears back to cmirrors (on link-07). The first one worked but the second one failed. Now it appears that clvmd is nolonger running on link-07, yet it still shows up as a cluster service on all 4 nodes in that cluster. During this failure scenerio, I also had the lvm test activator running, which basically creates a bunch of different types of lvm volumes and then deactivates and reactivates them over and over. [root@link-07 ~]# lvs connect() failed on local socket: Connection refused WARNING: Falling back to local file-based locking. Volume Groups with the clustered attribute will be inaccessible. Skipping clustered volume group corey Skipping clustered volume group activator4 Skipping clustered volume group activator3 Skipping clustered volume group activator2 Skipping clustered volume group activator1 LV VG Attr LSize Origin Snap% Move Log Copy% LogVol00 VolGroup00 -wi-ao 72.38G LogVol01 VolGroup00 -wi-ao 1.94G [root@link-02 ~]# cman_tool services Service Name GID LID State Code Fence Domain: "default" 5 2 run - [1 2 4 3] DLM Lock Space: "clvmd" 214 110 run - [1 3 2 4] DLM Lock Space: "gfs1" 223 116 run - [1 2 3 4] DLM Lock Space: "gfs2" 225 118 run - [1 2 3] DLM Lock Space: "clustered_log" 252 129 run - [1 2 3 4] GFS Mount Group: "gfs1" 224 117 run - [1 2 3 4] GFS Mount Group: "gfs2" 226 119 run - [1 2 3] [root@link-04 ~]# cman_tool services Service Name GID LID State Code Fence Domain: "default" 5 2 run - [2 1 4 3] DLM Lock Space: "clvmd" 214 4 run - [1 3 2 4] DLM Lock Space: "gfs1" 223 10 run - [2 1 3 4] DLM Lock Space: "gfs2" 225 12 run - [2 1 3] DLM Lock Space: "clustered_log" 252 23 run - [2 1 3 4] GFS Mount Group: "gfs1" 224 11 run - [2 1 3 4] GFS Mount Group: "gfs2" 226 13 run - [2 1 3] [root@link-07 ~]# cman_tool services Service Name GID LID State Code Fence Domain: "default" 5 2 run - [2 4 1 3] DLM Lock Space: "clvmd" 214 110 run - [1 2 3 4] DLM Lock Space: "gfs1" 223 116 run - [1 2 3 4] DLM Lock Space: "gfs2" 225 118 run - [1 2 3] DLM Lock Space: "clustered_log" 252 129 run - [2 1 3 4] GFS Mount Group: "gfs1" 224 117 run - [1 2 3 4] GFS Mount Group: "gfs2" 226 119 run - [1 2 3] [root@link-08 ~]# cman_tool services Service Name GID LID State Code Fence Domain: "default" 5 2 run - [4 2 1 3] DLM Lock Space: "clvmd" 214 110 run - [1 2 3 4] DLM Lock Space: "gfs1" 223 116 run - [1 2 3 4] DLM Lock Space: "clustered_log" 252 129 run - [1 2 3 4] GFS Mount Group: "gfs1" 224 117 run - [1 2 3 4] Version-Release number of selected component (if applicable): 2.6.9-48.ELsmp lvm2-2.02.21-3.el4 lvm2-cluster-2.02.21-3.el4 device-mapper-1.02.17-2.el4 cmirror-kernel-smp-2.6.9-23.0
Give me the core file for clvmd and I'll tell you exactly where it failed.
reproduction would be good, core file would be even better.
Just a note that I reproduced this today. Still couldn't find a core, and I believe that I had ulimit set to 0. [root@link-04 ~]# lvremove -f corey Inconsistent metadata copies found - updating to use version 27 EOF reading CLVMD Can't get exclusive access to volume "tre" Can't remove logical volume tre_mlog used as mirror log Can't remove logical volume tre_mimage_1 used by a mirror Can't remove logical volume tre_mimage_2 used by a mirror Error writing data to clvmd: Broken pipe Mar 19 10:44:05 link-04 kernel: clvmd[13058]: segfault at 0000000000000031 rip 0000003e78870810 rsp 00000000413fded8 error 4
We've seen this segfault when killing different devices on different machines. The offending code was: lib/metadata/metadata.c:~1063 if (!str_list_match_item(pvids, pvl->pv->dev->pvid)) { log_debug("Cached VG %s had incorrect PV list", vgname); Where pvl->pv->dev->pvid is not populated.
Please note that the reproduction in comment #3 may be different from the original post. IOW, there may be two bugs here.
I am calling this a dup of 246630 under the assumption that when the device was brought back in comment #1, either: 1) the device was not brought back on all machines 2) the device was not brought back on all machines at the same time, allowing an LVM command to go through while the view of devices in the cluster was not consistent *** This bug has been marked as a duplicate of 246630 ***