Bug 230437 - clvmd segfault while attempting to up convert to a cmirror
Summary: clvmd segfault while attempting to up convert to a cmirror
Keywords:
Status: CLOSED DUPLICATE of bug 246630
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: lvm2-cluster
Version: 4
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: LVM and device-mapper development team
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-02-28 20:54 UTC by Corey Marthaler
Modified: 2010-01-12 04:05 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-03-26 18:30:31 UTC
Embargoed:


Attachments (Terms of Use)

Description Corey Marthaler 2007-02-28 20:54:43 UTC
Description of problem:
I'm not sure if this is specifically a clvmd or cmirror issue. 

[root@link-07 ~]# lvconvert -m 1 corey/coreymir1
  Logical volume coreymir1 converted.
[root@link-07 ~]# lvconvert -m 1 corey/coreymir2
  EOF reading CLVMD
  Problem reactivating coreymir2
  Error writing data to clvmd: Broken pipe


Feb 28 14:59:59 link-07 lvm[4809]: Monitoring mirror device corey-coreymir2 for
events
Feb 28 14:59:59 link-07 kernel: clvmd[14593]: segfault at 0000000000000018 rip
000000000041c411 rsp 0000000042802930 error 4


I had a two GFS filesystems on top of cmirrors, both doing I/O, when I failed
the primary legs of both cmirrors. The conversion from cmirrors down to linears
appeared to work just fine. I then turned back on the failed device, and
proceeded to up convert those linears back to cmirrors (on link-07). The first
one worked but the second one failed. Now it appears that clvmd is nolonger
running on link-07, yet it still shows up as a cluster service on all 4 nodes in
that cluster.

During this failure scenerio, I also had the lvm test activator running, which
basically creates a bunch of different types of lvm volumes and then deactivates
and reactivates them over and over.


[root@link-07 ~]# lvs
  connect() failed on local socket: Connection refused
  WARNING: Falling back to local file-based locking.
  Volume Groups with the clustered attribute will be inaccessible.
  Skipping clustered volume group corey
  Skipping clustered volume group activator4
  Skipping clustered volume group activator3
  Skipping clustered volume group activator2
  Skipping clustered volume group activator1
  LV       VG         Attr   LSize  Origin Snap%  Move Log Copy%
  LogVol00 VolGroup00 -wi-ao 72.38G
  LogVol01 VolGroup00 -wi-ao  1.94G



[root@link-02 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           5   2 run       -
[1 2 4 3]

DLM Lock Space:  "clvmd"                           214 110 run       -
[1 3 2 4]

DLM Lock Space:  "gfs1"                            223 116 run       -
[1 2 3 4]

DLM Lock Space:  "gfs2"                            225 118 run       -
[1 2 3]

DLM Lock Space:  "clustered_log"                   252 129 run       -
[1 2 3 4]

GFS Mount Group: "gfs1"                            224 117 run       -
[1 2 3 4]

GFS Mount Group: "gfs2"                            226 119 run       -
[1 2 3]




[root@link-04 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           5   2 run       -
[2 1 4 3]

DLM Lock Space:  "clvmd"                           214   4 run       -
[1 3 2 4]

DLM Lock Space:  "gfs1"                            223  10 run       -
[2 1 3 4]

DLM Lock Space:  "gfs2"                            225  12 run       -
[2 1 3]

DLM Lock Space:  "clustered_log"                   252  23 run       -
[2 1 3 4]

GFS Mount Group: "gfs1"                            224  11 run       -
[2 1 3 4]

GFS Mount Group: "gfs2"                            226  13 run       -
[2 1 3]




[root@link-07 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           5   2 run       -
[2 4 1 3]

DLM Lock Space:  "clvmd"                           214 110 run       -
[1 2 3 4]

DLM Lock Space:  "gfs1"                            223 116 run       -
[1 2 3 4]

DLM Lock Space:  "gfs2"                            225 118 run       -
[1 2 3]

DLM Lock Space:  "clustered_log"                   252 129 run       -
[2 1 3 4]

GFS Mount Group: "gfs1"                            224 117 run       -
[1 2 3 4]

GFS Mount Group: "gfs2"                            226 119 run       -
[1 2 3]



[root@link-08 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           5   2 run       -
[4 2 1 3]

DLM Lock Space:  "clvmd"                           214 110 run       -
[1 2 3 4]

DLM Lock Space:  "gfs1"                            223 116 run       -
[1 2 3 4]

DLM Lock Space:  "clustered_log"                   252 129 run       -
[1 2 3 4]

GFS Mount Group: "gfs1"                            224 117 run       -
[1 2 3 4]








Version-Release number of selected component (if applicable):
2.6.9-48.ELsmp
lvm2-2.02.21-3.el4
lvm2-cluster-2.02.21-3.el4
device-mapper-1.02.17-2.el4
cmirror-kernel-smp-2.6.9-23.0

Comment 1 Jonathan Earl Brassow 2007-03-01 15:57:30 UTC
Give me the core file for clvmd and I'll tell you exactly where it failed.


Comment 2 Jonathan Earl Brassow 2007-03-14 04:06:50 UTC
reproduction would be good, core file would be even better.


Comment 3 Corey Marthaler 2007-03-19 21:04:51 UTC
Just a note that I reproduced this today. Still couldn't find a core, and I
believe that I had ulimit set to 0. 

 [root@link-04 ~]# lvremove -f corey
  Inconsistent metadata copies found - updating to use version 27
  EOF reading CLVMD
  Can't get exclusive access to volume "tre"
  Can't remove logical volume tre_mlog used as mirror log
  Can't remove logical volume tre_mimage_1 used by a mirror
  Can't remove logical volume tre_mimage_2 used by a mirror
  Error writing data to clvmd: Broken pipe

Mar 19 10:44:05 link-04 kernel: clvmd[13058]: segfault at 0000000000000031 rip
0000003e78870810 rsp 00000000413fded8 error 4


Comment 4 Jonathan Earl Brassow 2007-04-10 13:39:21 UTC
We've seen this segfault when killing different devices on different machines.

The offending code was:

lib/metadata/metadata.c:~1063

			if (!str_list_match_item(pvids, pvl->pv->dev->pvid)) {
				log_debug("Cached VG %s had incorrect PV list",
					  vgname);

Where pvl->pv->dev->pvid is not populated.


Comment 5 Jonathan Earl Brassow 2007-04-16 20:32:07 UTC
Please note that the reproduction in comment #3 may be different from the
original post.  IOW, there may be two bugs here.


Comment 6 Jonathan Earl Brassow 2008-03-26 18:30:31 UTC
I am calling this a dup of 246630 under the assumption that when the device was
brought back in comment #1, either:
1) the device was not brought back on all machines
2) the device was not brought back on all machines at the same time, allowing an
LVM command to go through while the view of devices in the cluster was not
consistent


*** This bug has been marked as a duplicate of 246630 ***


Note You need to log in before you can comment on or make changes to this bug.