Bug 236432

Summary:

Failures to multiple cmirrors on mulitple VGs causes problems

Product:

[Retired] Red Hat Cluster Suite

Reporter:

Corey Marthaler <cmarthal>

Component:

lvm2-cluster

Assignee:

Christine Caulfield <ccaulfie>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Cluster QE <mspqa-list>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

CC:

agk, dwysocha, jbrassow, mbroz, prockai

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

2.02.21-7

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2008-02-28 13:30:10 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Backtrace of failed clvmd process	none
clmvd dies again... here's the backtrace.	none
evidence of threads coliding?	none
Add locking arounf lvm calls.	none

Description Corey Marthaler 2007-04-13 20:57:48 UTC

Description of problem:
Multiple cmirrors on mulitple VGs with GFS I/O running to each mirror from each
node.
[root@link-08 ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                       72G  2.7G   66G   4% /
/dev/hda1              99M   32M   62M  35% /boot
none                  501M     0  501M   0% /dev/shm
/dev/mapper/vg1-mirror1
                      9.5G  3.9M  9.5G   1% /mnt/cm1
/dev/mapper/vg2-mirror2
                      9.5G  3.9M  9.5G   1% /mnt/cm2
/dev/mapper/vg3-mirror3
                      9.5G  3.9M  9.5G   1% /mnt/cm3
[root@link-08 ~]# lvs -a -o +devices
  LV                 VG         Attr   LSize  Origin Snap%  Move Log         
Copy%  Devices                 
  LogVol00           VolGroup00 -wi-ao 72.38G                                  
     /dev/hda2(0)            
  LogVol01           VolGroup00 -wi-ao  1.94G                                  
     /dev/hda2(2316)         
  mirror1            vg1        Mwi-ao 10.00G                    mirror1_mlog
100.00 mirror1_mimage_0(0),mirror1_mimage_1(0)
  [mirror1_mimage_0] vg1        iwi-ao 10.00G                                  
     /dev/sdh1(0)            
  [mirror1_mimage_1] vg1        iwi-ao 10.00G                                  
     /dev/sda1(0)            
  [mirror1_mlog]     vg1        lwi-ao  4.00M                                  
     /dev/sdb1(0)            
  mirror2            vg2        Mwi-ao 10.00G                    mirror2_mlog
100.00 mirror2_mimage_0(0),mirror2_mimage_1(0)
  [mirror2_mimage_0] vg2        iwi-ao 10.00G                                  
     /dev/sdh2(0)            
  [mirror2_mimage_1] vg2        iwi-ao 10.00G                                  
     /dev/sdc1(0)            
  [mirror2_mlog]     vg2        lwi-ao  4.00M                                  
     /dev/sdd1(0)            
  mirror3            vg3        Mwi-ao 10.00G                    mirror3_mlog
100.00 mirror3_mimage_0(0),mirror3_mimage_1(0),mirror3_mimage_2(0)
  [mirror3_mimage_0] vg3        iwi-ao 10.00G                                  
     /dev/sdh3(0)            
  [mirror3_mimage_1] vg3        iwi-ao 10.00G                                  
     /dev/sde1(0)            
  [mirror3_mimage_2] vg3        iwi-ao 10.00G                                  
     /dev/sdf1(0)            
  [mirror3_mlog]     vg3        lwi-ao  4.00M                                  
     /dev/sdg1(0)            

I then fail /dev/sdh (apart of all three VGs) which causes clvmd to die and the
down convert not to happen correctly.

[root@link-08 ~]# lvs -a -o +devices
  /dev/sdh1: read failed after 0 of 2048 at 0: Input/output error
  /dev/sdh2: read failed after 0 of 2048 at 0: Input/output error
  /dev/sdh3: read failed after 0 of 2048 at 0: Input/output error
  /dev/dm-3: read failed after 0 of 4096 at 10737352704: Input/output error
  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error
[...]
  /dev/sdh2: read failed after 0 of 2048 at 0: Input/output error
  /dev/sdh3: read failed after 0 of 2048 at 0: Input/output error
  Couldn't find device with uuid 'bJHduN-sKDV-RYWY-zoBZ-XKww-EJgd-QA8lOK'.
  Couldn't find all physical volumes for volume group vg3.
  Error writing data to clvmd: Broken pipe
  Volume group "vg3" not found
  Error writing data to clvmd: Broken pipe
  Can't lock vg2: skipping
  Error writing data to clvmd: Broken pipe
  Can't lock VolGroup00: skipping
  Error writing data to clvmd: Broken pipe
  Can't lock vg1: skipping

Messages:
Apr 13 09:22:41 link-08 kernel: Buffer I/O error on device sdh, logical block 15
Apr 13 09:22:41 link-08 kernel: clvmd[7374] general protection rip:41c524
rsp:42802a00 error:0
Apr 13 09:22:42 link-08 lvm[6749]: EOF reading CLVMD
Apr 13 09:22:42 link-08 lvm[6749]: Getting device info for vg2-mirror2
Apr 13 09:22:42 link-08 lvm[6749]: dm info 
LVM-Q5KoWP4Lj1IdufsnUbqo2YqjLvv53qj4oc2twrvU5X51pFq1dpmsfMrowirzc
ojF NF   [16384]
Apr 13 09:22:42 link-08 lvm[6749]: Setting activation/missing_stripe_filler to
/dev/ioerror
Apr 13 09:22:42 link-08 lvm[6749]: Getting device mirror status percentage for
vg2-mirror2
Apr 13 09:22:42 link-08 lvm[6749]: dm status 
LVM-Q5KoWP4Lj1IdufsnUbqo2YqjLvv53qj4oc2twrvU5X51pFq1dpmsfMrowir
zcojF NF   [16384]
Apr 13 09:22:43 link-08 lvm[6749]: Setting activation/mirror_region_size to 512
Apr 13 09:22:43 link-08 lvm[6749]: Mirror status: 2 253:7 253:8 20480/20480 1 DA
3 clustered_disk 253:6 A
Apr 13 09:22:43 link-08 lvm[6749]: LV percent: 100.000000
Apr 13 09:22:43 link-08 lvm[6749]: Reducing mirror set from 2 to 1 image(s).
Apr 13 09:22:43 link-08 lvm[6749]: /dev/sdc1 0:      0   2560: mirror2(0:0)

CLVMD is now dead on this node (link-08).

  Couldn't find all physical volumes for volume group vg3.
  Volume group "vg3" not found
  Skipping clustered volume group vg2
  Skipping clustered volume group vg1
  LV       VG         Attr   LSize  Origin Snap%  Move Log Copy%  Devices
  LogVol00 VolGroup00 -wi-ao 72.38G                               /dev/hda2(0)
  LogVol01 VolGroup00 -wi-ao  1.94G                               /dev/hda2(2316)

Another node (link-04) where clvmd is still running shows the mirrors in a limbo
type state:
  /dev/sdh3: read failed after 0 of 2048 at 0: Input/output error
  Couldn't find device with uuid 'bJHduN-sKDV-RYWY-zoBZ-XKww-EJgd-QA8lOK'.
  Couldn't find all physical volumes for volume group vg3.
  Volume group "vg3" not found
  LV                 VG         Attr   LSize  Origin Snap%  Move Log         
Copy%  Devices                 
  LogVol00           VolGroup00 -wi-ao 72.38G                                  
     /dev/hda2(0)            
  LogVol01           VolGroup00 -wi-ao  1.94G                                  
     /dev/hda2(2316)         
  mirror1            vg1        -wi-ao 10.00G                                  
     /dev/sda1(0)            
  [mirror1_mimage_1] vg1        vwi-a- 10.00G                                  
                             
  mirror1_mlog       vg1        -wi-a-  4.00M                                  
     /dev/sdb1(0)            
  mirror2            vg2        Mwi-ao 10.00G                    mirror2_mlog
100.00 mirror2_mimage_0(0),mirror2_mimage_1(0)
  [mirror2_mimage_0] vg2        iwi-ao 10.00G                                  
                             
  [mirror2_mimage_1] vg2        iwi-ao 10.00G                                  
     /dev/sdc1(0)            
  [mirror2_mlog]     vg2        lwi-ao  4.00M                                  
     /dev/sdd1(0)   

All the devices remain active on all nodes:
[root@link-08 ~]# dmsetup info
Name:              vg1-mirror1
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      7
Major, minor:      253, 5
Number of targets: 1
UUID: LVM-KtYb5SkL3gvhCTTBzyPFnaAC1ANKlCEXcJfEdPPvzKs2P1hCoti4py6JZcfn88hT

Name:              vg3-mirror3_mimage_2
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 13
Number of targets: 1
UUID: LVM-WYYcJgknHtw4Xk3YdfTB1nOywlJcylk0sMzUdqFeUdWD0351lgbMwRo1DFfFW6au

Name:              vg3-mirror3_mimage_1
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 12
Number of targets: 1
UUID: LVM-WYYcJgknHtw4Xk3YdfTB1nOywlJcylk0hddRr8atqFIquxEg4jFrYYSxwFYOa8ZC

Name:              vg3-mirror3_mimage_0
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 11
Number of targets: 1
UUID: LVM-WYYcJgknHtw4Xk3YdfTB1nOywlJcylk0zDM1v36GniBCQlm1jS2Ej23iFnM6ULU4

Name:              vg3-mirror3_mlog
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 10
Number of targets: 1
UUID: LVM-WYYcJgknHtw4Xk3YdfTB1nOywlJcylk04gyVp2tdkpUvUYAqjuxryusc5XAMOe2z

Name:              vg2-mirror2
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      519
Major, minor:      253, 9
Number of targets: 1
UUID: LVM-Q5KoWP4Lj1IdufsnUbqo2YqjLvv53qj4oc2twrvU5X51pFq1dpmsfMrowirzcojF

Name:              vg2-mirror2_mlog
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 6
Number of targets: 1
UUID: LVM-Q5KoWP4Lj1IdufsnUbqo2YqjLvv53qj4BJcUjNTZy1EYFonNcaTgHkt8ypBihMtb

Name:              vg1-mirror1_mimage_1
State:             ACTIVE
Tables present:    LIVE
Open count:        0
Event number:      0
Major, minor:      253, 4
Number of targets: 1
UUID: LVM-KtYb5SkL3gvhCTTBzyPFnaAC1ANKlCEX4VfZYU77Q4c9jf23bQ2V0VEj88CRdzXD

Name:              vg1-mirror1_mlog
State:             ACTIVE
Tables present:    LIVE
Open count:        0
Event number:      0
Major, minor:      253, 2
Number of targets: 1
UUID: LVM-KtYb5SkL3gvhCTTBzyPFnaAC1ANKlCEXRaaKVLhF9SaNpVRDsccPbuGfr6o7tig9

Name:              vg3-mirror3
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      209
Major, minor:      253, 14
Number of targets: 1
UUID: LVM-WYYcJgknHtw4Xk3YdfTB1nOywlJcylk0P3o3Ng4wTRHn9Q7JC7Kb2BKjb1EndeW2

Name:              VolGroup00-LogVol01
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 1
Number of targets: 1
UUID: LVM-kXi6raoZjwmxhIMi9yE47lLMxD7rCSgkT1h6aidVTpJgaACOvfJduKZ88jI03FEf

Name:              VolGroup00-LogVol00
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 0
Number of targets: 1
UUID: LVM-kXi6raoZjwmxhIMi9yE47lLMxD7rCSgkXAqtIL9TdeqFs3GAoSPDj3JVguOzhxCd

Name:              vg2-mirror2_mimage_1
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 8
Number of targets: 1
UUID: LVM-Q5KoWP4Lj1IdufsnUbqo2YqjLvv53qj4Mc7U6GYQdAilzcsLZ1FcI10OIlCWLcEq

Name:              vg2-mirror2_mimage_0
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 7
Number of targets: 1
UUID: LVM-Q5KoWP4Lj1IdufsnUbqo2YqjLvv53qj4naJ6yz0EYAxUaIaoWi3s7lJbyVujRVFi



This bug may be related to bz 236289.

Version-Release number of selected component (if applicable):
2.6.9-50.ELsmp
cmirror-kernel-2.6.9-30.0

Comment 1 Jonathan Earl Brassow 2007-04-17 15:48:26 UTC

clmvd dying

Comment 2 Corey Marthaler 2007-04-17 22:45:41 UTC

Patrick,

Here is how I set this up...

I have created 3 VGs each with at least 3 PVs so that I can have a cmirror
created out of each VG. Also, I partitioned one of the devices (in this case
/dev/sdh) into three pieces, that way there is a piece in each VG. When I fail
/dev/sdh, all three cmirrors, all three VGs, will experience a failure at the
same time.

[root@link-08 ~]# pvscan
  PV /dev/sdh3   VG vg3          lvm2 [402.62 GB / 392.62 GB free]
  PV /dev/sde1   VG vg3          lvm2 [135.66 GB / 125.66 GB free]
  PV /dev/sdf1   VG vg3          lvm2 [135.66 GB / 135.65 GB free]
  PV /dev/sdh2   VG vg2          lvm2 [402.62 GB / 392.62 GB free]
  PV /dev/sdc1   VG vg2          lvm2 [135.66 GB / 125.66 GB free]
  PV /dev/sdd1   VG vg2          lvm2 [135.66 GB / 135.65 GB free]
  PV /dev/hda2   VG VolGroup00   lvm2 [74.41 GB / 96.00 MB free]
  PV /dev/sdh1   VG vg1          lvm2 [402.62 GB / 392.62 GB free]
  PV /dev/sda1   VG vg1          lvm2 [135.66 GB / 125.66 GB free]
  PV /dev/sdb1   VG vg1          lvm2 [135.66 GB / 135.65 GB free]
  PV /dev/sdg1                   lvm2 [135.66 GB]
  Total: 11 [2.18 TB] / in use: 10 [2.05 TB] / in no VG: 1 [135.66 GB]

[root@link-08 ~]# vgs -a -o +devices
  VG         #PV #LV #SN Attr   VSize   VFree   Devices
  VolGroup00   1   2   0 wz--n-  74.41G  96.00M /dev/hda2(0)
  VolGroup00   1   2   0 wz--n-  74.41G  96.00M /dev/hda2(2316)
  vg1          3   4   0 wz--nc 673.93G 653.93G
mirror1_mimage_0(0),mirror1_mimage_1(0)
  vg1          3   4   0 wz--nc 673.93G 653.93G /dev/sdb1(0)
  vg1          3   4   0 wz--nc 673.93G 653.93G /dev/sdh1(0)
  vg1          3   4   0 wz--nc 673.93G 653.93G /dev/sda1(0)
  vg2          3   4   0 wz--nc 673.93G 653.93G
mirror2_mimage_0(0),mirror2_mimage_1(0)
  vg2          3   4   0 wz--nc 673.93G 653.93G /dev/sdd1(0)
  vg2          3   4   0 wz--nc 673.93G 653.93G /dev/sdh2(0)
  vg2          3   4   0 wz--nc 673.93G 653.93G /dev/sdc1(0)
  vg3          3   4   0 wz--nc 673.93G 653.93G
mirror3_mimage_0(0),mirror3_mimage_1(0)
  vg3          3   4   0 wz--nc 673.93G 653.93G /dev/sdf1(0)
  vg3          3   4   0 wz--nc 673.93G 653.93G /dev/sdh3(0)
  vg3          3   4   0 wz--nc 673.93G 653.93G /dev/sde1(0)

[root@link-08 ~]# lvs -a -o +devices
  LV                 VG         Attr   LSize  Origin Snap%  Move Log         
Copy%  Devices                 
  LogVol00           VolGroup00 -wi-ao 72.38G                                  
     /dev/hda2(0)            
  LogVol01           VolGroup00 -wi-ao  1.94G                                  
     /dev/hda2(2316)         
  mirror1            vg1        Mwi-ao 10.00G                    mirror1_mlog
100.00 mirror1_mimage_0(0),mirror1_mimage_1(0)
  [mirror1_mimage_0] vg1        iwi-ao 10.00G                                  
     /dev/sdh1(0)            
  [mirror1_mimage_1] vg1        iwi-ao 10.00G                                  
     /dev/sda1(0)            
  [mirror1_mlog]     vg1        lwi-ao  4.00M                                  
     /dev/sdb1(0)            
  mirror2            vg2        Mwi-ao 10.00G                    mirror2_mlog
100.00 mirror2_mimage_0(0),mirror2_mimage_1(0)
  [mirror2_mimage_0] vg2        iwi-ao 10.00G                                  
     /dev/sdh2(0)            
  [mirror2_mimage_1] vg2        iwi-ao 10.00G                                  
     /dev/sdc1(0)            
  [mirror2_mlog]     vg2        lwi-ao  4.00M                                  
     /dev/sdd1(0)            
  mirror3            vg3        Mwi-ao 10.00G                    mirror3_mlog
100.00 mirror3_mimage_0(0),mirror3_mimage_1(0)
  [mirror3_mimage_0] vg3        iwi-ao 10.00G                                  
     /dev/sdh3(0)            
  [mirror3_mimage_1] vg3        iwi-ao 10.00G                                  
     /dev/sde1(0)            
  [mirror3_mlog]     vg3        lwi-ao  4.00M                                  
     /dev/sdf1(0)            

I created a gfs filesystem on each mirror, and mounted each on all three nodes.
I then started I/O to each gfs before the failure.

/dev/mapper/vg1-mirror1 on /mnt/1 type gfs (rw)
/dev/mapper/vg2-mirror2 on /mnt/2 type gfs (rw)
/dev/mapper/vg3-mirror3 on /mnt/3 type gfs (rw)

The I/O that I used was xdoio/xiogen, but I'm sure any will work.

Now I fail /dev/sdh and the auto lvconvert to linear will usually work but when
I bring the device back online and convert the linears back to mirrors, it will
possibly hit either this bug or 236289, but not every time. It may cause this
bug faster if you use 3 legged (-m 2) mirrors.

Comment 3 Christine Caulfield 2007-04-20 14:08:08 UTC

There seems to be some sort of discrepancy between the devices that are
registered and those that are recovered (according to my log files). Sometimes I
get all three mirrors registered but only two recovered - for me this is the
most common situation.

Some other times, it seems that only two mirrors are registered but 3 are
recovered...that can't be right (though it might explain a crash!) but I've not
seen that since I put more logging in (hah).

Comment 4 Jonathan Earl Brassow 2007-04-20 15:07:18 UTC

What gives you the impression there is a discrepancy (what's in the log file)?

Comment 5 Corey Marthaler 2007-04-20 16:44:08 UTC

I believe that I hit this bug this morning. I had 3 3-way mirrors (2 corelog and
1 disklog) all on different VGs, and after failing a leg of each of the three
mirrors, clvmd died on one node only and caused everything else to lockup.


Apr 20 11:20:42 link-07 lvm[6305]: Locking
tfUW1A1zzWkxldikE6dQLZop1Se6pcdUH1T0jzc26AlGUkpq2TYiSolF8xP12loT at 0x9e
Apr 20 11:20:42 link-07 kernel: dm-cmirror: Removing 8xP12loT (1)
Apr 20 11:20:42 link-07 kernel: dm-cmirror: 2 mark requests remain at cluster
log deactivation
Apr 20 11:20:42 link-07 kernel: dm-cmirror: 0 region user structures freed
Apr 20 11:20:43 link-07 lvm[6305]: dm info 
LVM-tfUW1A1zzWkxldikE6dQLZop1Se6pcdUH1T0jzc26AlGUkpq2TYiSolF8xP12loT OF   [16384]
Apr 20 11:20:43 link-07 lvm[6305]: Monitoring mirror device V1-cm1 for events
Apr 20 11:20:43 link-07 lvm[6305]: Another thread is handling an event.  Waiting...
Apr 20 11:20:43 link-07 kernel: SCSI error : <2 0 1 1> return code = 0x10000
Apr 20 11:20:43 link-07 kernel: end_request: I/O error, dev sdh, sector 855140367
Apr 20 11:20:43 link-07 kernel: dm-cmirror: LOG INFO:
Apr 20 11:20:43 link-07 kernel: dm-cmirror:   uuid:
LVM-tfUW1A1zzWkxldikE6dQLZop1Se6pcdUH1T0jzc26AlGUkpq2TYiSolF8xP12loT
Apr 20 11:20:43 link-07 kernel: dm-cmirror:   uuid_ref    : 2
Apr 20 11:20:43 link-07 kernel: dm-cmirror:  ?region_count: 20480
Apr 20 11:20:43 link-07 kernel: dm-cmirror:  ?sync_count  : 20480
Apr 20 11:20:43 link-07 kernel: dm-cmirror:  ?sync_search : 20481
Apr 20 11:20:43 link-07 kernel: dm-cmirror:   in_sync     : YES
Apr 20 11:20:43 link-07 kernel: dm-cmirror:   server_id   : 4
Apr 20 11:20:43 link-07 kernel: dm-cmirror:   server_valid: YES
Apr 20 11:20:43 link-07 kernel: dm-cmirror: Creating t2VhLaPk (2)
Apr 20 11:20:43 link-07 kernel: dm-cmirror: server_id=dead, server_valid=0, t2VhLaPk
Apr 20 11:20:43 link-07 kernel: dm-cmirror: trigger = LRT_GET_SYNC_COUNT
Apr 20 11:20:43 link-07 kernel: dm-cmirror: Performing flush to work around bug
235040
Apr 20 11:20:43 link-07 kernel: dm-cmirror: Log flush complete
Apr 20 11:20:43 link-07 kernel: clvmd[2730]: segfault at 0000000000000010 rip
00000000004255f2 rsp 0000000041e01da0 error 4
Apr 20 11:20:43 link-07 lvm[6305]: EOF reading CLVMD
Apr 20 11:20:43 link-07 lvm[6305]: Problem reactivating cm1

Comment 6 Jonathan Earl Brassow 2007-04-20 19:58:31 UTC

I'm hitting this, but finding it hard to gather information:

From messages:
Apr 20 14:47:35 neo-06 lvm[3515]: LV percent: 100.000000
Apr 20 14:47:35 neo-06 lvm[3515]: Reducing mirror set from 3 to 2 image(s).
...
Apr 20 14:47:36 neo-06 lvm[3515]: Pre-Committing vg2 metadata (6) to /dev/sdf2
header at 4096
Apr 20 14:47:36 neo-06 lvm[3515]: Locking
pONgvBECf3UsRzsBpSrG6ihA7z7f8dL4oojjpoWuyUWnZeRffmvyyVTcXnY1uE7S at
0x9c
Apr 20 14:47:36 neo-06 lvm[3515]: Error writing data to clvmd: Broken pipe
Apr 20 14:47:36 neo-06 lvm[3515]: Failed to lock lv2


From clvmd -d:
CLVMD[b7580bb0]: Apr 20 14:47:35 do_lock_lv: resource
'iKN02g00wrkWjskQfw29eWflbN7aM9aBw68oGB4UmPwGZe394dnIj2Q
vFGd3sZQE', cmd = 0x1e, flags = 84
CLVMD[b7580bb0]: Apr 20 14:47:35 do_resume_lv, lock not already held
CLVMD[b7580bb0]: Apr 20 14:47:35 Command return is 0
CLVMD[b7580bb0]: Apr 20 14:47:35 LVM thread waiting for work
*** last message ***

If the timestamps are correct, clvmd went away 1 second before the broken pipe.

No indication of *why* it went away.

Comment 7 Jonathan Earl Brassow 2007-04-20 20:08:10 UTC

On each machine, failure events from the kernel will trigger dmeventd to start
vgreduce.  Each machine will process the events serially; and each vgreduce on a
volume group happens serially (due to VG locking).  However, if you are really
unlucky, you could have each machine doing a vgreduce on different volume groups
at the same time.

That doesn't necessarily point to any bugs... just something to be aware of.

Comment 8 Jonathan Earl Brassow 2007-04-20 20:20:11 UTC

From device-mapper/dmeventd/dmeventd.c

		/*
		 * Check against filter.
		 *
		 * If there's current events delivered from _event_wait() AND
		 * the device got registered for those events AND
		 * those events haven't been processed yet, call
		 * the DSO's process_event() handler.
		 */

This definitly isn't happening correctly.... but shouldn't hurt us (I don't think).

Comment 9 Jonathan Earl Brassow 2007-04-20 21:13:34 UTC

Created attachment 153217 [details]
Backtrace of failed clvmd process

Comment 10 Jonathan Earl Brassow 2007-04-20 21:20:04 UTC

doing a LCK_LV_RESUME

Comment 11 Jonathan Earl Brassow 2007-04-20 21:36:12 UTC

contents of vg is bad

(gdb) frame 4
#4  0x08073140 in _read_vg (fid=0x9d18a50, cft=0x9d8f3a8) at
format_text/import_vsn1.c:756
756                     log_error("Couldn't find all physical volumes for volume "
(gdb) p vg
$1 = (struct volume_group *) 0x9d1c238
(gdb) p *vg
$2 = {cmd = 0xdeaddead, fid = 0xdeaddead, seqno = 3735936685, id = {uuid = "�},
  name = 0xdeaddead <Address 0xdeaddead out of bounds>, system_id = 0xdeaddead
<Address 0xdeaddead out of bounds>, status = 3735936685,
  alloc = 3735936685, extent_size = 3735936685, extent_count = 3735936685,
free_count = 3735936685, max_lv = 3735936685, max_pv = 3735936685,
  pv_count = 3735936685, pvs = {n = 0xdeaddead, p = 0xdeaddead}, lv_count =
3735936685, snapshot_count = 3735936685, lvs = {n = 0xdeaddead,
    p = 0xdeaddead}, tags = {n = 0xdeaddead, p = 0xdeaddead}}

Comment 12 Jonathan Earl Brassow 2007-04-20 21:43:06 UTC

overlapping thread poisoning our memory?

Comment 13 Christine Caulfield 2007-04-23 10:38:02 UTC

It's unlikely to be threads overlapping. clvmd runs all LVM operations in a
single thread.

Comment 14 Jonathan Earl Brassow 2007-04-23 14:27:08 UTC

Created attachment 153283 [details]
clmvd dies again... here's the backtrace.

Comment 15 Jonathan Earl Brassow 2007-04-23 14:29:03 UTC

(gdb) list
687             return dev->fd;
688     }
689
690     const char *dev_name(const struct device *dev)
691     {
692             return (dev) ? list_item(dev->aliases.n, struct str_list)->str :
693                 "unknown device";
694     }
(gdb) p dev
$1 = (const struct device *) 0xdeaddead

Comment 16 Jonathan Earl Brassow 2007-04-23 14:30:58 UTC

I'm saving these core files, so anytime you want to look at them...

Comment 17 Jonathan Earl Brassow 2007-04-23 15:04:53 UTC

able to reproduce with 2-way mirrors

Comment 18 Jonathan Earl Brassow 2007-04-23 18:17:32 UTC

Created attachment 153299 [details]
evidence of threads coliding?

Comment 19 Christine Caulfield 2007-04-24 09:09:12 UTC

Created attachment 153335 [details]
Add locking arounf lvm calls.

Yep. Dead right. This innocent-looking call in post_lock_lv() is the culprit:

	if (!lv_info_by_lvid(cmd, resource, &lvi, 0))

Short-term I think putting a mutex around that should fix it. Longer-term I'd
like to find a non-locking fix.

This patch fixes things for me, though I'm seeing the lockup bug in 236289
rather than the segfault.

Comment 20 Christine Caulfield 2007-04-24 15:13:58 UTC

Mutex fix checked in:
Checking in daemons/clvmd/lvm-functions.c;
/cvs/lvm2/LVM2/daemons/clvmd/lvm-functions.c,v  <--  lvm-functions.c
new revision: 1.29; previous revision: 1.28
done

Comment 22 Corey Marthaler 2007-11-05 16:08:49 UTC

Marking this bug verified, it hasn't been seen in over six months since a fix
was added.