Bug 1085553

Summary:	When lvmetad is used, LVs do not properly report as 'p'artial
Product:	Red Hat Enterprise Linux 6	Reporter:	Jonathan Earl Brassow <jbrassow>
Component:	lvm2	Assignee:	Petr Rockai <prockai>
lvm2 sub component:	LVM Metadata / lvmetad (RHEL6)	QA Contact:	Cluster QE <mspqa-list>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	unspecified
Priority:	unspecified	CC:	agk, cmarthal, heinzm, jbrassow, msnitzer, nperic, prajnoha, prockai, slevine, zkabelac
Version:	6.5
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	lvm2-2.02.108-1.el6	Doc Type:	Bug Fix
Doc Text:	Cause: Information about physical volume availability can be out of date when lvmetad is in use. Consequence: The status string in the output of the 'lvs' command for a RAID volume may be different in identical situations depending on whether lvmetad is used or not (indicating 'r'efresh instead of 'p'artial in the lvmetad case). Fix: The dmeventd volume monitoring daemon now updates physical volume information in lvmetad for devices participating in a RAID array that has encountered an error. Result: If dmeventd is active (which is recommended regardless of this issue), the lvs output is the same in both the lvmetad and non-lvmetad cases. When dmeventd is disabled, it is recommended to run an 'lvscan --cache' for faulty RAID arrays, to ensure up-to-date information in lvs output.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2014-10-14 08:25:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1089170, 1089369

Description Jonathan Earl Brassow 2014-04-08 22:18:58 UTC

When using lvmetad, if you fail a device in a RAID LV, it does not report as partial.

[root@bp-01 ~]# off.sh sdf
Turning off sdf
[root@bp-01 ~]# devices vg
  /dev/sdf1: read failed after 0 of 512 at 898381381632: Input/output error
  /dev/sdf1: read failed after 0 of 512 at 898381488128: Input/output error
  /dev/sdf1: read failed after 0 of 512 at 0: Input/output error
  /dev/sdf1: read failed after 0 of 512 at 4096: Input/output error
  LV            Attr       Cpy%Sync Devices                                                    
  lv            rwi-a-r---   100.00 lv_rimage_0(0),lv_rimage_1(0),lv_rimage_2(0),lv_rimage_3(0)
  [lv_rimage_0] iwi-aor---          /dev/sdb1(1)                                               
  [lv_rimage_0] iwi-aor---          /dev/sdf1(0)                                               
  [lv_rimage_1] iwi-aor---          /dev/sdc1(1)                                               
  [lv_rimage_1] iwi-aor---          /dev/sdg1(0)                                               
  [lv_rimage_2] iwi-aor---          /dev/sdd1(1)                                               
  [lv_rimage_2] iwi-aor---          /dev/sdh1(0)                                               
  [lv_rimage_3] iwi-aor---          /dev/sde1(1)                                               
  [lv_rimage_3] iwi-aor---          /dev/sdi1(0)                                               
  [lv_rmeta_0]  ewi-aor---          /dev/sdb1(0)                                               
  [lv_rmeta_1]  ewi-aor---          /dev/sdc1(0)                                               
  [lv_rmeta_2]  ewi-aor---          /dev/sdd1(0)                                               
  [lv_rmeta_3]  ewi-aor---          /dev/sde1(0)                                


If you perform some writes to the device, the kernel notices the problem and then the LVs are reported as 'r'eplace/'r'efresh.
[root@bp-01 ~]# dd if=/dev/zero of=/dev/vg/lv bs=4M count=10
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.478818 s, 87.6 MB/s
[root@bp-01 ~]# devices vg
  /dev/sdf1: read failed after 0 of 512 at 898381381632: Input/output error
  /dev/sdf1: read failed after 0 of 512 at 898381488128: Input/output error
  /dev/sdf1: read failed after 0 of 512 at 0: Input/output error
  /dev/sdf1: read failed after 0 of 512 at 4096: Input/output error
  LV            Attr       Cpy%Sync Devices                                                    
  lv            rwi-a-r-r-   100.00 lv_rimage_0(0),lv_rimage_1(0),lv_rimage_2(0),lv_rimage_3(0)
  [lv_rimage_0] iwi-aor-r-          /dev/sdb1(1)                                               
  [lv_rimage_0] iwi-aor-r-          /dev/sdf1(0)                                               
  [lv_rimage_1] iwi-aor---          /dev/sdc1(1)                                               
  [lv_rimage_1] iwi-aor---          /dev/sdg1(0)                                               
  [lv_rimage_2] iwi-aor---          /dev/sdd1(1)                                               
  [lv_rimage_2] iwi-aor---          /dev/sdh1(0)                                               
  [lv_rimage_3] iwi-aor---          /dev/sde1(1)                                               
  [lv_rimage_3] iwi-aor---          /dev/sdi1(0)                                               
  [lv_rmeta_0]  ewi-aor-r-          /dev/sdb1(0)                                               
  [lv_rmeta_1]  ewi-aor---          /dev/sdc1(0)                                               
  [lv_rmeta_2]  ewi-aor---          /dev/sdd1(0)                                               
  [lv_rmeta_3]  ewi-aor---          /dev/sde1(0)                                


This is not good, because it causes repair operations to fail.  They fail because if the device is seen by LVM, it assumes that the failed device has returned - thus, it only needs a refresh.  If the device can't be seen by LVM (and has a 'p'artial flag), then it will be replaced by repair.

The distinction is important.

Without lvmetad, the 'p'artial flag shows up correctly:
[root@bp-01 ~]# nano /etc/lvm/lvm.conf 
[root@bp-01 ~]# killall -9 lvmetad
[root@bp-01 ~]# devices vg
  WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it!
  /dev/sdf1: read failed after 0 of 512 at 898381381632: Input/output error
  /dev/sdf1: read failed after 0 of 512 at 898381488128: Input/output error
  /dev/sdf1: read failed after 0 of 512 at 0: Input/output error
  /dev/sdf1: read failed after 0 of 512 at 4096: Input/output error
  /dev/sdf1: read failed after 0 of 2048 at 0: Input/output error
  Couldn't find device with uuid ceqxK0-V1Pe-i640-BlKJ-bYWW-7iaH-nwmCfW.
  LV            Attr       Cpy%Sync Devices                                                    
  lv            rwi-a-r-p-   100.00 lv_rimage_0(0),lv_rimage_1(0),lv_rimage_2(0),lv_rimage_3(0)
  [lv_rimage_0] iwi-aor-p-          /dev/sdb1(1)                                               
  [lv_rimage_0] iwi-aor-p-          unknown device(0)                                          
  [lv_rimage_1] iwi-aor---          /dev/sdc1(1)                                               
  [lv_rimage_1] iwi-aor---          /dev/sdg1(0)                                               
  [lv_rimage_2] iwi-aor---          /dev/sdd1(1)                                               
  [lv_rimage_2] iwi-aor---          /dev/sdh1(0)                                               
  [lv_rimage_3] iwi-aor---          /dev/sde1(1)                                               
  [lv_rimage_3] iwi-aor---          /dev/sdi1(0)                                               
  [lv_rmeta_0]  ewi-aor-r-          /dev/sdb1(0)                                               
  [lv_rmeta_1]  ewi-aor---          /dev/sdc1(0)                                               
  [lv_rmeta_2]  ewi-aor---          /dev/sdd1(0)                                               
  [lv_rmeta_3]  ewi-aor---          /dev/sde1(0)

Comment 3 Jonathan Earl Brassow 2014-04-08 22:25:47 UTC

This behavior could have something to do with the way I am killing the device.
# echo offline > /sys/block/$dev/device/state

I believe QA uses some other mechanism.  I don't know if that means this bug should be closed or if a customer would encounter a problem with a device in a similar state.  Has anyone tried pulling the plug on a devices to see what happens, rather than using software to emulate failures?

Comment 4 Peter Rajnoha 2014-04-09 07:12:07 UTC

(In reply to Jonathan Earl Brassow from comment #3)
> This behavior could have something to do with the way I am killing the
> device.
> # echo offline > /sys/block/$dev/device/state

The "echo offline" does not generate an event (and the device is gone just in half since the /sysfs content is still there). It's probably better to use "echo 1 > /sys/block/$dev/device/delete" which removes the device completely from the system with the REMOVE event generated.

You can still use the "echo offline", but then you always need to call "pvscan --cache $dev".

Comment 5 Peter Rajnoha 2014-04-09 07:44:02 UTC

(In reply to Peter Rajnoha from comment #4)
> You can still use the "echo offline", but then you always need to call
> "pvscan --cache $dev".

(...in which case we're not testing the whole thing with udev rules btw. So using the "echo 1 > ...device/delete" and then rescanning the scsi bus to make the device present is probably the correct way to test this completely with all events and mechanisms included.)

Comment 6 Jonathan Earl Brassow 2014-04-09 15:51:56 UTC

Ok, that makes sense.  However, is a udev event get generated when power or connectivity is lost to a drive in all cases?  I'm still curious whether a real failure event can look like "echo offline".  If we are sure that real failure events are all handled, then this bug can be closed.

Comment 7 Jonathan Earl Brassow 2014-04-10 16:03:51 UTC

Is there a case where power and connectivity are still available, but the drive throws errors for I/O?  Would that trigger a REMOVE event?

We may need to document the 'pvscan --cache $dev' step for users in those cases - or augment the RAID code to print something sensible or detect the problem.  For the RAID code, this may be as simple as running another check for kernel device status...

Comment 8 Petr Rockai 2014-04-10 16:39:59 UTC

If dmeventd is running and monitoring RAID devices and you do something to trigger a failure (echo offline and writing to the device should do), the status string in device mapper should reflect that the leg is offline. When that happens and dmeventd runs lvconvert --repair, the latter will notice the status and mark the LV as missing even if there was no REMOVE event. So in a production system, you should be covered even for "echo offline"-like events. If this doesn't happen, this might be a problem in lvconvert --repair not parsing raid1 status info correctly (this definitely used to work for old-style mirrors).

Comment 9 Peter Rajnoha 2014-04-22 11:37:45 UTC

Well, actually, it doesn't work quite well with lvmetad - bug #1089170, bug #1089369.

Comment 10 Jonathan Earl Brassow 2014-04-22 19:28:46 UTC

(In reply to Jonathan Earl Brassow from comment #0)
> This is not good, because it causes repair operations to fail.  They fail
> because if the device is seen by LVM, it assumes that the failed device has
> returned - thus, it only needs a refresh.  If the device can't be seen by
> LVM (and has a 'p'artial flag), then it will be replaced by repair.
> 

Bug 1089170 and bug 1089369 are now tested examples of how this is manifested.  We need a solution to this lvmetad problem.

The solution could be to cause a rescan if '--repair' or '--refresh' are used on the command line.  The device must be reread in order to determine if the device is dead or if it was a transient failure.  It is not enough to simply check the kernel status.

Comment 13 Peter Rajnoha 2014-04-23 07:15:34 UTC

So should dmeventd/lvconvert --repair run with lvmetad disabled then? The lvconvert --repair must see the IO error and without touching the device directly, I can't imagine how we can detect that...

Comment 15 Jonathan Earl Brassow 2014-05-02 04:41:26 UTC

(In reply to Peter Rajnoha from comment #13)
> So should dmeventd/lvconvert --repair run with lvmetad disabled then? The
> lvconvert --repair must see the IO error and without touching the device
> directly, I can't imagine how we can detect that..

Probably a good idea, but then 'lvs' would still be wrong - and it needs to be right for customers to take the appropriate action.  For example, on RAID if I saw a 'r'efresh flag, I would perform a 'lvchange --refresh vg/lv'.  OTOH, if I saw a 'p'artial flag, I would rather perform a 'lvconvert --repair vg/lv'.

'dmeventd' is notified when there is a write failure.  Perhaps there is a rescan command that could be run by dmeventd to inform lvmetad about the issue?

Comment 16 Petr Rockai 2014-05-05 11:05:55 UTC

I have checked the code, and lvconvert --repair for RAID does not take device status into account at all -- a big chunk of code is entirely missing. For old-style mirrors, we check the device mapper status of mirror LVs and feed that into --repair. So when the code was adapted to RAID, this was left out and needs to be added. This is not related to lvmetad in fact: if a device goes away, you write to the mirror and it comes back, lvconvert --repair will not work either. The only change with lvmetad is that this also happens if the device is still inaccessible at the time of lvconvert --repair.

The problem here is that lots of code is duplicated for raid that was there for mirrors, and a straightforward fix will make this even worse. So we are in for some refactoring of the status parsing code so it can work with both old-style mirrors and RAID. Most of the issues should go away then.

Comment 18 Petr Rockai 2014-07-28 15:23:46 UTC

This should be fixed upstream in 5dc6671bb550f4b480befee03d234373d08e188a, as long as dmeventd is in use. Non-dmeventd users need to issue lvscan --cache on the affected LV to update the partial/refresh flags on RAID LVs.

Comment 21 Nenad Peric 2014-08-07 17:04:36 UTC

Marking this one VERIFIED. 
Although this is not the same thing as a device which is still present in the system but gives out I/O errors, the echo 'offline' now works in such a way that LVM can see that an LV is partial and either replace the failed device or mark the LV as partial depending on the settings:

[root@tardis-01 ~]# echo offline >/sys/block/sdc/device/state

[root@tardis-01 ~]# lvs -a -o+devices
  /dev/sdc1: read failed after 0 of 512 at 16104947712: Input/output error
  /dev/sdc1: read failed after 0 of 512 at 16105054208: Input/output error
  /dev/sdc1: read failed after 0 of 512 at 0: Input/output error
  /dev/sdc1: read failed after 0 of 512 at 4096: Input/output error
  LV               VG          Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices                            
  raid1            vg          rwi-a-r---   1.00g                                    100.00           raid1_rimage_0(0),raid1_rimage_1(0)
  [raid1_rimage_0] vg          iwi-aor---   1.00g                                                     /dev/sdb1(1)                       
  [raid1_rimage_1] vg          iwi-aor---   1.00g                                                     /dev/sdc1(1)                       
  [raid1_rmeta_0]  vg          ewi-aor---   4.00m                                                     /dev/sdb1(0)                       
  [raid1_rmeta_1]  vg          ewi-aor---   4.00m                                                     /dev/sdc1(0)                       
  lv_home          vg_tardis01 -wi-ao---- 224.88g                                                     /dev/sda2(12800)                   
  lv_root          vg_tardis01 -wi-ao----  50.00g                                                     /dev/sda2(0)                       
  lv_swap          vg_tardis01 -wi-ao----   4.00g                                                     /dev/sda2(70368)        

[root@tardis-01 ~]# dd if=/dev/zero of=/dev/vg/raid1 count=10
10+0 records in
10+0 records out
5120 bytes (5.1 kB) copied, 0.0295471 s, 173 kB/s

[root@tardis-01 ~]# lvs -a -o+devices
  PV TzPlnL-QIfn-5TAn-PPHH-2eDs-OGzq-HYbGtq not recognised. Is the device missing?
  PV TzPlnL-QIfn-5TAn-PPHH-2eDs-OGzq-HYbGtq not recognised. Is the device missing?
  LV               VG          Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices                            
  raid1            vg          rwi-a-r-p-   1.00g                                    100.00           raid1_rimage_0(0),raid1_rimage_1(0)
  [raid1_rimage_0] vg          iwi-aor---   1.00g                                                     /dev/sdb1(1)                       
  [raid1_rimage_1] vg          iwi-aor-p-   1.00g                                                     unknown device(1)                  
  [raid1_rmeta_0]  vg          ewi-aor---   4.00m                                                     /dev/sdb1(0)                       
  [raid1_rmeta_1]  vg          ewi-aor-p-   4.00m                                                     unknown device(0)                  
  lv_home          vg_tardis01 -wi-ao---- 224.88g                                                     /dev/sda2(12800)                   
  lv_root          vg_tardis01 -wi-ao----  50.00g                                                     /dev/sda2(0)                       
  lv_swap          vg_tardis01 -wi-ao----   4.00g                                                     /dev/sda2(70368)                   
[root@tardis-01 ~]# 


with:

kernel 2.6.32-495.el6.x86_64

lvm2-2.02.109-1.el6    BUILT: Tue Aug  5 17:36:23 CEST 2014
lvm2-libs-2.02.109-1.el6    BUILT: Tue Aug  5 17:36:23 CEST 2014
lvm2-cluster-2.02.109-1.el6    BUILT: Tue Aug  5 17:36:23 CEST 2014
udev-147-2.57.el6    BUILT: Thu Jul 24 15:48:47 CEST 2014
device-mapper-1.02.88-1.el6    BUILT: Tue Aug  5 17:36:23 CEST 2014
device-mapper-libs-1.02.88-1.el6    BUILT: Tue Aug  5 17:36:23 CEST 2014
device-mapper-event-1.02.88-1.el6    BUILT: Tue Aug  5 17:36:23 CEST 2014
device-mapper-event-libs-1.02.88-1.el6    BUILT: Tue Aug  5 17:36:23 CEST 2014
device-mapper-persistent-data-0.3.2-1.el6    BUILT: Fri Apr  4 15:43:06 CEST 2014
cmirror-2.02.109-1.el6    BUILT: Tue Aug  5 17:36:23 CEST 2014

Comment 22 errata-xmlrpc 2014-10-14 08:25:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1387.html