Bug 602653

Summary:	qemu image corruption probably after power failure on all vms (iscsi)
Product:	Red Hat Enterprise Linux 5	Reporter:	Moran Goldboim <mgoldboi>
Component:	kvm	Assignee:	chellwig <chellwig>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	high	Docs Contact:
Priority:	high
Version:	5.5	CC:	kwolf, llim, michael.hagmann, mkenneth, tburke, virt-maint, ykaul
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2010-11-25 14:31:33 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	580949

Description Moran Goldboim 2010-06-10 12:12:08 UTC

Description of problem:
After a power failure all the vm's (200!) images - all from the same template became corrupted in different levels and in different places.
log indicated unknown storage error (EIO).
power failure happened both on the hosts and on the storage. 
topology: 
storage-iscsi SUN,SOLARIS
vm disk: size 15G cow, sparse IDE

qemu-img check  -f qcow2 /rhev/data-center/e80168ab-a912-4855-97ff-f778d5746432/8900978c-e842-4037-8f04-c9a740793a13/images/f1c50823-992b-4e40-a0c3-ed6fa48f7915/66b27f07-b569-4fea-96fe-6973557a5d54
ERROR cluster 2459 refcount=1 reference=0
ERROR cluster 3590 refcount=1 reference=0
2 errors were found on the image.
[root@silver-vdsd ~]# qemu-img check -f qcow2 /rhev/data-center/e80168ab-a912-4855-97ff-f778d5746432/8900978c-e842-4037-8f04-c9a740793a13/images/5e152b07-acaa-46b6-a06f-e9923084c518/c255fd8e-afbd-4beb-b6a2-1d453070b539
ERROR cluster 2116 refcount=1 reference=0
1 errors were found on the image. 

Version-Release number of selected component (if applicable):
Host:
kvm-83-164.el5_5.9
kernel-2.6.18-194.3.1.el5
device-mapper-1.02.39-1.el5_5.2
Guest:
kernel-2.6.18-194.3.1.el5


How reproducible:
happened once

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 26
model name      : Intel(R) Xeon(R) CPU           E5504  @ 2.00GHz
stepping        : 5
cpu MHz         : 2000.118
cache size      : 4096 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
apicid          : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips        : 4000.23
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: [8]

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 26
model name      : Intel(R) Xeon(R) CPU           E5504  @ 2.00GHz
stepping        : 5
cpu MHz         : 2000.118
cache size      : 4096 KB
physical id     : 0
siblings        : 4
core id         : 1
cpu cores       : 4
apicid          : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips        : 4000.06
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: [8]

processor       : 2
vendor_id       : GenuineIntel
cpu family      : 6
model           : 26
model name      : Intel(R) Xeon(R) CPU           E5504  @ 2.00GHz
stepping        : 5
cpu MHz         : 2000.118
cache size      : 4096 KB
physical id     : 0
siblings        : 4
core id         : 2
cpu cores       : 4
apicid          : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips        : 4000.06
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: [8]

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 6
model           : 26
model name      : Intel(R) Xeon(R) CPU           E5504  @ 2.00GHz
stepping        : 5
cpu MHz         : 2000.118
cache size      : 4096 KB
physical id     : 0
siblings        : 4
core id         : 3
cpu cores       : 4
apicid          : 6
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips        : 4000.06
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: [8]

processor       : 4
vendor_id       : GenuineIntel
cpu family      : 6
model           : 26
model name      : Intel(R) Xeon(R) CPU           E5504  @ 2.00GHz
stepping        : 5
cpu MHz         : 2000.118
cache size      : 4096 KB
physical id     : 1
siblings        : 4
core id         : 0
cpu cores       : 4
apicid          : 16
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips        : 4000.16
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: [8]

processor       : 5
vendor_id       : GenuineIntel
cpu family      : 6
model           : 26
model name      : Intel(R) Xeon(R) CPU           E5504  @ 2.00GHz
stepping        : 5
cpu MHz         : 2000.118
cache size      : 4096 KB
physical id     : 1
siblings        : 4
core id         : 1
cpu cores       : 4
apicid          : 18
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips        : 4000.27
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: [8]

processor       : 6
vendor_id       : GenuineIntel
cpu family      : 6
model           : 26
model name      : Intel(R) Xeon(R) CPU           E5504  @ 2.00GHz
stepping        : 5
cpu MHz         : 2000.118
cache size      : 4096 KB
physical id     : 1
siblings        : 4
core id         : 2
cpu cores       : 4
apicid          : 20
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips        : 4000.33
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: [8]

processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 26
model name      : Intel(R) Xeon(R) CPU           E5504  @ 2.00GHz
stepping        : 5
cpu MHz         : 2000.118
cache size      : 4096 KB
physical id     : 1
siblings        : 4
core id         : 3
cpu cores       : 4
apicid          : 22
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips        : 4000.17
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: [8]

Comment 1 Kevin Wolf 2010-06-11 07:26:56 UTC

(In reply to comment #0)
> ERROR cluster 2459 refcount=1 reference=0
> ERROR cluster 3590 refcount=1 reference=0
> 2 errors were found on the image.
>
> ERROR cluster 2116 refcount=1 reference=0
> 1 errors were found on the image. 

Is this BZ only about these qemu-img check messages or do you notice real breakage when running the VMs? These messages are just about leaked clusters, which are both expected and harmless (and actually unavoidable in case of power loss).

Comment 2 Moran Goldboim 2010-06-15 15:46:00 UTC

The Vms are not booting up, some fails and requires running of fsck (which doesn't succeed) others are in kernel panic and other bring up grub, but not one is booting up

Comment 3 chellwig@redhat.com 2010-06-16 08:34:57 UTC

What does:

dmesg | grep "Write cache"

say on the affected host system?

Comment 4 Moran Goldboim 2010-06-16 10:21:38 UTC

[root@silver-vdsd ~]#  sdparm --get WCE /dev/dm-3
    /dev/dm-3: SUN       SOLARIS           1
WCE error (try adding '-vv') in Caching (SBC) mode page
[root@silver-vdsd ~]#  sdparm --get WCE /dev/dm-3 -vv
mp_settings: page,subpage=0x8,0x0  num=1
  [0x8,0x0]  pdt=0 start_byte=0x2 start_bit=2 num_bits=1  val=0  acronym: WCE
>>> about to open device name: /dev/dm-3
    inquiry cdb: 12 00 00 00 24 00
    /dev/dm-3: SUN       SOLARIS           1
    mode sense (10) cdb: 5a 00 08 00 00 00 00 00 08 00
mode sense (10): transport: Host_status=0x04 [DID_BAD_TARGET]
Driver_status=0x08 [DRIVER_SENSE, SUGGEST_OK]

WCE error in Caching (SBC) mode page
[root@silver-vdsd ~]#  sdparm --get WCE /dev/dm-4 -vv
mp_settings: page,subpage=0x8,0x0  num=1
  [0x8,0x0]  pdt=0 start_byte=0x2 start_bit=2 num_bits=1  val=0  acronym: WCE
>>> about to open device name: /dev/dm-4
    inquiry cdb: 12 00 00 00 24 00
    /dev/dm-4: SUN       SOLARIS           1
    mode sense (10) cdb: 5a 00 08 00 00 00 00 00 08 00
mode sense (10): transport: Host_status=0x04 [DID_BAD_TARGET]
Driver_status=0x08 [DRIVER_SENSE, SUGGEST_OK]

WCE error in Caching (SBC) mode page
[root@silver-vdsd ~]#  sdparm --get WCE /dev/dm-2 -vv
mp_settings: page,subpage=0x8,0x0  num=1
  [0x8,0x0]  pdt=0 start_byte=0x2 start_bit=2 num_bits=1  val=0  acronym: WCE
>>> about to open device name: /dev/dm-2
    inquiry cdb: 12 00 00 00 24 00
    /dev/dm-2: SUN       SOLARIS           1
    mode sense (10) cdb: 5a 00 08 00 00 00 00 00 08 00
mode sense (10): transport: Host_status=0x04 [DID_BAD_TARGET]
Driver_status=0x08 [DRIVER_SENSE, SUGGEST_OK]

WCE error in Caching (SBC) mode page

Comment 5 chellwig@redhat.com 2010-06-16 10:31:21 UTC

The command needs to be run on the underlying /dev/sd* devices, not the device mapper devices.
Just do it on all devices showing up in lsscsi output, or use pvdisplay to figure out what devices belong to the volume group.

Comment 6 Moran Goldboim 2010-06-17 10:00:27 UTC

pvdisplay output:
[root@silver-vdsd new_kvm]# pvdisplay
  --- Physical volume ---
  PV Name               /dev/mapper/3600144f04b79233900003048344a6b00
  VG Name               8900978c-e842-4037-8f04-c9a740793a13
  PV Size               100.00 GB / not usable 128.00 MB
  Allocatable           yes
  PE Size (KByte)       131072
  Total PE              799
  Free PE               135
  Allocated PE          664
  PV UUID               YpjYcC-Jxc6-dvaJ-zIih-SJi4-eoJc-aqkKMf

  --- Physical volume ---
  PV Name               /dev/mapper/3600144f04b79235600003048344a6b00
  VG Name               8900978c-e842-4037-8f04-c9a740793a13
  PV Size               100.00 GB / not usable 128.00 MB
  Allocatable           yes
  PE Size (KByte)       131072
  Total PE              799
  Free PE               88
  Allocated PE          711
  PV UUID               XXgJOV-SY6i-Q9Th-Pxwb-Ywlk-q26O-KRfDbr

  --- Physical volume ---
  PV Name               /dev/mapper/3600144f04b82906100003048344a6b00
  VG Name               8900978c-e842-4037-8f04-c9a740793a13
  PV Size               300.00 GB / not usable 128.00 MB
  Allocatable           yes
  PE Size (KByte)       131072
  Total PE              2399
  Free PE               269
  Allocated PE          2130
  PV UUID               qgLD7h-cBoG-8Omt-IzK7-XRf2-dojt-KwprgF

  --- Physical volume ---
  PV Name               /dev/sda2
  VG Name               vg0
  PV Size               136.63 GB / not usable 5.83 MB
  Allocatable           yes
  PE Size (KByte)       32768
  Total PE              4372
  Free PE               3122
  Allocated PE          1250
  PV UUID               EPKn4x-ow5d-DYl7-S9BZ-z30t-FK2p-zD4qUt

since the problematic vg was 8900978c-e842-4037-8f04-c9a740793a13, on which devices should i run the "sdparm --get WCE" command

Comment 7 chellwig@redhat.com 2010-06-17 10:26:29 UTC

So the LVM volumes are stacked again on device mapper, I assume multipath.

Just do an

for i in /dev/sd?; do sdparm --get WCE $i; done

please.

Comment 8 Moran Goldboim 2010-06-21 09:11:59 UTC

[root@silver-vdsd ~]# for i in /dev/sd?; do sdparm -vv --get WCE $i; done
mp_settings: page,subpage=0x8,0x0  num=1
  [0x8,0x0]  pdt=0 start_byte=0x2 start_bit=2 num_bits=1  val=0  acronym: WCE
>>> about to open device name: /dev/sda
    inquiry cdb: 12 00 00 00 24 00
    /dev/sda: IBM-ESXS  CBRCA146C3ETS0 N  C370
    mode sense (10) cdb: 5a 00 08 00 00 00 00 00 08 00
    mode sense (10) cdb: 5a 00 08 00 00 00 00 00 24 00
    mode sense (10) cdb: 5a 00 48 00 00 00 00 00 24 00
    mode sense (10) cdb: 5a 00 88 00 00 00 00 00 24 00
    mode sense (10) cdb: 5a 00 c8 00 00 00 00 00 24 00
WCE         0  [cha: y, def:  0, sav:  0]
mp_settings: page,subpage=0x8,0x0  num=1
  [0x8,0x0]  pdt=0 start_byte=0x2 start_bit=2 num_bits=1  val=0  acronym: WCE
>>> about to open device name: /dev/sdb
    inquiry cdb: 12 00 00 00 24 00
    /dev/sdb: SUN       SOLARIS           1
    mode sense (10) cdb: 5a 00 08 00 00 00 00 00 08 00
mode sense (10): transport: Host_status=0x04 [DID_BAD_TARGET]
Driver_status=0x08 [DRIVER_SENSE, SUGGEST_OK]

WCE error in Caching (SBC) mode page
mp_settings: page,subpage=0x8,0x0  num=1
  [0x8,0x0]  pdt=0 start_byte=0x2 start_bit=2 num_bits=1  val=0  acronym: WCE
>>> about to open device name: /dev/sdc
    inquiry cdb: 12 00 00 00 24 00
    /dev/sdc: SUN       SOLARIS           1
    mode sense (10) cdb: 5a 00 08 00 00 00 00 00 08 00
mode sense (10): transport: Host_status=0x04 [DID_BAD_TARGET]
Driver_status=0x08 [DRIVER_SENSE, SUGGEST_OK]

WCE error in Caching (SBC) mode page
mp_settings: page,subpage=0x8,0x0  num=1
  [0x8,0x0]  pdt=0 start_byte=0x2 start_bit=2 num_bits=1  val=0  acronym: WCE
>>> about to open device name: /dev/sdd
    inquiry cdb: 12 00 00 00 24 00
    /dev/sdd: SUN       SOLARIS           1
    mode sense (10) cdb: 5a 00 08 00 00 00 00 00 08 00
mode sense (10): transport: Host_status=0x04 [DID_BAD_TARGET]
Driver_status=0x08 [DRIVER_SENSE, SUGGEST_OK]

WCE error in Caching (SBC) mode page

[root@silver-vdse ~]# for i in /dev/sd?; do sdparm -vv --get WCE $i; done
mp_settings: page,subpage=0x8,0x0  num=1
  [0x8,0x0]  pdt=0 start_byte=0x2 start_bit=2 num_bits=1  val=0  acronym: WCE
>>> about to open device name: /dev/sda
    inquiry cdb: 12 00 00 00 24 00
    /dev/sda: IBM-ESXS  ST9146803SS       B536
    mode sense (10) cdb: 5a 00 08 00 00 00 00 00 08 00
    mode sense (10) cdb: 5a 00 08 00 00 00 00 00 24 00
    mode sense (10) cdb: 5a 00 48 00 00 00 00 00 24 00
    mode sense (10) cdb: 5a 00 88 00 00 00 00 00 24 00
    mode sense (10) cdb: 5a 00 c8 00 00 00 00 00 24 00
WCE         0  [cha: y, def:  0, sav:  0]
mp_settings: page,subpage=0x8,0x0  num=1
  [0x8,0x0]  pdt=0 start_byte=0x2 start_bit=2 num_bits=1  val=0  acronym: WCE
>>> about to open device name: /dev/sdc
    inquiry cdb: 12 00 00 00 24 00
    /dev/sdc: SUN       SOLARIS           1
    mode sense (10) cdb: 5a 00 08 00 00 00 00 00 08 00
mode sense (10): transport: Host_status=0x04 [DID_BAD_TARGET]
Driver_status=0x08 [DRIVER_SENSE, SUGGEST_OK]

WCE error in Caching (SBC) mode page
mp_settings: page,subpage=0x8,0x0  num=1
  [0x8,0x0]  pdt=0 start_byte=0x2 start_bit=2 num_bits=1  val=0  acronym: WCE
>>> about to open device name: /dev/sde
    inquiry cdb: 12 00 00 00 24 00
    /dev/sde: SUN       SOLARIS           1
    mode sense (10) cdb: 5a 00 08 00 00 00 00 00 08 00
mode sense (10): transport: Host_status=0x04 [DID_BAD_TARGET]
Driver_status=0x08 [DRIVER_SENSE, SUGGEST_OK]

WCE error in Caching (SBC) mode page
mp_settings: page,subpage=0x8,0x0  num=1
  [0x8,0x0]  pdt=0 start_byte=0x2 start_bit=2 num_bits=1  val=0  acronym: WCE
>>> about to open device name: /dev/sdf
    inquiry cdb: 12 00 00 00 24 00
    /dev/sdf: SUN       SOLARIS           1
    mode sense (10) cdb: 5a 00 08 00 00 00 00 00 08 00
mode sense (10): transport: Host_status=0x04 [DID_BAD_TARGET]
Driver_status=0x08 [DRIVER_SENSE, SUGGEST_OK]

WCE error in Caching (SBC) mode page

Comment 11 chellwig@redhat.com 2010-09-23 14:29:26 UTC

Looks like all the WCE outputting failed for the "SOLARIS" device.  I wonder if we take that for a disabled write cache while it's not.

What does:

for i in /sys/class/scsi_disk/*/cache_type;
  do echo "$i: $(cat $i)"
done

say?

Btw, what layers do you have between the underlying /dev/sd* devices and the qcow2 images.  dm-multipath was mentioned, and given the pathnames a filesystem is probably used.  Does it also use lvm?

Either way none of the dm target in RHEL5 support barriers, and the default ext3 filesystem doesn't use it either.  Is there any way to find out what kind of caching the "SOLARIS" target pretends to implement?  So far I think the most likely culprit should be looked for at the target level, be it caching related or not.

Comment 13 Moran Goldboim 2010-11-25 14:31:33 UTC

Related setup on which the bug happened doesn't exist for now. no option to recreate the bug.