Bug 602653
Summary: | qemu image corruption probably after power failure on all vms (iscsi) | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Moran Goldboim <mgoldboi> |
Component: | kvm | Assignee: | chellwig <chellwig> |
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Virtualization Bugs <virt-bugs> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 5.5 | CC: | kwolf, llim, michael.hagmann, mkenneth, tburke, virt-maint, ykaul |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2010-11-25 14:31:33 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 580949 |
Description
Moran Goldboim
2010-06-10 12:12:08 UTC
(In reply to comment #0) > ERROR cluster 2459 refcount=1 reference=0 > ERROR cluster 3590 refcount=1 reference=0 > 2 errors were found on the image. > > ERROR cluster 2116 refcount=1 reference=0 > 1 errors were found on the image. Is this BZ only about these qemu-img check messages or do you notice real breakage when running the VMs? These messages are just about leaked clusters, which are both expected and harmless (and actually unavoidable in case of power loss). The Vms are not booting up, some fails and requires running of fsck (which doesn't succeed) others are in kernel panic and other bring up grub, but not one is booting up What does: dmesg | grep "Write cache" say on the affected host system? [root@silver-vdsd ~]# sdparm --get WCE /dev/dm-3 /dev/dm-3: SUN SOLARIS 1 WCE error (try adding '-vv') in Caching (SBC) mode page [root@silver-vdsd ~]# sdparm --get WCE /dev/dm-3 -vv mp_settings: page,subpage=0x8,0x0 num=1 [0x8,0x0] pdt=0 start_byte=0x2 start_bit=2 num_bits=1 val=0 acronym: WCE >>> about to open device name: /dev/dm-3 inquiry cdb: 12 00 00 00 24 00 /dev/dm-3: SUN SOLARIS 1 mode sense (10) cdb: 5a 00 08 00 00 00 00 00 08 00 mode sense (10): transport: Host_status=0x04 [DID_BAD_TARGET] Driver_status=0x08 [DRIVER_SENSE, SUGGEST_OK] WCE error in Caching (SBC) mode page [root@silver-vdsd ~]# sdparm --get WCE /dev/dm-4 -vv mp_settings: page,subpage=0x8,0x0 num=1 [0x8,0x0] pdt=0 start_byte=0x2 start_bit=2 num_bits=1 val=0 acronym: WCE >>> about to open device name: /dev/dm-4 inquiry cdb: 12 00 00 00 24 00 /dev/dm-4: SUN SOLARIS 1 mode sense (10) cdb: 5a 00 08 00 00 00 00 00 08 00 mode sense (10): transport: Host_status=0x04 [DID_BAD_TARGET] Driver_status=0x08 [DRIVER_SENSE, SUGGEST_OK] WCE error in Caching (SBC) mode page [root@silver-vdsd ~]# sdparm --get WCE /dev/dm-2 -vv mp_settings: page,subpage=0x8,0x0 num=1 [0x8,0x0] pdt=0 start_byte=0x2 start_bit=2 num_bits=1 val=0 acronym: WCE >>> about to open device name: /dev/dm-2 inquiry cdb: 12 00 00 00 24 00 /dev/dm-2: SUN SOLARIS 1 mode sense (10) cdb: 5a 00 08 00 00 00 00 00 08 00 mode sense (10): transport: Host_status=0x04 [DID_BAD_TARGET] Driver_status=0x08 [DRIVER_SENSE, SUGGEST_OK] WCE error in Caching (SBC) mode page The command needs to be run on the underlying /dev/sd* devices, not the device mapper devices. Just do it on all devices showing up in lsscsi output, or use pvdisplay to figure out what devices belong to the volume group. pvdisplay output: [root@silver-vdsd new_kvm]# pvdisplay --- Physical volume --- PV Name /dev/mapper/3600144f04b79233900003048344a6b00 VG Name 8900978c-e842-4037-8f04-c9a740793a13 PV Size 100.00 GB / not usable 128.00 MB Allocatable yes PE Size (KByte) 131072 Total PE 799 Free PE 135 Allocated PE 664 PV UUID YpjYcC-Jxc6-dvaJ-zIih-SJi4-eoJc-aqkKMf --- Physical volume --- PV Name /dev/mapper/3600144f04b79235600003048344a6b00 VG Name 8900978c-e842-4037-8f04-c9a740793a13 PV Size 100.00 GB / not usable 128.00 MB Allocatable yes PE Size (KByte) 131072 Total PE 799 Free PE 88 Allocated PE 711 PV UUID XXgJOV-SY6i-Q9Th-Pxwb-Ywlk-q26O-KRfDbr --- Physical volume --- PV Name /dev/mapper/3600144f04b82906100003048344a6b00 VG Name 8900978c-e842-4037-8f04-c9a740793a13 PV Size 300.00 GB / not usable 128.00 MB Allocatable yes PE Size (KByte) 131072 Total PE 2399 Free PE 269 Allocated PE 2130 PV UUID qgLD7h-cBoG-8Omt-IzK7-XRf2-dojt-KwprgF --- Physical volume --- PV Name /dev/sda2 VG Name vg0 PV Size 136.63 GB / not usable 5.83 MB Allocatable yes PE Size (KByte) 32768 Total PE 4372 Free PE 3122 Allocated PE 1250 PV UUID EPKn4x-ow5d-DYl7-S9BZ-z30t-FK2p-zD4qUt since the problematic vg was 8900978c-e842-4037-8f04-c9a740793a13, on which devices should i run the "sdparm --get WCE" command So the LVM volumes are stacked again on device mapper, I assume multipath. Just do an for i in /dev/sd?; do sdparm --get WCE $i; done please. [root@silver-vdsd ~]# for i in /dev/sd?; do sdparm -vv --get WCE $i; done mp_settings: page,subpage=0x8,0x0 num=1 [0x8,0x0] pdt=0 start_byte=0x2 start_bit=2 num_bits=1 val=0 acronym: WCE >>> about to open device name: /dev/sda inquiry cdb: 12 00 00 00 24 00 /dev/sda: IBM-ESXS CBRCA146C3ETS0 N C370 mode sense (10) cdb: 5a 00 08 00 00 00 00 00 08 00 mode sense (10) cdb: 5a 00 08 00 00 00 00 00 24 00 mode sense (10) cdb: 5a 00 48 00 00 00 00 00 24 00 mode sense (10) cdb: 5a 00 88 00 00 00 00 00 24 00 mode sense (10) cdb: 5a 00 c8 00 00 00 00 00 24 00 WCE 0 [cha: y, def: 0, sav: 0] mp_settings: page,subpage=0x8,0x0 num=1 [0x8,0x0] pdt=0 start_byte=0x2 start_bit=2 num_bits=1 val=0 acronym: WCE >>> about to open device name: /dev/sdb inquiry cdb: 12 00 00 00 24 00 /dev/sdb: SUN SOLARIS 1 mode sense (10) cdb: 5a 00 08 00 00 00 00 00 08 00 mode sense (10): transport: Host_status=0x04 [DID_BAD_TARGET] Driver_status=0x08 [DRIVER_SENSE, SUGGEST_OK] WCE error in Caching (SBC) mode page mp_settings: page,subpage=0x8,0x0 num=1 [0x8,0x0] pdt=0 start_byte=0x2 start_bit=2 num_bits=1 val=0 acronym: WCE >>> about to open device name: /dev/sdc inquiry cdb: 12 00 00 00 24 00 /dev/sdc: SUN SOLARIS 1 mode sense (10) cdb: 5a 00 08 00 00 00 00 00 08 00 mode sense (10): transport: Host_status=0x04 [DID_BAD_TARGET] Driver_status=0x08 [DRIVER_SENSE, SUGGEST_OK] WCE error in Caching (SBC) mode page mp_settings: page,subpage=0x8,0x0 num=1 [0x8,0x0] pdt=0 start_byte=0x2 start_bit=2 num_bits=1 val=0 acronym: WCE >>> about to open device name: /dev/sdd inquiry cdb: 12 00 00 00 24 00 /dev/sdd: SUN SOLARIS 1 mode sense (10) cdb: 5a 00 08 00 00 00 00 00 08 00 mode sense (10): transport: Host_status=0x04 [DID_BAD_TARGET] Driver_status=0x08 [DRIVER_SENSE, SUGGEST_OK] WCE error in Caching (SBC) mode page [root@silver-vdse ~]# for i in /dev/sd?; do sdparm -vv --get WCE $i; done mp_settings: page,subpage=0x8,0x0 num=1 [0x8,0x0] pdt=0 start_byte=0x2 start_bit=2 num_bits=1 val=0 acronym: WCE >>> about to open device name: /dev/sda inquiry cdb: 12 00 00 00 24 00 /dev/sda: IBM-ESXS ST9146803SS B536 mode sense (10) cdb: 5a 00 08 00 00 00 00 00 08 00 mode sense (10) cdb: 5a 00 08 00 00 00 00 00 24 00 mode sense (10) cdb: 5a 00 48 00 00 00 00 00 24 00 mode sense (10) cdb: 5a 00 88 00 00 00 00 00 24 00 mode sense (10) cdb: 5a 00 c8 00 00 00 00 00 24 00 WCE 0 [cha: y, def: 0, sav: 0] mp_settings: page,subpage=0x8,0x0 num=1 [0x8,0x0] pdt=0 start_byte=0x2 start_bit=2 num_bits=1 val=0 acronym: WCE >>> about to open device name: /dev/sdc inquiry cdb: 12 00 00 00 24 00 /dev/sdc: SUN SOLARIS 1 mode sense (10) cdb: 5a 00 08 00 00 00 00 00 08 00 mode sense (10): transport: Host_status=0x04 [DID_BAD_TARGET] Driver_status=0x08 [DRIVER_SENSE, SUGGEST_OK] WCE error in Caching (SBC) mode page mp_settings: page,subpage=0x8,0x0 num=1 [0x8,0x0] pdt=0 start_byte=0x2 start_bit=2 num_bits=1 val=0 acronym: WCE >>> about to open device name: /dev/sde inquiry cdb: 12 00 00 00 24 00 /dev/sde: SUN SOLARIS 1 mode sense (10) cdb: 5a 00 08 00 00 00 00 00 08 00 mode sense (10): transport: Host_status=0x04 [DID_BAD_TARGET] Driver_status=0x08 [DRIVER_SENSE, SUGGEST_OK] WCE error in Caching (SBC) mode page mp_settings: page,subpage=0x8,0x0 num=1 [0x8,0x0] pdt=0 start_byte=0x2 start_bit=2 num_bits=1 val=0 acronym: WCE >>> about to open device name: /dev/sdf inquiry cdb: 12 00 00 00 24 00 /dev/sdf: SUN SOLARIS 1 mode sense (10) cdb: 5a 00 08 00 00 00 00 00 08 00 mode sense (10): transport: Host_status=0x04 [DID_BAD_TARGET] Driver_status=0x08 [DRIVER_SENSE, SUGGEST_OK] WCE error in Caching (SBC) mode page Looks like all the WCE outputting failed for the "SOLARIS" device. I wonder if we take that for a disabled write cache while it's not. What does: for i in /sys/class/scsi_disk/*/cache_type; do echo "$i: $(cat $i)" done say? Btw, what layers do you have between the underlying /dev/sd* devices and the qcow2 images. dm-multipath was mentioned, and given the pathnames a filesystem is probably used. Does it also use lvm? Either way none of the dm target in RHEL5 support barriers, and the default ext3 filesystem doesn't use it either. Is there any way to find out what kind of caching the "SOLARIS" target pretends to implement? So far I think the most likely culprit should be looked for at the target level, be it caching related or not. Related setup on which the bug happened doesn't exist for now. no option to recreate the bug. |