967821 – engine: engine reports LSM successful which allows us to put the src domain in maintenance while vm keeps writing to it, after ovirt-engine restart the mapping changes back to src due LSM reported as failed and mapping is changed back

Bug 967821 - engine: engine reports LSM successful which allows us to put the src domain in maintenance while vm keeps writing to it, after ovirt-engine restart the mapping changes back to src due LSM reported as failed and mapping is changed back

Summary: engine: engine reports LSM successful which allows us to put the src domain i...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.2.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	3.3.0
Assignee:	Nobody's working on this, feel free to take it
QA Contact:	Gadi Ickowicz
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-05-28 11:36 UTC by Dafna Ron
Modified:	2016-02-10 18:58 UTC (History)
CC List:	11 users (show)
Fixed In Version:	is2
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-01-21 22:18:49 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logs (1.41 MB, application/x-gzip) 2013-05-28 11:36 UTC, Dafna Ron	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	15133	0	None	None	None	Never

Description Dafna Ron 2013-05-28 11:36:19 UTC

Created attachment 753865 [details]
logs

Description of problem:

LSM which failed in the DeleteImageGroupCommand step is reported as successful but after I restarted engine to clear ArrayOutOfBound error, engine detects LSM as failed and rolls back. 
the vm's mapping in db is also rolled back, in vdsm however, the mapping has changed and we are writing to the target domain. 
I put the domain in maintenance and the vm keeps writing to a domain which is not on-line. 
 
I'm opening this bug for engine because the mapping issue should not happen. 

Version-Release number of selected component (if applicable):

sf17.2

How reproducible:

100%

Steps to Reproduce:
1. in iscsi storage with two domain and two hosts, create and run a vm from template on the hsm. 
2. live storage migrate the vm disk
3. when engine prints DeleteImage in the log block connectivity to the domains from the spm. 
4. after the host becomes non-operational we can see that the vm disk is pointing to the target domain, put the src domain in maintenance. 
5. restart ovirt-engine service


Actual results:

before the ovirt-engine restart the disk is shown on the target domain and no failure in the LSM is reported. 
after the restart the mapping changes to the src domain. 
vm pid is showing the dst domain as the drive file. 
you can put the domain in maintenance which means that we are writing to a domain which no longer exists and all data written to it will be lost if we merge. 

Expected results:

if engine rolls back than update should be sent to vdsm. 

Additional info:

 LV                                   VG                                   Attr      LSize   Pool Origin Data%  Move Log Cpy%Sync Convert
  1683b62f-973f-44dd-89d2-31df8d80833a 38755249-4bb3-4841-bf5b-05f4a521514d -wi-ao---   3.00g                                             
  6282217b-4ef1-4c1e-9586-19154b41fb50 38755249-4bb3-4841-bf5b-05f4a521514d -wi-ao---   1.00g                                             
  f82a0d58-0791-4137-b1e6-22a8794acd2a 38755249-4bb3-4841-bf5b-05f4a521514d -wi-ao---   2.00g                                             
  ids                                  38755249-4bb3-4841-bf5b-05f4a521514d -wi-ao--- 128.00m                                             
  inbox                                38755249-4bb3-4841-bf5b-05f4a521514d -wi-a---- 128.00m                                             
  leases                               38755249-4bb3-4841-bf5b-05f4a521514d -wi-a----   2.00g                                             
  master                               38755249-4bb3-4841-bf5b-05f4a521514d -wi-ao---   1.00g                                             
  metadata                             38755249-4bb3-4841-bf5b-05f4a521514d -wi-a---- 512.00m                                             
  outbox                               38755249-4bb3-4841-bf5b-05f4a521514d -wi-a---- 128.00m                                             
  lv_root                              vg0                                  -wi-ao--- 457.71g                                             
  lv_swap                              vg0                                  -wi-ao---   7.85g                                             
[root@cougar02 ~]# ps -elf |grep 25415
6 S qemu     25415     1  3  80   0 - 269437 poll_s 13:39 ?       00:00:32 /usr/libexec/qemu-kvm -name testtt -S -M rhel6.4.0 -cpu Opteron_G3 -enable-kvm -m 512 -smp 1,sockets=1,cores=1,threads=1 -uuid 098ef05d-c346-4006-98cf-0f0371c4a82a -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=6Server-6.4.0.4.el6,serial=cfeccbf6-77c5-46b5-9367-7386e6a08831,uuid=098ef05d-c346-4006-98cf-0f0371c4a82a -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/testtt.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2013-05-28T10:39:05,driftfix=slew -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -drive file=/rhev/data-center/7fd33b43-a9f4-4eb7-a885-e9583a929ceb/81ef11d0-4c0c-47b4-8953-d61a6af442d8/images/a567b2f9-9f19-4302-83e6-ec7de7d7734a/6282217b-4ef1-4c1e-9586-19154b41fb50,if=none,id=drive-ide0-0-0,format=qcow2,serial=a567b2f9-9f19-4302-83e6-ec7de7d7734a,cache=none,werror=stop,rerror=stop,aio=native -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=27 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:23:a1:1e,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/testtt.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/testtt.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -spice port=5900,tls-port=5901,addr=0,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -k en-us -vga qxl -global qxl-vga.ram_size=67108864 -global qxl-vga.vram_size=67108864 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0
1 S root     25417     2  0  80   0 -     0 vhost_ 13:39 ?        00:00:00 [vhost-25415]
0 S root     29130 13837  0  80   0 - 25811 pipe_w 13:53 pts/2    00:00:00 grep 25415
[root@cougar02 ~]# 

before ovirt-engine restart: 

2013-05-28 13:47:06,165 INFO  [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (pool-4-thread-48) [11ca4bc1] Ending command successfully: org.ovirt.engine.core.bll.lsm.Li
veMigrateDiskCommand

After ovirt-engine restart: 

2013-05-28 13:49:12,237 ERROR [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (pool-4-thread-20) [11ca4bc1] Command org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to VmReplicateDiskFinishVDS, error = Drive image file %s could not be found

Comment 1 Dafna Ron 2013-05-28 11:52:45 UTC

to add to this bug, I had a vm running and I kept writing to it during the time the domain was in maintenance. 
after I powered down the vm and started the domain which was in maintenance, I merged the snapshot and all the data that I wrote on the vm was gone. 
so this scenario can cause user's data loss.

Comment 2 Allon Mureinik 2013-07-09 14:37:10 UTC

Changes to the recovery flow in 3.3 should have handled this issue too.
Nonetheless, I tried to reproduce and was unable to.
Moving to ON_QA to verify.

Comment 3 Aharon Canan 2013-09-10 08:39:50 UTC

can't verify due to bug #1006203

Comment 4 Gadi Ickowicz 2013-10-03 12:37:49 UTC

Scenario ran successfully on is17.1

Comment 5 Itamar Heim 2014-01-21 22:18:49 UTC

Closing - RHEV 3.3 Released

Comment 6 Itamar Heim 2014-01-21 22:25:07 UTC

Closing - RHEV 3.3 Released

Note You need to log in before you can comment on or make changes to this bug.