Bug 967821

Summary: engine: engine reports LSM successful which allows us to put the src domain in maintenance while vm keeps writing to it, after ovirt-engine restart the mapping changes back to src due LSM reported as failed and mapping is changed back
Product: Red Hat Enterprise Virtualization Manager Reporter: Dafna Ron <dron>
Component: ovirt-engineAssignee: Nobody's working on this, feel free to take it <nobody>
Status: CLOSED CURRENTRELEASE QA Contact: Gadi Ickowicz <gickowic>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.2.0CC: acanan, acathrow, amureini, gickowic, iheim, jkt, lpeer, nlevinki, Rhev-m-bugs, scohen, yeylon
Target Milestone: ---   
Target Release: 3.3.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: is2 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-01-21 22:18:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs none

Description Dafna Ron 2013-05-28 11:36:19 UTC
Created attachment 753865 [details]
logs

Description of problem:

LSM which failed in the DeleteImageGroupCommand step is reported as successful but after I restarted engine to clear ArrayOutOfBound error, engine detects LSM as failed and rolls back. 
the vm's mapping in db is also rolled back, in vdsm however, the mapping has changed and we are writing to the target domain. 
I put the domain in maintenance and the vm keeps writing to a domain which is not on-line. 
 
I'm opening this bug for engine because the mapping issue should not happen. 

Version-Release number of selected component (if applicable):

sf17.2

How reproducible:

100%

Steps to Reproduce:
1. in iscsi storage with two domain and two hosts, create and run a vm from template on the hsm. 
2. live storage migrate the vm disk
3. when engine prints DeleteImage in the log block connectivity to the domains from the spm. 
4. after the host becomes non-operational we can see that the vm disk is pointing to the target domain, put the src domain in maintenance. 
5. restart ovirt-engine service


Actual results:

before the ovirt-engine restart the disk is shown on the target domain and no failure in the LSM is reported. 
after the restart the mapping changes to the src domain. 
vm pid is showing the dst domain as the drive file. 
you can put the domain in maintenance which means that we are writing to a domain which no longer exists and all data written to it will be lost if we merge. 

Expected results:

if engine rolls back than update should be sent to vdsm. 

Additional info:

 LV                                   VG                                   Attr      LSize   Pool Origin Data%  Move Log Cpy%Sync Convert
  1683b62f-973f-44dd-89d2-31df8d80833a 38755249-4bb3-4841-bf5b-05f4a521514d -wi-ao---   3.00g                                             
  6282217b-4ef1-4c1e-9586-19154b41fb50 38755249-4bb3-4841-bf5b-05f4a521514d -wi-ao---   1.00g                                             
  f82a0d58-0791-4137-b1e6-22a8794acd2a 38755249-4bb3-4841-bf5b-05f4a521514d -wi-ao---   2.00g                                             
  ids                                  38755249-4bb3-4841-bf5b-05f4a521514d -wi-ao--- 128.00m                                             
  inbox                                38755249-4bb3-4841-bf5b-05f4a521514d -wi-a---- 128.00m                                             
  leases                               38755249-4bb3-4841-bf5b-05f4a521514d -wi-a----   2.00g                                             
  master                               38755249-4bb3-4841-bf5b-05f4a521514d -wi-ao---   1.00g                                             
  metadata                             38755249-4bb3-4841-bf5b-05f4a521514d -wi-a---- 512.00m                                             
  outbox                               38755249-4bb3-4841-bf5b-05f4a521514d -wi-a---- 128.00m                                             
  lv_root                              vg0                                  -wi-ao--- 457.71g                                             
  lv_swap                              vg0                                  -wi-ao---   7.85g                                             
[root@cougar02 ~]# ps -elf |grep 25415
6 S qemu     25415     1  3  80   0 - 269437 poll_s 13:39 ?       00:00:32 /usr/libexec/qemu-kvm -name testtt -S -M rhel6.4.0 -cpu Opteron_G3 -enable-kvm -m 512 -smp 1,sockets=1,cores=1,threads=1 -uuid 098ef05d-c346-4006-98cf-0f0371c4a82a -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=6Server-6.4.0.4.el6,serial=cfeccbf6-77c5-46b5-9367-7386e6a08831,uuid=098ef05d-c346-4006-98cf-0f0371c4a82a -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/testtt.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2013-05-28T10:39:05,driftfix=slew -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -drive file=/rhev/data-center/7fd33b43-a9f4-4eb7-a885-e9583a929ceb/81ef11d0-4c0c-47b4-8953-d61a6af442d8/images/a567b2f9-9f19-4302-83e6-ec7de7d7734a/6282217b-4ef1-4c1e-9586-19154b41fb50,if=none,id=drive-ide0-0-0,format=qcow2,serial=a567b2f9-9f19-4302-83e6-ec7de7d7734a,cache=none,werror=stop,rerror=stop,aio=native -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=27 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:23:a1:1e,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/testtt.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/testtt.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -spice port=5900,tls-port=5901,addr=0,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -k en-us -vga qxl -global qxl-vga.ram_size=67108864 -global qxl-vga.vram_size=67108864 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0
1 S root     25417     2  0  80   0 -     0 vhost_ 13:39 ?        00:00:00 [vhost-25415]
0 S root     29130 13837  0  80   0 - 25811 pipe_w 13:53 pts/2    00:00:00 grep 25415
[root@cougar02 ~]# 

before ovirt-engine restart: 

2013-05-28 13:47:06,165 INFO  [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (pool-4-thread-48) [11ca4bc1] Ending command successfully: org.ovirt.engine.core.bll.lsm.Li
veMigrateDiskCommand

After ovirt-engine restart: 

2013-05-28 13:49:12,237 ERROR [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (pool-4-thread-20) [11ca4bc1] Command org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to VmReplicateDiskFinishVDS, error = Drive image file %s could not be found

Comment 1 Dafna Ron 2013-05-28 11:52:45 UTC
to add to this bug, I had a vm running and I kept writing to it during the time the domain was in maintenance. 
after I powered down the vm and started the domain which was in maintenance, I merged the snapshot and all the data that I wrote on the vm was gone. 
so this scenario can cause user's data loss.

Comment 2 Allon Mureinik 2013-07-09 14:37:10 UTC
Changes to the recovery flow in 3.3 should have handled this issue too.
Nonetheless, I tried to reproduce and was unable to.
Moving to ON_QA to verify.

Comment 3 Aharon Canan 2013-09-10 08:39:50 UTC
can't verify due to bug #1006203

Comment 4 Gadi Ickowicz 2013-10-03 12:37:49 UTC
Scenario ran successfully on is17.1

Comment 5 Itamar Heim 2014-01-21 22:18:49 UTC
Closing - RHEV 3.3 Released

Comment 6 Itamar Heim 2014-01-21 22:25:07 UTC
Closing - RHEV 3.3 Released