Bug 967821 - engine: engine reports LSM successful which allows us to put the src domain in maintenance while vm keeps writing to it, after ovirt-engine restart the mapping changes back to src due LSM reported as failed and mapping is changed back
engine: engine reports LSM successful which allows us to put the src domain i...
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.2.0
x86_64 Linux
unspecified Severity urgent
: ---
: 3.3.0
Assigned To: Nobody's working on this, feel free to take it
Gadi Ickowicz
storage
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-05-28 07:36 EDT by Dafna Ron
Modified: 2016-02-10 13:58 EST (History)
11 users (show)

See Also:
Fixed In Version: is2
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-01-21 17:18:49 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
logs (1.41 MB, application/x-gzip)
2013-05-28 07:36 EDT, Dafna Ron
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 15133 None None None Never

  None (edit)
Description Dafna Ron 2013-05-28 07:36:19 EDT
Created attachment 753865 [details]
logs

Description of problem:

LSM which failed in the DeleteImageGroupCommand step is reported as successful but after I restarted engine to clear ArrayOutOfBound error, engine detects LSM as failed and rolls back. 
the vm's mapping in db is also rolled back, in vdsm however, the mapping has changed and we are writing to the target domain. 
I put the domain in maintenance and the vm keeps writing to a domain which is not on-line. 
 
I'm opening this bug for engine because the mapping issue should not happen. 

Version-Release number of selected component (if applicable):

sf17.2

How reproducible:

100%

Steps to Reproduce:
1. in iscsi storage with two domain and two hosts, create and run a vm from template on the hsm. 
2. live storage migrate the vm disk
3. when engine prints DeleteImage in the log block connectivity to the domains from the spm. 
4. after the host becomes non-operational we can see that the vm disk is pointing to the target domain, put the src domain in maintenance. 
5. restart ovirt-engine service


Actual results:

before the ovirt-engine restart the disk is shown on the target domain and no failure in the LSM is reported. 
after the restart the mapping changes to the src domain. 
vm pid is showing the dst domain as the drive file. 
you can put the domain in maintenance which means that we are writing to a domain which no longer exists and all data written to it will be lost if we merge. 

Expected results:

if engine rolls back than update should be sent to vdsm. 

Additional info:

 LV                                   VG                                   Attr      LSize   Pool Origin Data%  Move Log Cpy%Sync Convert
  1683b62f-973f-44dd-89d2-31df8d80833a 38755249-4bb3-4841-bf5b-05f4a521514d -wi-ao---   3.00g                                             
  6282217b-4ef1-4c1e-9586-19154b41fb50 38755249-4bb3-4841-bf5b-05f4a521514d -wi-ao---   1.00g                                             
  f82a0d58-0791-4137-b1e6-22a8794acd2a 38755249-4bb3-4841-bf5b-05f4a521514d -wi-ao---   2.00g                                             
  ids                                  38755249-4bb3-4841-bf5b-05f4a521514d -wi-ao--- 128.00m                                             
  inbox                                38755249-4bb3-4841-bf5b-05f4a521514d -wi-a---- 128.00m                                             
  leases                               38755249-4bb3-4841-bf5b-05f4a521514d -wi-a----   2.00g                                             
  master                               38755249-4bb3-4841-bf5b-05f4a521514d -wi-ao---   1.00g                                             
  metadata                             38755249-4bb3-4841-bf5b-05f4a521514d -wi-a---- 512.00m                                             
  outbox                               38755249-4bb3-4841-bf5b-05f4a521514d -wi-a---- 128.00m                                             
  lv_root                              vg0                                  -wi-ao--- 457.71g                                             
  lv_swap                              vg0                                  -wi-ao---   7.85g                                             
[root@cougar02 ~]# ps -elf |grep 25415
6 S qemu     25415     1  3  80   0 - 269437 poll_s 13:39 ?       00:00:32 /usr/libexec/qemu-kvm -name testtt -S -M rhel6.4.0 -cpu Opteron_G3 -enable-kvm -m 512 -smp 1,sockets=1,cores=1,threads=1 -uuid 098ef05d-c346-4006-98cf-0f0371c4a82a -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=6Server-6.4.0.4.el6,serial=cfeccbf6-77c5-46b5-9367-7386e6a08831,uuid=098ef05d-c346-4006-98cf-0f0371c4a82a -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/testtt.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2013-05-28T10:39:05,driftfix=slew -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -drive file=/rhev/data-center/7fd33b43-a9f4-4eb7-a885-e9583a929ceb/81ef11d0-4c0c-47b4-8953-d61a6af442d8/images/a567b2f9-9f19-4302-83e6-ec7de7d7734a/6282217b-4ef1-4c1e-9586-19154b41fb50,if=none,id=drive-ide0-0-0,format=qcow2,serial=a567b2f9-9f19-4302-83e6-ec7de7d7734a,cache=none,werror=stop,rerror=stop,aio=native -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=27 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:23:a1:1e,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/testtt.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/testtt.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -spice port=5900,tls-port=5901,addr=0,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -k en-us -vga qxl -global qxl-vga.ram_size=67108864 -global qxl-vga.vram_size=67108864 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0
1 S root     25417     2  0  80   0 -     0 vhost_ 13:39 ?        00:00:00 [vhost-25415]
0 S root     29130 13837  0  80   0 - 25811 pipe_w 13:53 pts/2    00:00:00 grep 25415
[root@cougar02 ~]# 

before ovirt-engine restart: 

2013-05-28 13:47:06,165 INFO  [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (pool-4-thread-48) [11ca4bc1] Ending command successfully: org.ovirt.engine.core.bll.lsm.Li
veMigrateDiskCommand

After ovirt-engine restart: 

2013-05-28 13:49:12,237 ERROR [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (pool-4-thread-20) [11ca4bc1] Command org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to VmReplicateDiskFinishVDS, error = Drive image file %s could not be found
Comment 1 Dafna Ron 2013-05-28 07:52:45 EDT
to add to this bug, I had a vm running and I kept writing to it during the time the domain was in maintenance. 
after I powered down the vm and started the domain which was in maintenance, I merged the snapshot and all the data that I wrote on the vm was gone. 
so this scenario can cause user's data loss.
Comment 2 Allon Mureinik 2013-07-09 10:37:10 EDT
Changes to the recovery flow in 3.3 should have handled this issue too.
Nonetheless, I tried to reproduce and was unable to.
Moving to ON_QA to verify.
Comment 3 Aharon Canan 2013-09-10 04:39:50 EDT
can't verify due to bug #1006203
Comment 4 Gadi Ickowicz 2013-10-03 08:37:49 EDT
Scenario ran successfully on is17.1
Comment 5 Itamar Heim 2014-01-21 17:18:49 EST
Closing - RHEV 3.3 Released
Comment 6 Itamar Heim 2014-01-21 17:25:07 EST
Closing - RHEV 3.3 Released

Note You need to log in before you can comment on or make changes to this bug.