Bug 1190696

Summary: Second run of Windows VM fails because of access problem to sysprep payload
Product: Red Hat Enterprise Virtualization Manager Reporter: Jiri Belka <jbelka>
Component: ovirt-engineAssignee: Michal Skrivanek <michal.skrivanek>
Status: CLOSED ERRATA QA Contact: Nisim Simsolo <nsimsolo>
Severity: urgent Docs Contact:
Priority: high    
Version: 3.5.0CC: agkesos, dornelas, gklein, gscott, inetkach, istein, jbelka, lpeer, lsurette, michal.skrivanek, nashok, nsimsolo, obockows, pzhukov, rbalakri, Rhev-m-bugs, rhodain, sknauss, smelamud, yeylon, ykaul
Target Milestone: ovirt-3.6.0-rcKeywords: ZStream
Target Release: 3.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1274708 (view as bug list) Environment:
Last Closed: 2016-03-09 20:56:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1274708    
Attachments:
Description Flags
engine.log, vdsm.log, libvirtd.log, qemu vm log
none
Log Collector report from the customer experiencing the problem none

Description Jiri Belka 2015-02-09 13:32:07 UTC
Created attachment 989687 [details]
engine.log, vdsm.log, libvirtd.log, qemu vm log

Description of problem:
In BZ1190663 I mentioned sysprep payload is attached to _normal_ Windows VM regardless of its settings' enabling. Surprisingly second start (second Run click; previously powered off) fails because of payload file access problem.

it is not related to selinux, i executed 'setenforce 0' (permissive) and this same issue.

...
Thread-353536::ERROR::2015-02-09 13:12:27,417::vm::2324::vm.Vm::(_startUnderlyingVm) vmId=`be87998e-8a1d-4817-8543-c7f11b392d43`::The vm start process failed
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 2264, in _startUnderlyingVm
  File "/usr/share/vdsm/virt/vm.py", line 3323, in _run
  File "/usr/lib/python2.6/site-packages/vdsm/libvirtconnection.py", line 111, in wrapper
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 2709, in createXML
libvirtError: internal error process exited while connecting to monitor: 2015-02-09T13:12:26.920792Z qemu-kvm: -drive file=/var/run/vdsm/payload/be87998e-8a1d-4817-8543-c7f11b392d43.8704fc5b5ad0d207a173b632278828
5a.img,if=none,id=drive-fdc0-0-0,format=raw,serial=: could not open disk image /var/run/vdsm/payload/be87998e-8a1d-4817-8543-c7f11b392d43.8704fc5b5ad0d207a173b6322788285a.img: Permission denied

Thread-353536::DEBUG::2015-02-09 13:12:27,422::vm::2779::vm.Vm::(setDownStatus) vmId=`be87998e-8a1d-4817-8543-c7f11b392d43`::Changed state to Down: internal error process exited while connecting to monitor: 2015-
02-09T13:12:26.920792Z qemu-kvm: -drive file=/var/run/vdsm/payload/be87998e-8a1d-4817-8543-c7f11b392d43.8704fc5b5ad0d207a173b6322788285a.img,if=none,id=drive-fdc0-0-0,format=raw,serial=: could not open disk image
 /var/run/vdsm/payload/be87998e-8a1d-4817-8543-c7f11b392d43.8704fc5b5ad0d207a173b6322788285a.img: Permission denied
 (code=1)

i used following loop to see payload image permissions and selinux context...

# rm -f /tmp/out ; while true; do echo **start**; date "+%Y-%m-%d %H:%M:%S:%N"; su -s /bin/sh - qemu -c "ls -laZ /var/run/vdsm/payload/" ; su -s /bin/sh - qemu -c "cat /var/run/vdsm/payload/*.img 2>/dev/null | strings" ; ps axuZ | grep '[q]emu.*omg'; echo ; sleep 0.1 ; echo **end** ; done 2>&1 >> /tmp/out

it revealed the payload is created under different context but right before qemu uses it, it is changed to 'svirt_image_t:s0:c739,c875'

**start**
2015-02-09 13:17:57:743047247
drwxr-xr-x. root root system_u:object_r:virt_var_run_t:s0 .
drwxr-xr-x. vdsm kvm  system_u:object_r:virt_var_run_t:s0 ..
-rw-r--r--. vdsm qemu system_u:object_r:svirt_image_t:s0:c739,c875 be87998e-8a1d-4817-8543-c7f11b392d43.8704fc5b5ad0d207a173b6322788285a.img
mkdosfs
T8           FAT12   
This is not a bootable disk.  Please insert a bootable floppy and
press any key to try again ... 
SYSPREP INF 
d<jIFIF
<jIF
<?xml version="1.0" encoding="utf-8"?>
...content of image...

ystem_u:system_r:svirt_t:s0:c739,c875 qemu 19714 5.0  0.1 155532 9124 ?       D    13:17   0:00 /usr/libexec/qemu-kvm -name omg -S -M rhel6.5.0 -cpu SandyBridge -enable-kvm -m 1024 -realtime mlock=off -smp 1,maxcpus=16,sockets=16,cores=1,threads=1 -uuid be87998e-8a1d-4817-8543-c7f11b392d43 -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=6.6-20150128.0.el6ev,serial=4C4C4544-0034-5310-8052-B3C04F4A354A,uuid=be87998e-8a1d-4817-8543-c7f11b392d43 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/omg.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2015-02-09T13:17:57,driftfix=slew -no-kvm-pit-reinjection -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x5 -device virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x6 -drive file=/rhev/data-center/mnt/10.34.63.204:_home_iso_shared/0c78b4d6-ba00-4d3e-9f9f-65c7d5899d71/images/11111111-1111-1111-1111-111111111111/RHEV-toolsSetup_3.5_9.iso,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/var/run/vdsm/payload/be87998e-8a1d-4817-8543-c7f11b392d43.8704fc5b5ad0d207a173b6322788285a.img,if=none,id=drive-fdc0-0-0,format=raw,serial= -global isa-fdc.driveA=drive-fdc0-0-0 -drive file=/rhev/data-center/00000002-0002-0002-0002-0000000001a9/2779099b-7952-4455-ac43-7ec876b9f3a3/images/8de92838-7f7e-4f45-bdfc-d59a9f2ebc97/11d3a0d1-5e4d-43a0-9c45-1e76eeefbb2f,if=none,id=drive-virtio-disk0,format=raw,serial=8de92838-7f7e-4f45-bdfc-d59a9f2ebc97,cache=none,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=31,id=hostnet0,vhost=on,vhostfd=32 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:d1:5d:06,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/be87998e-8a1d-4817-8543-c7f11b392d43.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/be87998e-8a1d-4817-8543-c7f11b392d43.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -device usb-tablet,id=input0 -vnc 10.34.62.205:0,password -k en-us -vga cirrus -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -msg timestamp=on
**end**

i don't understand why qemu-kvm can't access it (selinux is permissive!).

[root@dell-r210ii-13 ~]# grep denied /var/log/audit/audit.log
[root@dell-r210ii-13 ~]# getenforce 
Permissive

2015-02-09T13:17:57.796582Z qemu-kvm: -drive file=/var/run/vdsm/payload/be87998e-8a1d-4817-8543-c7f11b392d43.8704fc5b5ad0d207a173b6322788285a.img,if=none,id=drive-fdc0-0-0,format=raw,serial=: could not open disk image /var/run/vdsm/payload/be87998e-8a1d-4817-8543-c7f11b392d43.8704fc5b5ad0d207a173b6322788285a.img: Permission denied
2015-02-09 13:17:58.087+0000: shutting down

please inspect logs why it sometimes doesn't start

Version-Release number of selected component (if applicable):
libvirt-0.10.2-46.el6_6.2.x86_64
qemu-kvm-rhev-0.12.1.2-2.446.el6.x86_64
vdsm-4.16.8.1-6.el6ev.x86_64
Red Hat Enterprise Virtualization Hypervisor 6.6 (20150128.0.el6ev)
rhevm-backend-3.5.0-0.31.el6ev.noarch

How reproducible:
???

Steps to Reproduce:
1. new Windows VM in User Portal/Extended, add disk
2. start via 'Run' button, poweroff immediately
3. start again via 'Run' button

Actual results:
fails to start, qemu-kvm tries to access sysprep payload but fails

Expected results:
should work

Additional info:
no idea why there is sysprep payload at all, see BZ1190663

Comment 1 Omer Frenkel 2015-02-12 08:21:45 UTC
can you please check, after you shut down the vm, and before the second run, what is the result of the following db query:
select is_initialized from vm_static where vm_name='<name>';

Comment 2 Jiri Belka 2015-02-12 09:53:43 UTC
engine=# select is_initialized from vm_static where vm_name='bugtest';
 is_initialized 
----------------
 t
(1 row)


I used exact steps as in #0 (VmCreator role on DC).

Comment 3 Michal Skrivanek 2015-03-10 10:11:13 UTC
the second run by mistake drops the "readonly". It is a permission problem - image is r/o for qemu.
seems engine's fault...

Comment 4 Omer Frenkel 2015-03-10 10:23:07 UTC
right, the reason is that the engine mistakenly picks the floppy (unmanaged) device as a managed device and send it back to vdsm with bad params.

Comment 5 Michal Skrivanek 2015-03-10 14:00:09 UTC
reducing severity; sending payload again and again is not an interesting usecase (though it's possible)

Comment 9 Greg Scott 2015-10-09 23:14:09 UTC
Update on the problem symptoms at my customer (case number 1518293 linked to external trackers) and the reason I set this BZ to NeedsInfo.

After extensive testing at my customer site today, it was determined that the pooled Windows VMs show the problem on the second boot exactly as described in this BZ.  The VM firstboots and applies the settings from a:\\unattend.xml.  Shut down the VM, then power it up again.  It fails with the error as described. Sometime RHEV-M also declares the host in error for 30 minutes, which prohibits the host from accepting any new VM migrations.

But I have so far been unable to replicate the problem in my own test environment.  

So the info I'm looking for is (1) Do we know what conditions need to exist to trigger the problem, (2) is there a more suitable workaround than engine database updates for every single virtual machine (hundreds of new VMs every month), and (3) since a fix is already apparently in QA, is it possible to apply a hotfix to 3.5.4 to get our customer past this issue?

Customer details in the support case mentioned above.

thanks

- Greg Scott

Comment 10 Omer Frenkel 2015-10-11 13:58:34 UTC
not sure about hotfix,
but patch looks simple and safe to backport for 3.5
suggesting 3.5.z i think we can have it in time for 3.5.5

Comment 11 Greg Scott 2015-10-12 15:59:35 UTC
Backporting to 3.5.5 is great - thanks - but it still leaves a major customer in a world of hurt.  The customer is months into the 3.5.4 testing cycle and to start again with 3.5.5 means they have to throw out months of work and start over.  We've already pitched doing 3.5.5 since it's coming soon and the answer was a resounding no with some adjectives added. 

So would it be possible to also do this as a hotfix for 3.5.4?  If not possible to hotfix 3.5.4, at least document the conditions that trigger this bug since I'm not able to reproduce it.

thanks

- Greg

Comment 12 Greg Scott 2015-10-15 15:53:23 UTC
Omer, can we confirm for sure the fix can go into 3.5.5?  I'm pushing the customer to go to 3.5.5 instead of 3.5.4 - can I commit to the customer this fix will be in there?

thanks

- Greg

Comment 13 Omer Frenkel 2015-10-18 13:26:29 UTC
unfortunately i think not.
this got delayed as we could not confirm its the exact same case and the solution would be the same.
could you upload relevant logs of your scenario so we could investigate further?

Comment 14 Greg Scott 2015-10-19 13:26:36 UTC
Created attachment 1084405 [details]
Log Collector report from the customer experiencing the problem

Comment 15 Greg Scott 2015-10-19 13:33:29 UTC
I should have updated the title of the attachment I just uploaded. The whole log collector report was too big to upload.  So I extracted some QEMU logs, libvirtd.log, vdsm.log from a host and engine.log and the database from rhevm.

I know 3.5.5 is due out tomorrow - but if there's any way to get the fix into that 3.5.5 stream I will be in your debt.

thanks

- Greg

Comment 16 Omer Frenkel 2015-10-21 08:07:36 UTC
Im sorry but i think, 3.5.5 is missed, as far as i know there are no more builds planned for it (as it moved for final testing already).

I could not see the errors in vdsm/engine logs that you attached, only the final error in qemu/libvirt logs, which doesnt help in understanding the source of the issue.
assuming this is the same issue, i have created a patch rebased on top of 3.5
from what i know, you can ask for GSS to create hotfix using this patch,
so you can apply if needed.

I hope it helps

Comment 17 Omer Frenkel 2015-10-21 08:12:38 UTC
3.5 patch
https://gerrit.ovirt.org/#/c/47562/

Comment 19 Nisim Simsolo 2015-10-26 12:54:56 UTC
Verification build:
rhevm-3.6.0.1-0.1.el6
vdsm-4.17.9-1.el7ev.noarch
qemu-kvm-rhev-2.3.0-31.el7.x86_64
libvirt-client-1.2.17-5.el7.x86_64

Steps to Reproduce:
1. new Windows VM in User Portal/Extended, add disk
2. start via 'Run' button, poweroff immediately
3. start again via 'Run' button
4. Repeat steps 2-3 few times.

1. Create pool VM.
2. Run VM and power it off immediately.
3. run VM again.
4. Repeat steps 2-3 few times.

Comment 57 errata-xmlrpc 2016-03-09 20:56:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-0376.html