Created attachment 1237496 [details]
The logs of ovirt-engine and vdsmd
Description of problem:
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Prepare 2 hosts in one datacenter. Create a VM with OS
2. Migrate the VM to another host. Sometimes the VM will be shutdown.
VM is running and migration success.
Since it is seldom to reproduce the bug manually, I write a script to reproduce it by loop.
1. First of all, install ovirt-engine-sdk-python-4.0.4
2. Runing the following script until it stops.
import ovirtsdk4 as sdk
connection = sdk.Connection(url='https://xx.xx.xx.xx/ovirt-engine/api',
username='admin@internal', password='xxx', insecure=True, log=logging.getLogger())
vms_service = connection.system_service().vms_service()
vm = get_vm(vms_service)
vm_service = vms_service.vm_service(vm.id)
if get_vm(vms_service).status.value is 'up':
print 'vm will be migrated'
elif get_vm(vms_service).status.value is 'down':
The bug will be reproduced after running the script for several minutes. Since there is not coredump in abrt, I tend to think it is an issue in ovirt, not libvirt or qemu.
In ovirt-engine's log, there is an error when VM is shutoff:
2017-01-05 11:25:59,008 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring] (DefaultQuartzScheduler5) [3d13e3e2] Rerun VM '4702cbe3-6d0f-403d-be16-37f3ddc36d95'. Called from VDS 'B'
2017-01-05 11:25:59,041 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-18) [3d13e3e2] Correlation ID: 413c80a7, Job ID: 72e91efd-564d-486c-bb8a-dc5ba83a8502, Call Stack: null, Custom Event ID: -1, Message: Migration failed (VM: NFS, Source: A).
Where are the libvirt logs? Anything happened to the VM or did it just fail migration and kept alive on the source?
Anything on vdsm log, or did you just attach it?
What migration policy did you use?
First of all, for severity, I think it is medium at least, for it could cause a running VM poweroff and it could be reproduced on ovirt web UI by chance. Originally I noticed the issue when I did VM migration on web UI. So the issue can be hit by our customers, too. To make it reproduced more conveniently, I wrote the python script as comment0.
For the libvirt logs, they are too large(about 1.2G) to upload after multiple migrations. And I didn't find errors about migration in the libvirt logs. If you need them, I can offer you the test environment.
For what happened to VM, the issue is that VM is poweroff when migration fails. It is totally poweroff, not keeping alive on the source.
For vdsm logs, yes, I just attached them. I am not familiar with vdsm features, so I didn't analyze them. Maybe we need the help from RHEV QE :)
For migration policy, I use 'Select Host Automatically' both on web UI and script.
Maybe I should explain more about what I did about the bug.
Originally I just did a simple VM migration on ovirt web UI with 'Select Host Automatically' policy but the VM became poweroff unexpectedly. I tried it again on web UI and found the issue hard to reproduce. So I wrote the script that migrates the VM looply until detecting VM poweroff. Additionally, I deployed my RHEV host with abrt that could catch coredump, but no coredump catched. So I am sure it's not crash in qemu or libvirt, and I tend to think it is problems in vdsm or ovirt.
At last, I have to say, it is hard to reproduce. Maybe you will wait the script for half one hour or more to catch the bug :)
Can reproduce this bug on rhevm4.1.
The phenomenon is a little different, there is a shutdown status when migrating, but it will automatically be running again immediately.
Steps to Reproduce:
1. Prepare 2 hosts(yan-A, yan-B) in one datacenter. Create a VM with OS (yan-vm-2)
2. Used the same python script as in comment#1 to migrate VM to another host. Sometimes the VM will be shutdown, then script will stop.
There are 3 kinds of output when script stops:
(1).ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "". HTTP response code is 400.
(3).ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[Cannot migrate VM. VM is not running.]". HTTP response code is 409.
Pls refer to the attachment "rhevm4.1-mig-logs", it includes:
pyError-.txt pyError-bugsDown.txt pyError-[cannot].txt are script outputs for the above 3 situations.
yan-vm-2.log is the guest log.
rhvm-event-*.txt is some event info on rhevm page
Tips: Pls ignore the messages like "profiling:/mnt/coverage/BUILD/libvirt-2.5.0/src/util/.libs/libvirt_util_la-vireventpoll.gcda:Cannot open" in yan-vm-2.log, that is because we used the libvirt-2.5.0-1.virtcov.el7.x86_64, we use virtcov to check code coverage, but didn't modify libvirt codes, so it does not affect.
1. There seems no obvious migration related error message in libvirtd.log
2. The errors in vdsm log on yan-B:
[root@hostb host-yanB]# cat vdsm-*|grep ERROR
> 2017-01-15 21:24:54,811 ERROR (migsrc/01593538) [virt.vm] (vmId='01593538-a124-4958-bbf1-f6c83aaea600') migration destination error: Virtual machine already exists (migration:265)
> 2017-01-15 21:25:02,428 ERROR (jsonrpc/4) [jsonrpc.JsonRpcServer] Internal server error (__init__:552)
> 2017-01-15 21:25:17,450 ERROR (jsonrpc/4) [jsonrpc.JsonRpcServer] Internal server error (__init__:552)
> 2017-01-15 21:25:32,476 ERROR (jsonrpc/5) [jsonrpc.JsonRpcServer] Internal server error (__init__:552)
2017-01-15 21:32:17,983 ERROR (jsonrpc/1) [jsonrpc.JsonRpcServer] Internal server error (__init__:552)
2017-01-15 21:32:48,020 ERROR (jsonrpc/2) [jsonrpc.JsonRpcServer] Internal server error (__init__:552)
2017-01-15 21:11:23,585 ERROR (jsonrpc/0) [virt.api] FINISH create error=Virtual machine already exists (api:69)
2017-01-15 21:12:11,031 ERROR (periodic/3) [virt.periodic.Operation] <vdsm.virt.sampling.VMBulkSampler object at 0x3792790> operation failed (periodic:192)
2017-01-15 21:12:21,763 ERROR (vm/01593538) [virt.vm] (vmId='01593538-a124-4958-bbf1-f6c83aaea600') Error fetching vm stats (vm:1320)
2017-01-15 21:12:25,691 ERROR (periodic/1) [root] VM metrics collection failed (vmstats:264)
2017-01-15 21:12:27,973 ERROR (migsrc/01593538) [virt.vm] (vmId='01593538-a124-4958-bbf1-f6c83aaea600') migration destination error: Virtual machine already exists (migration:265)
2017-01-15 21:12:31,461 ERROR (jsonrpc/6) [jsonrpc.JsonRpcServer] Internal server error (__init__:552)
2017-01-15 21:12:46,485 ERROR (jsonrpc/6) [jsonrpc.JsonRpcServer] Internal server error (__init__:552)
> The phenomenon is a little different, there is a shutdown status when
> migrating, but it will automatically be running again immediately.
Do I understand correctly that the VM is not actually down, it is just reported as down for a short while and than again correctly reported as up? E.g. if you put the same sleep at the end of the script like here: https://bugzilla.redhat.com/show_bug.cgi?id=1409033
does it work correctly?
Pls refer to my answers:
1.Do I understand correctly that the VM is not actually down, it is just reported as down for a short while and than again correctly reported as up?
- Yes, I think so. Since the job in guest keeps running, not reboot.
2.if you put the same sleep at the end of the script like here: https://bugzilla.redhat.com/show_bug.cgi?id=1409033
does it work correctly?
- I used the same script as https://bugzilla.redhat.com/show_bug.cgi?id=1409033#c6, sleep(2) before migrate.
A fix for that was done back in Dec 14 (https://gerrit.ovirt.org/#/c/67917/) but it seems that a new version of vdsm-jsonrpc-java was built only on Jan 9. So I'm not sure whether the engine you used included the fix or not. Could you please try to reproduce it with the latest 4.1 version?
Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1420718, cannot start a normal vm now.
I've just seen this happening now in o-s-t.
Need to reproduce with logs.
This bug didn't get any attention for a while, we didn't have the capacity to make any progress. If you deeply care about it or want to work on it please assign/target accordingly
ok, closing. Please reopen if still relevant/you want to work on it.