Bug 1410314

Summary: VM shutoff when migration
Product: [oVirt] ovirt-engine Reporter: Han Han <hhan>
Component: GeneralAssignee: Arik <ahadas>
Status: CLOSED DEFERRED QA Contact:
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.0.6.3CC: ahadas, bugs, chhu, dyuan, hhan, rbarry, tjelinek, xuzhang, yanqzhan
Target Milestone: ---Keywords: FutureFeature
Target Release: ---Flags: tjelinek: ovirt-future?
rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-04-01 14:44:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
The logs of ovirt-engine and vdsmd none

Description Han Han 2017-01-05 06:05:02 UTC
Created attachment 1237496 [details]
The logs of ovirt-engine and vdsmd

Description of problem:
As subject

Version-Release number of selected component (if applicable):
RHEVM: rhevm-4.0.6.3-0.1.el7ev.noarch
Host:
libvirt-2.0.0-10.virtcov.el7_3.3.x86_64
qemu-kvm-rhev-2.6.0-28.el7_3.2.x86_64
vdsm-4.19.1-1.el7ev.x86_64


How reproducible:
10%

Steps to Reproduce:
1. Prepare 2 hosts in one datacenter. Create a VM with OS 
2. Migrate the VM to another host. Sometimes the VM will be shutdown.

Actual results:
As step2

Expected results:
VM is running and migration success.

Additional info:
Since it is seldom to reproduce the bug manually, I write a script to reproduce it by loop.
1. First of all, install ovirt-engine-sdk-python-4.0.4
2. Runing the following script until it stops.
```
#!/bin/python
import ovirtsdk4 as sdk
import logging
import time

connection = sdk.Connection(url='https://xx.xx.xx.xx/ovirt-engine/api',
                            username='admin@internal', password='xxx', insecure=True, log=logging.getLogger())
vms_service = connection.system_service().vms_service()

def get_vm(vms_service):
    return vms_service.list(search='VM')[0]

vm = get_vm(vms_service)
vm_service = vms_service.vm_service(vm.id)
while True:
    print get_vm(vms_service).status
    if get_vm(vms_service).status.value is 'up':
        print 'vm will be migrated'
        time.sleep(2)
        vm_service.migrate()
        print get_vm(vms_service).status.value
    elif get_vm(vms_service).status.value is 'down':
        print 'Bugs!\n'
        print time.time()
        break
    time.sleep(2)
```
The bug will be reproduced after running the script for several minutes. Since there is not coredump in abrt, I tend to think it is an issue in ovirt, not libvirt or qemu. 
In ovirt-engine's log, there is an error when VM is shutoff:
2017-01-05 11:25:59,008 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring] (DefaultQuartzScheduler5) [3d13e3e2] Rerun VM '4702cbe3-6d0f-403d-be16-37f3ddc36d95'. Called from VDS 'B'
2017-01-05 11:25:59,041 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-18) [3d13e3e2] Correlation ID: 413c80a7, Job ID: 72e91efd-564d-486c-bb8a-dc5ba83a8502, Call Stack: null, Custom Event ID: -1, Message: Migration failed  (VM: NFS, Source: A).

Comment 1 Yaniv Kaul 2017-01-06 20:17:30 UTC
Missing severity. 
Where are the libvirt logs? Anything happened to the VM or did it just fail migration and kept alive on the source? 

Anything on vdsm log, or did you just attach it? 
What migration policy did you use?

Comment 2 Han Han 2017-01-12 02:18:44 UTC
Hi Yaniv,
First of all, for severity, I think it is medium at least, for it could cause a running VM poweroff and it could be reproduced on ovirt web UI by chance. Originally I noticed the issue  when I did VM migration on web UI. So the issue can be hit by our customers, too. To make it reproduced more conveniently, I wrote the python script as comment0.

For the libvirt logs, they are too large(about 1.2G) to upload after multiple migrations. And I didn't find errors about migration in the libvirt logs. If you need them, I can offer you the test environment.

For what happened to VM, the issue is that VM is poweroff when migration fails. It is totally poweroff, not keeping alive on the source.

For vdsm logs, yes, I just attached them. I am not familiar with vdsm features, so I didn't analyze them. Maybe we need the help from RHEV QE :)

For migration policy, I use 'Select Host Automatically' both on web UI and script.

Maybe I should explain more about what I did about the bug.
Originally I just did a simple VM migration on ovirt web UI with 'Select Host Automatically' policy but the VM became poweroff unexpectedly. I tried it again on web UI and found the issue hard to reproduce. So I wrote the script that migrates the VM looply until detecting VM poweroff. Additionally, I deployed my RHEV host with abrt that could catch coredump, but no coredump catched. So I am sure it's not crash in qemu or libvirt, and I tend to think it is problems in vdsm or ovirt.

At last, I have to say, it is hard to reproduce. Maybe you will wait the script for half one hour or more to catch the bug :)

Comment 5 Yanqiu Zhang 2017-01-16 08:05:05 UTC
Hi,
Can reproduce this bug on rhevm4.1.
Rhevm-4.1.0-0.3.beta2.el7.noarch
Host:
Libvirt-2.5.0-1.el7.x86_64
Qemu-kvm-rhev-2.8.0-1.el7.x86_64
vdsm-4.19.1-1.el7ev.x86_64

The phenomenon is  a little different, there is a shutdown status when migrating, but it will automatically be running again immediately.

Steps to Reproduce:
1. Prepare 2 hosts(yan-A, yan-B) in one datacenter. Create a VM with OS (yan-vm-2)
2. Used the same python script as in comment#1 to migrate VM to another host. Sometimes the VM will be shutdown, then script will stop.

There are 3 kinds of output when script stops:
(1).ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[]". HTTP response code is 400.
(2).down
Bugs!
1484533548.02
(3).ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[Cannot migrate VM. VM is not running.]". HTTP response code is 409.

Pls refer to the attachment "rhevm4.1-mig-logs", it includes:
pyError-[].txt  pyError-bugsDown.txt pyError-[cannot].txt are script outputs for the above 3 situations.
ovirt-engine.log
yan-vm-2.log is the guest log.
libvirtd-*.log
vdsm-*.log
rhvm-event-*.txt  is some event info on rhevm page
Tips: Pls ignore the messages like "profiling:/mnt/coverage/BUILD/libvirt-2.5.0/src/util/.libs/libvirt_util_la-vireventpoll.gcda:Cannot open" in yan-vm-2.log, that is because we used the libvirt-2.5.0-1.virtcov.el7.x86_64, we use virtcov to check code coverage, but didn't modify libvirt codes, so it does not affect.

Log errors:
1. There seems no obvious migration related error message in libvirtd.log
2. The errors in vdsm log on yan-B:
[root@hostb host-yanB]# cat vdsm-*|grep ERROR
> 2017-01-15 21:24:54,811 ERROR (migsrc/01593538) [virt.vm] (vmId='01593538-a124-4958-bbf1-f6c83aaea600') migration destination error: Virtual machine already exists (migration:265)
> 2017-01-15 21:25:02,428 ERROR (jsonrpc/4) [jsonrpc.JsonRpcServer] Internal server error (__init__:552)
> 2017-01-15 21:25:17,450 ERROR (jsonrpc/4) [jsonrpc.JsonRpcServer] Internal server error (__init__:552)
> 2017-01-15 21:25:32,476 ERROR (jsonrpc/5) [jsonrpc.JsonRpcServer] Internal server error (__init__:552)
2017-01-15 21:32:17,983 ERROR (jsonrpc/1) [jsonrpc.JsonRpcServer] Internal server error (__init__:552)
2017-01-15 21:32:48,020 ERROR (jsonrpc/2) [jsonrpc.JsonRpcServer] Internal server error (__init__:552)
2017-01-15 21:11:23,585 ERROR (jsonrpc/0) [virt.api] FINISH create error=Virtual machine already exists (api:69)
2017-01-15 21:12:11,031 ERROR (periodic/3) [virt.periodic.Operation] <vdsm.virt.sampling.VMBulkSampler object at 0x3792790> operation failed (periodic:192)
2017-01-15 21:12:21,763 ERROR (vm/01593538) [virt.vm] (vmId='01593538-a124-4958-bbf1-f6c83aaea600') Error fetching vm stats (vm:1320)
2017-01-15 21:12:25,691 ERROR (periodic/1) [root] VM metrics collection failed (vmstats:264)
2017-01-15 21:12:27,973 ERROR (migsrc/01593538) [virt.vm] (vmId='01593538-a124-4958-bbf1-f6c83aaea600') migration destination error: Virtual machine already exists (migration:265)
2017-01-15 21:12:31,461 ERROR (jsonrpc/6) [jsonrpc.JsonRpcServer] Internal server error (__init__:552)
2017-01-15 21:12:46,485 ERROR (jsonrpc/6) [jsonrpc.JsonRpcServer] Internal server error (__init__:552)

Comment 6 Tomas Jelinek 2017-01-18 08:30:06 UTC
> The phenomenon is  a little different, there is a shutdown status when
> migrating, but it will automatically be running again immediately.

Do I understand correctly that the VM is not actually down, it is just reported as down for a short while and than again correctly reported as up? E.g. if you put the same sleep at the end of the script like here: https://bugzilla.redhat.com/show_bug.cgi?id=1409033
does it work correctly?

Comment 7 Yanqiu Zhang 2017-01-18 09:15:42 UTC
Hi,Tomas
Pls refer to my answers:
1.Do I understand correctly that the VM is not actually down, it is just reported as down for a short while and than again correctly reported as up?
 - Yes, I think so. Since the job in guest keeps running, not reboot.
2.if you put the same sleep at the end of the script like here: https://bugzilla.redhat.com/show_bug.cgi?id=1409033
does it work correctly?
 - I used the same script as https://bugzilla.redhat.com/show_bug.cgi?id=1409033#c6, sleep(2) before migrate.

Comment 8 Arik 2017-02-08 12:08:22 UTC
A fix for that was done back in Dec 14 (https://gerrit.ovirt.org/#/c/67917/) but it seems that a new version of vdsm-jsonrpc-java was built only on Jan 9. So I'm not sure whether the engine you used included the fix or not. Could you please try to reproduce it with the latest 4.1 version?

Comment 9 Yanqiu Zhang 2017-02-09 11:41:43 UTC
Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1420718, cannot start a normal vm now.

Comment 10 Yaniv Kaul 2017-10-15 17:53:32 UTC
I've just seen this happening now in o-s-t.
Need to reproduce with logs.

Comment 11 Michal Skrivanek 2020-03-18 15:43:21 UTC
This bug didn't get any attention for a while, we didn't have the capacity to make any progress. If you deeply care about it or want to work on it please assign/target accordingly

Comment 12 Michal Skrivanek 2020-03-18 15:46:39 UTC
This bug didn't get any attention for a while, we didn't have the capacity to make any progress. If you deeply care about it or want to work on it please assign/target accordingly

Comment 13 Michal Skrivanek 2020-04-01 14:44:15 UTC
ok, closing. Please reopen if still relevant/you want to work on it.

Comment 14 Michal Skrivanek 2020-04-01 14:49:20 UTC
ok, closing. Please reopen if still relevant/you want to work on it.