Bug 1276930 - VM is down after maintenance: Lost connection with qemu process
Summary: VM is down after maintenance: Lost connection with qemu process
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Virt
Version: 3.6.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Michal Skrivanek
QA Contact: Israel Pinto
URL:
Whiteboard: virt
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-11-01 14:01 UTC by Israel Pinto
Modified: 2015-11-12 13:57 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-11-12 13:57:42 UTC
oVirt Team: ---
Embargoed:
rule-engine: planning_ack?
ipinto: devel_ack?
ipinto: testing_ack?


Attachments (Terms of Use)
engine_log (68.68 KB, application/zip)
2015-11-01 14:08 UTC, Israel Pinto
no flags Details
hosts_logs (594.30 KB, application/zip)
2015-11-01 14:08 UTC, Israel Pinto
no flags Details

Description Israel Pinto 2015-11-01 14:01:15 UTC
Description of problem:
Switch host to maintenance with one VM, the migration completed BUT VM is dowm
with error: Lost connection with qemu process 

Version-Release number of selected component (if applicable):
RHEVM:rhevm-3.6.0.2-0.1.el6.noarch
VDSM: vdsm-4.17.10-5.el7ev.noarch
libvirt: libvirt-1.2.17-4.el7.x86_64


How reproducible:
All the time

Steps to Reproduce:
Switch host to maintenance with one VM

Actual results:
1. Host switch to maintenance
2. Migration completed
3. VM is down 


Additional info:
----------------------------
From engine log:
## Migration completed and host switch to maintenance ##
2015-11-01 15:35:50,499 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-83) [11ddbfa5] Correlation ID: 14176bb3, Job ID: daff9288-65e0-4574-aaf0-e66ac3e34414, Call Stack: null, Custom Event ID: -1, Message: Migration completed (VM: golden_env_mixed_virtio_0, Source: host_mixed_1, Destination: host_mixed_2, Duration: 18 seconds, Total: 18 seconds, Actual downtime: 25ms)
2015-11-01 15:35:50,499 INFO  [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (ForkJoinPool-1-worker-83) [11ddbfa5] Lock freed to object 'EngineLock:{exclusiveLocks='[6e7e9891-79ed-4b8c-8dce-0e0d67db9358=<VM, ACTION_TYPE_FAILED_VM_IS_BEING_MIGRATED$VmName golden_env_mixed_virtio_0>]', sharedLocks='null'}'
2015-11-01 15:35:50,788 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-92) [11ddbfa5] START, DestroyVDSCommand(HostName = host_mixed_1, DestroyVmVDSCommandParameters:{runAsync='true', hostId='c84dd08c-a044-4159-a2bf-32c0c615001c', vmId='6e7e9891-79ed-4b8c-8dce-0e0d67db9358', force='false', secondsToWait='0', gracefully='false', reason=''}), log id: 34e6a40f
2015-11-01 15:35:50,804 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-92) [11ddbfa5] FINISH, DestroyVDSCommand, log id: 34e6a40f
2015-11-01 15:35:50,806 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (ForkJoinPool-1-worker-92) [11ddbfa5] RefreshVmList VM id '6e7e9891-79ed-4b8c-8dce-0e0d67db9358' status = 'Down' on VDS 'host_mixed_1' ignoring it in the refresh until migration is done
2015-11-01 15:35:51,276 INFO  [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-32) [4b2c3daa] Updated vds status from 'Preparing for Maintenance' to 'Maintenance' in database,  vds 'host_mixed_1'(c84dd08c-a044-4159-a2bf-32c0c615001c)

## VM down ###
2015-11-01 15:35:59,227 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-92) [11ddbfa5] FINISH, DestroyVDSCommand, log id: 15622138
2015-11-01 15:35:59,243 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-92) [11ddbfa5] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM golden_env_mixed_virtio_0 is down with error. Exit message: Lost connection with qemu process.
2015-11-01 15:35:59,244 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (ForkJoinPool-1-worker-92) [11ddbfa5] VM '6e7e9891-79ed-4b8c-8dce-0e0d67db9358(golden_env_mixed_virtio_0) is running in db and not running in VDS 'host_mixed_2'
---------------------------------------------------------
From VDSM log:
periodic/4::ERROR::2015-11-01 15:35:59,199::sampling::538::virt.sampling.VMBulkSampler::(__call__) vm sampling failed
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/sampling.py", line 526, in __call__
    bulk_stats = self._conn.getAllDomainStats(self._stats_flags)
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 124, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 5104, in getAllDomainStats
    raise libvirtError("virConnectGetAllDomainStats() failed", conn=self)
libvirtError: Unable to read from monitor: Connection reset by peer

Comment 1 Israel Pinto 2015-11-01 14:08:20 UTC
Created attachment 1088359 [details]
engine_log

Comment 2 Israel Pinto 2015-11-01 14:08:49 UTC
Created attachment 1088360 [details]
hosts_logs

Comment 3 Michal Skrivanek 2015-11-02 10:37:31 UTC
whenever a qemu crash is involved, please include libvirt and qemu logs as well

Comment 4 Red Hat Bugzilla Rules Engine 2015-11-02 13:33:46 UTC
This bug is not marked for z-stream, yet the milestone is for a z-stream version, therefore the milestone has been reset.
Please set the correct milestone or add the z-stream flag.

Comment 5 Israel Pinto 2015-11-02 14:26:37 UTC
I update libvirt severity to Debug, and restart libvirt and vdsm.
The problem did not reproduce. will update if happen again.
  (In reply to Michal Skrivanek from comment #3)
> whenever a qemu crash is involved, please include libvirt and qemu logs as
> well

I update libvirt severity to Debug, and restart libvirt and vdsm.
The problem did not reproduce. will update if happen again.

Comment 6 Michal Skrivanek 2015-11-03 09:44:23 UTC
let's see in a week or so, i not reproduce we would have to unfortunately close this with insufficient data

Comment 7 Gil Klein 2015-11-03 15:08:03 UTC
Reducing severity for now till we check if we can reproduce it

Comment 8 Yaniv Kaul 2015-11-12 13:57:42 UTC
Closing as not reproducible. Israel - if you do manage to reproduce, please verify it's not a qemu bug.


Note You need to log in before you can comment on or make changes to this bug.