1151953 – VMs not migrating following move host to maintenance.

Bug 1151953 - VMs not migrating following move host to maintenance.

Summary: VMs not migrating following move host to maintenance.

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Francesco Romani
QA Contact:	Ilanit Stein
Docs Contact:
URL:
Whiteboard:	virt
Depends On:	1162208 1174869
Blocks:	1154351
TreeView+	depends on / blocked

Reported:	2014-10-13 07:45 UTC by Ilanit Stein
Modified:	2015-02-17 08:29 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1154351 (view as bug list)
Environment:
Last Closed:	2015-02-17 08:29:23 UTC
oVirt Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
test logs (1.22 MB, application/x-bzip) 2014-10-13 07:46 UTC, Ilanit Stein	no flags	Details
vt12 migration test logs (904.13 KB, application/x-bzip) 2014-11-30 12:08 UTC, Ilanit Stein	no flags	Details
new logs (1.41 MB, application/x-bzip) 2014-12-14 14:52 UTC, Raz Tamir	no flags	Details
View All

Description Ilanit Stein 2014-10-13 07:45:03 UTC

Description of problem:
Migration automatic test (run vi rest api) fail on migration of vms, following move host to maintenance.

Test link:
http://jenkins.qa.lab.tlv.redhat.com:8080/view/Compute/view/3.5-git/view/Dashboard/job/3.5-git-compute-virt-vm_migration-iscsi/29/

Test flow:
Have 5 vms, installed with rhel+guest agent
Move host to maintenance.

Version-Release number of selected component (if applicable):
engine: vt5 
hosts: rhel-7, vdsm-4.16.6-1.el7.x86_64

How reproducible:
Failed for several consecutive runs on vt5.

Expected results:
vms should succeed migration, and host should move to maintenance.

Additional info:
Same test PASSED on: 
engine: vt4
hosts: vdsm-4.16.5-2.el7.x86_64

Comment 1 Ilanit Stein 2014-10-13 07:46:29 UTC

Created attachment 946262 [details]
test logs

Comment 2 Francesco Romani 2014-10-14 12:46:45 UTC

Hi,

I scanned the logs and found these migrations failed, which most likely caused the issue:

Thread-1322::WARNING::2014-10-13 03:02:02,352::libvirtconnection::135::root::(wrapper) connection to libvirt broken. ecode: 1 edom: 7
Thread-1309::WARNING::2014-10-13 03:02:02,352::libvirtconnection::135::root::(wrapper) connection to libvirt broken. ecode: 1 edom: 7
Thread-1322::CRITICAL::2014-10-13 03:02:02,352::libvirtconnection::137::root::(wrapper) taking calling process down.
Thread-1309::CRITICAL::2014-10-13 03:02:02,352::libvirtconnection::137::root::(wrapper) taking calling process down.
Thread-1322::DEBUG::2014-10-13 03:02:02,353::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 1 edom: 7 level: 2 message: internal error: client socket is closed
Thread-1309::DEBUG::2014-10-13 03:02:02,353::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 1 edom: 7 level: 2 message: internal error: client socket is closed
Thread-1322::ERROR::2014-10-13 03:02:02,353::migration::259::vm.Vm::(run) vmId=`3b32c9f3-b02f-4497-b176-664a71eeab19`::Failed to migrate
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/migration.py", line 245, in run
    self._startUnderlyingMigration(time.time())
  File "/usr/share/vdsm/virt/migration.py", line 324, in _startUnderlyingMigration
    None, maxBandwidth)
  File "/usr/share/vdsm/virt/vm.py", line 670, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/vdsm/libvirtconnection.py", line 129, in wrapper
    __connections.get(id(target)).pingLibvirt()
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3642, in getLibVersion
    if ret == -1: raise libvirtError ('virConnectGetLibVersion() failed', conn=self)
libvirtError: internal error: client socket is closed
Thread-1309::ERROR::2014-10-13 03:02:02,354::migration::259::vm.Vm::(run) vmId=`db2e563c-f0fe-43d7-9b35-0913d5e7e693`::Failed to migrate
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/migration.py", line 245, in run
    self._startUnderlyingMigration(time.time())
  File "/usr/share/vdsm/virt/migration.py", line 324, in _startUnderlyingMigration
    None, maxBandwidth)
  File "/usr/share/vdsm/virt/vm.py", line 670, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/vdsm/libvirtconnection.py", line 129, in wrapper
    __connections.get(id(target)).pingLibvirt()
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3642, in getLibVersion
    if ret == -1: raise libvirtError ('virConnectGetLibVersion() failed', conn=self)
libvirtError: internal error: client socket is closed
VM Channels Listener::DEBUG::2014-10-13 03:02:02,425::vmchannels::133::vds::(_handle_unconnected) Trying to connect fileno 49.


which in turn seems to be caused by a libvirtd crash:

Oct 13 03:02:01 puma20 systemd-machined: Machine qemu-virt-vm-3 terminated.
Oct 13 03:02:01 puma20 libvirtd: Caught Segmentation violation dumping internal log buffer:
Oct 13 03:02:01 puma20 libvirtd: ====== start of log =====
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.486+0000: 1520: debug : virEventPollMakePollFDs:391 : Prepare n=8 w=9, f=16 e=1 d=0
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.486+0000: 1520: debug : virEventPollMakePollFDs:391 : Prepare n=9 w=10, f=18 e=1 d=0
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.486+0000: 1520: debug : virEventPollMakePollFDs:391 : Prepare n=10 w=15, f=22 e=1 d=0
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.486+0000: 1520: debug : virEventPollMakePollFDs:391 : Prepare n=11 w=16, f=23 e=1 d=0
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.486+0000: 1520: debug : virEventPollMakePollFDs:391 : Prepare n=12 w=25, f=28 e=25 d=0
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.486+0000: 1520: debug : virEventPollMakePollFDs:391 : Prepare n=13 w=26, f=27 e=25 d=0
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.486+0000: 1520: debug : virEventPollMakePollFDs:391 : Prepare n=14 w=36, f=25 e=1 d=0
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.486+0000: 1520: debug : virEventPollMakePollFDs:391 : Prepare n=15 w=37, f=29 e=1 d=0
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.486+0000: 1520: debug : virEventPollCalculateTimeout:332 : Calculate expiry of 5 timers
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.486+0000: 1520: debug : virEventPollCalculateTimeout:340 : Got a timeout scheduled for 1413183726235
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.486+0000: 1520: debug : virEventPollCalculateTimeout:340 : Got a timeout scheduled for 1413183725078
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.486+0000: 1526: debug : virFileMakePathHelper:2201 : path=/var/run/libvirt/qemu mode=0777
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.486+0000: 1520: debug : virEventPollCalculateTimeout:353 : Schedule timeout then=1413183725078 now=1413183721486
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.486+0000: 1520: debug : virEventPollCalculateTimeout:361 : Timeout at 1413183725078 due in 3592 ms
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.486+0000: 1520: debug : virEventPollRunOnce:627 : EVENT_POLL_RUN: nhandles=15 timeout=3592
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.486+0000: 1526: debug : virFileClose:90 : Closed fd 24
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.486+0000: 1526: debug : virObjectUnref:256 : OBJECT_UNREF: obj=0x7f8f1c162680
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.486+0000: 1526: debug : virObjectUnref:256 : OBJECT_UNREF: obj=0x7f8f0800cba0
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.486+0000: 1526: debug : virConnectIsAlive:21373 : conn=0x7f8f08012c50
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.536+0000: 1526: debug : virObjectRef:293 : OBJECT_REF: obj=0x7f8f1c162680
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.536+0000: 1526: debug : virObjectRef:293 : OBJECT_REF: obj=0x7f8f0800cba0
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.536+0000: 1526: debug : qemuDomainObjBeginJobInternal:1038 : Starting job: async nested (async=migration out)
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.536+0000: 1526: debug : virObjectUnref:256 : OBJECT_UNREF: obj=0x7f8f1c162680
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.536+0000: 1526: debug : virObjectRef:293 : OBJECT_REF: obj=0x7f8f080129f0
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.536+0000: 1526: debug : qemuMonitorGetMigrationStatus:2089 : mon=0x7f8f080129f0
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.536+0000: 1526: debug : virJSONValueToString:1100 : object=0x7f8f08006120
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.536+0000: 1526: debug : virJSONValueToStringOne:1031 : object=0x7f8f08006120 type=0 gen=0x7f8f0800b840
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.536+0000: 1526: debug : virJSONValueToStringOne:1031 : object=0x7f8f0800c2b0 type=2 gen=0x7f8f0800b840
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.536+0000: 1526: debug : virJSONValueToStringOne:1031 : object=0x7f8f08005320 type=2 gen=0x7f8f0800b840
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.536+0000: 1526: debug : virJSONValueToString:1133 : result={"execute":"query-migrate","id":"libvirt-391"}
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.536+0000: 1526: debug : qemuMonitorJSONCommandWithFd:264 : Send command '{"execute":"query-migrate","id":"libvirt-391"}' for write with FD -1
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.536+0000: 1526: debug : virEventPollUpdateHandle:147 : EVENT_POLL_UPDATE_HANDLE: watch=25 events=15
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.536+0000: 1526: debug : virEventPollInterruptLocked:714 : Interrupting
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.536+0000: 1526: debug : qemuMonitorSend:955 : QEMU_MONITOR_SEND_MSG: mon=0x7f8f080129f0 msg={"execute":"query-migrate","id":"libvirt-391"}
Oct 13 03:02:01 puma20 libvirtd: fd=-1
Oct 13 03:02:01 puma20 libvirtd: 2014-10-13 07:02:01.536+0000: 1520: debug : virEventPollRunOnce:638 : Poll got 1 event(s)
Oct 13 03:02:01 puma20 journal: vdsm root WARNING connection to libvirt broken. ecode: 1 edom: 7
Oct 13 03:02:01 puma20 journal: vdsm root WARNING connection to libvirt broken. ecode: 1 edom: 7
Oct 13 03:02:01 puma20 journal: vdsm root WARNING connection to libvirt broken. ecode: 1 edom: 7
Oct 13 03:02:01 puma20 journal: vdsm root WARNING connection to libvirt broken. ecode: 1 edom: 7
Oct 13 03:02:01 puma20 journal: vdsm root CRITICAL taking calling process down.
Oct 13 03:02:01 puma20 journal: vdsm root CRITICAL taking calling process down.
Oct 13 03:02:01 puma20 journal: vdsm root CRITICAL taking calling process down.
Oct 13 03:02:01 puma20 journal: vdsm root CRITICAL taking calling process down.
Oct 13 03:02:01 puma20 systemd: libvirtd.service: main process exited, code=killed, status=11/SEGV
Oct 13 03:02:01 puma20 systemd: Unit libvirtd.service entered failed state.

To be honest, I don't know if this is the only cause of misbehaviour, but for sure it is one cause and quite big, so we need to move the bug against libvirt and block this one on it.

Comment 3 Francesco Romani 2014-10-15 07:12:11 UTC

same as https://bugzilla.redhat.com/show_bug.cgi?id=1083485 apparently.

Either we manage to fetch a coredump or we have to close this as well since it seems  there is nothing we can do

Comment 4 Michal Skrivanek 2014-10-20 13:59:47 UTC

ok, so can we please get a core dump? We may need to manually reproduce and/or grab it from the automation box in the middle of the test or something…I don't think it's archived

Comment 5 Ilanit Stein 2014-10-22 14:20:54 UTC

Update: The CI team is working on generating the required core dump.

Comment 7 Ilanit Stein 2014-10-31 12:14:43 UTC

We couldn't reproduce the failure so far.

Comment 8 Ilanit Stein 2014-11-02 14:32:15 UTC

This specific problem of having host remain in preparing to mentenence,
and not succeeding to migrate the 5 running VM, did not reproduce, on builds after vt5 (not in qe \ ci jenkins migration automatic test).

There are OTHER inconsistent migration test failures, seen on vt8 runs, which are under investigation.

I suggest to close this bug for now, and in case this problem will be encountered again, I'll reopen it.

Comment 9 Michal Skrivanek 2014-11-03 13:47:30 UTC

ok, thanks

Comment 10 Ilanit Stein 2014-11-30 12:06:57 UTC

Starting vt8 this issue fail on each automatic migration test run on jenkins-qe.
Yet core dump is not enabled on automation env.
So far, the failure was not reproduced locally (not jenkins).

While test running (on vt12), the following behaviour was found:

5 vms installed with rhel, pinned to host, and running (named virt-vm-0..4).
Following trial to move host (running all 5 vms) to maintenance:

One of the VMs (virt-vm-2) turn into down state.
host, which was preparing to maintenance, after migrating 3 vms successfully (virt-vm-0,3,4), had events on becoming not responding, and not finding domain virt-vm-2. 5th vm, virt-vm-1, failed to migrate as well

vm virt-vm-2 event:
==================
"Migration failed: Failed to find the libvirt domain (VM: virt-vm-2, Source: 10.35.160.39, Destination: <UNKNOWN>)."

host events:
===========
2014-Nov-30, 12:53
Migration failed while Host is in 'preparing for maintenance' state. Consider manual intervention: stopping/migrating Vms as Host's state will not turn to maintenance while VMs are still running on it.(VM: virt-vm-1, Source: 10.35.160.39, Destination: <UNKNOWN>).

2014-Nov-30, 12:53
Migration failed . Trying to migrate to another Host (VM: virt-vm-1, Source: 10.35.160.39, Destination: 10.35.160.57).

2014-Nov-30, 12:53
Migration failed: Failed to find the libvirt domain (VM: virt-vm-2, Source: 10.35.160.39, Destination: <UNKNOWN>).
	
2014-Nov-30, 12:53
Host 10.35.160.39 is not responding. It will stay in Connecting state for a grace period of 61 seconds and after that an attempt to fence the host will be issued.
	
2014-Nov-30, 12:52
Migration completed (VM: virt-vm-3, Source: 10.35.160.39, Destination: 10.35.160.57, Duration: 30 seconds).
	
2014-Nov-30, 12:52
Migration completed (VM: virt-vm-4, Source: 10.35.160.39, Destination: 10.35.160.57, Duration: 30 seconds).
	
2014-Nov-30, 12:52
Migration completed (VM: virt-vm-0, Source: 10.35.160.39, Destination: 10.35.160.57, Duration: 27 seconds).

** Attaching the logs of the automation run, described above:
http://jenkins.qa.lab.tlv.redhat.com:8080/view/Compute/view/3.5-git/view/Virt/job/3.5-git-compute-virt-vm_migration-iscsi/48/

Comment 11 Ilanit Stein 2014-11-30 12:08:03 UTC

Created attachment 962908 [details]
vt12 migration test logs

Comment 12 Ilanit Stein 2014-11-30 12:33:08 UTC

src host vdsm.log ERRORs:
======================== 

Thread-34::ERROR::2014-11-30 12:43:34,642::sdc::143::Storage.StorageDomainCache::(_findDomain) domain 1fa7ac1c-7db6-47dd-875b-18252b13ed78 not found
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 171, in _findUnfetchedDomain
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: (u'1fa7ac1c-7db6-47dd-875b-18252b13ed78',)

Thread-34::DEBUG::2014-11-30 12:43:34,671::lvm::288::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = '  WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it!\n'; <rc> = 0

Thread-102::DEBUG::2014-11-30 12:45:22,765::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 74 edom: 10 level: 2 message: argument unsupported: QEMU driver does not support <metadata> element
Thread-102::ERROR::2014-11-30 12:45:22,766::vm::363::vm.Vm::(_sampleCpuTune) vmId=`a83f691d-20b4-4324-ab65-97bc8ce0be66`::libvirt does not support metadata

Thread-469::ERROR::2014-11-30 12:53:09,700::sampling::475::vm.Vm::(collect) vmId=`d1a710b8-7aa1-412f-944a-71d586248158`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x2870bd0>
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/sampling.py", line 471, in collect
    statsFunction()
  File "/usr/share/vdsm/virt/sampling.py", line 346, in __call__
    retValue = self._function(*args, **kwargs)
  File "/usr/share/vdsm/virt/vm.py", line 292, in _highWrite
    self._vm.extendDrivesIfNeeded()
  File "/usr/share/vdsm/virt/vm.py", line 2586, in extendDrivesIfNeeded
    extend = [x for x in self._getExtendCandidates()
  File "/usr/share/vdsm/virt/vm.py", line 2538, in _getExtendCandidates
    capacity, alloc, physical = self._dom.blockInfo(drive.path, 0)
  File "/usr/share/vdsm/virt/vm.py", line 689, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 129, in wrapper
    __connections.get(id(target)).pingLibvirt()
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3642, in getLibVersion
    if ret == -1: raise libvirtError ('virConnectGetLibVersion() failed', conn=self)
libvirtError: internal error: client socket is closed
Thread-538::ERROR::2014-11-30 12:53:09,703::migration::161::vm.Vm::(_recover) vmId=`d1a710b8-7aa1-412f-944a-71d586248158`::internal error: client socket is closed
Thread-599::WARNING::2014-11-30 12:53:09,703::libvirtconnection::135::root::(wrapper) connection to libvirt broken. ecode: 1 edom: 7
Thread-540::ERROR::2014-11-30 12:53:09,704::migration::161::vm.Vm::(_recover) vmId=`a83f691d-20b4-4324-ab65-97bc8ce0be66`::internal error: client socket is closed
Thread-599::CRITICAL::2014-11-30 12:53:09,706::libvirtconnection::137::root::(wrapper) taking calling process down.
MainThread::INFO::2014-11-30 12:53:09,707::vmchannels::188::vds::(stop) VM channels listener was stopped.
MainThread::DEBUG::2014-11-30 12:53:09,707::vdsm::58::vds::(sigtermHandler) Received signal 15
Thread-599::DEBUG::2014-11-30 12:53:09,708::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 1 edom: 7 level: 2 message: internal error: client socket is closed
MainThread::INFO::2014-11-30 12:53:09,708::momIF::91::MOM::(stop) Shutting down MOM
MainThread::DEBUG::2014-11-30 12:53:09,709::task::595::Storage.TaskManager.Task::(_updateState) Task=`79e37050-f2ba-4ea6-9b73-6d090af1c050`::moving from state init -> state preparing
MainThread::INFO::2014-11-30 12:53:09,709::logUtils::44::dispatcher::(wrapper) Run and protect: prepareForShutdown(options=None)
Thread-11::DEBUG::2014-11-30 12:53:09,709::storageServer::706::Storage.ConnectionMonitor::(_monitorConnections) Monitoring stopped
Thread-540::WARNING::2014-11-30 12:53:09,844::libvirtconnection::135::root::(wrapper) connection to libvirt broken. ecode: 1 edom: 7
Thread-540::CRITICAL::2014-11-30 12:53:09,845::libvirtconnection::137::root::(wrapper) taking calling process down.
Thread-540::DEBUG::2014-11-30 12:53:09,845::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 1 edom: 7 level: 2 message: internal error: client socket is closed
Thread-540::ERROR::2014-11-30 12:53:09,845::migration::260::vm.Vm::(run) vmId=`a83f691d-20b4-4324-ab65-97bc8ce0be66`::Failed to migrate
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/migration.py", line 246, in run
    self._startUnderlyingMigration(time.time())
  File "/usr/share/vdsm/virt/migration.py", line 325, in _startUnderlyingMigration
    None, maxBandwidth)
  File "/usr/share/vdsm/virt/vm.py", line 689, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 129, in wrapper
    __connections.get(id(target)).pingLibvirt()
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3642, in getLibVersion
    if ret == -1: raise libvirtError ('virConnectGetLibVersion() failed', conn=self)
libvirtError: internal error: client socket is closed

Comment 13 Francesco Romani 2014-12-01 07:52:46 UTC

In the host logs I see quite frequently these logs:

Nov 30 12:44:04 puma24 kernel: type=1400 audit(1417344244.582:19): avc:  denied  { read } for  pid=12873 comm="sanlock" name="253:3" dev="sysfs" ino=23960 scontext=system_u:system_r:sanlock_t:s0-s0:c0.c1023 tcontext=system_u:object_r:sysfs_t:s0 tclass=lnk_file
Nov 30 12:44:04 puma24 kernel: type=1400 audit(1417344244.583:20): avc:  denied  { read } for  pid=12873 comm="sanlock" name="253:3" dev="sysfs" ino=23960 scontext=system_u:system_r:sanlock_t:s0-s0:c0.c1023 tcontext=system_u:object_r:sysfs_t:s0 tclass=lnk_file
Nov 30 12:44:04 puma24 kernel: type=1400 audit(1417344244.617:21): avc:  denied  { read } for  pid=12873 comm="sanlock" name="253:3" dev="sysfs" ino=23960 scontext=system_u:system_r:sanlock_t:s0-s0:c0.c1023 tcontext=system_u:object_r:sysfs_t:s0 tclass=lnk_file
Nov 30 12:44:04 puma24 kernel: type=1400 audit(1417344244.618:22): avc:  denied  { read } for  pid=12873 comm="sanlock" name="253:3" dev="sysfs" ino=23960 scontext=system_u:system_r:sanlock_t:s0-s0:c0.c1023 tcontext=system_u:object_r:sysfs_t:s0 tclass=lnk_file

There are known cases of misbehaviour with sanlock, selinux and libvirt.

Could be another case of https://bugzilla.redhat.com/show_bug.cgi?id=1150243 ?

See in particular https://bugzilla.redhat.com/show_bug.cgi?id=1150243#c46 about how to check selinux configuration.

Comment 14 Ilanit Stein 2014-12-01 09:52:22 UTC

On Bug 1150243, the vdsm.log error was on disk permission problem.

Here we have errors like:  "argument unsupported: QEMU driver does not support <metadata> element", and "client socket is closed"

Here is the booleans read from the server mentioned in comment 13:

[root@puma24 ~]# getsebool -a | grep virt
staff_use_svirt --> off
unprivuser_use_svirt --> off
virt_sandbox_use_audit --> on
virt_sandbox_use_netlink --> off
virt_sandbox_use_nfs --> off
virt_sandbox_use_samba --> off
virt_sandbox_use_sys_admin --> off
virt_transition_userdomain --> off
virt_use_comm --> off
virt_use_execmem --> off
virt_use_fusefs --> on
virt_use_nfs --> on
virt_use_rawip --> off
virt_use_samba --> on
virt_use_sanlock --> on
virt_use_usb --> on
virt_use_xserver --> off

Comment 15 Francesco Romani 2014-12-01 10:09:48 UTC

(In reply to Ilanit Stein from comment #14)
> On Bug 1150243, the vdsm.log error was on disk permission problem.

Indeed, caused by selinux denials similar to what we found here.

> 
> Here we have errors like:  "argument unsupported: QEMU driver does not
> support <metadata> element",

Not criticial for this issue, just noise here.

> and "client socket is closed"

Which is closely related to the failed migration here.

> Here is the booleans read from the server mentioned in comment 13:
> 
> [root@puma24 ~]# getsebool -a | grep virt
> staff_use_svirt --> off
> unprivuser_use_svirt --> off
> virt_sandbox_use_audit --> on
> virt_sandbox_use_netlink --> off
> virt_sandbox_use_nfs --> off
> virt_sandbox_use_samba --> off
> virt_sandbox_use_sys_admin --> off
> virt_transition_userdomain --> off
> virt_use_comm --> off
> virt_use_execmem --> off
> virt_use_fusefs --> on
> virt_use_nfs --> on
> virt_use_rawip --> off
> virt_use_samba --> on
> virt_use_sanlock --> on
> virt_use_usb --> on
> virt_use_xserver --> off

seems ok, which is quite strange.

Comment 16 Michal Skrivanek 2014-12-01 10:15:20 UTC

Seems libvirt has crashed. Please get a coredump if possible. 
Jiri, any guess?


2014-11-30 10:53:09.689+0000: 20148: debug : remoteRelayDomainEventLifecycle:160 : Relaying domain lifecycle event 5 3
2014-11-30 10:53:09.689+0000: 20151: debug : virObjectRef:293 : OBJECT_REF: obj=0x7fec30162910
2014-11-30 10:53:09.689+0000: 20155: debug : virAccessManagerCheckDomain:231 : manager=0x7fec4f37f020(name=stack) driver=QEMU domain=0x7fec14016390 perm=1
2014-11-30 10:53:09.689+0000: 20151: debug : virObjectUnref:256 : OBJECT_UNREF: obj=0x7fec30162910
2014-11-30 10:53:09.689+0000: 20148: debug : virNetMessageNew:44 : msg=0x7fec4f3947f0 tracked=0
2014-11-30 10:53:09.689+0000: 20155: debug : virAccessManagerCheckDomain:231 : manager=0x7fec4f3736f0(name=none) driver=QEMU domain=0x7fec14016390 perm=1
2014-11-30 10:53:09.689+0000: 20155: debug : virObjectUnref:256 : OBJECT_UNREF: obj=0x7fec4f37f020
2014-11-30 10:53:09.689+0000: 20155: debug : virDomainFree:2428 : dom=0x7fec18000de0, (VM: name=virt-vm-1, uuid=a83f691d-20b4-4324-ab65-97bc8ce0be66)


     ====== end of log =====

2014-11-30 10:53:14.596+0000: 23598: info : libvirt version: 1.1.1, package: 29.el7_0.3 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2014-09-23-10:47:01, x86-019.build.eng.bos.redhat.com)

Comment 17 Francesco Romani 2014-12-01 10:27:15 UTC

Indeed libvirt crash is most worrysome.

About selinux state, closer inspection of the host kernel logs reveals the following
(assuming host clock is reliable, as it should)

Roughly a minute before the reported failure, tunables were misconfigured:

[Nov30 12:43] SELinux: 2048 avtab hash slots, 100242 rules.
[  +0.016984] SELinux: 2048 avtab hash slots, 100242 rules.
[  +0.029484] SELinux:  8 users, 86 roles, 4818 types, 285 bools, 1 sens, 1024 cats
[  +0.000002] SELinux:  83 classes, 100242 rules
[  +0.997093] type=1403 audit(1417344152.744:6): policy loaded auid=0 ses=1068
[  +0.333779] type=1107 audit(1417344153.077:7): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='avc:  received policyload notice (seqno=3)
 exe="/usr/lib/systemd/systemd" sauid=0 hostname=? addr=? terminal=?'
[ +34.258883] type=1405 audit(1417344187.309:8): bool=sanlock_use_fusefs val=1 old_val=0 auid=0 ses=1068
[  +0.000007] type=1405 audit(1417344187.309:9): bool=virt_use_nfs val=1 old_val=0 auid=0 ses=1068
[  +0.000003] type=1405 audit(1417344187.309:10): bool=virt_use_samba val=1 old_val=0 auid=0 ses=1068
[  +0.000003] type=1405 audit(1417344187.309:11): bool=sanlock_use_samba val=1 old_val=0 auid=0 ses=1068
[  +0.000003] type=1405 audit(1417344187.309:12): bool=virt_use_fusefs val=1 old_val=0 auid=0 ses=1068
[  +0.000003] type=1405 audit(1417344187.309:13): bool=virt_use_sanlock val=1 old_val=0 auid=0 ses=1068
[  +0.000003] type=1405 audit(1417344187.309:14): bool=sanlock_use_nfs val=1 old_val=0 auid=0 ses=1068

They were changed again after:

[Nov30 13:25] type=1405 audit(1417346653.533:320): bool=sanlock_use_fusefs val=0 old_val=1 auid=0 ses=1136
[  +0.000006] type=1405 audit(1417346653.533:321): bool=virt_use_nfs val=0 old_val=1 auid=0 ses=1136
[  +0.000004] type=1405 audit(1417346653.533:322): bool=virt_use_samba val=0 old_val=1 auid=0 ses=1136
[  +0.000003] type=1405 audit(1417346653.533:323): bool=sanlock_use_samba val=0 old_val=1 auid=0 ses=1136
[  +0.000002] type=1405 audit(1417346653.533:324): bool=virt_use_fusefs val=0 old_val=1 auid=0 ses=1136
[  +0.000004] type=1405 audit(1417346653.533:325): bool=virt_use_sanlock val=0 old_val=1 auid=0 ses=1136
[  +0.000002] type=1405 audit(1417346653.533:326): bool=sanlock_use_nfs val=0 old_val=1 auid=0 ses=1136

And eventually configured as expected and as reported:

[Nov30 13:26] ip_tables: (C) 2000-2006 Netfilter Core Team
[ +16.567195] type=1405 audit(1417346737.619:332): bool=sanlock_use_fusefs val=1 old_val=0 auid=0 ses=1161
[  +0.000006] type=1405 audit(1417346737.619:333): bool=virt_use_nfs val=1 old_val=0 auid=0 ses=1161
[  +0.000003] type=1405 audit(1417346737.619:334): bool=virt_use_samba val=1 old_val=0 auid=0 ses=1161
[  +0.000003] type=1405 audit(1417346737.619:335): bool=sanlock_use_samba val=1 old_val=0 auid=0 ses=1161
[  +0.000003] type=1405 audit(1417346737.619:336): bool=virt_use_fusefs val=1 old_val=0 auid=0 ses=1161
[  +0.000003] type=1405 audit(1417346737.619:337): bool=virt_use_sanlock val=1 old_val=0 auid=0 ses=1161
[  +0.000003] type=1405 audit(1417346737.619:338): bool=sanlock_use_nfs val=1 old_val=0 auid=0 ses=1161


So it seems that only during the failed attempt they were misconfigured.

Comment 18 Francesco Romani 2014-12-01 12:11:15 UTC

Closed based on what exposed on
https://bugzilla.redhat.com/show_bug.cgi?id=1151953#c16
https://bugzilla.redhat.com/show_bug.cgi?id=1151953#c17

Comment 19 Ilanit Stein 2014-12-07 20:07:51 UTC

On several more test runs investigation, it was found there was no libvirt crash (no core dump generated).
Also, when selinux was in Disable mode, still migration automatic test was failing,
in the following flow:
1. Move host1 (with 5 running vms)to maintenance.
2. host1 activate.
3. Move host2 to maintenance => Migration Failed (host stay in preparing for maintenance).

Comment 20 Raz Tamir 2014-12-14 14:52:06 UTC

Created attachment 968495 [details]
new logs

Still occurs - reopening
relevant logs attached

Comment 21 Michal Skrivanek 2014-12-15 09:13:04 UTC

vdsm shows similar libvirt crash 14.12. 12:11 on host 10.35.160.67.
the same in system log:
Dec 14 12:11:36 puma29 systemd: libvirtd.service: main process exited, code=killed, status=11/SEGV

there's also selinux switching on and off. Why?
but it seems it was correct when the crash occurred. Though it's hard to say, there are some avc_denials from sanlock at 12:11 as well.


- libvirt logs do not cover the time
- please get libvirt core dump so we can investigate a possible libvirt problem
- also an exact state of selinux when it happens
- might be also worth trying with new selinux-policy from bug 1141658

Comment 22 Francesco Romani 2014-12-15 09:13:44 UTC

(In reply to ratamir from comment #20)
> Created attachment 968495 [details]
> new logs
> 
> Still occurs - reopening
> relevant logs attached

Hi,

The migration direction is 65 -> 67

2014-12-14 12:11:15,516 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-25) RefreshVmList vm id 0e918f2f-56e7-430d-b791-67c2ec247eb3 is migrating to vds 10.35.160.65 ignoring it in the refresh until migration is done
2014-12-14 12:11:18,539 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-38) [698637f2] RefreshVmList vm id 0e918f2f-56e7-430d-b791-67c2ec247eb3 is migrating to vds 10.35.160.65 ignoring it in the refresh until migration is done
2014-12-14 12:11:21,558 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-7) RefreshVmList vm id 0e918f2f-56e7-430d-b791-67c2ec247eb3 is migrating to vds 10.35.160.65 ignoring it in the refresh until migration is done
2014-12-14 12:11:24,580 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-47) RefreshVmList vm id 0e918f2f-56e7-430d-b791-67c2ec247eb3 is migrating to vds 10.35.160.65 ignoring it in the refresh until migration is done
2014-12-14 12:11:27,613 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-61) RefreshVmList vm id 0e918f2f-56e7-430d-b791-67c2ec247eb3 is migrating to vds 10.35.160.65 ignoring it in the refresh until migration is done
2014-12-14 12:11:30,659 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-63) RefreshVmList vm id 0e918f2f-56e7-430d-b791-67c2ec247eb3 is migrating to vds 10.35.160.65 ignoring it in the refresh until migration is done
2014-12-14 12:11:33,678 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-73) RefreshVmList vm id 0e918f2f-56e7-430d-b791-67c2ec247eb3 is migrating to vds 10.35.160.65 ignoring it in the refresh until migration is done
2014-12-14 12:11:36,700 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-79) RefreshVmList vm id 0e918f2f-56e7-430d-b791-67c2ec247eb3 is migrating to vds 10.35.160.65 ignoring it in the refresh until migration is done

But on destination, libvirt crashed
Dec 14 12:11:36 puma29 systemd: libvirtd.service: main process exited, code=killed, status=11/SEGV
Dec 14 12:11:36 puma29 systemd: Unit libvirtd.service entered failed state.
Dec 14 12:11:36 puma29 journal: vdsm root CRITICAL taking calling process down.


(On the bright side, VDSM seems to have reacted correctly and shut down itself).

What do we need is to enable reproduce, grab a core dump from libvirt and file a libvirt bug.

Comment 23 Ilanit Stein 2014-12-15 11:48:26 UTC

This bug was closed for insufficient data, since core dump was missing, required  to understand the failure.
 
As was mentioned before, the core dump generation was enabled on the hosts, in migration test, where it is constantly failing,
and the core dump was not generated. There wasn't a libvirt crash.

We may consider more logging here.
 
Both on migration test, and storage test, referred to in comment #20,
there was error on vdsm.log on "libvirt socket is closed"

Seems that the failure is iscsi specific:
All the repeated failures on migration test, were with iscsi storage.
Recent migration test runs, with nfs storage, all PASS.

Comment 24 Raz Tamir 2014-12-15 11:52:17 UTC

comment #23

Comment 25 Michal Skrivanek 2014-12-15 12:53:03 UTC

(In reply to Ilanit Stein from comment #23)
> This bug was closed for insufficient data, since core dump was missing,
> required  to understand the failure.  
> As was mentioned before, the core dump generation was enabled on the hosts,
> in migration test, where it is constantly failing,
> and the core dump was not generated. There wasn't a libvirt crash.

there was a crash. In this last occurrence (logs from comment #20). If the core dump is not there then it was not enabled correctly, sorry.

> We may consider more logging here.

logging on vdsm side is fine. Not sure if anything can be added in libvirt, most likely not as it would be too detailed and we don't know where it crashes yet.

> Both on migration test, and storage test, referred to in comment #20,
> there was error on vdsm.log on "libvirt socket is closed"
>
> Seems that the failure is iscsi specific:
> All the repeated failures on migration test, were with iscsi storage.
> Recent migration test runs, with nfs storage, all PASS.

can you confirm you see the same when selinux was in Disable mode (you said so in comment #19)

Comment 26 Michal Skrivanek 2014-12-15 12:55:40 UTC

(In reply to ratamir from comment #24)
> comment #23

sorry, but you didn't answer my questions

Comment 27 Raz Tamir 2014-12-15 13:03:30 UTC

Hi Michal,
All the relevant logs are attached. I can't tell what was the exact state of selinux when it happens.
Also, about the libvirt core-dump, in comment #23, Ilanit said:
"the core dump generation was enabled on the hosts, in migration test, where it is constantly failing,
and the core dump was not generated. There wasn't a libvirt crash."

Comment 28 Francesco Romani 2014-12-15 13:06:35 UTC

(In reply to Ilanit Stein from comment #23)
> As was mentioned before, the core dump generation was enabled on the hosts,
> in migration test, where it is constantly failing,
> and the core dump was not generated. There wasn't a libvirt crash.

I beg to disagree.

This is hard evidence of a libvirt crash:

Dec 14 12:11:36 puma29 systemd: libvirtd.service: main process exited, code=killed, status=11/SEGV
Dec 14 12:11:36 puma29 systemd: Unit libvirtd.service entered failed state.

libvirt daemon received a segmentation violation and was brought down.
Or, to use short names, it crashed.

Lack of coredumping is a different issue.

Comment 29 Michal Skrivanek 2014-12-15 13:22:37 UTC

current plan we agreed on with Ilanit:
Let's reserve 2 jenkins hosts, Disable the selinux, puppet, and enable the core dump again, run it on iscsi/nfs.
Before the start it would be good to check core dumps are really generated (feel free to ping francesco/me)

Comment 31 Francesco Romani 2014-12-15 13:27:33 UTC

(In reply to Michal Skrivanek from comment #29)

> Before the start it would be good to check core dumps are really generated
> (feel free to ping francesco/me)

A possible way to do that:

as root, on a idle box:

# kill -11 $( pidof libvirtd )
or
# kill -4 $( pidof libvirtd )
or
# kill -6 $( pidof libvirtd )

then look for the core dump

Comment 33 Francesco Romani 2014-12-16 16:23:26 UTC

libvirt issue. Coredump found, bug filed:
https://bugzilla.redhat.com/show_bug.cgi?id=1174869

Comment 34 Michal Skrivanek 2014-12-17 09:00:37 UTC

please verify using libvirt-1.1.1-29.el7_0.4

Comment 36 Francesco Romani 2014-12-17 09:14:12 UTC

(In reply to Michal Skrivanek from comment #34)
> please verify using libvirt-1.1.1-29.el7_0.4

Surely this is needed and will help for this BZ.

I'd also add that I believe we'll also need the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1162208 to have more safety, especially after migration, being migration the most complex flow with lots of moving parts and room for refcounts mistakes.

Comment 37 Jiri Denemark 2014-12-17 09:29:10 UTC

Bug 1162208 is just a 7.1 version of the bug fixed in libvirt-1.1.1-29.el7_0.4. I thing you are rather interested in bug 1141159 which is about fixing the reference counting of domain objects.

Comment 38 Jiri Denemark 2014-12-17 12:08:28 UTC

To avoid possible confusion, comment 37 was a reply to comment 36.

Comment 39 Ilanit Stein 2014-12-22 07:16:45 UTC

Verified on vt13.4:
Migration tests (iscsi storage), that were failing constantly had passed,
using libvirt-1.1.1-29.el7_0.4

Comment 40 Omer Frenkel 2015-02-17 08:29:23 UTC

RHEV-M 3.5.0 has been released

Note You need to log in before you can comment on or make changes to this bug.