Bug 1289326

Summary: instances not launching until compute node was rebooted
Product: Red Hat Enterprise Linux 7 Reporter: Kambiz Aghaiepour <kambiz>
Component: systemdAssignee: systemd-maint
Status: CLOSED INSUFFICIENT_DATA QA Contact: qe-baseos-daemons
Severity: medium Docs Contact:
Priority: unspecified    
Version: 7.0CC: bblaskov, berrange, dasmith, dmsimard, eglynn, jruemker, jsynacek, kambiz, kchamart, lnykryn, pdhange, rbryant, sbauza, sferdjao, sgordon, srevivo, systemd-maint-list, vromanso, wfoster, wlehman
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-05-05 12:33:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1298243    

Description Kambiz Aghaiepour 2015-12-07 21:14:35 UTC
Description of problem:

When attempting to launch any new instances, we would get obscure errors in the libvirt logs as well as the nova compute logs and instances would not launch.  Only 2 instances were on the compute node, so it was not heavily loaded.

Systemd-machined might be related to the issue, and I will try and include logs and various outputs.

Version-Release number of selected component (if applicable):

libvirt-daemon-driver-secret-1.2.8-16.el7_1.5.x86_64
python-nova-2015.1.2-7.el7ost.noarch
openstack-nova-console-2015.1.2-7.el7ost.noarch
libvirt-daemon-driver-network-1.2.8-16.el7_1.5.x86_64
libvirt-daemon-driver-lxc-1.2.8-16.el7_1.5.x86_64
openstack-nova-cert-2015.1.2-7.el7ost.noarch
libvirt-daemon-driver-storage-1.2.8-16.el7_1.5.x86_64
libvirt-daemon-kvm-1.2.8-16.el7_1.5.x86_64
openstack-nova-novncproxy-2015.1.2-7.el7ost.noarch
libvirt-client-1.2.8-16.el7_1.5.x86_64
libvirt-daemon-driver-qemu-1.2.8-16.el7_1.5.x86_64
libvirt-daemon-driver-nodedev-1.2.8-16.el7_1.5.x86_64
libvirt-1.2.8-16.el7_1.5.x86_64
openstack-nova-common-2015.1.2-7.el7ost.noarch
openstack-nova-api-2015.1.2-7.el7ost.noarch
libvirt-daemon-driver-nwfilter-1.2.8-16.el7_1.5.x86_64
libvirt-daemon-config-network-1.2.8-16.el7_1.5.x86_64
libvirt-python-1.2.8-7.el7_1.1.x86_64
openstack-nova-scheduler-2015.1.2-7.el7ost.noarch
libvirt-daemon-driver-interface-1.2.8-16.el7_1.5.x86_64
python-novaclient-2.23.0-2.el7ost.noarch
openstack-nova-compute-2015.1.2-7.el7ost.noarch
libvirt-daemon-1.2.8-16.el7_1.5.x86_64
libvirt-daemon-config-nwfilter-1.2.8-16.el7_1.5.x86_64
openstack-nova-conductor-2015.1.2-7.el7ost.noarch



How reproducible:

We cannot currently reproduce this, as the compute node reboot has cleared the error.  However, here are what we observed:

[root@overcloud-compute-0 ~]# systemctl status systemd-machined
systemd-machined.service - Virtual Machine and Container Registration Service
   Loaded: loaded (/usr/lib/systemd/system/systemd-machined.service; static)
   Active: active (running) since Mon 2015-12-07 14:46:41 EST; 55min ago
     Docs: man:systemd-machined.service(8)
           http://www.freedesktop.org/wiki/Software/systemd/machined
 Main PID: 27488 (systemd-machine)
   CGroup: /system.slice/systemd-machined.service
           └─27488 /usr/lib/systemd/systemd-machined

Dec 07 14:46:41 overcloud-compute-0.localdomain systemd[1]: Started Virtual Machine and Container Registration Service.
Dec 07 14:56:07 overcloud-compute-0.localdomain systemd-machined[27488]: Failed to start machine scope: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message b...n was broken.
Dec 07 14:58:46 overcloud-compute-0.localdomain systemd-machined[27488]: Failed to start machine scope: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message b...n was broken.
Dec 07 15:20:59 overcloud-compute-0.localdomain systemd-machined[27488]: Failed to start machine scope: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message b...n was broken.
Dec 07 15:23:02 overcloud-compute-0.localdomain systemd-machined[27488]: New machine qemu-instance-0000437e.
Dec 07 15:26:19 overcloud-compute-0.localdomain systemd-machined[27488]: Failed to start machine scope: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message b...n was broken.
Dec 07 15:29:39 overcloud-compute-0.localdomain systemd-machined[27488]: Failed to start machine scope: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message b...n was broken.
Hint: Some lines were ellipsized, use -l to show in full.

The full line from above is:

[...]
Dec  7 15:29:39 overcloud-compute-0 systemd-machined: Failed to start machine scope: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
Dec  7 15:44:44 overcloud-compute-0 systemd-machined: Failed to start machine scope: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
[...]

This did not appear to be related to selinux, as we attempted to run with permissive instead of enforcing.

The libvirt logs also would show:

2015-12-07 20:44:19.707+0000: 29860: info : qemuDomainDefineXML:7006 : Creating domain 'instance-0000439e'
2015-12-07 20:44:19.714+0000: 29855: info : virNetDevProbeVnetHdr:203 : Enabling IFF_VNET_HDR
2015-12-07 20:44:44.783+0000: 29855: error : virDBusCall:1429 : error from service: CreateMachine: Input/output error
2015-12-07 20:44:55.315+0000: 29857: info : qemuDomainUndefineFlags:7110 : Undefining domain 'instance-0000439e'
2015-12-07 20:45:22.960+0000: 29854: error : qemuMonitorIO:666 : internal error: End of file from monitor

(We enabled additional INFO level logging to troubleshoot)

We were able to export images using:  rbd export images/<some id> /tmp/some-file-path
in order to ensure ceph was working correctly.

The nova-compute log on the compute host would show the following:

2015-12-07 15:44:55.192 29925 ERROR nova.compute.manager [req-a6e9b323-4ca5-4f96-b6d9-b3d9cc4210e1 007e08eeb11a4849ad0078b5b592c1f8 99be85c7835141208aee5fcda2398a9a - - -] [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae] Instance failed to spawn
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae] Traceback (most recent call last):
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2486, in _build_resources
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae]     yield resources
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2358, in _build_and_run_instance
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae]     block_device_info=block_device_info)
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2492, in spawn
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae]     block_device_info=block_device_info)
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4571, in _create_domain_and_network
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae]     power_on=power_on)
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4502, in _create_domain
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae]     LOG.error(err)
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae]   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 85, in __exit__
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae]     six.reraise(self.type_, self.value, self.tb)
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4492, in _create_domain
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae]     domain.createWithFlags(launch_flags)
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae]   File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 183, in doit
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae]     result = proxy_call(self._autowrap, f, *args, **kwargs)
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae]   File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 141, in proxy_call
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae]     rv = execute(f, *args, **kwargs)
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae]   File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 122, in execute
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae]     six.reraise(c, e, tb)
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae]   File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 80, in tworker
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae]     rv = meth(*args, **kwargs)
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae]   File "/usr/lib64/python2.7/site-packages/libvirt.py", line 996, in createWithFlags
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae]     if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae] libvirtError: error from service: CreateMachine: Input/output error


Ultimately we rebooted the compute node and were once again able to launch instances.

Is this a known issue?  Not sure if we need to upgrade from the package set listed above.

Thanks!
Kambiz

Comment 2 Daniel Berrangé 2015-12-11 15:24:18 UTC
So libvirt talks to systemd-machined over dbus to create the machine scope, but we seem systemd-machined is reporting an error (presumably from when it talks to systemd):

Dec 07 14:56:07 overcloud-compute-0.localdomain systemd-machined[27488]: Failed to start machine scope: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message b...n was broken.

This error in turn causes the libvirt error message

2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae] libvirtError: error from service: CreateMachine: Input/output error

So I believe this is probably a systemd bug.

Comment 4 Lukáš Nykrýn 2015-12-14 11:52:06 UTC
What version of systemd are you using? Can you reproduce it with 219-19?