DescriptionKambiz Aghaiepour
2015-12-07 21:14:35 UTC
Description of problem:
When attempting to launch any new instances, we would get obscure errors in the libvirt logs as well as the nova compute logs and instances would not launch. Only 2 instances were on the compute node, so it was not heavily loaded.
Systemd-machined might be related to the issue, and I will try and include logs and various outputs.
Version-Release number of selected component (if applicable):
libvirt-daemon-driver-secret-1.2.8-16.el7_1.5.x86_64
python-nova-2015.1.2-7.el7ost.noarch
openstack-nova-console-2015.1.2-7.el7ost.noarch
libvirt-daemon-driver-network-1.2.8-16.el7_1.5.x86_64
libvirt-daemon-driver-lxc-1.2.8-16.el7_1.5.x86_64
openstack-nova-cert-2015.1.2-7.el7ost.noarch
libvirt-daemon-driver-storage-1.2.8-16.el7_1.5.x86_64
libvirt-daemon-kvm-1.2.8-16.el7_1.5.x86_64
openstack-nova-novncproxy-2015.1.2-7.el7ost.noarch
libvirt-client-1.2.8-16.el7_1.5.x86_64
libvirt-daemon-driver-qemu-1.2.8-16.el7_1.5.x86_64
libvirt-daemon-driver-nodedev-1.2.8-16.el7_1.5.x86_64
libvirt-1.2.8-16.el7_1.5.x86_64
openstack-nova-common-2015.1.2-7.el7ost.noarch
openstack-nova-api-2015.1.2-7.el7ost.noarch
libvirt-daemon-driver-nwfilter-1.2.8-16.el7_1.5.x86_64
libvirt-daemon-config-network-1.2.8-16.el7_1.5.x86_64
libvirt-python-1.2.8-7.el7_1.1.x86_64
openstack-nova-scheduler-2015.1.2-7.el7ost.noarch
libvirt-daemon-driver-interface-1.2.8-16.el7_1.5.x86_64
python-novaclient-2.23.0-2.el7ost.noarch
openstack-nova-compute-2015.1.2-7.el7ost.noarch
libvirt-daemon-1.2.8-16.el7_1.5.x86_64
libvirt-daemon-config-nwfilter-1.2.8-16.el7_1.5.x86_64
openstack-nova-conductor-2015.1.2-7.el7ost.noarch
How reproducible:
We cannot currently reproduce this, as the compute node reboot has cleared the error. However, here are what we observed:
[root@overcloud-compute-0 ~]# systemctl status systemd-machined
systemd-machined.service - Virtual Machine and Container Registration Service
Loaded: loaded (/usr/lib/systemd/system/systemd-machined.service; static)
Active: active (running) since Mon 2015-12-07 14:46:41 EST; 55min ago
Docs: man:systemd-machined.service(8)
http://www.freedesktop.org/wiki/Software/systemd/machined
Main PID: 27488 (systemd-machine)
CGroup: /system.slice/systemd-machined.service
└─27488 /usr/lib/systemd/systemd-machined
Dec 07 14:46:41 overcloud-compute-0.localdomain systemd[1]: Started Virtual Machine and Container Registration Service.
Dec 07 14:56:07 overcloud-compute-0.localdomain systemd-machined[27488]: Failed to start machine scope: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message b...n was broken.
Dec 07 14:58:46 overcloud-compute-0.localdomain systemd-machined[27488]: Failed to start machine scope: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message b...n was broken.
Dec 07 15:20:59 overcloud-compute-0.localdomain systemd-machined[27488]: Failed to start machine scope: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message b...n was broken.
Dec 07 15:23:02 overcloud-compute-0.localdomain systemd-machined[27488]: New machine qemu-instance-0000437e.
Dec 07 15:26:19 overcloud-compute-0.localdomain systemd-machined[27488]: Failed to start machine scope: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message b...n was broken.
Dec 07 15:29:39 overcloud-compute-0.localdomain systemd-machined[27488]: Failed to start machine scope: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message b...n was broken.
Hint: Some lines were ellipsized, use -l to show in full.
The full line from above is:
[...]
Dec 7 15:29:39 overcloud-compute-0 systemd-machined: Failed to start machine scope: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
Dec 7 15:44:44 overcloud-compute-0 systemd-machined: Failed to start machine scope: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
[...]
This did not appear to be related to selinux, as we attempted to run with permissive instead of enforcing.
The libvirt logs also would show:
2015-12-07 20:44:19.707+0000: 29860: info : qemuDomainDefineXML:7006 : Creating domain 'instance-0000439e'
2015-12-07 20:44:19.714+0000: 29855: info : virNetDevProbeVnetHdr:203 : Enabling IFF_VNET_HDR
2015-12-07 20:44:44.783+0000: 29855: error : virDBusCall:1429 : error from service: CreateMachine: Input/output error
2015-12-07 20:44:55.315+0000: 29857: info : qemuDomainUndefineFlags:7110 : Undefining domain 'instance-0000439e'
2015-12-07 20:45:22.960+0000: 29854: error : qemuMonitorIO:666 : internal error: End of file from monitor
(We enabled additional INFO level logging to troubleshoot)
We were able to export images using: rbd export images/<some id> /tmp/some-file-path
in order to ensure ceph was working correctly.
The nova-compute log on the compute host would show the following:
2015-12-07 15:44:55.192 29925 ERROR nova.compute.manager [req-a6e9b323-4ca5-4f96-b6d9-b3d9cc4210e1 007e08eeb11a4849ad0078b5b592c1f8 99be85c7835141208aee5fcda2398a9a - - -] [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae] Instance failed to spawn
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae] Traceback (most recent call last):
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2486, in _build_resources
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae] yield resources
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2358, in _build_and_run_instance
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae] block_device_info=block_device_info)
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2492, in spawn
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae] block_device_info=block_device_info)
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4571, in _create_domain_and_network
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae] power_on=power_on)
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4502, in _create_domain
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae] LOG.error(err)
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 85, in __exit__
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae] six.reraise(self.type_, self.value, self.tb)
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4492, in _create_domain
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae] domain.createWithFlags(launch_flags)
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae] File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 183, in doit
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae] result = proxy_call(self._autowrap, f, *args, **kwargs)
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae] File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 141, in proxy_call
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae] rv = execute(f, *args, **kwargs)
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae] File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 122, in execute
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae] six.reraise(c, e, tb)
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae] File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 80, in tworker
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae] rv = meth(*args, **kwargs)
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae] File "/usr/lib64/python2.7/site-packages/libvirt.py", line 996, in createWithFlags
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae] if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae] libvirtError: error from service: CreateMachine: Input/output error
Ultimately we rebooted the compute node and were once again able to launch instances.
Is this a known issue? Not sure if we need to upgrade from the package set listed above.
Thanks!
Kambiz
So libvirt talks to systemd-machined over dbus to create the machine scope, but we seem systemd-machined is reporting an error (presumably from when it talks to systemd):
Dec 07 14:56:07 overcloud-compute-0.localdomain systemd-machined[27488]: Failed to start machine scope: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message b...n was broken.
This error in turn causes the libvirt error message
2015-12-07 15:44:55.192 29925 TRACE nova.compute.manager [instance: e28a2e4b-20f6-42f4-9a51-d8ef0fc0e1ae] libvirtError: error from service: CreateMachine: Input/output error
So I believe this is probably a systemd bug.