Bug 1858553 - [OSP16.1] Instances with DPDK fail to spawn - Failed to bind socket to /var/lib/vhost_sockets Permission denied
Summary: [OSP16.1] Instances with DPDK fail to spawn - Failed to bind socket to /var/...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ga
: 16.1 (Train on RHEL 8.2)
Assignee: Cédric Jeanneret
QA Contact: Vadim Khitrin
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-19 07:16 UTC by Vadim Khitrin
Modified: 2020-07-31 05:55 UTC (History)
14 users (show)

Fixed In Version: openstack-tripleo-heat-templates-11.3.2-0.20200616081531.396affd.el8ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-29 07:53:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
selinux_audit_1858553.log (3.33 MB, text/plain)
2020-07-19 17:38 UTC, Vadim Khitrin
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1888216 0 None None None 2020-07-20 11:25:13 UTC
OpenStack gerrit 741921 0 None MERGED Drop the relabel flag for bind-mount 2020-08-21 11:08:04 UTC
Red Hat Product Errata RHBA-2020:3148 0 None None None 2020-07-29 07:54:24 UTC

Description Vadim Khitrin 2020-07-19 07:16:12 UTC
Description of problem:
When attempting to spawn an instance with DPDK interface attached, the creation fails and the following error appears in nova-compute.log on the compute node:
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [req-99508106-719a-4ca0-b23b-190a9b66c8a1 604e614dde284a959f6834a07cf15040 a6e139dcad2943a38a178635c74abd45 - default default] [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d] Instance failed to spawn: libvirt.libvirtError: internal error: process exited while connecting to monitor: 2020-07-16T16:19:32.662500Z qemu-kvm: -chardev socket,id=charnet0,path=/var/lib/vhost_sockets/vhuca5e1e91-5e,server: Failed to bind socket to /var/lib/vhost_sockets/vhuca5e1e91-5e: Permission denied
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d] Traceback (most recent call last):
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]   File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 2663, in _build_resources
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]     yield resources
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]   File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 2437, in _build_and_run_instance
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]     block_device_info=block_device_info)
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]   File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 3647, in spawn
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]     cleanup_instance_disks=created_disks)
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]   File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 6473, in _create_domain_and_network
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]     cleanup_instance_disks=cleanup_instance_disks)
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]   File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]     self.force_reraise()
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]   File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]     six.reraise(self.type_, self.value, self.tb)
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]   File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]     raise value
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]   File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 6439, in _create_domain_and_network
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]     post_xml_callback=post_xml_callback)
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]   File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 6368, in _create_domain
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]     guest.launch(pause=pause)
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]   File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/guest.py", line 143, in launch
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]     self._encoded_xml, errors='ignore')
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]   File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]     self.force_reraise()
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]   File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]     six.reraise(self.type_, self.value, self.tb)
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]   File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]     raise value
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]   File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/guest.py", line 138, in launch
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]     return self._domain.createWithFlags(flags)
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]   File "/usr/lib/python3.6/site-packages/eventlet/tpool.py", line 190, in doit
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]     result = proxy_call(self._autowrap, f, *args, **kwargs)
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]   File "/usr/lib/python3.6/site-packages/eventlet/tpool.py", line 148, in proxy_call
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]     rv = execute(f, *args, **kwargs)
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]   File "/usr/lib/python3.6/site-packages/eventlet/tpool.py", line 129, in execute
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]     six.reraise(c, e, tb)
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]   File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]     raise value
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]   File "/usr/lib/python3.6/site-packages/eventlet/tpool.py", line 83, in tworker
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]     rv = meth(*args, **kwargs)
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]   File "/usr/lib64/python3.6/site-packages/libvirt.py", line 1265, in createWithFlags
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]     if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d] libvirt.libvirtError: internal error: process exited while connecting to monitor: 2020-07-16T16:19:32.662500Z qemu-kvm: -chardev socket,id=charnet0,path=/var/lib/vhost_sockets/vhuca5e1e91-5e,server: Failed to bind socket to /var/lib/vhost_sockets/vhuca5e1e91-5e: Permission denied
2020-07-16 16:19:33.700 7 ERROR nova.compute.manager [instance: a3fafcd5-99a2-4f8a-b6b7-768c4631411d]

This is a regression and was working on previous 16.1 composes.

Version-Release number of selected component (if applicable):
Compose:  RHOS-16.1-RHEL-8-20200714.n.0
network-scripts-openvswitch2.13-2.13.0-25.el8fdp.1.x86_64
openvswitch-selinux-extra-policy-1.0-22.el8fdp.noarch
openvswitch2.13-2.13.0-25.el8fdp.1.x86_64
rhosp-openvswitch-2.13-8.el8ost.noarch
dpdk-19.11-4.el8.x86_64
puppet-nova-15.5.1-0.20200608173427.c15e37c.el8ost.noarch
python3-novaclient-15.1.0-0.20200310162625.cd396b8.el8ost.noarch
puppet-neutron-15.5.1-0.20200514103419.0a45ec7.el8ost.noarch
python3-neutronclient-6.14.0-0.20200310192910.115f60f.el8ost.noarch

How reproducible:
Always

Steps to Reproduce:
1. Deploy an a cloud with DPDK capabilities
2. Attempt to spawn an instance

Actual results:
Creation of instance with DPDK ports is failing.

Expected results:
Instance are created successfully.

Additional info:
Will attach sos reports in comments.

Comment 2 Vadim Khitrin 2020-07-19 17:38:00 UTC
Created attachment 1701677 [details]
selinux_audit_1858553.log

This is an SElinux issue.
Once I set to Permissive mode, I was able to spawn an instance.

Attaching audit.log from compute node, search for 'vhufb58f8d2-e0' to see the failing messages.

Comment 3 Vadim Khitrin 2020-07-19 17:42:22 UTC
Sorry, in regards to comment #2, please grep for 'vhost' and not 'vhufb58f8d2-e0'

Comment 4 Saravanan KR 2020-07-20 06:32:19 UTC
audit.log for denial:
---------------------
type=AVC msg=audit(1595173011.584:8090): avc:  denied  { search } for  pid=8069 comm="vhost_reconn" name="vhost_sockets" dev="sda2" ino=1342177472 scontext=system_u:system_r:openvswitch_t:s0 tcontext=system_u:object_r:container_ro_file_t:s0 tclass=dir permissive=0


openstack-selinux commit for this specific directory:
-----------------------------------------------------
https://github.com/redhat-openstack/openstack-selinux/commit/b09ec67785aa3f3a8751b15ac695effa17151876


Expected tcontext - system_u:object_r:virt_cache_t:s0
Actual tcontenst  - system_u:object_r:container_ro_file_t:s0


As per this code in THT, the tcontext should have "setype" as "virt_cache_t" which is expected value. But it is not applied or modified by some other operation to "container_ro_file_t", which is causing this issue. @Vadim, do you have the cluster to take a look?

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- name: create directory for vhost-user sockets with qemu ownership
  file:
  path: /var/lib/vhost_sockets
  state: directory
  owner: qemu
  group: {get_attr: [RoleParametersValue, value, vhostuser_socket_group]}
  setype: virt_cache_t
  seuser: system_u
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Comment 6 Julie Pichon 2020-07-20 08:24:03 UTC
Using 'setype' in THT only applies the label once, something like a restorecon would revert it. Ideally this kind of labels should be applied with sefcontext [1] to make the changes persistent.

There has been a number of recent changes around libvirt to resolve other issues (e.g. bug 1841822). However looking at the related patch it doesn't look like /var/lib/vhost_sockets was directly affected so I'm not sure why the label would be different now. Cedric, does anything ring a bell that'd cause that directory to go from virt_cache_t to container_ro_file_t?


[1] https://docs.ansible.com/ansible/latest/modules/sefcontext_module.html

Comment 10 Julie Pichon 2020-07-20 08:56:01 UTC
Note that the logs provided are enforcing. If we decide to go with adding the equivalent rules for openvswitch_t + container_ro_file_t rather than dropping :z, the issue should be reproduced in permissive mode first to ensure we have all the denials covered. Thanks! The commit linked to in comment 4 indicates there's probably at least one more...

Comment 11 Saravanan KR 2020-07-20 09:06:45 UTC
(In reply to Cédric Jeanneret from comment #9)

> @Saravanan, @Vadim: would you be able to deploy with a small edition in the
> tripleo-heat-templates?
> If so, please drop the ":z" here:
> https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/train/
> deployment/nova/nova-libvirt-container-puppet.yaml#L722
> 

Yes, it works. I have removed ":z" and restarted the container with restorecon and I am able to create the VM even in enforcing mode. 
Now, I am re-deploying the node by remove ":z" in the templates to confirm the fix. Will update once confirmed.

Do you see any root cause of why this change is causing issue only on the recent puddles? Looking back at the history of adding ":z", it was added more than a year ago. I am wondering what triggered this issue with recent puddle.

Comment 13 Cédric Jeanneret 2020-07-20 09:22:10 UTC
(In reply to Saravanan KR from comment #11)
> (In reply to Cédric Jeanneret from comment #9)
> 
> > @Saravanan, @Vadim: would you be able to deploy with a small edition in the
> > tripleo-heat-templates?
> > If so, please drop the ":z" here:
> > https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/train/
> > deployment/nova/nova-libvirt-container-puppet.yaml#L722
> > 
> 
> Yes, it works. I have removed ":z" and restarted the container with
> restorecon and I am able to create the VM even in enforcing mode. 
> Now, I am re-deploying the node by remove ":z" in the templates to confirm
> the fix. Will update once confirmed.

That's a great news already - lemme know how it goes once you get the full deploy :). But I'm pretty sure it will be OK, seeing the usage of this location. Julie is preparing the policy in parallel, "just in case".

The good news is, we probably won't need to block GA for that issue. Once I get your go, I can prepare the upstream patch and prepare the downstream backport in order to be ready once we get the accepted blocker flag.

Cheers,

C.

Comment 14 Julie Pichon 2020-07-20 09:41:24 UTC
(In reply to Julie Pichon from comment #6)
> Using 'setype' in THT only applies the label once, something like a
> restorecon would revert it. Ideally this kind of labels should be applied
> with sefcontext [1] to make the changes persistent.

Scratch this comment, we are already setting it in policy: https://github.com/redhat-openstack/openstack-selinux/blob/master/local_settings.sh.in#L80

Comment 15 Vadim Khitrin 2020-07-20 10:26:16 UTC
Saravanan has redeployed the deployment with ":z" removed in the tripleo templates.
The deployment has passed successfully and we were able to spawn instances with DPDK ports attached while compute nodes are enforcing SELinux.

Thanks everyone for the help!

Comment 16 Cédric Jeanneret 2020-07-20 11:39:24 UTC
Blocker requested in // - we should get things sorted out today hopefully.

Upstream patch is already done, and there's an LP in order to make the backport easier upstream.
Downstream backport will start as soon as we get the missing flags.

Comment 25 errata-xmlrpc 2020-07-29 07:53:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3148


Note You need to log in before you can comment on or make changes to this bug.