Bug 1846020

Summary:	[OSP16.1] env constantly times out when creating an overcloud instance
Product:	Red Hat OpenStack	Reporter:	Michele Baldessari <michele>
Component:	openstack-containers	Assignee:	Michele Baldessari <michele>
Status:	CLOSED ERRATA	QA Contact:	nlevinki <nlevinki>
Severity:	high	Docs Contact:
Priority:	high
Version:	16.1 (Train)	CC:	bdobreli, bfournie, gael_rehault, hjensas, jamsmith, kiyyappa, lmiccini, m.andre, nlevinki, oblaut, pkomarov, tquinlan
Target Milestone:	z2	Keywords:	Triaged
Target Release:	16.1 (Train on RHEL 8.2)
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	openstack-tripleo-heat-templates-11.3.2-1.20200914170155.29a02c1.el8ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-28 19:02:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1186913

Comment 3 Bob Fournier 2020-06-16 17:06:58 UTC

Recreated this with compose RHOS-16.1-RHEL-8-20200611.n.0

(overcloud) [stack@hardprov-dl360-g9-01 ~]$ openstack server show instance1 -f json -c fault
{
  "fault": {
    "code": 500,
    "created": "2020-06-16T16:54:28Z",
    "message": "MessagingTimeout",
    "details": "Traceback (most recent call last):\n  File \"/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py\", line 397, in get\n    return self._queues[msg_id].get(block=True, timeout=timeout)\n  File \"/usr/lib/python3.6/site-packages/eventlet/queue.py\", line 322, in get\n    return waiter.wait()\n  File \"/usr/lib/python3.6/site-packages/eventlet/queue.py\", line 141, in wait\n    return get_hub().switch()\n  File \"/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py\", line 298, in switch\n    return self.greenlet.switch()\nqueue.Empty\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/usr/lib/python3.6/site-packages/nova/conductor/manager.py\", line 1360, in schedule_and_build_instances\n    instance_uuids, return_alternates=True)\n  File \"/usr/lib/python3.6/site-packages/nova/conductor/manager.py\", line 839, in _schedule_instances\n    return_alternates=return_alternates)\n  File \"/usr/lib/python3.6/site-packages/nova/scheduler/client/query.py\", line 42, in select_destinations\n    instance_uuids, return_objects, return_alternates)\n  File \"/usr/lib/python3.6/site-packages/nova/scheduler/rpcapi.py\", line 160, in select_destinations\n    return cctxt.call(ctxt, 'select_destinations', **msg_args)\n  File \"/usr/lib/python3.6/site-packages/oslo_messaging/rpc/client.py\", line 181, in call\n    transport_options=self.transport_options)\n  File \"/usr/lib/python3.6/site-packages/oslo_messaging/transport.py\", line 129, in _send\n    transport_options=transport_options)\n  File \"/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py\", line 646, in send\n    transport_options=transport_options)\n  File \"/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py\", line 634, in _send\n    call_monitor_timeout)\n  File \"/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py\", line 523, in wait\n    message = self.waiters.get(msg_id, timeout=timeout)\n  File \"/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py\", line 401, in get\n    'to message ID %s' % msg_id)\noslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 0cf5fca35a6345efae8e5783b988a81b\n"
  }
}

Its still in that state now if anyone wants to look.

Comment 4 Bob Fournier 2020-06-16 17:37:11 UTC

Also nova-conductor.log on controller node:
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager [req-6e4c5924-9dcd-4886-b4e1-a0e8fe3a6153 95fa5d0d87844f1798056e7c1e8fed0d 50865e7763d54969a5a02ca180c974e4 - default default] Failed to schedule instances: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 0cf5fca35a6345efae8e5783b988a81b
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager Traceback (most recent call last):
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 397, in get
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager     return self._queues[msg_id].get(block=True, timeout=timeout)
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager   File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 322, in get
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager     return waiter.wait()
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager   File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 141, in wait
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager     return get_hub().switch()
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager   File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in switch
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager     return self.greenlet.switch()
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager queue.Empty
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager During handling of the above exception, another exception occurred:
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager Traceback (most recent call last):
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager   File "/usr/lib/python3.6/site-packages/nova/conductor/manager.py", line 1360, in schedule_and_build_instances
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager     instance_uuids, return_alternates=True)
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager   File "/usr/lib/python3.6/site-packages/nova/conductor/manager.py", line 839, in _schedule_instances
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager     return_alternates=return_alternates)
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager   File "/usr/lib/python3.6/site-packages/nova/scheduler/client/query.py", line 42, in select_destinations
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager     instance_uuids, return_objects, return_alternates)
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager   File "/usr/lib/python3.6/site-packages/nova/scheduler/rpcapi.py", line 160, in select_destinations
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager     return cctxt.call(ctxt, 'select_destinations', **msg_args)
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager   File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line 181, in call
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager     transport_options=self.transport_options)
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager   File "/usr/lib/python3.6/site-packages/oslo_messaging/transport.py", line 129, in _send
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager     transport_options=transport_options)
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 646, in send
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager     transport_options=transport_options)
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 634, in _send
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager     call_monitor_timeout)
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 523, in wait
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager     message = self.waiters.get(msg_id, timeout=timeout)
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 401, in get
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager     'to message ID %s' % msg_id)
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 0cf5fca35a6345efae8e5783b988a81b
2020-06-16 16:54:28.247 24 ERROR nova.conductor.manager
2020-06-16 16:54:28.266 24 INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : 0cf5fca35a6345efae8e5783b988a81b
2020-06-16 16:54:28.266 24 INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : 0cf5fca35a6345efae8e5783b988a81b
2020-06-16 16:54:28.300 24 WARNING nova.scheduler.utils [req-6e4c5924-9dcd-4886-b4e1-a0e8fe3a6153 95fa5d0d87844f1798056e7c1e8fed0d 50865e7763d54969a5a02ca180c974e4 - default default] Failed to compute_task_build_instances: Timed out waiting for a reply to message ID 0cf5fca35a6345efae8e5783b988a81b: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 0cf5fca35a6345efae8e5783b988a81b

Comment 5 Michele Baldessari 2020-06-16 21:30:37 UTC

First of all I think we can confirm that the timeout issue is due to some selinux denials:
type=USER_AVC msg=audit(1592337775.860:74119): pid=1284 uid=81 auid=4294967295 ses=4294967295 subj=system_u:system_r:system_dbusd_t:s0-s0:c0.c1023 msg='avc:  denied  { send_msg } for msgtype=method_call interface=org.freedesktop.DBus member=Hello dest=org.freedesktop.DBus spid=406228 scontext=system_u:system_r:container_t:s0:c162,c886 tcontext=system_u:system_r:system_dbusd_t:s0-s0:c0.c1023 tclass=dbus permissive=0  exe="/usr/bin/dbus-daemon" sauid=81 hostname=? addr=? terminal=?'UID="dbus" AUID="unset" SAUID="dbus"
type=USER_AVC msg=audit(1592337805.864:74233): pid=1284 uid=81 auid=4294967295 ses=4294967295 subj=system_u:system_r:system_dbusd_t:s0-s0:c0.c1023 msg='avc:  denied  { send_msg } for msgtype=method_call interface=org.freedesktop.DBus member=Hello dest=org.freedesktop.DBus spid=406228 scontext=system_u:system_r:container_t:s0:c162,c886 tcontext=system_u:system_r:system_dbusd_t:s0-s0:c0.c1023 tclass=dbus permissive=0  exe="/usr/bin/dbus-daemon" sauid=81 hostname=? addr=? terminal=?'UID="dbus" AUID="unset" SAUID="dbus"
type=USER_AVC msg=audit(1592337869.062:74398): pid=1284 uid=81 auid=4294967295 ses=4294967295 subj=system_u:system_r:system_dbusd_t:s0-s0:c0.c1023 msg='avc:  denied  { send_msg } for msgtype=method_call interface=org.freedesktop.DBus member=Hello dest=org.freedesktop.DBus spid=417423 scontext=system_u:system_r:container_t:s0:c162,c886 tcontext=system_u:system_r:system_dbusd_t:s0-s0:c0.c1023 tclass=dbus permissive=0  exe="/usr/bin/dbus-daemon" sauid=81 hostname=? addr=? terminal=?'UID="dbus" AUID="unset" SAUID="dbus"
type=USER_AVC msg=audit(1592337899.069:74465): pid=1284 uid=81 auid=4294967295 ses=4294967295 subj=system_u:system_r:system_dbusd_t:s0-s0:c0.c1023 msg='avc:  denied  { send_msg } for msgtype=method_call interface=org.freedesktop.DBus member=Hello dest=org.freedesktop.DBus spid=417423 scontext=system_u:system_r:container_t:s0:c162,c886 tcontext=system_u:system_r:system_dbusd_t:s0-s0:c0.c1023 tclass=dbus permissive=0  exe="/usr/bin/dbus-daemon" sauid=81 hostname=? addr=? terminal=?'UID="dbus" AUID="unset" SAUID="dbus"
                                                                                                                                                                                                                                                                                                                                                                                                                                       
 
Interestingly only two containers seem to suffer from this 'getent passwd' timeout (compute has not 'slow' containers):
for i in $(podman ps |awk '{ print $NF }'); do echo $i; time podman exec -uroot $i sh -c 'getent passwd &> /dev/null'; done | grep -B2 0m30
swift_proxy
 
real    0m30.301s
--
nova_scheduler
 
real    0m30.316s
 
 
Not sure yet if they are special, but it's certainly no coincidence that nova_scheduler is always affected.
 
[root@overcloud-controller-0 ~]# podman exec -it swift_proxy sh -c 'ps auxwfZ'
LABEL                           USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
system_u:system_r:container_t:s0:c208,c410 swift 3729 0.0  0.0 47628 3956 pts/4  Rs+  20:29   0:00 ps auxwfZ
system_u:system_r:container_t:s0:c208,c410 swift 1 0.0  0.0  4204   804 ?        Ss   14:32   0:00 dumb-init --single-child -- kolla_start
system_u:system_r:container_t:s0:c208,c410 swift 7 0.6  0.0 211160 60520 ?       Ss   14:32   2:26 /usr/bin/python3 -s /usr/bin/swift-proxy-server /etc/swift/proxy-server.c
system_u:system_r:container_t:s0:c208,c410 swift 21 0.0  0.0 212944 53472 ?      S    14:33   0:03  \_ /usr/bin/python3 -s /usr/bin/swift-proxy-server /etc/swift/proxy-serv
...
system_u:system_r:container_t:s0:c208,c410 swift 30 0.0  0.0 211700 51468 ?      S    14:33   0:03  \_ /usr/bin/python3 -s /usr/bin/swift-proxy-server /etc/swift/proxy-serv
[root@overcloud-controller-0 ~]# podman exec -it nova_scheduler sh -c 'ps auxwfZ'
LABEL                           USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
system_u:system_r:container_t:s0:c162,c886 nova 4917 0.0  0.0 47628 3856 pts/0   Rs+  20:30   0:00 ps auxwfZ
system_u:system_r:container_t:s0:c162,c886 nova 1 0.0  0.0   4204   784 ?        Ss   14:31   0:00 dumb-init --single-child -- kolla_start
system_u:system_r:container_t:s0:c162,c886 nova 7 0.6  0.1 345364 109476 ?       S    14:31   2:27 /usr/bin/python3 /usr/bin/nova-scheduler
...
system_u:system_r:container_t:s0:c162,c886 nova 31 0.0  0.1 350756 105276 ?      S    14:32   0:12  \_ /usr/bin/python3 /usr/bin/nova-scheduler
 
So the denied messages we've seen are coming from nova_scheduler, since the
scontext is scontext=system_u:system_r:container_t:s0:c162,c886 and that is the
nova_scheduler container. The reason we see these denied messages is due to the
healthchecks which call sudo which trigger an NSS lookup which hangs for 30
seconds due to the systemd NSS module which is blocked by selinux.
 
What we do not know is the following:
A) Why only these two containers (swift_proxy and nova_scheduler) seem to be affected? What is special about them?
B) Why is the systemd module even triggered in these two containers, afaict it is the last nss module:
passwd:      sss files systemd
group:       sss files systemd
C) Why is this environment affected, the only special thing is IPv6 on the ctlplane
 
(I checked and getent ahosts is not affected by timeouts, so it really is passwd and group only)
 
An additional data point is the following:
1) Run as root
[root@overcloud-controller-0 ~]# time podman exec -u root -it nova_scheduler sh -c 'getent passwd &> /dev/null'
 
real    0m30.339s
user    0m0.047s
sys     0m0.064s
2) Run as normal user
[root@overcloud-controller-0 ~]# time podman exec -it nova_scheduler sh -c 'getent passwd &> /dev/null'
 
real    0m0.336s
user    0m0.043s
sys     0m0.056s
 
I wonder if the above somehow clues us in on question A). I guess it is due to the fact that maybe only a lookup done
by root can trigger the systemd NSS module, but certainly worth noting. Also I do not think this is image related in
any way:
[root@overcloud-controller-0 ~]# time podman run --net=host --rm --name=bandini1 -it -uroot hardprov-dl360-g9-01.ctlplane.localdomain:8787/rh-osbs/rhosp16-openstack-nova-scheduler:16.1_20200611.1 sh -c 'getent passwd &> /dev/null'
 
real    0m0.502s
user    0m0.125s
sys     0m0.081s
 
 
So one potential theory is the following. Both swift_proxy and nova_scheduler (but also iscsid, ovn_controller, nova_compute, openstack-cinder-volume-podman-0 and ironic_conductor) bind mount /run from the host:
                "Source": "/run",
                "Destination": "/run",
 
I wonder if somehow we always try to talk to dbus, but simply fail in most containers because /run is inside the container
and so /run/dbus/system_bus_socket is not available. You will of course ask, why are iscsid, ovn_controller, nova_compute,  openstack-cinder-volume-podman-0 and ironic_conductor unaffected then? While I am not 100% sure, I do notice that they run privileged: true, so maybe that is what is giving them a pass. Let's test this theory.
 
vi /var/lib/tripleo-config/container-startup-config/step_3/iscsid.json
# set privileged: false
paunch --debug apply --default-runtime podman --file /var/lib/tripleo-config/container-startup-config/step_3/iscsid.json --config-id tripleo_step3
 
And now that iscsid has no privileges, boom it all hangs:
time podman exec -u root -it iscsid sh -c 'getent passwd &> /dev/null'
[root@overcloud-controller-0 tmp]# time podman exec -u root -it iscsid sh -c 'getent passwd &> /dev/null'
 
real    0m30.378s
user    0m0.054s
sys     0m0.050s

I'll come up with a kolla patch tomorrow.

Comment 6 Michele Baldessari 2020-06-17 07:44:21 UTC

Ok so now I fully root-caused it. Here we go:

- Bob's env is a bit slow in general and that is why we actually spotted this issue in the first place. It had nothing to do with the env being an ipv6-only setup. In fact if you run "time podman -u root -it nova_scheduler sh -c 'getent passwd &> /dev/null'" on any >= OSP15 env it will take 30 seconds.

- The root cause is that /etc/nsswitch.conf inside containers has 'systemd' as an NSS module. This module will poke dbus via /run/dbus/system_bus_socket at any lookup

- This is fine for most containers because /run/dbus won't exist at all, but it is a problem when /run is bind mounted

- Only containers which are *not* privileged will be denied by selinux policy. In our case this is swift_proxy and nova_scheduler

- OSP13 is not affected

- Going back in git history there is no explanation why /run is even there (https://review.opendev.org/#/c/425434/ and https://review.opendev.org/#/c/420921/ had /run bind mounted and it was the first commit)

- The way out of this, I think, should be the following:
A) Propose a kolla patch so that we disable systemd nss lookups. They make little sense inside containers anyway
B) Try and verify if swift_proxy and nova_scheduler work without bind-mounting /run

I think both A) and B) are desirable at this point. Option B), should it just work, is probably a bit faster to merge

Comment 8 Bob Fournier 2020-06-17 12:39:02 UTC

Would like this to be considered an RC blocker for 16.1 as this affects Ironic in Overcloud functionality, not necessarily just with IPv6.

Comment 12 Rajesh Tailor 2020-07-17 11:36:21 UTC

*** Bug 1847617 has been marked as a duplicate of this bug. ***

Comment 20 Bob Fournier 2020-09-16 12:42:28 UTC

Updated title to remove "ipv6" as this happens with IPv4 also.

Comment 28 errata-xmlrpc 2020-10-28 19:02:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1 containers bug fix advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:4382