Bug 1817926 - vms fails to connect to the metadata server
Summary: vms fails to connect to the metadata server
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: tripleo-ansible
Version: 16.0 (Train)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: Brent Eagles
QA Contact: Candido Campos
URL:
Whiteboard:
: 1818518 1820937 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-27 09:00 UTC by Attila Fazekas
Modified: 2020-05-22 13:47 UTC (History)
10 users (show)

Fixed In Version: tripleo-ansible-0.4.2-0.20200402065246.6162151.el8ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-14 12:16:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1869384 0 None None None 2020-03-27 15:18:55 UTC
OpenStack gerrit 715488 0 None MERGED Using podman/docker ps to look for running sidecars 2020-07-27 15:18:52 UTC
Red Hat Product Errata RHBA-2020:2114 0 None None None 2020-05-14 12:16:51 UTC

Description Attila Fazekas 2020-03-27 09:00:50 UTC
Description of problem:
director setup with ipv6 API services and vlan tenant network .

The issue not seen with vxlan tunneling setups.

The working setups has log entries like:
controller-2/var/log/containers/neutron/l3-agent.log:2020-02-27 01:31:19.726 120498 DEBUG neutron.agent.metadata.driver [-] haproxy_cfg 

2020-02-27 01:36:14.173 120498 DEBUG neutron.agent.l3.ha [-] Spawning metadata proxy for router 950082c6-7003-44bf-aa48-b2c3fdf5c926 _update_metadata_proxy /usr/lib/python3.6/site-packages/neutron/agent/l3/ha.py:210

We stop having the above kind of log entries.

I did not found relevant config changes in the neutron config files
or a relevant error message .


The vms looks like able to receive ip address (dhcp),
and able to connect to 2 from 3 nameservers.


RHOS_TRUNK-16.0-RHEL-8-20200226.n.1 (python3-neutron-15.0.2-0.20200206145602.ce3352a.el8ost.noarch) Known working version.
RHOS_TRUNK-16.0-RHEL-8-20200324.n.0 (python3-neutron-15.0.3-0.20200321092338.651eb12.el8ost.noarch) not working version.

ovs/ml2/gre setup also has issue.
ovn/geneve looks ok.

Comment 2 Bernard Cafarelli 2020-03-27 11:19:18 UTC
l3-agent.log on all 3 controllers have a bunch of "should not have died" errors:
2020-03-26 19:36:32.719 124685 ERROR neutron.agent.linux.external_process [-] keepalived for router with uuid 966e8865-fa48-4216-90d3-79d0472189be not found. The process should not have died

and subsequent keepalived router respawning

I wonder if this is another issue like bug 1812630

Comment 3 Bernard Cafarelli 2020-03-27 12:29:05 UTC
For metadata (which is the probable root cause here), it looks like haproxy was not started for (probably) every router
2020-03-26 19:42:46.090 124685 DEBUG neutron.agent.l3.ha [-] Closing metadata proxy for router 02568ba7-c16f-4cd9-a769-c7b7c7e3a9ca _update_metadata_proxy /usr/lib/python3.6/site-packages/neutron/agent/l3/ha.py:215
2020-03-26 19:42:46.091 124685 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/02568ba7-c16f-4cd9-a769-c7b7c7e3a9ca.pid.haproxy get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261
2020-03-26 19:42:46.091 124685 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/02568ba7-c16f-4cd9-a769-c7b7c7e3a9ca.pid.haproxy get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261
2020-03-26 19:42:46.091 124685 DEBUG neutron.agent.linux.external_process [-] No haproxy process started for 02568ba7-c16f-4cd9-a769-c7b7c7e3a9ca disable /usr/lib/python3.6/site-packages/neutron/agent/linux/external_process.py:124
2020-03-26 19:42:46.092 124685 DEBUG neutron.agent.l3.router_info [-] Terminating radvd daemon in router device: 02568ba7-c16f-4cd9-a769-c7b7c7e3a9ca disable_radvd /usr/lib/python3.6/site-packages/neutron/agent/l3/router_info.py:586
2020-03-26 19:42:46.092 124685 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/02568ba7-c16f-4cd9-a769-c7b7c7e3a9ca.pid.radvd get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261
2020-03-26 19:42:46.092 124685 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/02568ba7-c16f-4cd9-a769-c7b7c7e3a9ca.pid.radvd get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261
2020-03-26 19:42:46.092 124685 DEBUG neutron.agent.linux.external_process [-] No radvd process started for 02568ba7-c16f-4cd9-a769-c7b7c7e3a9ca disable /usr/lib/python3.6/site-packages/neutron/agent/linux/external_process.py:124

Logs about radvd are worrisome too

Comment 4 Bernard Cafarelli 2020-03-27 12:32:30 UTC
May be unrelated but 2 controllers out 3 have a non-empty privsep-helper.log with:
2020-03-26 20:22:33.885 530230 CRITICAL privsep [-] Unhandled error: FileNotFoundError: [Errno 2] No such file or directory
2020-03-26 20:22:33.885 530230 ERROR privsep Traceback (most recent call last):
2020-03-26 20:22:33.885 530230 ERROR privsep   File "/bin/privsep-helper", line 10, in <module>
2020-03-26 20:22:33.885 530230 ERROR privsep     sys.exit(helper_main())
2020-03-26 20:22:33.885 530230 ERROR privsep   File "/usr/lib/python3.6/site-packages/oslo_privsep/daemon.py", line 536, in helper_main
2020-03-26 20:22:33.885 530230 ERROR privsep     sock.connect(cfg.CONF.privsep_sock_path)
2020-03-26 20:22:33.885 530230 ERROR privsep FileNotFoundError: [Errno 2] No such file or directory
2020-03-26 20:22:33.885 530230 ERROR privsep

Comment 5 Bernard Cafarelli 2020-03-27 15:41:50 UTC
OK, after some investigation the root cause is how the new tripleo-systemd-wrapper system checks for processes in sync:
https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/roles/tripleo_systemd_wrapper/templates/service_sync.j2#L43

[root@controller-0 heat-admin]# ps -e -o pid,command | grep "$(echo $NETNS | sed 's|^[^-]*\-||')" |egrep -v "grep | netns exec"                                         
 479796 neutron-keepalived-state-change (/usr/bin/python3 /usr/bin/neutron-keepalived-state-change --router_id=a732dc94-c070-4dd8-8b0a-8fe43bec25a5 --namespace=qrouter-a732dc94-c070-4dd8-8b0a-8fe43bec25a5 --conf_dir=/var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5 --log-file=/var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5/neutron-keepalived-state-change.log --monitor_interface=ha-e77d6c7e-6d --monitor_cidr=169.254.0.190/24 --pid_file=/var/lib/neutron/external/pids/a732dc94-c070-4dd8-8b0a-8fe43bec25a5.monitor.pid.neutron-keepalived-state-change-monitor --state_path=/var/lib/neutron --user=42435 --group=42435)                               
 698911 /usr/bin/conmon --api-version 1 -s -c 0f8d5c20bfbd74c600abce4768df8402a8e4f0a526c3a793c86df39ea4be80eb -u 0f8d5c20bfbd74c600abce4768df8402a8e4f0a526c3a793c86df39ea4be80eb -r /usr/bin/runc -b /var/lib/containers/storage/overlay-containers/0f8d5c20bfbd74c600abce4768df8402a8e4f0a526c3a793c86df39ea4be80eb/userdata -p /var/run/containers/storage/overlay-containers/0f8d5c20bfbd74c600abce4768df8402a8e4f0a526c3a793c86df39ea4be80eb/userdata/pidfile -l k8s-file:/var/log/containers/stdouts/l3_keepalived-qrouter-a732dc94-c070-4dd8-8b0a-8fe43bec25a5.log --exit-dir /var/run/libpod/exits --socket-dir-path /var/run/libpod/socket --log-level error --runtime-arg --log-format=json --runtime-arg --log --runtime-arg=/var/run/containers/storage/overlay-containers/0f8d5c20bfbd74c600abce4768df8402a8e4f0a526c3a793c86df39ea4be80eb/userdata/oci-log --conmon-pidfile /var/run/containers/storage/overlay-containers/0f8d5c20bfbd74c600abce4768df8402a8e4f0a526c3a793c86df39ea4be80eb/userdata/conmon.pid --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /var/lib/containers/storage --exit-command-arg --runroot --exit-command-arg /var/run/containers/storage --exit-command-arg --log-level --exit-command-arg error --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /var/run/libpod --exit-command-arg --runtime --exit-command-arg runc --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --events-backend --exit-command-arg journald --exit-command-arg container --exit-command-arg cleanup --exit-command-arg 0f8d5c20bfbd74c600abce4768df8402a8e4f0a526c3a793c86df39ea4be80eb                 
 698948 bash /var/lib/neutron/l3_keepalived/command -P -f /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5/keepalived.conf -p /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5.pid.keepalived -r /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5.pid.keepalived-vrrp                                  
 698951 /usr/sbin/keepalived -n -l -D -P -f /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5/keepalived.conf -p /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5.pid.keepalived -r /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5.pid.keepalived-vrrp                                                
 698952 /usr/sbin/keepalived -n -l -D -P -f /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5/keepalived.conf -p /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5.pid.keepalived -r /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5.pid.keepalived-vrrp 

(for keepalived here)

We have processes other than keepalived itself here

A quick fix would be to also filter on the command launched "/usr/sbin/keepalived -n -l -D" in this case
[root@controller-0 heat-admin]# ip netns exec $NETNS ps -e -o pid,command | grep "$(echo $NETNS | sed 's|^[^-]*\-||')" |egrep -v "grep | netns exec"|grep "/usr/sbin/keepalived -n -l -D"
 698951 /usr/sbin/keepalived -n -l -D -P -f /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5/keepalived.conf -p /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5.pid.keepalived -r /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5.pid.keepalived-vrrp                                                
 698952 /usr/sbin/keepalived -n -l -D -P -f /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5/keepalived.conf -p /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5.pid.keepalived -r /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5.pid.keepalived-vrrp 

As mentioned in LP bug, proper fix would be the mentioned TODO in code, to use podman ps

Comment 6 Bernard Cafarelli 2020-03-30 13:48:10 UTC
*** Bug 1818518 has been marked as a duplicate of this bug. ***

Comment 7 Candido Campos 2020-04-06 16:13:15 UTC
*** Bug 1820937 has been marked as a duplicate of this bug. ***

Comment 16 errata-xmlrpc 2020-05-14 12:16:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2114


Note You need to log in before you can comment on or make changes to this bug.