Bug 1817926
| Summary: | vms fails to connect to the metadata server | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Attila Fazekas <afazekas> |
| Component: | tripleo-ansible | Assignee: | Brent Eagles <beagles> |
| Status: | CLOSED ERRATA | QA Contact: | Candido Campos <ccamposr> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 16.0 (Train) | CC: | amuller, bcafarel, beagles, ccamposr, cgoncalves, chrisw, ekuris, jschluet, scohen, shrjoshi |
| Target Milestone: | --- | Keywords: | Triaged |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | tripleo-ansible-0.4.2-0.20200402065246.6162151.el8ost | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-05-14 12:16:33 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Attila Fazekas
2020-03-27 09:00:50 UTC
l3-agent.log on all 3 controllers have a bunch of "should not have died" errors: 2020-03-26 19:36:32.719 124685 ERROR neutron.agent.linux.external_process [-] keepalived for router with uuid 966e8865-fa48-4216-90d3-79d0472189be not found. The process should not have died and subsequent keepalived router respawning I wonder if this is another issue like bug 1812630 For metadata (which is the probable root cause here), it looks like haproxy was not started for (probably) every router 2020-03-26 19:42:46.090 124685 DEBUG neutron.agent.l3.ha [-] Closing metadata proxy for router 02568ba7-c16f-4cd9-a769-c7b7c7e3a9ca _update_metadata_proxy /usr/lib/python3.6/site-packages/neutron/agent/l3/ha.py:215 2020-03-26 19:42:46.091 124685 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/02568ba7-c16f-4cd9-a769-c7b7c7e3a9ca.pid.haproxy get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261 2020-03-26 19:42:46.091 124685 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/02568ba7-c16f-4cd9-a769-c7b7c7e3a9ca.pid.haproxy get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261 2020-03-26 19:42:46.091 124685 DEBUG neutron.agent.linux.external_process [-] No haproxy process started for 02568ba7-c16f-4cd9-a769-c7b7c7e3a9ca disable /usr/lib/python3.6/site-packages/neutron/agent/linux/external_process.py:124 2020-03-26 19:42:46.092 124685 DEBUG neutron.agent.l3.router_info [-] Terminating radvd daemon in router device: 02568ba7-c16f-4cd9-a769-c7b7c7e3a9ca disable_radvd /usr/lib/python3.6/site-packages/neutron/agent/l3/router_info.py:586 2020-03-26 19:42:46.092 124685 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/02568ba7-c16f-4cd9-a769-c7b7c7e3a9ca.pid.radvd get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261 2020-03-26 19:42:46.092 124685 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/02568ba7-c16f-4cd9-a769-c7b7c7e3a9ca.pid.radvd get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261 2020-03-26 19:42:46.092 124685 DEBUG neutron.agent.linux.external_process [-] No radvd process started for 02568ba7-c16f-4cd9-a769-c7b7c7e3a9ca disable /usr/lib/python3.6/site-packages/neutron/agent/linux/external_process.py:124 Logs about radvd are worrisome too May be unrelated but 2 controllers out 3 have a non-empty privsep-helper.log with: 2020-03-26 20:22:33.885 530230 CRITICAL privsep [-] Unhandled error: FileNotFoundError: [Errno 2] No such file or directory 2020-03-26 20:22:33.885 530230 ERROR privsep Traceback (most recent call last): 2020-03-26 20:22:33.885 530230 ERROR privsep File "/bin/privsep-helper", line 10, in <module> 2020-03-26 20:22:33.885 530230 ERROR privsep sys.exit(helper_main()) 2020-03-26 20:22:33.885 530230 ERROR privsep File "/usr/lib/python3.6/site-packages/oslo_privsep/daemon.py", line 536, in helper_main 2020-03-26 20:22:33.885 530230 ERROR privsep sock.connect(cfg.CONF.privsep_sock_path) 2020-03-26 20:22:33.885 530230 ERROR privsep FileNotFoundError: [Errno 2] No such file or directory 2020-03-26 20:22:33.885 530230 ERROR privsep OK, after some investigation the root cause is how the new tripleo-systemd-wrapper system checks for processes in sync: https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/roles/tripleo_systemd_wrapper/templates/service_sync.j2#L43 [root@controller-0 heat-admin]# ps -e -o pid,command | grep "$(echo $NETNS | sed 's|^[^-]*\-||')" |egrep -v "grep | netns exec" 479796 neutron-keepalived-state-change (/usr/bin/python3 /usr/bin/neutron-keepalived-state-change --router_id=a732dc94-c070-4dd8-8b0a-8fe43bec25a5 --namespace=qrouter-a732dc94-c070-4dd8-8b0a-8fe43bec25a5 --conf_dir=/var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5 --log-file=/var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5/neutron-keepalived-state-change.log --monitor_interface=ha-e77d6c7e-6d --monitor_cidr=169.254.0.190/24 --pid_file=/var/lib/neutron/external/pids/a732dc94-c070-4dd8-8b0a-8fe43bec25a5.monitor.pid.neutron-keepalived-state-change-monitor --state_path=/var/lib/neutron --user=42435 --group=42435) 698911 /usr/bin/conmon --api-version 1 -s -c 0f8d5c20bfbd74c600abce4768df8402a8e4f0a526c3a793c86df39ea4be80eb -u 0f8d5c20bfbd74c600abce4768df8402a8e4f0a526c3a793c86df39ea4be80eb -r /usr/bin/runc -b /var/lib/containers/storage/overlay-containers/0f8d5c20bfbd74c600abce4768df8402a8e4f0a526c3a793c86df39ea4be80eb/userdata -p /var/run/containers/storage/overlay-containers/0f8d5c20bfbd74c600abce4768df8402a8e4f0a526c3a793c86df39ea4be80eb/userdata/pidfile -l k8s-file:/var/log/containers/stdouts/l3_keepalived-qrouter-a732dc94-c070-4dd8-8b0a-8fe43bec25a5.log --exit-dir /var/run/libpod/exits --socket-dir-path /var/run/libpod/socket --log-level error --runtime-arg --log-format=json --runtime-arg --log --runtime-arg=/var/run/containers/storage/overlay-containers/0f8d5c20bfbd74c600abce4768df8402a8e4f0a526c3a793c86df39ea4be80eb/userdata/oci-log --conmon-pidfile /var/run/containers/storage/overlay-containers/0f8d5c20bfbd74c600abce4768df8402a8e4f0a526c3a793c86df39ea4be80eb/userdata/conmon.pid --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /var/lib/containers/storage --exit-command-arg --runroot --exit-command-arg /var/run/containers/storage --exit-command-arg --log-level --exit-command-arg error --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /var/run/libpod --exit-command-arg --runtime --exit-command-arg runc --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --events-backend --exit-command-arg journald --exit-command-arg container --exit-command-arg cleanup --exit-command-arg 0f8d5c20bfbd74c600abce4768df8402a8e4f0a526c3a793c86df39ea4be80eb 698948 bash /var/lib/neutron/l3_keepalived/command -P -f /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5/keepalived.conf -p /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5.pid.keepalived -r /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5.pid.keepalived-vrrp 698951 /usr/sbin/keepalived -n -l -D -P -f /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5/keepalived.conf -p /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5.pid.keepalived -r /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5.pid.keepalived-vrrp 698952 /usr/sbin/keepalived -n -l -D -P -f /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5/keepalived.conf -p /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5.pid.keepalived -r /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5.pid.keepalived-vrrp (for keepalived here) We have processes other than keepalived itself here A quick fix would be to also filter on the command launched "/usr/sbin/keepalived -n -l -D" in this case [root@controller-0 heat-admin]# ip netns exec $NETNS ps -e -o pid,command | grep "$(echo $NETNS | sed 's|^[^-]*\-||')" |egrep -v "grep | netns exec"|grep "/usr/sbin/keepalived -n -l -D" 698951 /usr/sbin/keepalived -n -l -D -P -f /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5/keepalived.conf -p /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5.pid.keepalived -r /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5.pid.keepalived-vrrp 698952 /usr/sbin/keepalived -n -l -D -P -f /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5/keepalived.conf -p /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5.pid.keepalived -r /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5.pid.keepalived-vrrp As mentioned in LP bug, proper fix would be the mentioned TODO in code, to use podman ps *** Bug 1818518 has been marked as a duplicate of this bug. *** *** Bug 1820937 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2114 |