Description of problem: director setup with ipv6 API services and vlan tenant network . The issue not seen with vxlan tunneling setups. The working setups has log entries like: controller-2/var/log/containers/neutron/l3-agent.log:2020-02-27 01:31:19.726 120498 DEBUG neutron.agent.metadata.driver [-] haproxy_cfg 2020-02-27 01:36:14.173 120498 DEBUG neutron.agent.l3.ha [-] Spawning metadata proxy for router 950082c6-7003-44bf-aa48-b2c3fdf5c926 _update_metadata_proxy /usr/lib/python3.6/site-packages/neutron/agent/l3/ha.py:210 We stop having the above kind of log entries. I did not found relevant config changes in the neutron config files or a relevant error message . The vms looks like able to receive ip address (dhcp), and able to connect to 2 from 3 nameservers. RHOS_TRUNK-16.0-RHEL-8-20200226.n.1 (python3-neutron-15.0.2-0.20200206145602.ce3352a.el8ost.noarch) Known working version. RHOS_TRUNK-16.0-RHEL-8-20200324.n.0 (python3-neutron-15.0.3-0.20200321092338.651eb12.el8ost.noarch) not working version. ovs/ml2/gre setup also has issue. ovn/geneve looks ok.
l3-agent.log on all 3 controllers have a bunch of "should not have died" errors: 2020-03-26 19:36:32.719 124685 ERROR neutron.agent.linux.external_process [-] keepalived for router with uuid 966e8865-fa48-4216-90d3-79d0472189be not found. The process should not have died and subsequent keepalived router respawning I wonder if this is another issue like bug 1812630
For metadata (which is the probable root cause here), it looks like haproxy was not started for (probably) every router 2020-03-26 19:42:46.090 124685 DEBUG neutron.agent.l3.ha [-] Closing metadata proxy for router 02568ba7-c16f-4cd9-a769-c7b7c7e3a9ca _update_metadata_proxy /usr/lib/python3.6/site-packages/neutron/agent/l3/ha.py:215 2020-03-26 19:42:46.091 124685 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/02568ba7-c16f-4cd9-a769-c7b7c7e3a9ca.pid.haproxy get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261 2020-03-26 19:42:46.091 124685 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/02568ba7-c16f-4cd9-a769-c7b7c7e3a9ca.pid.haproxy get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261 2020-03-26 19:42:46.091 124685 DEBUG neutron.agent.linux.external_process [-] No haproxy process started for 02568ba7-c16f-4cd9-a769-c7b7c7e3a9ca disable /usr/lib/python3.6/site-packages/neutron/agent/linux/external_process.py:124 2020-03-26 19:42:46.092 124685 DEBUG neutron.agent.l3.router_info [-] Terminating radvd daemon in router device: 02568ba7-c16f-4cd9-a769-c7b7c7e3a9ca disable_radvd /usr/lib/python3.6/site-packages/neutron/agent/l3/router_info.py:586 2020-03-26 19:42:46.092 124685 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/02568ba7-c16f-4cd9-a769-c7b7c7e3a9ca.pid.radvd get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261 2020-03-26 19:42:46.092 124685 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/02568ba7-c16f-4cd9-a769-c7b7c7e3a9ca.pid.radvd get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261 2020-03-26 19:42:46.092 124685 DEBUG neutron.agent.linux.external_process [-] No radvd process started for 02568ba7-c16f-4cd9-a769-c7b7c7e3a9ca disable /usr/lib/python3.6/site-packages/neutron/agent/linux/external_process.py:124 Logs about radvd are worrisome too
May be unrelated but 2 controllers out 3 have a non-empty privsep-helper.log with: 2020-03-26 20:22:33.885 530230 CRITICAL privsep [-] Unhandled error: FileNotFoundError: [Errno 2] No such file or directory 2020-03-26 20:22:33.885 530230 ERROR privsep Traceback (most recent call last): 2020-03-26 20:22:33.885 530230 ERROR privsep File "/bin/privsep-helper", line 10, in <module> 2020-03-26 20:22:33.885 530230 ERROR privsep sys.exit(helper_main()) 2020-03-26 20:22:33.885 530230 ERROR privsep File "/usr/lib/python3.6/site-packages/oslo_privsep/daemon.py", line 536, in helper_main 2020-03-26 20:22:33.885 530230 ERROR privsep sock.connect(cfg.CONF.privsep_sock_path) 2020-03-26 20:22:33.885 530230 ERROR privsep FileNotFoundError: [Errno 2] No such file or directory 2020-03-26 20:22:33.885 530230 ERROR privsep
OK, after some investigation the root cause is how the new tripleo-systemd-wrapper system checks for processes in sync: https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/roles/tripleo_systemd_wrapper/templates/service_sync.j2#L43 [root@controller-0 heat-admin]# ps -e -o pid,command | grep "$(echo $NETNS | sed 's|^[^-]*\-||')" |egrep -v "grep | netns exec" 479796 neutron-keepalived-state-change (/usr/bin/python3 /usr/bin/neutron-keepalived-state-change --router_id=a732dc94-c070-4dd8-8b0a-8fe43bec25a5 --namespace=qrouter-a732dc94-c070-4dd8-8b0a-8fe43bec25a5 --conf_dir=/var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5 --log-file=/var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5/neutron-keepalived-state-change.log --monitor_interface=ha-e77d6c7e-6d --monitor_cidr=169.254.0.190/24 --pid_file=/var/lib/neutron/external/pids/a732dc94-c070-4dd8-8b0a-8fe43bec25a5.monitor.pid.neutron-keepalived-state-change-monitor --state_path=/var/lib/neutron --user=42435 --group=42435) 698911 /usr/bin/conmon --api-version 1 -s -c 0f8d5c20bfbd74c600abce4768df8402a8e4f0a526c3a793c86df39ea4be80eb -u 0f8d5c20bfbd74c600abce4768df8402a8e4f0a526c3a793c86df39ea4be80eb -r /usr/bin/runc -b /var/lib/containers/storage/overlay-containers/0f8d5c20bfbd74c600abce4768df8402a8e4f0a526c3a793c86df39ea4be80eb/userdata -p /var/run/containers/storage/overlay-containers/0f8d5c20bfbd74c600abce4768df8402a8e4f0a526c3a793c86df39ea4be80eb/userdata/pidfile -l k8s-file:/var/log/containers/stdouts/l3_keepalived-qrouter-a732dc94-c070-4dd8-8b0a-8fe43bec25a5.log --exit-dir /var/run/libpod/exits --socket-dir-path /var/run/libpod/socket --log-level error --runtime-arg --log-format=json --runtime-arg --log --runtime-arg=/var/run/containers/storage/overlay-containers/0f8d5c20bfbd74c600abce4768df8402a8e4f0a526c3a793c86df39ea4be80eb/userdata/oci-log --conmon-pidfile /var/run/containers/storage/overlay-containers/0f8d5c20bfbd74c600abce4768df8402a8e4f0a526c3a793c86df39ea4be80eb/userdata/conmon.pid --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /var/lib/containers/storage --exit-command-arg --runroot --exit-command-arg /var/run/containers/storage --exit-command-arg --log-level --exit-command-arg error --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /var/run/libpod --exit-command-arg --runtime --exit-command-arg runc --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --events-backend --exit-command-arg journald --exit-command-arg container --exit-command-arg cleanup --exit-command-arg 0f8d5c20bfbd74c600abce4768df8402a8e4f0a526c3a793c86df39ea4be80eb 698948 bash /var/lib/neutron/l3_keepalived/command -P -f /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5/keepalived.conf -p /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5.pid.keepalived -r /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5.pid.keepalived-vrrp 698951 /usr/sbin/keepalived -n -l -D -P -f /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5/keepalived.conf -p /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5.pid.keepalived -r /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5.pid.keepalived-vrrp 698952 /usr/sbin/keepalived -n -l -D -P -f /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5/keepalived.conf -p /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5.pid.keepalived -r /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5.pid.keepalived-vrrp (for keepalived here) We have processes other than keepalived itself here A quick fix would be to also filter on the command launched "/usr/sbin/keepalived -n -l -D" in this case [root@controller-0 heat-admin]# ip netns exec $NETNS ps -e -o pid,command | grep "$(echo $NETNS | sed 's|^[^-]*\-||')" |egrep -v "grep | netns exec"|grep "/usr/sbin/keepalived -n -l -D" 698951 /usr/sbin/keepalived -n -l -D -P -f /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5/keepalived.conf -p /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5.pid.keepalived -r /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5.pid.keepalived-vrrp 698952 /usr/sbin/keepalived -n -l -D -P -f /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5/keepalived.conf -p /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5.pid.keepalived -r /var/lib/neutron/ha_confs/a732dc94-c070-4dd8-8b0a-8fe43bec25a5.pid.keepalived-vrrp As mentioned in LP bug, proper fix would be the mentioned TODO in code, to use podman ps
*** Bug 1818518 has been marked as a duplicate of this bug. ***
*** Bug 1820937 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2114