Created attachment 1573850 [details] Controller hung L3 agent log Only seen so far on single-controller deployments (reproduced on 1 controller 1 compute vxlan deployment) At some point in tests, the L3 agent hangs, last log is: 2019-05-24 17:25:00.249 118696 INFO oslo_rootwrap.client [-] Stopping rootwrap daemon process with pid=241697 2019-05-24 17:25:00.308 118696 DEBUG oslo_rootwrap.client [-] Popen for ['sudo', 'neutron-rootwrap-daemon', '/etc/neutron/rootwrap.conf'] command has been instantiated _initialize /usr/lib/python3.6/site-packages/oslo_rootwrap/client.py:73 2019-05-24 17:25:00.565 118696 INFO oslo_rootwrap.client [-] Spawned new rootwrap daemon process with pid=242113 2019-05-24 17:25:00.606 118696 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-76c2dcbc-3a88-4870-836d-20f587a0fe34', 'ip', '-6', 'route', 'list', 'dev', 'qg-fb818926-22', 'scope', 'link'] execute_rootwrap_daemon /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103 2019-05-24 17:25:00.629 118696 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-76c2dcbc-3a88-4870-836d-20f587a0fe34', 'ip', '-4', 'route', 'list', 'dev', 'qg-fb818926-22'] execute_rootwrap_daemon /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103 2019-05-24 17:25:00.649 118696 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-76c2dcbc-3a88-4870-836d-20f587a0fe34', 'ip', '-6', 'route', 'list', 'dev', 'qg-fb818926-22'] execute_rootwrap_daemon /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103 2019-05-24 17:25:00.670 118696 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-76c2dcbc-3a88-4870-836d-20f587a0fe34', 'ip', '-4', 'route', 'replace', 'default', 'via', '10.0.0.1', 'dev', 'qg-fb818926-22'] execute_rootwrap_daemon /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103 2019-05-24 17:25:00.690 118696 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-76c2dcbc-3a88-4870-836d-20f587a0fe34', 'sysctl', '-w', 'net.ipv6.conf.qg-fb818926-22.accept_ra=2'] execute_rootwrap_daemon /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103 2019-05-24 17:25:00.719 118696 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-76c2dcbc-3a88-4870-836d-20f587a0fe34', 'sysctl', '-w', 'net.ipv6.conf.qg-fb818926-22.forwarding=1'] execute_rootwrap_daemon /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103 2019-05-24 17:25:00.759 118696 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-76c2dcbc-3a88-4870-836d-20f587a0fe34', 'sysctl', '-w', 'net.ipv6.conf.all.forwarding=1'] execute_rootwrap_daemon /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103 (full l3-agent.log attached) neutron-server reports: 2019-05-25 08:35:13.856 32 WARNING neutron.db.agents_db [req-a7be97df-e555-4e77-be3a-168893459bfe - - - - -] Agent healthcheck: found 1 dead agents out of 5: Type Last heartbeat host L3 agent 2019-05-24 17:25:00 controller-0.localdomain "podman ps" still shows the container around # ps fuaxw|grep [l]3-agent 42435 118696 0.0 0.3 322900 106668 ? S May24 0:03 | \_ /usr/bin/python3 /usr/bin/neutron-l3-agent --config-file /usr/share/neutron/neutron-dist.conf --config-dir /usr/share/neutron/l3_agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/l3_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-l3-agent --log-file=/var/log/neutron/l3-agent.log root 119163 0.0 0.2 832836 71748 ? Sl May24 0:00 \_ /usr/libexec/platform-python /bin/privsep-helper --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/l3_agent.ini --config-dir /etc/neutron/conf.d/neutron-l3-agent --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmpcd2568kk/privsep.sock
This may happend because of bug #1713321 What we see in abnormal behaviours is the rootwrap daemon being restarted from time to time: 2019-05-27 09:40:31.717 386982 INFO oslo_rootwrap.client [-] Stopping rootwrap daemon process with pid=422757 2019-05-27 09:40:31.744 386982 DEBUG oslo_rootwrap.client [-] Popen for ['sudo', 'neutron-rootwrap-daemon', '/etc/neutron/rootwrap.conf'] command has been instantiated _initialize /usr/lib/python3.6/site-packages/oslo_rootwrap/client.py:73 2019-05-27 09:40:32.004 386982 INFO oslo_rootwrap.client [-] Spawned new rootwrap daemon process with pid=1007898 This should not happen and the daemon should stay up for the duration of the agent lifecycle. Such a restart was in the L3 agent log shortly before the agent went in locked state. Also killing the privsep-helper does not unlock the L3 agent, so this is not the direct cause. But it may be possible that, because of missing arping binary, the rootwrap daemon ends (bug #1713321), and during the restart some race condition happens in rootwrap or privsep IP commands. Currently running tempest loops on a setup where the behaviour was reproduced, but after installing iputils. If we do not see additional rootwrap-daemon restarts, we will know that missing arping command was causing the restart. And it very well could be the reason for this bug too
With added iputils in neutron containers, we do not see additionals rootwrap daemons restarting as in previous logs, and no L3 agent hang in matching test runs. So the root cause was indeed rootwrap daemon restarting because of bug #1713321 marking as duplicate *** This bug has been marked as a duplicate of bug 1713321 ***