Bug 1714146 - [OSP15] L3 agent hangs during a tempest run
Summary: [OSP15] L3 agent hangs during a tempest run
Keywords:
Status: CLOSED DUPLICATE of bug 1713321
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 15.0 (Stein)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: beta
: ---
Assignee: Bernard Cafarelli
QA Contact: Roee Agiman
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-05-27 08:48 UTC by Bernard Cafarelli
Modified: 2019-06-03 10:01 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-03 10:01:12 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Controller hung L3 agent log (345.24 KB, text/plain)
2019-05-27 08:48 UTC, Bernard Cafarelli
no flags Details

Description Bernard Cafarelli 2019-05-27 08:48:58 UTC
Created attachment 1573850 [details]
Controller hung L3 agent log

Only seen so far on single-controller deployments (reproduced on 1 controller 1 compute vxlan deployment)

At some point in tests, the L3 agent hangs, last log is:
2019-05-24 17:25:00.249 118696 INFO oslo_rootwrap.client [-] Stopping rootwrap daemon process with pid=241697
2019-05-24 17:25:00.308 118696 DEBUG oslo_rootwrap.client [-] Popen for ['sudo', 'neutron-rootwrap-daemon', '/etc/neutron/rootwrap.conf'] command has been instantiated _initialize /usr/lib/python3.6/site-packages/oslo_rootwrap/client.py:73
2019-05-24 17:25:00.565 118696 INFO oslo_rootwrap.client [-] Spawned new rootwrap daemon process with pid=242113
2019-05-24 17:25:00.606 118696 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-76c2dcbc-3a88-4870-836d-20f587a0fe34', 'ip', '-6', 'route', 'list', 'dev', 'qg-fb818926-22', 'scope', 'link'] execute_rootwrap_daemon /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103
2019-05-24 17:25:00.629 118696 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-76c2dcbc-3a88-4870-836d-20f587a0fe34', 'ip', '-4', 'route', 'list', 'dev', 'qg-fb818926-22'] execute_rootwrap_daemon /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103
2019-05-24 17:25:00.649 118696 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-76c2dcbc-3a88-4870-836d-20f587a0fe34', 'ip', '-6', 'route', 'list', 'dev', 'qg-fb818926-22'] execute_rootwrap_daemon /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103
2019-05-24 17:25:00.670 118696 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-76c2dcbc-3a88-4870-836d-20f587a0fe34', 'ip', '-4', 'route', 'replace', 'default', 'via', '10.0.0.1', 'dev', 'qg-fb818926-22'] execute_rootwrap_daemon /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103
2019-05-24 17:25:00.690 118696 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-76c2dcbc-3a88-4870-836d-20f587a0fe34', 'sysctl', '-w', 'net.ipv6.conf.qg-fb818926-22.accept_ra=2'] execute_rootwrap_daemon /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103
2019-05-24 17:25:00.719 118696 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-76c2dcbc-3a88-4870-836d-20f587a0fe34', 'sysctl', '-w', 'net.ipv6.conf.qg-fb818926-22.forwarding=1'] execute_rootwrap_daemon /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103
2019-05-24 17:25:00.759 118696 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-76c2dcbc-3a88-4870-836d-20f587a0fe34', 'sysctl', '-w', 'net.ipv6.conf.all.forwarding=1'] execute_rootwrap_daemon /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103
(full l3-agent.log attached)

neutron-server reports:
2019-05-25 08:35:13.856 32 WARNING neutron.db.agents_db [req-a7be97df-e555-4e77-be3a-168893459bfe - - - - -] Agent healthcheck: found 1 dead agents out of 5:
                Type       Last heartbeat host
            L3 agent  2019-05-24 17:25:00 controller-0.localdomain

"podman ps" still shows the container around

# ps fuaxw|grep [l]3-agent
42435     118696  0.0  0.3 322900 106668 ?       S    May24   0:03  |   \_ /usr/bin/python3 /usr/bin/neutron-l3-agent --config-file /usr/share/neutron/neutron-dist.conf --config-dir /usr/share/neutron/l3_agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/l3_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-l3-agent --log-file=/var/log/neutron/l3-agent.log
root      119163  0.0  0.2 832836 71748 ?        Sl   May24   0:00  \_ /usr/libexec/platform-python /bin/privsep-helper --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/l3_agent.ini --config-dir /etc/neutron/conf.d/neutron-l3-agent --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmpcd2568kk/privsep.sock

Comment 3 Bernard Cafarelli 2019-05-27 11:23:38 UTC
This may happend because of bug #1713321

What we see in abnormal behaviours is the rootwrap daemon being restarted from time to time:
2019-05-27 09:40:31.717 386982 INFO oslo_rootwrap.client [-] Stopping rootwrap daemon process with pid=422757
2019-05-27 09:40:31.744 386982 DEBUG oslo_rootwrap.client [-] Popen for ['sudo', 'neutron-rootwrap-daemon', '/etc/neutron/rootwrap.conf'] command has been instantiated _initialize /usr/lib/python3.6/site-packages/oslo_rootwrap/client.py:73
2019-05-27 09:40:32.004 386982 INFO oslo_rootwrap.client [-] Spawned new rootwrap daemon process with pid=1007898

This should not happen and the daemon should stay up for the duration of the agent lifecycle.

Such a restart was in the L3 agent log shortly before the agent went in locked state. Also killing the privsep-helper does not unlock the L3 agent, so this is not the direct cause.

But it may be possible that, because of missing arping binary, the rootwrap daemon ends (bug #1713321), and during the restart some race condition happens in rootwrap or privsep IP commands. Currently running tempest loops on a setup where the behaviour was reproduced, but after installing iputils. If we do not see additional rootwrap-daemon restarts, we will know that missing arping command was causing the restart. And it very well could be the reason for this bug too

Comment 4 Bernard Cafarelli 2019-06-03 10:01:12 UTC
With added iputils in neutron containers, we do not see additionals rootwrap daemons restarting as in previous logs, and no L3 agent hang in matching test runs.

So the root cause was indeed rootwrap daemon restarting because of bug #1713321 marking as duplicate

*** This bug has been marked as a duplicate of bug 1713321 ***


Note You need to log in before you can comment on or make changes to this bug.