1693363 – Neutron L3 agents crash flapping after applying a minor update; DHCP agents unable to allocate memory

Bug 1693363 - Neutron L3 agents crash flapping after applying a minor update; DHCP agents unable to allocate memory

Summary: Neutron L3 agents crash flapping after applying a minor update; DHCP agents u...

Keywords:
Status:	CLOSED DUPLICATE of bug 1693430
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-neutron
Sub Component:
Version:	10.0 (Newton)
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Nate Johnston
QA Contact:	Roee Agiman
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-03-27 16:19 UTC by Ganesh Kadam
Modified:	2019-04-01 20:24 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-04-01 20:24:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Ganesh Kadam 2019-03-27 16:19:35 UTC

Description of problem:

After applying a RHEL7.6 minor update procedure to a OpenStack Platform 10 cloud (via the DIrector minor update procedure) to apply updates to packaging, one of our Customers has been seeing a lot of errors in Neutron logging on the 3 controller/network nodes around the L3 and DHCP agents.

Specifically we see that the keepalived-based Neutron L3 HA routers are trapped in a loop of starting up, crashing and restarting, across all 3 controller/network nodes. The keepalived processes start up in network namespaces and immediately fail on a config error and are then restarted.

On the controller nodes hosting Neutron DHCP agents we also see in the logging DHCP agents unable to allocate memory for Neutron networks' DHCP agents.


Version-Release number of selected component (if applicable):

During the minor update process RHEL7.6, relevant packages to Neutron were updated to these new latest versions ;

 keepalived-1.3.5-8.el7_6.x86_64
 kernel-3.10.0-957.10.1.el7.x86_645
 kernel-devel-3.10.0-957.10.1.el7.x86_645
 kernel-headers-3.10.0-957.10.1.el7.x86_645
 kernel-tools-3.10.0-957.10.1.el7.x86_645
 kernel-tools-libs-3.10.0-957.10.1.el7.x86_645
 openstack-neutron-9.4.1-32.el7ost.noarch
 openstack-neutron-common-9.4.1-32.el7ost.noarch
 openstack-neutron-metering-agent-9.4.1-32.el7ost.noarch
 openstack-neutron-ml2-9.4.1-32.el7ost.noarch
 openstack-neutron-openvswitch-9.4.1-32.el7ost.noarch
 openvswitch-2.9.0-83.el7fdp.1.x86_64
 python-neutron-9.4.1-32.el7ost.noarch
 python-neutron-tests-9.4.1-32.el7ost.noarch
 python-openvswitch-2.9.0-83.el7fdp.1.x86_64
 selinux-policy-3.13.1-229.el7_6.9.noarch
 selinux-policy-targeted-3.13.1-229.el7_6.9.noarch


Actual results:

Neutron agents are unable to allocate memory for normal operations after performing minor updates

Expected results:


Neutron agents should work as expected for normal operations after performing minor updates


Additional info:

We see this behaviour repeatedly in the Neutron and system logging overall 3 controllers/network nodes that are present in the OSP10 cloud.

This issue would have the highest level of business impact as functionality over the OSP10 cloud related to L3 routing and DHCP on Neutron networks is directly impacted.

Comment 3 Paul Browne 2019-03-27 23:44:20 UTC

Sorry to intrude on the BZ (I'm the original poster of the linked support case).

Our issue, which may not be immediately clear, is that on a fresh boot of a combined OSP10 controller/network node (any of the total 3, the behaviour is seen on all), Neutron-related services start out consuming ~600MB RSS, but over a couple of days linearly balloon to consuming all available memory on the node;


# Freshly booted node
top -n1 -b -o %MEM | head -100 | egrep neutron
  16597 neutron   20   0  509912 201876   6828 S   0.0  0.2   0:26.42 neutron-dhcp-agent
  16601 neutron   20   0  506912 200768   6756 S   0.0  0.2   0:25.25 neutron-l3-agent
  36018 neutron   20   0  496984 190912   6760 S   0.0  0.1   0:26.56 neutron-openvswitch

# Node with 2 days uptime
top -n1 -b -o %MEM | head -100 | egrep neutron
  17872 neutron   20   0   14.7g  14.4g   6824 S   0.0 11.4  48:43.08 neutron-dhcp-agent
  17881 neutron   20   0   14.7g  14.4g   6816 S   0.0 11.4  21:54.27 neutron-l3-agent
  17992 neutron   20   0   14.7g  14.4g   6760 S   0.0 11.4 445:50.22 neutron-openvswitch

Left unchecked these processes accumulate all available memory on the node, and then normal Neutron processes such as creation new networks or L3 routers can no longer proceed, as no memory can be allocated for them.

The only fix has been to reboot the controller/network node as memory usage proceeds to these levels. I don't think our cloud sees the rate and totals of Neutron resource creation that could explain this memory usage purely in the number of resources being managed by Neutron in the cloud (total routers ~130 , total tenant private networks ~250).

Comment 4 Nilesh 2019-03-28 03:44:17 UTC

Team,

Any update?

Comment 5 Nate Johnston 2019-04-01 20:24:38 UTC

Marking this as a duplicate of bug 1693430 which tracks our memory leak fix for python-openvswitch.

*** This bug has been marked as a duplicate of bug 1693430 ***

Note You need to log in before you can comment on or make changes to this bug.