Description of problem: After applying a RHEL7.6 minor update procedure to a OpenStack Platform 10 cloud (via the DIrector minor update procedure) to apply updates to packaging, one of our Customers has been seeing a lot of errors in Neutron logging on the 3 controller/network nodes around the L3 and DHCP agents. Specifically we see that the keepalived-based Neutron L3 HA routers are trapped in a loop of starting up, crashing and restarting, across all 3 controller/network nodes. The keepalived processes start up in network namespaces and immediately fail on a config error and are then restarted. On the controller nodes hosting Neutron DHCP agents we also see in the logging DHCP agents unable to allocate memory for Neutron networks' DHCP agents. Version-Release number of selected component (if applicable): During the minor update process RHEL7.6, relevant packages to Neutron were updated to these new latest versions ; keepalived-1.3.5-8.el7_6.x86_64 kernel-3.10.0-957.10.1.el7.x86_645 kernel-devel-3.10.0-957.10.1.el7.x86_645 kernel-headers-3.10.0-957.10.1.el7.x86_645 kernel-tools-3.10.0-957.10.1.el7.x86_645 kernel-tools-libs-3.10.0-957.10.1.el7.x86_645 openstack-neutron-9.4.1-32.el7ost.noarch openstack-neutron-common-9.4.1-32.el7ost.noarch openstack-neutron-metering-agent-9.4.1-32.el7ost.noarch openstack-neutron-ml2-9.4.1-32.el7ost.noarch openstack-neutron-openvswitch-9.4.1-32.el7ost.noarch openvswitch-2.9.0-83.el7fdp.1.x86_64 python-neutron-9.4.1-32.el7ost.noarch python-neutron-tests-9.4.1-32.el7ost.noarch python-openvswitch-2.9.0-83.el7fdp.1.x86_64 selinux-policy-3.13.1-229.el7_6.9.noarch selinux-policy-targeted-3.13.1-229.el7_6.9.noarch Actual results: Neutron agents are unable to allocate memory for normal operations after performing minor updates Expected results: Neutron agents should work as expected for normal operations after performing minor updates Additional info: We see this behaviour repeatedly in the Neutron and system logging overall 3 controllers/network nodes that are present in the OSP10 cloud. This issue would have the highest level of business impact as functionality over the OSP10 cloud related to L3 routing and DHCP on Neutron networks is directly impacted.
Sorry to intrude on the BZ (I'm the original poster of the linked support case). Our issue, which may not be immediately clear, is that on a fresh boot of a combined OSP10 controller/network node (any of the total 3, the behaviour is seen on all), Neutron-related services start out consuming ~600MB RSS, but over a couple of days linearly balloon to consuming all available memory on the node; # Freshly booted node top -n1 -b -o %MEM | head -100 | egrep neutron 16597 neutron 20 0 509912 201876 6828 S 0.0 0.2 0:26.42 neutron-dhcp-agent 16601 neutron 20 0 506912 200768 6756 S 0.0 0.2 0:25.25 neutron-l3-agent 36018 neutron 20 0 496984 190912 6760 S 0.0 0.1 0:26.56 neutron-openvswitch # Node with 2 days uptime top -n1 -b -o %MEM | head -100 | egrep neutron 17872 neutron 20 0 14.7g 14.4g 6824 S 0.0 11.4 48:43.08 neutron-dhcp-agent 17881 neutron 20 0 14.7g 14.4g 6816 S 0.0 11.4 21:54.27 neutron-l3-agent 17992 neutron 20 0 14.7g 14.4g 6760 S 0.0 11.4 445:50.22 neutron-openvswitch Left unchecked these processes accumulate all available memory on the node, and then normal Neutron processes such as creation new networks or L3 routers can no longer proceed, as no memory can be allocated for them. The only fix has been to reboot the controller/network node as memory usage proceeds to these levels. I don't think our cloud sees the rate and totals of Neutron resource creation that could explain this memory usage purely in the number of resources being managed by Neutron in the cloud (total routers ~130 , total tenant private networks ~250).
Team, Any update?
Marking this as a duplicate of bug 1693430 which tracks our memory leak fix for python-openvswitch. *** This bug has been marked as a duplicate of bug 1693430 ***