1483962 – Network on controllers is down due to neutron-ovs-cleanup failure

Bug 1483962 - Network on controllers is down due to neutron-ovs-cleanup failure

Summary: Network on controllers is down due to neutron-ovs-cleanup failure

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-neutron
Sub Component:
Version:	7.0 (Kilo)
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	zstream
Target Release:	7.0 (Kilo)
Assignee:	Terry Wilson
QA Contact:	Toni Freger
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1531141 1531143 1531144 1531180
TreeView+	depends on / blocked

Reported:	2017-08-22 11:39 UTC by Pradipta Kumar Sahoo
Modified:	2020-12-14 09:38 UTC (History)
CC List:	7 users (show)
Fixed In Version:	openstack-neutron-2015.1.4-31.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1531141 1531180 (view as bug list)
Environment:
Last Closed:	2018-03-07 15:22:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
RDO	9921	0	None	None	None	2017-10-03 20:20:07 UTC
Red Hat Product Errata	RHBA-2018:0463	0	normal	SHIPPED_LIVE	openstack-neutron bug fix advisory	2018-03-07 20:21:16 UTC

Description Pradipta Kumar Sahoo 2017-08-22 11:39:12 UTC

Description of problem:
Due to Dell firmware upgrade, all the nodes on the OSP7 environment were rebooted. 
We noticed from the logs that the service "systemctl restart neutron-ovs-cleanup" timeout before the reboot when the cluster was put on standby.
Once the servers rebooted, we noticed that the VLAN networks were not up, but a bunch of OVS interfaces were still there (from prior to reboot).

Version-Release number of selected component (if applicable):
openstack-neutron-2015.1.4-13.el7ost.noarch                 Sat Apr  8 02:07:32 2017
openstack-neutron-bigswitch-lldp-2015.1.38-1.el7ost.noarch  Wed Dec 16 19:42:04 2015
openstack-neutron-common-2015.1.4-13.el7ost.noarch          Sat Apr  8 02:07:23 2017
openstack-neutron-lbaas-2015.1.4-1.el7ost.noarch            Wed Dec 14 02:13:59 2016
openstack-neutron-metering-agent-2015.1.4-13.el7ost.noarch  Sat Apr  8 02:07:32 2017
openstack-neutron-ml2-2015.1.4-13.el7ost.noarch             Sat Apr  8 02:07:32 2017
openstack-neutron-openvswitch-2015.1.4-13.el7ost.noarch     Sat Apr  8 02:07:32 2017
python-neutron-2015.1.4-13.el7ost.noarch                    Sat Apr  8 02:07:23 2017
python-neutron-lbaas-2015.1.4-1.el7ost.noarch               Wed Dec 14 02:13:59 2016
python-neutronclient-2.4.0-2.el7ost.noarch                  Wed Dec 16 19:41:47 2015


How reproducible: Reproduced in customer environment


Steps to Reproduce:
1- Controllers require reboot (in this case was to apply required DELL firmware, sometimes to update kernel, etc)

2- We put the controllers on standby (pcs cluster stop)

3- All the steps are stopped, and the neutron-ovs cleanup services running state.

4- The services "neutron-ovs-cleanup" timeout before all the cleanup is completed. Many namespace ports were node deleted/cleaned
$ sudo systemctl restart neutron-ovs-cleanup
Job for neutron-ovs-cleanup.service failed because a configured resource limit was exceeded. 

5- Controllers are rebooted

6- Once the controllers boot up, the "network" service timeout and all the required VLANs are not created
	# systemctl status network
	● network.service - LSB: Bring up/down networking
	   Loaded: loaded (/etc/rc.d/init.d/network; bad; vendor preset: disabled)
	   Active: failed (Result: timeout) since Sat 2017-08-12 04:51:44 UTC; 11min ago
	     Docs: man:systemd-sysv-generator(8)
	  Process: 5616 ExecStart=/etc/rc.d/init.d/network start (code=killed, signal=TERM)

Actual results:
Once the servers rebooted, We noticed that the VLAN networks were not up and network service is DOWN.

Expected results:
The cleanup services need to delete all the virtual networks after reboot the controller nodes.
After rebooting the controller the network service should come UP with it configured VLAN based network. 

Additional info:
WORKAROUND:
- Run "systemctl restart neutron-ovs-cleanup"  multiple times until no namespaces are left from "ifconfig" list
- Once the "neutron-ovs-cleanup" completes successfully (without the errors on step 4), reboot. In the last instance, we had this issue, I had to run this several times until it was all clear.

Please let us know if this is known bug in OSP7 and confirm the whether it fixed errata patch release.
Please let us whether you need any further informaiton on it.

Thnak You,

Comment 1 Assaf Muller 2017-08-22 18:36:56 UTC

Can we have an sosreport from a node that includes the ovs-cleanup failures?

Comment 2 Terry Wilson 2017-08-22 19:52:08 UTC

Yeah, we'll need the logs. ovs_cleanup could definitely be sped up by not using ovs_lib's delete_port for each port and instead creating a single transaction that deletes all of the ports that need deleting. But w/o the log, I'm not sure exactly why things are timing out.

Comment 4 Terry Wilson 2017-08-23 14:20:52 UTC

The SOS reports don't seem include the /var/log/neutron/ovs-cleanup.log files. Do these files exist somewhere?

Comment 11 Terry Wilson 2017-10-02 22:22:34 UTC

Testing with creating dummy neutron ports via:

$ sudo rmmod dummy; sudo modprobe dummy numdummies=1000
$ for ((i=0;i<1000;i++));do echo "-- add-port br-int dummy$i -- set Interface dummy$i external-ids:attached-mac=00:01:02:03:04:05 external-ids:iface-id=1";done|xargs sudo ovs-vsctl
$ time sudo systemctl stop neutron-ovs-cleanup.service

results in a timeout at 90s and leftover ports. re-running cleans up the ports.

Doing the same with:
$ time sudo systemctl start neutron-ovs-cleanup.service

results in all ports being cleaned up after ~130s. So the stop/start timeouts seem to be different at least on my RHEL 7.4/kilo install.

Modifying /usr/lib/systemd/system/neutron-ovs-cleanup.service to have:

[Service]
...
TimeoutSec=0

allows systemctl stop neutron-ovs-cleanup to run until completion. I think this would be a sufficient solution (several other openstack services have stop/start timeout=0).

I should note, however, that on one occasion i did not reload the dummy module and recreated the ports in ovs. There were errors from vswitchd, though the ports existed in the OVSDB. Running neutron-ovs-cleanup start ran *very* slowly in this case (after 10 minutes, there were only about 100 ports deleted). strace showed that it was waiting on the rootwrap daemon to send responses. So setting the an infinite timeout could theoretically cause a server to reboot extremely slowly.

Comment 18 errata-xmlrpc 2018-03-07 15:22:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0463

Note You need to log in before you can comment on or make changes to this bug.