Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1483962

Summary: Network on controllers is down due to neutron-ovs-cleanup failure
Product: Red Hat OpenStack Reporter: Pradipta Kumar Sahoo <psahoo>
Component: openstack-neutronAssignee: Terry Wilson <twilson>
Status: CLOSED ERRATA QA Contact: Toni Freger <tfreger>
Severity: high Docs Contact:
Priority: high    
Version: 7.0 (Kilo)CC: amuller, ccollett, chrisw, nyechiel, psahoo, srevivo, twilson
Target Milestone: zstreamKeywords: Triaged, ZStream
Target Release: 7.0 (Kilo)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-neutron-2015.1.4-31.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1531141 1531180 (view as bug list) Environment:
Last Closed: 2018-03-07 15:22:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1531141, 1531143, 1531144, 1531180    

Description Pradipta Kumar Sahoo 2017-08-22 11:39:12 UTC
Description of problem:
Due to Dell firmware upgrade, all the nodes on the OSP7 environment were rebooted. 
We noticed from the logs that the service "systemctl restart neutron-ovs-cleanup" timeout before the reboot when the cluster was put on standby.
Once the servers rebooted, we noticed that the VLAN networks were not up, but a bunch of OVS interfaces were still there (from prior to reboot).

Version-Release number of selected component (if applicable):
openstack-neutron-2015.1.4-13.el7ost.noarch                 Sat Apr  8 02:07:32 2017
openstack-neutron-bigswitch-lldp-2015.1.38-1.el7ost.noarch  Wed Dec 16 19:42:04 2015
openstack-neutron-common-2015.1.4-13.el7ost.noarch          Sat Apr  8 02:07:23 2017
openstack-neutron-lbaas-2015.1.4-1.el7ost.noarch            Wed Dec 14 02:13:59 2016
openstack-neutron-metering-agent-2015.1.4-13.el7ost.noarch  Sat Apr  8 02:07:32 2017
openstack-neutron-ml2-2015.1.4-13.el7ost.noarch             Sat Apr  8 02:07:32 2017
openstack-neutron-openvswitch-2015.1.4-13.el7ost.noarch     Sat Apr  8 02:07:32 2017
python-neutron-2015.1.4-13.el7ost.noarch                    Sat Apr  8 02:07:23 2017
python-neutron-lbaas-2015.1.4-1.el7ost.noarch               Wed Dec 14 02:13:59 2016
python-neutronclient-2.4.0-2.el7ost.noarch                  Wed Dec 16 19:41:47 2015


How reproducible: Reproduced in customer environment


Steps to Reproduce:
1- Controllers require reboot (in this case was to apply required DELL firmware, sometimes to update kernel, etc)

2- We put the controllers on standby (pcs cluster stop)

3- All the steps are stopped, and the neutron-ovs cleanup services running state.

4- The services "neutron-ovs-cleanup" timeout before all the cleanup is completed. Many namespace ports were node deleted/cleaned
$ sudo systemctl restart neutron-ovs-cleanup
Job for neutron-ovs-cleanup.service failed because a configured resource limit was exceeded. 

5- Controllers are rebooted

6- Once the controllers boot up, the "network" service timeout and all the required VLANs are not created
	# systemctl status network
	● network.service - LSB: Bring up/down networking
	   Loaded: loaded (/etc/rc.d/init.d/network; bad; vendor preset: disabled)
	   Active: failed (Result: timeout) since Sat 2017-08-12 04:51:44 UTC; 11min ago
	     Docs: man:systemd-sysv-generator(8)
	  Process: 5616 ExecStart=/etc/rc.d/init.d/network start (code=killed, signal=TERM)

Actual results:
Once the servers rebooted, We noticed that the VLAN networks were not up and network service is DOWN.

Expected results:
The cleanup services need to delete all the virtual networks after reboot the controller nodes.
After rebooting the controller the network service should come UP with it configured VLAN based network. 

Additional info:
WORKAROUND:
- Run "systemctl restart neutron-ovs-cleanup"  multiple times until no namespaces are left from "ifconfig" list
- Once the "neutron-ovs-cleanup" completes successfully (without the errors on step 4), reboot. In the last instance, we had this issue, I had to run this several times until it was all clear.

Please let us know if this is known bug in OSP7 and confirm the whether it fixed errata patch release.
Please let us whether you need any further informaiton on it.

Thnak You,

Comment 1 Assaf Muller 2017-08-22 18:36:56 UTC
Can we have an sosreport from a node that includes the ovs-cleanup failures?

Comment 2 Terry Wilson 2017-08-22 19:52:08 UTC
Yeah, we'll need the logs. ovs_cleanup could definitely be sped up by not using ovs_lib's delete_port for each port and instead creating a single transaction that deletes all of the ports that need deleting. But w/o the log, I'm not sure exactly why things are timing out.

Comment 4 Terry Wilson 2017-08-23 14:20:52 UTC
The SOS reports don't seem include the /var/log/neutron/ovs-cleanup.log files. Do these files exist somewhere?

Comment 11 Terry Wilson 2017-10-02 22:22:34 UTC
Testing with creating dummy neutron ports via:

$ sudo rmmod dummy; sudo modprobe dummy numdummies=1000
$ for ((i=0;i<1000;i++));do echo "-- add-port br-int dummy$i -- set Interface dummy$i external-ids:attached-mac=00:01:02:03:04:05 external-ids:iface-id=1";done|xargs sudo ovs-vsctl
$ time sudo systemctl stop neutron-ovs-cleanup.service

results in a timeout at 90s and leftover ports. re-running cleans up the ports.

Doing the same with:
$ time sudo systemctl start neutron-ovs-cleanup.service

results in all ports being cleaned up after ~130s. So the stop/start timeouts seem to be different at least on my RHEL 7.4/kilo install.

Modifying /usr/lib/systemd/system/neutron-ovs-cleanup.service to have:

[Service]
...
TimeoutSec=0

allows systemctl stop neutron-ovs-cleanup to run until completion. I think this would be a sufficient solution (several other openstack services have stop/start timeout=0).

I should note, however, that on one occasion i did not reload the dummy module and recreated the ports in ovs. There were errors from vswitchd, though the ports existed in the OVSDB. Running neutron-ovs-cleanup start ran *very* slowly in this case (after 10 minutes, there were only about 100 ports deleted). strace showed that it was waiting on the rootwrap daemon to send responses. So setting the an infinite timeout could theoretically cause a server to reboot extremely slowly.

Comment 18 errata-xmlrpc 2018-03-07 15:22:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0463