Description of problem: Due to Dell firmware upgrade, all the nodes on the OSP7 environment were rebooted. We noticed from the logs that the service "systemctl restart neutron-ovs-cleanup" timeout before the reboot when the cluster was put on standby. Once the servers rebooted, we noticed that the VLAN networks were not up, but a bunch of OVS interfaces were still there (from prior to reboot). Version-Release number of selected component (if applicable): openstack-neutron-2015.1.4-13.el7ost.noarch Sat Apr 8 02:07:32 2017 openstack-neutron-bigswitch-lldp-2015.1.38-1.el7ost.noarch Wed Dec 16 19:42:04 2015 openstack-neutron-common-2015.1.4-13.el7ost.noarch Sat Apr 8 02:07:23 2017 openstack-neutron-lbaas-2015.1.4-1.el7ost.noarch Wed Dec 14 02:13:59 2016 openstack-neutron-metering-agent-2015.1.4-13.el7ost.noarch Sat Apr 8 02:07:32 2017 openstack-neutron-ml2-2015.1.4-13.el7ost.noarch Sat Apr 8 02:07:32 2017 openstack-neutron-openvswitch-2015.1.4-13.el7ost.noarch Sat Apr 8 02:07:32 2017 python-neutron-2015.1.4-13.el7ost.noarch Sat Apr 8 02:07:23 2017 python-neutron-lbaas-2015.1.4-1.el7ost.noarch Wed Dec 14 02:13:59 2016 python-neutronclient-2.4.0-2.el7ost.noarch Wed Dec 16 19:41:47 2015 How reproducible: Reproduced in customer environment Steps to Reproduce: 1- Controllers require reboot (in this case was to apply required DELL firmware, sometimes to update kernel, etc) 2- We put the controllers on standby (pcs cluster stop) 3- All the steps are stopped, and the neutron-ovs cleanup services running state. 4- The services "neutron-ovs-cleanup" timeout before all the cleanup is completed. Many namespace ports were node deleted/cleaned $ sudo systemctl restart neutron-ovs-cleanup Job for neutron-ovs-cleanup.service failed because a configured resource limit was exceeded. 5- Controllers are rebooted 6- Once the controllers boot up, the "network" service timeout and all the required VLANs are not created # systemctl status network ● network.service - LSB: Bring up/down networking Loaded: loaded (/etc/rc.d/init.d/network; bad; vendor preset: disabled) Active: failed (Result: timeout) since Sat 2017-08-12 04:51:44 UTC; 11min ago Docs: man:systemd-sysv-generator(8) Process: 5616 ExecStart=/etc/rc.d/init.d/network start (code=killed, signal=TERM) Actual results: Once the servers rebooted, We noticed that the VLAN networks were not up and network service is DOWN. Expected results: The cleanup services need to delete all the virtual networks after reboot the controller nodes. After rebooting the controller the network service should come UP with it configured VLAN based network. Additional info: WORKAROUND: - Run "systemctl restart neutron-ovs-cleanup" multiple times until no namespaces are left from "ifconfig" list - Once the "neutron-ovs-cleanup" completes successfully (without the errors on step 4), reboot. In the last instance, we had this issue, I had to run this several times until it was all clear. Please let us know if this is known bug in OSP7 and confirm the whether it fixed errata patch release. Please let us whether you need any further informaiton on it. Thnak You,
Can we have an sosreport from a node that includes the ovs-cleanup failures?
Yeah, we'll need the logs. ovs_cleanup could definitely be sped up by not using ovs_lib's delete_port for each port and instead creating a single transaction that deletes all of the ports that need deleting. But w/o the log, I'm not sure exactly why things are timing out.
The SOS reports don't seem include the /var/log/neutron/ovs-cleanup.log files. Do these files exist somewhere?
Testing with creating dummy neutron ports via: $ sudo rmmod dummy; sudo modprobe dummy numdummies=1000 $ for ((i=0;i<1000;i++));do echo "-- add-port br-int dummy$i -- set Interface dummy$i external-ids:attached-mac=00:01:02:03:04:05 external-ids:iface-id=1";done|xargs sudo ovs-vsctl $ time sudo systemctl stop neutron-ovs-cleanup.service results in a timeout at 90s and leftover ports. re-running cleans up the ports. Doing the same with: $ time sudo systemctl start neutron-ovs-cleanup.service results in all ports being cleaned up after ~130s. So the stop/start timeouts seem to be different at least on my RHEL 7.4/kilo install. Modifying /usr/lib/systemd/system/neutron-ovs-cleanup.service to have: [Service] ... TimeoutSec=0 allows systemctl stop neutron-ovs-cleanup to run until completion. I think this would be a sufficient solution (several other openstack services have stop/start timeout=0). I should note, however, that on one occasion i did not reload the dummy module and recreated the ports in ovs. There were errors from vswitchd, though the ports existed in the OVSDB. Running neutron-ovs-cleanup start ran *very* slowly in this case (after 10 minutes, there were only about 100 ports deleted). strace showed that it was waiting on the rootwrap daemon to send responses. So setting the an infinite timeout could theoretically cause a server to reboot extremely slowly.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0463