Bug 1722578

Summary:	Loss of network connectivity of a compute node after reboot due to wrong network services startup sequence
Product:	Red Hat OpenStack	Reporter:	Aviv Guetta <aguetta>
Component:	openstack-neutron	Assignee:	Slawek Kaplonski <skaplons>
Status:	CLOSED ERRATA	QA Contact:	Candido Campos <ccamposr>
Severity:	high	Docs Contact:
Priority:	high
Version:	13.0 (Queens)	CC:	amoralej, amuller, astupnik, bcafarel, ccamposr, chrisw, ealcaniz, mburns, pmorey, rhos-maint, scohen, skaplons, tfreger
Target Milestone:	z9	Keywords:	Triaged, ZStream
Target Release:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-neutron-12.0.6-11.el7ost	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-11-07 14:00:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Aviv Guetta 2019-06-20 16:35:21 UTC

Description of problem:
Sometimes after a reboot of a compute node, the compute node/server stays unreachable.
In order to remediate the issue network.service needs to be restarted.

From `messages.log` we suspect the network isn't brought up correctly and the bond (OVS) is brought up before the network interface:
~~~
2019-05-31 10:00:20 +02:00 d100siul0555 kern.info kernel: device em1 entered promiscuous mode
2019-05-31 10:00:20 +02:00 d100siul0555 kern.info kernel: device p1p1 entered promiscuous mode
2019-05-31 10:00:23 +02:00 d100siul0555 kern.info kernel: IPv6: ADDRCONF(NETDEV_UP): em1: link is not ready  
2019-05-31 10:00:23 +02:00 d100siul0555 kern.info kernel: IPv6: ADDRCONF(NETDEV_UP): p1p1: link is not ready   
2019-05-31 10:00:24 +02:00 d100siul0555 daemon.notice ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl -t 10 -- --if-exists del-port br-ex bond0 -- add-bond br-ex bond0 p1p1 em1 bond_mode=active-backup
2019-05-31 10:00:24 +02:00 d100siul0555 daemon.info network: Bringing up interface bond0:  [  OK  ] <<== Bond is brought up
2019-05-31 10:00:26 +02:00 d100siul0555 kern.info kernel: igb 0000:20:00.0 p1p1: igb: p1p1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX    <<<===NIC is brought up
2019-05-31 10:00:26 +02:00 d100siul0555 kern.info kernel: IPv6: ADDRCONF(NETDEV_CHANGE): p1p1: link becomes ready
2019-05-31 10:00:27 +02:00 d100siul0555 kern.info kernel: ixgbe 0000:08:00.0 em1: NIC Link is Up 1 Gbps, Flow Control: None
2019-05-31 10:00:27 +02:00 d100siul0555 kern.info kernel: IPv6: ADDRCONF(NETDEV_CHANGE): em1: link becomes ready
2019-05-31 10:00:36 +02:00 d100siul0555 daemon.info network: Bringing up interface em1:  [  OK  ]  <<<===NIC is brought up
2019-05-31 10:00:38 +02:00 d100siul0555 daemon.info network: Bringing up interface p1p1:  [  OK  ] <<<===NIC is brought up
~~~

Version-Release number of selected component (if applicable):
RHOSP 13.0
RHEL 7.6

How reproducible:

The issue is occasional. The customer says it happens on 1 out of 10 reboots.
 

Additional info:

Comment 3 Alex Stupnikov 2019-08-02 17:22:38 UTC

After troubleshooting it looks like this issue is caused by neutron-openvswitch-agent: after compute reboot br-ex interface comes online for some short interval: other hosts in the same network can ping affected compute. But it goes back offline after openvswitch establishes connection to neutron-openvswitch-agent [1].

We have performed the following troubleshooting steps:

- we have sent ICMP echo requests from second compute to affected one and collected timestamps to isolate the time intervals of this issue. Results:
  - Aug  2 13:24:46 --> outage (compute was rebooted, boot process started at 2019-08-02 13:31:08)
  - Aug  2 13:31:12 --> successful ping
  - Aug  2 13:31:26 --> last successful ping
  - Aug  2 13:31:27 --> outage
- we can see that outage occurred when OVS connected to neutron-openvswitch-agent [1]


Next steps:

- it will be great to have some update from neutron developers: sosreports are available, you can check the data we have collected in collect-data.tar.gz archive;
- support will enable debug for neutron services and provide detailed logs for neutron OVS agent at the time of the outage

[1]
    2019-08-02T11:31:27.013Z|00396|rconn|INFO|br-uplink1<->tcp:127.0.0.1:6633: connected
    2019-08-02T11:31:27.014Z|00397|rconn|INFO|br-ex<->tcp:127.0.0.1:6633: connected
    2019-08-02T11:31:27.014Z|00398|rconn|INFO|br-int<->tcp:127.0.0.1:6633: connected

Comment 4 Alex Stupnikov 2019-08-02 17:27:02 UTC

Sorry, I haven't included the root cause of the networking outage in my previous comment: there are no flows on br-ex bridge. This issue occurs only if customer reboots a compute with some of the bonds in down state.

Short summary:

- before reboot customer shuts down one of bond interfaces
- after reboot IP address from br-ex interface becomes available for short period of time
- after [1] this IP address becomes unavailable because there are no flows in br-ex table.

[1]
    2019-08-02T11:31:27.013Z|00396|rconn|INFO|br-uplink1<->tcp:127.0.0.1:6633: connected
    2019-08-02T11:31:27.014Z|00397|rconn|INFO|br-ex<->tcp:127.0.0.1:6633: connected
    2019-08-02T11:31:27.014Z|00398|rconn|INFO|br-int<->tcp:127.0.0.1:6633: connected

Comment 14 Slawek Kaplonski 2019-08-16 12:19:29 UTC

I just reported related bug in u/s: https://bugs.launchpad.net/neutron/+bug/1840443
I think that it will be easy to fix this in u/s.

Comment 39 Alex McLeod 2019-10-31 11:32:59 UTC

If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field. The documentation team will review, edit, and approve the text.

If this bug does not require doc text, please set the 'requires_doc_text' flag to -.

Comment 41 errata-xmlrpc 2019-11-07 14:00:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3803