1722578 – Loss of network connectivity of a compute node after reboot due to wrong network services startup sequence

Bug 1722578 - Loss of network connectivity of a compute node after reboot due to wrong network services startup sequence

Summary: Loss of network connectivity of a compute node after reboot due to wrong net...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-neutron
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	z9
Target Release:	13.0 (Queens)
Assignee:	Slawek Kaplonski
QA Contact:	Candido Campos
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-06-20 16:35 UTC by Aviv Guetta
Modified:	2023-12-15 16:36 UTC (History)
CC List:	13 users (show)
Fixed In Version:	openstack-neutron-12.0.6-11.el7ost
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-11-07 14:00:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1840443	None	None	None	2019-08-16 12:19:29 UTC
OpenStack gerrit	676949	'None'	MERGED	Initialize phys bridges before setup_rpc	2020-07-15 12:51:47 UTC
OpenStack gerrit	677056	'None'	MERGED	Initialize phys bridges before setup_rpc	2020-07-15 12:51:47 UTC
Red Hat Issue Tracker	OSP-30786	None	None	None	2023-12-15 16:36:34 UTC
Red Hat Product Errata	RHBA-2019:3803	None	None	None	2019-11-07 14:00:34 UTC

Description Aviv Guetta 2019-06-20 16:35:21 UTC

Description of problem:
Sometimes after a reboot of a compute node, the compute node/server stays unreachable.
In order to remediate the issue network.service needs to be restarted.

From `messages.log` we suspect the network isn't brought up correctly and the bond (OVS) is brought up before the network interface:
~~~
2019-05-31 10:00:20 +02:00 d100siul0555 kern.info kernel: device em1 entered promiscuous mode
2019-05-31 10:00:20 +02:00 d100siul0555 kern.info kernel: device p1p1 entered promiscuous mode
2019-05-31 10:00:23 +02:00 d100siul0555 kern.info kernel: IPv6: ADDRCONF(NETDEV_UP): em1: link is not ready  
2019-05-31 10:00:23 +02:00 d100siul0555 kern.info kernel: IPv6: ADDRCONF(NETDEV_UP): p1p1: link is not ready   
2019-05-31 10:00:24 +02:00 d100siul0555 daemon.notice ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl -t 10 -- --if-exists del-port br-ex bond0 -- add-bond br-ex bond0 p1p1 em1 bond_mode=active-backup
2019-05-31 10:00:24 +02:00 d100siul0555 daemon.info network: Bringing up interface bond0:  [  OK  ] <<== Bond is brought up
2019-05-31 10:00:26 +02:00 d100siul0555 kern.info kernel: igb 0000:20:00.0 p1p1: igb: p1p1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX    <<<===NIC is brought up
2019-05-31 10:00:26 +02:00 d100siul0555 kern.info kernel: IPv6: ADDRCONF(NETDEV_CHANGE): p1p1: link becomes ready
2019-05-31 10:00:27 +02:00 d100siul0555 kern.info kernel: ixgbe 0000:08:00.0 em1: NIC Link is Up 1 Gbps, Flow Control: None
2019-05-31 10:00:27 +02:00 d100siul0555 kern.info kernel: IPv6: ADDRCONF(NETDEV_CHANGE): em1: link becomes ready
2019-05-31 10:00:36 +02:00 d100siul0555 daemon.info network: Bringing up interface em1:  [  OK  ]  <<<===NIC is brought up
2019-05-31 10:00:38 +02:00 d100siul0555 daemon.info network: Bringing up interface p1p1:  [  OK  ] <<<===NIC is brought up
~~~

Version-Release number of selected component (if applicable):
RHOSP 13.0
RHEL 7.6

How reproducible:

The issue is occasional. The customer says it happens on 1 out of 10 reboots.
 

Additional info:

Comment 3 Alex Stupnikov 2019-08-02 17:22:38 UTC

After troubleshooting it looks like this issue is caused by neutron-openvswitch-agent: after compute reboot br-ex interface comes online for some short interval: other hosts in the same network can ping affected compute. But it goes back offline after openvswitch establishes connection to neutron-openvswitch-agent [1].

We have performed the following troubleshooting steps:

- we have sent ICMP echo requests from second compute to affected one and collected timestamps to isolate the time intervals of this issue. Results:
  - Aug  2 13:24:46 --> outage (compute was rebooted, boot process started at 2019-08-02 13:31:08)
  - Aug  2 13:31:12 --> successful ping
  - Aug  2 13:31:26 --> last successful ping
  - Aug  2 13:31:27 --> outage
- we can see that outage occurred when OVS connected to neutron-openvswitch-agent [1]


Next steps:

- it will be great to have some update from neutron developers: sosreports are available, you can check the data we have collected in collect-data.tar.gz archive;
- support will enable debug for neutron services and provide detailed logs for neutron OVS agent at the time of the outage

[1]
    2019-08-02T11:31:27.013Z|00396|rconn|INFO|br-uplink1<->tcp:127.0.0.1:6633: connected
    2019-08-02T11:31:27.014Z|00397|rconn|INFO|br-ex<->tcp:127.0.0.1:6633: connected
    2019-08-02T11:31:27.014Z|00398|rconn|INFO|br-int<->tcp:127.0.0.1:6633: connected

Comment 4 Alex Stupnikov 2019-08-02 17:27:02 UTC

Sorry, I haven't included the root cause of the networking outage in my previous comment: there are no flows on br-ex bridge. This issue occurs only if customer reboots a compute with some of the bonds in down state.

Short summary:

- before reboot customer shuts down one of bond interfaces
- after reboot IP address from br-ex interface becomes available for short period of time
- after [1] this IP address becomes unavailable because there are no flows in br-ex table.

[1]
    2019-08-02T11:31:27.013Z|00396|rconn|INFO|br-uplink1<->tcp:127.0.0.1:6633: connected
    2019-08-02T11:31:27.014Z|00397|rconn|INFO|br-ex<->tcp:127.0.0.1:6633: connected
    2019-08-02T11:31:27.014Z|00398|rconn|INFO|br-int<->tcp:127.0.0.1:6633: connected

Comment 14 Slawek Kaplonski 2019-08-16 12:19:29 UTC

I just reported related bug in u/s: https://bugs.launchpad.net/neutron/+bug/1840443
I think that it will be easy to fix this in u/s.

Comment 39 Alex McLeod 2019-10-31 11:32:59 UTC

If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field. The documentation team will review, edit, and approve the text.

If this bug does not require doc text, please set the 'requires_doc_text' flag to -.

Comment 41 errata-xmlrpc 2019-11-07 14:00:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3803

Note You need to log in before you can comment on or make changes to this bug.