Bug 1722578 - Loss of network connectivity of a compute node after reboot due to wrong network services startup sequence
Summary: Loss of network connectivity of a compute node after reboot due to wrong net...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z9
: 13.0 (Queens)
Assignee: Slawek Kaplonski
QA Contact: Candido Campos
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-06-20 16:35 UTC by Aviv Guetta
Modified: 2023-12-15 16:36 UTC (History)
13 users (show)

Fixed In Version: openstack-neutron-12.0.6-11.el7ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-11-07 14:00:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1840443 0 None None None 2019-08-16 12:19:29 UTC
OpenStack gerrit 676949 0 'None' MERGED Initialize phys bridges before setup_rpc 2020-07-15 12:51:47 UTC
OpenStack gerrit 677056 0 'None' MERGED Initialize phys bridges before setup_rpc 2020-07-15 12:51:47 UTC
Red Hat Issue Tracker OSP-30786 0 None None None 2023-12-15 16:36:34 UTC
Red Hat Product Errata RHBA-2019:3803 0 None None None 2019-11-07 14:00:34 UTC

Description Aviv Guetta 2019-06-20 16:35:21 UTC
Description of problem:
Sometimes after a reboot of a compute node, the compute node/server stays unreachable.
In order to remediate the issue network.service needs to be restarted.

From `messages.log` we suspect the network isn't brought up correctly and the bond (OVS) is brought up before the network interface:
~~~
2019-05-31 10:00:20 +02:00 d100siul0555 kern.info kernel: device em1 entered promiscuous mode
2019-05-31 10:00:20 +02:00 d100siul0555 kern.info kernel: device p1p1 entered promiscuous mode
2019-05-31 10:00:23 +02:00 d100siul0555 kern.info kernel: IPv6: ADDRCONF(NETDEV_UP): em1: link is not ready  
2019-05-31 10:00:23 +02:00 d100siul0555 kern.info kernel: IPv6: ADDRCONF(NETDEV_UP): p1p1: link is not ready   
2019-05-31 10:00:24 +02:00 d100siul0555 daemon.notice ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl -t 10 -- --if-exists del-port br-ex bond0 -- add-bond br-ex bond0 p1p1 em1 bond_mode=active-backup
2019-05-31 10:00:24 +02:00 d100siul0555 daemon.info network: Bringing up interface bond0:  [  OK  ] <<== Bond is brought up
2019-05-31 10:00:26 +02:00 d100siul0555 kern.info kernel: igb 0000:20:00.0 p1p1: igb: p1p1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX    <<<===NIC is brought up
2019-05-31 10:00:26 +02:00 d100siul0555 kern.info kernel: IPv6: ADDRCONF(NETDEV_CHANGE): p1p1: link becomes ready
2019-05-31 10:00:27 +02:00 d100siul0555 kern.info kernel: ixgbe 0000:08:00.0 em1: NIC Link is Up 1 Gbps, Flow Control: None
2019-05-31 10:00:27 +02:00 d100siul0555 kern.info kernel: IPv6: ADDRCONF(NETDEV_CHANGE): em1: link becomes ready
2019-05-31 10:00:36 +02:00 d100siul0555 daemon.info network: Bringing up interface em1:  [  OK  ]  <<<===NIC is brought up
2019-05-31 10:00:38 +02:00 d100siul0555 daemon.info network: Bringing up interface p1p1:  [  OK  ] <<<===NIC is brought up
~~~

Version-Release number of selected component (if applicable):
RHOSP 13.0
RHEL 7.6

How reproducible:

The issue is occasional. The customer says it happens on 1 out of 10 reboots.
 

Additional info:

Comment 3 Alex Stupnikov 2019-08-02 17:22:38 UTC
After troubleshooting it looks like this issue is caused by neutron-openvswitch-agent: after compute reboot br-ex interface comes online for some short interval: other hosts in the same network can ping affected compute. But it goes back offline after openvswitch establishes connection to neutron-openvswitch-agent [1].

We have performed the following troubleshooting steps:

- we have sent ICMP echo requests from second compute to affected one and collected timestamps to isolate the time intervals of this issue. Results:
  - Aug  2 13:24:46 --> outage (compute was rebooted, boot process started at 2019-08-02 13:31:08)
  - Aug  2 13:31:12 --> successful ping
  - Aug  2 13:31:26 --> last successful ping
  - Aug  2 13:31:27 --> outage
- we can see that outage occurred when OVS connected to neutron-openvswitch-agent [1]


Next steps:

- it will be great to have some update from neutron developers: sosreports are available, you can check the data we have collected in collect-data.tar.gz archive;
- support will enable debug for neutron services and provide detailed logs for neutron OVS agent at the time of the outage

[1]
    2019-08-02T11:31:27.013Z|00396|rconn|INFO|br-uplink1<->tcp:127.0.0.1:6633: connected
    2019-08-02T11:31:27.014Z|00397|rconn|INFO|br-ex<->tcp:127.0.0.1:6633: connected
    2019-08-02T11:31:27.014Z|00398|rconn|INFO|br-int<->tcp:127.0.0.1:6633: connected

Comment 4 Alex Stupnikov 2019-08-02 17:27:02 UTC
Sorry, I haven't included the root cause of the networking outage in my previous comment: there are no flows on br-ex bridge. This issue occurs only if customer reboots a compute with some of the bonds in down state.

Short summary:

- before reboot customer shuts down one of bond interfaces
- after reboot IP address from br-ex interface becomes available for short period of time
- after [1] this IP address becomes unavailable because there are no flows in br-ex table.

[1]
    2019-08-02T11:31:27.013Z|00396|rconn|INFO|br-uplink1<->tcp:127.0.0.1:6633: connected
    2019-08-02T11:31:27.014Z|00397|rconn|INFO|br-ex<->tcp:127.0.0.1:6633: connected
    2019-08-02T11:31:27.014Z|00398|rconn|INFO|br-int<->tcp:127.0.0.1:6633: connected

Comment 14 Slawek Kaplonski 2019-08-16 12:19:29 UTC
I just reported related bug in u/s: https://bugs.launchpad.net/neutron/+bug/1840443
I think that it will be easy to fix this in u/s.

Comment 39 Alex McLeod 2019-10-31 11:32:59 UTC
If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field. The documentation team will review, edit, and approve the text.

If this bug does not require doc text, please set the 'requires_doc_text' flag to -.

Comment 41 errata-xmlrpc 2019-11-07 14:00:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3803


Note You need to log in before you can comment on or make changes to this bug.