Description of problem: Sometimes after a reboot of a compute node, the compute node/server stays unreachable. In order to remediate the issue network.service needs to be restarted. From `messages.log` we suspect the network isn't brought up correctly and the bond (OVS) is brought up before the network interface: ~~~ 2019-05-31 10:00:20 +02:00 d100siul0555 kern.info kernel: device em1 entered promiscuous mode 2019-05-31 10:00:20 +02:00 d100siul0555 kern.info kernel: device p1p1 entered promiscuous mode 2019-05-31 10:00:23 +02:00 d100siul0555 kern.info kernel: IPv6: ADDRCONF(NETDEV_UP): em1: link is not ready 2019-05-31 10:00:23 +02:00 d100siul0555 kern.info kernel: IPv6: ADDRCONF(NETDEV_UP): p1p1: link is not ready 2019-05-31 10:00:24 +02:00 d100siul0555 daemon.notice ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl -t 10 -- --if-exists del-port br-ex bond0 -- add-bond br-ex bond0 p1p1 em1 bond_mode=active-backup 2019-05-31 10:00:24 +02:00 d100siul0555 daemon.info network: Bringing up interface bond0: [ OK ] <<== Bond is brought up 2019-05-31 10:00:26 +02:00 d100siul0555 kern.info kernel: igb 0000:20:00.0 p1p1: igb: p1p1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX <<<===NIC is brought up 2019-05-31 10:00:26 +02:00 d100siul0555 kern.info kernel: IPv6: ADDRCONF(NETDEV_CHANGE): p1p1: link becomes ready 2019-05-31 10:00:27 +02:00 d100siul0555 kern.info kernel: ixgbe 0000:08:00.0 em1: NIC Link is Up 1 Gbps, Flow Control: None 2019-05-31 10:00:27 +02:00 d100siul0555 kern.info kernel: IPv6: ADDRCONF(NETDEV_CHANGE): em1: link becomes ready 2019-05-31 10:00:36 +02:00 d100siul0555 daemon.info network: Bringing up interface em1: [ OK ] <<<===NIC is brought up 2019-05-31 10:00:38 +02:00 d100siul0555 daemon.info network: Bringing up interface p1p1: [ OK ] <<<===NIC is brought up ~~~ Version-Release number of selected component (if applicable): RHOSP 13.0 RHEL 7.6 How reproducible: The issue is occasional. The customer says it happens on 1 out of 10 reboots. Additional info:
After troubleshooting it looks like this issue is caused by neutron-openvswitch-agent: after compute reboot br-ex interface comes online for some short interval: other hosts in the same network can ping affected compute. But it goes back offline after openvswitch establishes connection to neutron-openvswitch-agent [1]. We have performed the following troubleshooting steps: - we have sent ICMP echo requests from second compute to affected one and collected timestamps to isolate the time intervals of this issue. Results: - Aug 2 13:24:46 --> outage (compute was rebooted, boot process started at 2019-08-02 13:31:08) - Aug 2 13:31:12 --> successful ping - Aug 2 13:31:26 --> last successful ping - Aug 2 13:31:27 --> outage - we can see that outage occurred when OVS connected to neutron-openvswitch-agent [1] Next steps: - it will be great to have some update from neutron developers: sosreports are available, you can check the data we have collected in collect-data.tar.gz archive; - support will enable debug for neutron services and provide detailed logs for neutron OVS agent at the time of the outage [1] 2019-08-02T11:31:27.013Z|00396|rconn|INFO|br-uplink1<->tcp:127.0.0.1:6633: connected 2019-08-02T11:31:27.014Z|00397|rconn|INFO|br-ex<->tcp:127.0.0.1:6633: connected 2019-08-02T11:31:27.014Z|00398|rconn|INFO|br-int<->tcp:127.0.0.1:6633: connected
Sorry, I haven't included the root cause of the networking outage in my previous comment: there are no flows on br-ex bridge. This issue occurs only if customer reboots a compute with some of the bonds in down state. Short summary: - before reboot customer shuts down one of bond interfaces - after reboot IP address from br-ex interface becomes available for short period of time - after [1] this IP address becomes unavailable because there are no flows in br-ex table. [1] 2019-08-02T11:31:27.013Z|00396|rconn|INFO|br-uplink1<->tcp:127.0.0.1:6633: connected 2019-08-02T11:31:27.014Z|00397|rconn|INFO|br-ex<->tcp:127.0.0.1:6633: connected 2019-08-02T11:31:27.014Z|00398|rconn|INFO|br-int<->tcp:127.0.0.1:6633: connected
I just reported related bug in u/s: https://bugs.launchpad.net/neutron/+bug/1840443 I think that it will be easy to fix this in u/s.
If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field. The documentation team will review, edit, and approve the text. If this bug does not require doc text, please set the 'requires_doc_text' flag to -.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:3803