Bug 1475764

Summary: OSP12 | HA | Controller network ovs interfaces are non-functional after hard reboot of the node (OVS)
Product: Red Hat OpenStack Reporter: Udi Shkalim <ushkalim>
Component: openstack-neutronAssignee: Assaf Muller <amuller>
Status: CLOSED DUPLICATE QA Contact: Toni Freger <tfreger>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 12.0 (Pike)CC: agurenko, amuller, chrisw, dciabrin, jlibosva, m.andre, mcornea, michele, mkrcmari, nyechiel, oblaut, ohochman, rhallise, sasha, srevivo, ushkalim
Target Milestone: ---Keywords: AutomationBlocker
Target Release: 12.0 (Pike)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-17 14:00:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
firewall settings when network is working
none
broken firewall settings after systemctl restart docker
none
correct firewall settings just after systemctl stop docker none

Description Udi Shkalim 2017-07-27 09:36:41 UTC
Description of problem:
In an HA OSP12 setup Tried to hard reboot the node (echo b > /proc/sysrq-trigger ) and the isolated vlans networking was not available (no ping to other controllers)
When I stopped the docker service the network came back up. when I started the docker again then the networking failed again.

When I restarted the network service it seems to fix the issue - systemctl restart network.

Version-Release number of selected component (if applicable):


How reproducible:
on BM and virt host.

Steps to Reproduce:
1. Deploy OSP12 containers in an HA setup
2. Hard reboot the controller (e.g ipmi)
3.

Actual results:
Host networking is not functional 

Expected results:
Networking is fully restored

Additional info:

Comment 2 Damien Ciabrini 2017-07-27 11:12:39 UTC
Created attachment 1305282 [details]
firewall settings when network is working

Comment 3 Damien Ciabrini 2017-07-27 11:14:35 UTC
Created attachment 1305296 [details]
broken firewall settings after systemctl restart docker

Comment 4 Damien Ciabrini 2017-07-27 11:15:50 UTC
Created attachment 1305298 [details]
correct firewall settings just after systemctl stop docker

Comment 5 Damien Ciabrini 2017-07-27 11:22:29 UTC
So it looks like when docker is (re)started, it's forcing its firewall rules to come first on the FORWARD chain:

diff -Nru fw-working.txt fw-broken-after-docker-restart.txt
--- fw-working.txt	2017-07-27 13:06:12.725081704 +0200
+++ fw-broken-after-docker-restart.txt	2017-07-27 13:06:11.890084604 +0200
@@ -77,13 +77,13 @@
 
 Chain FORWARD (policy ACCEPT)
 target     prot opt source               destination         
-nova-filter-top  all  --  0.0.0.0/0            0.0.0.0/0           
-nova-api-FORWARD  all  --  0.0.0.0/0            0.0.0.0/0           
 DOCKER-ISOLATION  all  --  0.0.0.0/0            0.0.0.0/0           
 DOCKER     all  --  0.0.0.0/0            0.0.0.0/0           
 ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            ctstate RELATED,ESTABLISHED
 ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0           
 ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0           
+nova-filter-top  all  --  0.0.0.0/0            0.0.0.0/0           
+nova-api-FORWARD  all  --  0.0.0.0/0            0.0.0.0/0           
 
 Chain OUTPUT (policy ACCEPT)
 target     prot opt source               destination         
 
And this breaks the firewall rules that has been set up by nova to bring network connectivity to controller nodes. Consequently the controller cannot reach other machine on the control plane. 

I'm not sure why, but stopping the docker service seems to revert that firewall behaviour:

diff -Nru fw-broken-after-docker-restart.txt fw-working-again-after-docker-stop.txt
--- fw-broken-after-docker-restart.txt	2017-07-27 13:06:11.890084604 +0200
+++ fw-working-again-after-docker-stop.txt	2017-07-27 13:06:12.520082416 +0200
@@ -77,13 +77,13 @@
 
 Chain FORWARD (policy ACCEPT)
 target     prot opt source               destination         
+nova-filter-top  all  --  0.0.0.0/0            0.0.0.0/0           
+nova-api-FORWARD  all  --  0.0.0.0/0            0.0.0.0/0           
 DOCKER-ISOLATION  all  --  0.0.0.0/0            0.0.0.0/0           
 DOCKER     all  --  0.0.0.0/0            0.0.0.0/0           
 ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            ctstate RELATED,ESTABLISHED
 ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0           
 ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0           
-nova-filter-top  all  --  0.0.0.0/0            0.0.0.0/0           
-nova-api-FORWARD  all  --  0.0.0.0/0            0.0.0.0/0           
 
 Chain OUTPUT (policy ACCEPT)
 target     prot opt source               destination         

...and once docker is stopped, network works again (controller node can reach its peers).

Comment 7 Dan Prince 2017-08-02 20:44:48 UTC
By default dockerd uses the following:

       --iptables=true|false
         Enable Docker's addition of iptables rules. Default is true.

----

I think we can resolve this by adjusting our default dockerd options in TripleO.


I will push an upstream patch to do this shortly.

Until then, as a downstream workaround for this issue please consider setting the following Heat parameter for this issue:

parameter_defaults:
  ControllerExtraConfig:
    tripleo::profile::base::docker::docker_options: '--log-driver=journald --signature-verification=false --iptables=false'
  ComputeExtraConfig:
    tripleo::profile::base::docker::docker_options: '--log-driver=journald --signature-verification=false --iptables=false'

Comment 8 Dan Prince 2017-08-02 21:04:45 UTC
See upstream: https://bugs.launchpad.net/tripleo/+bug/1708279

Also, this fix: https://review.openstack.org/#/c/490201/

Comment 12 Michele Baldessari 2017-08-16 09:09:38 UTC
So Damien, Marian and I spent some time looking at this issue and it seems an ovs vs neutron_ovs_agent issue (not sure if it is a pike regression or it is due to the containerization)

Problem:
When rebooting a controller node we lose all connectivity over interfaces managed by OVS

Explanation:
After a reboot the following sequence happens:
A) openvswitch tries to talk to the neutron_ovs_agent container on port 6633
B) The neutron_ovs_agent is unable to talk to rabbit (because rabbit is reachable only over an OVS network, br-ex in our case) and hence logs the following:
6cdb-3cdb-4370-b88a-b5fcfc261835] AMQP server on overcloud-controller-2.internalapi.localdomain:5672 is unreachable: [Errno 113] EHOSTUNREACH. Trying again in 2 seconds. Client port: None: error: [Errno 113] EHOSTUNREACH
2017-08-16 08:46:58.115 9792 ERROR oslo.messaging._drivers.impl_rabbit [-] [506dd6cdb-3cdb-4370-b88a-b5fcfc261835] AMQP server on overcloud-controller-1.internalapi.localdomain:5672 is unreachable: [Errno 113] EHOSTUNREACH. Trying again in 1 seconds. Client port: None: error: [Errno 113] EHOSTUNREACH
C) Since the neutron_ovs_agent can't talk to rabbit it seems that ovs just keeps retrying to ask on port 6633 and never giving up and so the networking is never functional


By doing a 'docker stop neutron_ovs_agent' things are dandy again and we can ping on the networks managed by ovs. OVS seems to default to some open policy when neutron_ovs_agent is fully down, but it seems to get stuck when the neutron_ovs_agent container cannot talk to rabbit.


I am moving this into DFG Networking's lap and back to NEW for triage.
I think the patch to disable iptables in docker on the overcloud is still a sensible change in general for Pike, so there is no need to drop it.

Comment 17 Jakub Libosvar 2017-08-17 14:00:48 UTC

*** This bug has been marked as a duplicate of bug 1473763 ***