Bug 1475764
Summary: | OSP12 | HA | Controller network ovs interfaces are non-functional after hard reboot of the node (OVS) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Udi Shkalim <ushkalim> | ||||||||
Component: | openstack-neutron | Assignee: | Assaf Muller <amuller> | ||||||||
Status: | CLOSED DUPLICATE | QA Contact: | Toni Freger <tfreger> | ||||||||
Severity: | urgent | Docs Contact: | |||||||||
Priority: | urgent | ||||||||||
Version: | 12.0 (Pike) | CC: | agurenko, amuller, chrisw, dciabrin, jlibosva, m.andre, mcornea, michele, mkrcmari, nyechiel, oblaut, ohochman, rhallise, sasha, srevivo, ushkalim | ||||||||
Target Milestone: | --- | Keywords: | AutomationBlocker | ||||||||
Target Release: | 12.0 (Pike) | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2017-08-17 14:00:48 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Udi Shkalim
2017-07-27 09:36:41 UTC
Created attachment 1305282 [details]
firewall settings when network is working
Created attachment 1305296 [details]
broken firewall settings after systemctl restart docker
Created attachment 1305298 [details]
correct firewall settings just after systemctl stop docker
So it looks like when docker is (re)started, it's forcing its firewall rules to come first on the FORWARD chain: diff -Nru fw-working.txt fw-broken-after-docker-restart.txt --- fw-working.txt 2017-07-27 13:06:12.725081704 +0200 +++ fw-broken-after-docker-restart.txt 2017-07-27 13:06:11.890084604 +0200 @@ -77,13 +77,13 @@ Chain FORWARD (policy ACCEPT) target prot opt source destination -nova-filter-top all -- 0.0.0.0/0 0.0.0.0/0 -nova-api-FORWARD all -- 0.0.0.0/0 0.0.0.0/0 DOCKER-ISOLATION all -- 0.0.0.0/0 0.0.0.0/0 DOCKER all -- 0.0.0.0/0 0.0.0.0/0 ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 ctstate RELATED,ESTABLISHED ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 +nova-filter-top all -- 0.0.0.0/0 0.0.0.0/0 +nova-api-FORWARD all -- 0.0.0.0/0 0.0.0.0/0 Chain OUTPUT (policy ACCEPT) target prot opt source destination And this breaks the firewall rules that has been set up by nova to bring network connectivity to controller nodes. Consequently the controller cannot reach other machine on the control plane. I'm not sure why, but stopping the docker service seems to revert that firewall behaviour: diff -Nru fw-broken-after-docker-restart.txt fw-working-again-after-docker-stop.txt --- fw-broken-after-docker-restart.txt 2017-07-27 13:06:11.890084604 +0200 +++ fw-working-again-after-docker-stop.txt 2017-07-27 13:06:12.520082416 +0200 @@ -77,13 +77,13 @@ Chain FORWARD (policy ACCEPT) target prot opt source destination +nova-filter-top all -- 0.0.0.0/0 0.0.0.0/0 +nova-api-FORWARD all -- 0.0.0.0/0 0.0.0.0/0 DOCKER-ISOLATION all -- 0.0.0.0/0 0.0.0.0/0 DOCKER all -- 0.0.0.0/0 0.0.0.0/0 ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 ctstate RELATED,ESTABLISHED ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 -nova-filter-top all -- 0.0.0.0/0 0.0.0.0/0 -nova-api-FORWARD all -- 0.0.0.0/0 0.0.0.0/0 Chain OUTPUT (policy ACCEPT) target prot opt source destination ...and once docker is stopped, network works again (controller node can reach its peers). By default dockerd uses the following: --iptables=true|false Enable Docker's addition of iptables rules. Default is true. ---- I think we can resolve this by adjusting our default dockerd options in TripleO. I will push an upstream patch to do this shortly. Until then, as a downstream workaround for this issue please consider setting the following Heat parameter for this issue: parameter_defaults: ControllerExtraConfig: tripleo::profile::base::docker::docker_options: '--log-driver=journald --signature-verification=false --iptables=false' ComputeExtraConfig: tripleo::profile::base::docker::docker_options: '--log-driver=journald --signature-verification=false --iptables=false' See upstream: https://bugs.launchpad.net/tripleo/+bug/1708279 Also, this fix: https://review.openstack.org/#/c/490201/ So Damien, Marian and I spent some time looking at this issue and it seems an ovs vs neutron_ovs_agent issue (not sure if it is a pike regression or it is due to the containerization) Problem: When rebooting a controller node we lose all connectivity over interfaces managed by OVS Explanation: After a reboot the following sequence happens: A) openvswitch tries to talk to the neutron_ovs_agent container on port 6633 B) The neutron_ovs_agent is unable to talk to rabbit (because rabbit is reachable only over an OVS network, br-ex in our case) and hence logs the following: 6cdb-3cdb-4370-b88a-b5fcfc261835] AMQP server on overcloud-controller-2.internalapi.localdomain:5672 is unreachable: [Errno 113] EHOSTUNREACH. Trying again in 2 seconds. Client port: None: error: [Errno 113] EHOSTUNREACH 2017-08-16 08:46:58.115 9792 ERROR oslo.messaging._drivers.impl_rabbit [-] [506dd6cdb-3cdb-4370-b88a-b5fcfc261835] AMQP server on overcloud-controller-1.internalapi.localdomain:5672 is unreachable: [Errno 113] EHOSTUNREACH. Trying again in 1 seconds. Client port: None: error: [Errno 113] EHOSTUNREACH C) Since the neutron_ovs_agent can't talk to rabbit it seems that ovs just keeps retrying to ask on port 6633 and never giving up and so the networking is never functional By doing a 'docker stop neutron_ovs_agent' things are dandy again and we can ping on the networks managed by ovs. OVS seems to default to some open policy when neutron_ovs_agent is fully down, but it seems to get stuck when the neutron_ovs_agent container cannot talk to rabbit. I am moving this into DFG Networking's lap and back to NEW for triage. I think the patch to disable iptables in docker on the overcloud is still a sensible change in general for Pike, so there is no need to drop it. *** This bug has been marked as a duplicate of bug 1473763 *** |