Bug 1527130

Summary: Dataplane gets *broken* when containers are stopped
Product: Red Hat OpenStack Reporter: Daniel Alvarez Sanchez <dalvarez>
Component: openstack-tripleo-heat-templatesAssignee: Brent Eagles <beagles>
Status: CLOSED ERRATA QA Contact: Roee Agiman <ragiman>
Severity: urgent Docs Contact:
Priority: high    
Version: 13.0 (Queens)CC: amuller, augol, bcafarel, beagles, bhaley, chrisw, dalvarez, fdinitto, jlibosva, jschluet, mburns, mcornea, nyechiel, pkomarov, ragiman, rhel-osp-director-maint, rscarazz, srevivo, tfreger, ushkalim
Target Milestone: rcKeywords: Regression, TestBlocker, Triaged
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-common-8.6.1-10.el7ost puppet-tripleo-8.3.2-6.el7ost openstack-tripleo-heat-templates-8.0.2-18.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-27 13:40:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1544150    
Bug Blocks: 1503794, 1545154    

Description Daniel Alvarez Sanchez 2017-12-18 15:43:28 UTC
Description of problem:

In TripleO upstream deployment with ML2/OVS, when neutron-l3 container is restarted or stopped, dataplane traffic is disrupted.

Since 

Version-Release number of selected component (if applicable):

openstack-tripleo-common-8.1.1-0.20171130034833.0e92cba.el7.centos.noarch
openstack-tripleo-puppet-elements-8.0.0-0.20171127180031.cc2c715.el7.centos.noarch
openstack-tripleo-ui-8.0.1-0.20171129193834.1e42711.el7.centos.noarch
openstack-tripleo-common-containers-8.1.1-0.20171130034833.0e92cba.el7.centos.noarch
openstack-tripleo-validations-8.0.1-0.20171129140336.c1f2069.el7.centos.noarch
openstack-tripleo-heat-templates-8.0.0-0.20171130031741.4df242c.el7.centos.noarch
openstack-tripleo-image-elements-8.0.0-0.20171118092222.90b9a25.el7.centos.noarch 
openstack-kolla-5.0.0-0.20171107075441.61495b1.el7.centos.noarch


()[root@overcloud-controller-2 /]# rpm -qa | grep neutron
python-neutron-12.0.0-0.20171206144209.1ca38a1.el7.centos.noarch
python-neutron-lbaas-12.0.0-0.20171206032035.0c76484.el7.centos.noarch
openstack-neutron-lbaas-12.0.0-0.20171206032035.0c76484.el7.centos.noarch
python2-neutronclient-6.5.0-0.20171023215239.355983d.el7.centos.noarch
openstack-neutron-common-12.0.0-0.20171206144209.1ca38a1.el7.centos.noarch
python-neutron-fwaas-12.0.0-0.20171206094459.b5b4491.el7.centos.noarch
openstack-neutron-fwaas-12.0.0-0.20171206094459.b5b4491.el7.centos.noarch
openstack-neutron-ml2-12.0.0-0.20171206144209.1ca38a1.el7.centos.noarch
python2-neutron-lib-1.11.0-0.20171129185804.ff5ee17.el7.centos.noarch
openstack-neutron-12.0.0-0.20171206144209.1ca38a1.el7.centos.noarch 

How reproducible:

100%

Steps to Reproduce:
1. Create a network, subnet, router, VM and attach a FIP to it.
2. Ping the FIP from the undercloud
3. Stop the L3 agent from all nodes

Actual results:

Ping stops working:

(overcloud) [stack@undercloud ~]$ sudo ping 10.0.0.131 -i 0.2
PING 10.0.0.131 (10.0.0.131) 56(84) bytes of data.
From 10.0.0.1 icmp_seq=1 Destination Host Unreachable 

Expected results:

Ping keeps working normally as we're not expecting dataplane downtime.

Additional info:

This happens because the containers are mounting host /run in their own /run and namespaces are left behind after stopping/restarting the namespaces as these bug show [0][1]. I applied [2] and now stopping the container will still cause dataplane downtime but also restarting containers simply won't work (we may need additional BZ for this).

Namespaces can't be now seen from outside the containers:

[heat-admin@overcloud-controller-2 ~]$ sudo ip netns | grep qrouter
RTNETLINK answers: Invalid argument
RTNETLINK answers: Invalid argument
[heat-admin@overcloud-controller-2 ~]$  

But from inside the container, they can:

[heat-admin@overcloud-controller-2 ~]$ sudo docker exec --user root -it 9f8a322c4a3c bash
()[root@overcloud-controller-2 /]# ip netns | grep qrouter 
RTNETLINK answers: Invalid argument 
RTNETLINK answers: Invalid argument 
qrouter-5244e91c-f533-4128-9289-f37c9656792c


However, l3 agent fails to initialize because it can't access to them after restart:

()[root@overcloud-controller-2 /]# ip netns exec qrouter-5244e91c-f533-4128-9289-f37c9656792c ip a 
RTNETLINK answers: Invalid argument 
setting the network namespace "qrouter-5244e91c-f533-4128-9289-f37c9656792c" failed: Invalid argument

If I manually delete the namespace from inside the container and restart it, it'll work again:

()[root@overcloud-controller-2 /]# ip netns del qrouter-5244e91c-f533-4128-9289-f37c9656792c
RTNETLINK answers: Invalid argument  

()[root@overcloud-controller-2 /]# ip netns del qrouter-5244e91c-f533-4128-9289-f37c9656792c 
Cannot remove namespace file "/var/run/netns/qrouter-5244e91c-f533-4128-9289-f37c9656792c": No such file or directory 


[heat-admin@overcloud-controller-2 ~]$ sudo docker restart 9f8a322c4a3c 

And now ping to the FIP works back again:

(overcloud) [stack@undercloud ~]$ sudo ping 10.0.0.131 -i 0.2
PING 10.0.0.131 (10.0.0.131) 56(84) bytes of data.
64 bytes from 10.0.0.131: icmp_seq=1 ttl=63 time=38.5 ms
64 bytes from 10.0.0.131: icmp_seq=2 ttl=63 time=6.58 ms
64 bytes from 10.0.0.131: icmp_seq=3 ttl=63 time=5.28 ms
64 bytes from 10.0.0.131: icmp_seq=4 ttl=63 time=2.71 ms
64 bytes from 10.0.0.131: icmp_seq=5 ttl=63 time=0.980 ms 


[0] https://bugs.launchpad.net/kolla/+bug/1616268
[1] https://bugs.launchpad.net/tripleo/+bug/1734333
[2] https://github.com/openstack/tripleo-heat-templates/commit/2e3a91f58bb48d4e7ab88258fbd704975cf1c79c

Comment 1 Assaf Muller 2017-12-18 21:21:41 UTC
Can we untangle the various issues to somehow determine if the downtime is between the container going down and:

1) It going back up again, or:
2) The L3 agent finishing a sync?

It can finish the L3 agent more than an hour to finish syncing a large number of routers, so the time delta between (1) and (2) is critical, and directly impacts the severity of this bug. That being said, even if the answer is (1), this RHBZ is still a severe regression when compared to OSP 12 / non-containerized environment.

Comment 2 Daniel Alvarez Sanchez 2017-12-18 21:32:34 UTC
(In reply to Assaf Muller from comment #1)
> Can we untangle the various issues to somehow determine if the downtime is
> between the container going down and:
> 
> 1) It going back up again, or:
> 2) The L3 agent finishing a sync?
> 
> It can finish the L3 agent more than an hour to finish syncing a large
> number of routers, so the time delta between (1) and (2) is critical, and
> directly impacts the severity of this bug. That being said, even if the
> answer is (1), this RHBZ is still a severe regression when compared to OSP
> 12 / non-containerized environment.

In order to be able to answer that question we would need to fix whatever issue prevents us to somehow reuse the namespace, which doesn't seem obvious to me.
If we mount /run in the container it doesn't work but it we don't, namespaces are not visible from the root namespace but still not usable when the container is restarted (at least from my tests). I wonder if this has already been tested by the Upgrades DFG upstream (since neutron is not yet containerized in OSP12).

Anyways (just guessing) my bet is that whenever the L3 container is restarted, if the namespaces reside inside the container it would have to do a fullsync so we'll need to wait that time for the dataplane to be completely back. If we were able to let the namespaces live outside the container, then what's the point of having L3 agent containerized?

Comment 3 Assaf Muller 2017-12-18 21:50:22 UTC
> 
Anyways (just guessing) my bet is that whenever the L3 container is restarted, if the namespaces reside inside the container it would have to do a fullsync so we'll need to wait that time for the dataplane to be completely back.

That's not entirely correct. If the L3 agent in the container could properly access the namespaces, and nothing was torn down when the container was shut down, then we'd be back to being equivalent to a non-containerized L3 agent, which is what we're after. When you shut down a non-containerized L3 agent nothing is torn down, so connectivity is retained. When it starts back up again, it starts a sync, and no dataplane connectivity is interrupted while the sync is going on. The sync merely sets up new (and updated) routers that were created when the agent was down, as well as newly created/deleted floating IPs. In short, we need to get back to a state where shutting down the L3 agent container doesn't disrupt anything or tear anything down, and starting it back up again should allow it to resync properly.

Comment 4 Daniel Alvarez Sanchez 2017-12-18 22:15:06 UTC
(In reply to Assaf Muller from comment #3)
> > 
> Anyways (just guessing) my bet is that whenever the L3 container is
> restarted, if the namespaces reside inside the container it would have to do
> a fullsync so we'll need to wait that time for the dataplane to be
> completely back.
> 
> That's not entirely correct. If the L3 agent in the container could properly
> access the namespaces, and nothing was torn down when the container was shut
> down, then we'd be back to being equivalent to a non-containerized L3 agent,
> which is what we're after. When you shut down a non-containerized L3 agent
> nothing is torn down, so connectivity is retained. When it starts back up
> again, it starts a sync, and no dataplane connectivity is interrupted while
> the sync is going on. The sync merely sets up new (and updated) routers that
> were created when the agent was down, as well as newly created/deleted
> floating IPs. In short, we need to get back to a state where shutting down
> the L3 agent container doesn't disrupt anything or tear anything down, and
> starting it back up again should allow it to resync properly.

I think this is a general problem now in containerized setups where restarting a container makes the namespaces created by it unusable. I agree that we need to get back to a state where shutting down containers doesn't tear the namespaces down. That would fix the both issues I've commented in the report.

In my tests, not mounting /run inside the container as this patch [0] does 
"just" solves the fact that deleting the container (not stopping, not restarting) will make the namespaces usable again. So if I spawn the container again (new ID) the initial sync will basically do a full sync (namespaces are gone) so dataplane won't be restored until the sync is finished. This is what I meant in my previous comment but I should've elaborated further :)

The solution is clear, how, I don't know. I guess we need docker experts here, maybe the OpenShift/kubernetes folks can give a hand as they may be facing similar issues.

[0] https://github.com/openstack/tripleo-heat-templates/commit/2e3a91f58bb48d4e7ab88258fbd704975cf1c79c

Comment 5 Daniel Alvarez Sanchez 2018-02-09 08:44:23 UTC
I tested latest state with TripleO OVN deployment and the result is the same.

[heat-admin@overcloud-novacompute-0 neutron]$ sudo docker exec -u root -it 8559f5a7fa45 /bin/bash
()[root@overcloud-novacompute-0 /]# ip net e ovnmeta-b114a9f1-92da-4e4d-bb2d-23c0c8d6e82f ip a                                                                                               
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: tapb114a9f1-91@if12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000
    link/ether fa:16:3e:46:91:d9 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 169.254.169.254/16 brd 169.254.255.255 scope global tapb114a9f1-91
       valid_lft forever preferred_lft forever
    inet 192.168.99.2/24 brd 192.168.99.255 scope global tapb114a9f1-91
       valid_lft forever preferred_lft forever


The namespace is only visible from inside the container.


* Restart the container:

[heat-admin@overcloud-novacompute-0 neutron]$ sudo docker restart 8559f5a7fa45
8559f5a7fa45



[heat-admin@overcloud-novacompute-0 neutron]$ tail -f /var/log/containers/neutron/networking-ovn-metadata-agent.log 
2018-02-09 08:34:41.059 5 CRITICAL neutron [-] Unhandled error: ProcessExecutionError: Exit code: 2; Stdin: ; Stdout: ; Stderr: RTNETLINK answers: Invalid argument
2018-02-09 08:34:41.059 5 ERROR neutron Traceback (most recent call last):
2018-02-09 08:34:41.059 5 ERROR neutron   File "/usr/bin/networking-ovn-metadata-agent", line 10, in <module>
2018-02-09 08:34:41.059 5 ERROR neutron     sys.exit(main())
2018-02-09 08:34:41.059 5 ERROR neutron   File "/usr/lib/python2.7/site-packages/networking_ovn/cmd/eventlet/agents/metadata.py", line 17, in main
2018-02-09 08:34:41.059 5 ERROR neutron     metadata_agent.main()
2018-02-09 08:34:41.059 5 ERROR neutron   File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata_agent.py", line 38, in main
2018-02-09 08:34:41.059 5 ERROR neutron     agt.start()
2018-02-09 08:34:41.059 5 ERROR neutron   File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 147, in start
2018-02-09 08:34:41.059 5 ERROR neutron     self.sync()
2018-02-09 08:34:41.059 5 ERROR neutron   File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 56, in wrapped
2018-02-09 08:34:41.059 5 ERROR neutron     return f(*args, **kwargs)
2018-02-09 08:34:41.059 5 ERROR neutron   File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 169, in sync
2018-02-09 08:34:41.059 5 ERROR neutron     metadata_namespaces = self.ensure_all_networks_provisioned()
2018-02-09 08:34:41.059 5 ERROR neutron   File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 350, in ensure_all_networks_provisioned
2018-02-09 08:34:41.059 5 ERROR neutron     netns = self.provision_datapath(datapath)
2018-02-09 08:34:41.059 5 ERROR neutron   File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 294, in provision_datapath
2018-02-09 08:34:41.059 5 ERROR neutron     veth_name[0], veth_name[1], namespace)
2018-02-09 08:34:41.059 5 ERROR neutron   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 182, in add_veth
2018-02-09 08:34:41.059 5 ERROR neutron     self._as_root([], 'link', tuple(args))
2018-02-09 08:34:41.059 5 ERROR neutron   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 94, in _as_root
2018-02-09 08:34:41.059 5 ERROR neutron     namespace=namespace)
2018-02-09 08:34:41.059 5 ERROR neutron   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 102, in _execute
2018-02-09 08:34:41.059 5 ERROR neutron     log_fail_as_error=self.log_fail_as_error)
2018-02-09 08:34:41.059 5 ERROR neutron   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 151, in execute
2018-02-09 08:34:41.059 5 ERROR neutron     raise ProcessExecutionError(msg, returncode=returncode)
2018-02-09 08:34:41.059 5 ERROR neutron ProcessExecutionError: Exit code: 2; Stdin: ; Stdout: ; Stderr: RTNETLINK answers: Invalid argument
2018-02-09 08:34:41.059 5 ERROR neutron 
2018-02-09 08:34:41.059 5 ERROR neutron 
2018-02-09 08:34:41.177 21 INFO oslo_service.service [-] Parent process has died unexpectedly, exiting
2018-02-09 08:34:41.178 21 INFO eventlet.wsgi.server [-] (21) wsgi exited, is_accepting=True


[heat-admin@overcloud-novacompute-0 neutron]$ sudo docker ps
CONTAINER ID        IMAGE                                                                                   COMMAND             CREATED             STATUS                          PORTS               NAMES
8559f5a7fa45        192.168.24.1:8787/master/centos-binary-neutron-metadata-agent-ovn:current-tripleo-rdo   "kolla_start"       17 hours ago        Restarting (1) 18 seconds ago                       ovn_metadata_agent


* It only worked after I deleted the namespace manually:
[heat-admin@overcloud-novacompute-0 neutron]$ sudo docker exec -u root -it 8559f5a7fa45 ip netns del ovnmeta-b114a9f1-92da-4e4d-bb2d-23c0c8d6e82f
RTNETLINK answers: Invalid argument
[heat-admin@overcloud-novacompute-0 neutron]$ sudo docker exec -u root -it 8559f5a7fa45 ip netns del ovnmeta-b114a9f1-92da-4e4d-bb2d-23c0c8d6e82f
Cannot remove namespace file "/var/run/netns/ovnmeta-b114a9f1-92da-4e4d-bb2d-23c0c8d6e82f": No such file or directory

As already discussed in this BZ, reusing namespaces is the preferred option as it won't have dataplane impact and won't require a full sync when containers are restarted.

Comment 6 Daniel Alvarez Sanchez 2018-02-09 10:22:37 UTC
Thinking again about reusing namespaces, as we spawn processes (ie. keepalived, haproxy, etc.) to listen on those containers, when we restart the container, those will go away even though we'd manage to use namespaces on a different container or in the host.

Comment 7 Brent Eagles 2018-02-09 18:14:43 UTC
Marking to post for the connectivity issue.

Comment 11 Assaf Muller 2018-03-07 14:49:26 UTC
*** Bug 1545154 has been marked as a duplicate of this bug. ***

Comment 12 Brent Eagles 2018-03-08 13:09:46 UTC
For informational purposes, the related upstream patches in progress are:

https://review.openstack.org/549851 kolla - Fix neutron dhcp agent dockerfile for non-dep/ubuntu

https://review.openstack.org/549855 tripleo-common - Add docker packages to dhcp and l3 agent

https://review.openstack.org/550224 puppet-tripleo - WIP: Adding wrapper scripts for neutron agent subprocesses


https://review.openstack.org/#/c/550823/ tripleo-heat-templates (currently titled WIP sorting out wrapper injection to container)

I'm focusing on dnsmasq and keepalived as these appear to be the minimum to making HA work properly and avoid failovers when restarting the containers are restarted. They are also more or less known to be solvable using this approach. Follow ups for haproxy and radvd will follow.

Comment 13 Daniel Alvarez Sanchez 2018-03-08 13:49:11 UTC
Just as a comment, right now neutron monitors all the spawned processes [0]. In case some of them dies, they'll be respawned. With the wrappers, the processes terminate right away so I guess that the tracking will fail and neutron will keep trying to get them up every minute, therefore, starting new containers?
We need to verify this and come up with a solution.


[0] https://github.com/openstack/neutron/blob/1077184d8d05113544212fd05b455bf5b6a10ad0/neutron/agent/linux/external_process.py#L50

Comment 14 Brent Eagles 2018-03-21 20:37:37 UTC
It does not look like we will be able to containerize radvd at the moment due to https://bugzilla.redhat.com/show_bug.cgi?id=1559160. The wrapper technique spawns radvd fine but does the '-p' option doesn't work when not running as a daemon.

Comment 15 Brian Haley 2018-04-18 14:19:53 UTC
*** Bug 1568424 has been marked as a duplicate of this bug. ***

Comment 25 errata-xmlrpc 2018-06-27 13:40:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086