Bug 1527130
Summary: | Dataplane gets *broken* when containers are stopped | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Daniel Alvarez Sanchez <dalvarez> |
Component: | openstack-tripleo-heat-templates | Assignee: | Brent Eagles <beagles> |
Status: | CLOSED ERRATA | QA Contact: | Roee Agiman <ragiman> |
Severity: | urgent | Docs Contact: | |
Priority: | high | ||
Version: | 13.0 (Queens) | CC: | amuller, augol, bcafarel, beagles, bhaley, chrisw, dalvarez, fdinitto, jlibosva, jschluet, mburns, mcornea, nyechiel, pkomarov, ragiman, rhel-osp-director-maint, rscarazz, srevivo, tfreger, ushkalim |
Target Milestone: | rc | Keywords: | Regression, TestBlocker, Triaged |
Target Release: | 13.0 (Queens) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | openstack-tripleo-common-8.6.1-10.el7ost puppet-tripleo-8.3.2-6.el7ost openstack-tripleo-heat-templates-8.0.2-18.el7ost | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-06-27 13:40:26 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1544150 | ||
Bug Blocks: | 1503794, 1545154 |
Description
Daniel Alvarez Sanchez
2017-12-18 15:43:28 UTC
Can we untangle the various issues to somehow determine if the downtime is between the container going down and: 1) It going back up again, or: 2) The L3 agent finishing a sync? It can finish the L3 agent more than an hour to finish syncing a large number of routers, so the time delta between (1) and (2) is critical, and directly impacts the severity of this bug. That being said, even if the answer is (1), this RHBZ is still a severe regression when compared to OSP 12 / non-containerized environment. (In reply to Assaf Muller from comment #1) > Can we untangle the various issues to somehow determine if the downtime is > between the container going down and: > > 1) It going back up again, or: > 2) The L3 agent finishing a sync? > > It can finish the L3 agent more than an hour to finish syncing a large > number of routers, so the time delta between (1) and (2) is critical, and > directly impacts the severity of this bug. That being said, even if the > answer is (1), this RHBZ is still a severe regression when compared to OSP > 12 / non-containerized environment. In order to be able to answer that question we would need to fix whatever issue prevents us to somehow reuse the namespace, which doesn't seem obvious to me. If we mount /run in the container it doesn't work but it we don't, namespaces are not visible from the root namespace but still not usable when the container is restarted (at least from my tests). I wonder if this has already been tested by the Upgrades DFG upstream (since neutron is not yet containerized in OSP12). Anyways (just guessing) my bet is that whenever the L3 container is restarted, if the namespaces reside inside the container it would have to do a fullsync so we'll need to wait that time for the dataplane to be completely back. If we were able to let the namespaces live outside the container, then what's the point of having L3 agent containerized? >
Anyways (just guessing) my bet is that whenever the L3 container is restarted, if the namespaces reside inside the container it would have to do a fullsync so we'll need to wait that time for the dataplane to be completely back.
That's not entirely correct. If the L3 agent in the container could properly access the namespaces, and nothing was torn down when the container was shut down, then we'd be back to being equivalent to a non-containerized L3 agent, which is what we're after. When you shut down a non-containerized L3 agent nothing is torn down, so connectivity is retained. When it starts back up again, it starts a sync, and no dataplane connectivity is interrupted while the sync is going on. The sync merely sets up new (and updated) routers that were created when the agent was down, as well as newly created/deleted floating IPs. In short, we need to get back to a state where shutting down the L3 agent container doesn't disrupt anything or tear anything down, and starting it back up again should allow it to resync properly.
(In reply to Assaf Muller from comment #3) > > > Anyways (just guessing) my bet is that whenever the L3 container is > restarted, if the namespaces reside inside the container it would have to do > a fullsync so we'll need to wait that time for the dataplane to be > completely back. > > That's not entirely correct. If the L3 agent in the container could properly > access the namespaces, and nothing was torn down when the container was shut > down, then we'd be back to being equivalent to a non-containerized L3 agent, > which is what we're after. When you shut down a non-containerized L3 agent > nothing is torn down, so connectivity is retained. When it starts back up > again, it starts a sync, and no dataplane connectivity is interrupted while > the sync is going on. The sync merely sets up new (and updated) routers that > were created when the agent was down, as well as newly created/deleted > floating IPs. In short, we need to get back to a state where shutting down > the L3 agent container doesn't disrupt anything or tear anything down, and > starting it back up again should allow it to resync properly. I think this is a general problem now in containerized setups where restarting a container makes the namespaces created by it unusable. I agree that we need to get back to a state where shutting down containers doesn't tear the namespaces down. That would fix the both issues I've commented in the report. In my tests, not mounting /run inside the container as this patch [0] does "just" solves the fact that deleting the container (not stopping, not restarting) will make the namespaces usable again. So if I spawn the container again (new ID) the initial sync will basically do a full sync (namespaces are gone) so dataplane won't be restored until the sync is finished. This is what I meant in my previous comment but I should've elaborated further :) The solution is clear, how, I don't know. I guess we need docker experts here, maybe the OpenShift/kubernetes folks can give a hand as they may be facing similar issues. [0] https://github.com/openstack/tripleo-heat-templates/commit/2e3a91f58bb48d4e7ab88258fbd704975cf1c79c I tested latest state with TripleO OVN deployment and the result is the same. [heat-admin@overcloud-novacompute-0 neutron]$ sudo docker exec -u root -it 8559f5a7fa45 /bin/bash ()[root@overcloud-novacompute-0 /]# ip net e ovnmeta-b114a9f1-92da-4e4d-bb2d-23c0c8d6e82f ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: tapb114a9f1-91@if12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000 link/ether fa:16:3e:46:91:d9 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 169.254.169.254/16 brd 169.254.255.255 scope global tapb114a9f1-91 valid_lft forever preferred_lft forever inet 192.168.99.2/24 brd 192.168.99.255 scope global tapb114a9f1-91 valid_lft forever preferred_lft forever The namespace is only visible from inside the container. * Restart the container: [heat-admin@overcloud-novacompute-0 neutron]$ sudo docker restart 8559f5a7fa45 8559f5a7fa45 [heat-admin@overcloud-novacompute-0 neutron]$ tail -f /var/log/containers/neutron/networking-ovn-metadata-agent.log 2018-02-09 08:34:41.059 5 CRITICAL neutron [-] Unhandled error: ProcessExecutionError: Exit code: 2; Stdin: ; Stdout: ; Stderr: RTNETLINK answers: Invalid argument 2018-02-09 08:34:41.059 5 ERROR neutron Traceback (most recent call last): 2018-02-09 08:34:41.059 5 ERROR neutron File "/usr/bin/networking-ovn-metadata-agent", line 10, in <module> 2018-02-09 08:34:41.059 5 ERROR neutron sys.exit(main()) 2018-02-09 08:34:41.059 5 ERROR neutron File "/usr/lib/python2.7/site-packages/networking_ovn/cmd/eventlet/agents/metadata.py", line 17, in main 2018-02-09 08:34:41.059 5 ERROR neutron metadata_agent.main() 2018-02-09 08:34:41.059 5 ERROR neutron File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata_agent.py", line 38, in main 2018-02-09 08:34:41.059 5 ERROR neutron agt.start() 2018-02-09 08:34:41.059 5 ERROR neutron File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 147, in start 2018-02-09 08:34:41.059 5 ERROR neutron self.sync() 2018-02-09 08:34:41.059 5 ERROR neutron File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 56, in wrapped 2018-02-09 08:34:41.059 5 ERROR neutron return f(*args, **kwargs) 2018-02-09 08:34:41.059 5 ERROR neutron File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 169, in sync 2018-02-09 08:34:41.059 5 ERROR neutron metadata_namespaces = self.ensure_all_networks_provisioned() 2018-02-09 08:34:41.059 5 ERROR neutron File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 350, in ensure_all_networks_provisioned 2018-02-09 08:34:41.059 5 ERROR neutron netns = self.provision_datapath(datapath) 2018-02-09 08:34:41.059 5 ERROR neutron File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 294, in provision_datapath 2018-02-09 08:34:41.059 5 ERROR neutron veth_name[0], veth_name[1], namespace) 2018-02-09 08:34:41.059 5 ERROR neutron File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 182, in add_veth 2018-02-09 08:34:41.059 5 ERROR neutron self._as_root([], 'link', tuple(args)) 2018-02-09 08:34:41.059 5 ERROR neutron File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 94, in _as_root 2018-02-09 08:34:41.059 5 ERROR neutron namespace=namespace) 2018-02-09 08:34:41.059 5 ERROR neutron File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 102, in _execute 2018-02-09 08:34:41.059 5 ERROR neutron log_fail_as_error=self.log_fail_as_error) 2018-02-09 08:34:41.059 5 ERROR neutron File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 151, in execute 2018-02-09 08:34:41.059 5 ERROR neutron raise ProcessExecutionError(msg, returncode=returncode) 2018-02-09 08:34:41.059 5 ERROR neutron ProcessExecutionError: Exit code: 2; Stdin: ; Stdout: ; Stderr: RTNETLINK answers: Invalid argument 2018-02-09 08:34:41.059 5 ERROR neutron 2018-02-09 08:34:41.059 5 ERROR neutron 2018-02-09 08:34:41.177 21 INFO oslo_service.service [-] Parent process has died unexpectedly, exiting 2018-02-09 08:34:41.178 21 INFO eventlet.wsgi.server [-] (21) wsgi exited, is_accepting=True [heat-admin@overcloud-novacompute-0 neutron]$ sudo docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 8559f5a7fa45 192.168.24.1:8787/master/centos-binary-neutron-metadata-agent-ovn:current-tripleo-rdo "kolla_start" 17 hours ago Restarting (1) 18 seconds ago ovn_metadata_agent * It only worked after I deleted the namespace manually: [heat-admin@overcloud-novacompute-0 neutron]$ sudo docker exec -u root -it 8559f5a7fa45 ip netns del ovnmeta-b114a9f1-92da-4e4d-bb2d-23c0c8d6e82f RTNETLINK answers: Invalid argument [heat-admin@overcloud-novacompute-0 neutron]$ sudo docker exec -u root -it 8559f5a7fa45 ip netns del ovnmeta-b114a9f1-92da-4e4d-bb2d-23c0c8d6e82f Cannot remove namespace file "/var/run/netns/ovnmeta-b114a9f1-92da-4e4d-bb2d-23c0c8d6e82f": No such file or directory As already discussed in this BZ, reusing namespaces is the preferred option as it won't have dataplane impact and won't require a full sync when containers are restarted. Thinking again about reusing namespaces, as we spawn processes (ie. keepalived, haproxy, etc.) to listen on those containers, when we restart the container, those will go away even though we'd manage to use namespaces on a different container or in the host. Marking to post for the connectivity issue. *** Bug 1545154 has been marked as a duplicate of this bug. *** For informational purposes, the related upstream patches in progress are: https://review.openstack.org/549851 kolla - Fix neutron dhcp agent dockerfile for non-dep/ubuntu https://review.openstack.org/549855 tripleo-common - Add docker packages to dhcp and l3 agent https://review.openstack.org/550224 puppet-tripleo - WIP: Adding wrapper scripts for neutron agent subprocesses https://review.openstack.org/#/c/550823/ tripleo-heat-templates (currently titled WIP sorting out wrapper injection to container) I'm focusing on dnsmasq and keepalived as these appear to be the minimum to making HA work properly and avoid failovers when restarting the containers are restarted. They are also more or less known to be solvable using this approach. Follow ups for haproxy and radvd will follow. Just as a comment, right now neutron monitors all the spawned processes [0]. In case some of them dies, they'll be respawned. With the wrappers, the processes terminate right away so I guess that the tracking will fail and neutron will keep trying to get them up every minute, therefore, starting new containers? We need to verify this and come up with a solution. [0] https://github.com/openstack/neutron/blob/1077184d8d05113544212fd05b455bf5b6a10ad0/neutron/agent/linux/external_process.py#L50 It does not look like we will be able to containerize radvd at the moment due to https://bugzilla.redhat.com/show_bug.cgi?id=1559160. The wrapper technique spawns radvd fine but does the '-p' option doesn't work when not running as a daemon. *** Bug 1568424 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2086 |