Bug 1589684
Summary: | OSP13-HA minor update: router created before the update doesn't have external connectivity | |||
---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Eran Kuris <ekuris> | |
Component: | openstack-tripleo-heat-templates | Assignee: | Brent Eagles <beagles> | |
Status: | CLOSED WORKSFORME | QA Contact: | Toni Freger <tfreger> | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | 13.0 (Queens) | CC: | amuller, asimonel, bcafarel, beagles, bhaley, chjones, chrisw, ekuris, jschluet, jstransk, kgiusti, majopela, mburns, mcornea, nyechiel, sasha, sclewis, srevivo | |
Target Milestone: | ga | Keywords: | Triaged | |
Target Release: | 13.0 (Queens) | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Known Issue | ||
Doc Text: |
update / upgrade unsupported for the Release Candidate
|
Story Points: | --- | |
Clone Of: | ||||
: | 1594367 (view as bug list) | Environment: | ||
Last Closed: | 2018-06-22 17:49:34 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1594367 |
Description
Eran Kuris
2018-06-11 07:57:30 UTC
new object works well. I created new router with external access and it works well. The problem is with the router that created before the update process. I think we may be looking at two issues: 1. Sidecars were restarted at some point during the update! The output of the "docker ps" command provided in the sosreports indicates that the l3-agent's supporting containers for keepalived, haproxy, etc are running with the new container image (the runtimes also align) and with approximately the same runtime as the agent. This indicates that pre-update sidecars were killed for some reason. The killing of keepalived containers could result in qrouter namespaces containing duplicate data on multiple hosts as was reported. This is very serious and we need to find the root cause for the restart. 2. keepalived should have righted itself! Even with the keepalived restart, I would expect that once keepalived was running on all of the controllers again it would have resolved the incorrect IP/routing configurations in the qrouter namespace. This sometimes indicates that the keepalived instances cannot communicate with each other. This could be happening because of some side-effect of having invalid network configuration in the router's namespace (e.g. duplicate IPs) or some other issue that is breaking the required networking, e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1590651. While we should identify why this isn't working properly, it is arguably less critical than the sidecar restart. Unfortunately the data in /var/log/messages in the sosreports starts long after the problem occurred and should have contained key data. I may have found root cause. Apparently we stop all containers if we predict that it is necessary to restart the docker daemon. We do this if the docker package is updated or it's configuration has changed during the update. This used to be necessary to deal with a docker bug but should no longer be necessary. I recommend that we a.) either remove this from the update process, or if it is felt that is too risky, b.) have the 'docker stop'. It would good if we can rerun this or any other minor update test job where a router is meant to persist over the update so we can confirm that this is actually what is happening. option b.) from comment 6 should have read to have "docker stop" only run on deployed services and skip the containers neutron started. After further discussion and investigation, while not stopping the neutron sidecar containers during minor updates may side-step this issue, it is unlikely the root cause. On closer inspection, the timing around the apparent failover is suspect: - 2018-06-10 05:27:48.949 router initially instantiated on controller 0. - 2018-06-10 10:23:44.453 l3 agent on controller 1 is started - 2018-06-10 11:17:10.459 router transitions to master on controller 1 - 2018-06-10 11:18:00.049 l3 agent on controller 0 is started - 2018-06-10 11:18:11.165 router transitions to master on controller 0 The router does not transition to backup on controller 1 after it transitions to master on controller 0. It appears that some issue with the system is preventing keepalived processes from reconciling the system changes. On controller 1, the openvswitch-agent.log is reporting: "2018-06-10 11:18:22.575 233154 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-062966ee-311d-47f3-8408-f01dbe5d4f21 - - - - -] Port 'qg-38e39d71-54' has lost its vlan tag '45'!" Which may be indicating that something has gone wrong with br-ext-int bridge on that node. This might be related to recently fixed os-net-config issue rhbz 1590651 or neutron issue rhbz 1576286. Where I have had no luck in reproducing this bug - would it be possible for the reporter or someone else to recreate a system upgraded using a similar process to see if this recurs and we can log in to get a better idea of what is going on? (In reply to Brent Eagles from comment #8) > After further discussion and investigation, while not stopping the neutron > sidecar containers during minor updates may side-step this issue, it is > unlikely the root cause. > > On closer inspection, the timing around the apparent failover is suspect: > > - 2018-06-10 05:27:48.949 router initially instantiated on controller 0. > - 2018-06-10 10:23:44.453 l3 agent on controller 1 is started > - 2018-06-10 11:17:10.459 router transitions to master on controller 1 > - 2018-06-10 11:18:00.049 l3 agent on controller 0 is started > - 2018-06-10 11:18:11.165 router transitions to master on controller 0 > > The router does not transition to backup on controller 1 after it > transitions to master on controller 0. It appears that some issue with the > system is preventing keepalived processes from reconciling the system > changes. > > On controller 1, the openvswitch-agent.log is reporting: > > "2018-06-10 11:18:22.575 233154 INFO > neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent > [req-062966ee-311d-47f3-8408-f01dbe5d4f21 - - - - -] Port 'qg-38e39d71-54' > has lost its vlan tag '45'!" > > Which may be indicating that something has gone wrong with br-ext-int bridge > on that node. This might be related to recently fixed os-net-config issue > rhbz 1590651 or neutron issue rhbz 1576286. > > Where I have had no luck in reproducing this bug - would it be possible for > the reporter or someone else to recreate a system upgraded using a similar > process to see if this recurs and we can log in to get a better idea of what > is going on? I cant reproduce it at the moment because Nova & NFV DEV are working on my setup. I will try to get to it when I get my setup back. It looks like someone else saw this issue : https://bugzilla.redhat.com/show_bug.cgi?id=1592528 It occurred to me that neutron sidecar containers were not in the beta puddle, so this would be an update from agents-without-sidecars to agents-with-sidecars. There are some implications: - the docker stop on update may or may not have run - processes like keepalived would have definitely been killed when paunch updated and restarted the agent containers with the new images - the keepalived process that would've been running in the l3 agent before it was updated would not have been cleanly shutdown unless the l3 agent passes sigterm to its child (which I do not think that it would). Do we need to support minor updates from releases that have neutron agents-without-sidecars to agents-with-sidecars? (In reply to Brent Eagles from comment #11) <.. snip ..> > > Do we need to support minor updates from releases that have neutron > agents-without-sidecars to agents-with-sidecars? In short: No. The only release outside of RH that didn't have sidecar containers for neutron agents was the beta release, and we don't "support" (in the full sense of the word) an update from beta to any other build. This massive delta between beta and later builds also means that there's no point in testing updates from beta to current puddles, because there's nothing we can learn from it, and issues we find may not be relevant. Ok so if there's nothing we can learn from updating the system from beta to latest puddle I am stopping this test that I started to run. Today I tried to reproduce the issue from RC to latest puddle and it did not reproduce. I don't know if it means that the issue was solved and I hope this bug is not hiding somewhere. A fine group of folks met today on the 'OSP 13 GA Update/Upgrade/Reboot Network Issues' daily to discuss two OSP 13 GA blockers. One of which was the one this thread is about: https://bugzilla.redhat.com/show_bug.cgi?id=1589684 - "minor update: router created before the update doesn't have external connectivity" The conclusion on the call was that the bug must have been caused by a network interruption during the update, which was resolved by the os-net-config fix [1]. No one can reproduce this issue if updating from more previous puddles (newer than beta), including Eran who reported this bug originally. There were two remaining loose ends that we finalized on the call: 1) What if a network interruption happens during the update for some other reason, and we're back to a place where routers break again? As you can see from Damien's earlier reply on this thread, Damien and others simulated all sorts of failures and couldn't spot any issues related to HA routers. This gives me sufficient confidence with respect to any hidden underlying issues. 2) We know that Director minor-updates restarts the Docker service and all containers in the case that Docker is updated or needs to be reconfigured. This is done one controller at a time, and because we run keepalived (that drives each HA router) in a container, this operation incurs a needless HA router failover, and that causes a few seconds of downtime to your VMs' floating IPs. This is a regression. Putting (1) and (2) together, the decision was to close this RHBZ, and open 1594367 to more narrowly capture the issue discussed in (2). We also discussed whether this specific issue is a release blocker on its own, and the decision was to explicitly target it to z-stream. Note that when we fix it in z-stream the update will be driven by the TripleO code shipped in that z-stream, and not by the code we shipped in GA, which allows us to fix this issue in z-stream in the first place. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1590651 |