Bug 1759974
Summary: | Updating a containerized HA setup of ovn on rhel 8 using OSP15 fails to restart two out of three ovn bundle. | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | Sofer Athlan-Guyot <sathlang> |
Component: | ovn2.11 | Assignee: | Numan Siddique <nusiddiq> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Jianlin Shi <jishi> |
Severity: | low | Docs Contact: | |
Priority: | unspecified | ||
Version: | RHEL 8.0 | CC: | ctrautma, dalvarez, jishi, liali, nusiddiq, ovs-team, qding |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-04-15 16:21:12 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1760763, 1766586, 1771854, 1775795 | ||
Bug Blocks: | 1760405 |
Description
Sofer Athlan-Guyot
2019-10-09 14:26:18 UTC
Note, before the update the master is on ctl0: podman container set: ovn-dbs-bundle [192.168.24.1:8787/rhosp15/openstack-ovn-northd:pcmklatest] ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Master controller-0 ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Slave controller-1 ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave controller-2 and ovn version is ovn2.11-2.11.0-26.el8fdp.x86_64 Hi, well from the controller-0.tar.gz in the ./log/extra/containers/containers_allinfo.log we have: f71669eb2213 192.168.24.1:8787/rhosp15/openstack-ovn-northd:20190926.1 dumb-init --singl... 3 hours ago Up 3 hours ago ovn-dbs-bundle-podman-0 27B (virtual 646MB) so even if pacemaker is reporting the service as stop it seems it was started just fine. Maybe an issue in the agent ? podman container set: ovn-dbs-bundle [192.168.24.1:8787/rhosp15/openstack-ovn-northd:pcmklatest] ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Stopped controller-0 ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Stopped controller-1 ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Master controller-2 The log/extra/containers/containers/ovn-dbs-bundle-podman-0/stdout.log doesn't show any error neither. Hi, so with Damien and Daniel we went to the bottom of it (thanks numans as well for the help). During the update ovn went from version ovn2.11-2.11.0-26.el8fdp.x86_64 to version ovn2.11-2.11.1-2.el8fdp which include a schema change for the south bound service. In the HA setup we start by updating the Master ovndb-server (on ctl0): - we stop the resource, update it and restart it - the ovndb-server become master on ctl1. When the resource agent (/usr/lib/ocf/resource.d/ovn/ovndb-servers) tries to restart the resource on ctl0. It fails because the south bound ovn service is in "running/active" state according to ()[root@controller-0 /]# /bin/sh /usr/share/openvswitch/scripts/ovn-ctl status_ovnsb running/active while the nb is running/backup: ()[root@controller-0 /]# /bin/sh /usr/share/openvswitch/scripts/ovn-ctl status_ovnnb running/backup In the agent we have that check [1] so the agent returns "No running - rc=7". After a certain times (200000ms) pacemaker gives up and put the resource in failure. ovnsb returns an running/active state as it can't replicate the master db because of the schema change: 2019-10-09T11:35:21.375Z|00005|replication|INFO|Schema version mismatch, OVN_Southbound not replicated 2019-10-09T11:35:21.375Z|00006|replication|WARN|Nothing to replicate. As it cannot be a backup it returns active status, breaking the ocf agent startup sequence. So this is the root cause. Now for the implication of a that in a rolling update of the controllers: - we lose the HA: ctl0 fails, then ctl1 fails and then ctl2 is master with the two others node being in a failed status. - we have a small control plan outage: - after the update of ctl0 and ctl1, ctl2 becomes master (still not updated) - then we update ctl2 so we stop the container, as it's alone (ctl0 and ctl1 are failing), there is no other master to fall back to -> control plane outage, we can't add new flows in there; this lasts until we pull the new image and restart the container. - it doesn't converge back to master/slave/slave even after that last step and we are still in a no-ha setup with only one master ctl2. From here, a "simple" : pcs resource cleanup will put the ovndb server back online. In the OSP15 update workflow, we can't really automate the pcs resource cleanup (as it could hide other issues) *and* we have a /some/ control plane cut which is not expected to happen during update. Ideally we would have a Schema compatibility between minor version change, ie the new ovndb south bound service should put itself in running/backup mode even if the master doesn't have the new schema. This would be the preferred option. A revert of the schema change would solve it as well ... All the rest would be hack in OSP15 update procedure, and hack get messy quickly, so ideally this either wouldn't be required (the above feature happens fast or equivalent) or it wouldn't last (the above feature or equivalent eventually happens in the not too distant future) For this to work we would still need assistance, specifically we would need an *easy* way to detect if the slave has a schema change relative to the master without running the service and parsing log. Then we could force the updated node to become the master. This will prevent the last cut in the control plane and while we would loose HA during some time it will eventually converge towards a stable state without any errors in pacemaker. Any other solution ? [1] https://github.com/openvswitch/ovs/blob/branch-2.12/ovn/utilities/ovndb-servers.ocf#L337-L341 If it's unclear I'm chem on #irc. Some update after our debugging session today (please correct me if I'm wrong): - The minor update doesn't really fail. - The first controller that gets updated makes ovn-dbs to connect to a master instance that is still running an old schema version** of the SB database. - Due to this, the SB ovsdb-server cannot go into 'replicating' mode so it goes into "active" - The NB database didn't observe any database changes so it goes into 'replicating' mode so its state is "backup". - One being 'active' and the other being 'backup' leads to a return value of "OCF_NOT_RUNNING" [0]. - At this point, this is still fine, the old master node is still master and all the clients (ovn-controllers in all overcloud nodes, neutron server, metadata agent, ...) are connected to the OVSDB servers with the old schema. So far, so good, no dataplane or controlplane downtime has been observed. - Eventually, when all controller nodes get updated, everything will be uniform and all instances will be offering the updated schema. - However, the validations currently being done in the update framework could lead to 'false negatives' of failures. I don't know if this is the case but this is what I understood. I want to highlight that during the process described above *there's no dataplane disruption*, only controlplane disruption can be observed but I don't see this being any different than whenever there's not a schema change. Please, correct me in this point if I'm wrong showing the scenario that these schema changes can be worse. I understand that the main problem here could be the validations performed. So either we enhance/tweak/workaround these or we can try to look into it from the core OVN resource agent script to return a different value instead of "OCF_NOT_RUNNING". [0] https://github.com/openvswitch/ovs/blob/branch-2.11/ovn/utilities/ovndb-servers.ocf#L366 ** schema versions are handled automatically in OVS/OVN. Even though we want to minimize the schema changes as much as we can, we cannot ensure that 100% of the times we can get it due to pressuring customer issues. (In reply to Sofer Athlan-Guyot from comment #4) > > Now for the implication of a that in a rolling update of the controllers: > - we lose the HA: ctl0 fails, then ctl1 fails and then ctl2 is master with > the two others node being in a failed status. > - we have a small control plan outage: > - after the update of ctl0 and ctl1, ctl2 becomes master (still not > updated) > - then we update ctl2 so we stop the container, as it's alone (ctl0 and > ctl1 are failing), there is no other master to fall back to > -> control plane outage, we can't add new flows in there; > this lasts until we pull the new image and restart the container. > - it doesn't converge back to master/slave/slave even after that last step > and we are still in a no-ha setup with only one master ctl2. Thanks a lot Sofer for all the explanations. I've been thinking through more carefully and I think there's no 'regression' here in terms of HA. Let me use some example but please correct me if I'm wrong as I may be missing something: Imagine this starting situation: ctl1 M, ctl2 S, ctl 3 S (M=master S=slave) 1. Slave gets promoted first a) ctl2 gets updated, no controlplane outage and we still have HA if ctl1 fails during the minor update b) At this point we have only Ctl2 updated. Ctl3 gets updated next. If ctl1 (master) fails, then ctl2 / ctl3 will be started as master so we have HA and they run on newer db schemas. c) If nothing failed, then ctl1 will get updated and ctl2 / ctl3 will take the new master role. Or: 2. master gets promoted first. a) ctl1 gets updated, so we have HA as master will be either on ctl2 or ctl3. Let's assume ctl2 gets master. b) then ctl2 gets updated, ctl3 will be promoted as master with old schema. c) then ctl3 gets updated, either ctl1 or ctl2 will be promoted as master with new schema as they were good to start already against the new schema offered by ctl3. I see the same controlplane 'outage' possibilities than if we wouldn't have any schema changes. Can you please explain in which case this has made things worse? > > > From here, a "simple" : pcs resource cleanup will put the ovndb server back > online. > > In the OSP15 update workflow, we can't really automate the pcs resource > cleanup (as it could hide other issues) *and* we have a /some/ control plane > cut which is not expected to happen during update. > > Ideally we would have a Schema compatibility between minor version change, > ie the new ovndb south bound service should put itself in running/backup > mode even if the master doesn't have the new schema. This would be the > preferred option. A revert of the schema change would solve it as well ... > > All the rest would be hack in OSP15 update procedure, and hack get messy > quickly, so ideally this either wouldn't be required (the above feature > happens fast or equivalent) or it wouldn't last (the above feature or > equivalent eventually happens in the not too distant future) > > For this to work we would still need assistance, specifically we would need > an *easy* way to detect if the slave has a schema change relative to the > master without running the service and parsing log. > > Then we could force the updated node to become the master. This will > prevent the last cut in the control plane and while we would loose HA during > some time it will eventually converge towards a stable state without any > errors in pacemaker. > > Any other solution ? > > > [1] > https://github.com/openvswitch/ovs/blob/branch-2.12/ovn/utilities/ovndb- > servers.ocf#L337-L341 > > If it's unclear I'm chem on #irc. Sorry when i said: 1. Slave gets promoted first 2. Master gets promoted first I meant s/promoted/updated Hi Daniel, > I see the same controlplane 'outage' possibilities than if we wouldn't have any schema changes. Can you please explain in which case this has made things worse? well I agree with your story, but it doesn't happen like this in reality unfortunately. I'm taking the second story as a reference point (Master updated first) The point is that the failed ovndb pacemaker resources are not going back into the HA cluster. They just stay in a Stopped state ... even after reboot of the overcloud[1] In my local testing I can see as well that at the end of the update, the cluster *doesn't* converge back into HA. Meaning that when ctl3 get updated, it is the master and the two other node are "Stopped". As soon as we stop the resource on ctl3, then we have a control plane outage that wouldn't happen if pacemaker had the two other nodes as slave. > I understand that the main problem here could be the validations performed. So either we enhance/tweak/workaround these or we can try to look into it from the core OVN resource agent script to return a different value instead of "OCF_NOT_RUNNING". Or even a different behavior, maybe we could try to get master when this happen. This would "solve" the issue as if the node with the new schema we are certain to not have the final unexpected cut in the ctl plane. Furthermore, maybe it solve it all as we yet don't know what happen if an "old" ovndb server tries to connect to a "newer" one. Maybe that just work (as the code is not aware of the schema change). Worst case scenario, they go into Stopped because of the timeout. But as we go and update them we restart them and they will happily rejoin the cluster. WDYT, maybe a bit bold :) ? [1] See that log https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-upgrades-updates-15-trunk-from-RHOS_TRUNK-15.0-RHEL-8-20190913.n.3-HA_no_ceph-ipv4/29/console, look for "Stopped controller-0" string. This check is happening *after* the nodes reboot. (In reply to Sofer Athlan-Guyot from comment #8) > > They just stay in a Stopped state ... even after reboot of the overcloud[1] > Ok! Now I got it, I got confused as even in 'Stopped' state we saw pacemaker trying and trying every few seconds to bring the process back to live. I assumed that eventually when the process could either run as master or connect to an updated master, it'll be again part of the cluster. If it's not like this, then we do lose HA. Not trying to lay the blame on pacemaker here but is this expected? ie. Shouldn't the instance in "Stopped" state get back to the cluster after they're in a healthy state? Also, the "workaround" of promoting the first updated node to master as we discussed seem to work here. I briefly talked to core OVN folks and they don't think that ovsdb-server can go into 'backup' mode replicating from a different schema. While it could be possible to have some sort of compatibility with the previous schema, we may always face the situation (ie. OSP13 2.9 -> 2.11) where a big change is gonna happen even on a minor update. Hi,
So let me recap what we have here:
- running/backup won't happen when there is a schema change because it's not supported and will never be supported;
- In HA configuration when we have such update, the cluster never converge to a stable state anymore:
- we loose HA during update;
- we need a manual operation on the cluster to have it working again (pcs resource cleanup)
> Also, the "workaround" of promoting the first updated node to master as we discussed seem to work here.
I would be very much interested in having the exact sequence of command used there as I've spend some time trying to do it but couldn't get something working.
In any case, I think the agent need some work for it to at least converge back to a stable state.
We would still loose HA during update with updated schema, but at least we would have a working cluster of ovndb after update and, this is true for osp15 as it's true for any ovndb HA deployment.
By the way do you have a formal documentation about ovndb HA deployment and how to handle schema update in such a case. Our constraint in OSP15 is that we cannot force the update to start on any particular member of the cluster, so it has to work either when starting with the current master or starting with any slave and then in any order.
Thanks for your feedback.
I'm raising the priority and severity as it's going to be a issue for OSP15z1 very soon and it currently breaks our jenkins jobs for osp15 update.
Let's close it, the problem is not happening anymore and I don't think any more work will be done here. |