Description of problem: Doing an update of OSP15 from GA to RHOS_TRUNK-15.0-RHEL-8-20190926.n.0. Everything goes fine and everything is working but the ovn servers are not in HA anymore, 2 of them are stopped: Cluster name: tripleo_cluster [507/556] Stack: corosync Current DC: controller-0 (version 2.0.1-4.el8_0.4-0eb7991564) - partition with quorum Last updated: Wed Oct 9 12:05:36 2019 Last change: Wed Oct 9 10:06:25 2019 by root via crm_resource on controller-2 15 nodes configured 46 resources configured Online: [ controller-0 controller-1 controller-2 ] GuestOnline: [ galera-bundle-0@controller-0 galera-bundle-1@controller-1 galera-bundle-2@controller-2 ovn-dbs-bundle-0@controller-0 ovn-dbs-bundle-1@controller-1 ovn-dbs-bundle-2@controller-2 rabbitmq-bundle-0@con troller-0 rabbitmq-bundle-1@controller-1 rabbitmq-bundle-2@controller-2 redis-bundle-0@controller-0 redis-bundle-1@controller-1 redis-bundle-2@controller-2 ] Full list of resources: podman container set: galera-bundle [192.168.24.1:8787/rhosp15/openstack-mariadb:pcmklatest] galera-bundle-0 (ocf::heartbeat:galera): Master controller-0 galera-bundle-1 (ocf::heartbeat:galera): Master controller-1 galera-bundle-2 (ocf::heartbeat:galera): Master controller-2 podman container set: rabbitmq-bundle [192.168.24.1:8787/rhosp15/openstack-rabbitmq:pcmklatest] rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started controller-0 rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started controller-1 rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started controller-2 podman container set: redis-bundle [192.168.24.1:8787/rhosp15/openstack-redis:pcmklatest] redis-bundle-0 (ocf::heartbeat:redis): Master controller-0 redis-bundle-1 (ocf::heartbeat:redis): Slave controller-1 redis-bundle-2 (ocf::heartbeat:redis): Slave controller-2 ip-192.168.24.15 (ocf::heartbeat:IPaddr2): Started controller-0 ip-10.0.0.110 (ocf::heartbeat:IPaddr2): Started controller-1 ip-172.17.1.72 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.1.108 (ocf::heartbeat:IPaddr2): Started controller-2 ip-172.17.3.110 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.4.102 (ocf::heartbeat:IPaddr2): Started controller-1 podman container set: haproxy-bundle [192.168.24.1:8787/rhosp15/openstack-haproxy:pcmklatest] haproxy-bundle-podman-0 (ocf::heartbeat:podman): Started controller-0 haproxy-bundle-podman-1 (ocf::heartbeat:podman): Started controller-1 haproxy-bundle-podman-2 (ocf::heartbeat:podman): Started controller-2 podman container set: ovn-dbs-bundle [192.168.24.1:8787/rhosp15/openstack-ovn-northd:pcmklatest] ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Stopped controller-0 ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Stopped controller-1 ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Master controller-2 podman container: openstack-cinder-volume [192.168.24.1:8787/rhosp15/openstack-cinder-volume:pcmklatest] openstack-cinder-volume-podman-0 (ocf::heartbeat:podman): Started controller-1 Failed Resource Actions: * ovndb_servers_start_0 on ovn-dbs-bundle-0 'unknown error' (1): call=8, status=Timed Out, exitreason='', last-rc-change='Wed Oct 9 09:12:56 2019', queued=0ms, exec=200002ms * ovndb_servers_start_0 on ovn-dbs-bundle-1 'unknown error' (1): call=8, status=Timed Out, exitreason='', last-rc-change='Wed Oct 9 09:42:35 2019', queued=0ms, exec=200002ms Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled Timeline of events: - 09:12:56 2019: pcs failed to restart ovn-dbs-bundle-0 and got a backup of the db; Oct 09 09:02:16 controller-0 pacemaker-controld [68411] (process_lrm_event) error: Result of start operation for ovndb_servers on ovn-dbs-bundle-0: Timed Out | call=8 key=ovndb_servers_start_0 timeout=200000ms 09:02:16 controller-0 pacemaker-controld [68411] (process_lrm_event) notice: ovn-dbs-bundle-0-ovndb_servers_start_0:8 [ Backing up database to /etc/openvswitch/ovnsb_db.db.backup2.4.0-1795697952 [ OK ]\r\nCompacting database [ OK ]\r\nConverting database schema [ OK ]\r\n ] - 2019-10-09T09:23:24.877Z: (/var/log/containers/openvswitch/ovn-controller.log ) on ctl0: 2019-10-09T09:23:24.877Z|00008|ovsdb_idl|WARN|OVN_Southbound database lacks IGMP_Group table (database needs upgrade?) 2019-10-09T09:23:24.878Z|00011|ovsdb_idl|WARN|OVN_Southbound database lacks IGMP_Group table (database needs upgrade?) 2019-10-09T09:23:24.900Z|00017|ovsdb_idl|INFO|tcp:172.17.1.108:6642: received unexpected error response in MONITORING state: {"error":{"details":"no table named IGMP_Group","error":"syntax error"},"id":11,"result":null} 2019-10-09T09:23:25.912Z|00022|ovsdb_idl|WARN|OVN_Southbound database lacks IGMP_Group table (database needs upgrade?) 2019-10-09T09:23:25.914Z|00025|ovsdb_idl|WARN|OVN_Southbound database lacks IGMP_Group table (database needs upgrade?) 2019-10-09T09:28:36.080Z|00033|ovsdb_idl|WARN|OVN_Southbound database lacks IGMP_Group table (database needs upgrade?) 2019-10-09T09:28:36.083Z|00036|ovsdb_idl|WARN|OVN_Southbound database lacks IGMP_Group table (database needs upgrade?) - 09:42:35 2019: pcs fails to restart ovn-dbs-bundle-1; Oct 09 09:34:19 controller-1 pacemaker-controld [803850] (process_lrm_event) notice: ovn-dbs-bundle-1-ovndb_servers_start_0:8 [ Backing up database to /etc/openvswitch/ovnsb_db.db.backup2.4.0-1795697952 [ OK]\r\nCompacting database [ OK ]\r\nConverting database schema [ OK ]\r\n ] Oct 09 09:45:55 controller-1 pacemaker-controld [803850] (process_lrm_event) error: Result of start operation for ovndb_servers on ovn-dbs-bundle-1: Timed Out | call=8 key=ovndb_servers_start_0 timeout=200000ms - Oct 09 09:45:57 controller-1 pacemaker-controld [803850] (process_lrm_event) notice: Result of stop operation for ovndb_servers on ovn-dbs-bundle-1: 0 (ok) | call=17 key=ovndb_servers_stop_0 confirmed=true cib- update=265 - 2019-10-09T09:48:12.963Z: 2019-10-09T09:48:12.963Z|00008|ovsdb_idl|WARN|OVN_Southbound database lacks IGMP_Group table (database needs upgrade?) 2019-10-09T09:48:12.963Z|00011|ovsdb_idl|WARN|OVN_Southbound database lacks IGMP_Group table (database needs upgrade?) 2019-10-09T09:48:12.976Z|00017|ovsdb_idl|INFO|tcp:172.17.1.108:6642: received unexpected error response in MONITORING state: {"error":{"details":"no table named IGMP_Group","error":"syntax error"},"id":11,"result":null} 2019-10-09T09:48:13.980Z|00022|ovsdb_idl|WARN|OVN_Southbound database lacks IGMP_Group table (database needs upgrade?) 2019-10-09T09:48:13.981Z|00025|ovsdb_idl|WARN|OVN_Southbound database lacks IGMP_Group table (database needs upgrade?) - 2019-10-09T10:08:07.867Z: ctl2 get its bundle restarted fine. 2019-10-09T10:08:07.867Z|00036|fatal_signal|WARN|terminating with signal 15 (Terminated) 2019-10-09T10:12:17.337Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovn-controller.log Note about the IGMP_Group missing, it is only a INFO, so not sure it's enough to have make the restart of the resource failed: - Oct 09 09:51:34 controller-0 pacemaker-schedulerd[68410] (unpack_rsc_op_failure) warning: Processing failed start of ovndb_servers:0 on ovn-dbs-bundle-0: unknown error | rc=1 Version-Release number of selected component (if applicable): ovn2.11-2.11.1-2.el8fdp.x86_64 inside the container container updated from 20190913.1 to 20190926.1 So not sure how we could get the IGMP_Group table error in the container log if we failed to restart the container as pacemaker seems to say. Then those error are just "INFO" in the log /var/log/containers/openvswitch/ovn-controller.log . My /theory/ for it would be that: - stop ctl0, ctl1 becomes master for ovn - update ctl0, then restart it, it is attached to the db on ctl1 where the schema isn't yet updated; - update ctl1, ctl2 becomes master; - restart ovn on ctl1 which attach to ctl2 for the db which still not has the new field; - update ctl2 everything go fine, beside the fact that we must have a cut in connection at that moment as there is no master. How reproducible: we had it twice. Steps to Reproduce: 1. deploy osp15 GA 2. update to RHOS_TRUNK-15.0-RHEL-8-20190926.n.0 3. after the complete update check pcs Additional info: Doing a "pcs resource cleanup" solves the issue right away.
Note, before the update the master is on ctl0: podman container set: ovn-dbs-bundle [192.168.24.1:8787/rhosp15/openstack-ovn-northd:pcmklatest] ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Master controller-0 ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Slave controller-1 ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave controller-2 and ovn version is ovn2.11-2.11.0-26.el8fdp.x86_64
Hi, well from the controller-0.tar.gz in the ./log/extra/containers/containers_allinfo.log we have: f71669eb2213 192.168.24.1:8787/rhosp15/openstack-ovn-northd:20190926.1 dumb-init --singl... 3 hours ago Up 3 hours ago ovn-dbs-bundle-podman-0 27B (virtual 646MB) so even if pacemaker is reporting the service as stop it seems it was started just fine. Maybe an issue in the agent ? podman container set: ovn-dbs-bundle [192.168.24.1:8787/rhosp15/openstack-ovn-northd:pcmklatest] ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Stopped controller-0 ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Stopped controller-1 ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Master controller-2 The log/extra/containers/containers/ovn-dbs-bundle-podman-0/stdout.log doesn't show any error neither.
Hi, so with Damien and Daniel we went to the bottom of it (thanks numans as well for the help). During the update ovn went from version ovn2.11-2.11.0-26.el8fdp.x86_64 to version ovn2.11-2.11.1-2.el8fdp which include a schema change for the south bound service. In the HA setup we start by updating the Master ovndb-server (on ctl0): - we stop the resource, update it and restart it - the ovndb-server become master on ctl1. When the resource agent (/usr/lib/ocf/resource.d/ovn/ovndb-servers) tries to restart the resource on ctl0. It fails because the south bound ovn service is in "running/active" state according to ()[root@controller-0 /]# /bin/sh /usr/share/openvswitch/scripts/ovn-ctl status_ovnsb running/active while the nb is running/backup: ()[root@controller-0 /]# /bin/sh /usr/share/openvswitch/scripts/ovn-ctl status_ovnnb running/backup In the agent we have that check [1] so the agent returns "No running - rc=7". After a certain times (200000ms) pacemaker gives up and put the resource in failure. ovnsb returns an running/active state as it can't replicate the master db because of the schema change: 2019-10-09T11:35:21.375Z|00005|replication|INFO|Schema version mismatch, OVN_Southbound not replicated 2019-10-09T11:35:21.375Z|00006|replication|WARN|Nothing to replicate. As it cannot be a backup it returns active status, breaking the ocf agent startup sequence. So this is the root cause. Now for the implication of a that in a rolling update of the controllers: - we lose the HA: ctl0 fails, then ctl1 fails and then ctl2 is master with the two others node being in a failed status. - we have a small control plan outage: - after the update of ctl0 and ctl1, ctl2 becomes master (still not updated) - then we update ctl2 so we stop the container, as it's alone (ctl0 and ctl1 are failing), there is no other master to fall back to -> control plane outage, we can't add new flows in there; this lasts until we pull the new image and restart the container. - it doesn't converge back to master/slave/slave even after that last step and we are still in a no-ha setup with only one master ctl2. From here, a "simple" : pcs resource cleanup will put the ovndb server back online. In the OSP15 update workflow, we can't really automate the pcs resource cleanup (as it could hide other issues) *and* we have a /some/ control plane cut which is not expected to happen during update. Ideally we would have a Schema compatibility between minor version change, ie the new ovndb south bound service should put itself in running/backup mode even if the master doesn't have the new schema. This would be the preferred option. A revert of the schema change would solve it as well ... All the rest would be hack in OSP15 update procedure, and hack get messy quickly, so ideally this either wouldn't be required (the above feature happens fast or equivalent) or it wouldn't last (the above feature or equivalent eventually happens in the not too distant future) For this to work we would still need assistance, specifically we would need an *easy* way to detect if the slave has a schema change relative to the master without running the service and parsing log. Then we could force the updated node to become the master. This will prevent the last cut in the control plane and while we would loose HA during some time it will eventually converge towards a stable state without any errors in pacemaker. Any other solution ? [1] https://github.com/openvswitch/ovs/blob/branch-2.12/ovn/utilities/ovndb-servers.ocf#L337-L341 If it's unclear I'm chem on #irc.
Some update after our debugging session today (please correct me if I'm wrong): - The minor update doesn't really fail. - The first controller that gets updated makes ovn-dbs to connect to a master instance that is still running an old schema version** of the SB database. - Due to this, the SB ovsdb-server cannot go into 'replicating' mode so it goes into "active" - The NB database didn't observe any database changes so it goes into 'replicating' mode so its state is "backup". - One being 'active' and the other being 'backup' leads to a return value of "OCF_NOT_RUNNING" [0]. - At this point, this is still fine, the old master node is still master and all the clients (ovn-controllers in all overcloud nodes, neutron server, metadata agent, ...) are connected to the OVSDB servers with the old schema. So far, so good, no dataplane or controlplane downtime has been observed. - Eventually, when all controller nodes get updated, everything will be uniform and all instances will be offering the updated schema. - However, the validations currently being done in the update framework could lead to 'false negatives' of failures. I don't know if this is the case but this is what I understood. I want to highlight that during the process described above *there's no dataplane disruption*, only controlplane disruption can be observed but I don't see this being any different than whenever there's not a schema change. Please, correct me in this point if I'm wrong showing the scenario that these schema changes can be worse. I understand that the main problem here could be the validations performed. So either we enhance/tweak/workaround these or we can try to look into it from the core OVN resource agent script to return a different value instead of "OCF_NOT_RUNNING". [0] https://github.com/openvswitch/ovs/blob/branch-2.11/ovn/utilities/ovndb-servers.ocf#L366 ** schema versions are handled automatically in OVS/OVN. Even though we want to minimize the schema changes as much as we can, we cannot ensure that 100% of the times we can get it due to pressuring customer issues.
(In reply to Sofer Athlan-Guyot from comment #4) > > Now for the implication of a that in a rolling update of the controllers: > - we lose the HA: ctl0 fails, then ctl1 fails and then ctl2 is master with > the two others node being in a failed status. > - we have a small control plan outage: > - after the update of ctl0 and ctl1, ctl2 becomes master (still not > updated) > - then we update ctl2 so we stop the container, as it's alone (ctl0 and > ctl1 are failing), there is no other master to fall back to > -> control plane outage, we can't add new flows in there; > this lasts until we pull the new image and restart the container. > - it doesn't converge back to master/slave/slave even after that last step > and we are still in a no-ha setup with only one master ctl2. Thanks a lot Sofer for all the explanations. I've been thinking through more carefully and I think there's no 'regression' here in terms of HA. Let me use some example but please correct me if I'm wrong as I may be missing something: Imagine this starting situation: ctl1 M, ctl2 S, ctl 3 S (M=master S=slave) 1. Slave gets promoted first a) ctl2 gets updated, no controlplane outage and we still have HA if ctl1 fails during the minor update b) At this point we have only Ctl2 updated. Ctl3 gets updated next. If ctl1 (master) fails, then ctl2 / ctl3 will be started as master so we have HA and they run on newer db schemas. c) If nothing failed, then ctl1 will get updated and ctl2 / ctl3 will take the new master role. Or: 2. master gets promoted first. a) ctl1 gets updated, so we have HA as master will be either on ctl2 or ctl3. Let's assume ctl2 gets master. b) then ctl2 gets updated, ctl3 will be promoted as master with old schema. c) then ctl3 gets updated, either ctl1 or ctl2 will be promoted as master with new schema as they were good to start already against the new schema offered by ctl3. I see the same controlplane 'outage' possibilities than if we wouldn't have any schema changes. Can you please explain in which case this has made things worse? > > > From here, a "simple" : pcs resource cleanup will put the ovndb server back > online. > > In the OSP15 update workflow, we can't really automate the pcs resource > cleanup (as it could hide other issues) *and* we have a /some/ control plane > cut which is not expected to happen during update. > > Ideally we would have a Schema compatibility between minor version change, > ie the new ovndb south bound service should put itself in running/backup > mode even if the master doesn't have the new schema. This would be the > preferred option. A revert of the schema change would solve it as well ... > > All the rest would be hack in OSP15 update procedure, and hack get messy > quickly, so ideally this either wouldn't be required (the above feature > happens fast or equivalent) or it wouldn't last (the above feature or > equivalent eventually happens in the not too distant future) > > For this to work we would still need assistance, specifically we would need > an *easy* way to detect if the slave has a schema change relative to the > master without running the service and parsing log. > > Then we could force the updated node to become the master. This will > prevent the last cut in the control plane and while we would loose HA during > some time it will eventually converge towards a stable state without any > errors in pacemaker. > > Any other solution ? > > > [1] > https://github.com/openvswitch/ovs/blob/branch-2.12/ovn/utilities/ovndb- > servers.ocf#L337-L341 > > If it's unclear I'm chem on #irc.
Sorry when i said: 1. Slave gets promoted first 2. Master gets promoted first I meant s/promoted/updated
Hi Daniel, > I see the same controlplane 'outage' possibilities than if we wouldn't have any schema changes. Can you please explain in which case this has made things worse? well I agree with your story, but it doesn't happen like this in reality unfortunately. I'm taking the second story as a reference point (Master updated first) The point is that the failed ovndb pacemaker resources are not going back into the HA cluster. They just stay in a Stopped state ... even after reboot of the overcloud[1] In my local testing I can see as well that at the end of the update, the cluster *doesn't* converge back into HA. Meaning that when ctl3 get updated, it is the master and the two other node are "Stopped". As soon as we stop the resource on ctl3, then we have a control plane outage that wouldn't happen if pacemaker had the two other nodes as slave. > I understand that the main problem here could be the validations performed. So either we enhance/tweak/workaround these or we can try to look into it from the core OVN resource agent script to return a different value instead of "OCF_NOT_RUNNING". Or even a different behavior, maybe we could try to get master when this happen. This would "solve" the issue as if the node with the new schema we are certain to not have the final unexpected cut in the ctl plane. Furthermore, maybe it solve it all as we yet don't know what happen if an "old" ovndb server tries to connect to a "newer" one. Maybe that just work (as the code is not aware of the schema change). Worst case scenario, they go into Stopped because of the timeout. But as we go and update them we restart them and they will happily rejoin the cluster. WDYT, maybe a bit bold :) ? [1] See that log https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-upgrades-updates-15-trunk-from-RHOS_TRUNK-15.0-RHEL-8-20190913.n.3-HA_no_ceph-ipv4/29/console, look for "Stopped controller-0" string. This check is happening *after* the nodes reboot.
(In reply to Sofer Athlan-Guyot from comment #8) > > They just stay in a Stopped state ... even after reboot of the overcloud[1] > Ok! Now I got it, I got confused as even in 'Stopped' state we saw pacemaker trying and trying every few seconds to bring the process back to live. I assumed that eventually when the process could either run as master or connect to an updated master, it'll be again part of the cluster. If it's not like this, then we do lose HA. Not trying to lay the blame on pacemaker here but is this expected? ie. Shouldn't the instance in "Stopped" state get back to the cluster after they're in a healthy state? Also, the "workaround" of promoting the first updated node to master as we discussed seem to work here. I briefly talked to core OVN folks and they don't think that ovsdb-server can go into 'backup' mode replicating from a different schema. While it could be possible to have some sort of compatibility with the previous schema, we may always face the situation (ie. OSP13 2.9 -> 2.11) where a big change is gonna happen even on a minor update.
Hi, So let me recap what we have here: - running/backup won't happen when there is a schema change because it's not supported and will never be supported; - In HA configuration when we have such update, the cluster never converge to a stable state anymore: - we loose HA during update; - we need a manual operation on the cluster to have it working again (pcs resource cleanup) > Also, the "workaround" of promoting the first updated node to master as we discussed seem to work here. I would be very much interested in having the exact sequence of command used there as I've spend some time trying to do it but couldn't get something working. In any case, I think the agent need some work for it to at least converge back to a stable state. We would still loose HA during update with updated schema, but at least we would have a working cluster of ovndb after update and, this is true for osp15 as it's true for any ovndb HA deployment. By the way do you have a formal documentation about ovndb HA deployment and how to handle schema update in such a case. Our constraint in OSP15 is that we cannot force the update to start on any particular member of the cluster, so it has to work either when starting with the current master or starting with any slave and then in any order. Thanks for your feedback. I'm raising the priority and severity as it's going to be a issue for OSP15z1 very soon and it currently breaks our jenkins jobs for osp15 update.
Let's close it, the problem is not happening anymore and I don't think any more work will be done here.