Bug 1837953
Summary: | Replacing masters doesn't work for ovn-kubernetes 4.4 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Eduardo Minguez <eminguez> | ||||||||
Component: | Networking | Assignee: | Aniket Bhat <anbhat> | ||||||||
Networking sub component: | ovn-kubernetes | QA Contact: | Ross Brattain <rbrattai> | ||||||||
Status: | CLOSED ERRATA | Docs Contact: | |||||||||
Severity: | high | ||||||||||
Priority: | high | CC: | anusaxen, bbennett, ChetRHosey, ctrautma, dcbw, dceara, fpaoline, fsimonce, jhsiao, ngirard, ralongi, rbrattai, ricarril, rkhan, sscheink, trozet, wking, yjoseph | ||||||||
Version: | 4.4 | Keywords: | Reopened | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | 4.7.0 | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | |||||||||||
: | 1848066 1857455 1882569 1885675 (view as bug list) | Environment: | |||||||||
Last Closed: | 2021-02-24 15:12:13 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | 1848066 | ||||||||||
Bug Blocks: | 1854072, 1857455, 1857462, 1858712, 1882569, 1885675, 1887462 | ||||||||||
Attachments: |
|
Description
Eduardo Minguez
2020-05-20 09:17:28 UTC
oc logs -> https://pastebin.com/Jn5xbywP Setting the target release to the development branch so we can identify the issue and fix it. We can work out where we backport to after the fix has been identified. I think there was a job or test somewhere in which we killed a master in order to test the DB could be rebuilt from scratch. I'm asking around about it. Replacing a master with a completely new master will surely need changes to CNO and possible master containers. It's not supported at this time but it's a must-have for GA/4.6. Opened JIRA https://issues.redhat.com/browse/SDN-1024 , closing this BZ. I don't have an estimate yet because I'm not able to repro this on master. Killing the machine hosting the OVN master lead, then creating a new machine the CNO eventually creates the new pods fine and the network converges. The only thing I had to do was to remove the removed etcd member and that was it. I should be able to tell if I can repro or not on 4.4 tomorrow and get back to you. I'm afraid I haven't been able to reproduce on 4.4 either. Moreover, chatting with other peers in the team we don't think there should be issues on recovering masters in OVN, there may be some raft fixes needed tho if the cluster size was reduced tho. Edu: Can you please try to reproduce on your side again? Let me know so I can jump in tmate or something with you to debug. I've been able to reproduce it again with 4.4.6 in baremetal: $ oc version Client Version: 4.4.6 Server Version: 4.4.6 Kubernetes Version: v1.17.1+f63db30 $ oc get po -n openshift-ovn-kubernetes NAME READY STATUS RESTARTS AGE ovnkube-master-7fkbb 4/4 Running 0 5m6s ovnkube-master-h566j 3/4 Error 5 3m23s ovnkube-master-zzczg 4/4 Running 0 4m24s ovnkube-node-5vcsc 2/2 Running 0 4m23s ovnkube-node-8vg4z 2/2 Running 0 4m42s ovnkube-node-d8gjq 2/2 Running 0 5m16s ovnkube-node-vhs9c 2/2 Running 0 4m54s ovs-node-6kgxx 1/1 Running 0 3d17h ovs-node-l6b2v 1/1 Running 0 3d17h ovs-node-vc6vv 1/1 Running 0 8m2s ovs-node-vfm6l 1/1 Running 0 3d17h $ oc logs ovnkube-master-h566j --all-containers -n openshift-ovn-kubernetes > see-attached.txt Created attachment 1697352 [details]
ovnkube-master-h566j logs (ocp 4.4.6)
[ricky@ricky-laptop ~]$ oc -n openshift-ovn-kubernetes logs -c nbdb ovnkube-master-h566j + [[ -f /env/_master ]] + [[ -f /usr/bin/ovn-appctl ]] + OVNCTL_PATH=/usr/share/ovn/scripts/ovn-ctl + MASTER_IP=10.19.138.33 + [[ 10.19.138.38 == \1\0\.\1\9\.\1\3\8\.\3\3 ]] ++ bracketify 10.19.138.38 ++ case "$1" in ++ echo 10.19.138.38 ++ bracketify 10.19.138.33 ++ case "$1" in ++ echo 10.19.138.33 + exec /usr/share/ovn/scripts/ovn-ctl --db-nb-cluster-local-port=9643 --db-nb-cluster-remote-port=9643 --db-nb-cluster-local-addr=10.19.138.38 --db-nb-cluster-remote-addr=10.19.138.33 --no-monitor --db-nb-cluster-local-proto=ssl --db-nb-cluster-remote-pr oto=ssl --ovn-nb-db-ssl-key=/ovn-cert/tls.key --ovn-nb-db-ssl-cert=/ovn-cert/tls.crt --ovn-nb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt '--ovn-nb-log=-vconsole:info -vfile:off' run_nb_ovsdb 2020-06-15T14:53:33Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log ovsdb-server: ovsdb error: server does not belong to cluster ----------------------------------------------------------------------- [ricky@ricky-laptop ~]$ oc get nodes -owide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME kni1-vmaster-1 Ready master,virtual 3d23h v1.17.1 10.19.138.33 <none> Red Hat Enterprise Linux CoreOS 44.81.202005250830-0 (Ootpa) 4.18.0-147.8.1.el8_1.x86_64 cri-o://1.17.4-12.dev.rhaos4.4.git2be4d9c.el8 kni1-vmaster-2 Ready master,virtual 3d23h v1.17.1 10.19.138.37 <none> Red Hat Enterprise Linux CoreOS 44.81.202005250830-0 (Ootpa) 4.18.0-147.8.1.el8_1.x86_64 cri-o://1.17.4-12.dev.rhaos4.4.git2be4d9c.el8 kni1-vmaster-3 Ready master 6h10m v1.17.1 10.19.138.38 <none> Red Hat Enterprise Linux CoreOS 44.81.202005250830-0 (Ootpa) 4.18.0-147.8.1.el8_1.x86_64 cri-o://1.17.4-12.dev.rhaos4.4.git2be4d9c.el8 kni1-worker-0.cloud.lab.eng.bos.redhat.com Ready worker 3d23h v1.17.1 10.19.138.9 <none> Red Hat Enterprise Linux CoreOS 44.81.202005250830-0 (Ootpa) 4.18.0-147.8.1.el8_1.x86_64 cri-o://1.17.4-12.dev.rhaos4.4.git2be4d9c.el8 ---------- from a working ovn-k pod ----------------------------------- sh-4.2# ovs-appctl -t /run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound 9f9f Name: OVN_Southbound Cluster ID: 2219 (2219724f-9183-4702-9d19-a0846a98a417) Server ID: 9f9f (9f9f6675-a7d3-44b8-a908-82cb587f8702) Address: ssl:10.19.138.37:9644 Status: cluster member Role: follower Term: 71 Leader: 9cf8 Vote: 9cf8 Election timer: 1000 Log: [37186, 37538] Entries not yet committed: 0 Entries not yet applied: 0 Connections: (->f641) ->9cf8 <-9cf8 ->752c <-752c Servers: f641 (f641 at ssl:10.19.138.32:9644) 752c (752c at ssl:10.19.138.38:9644) 9f9f (9f9f at ssl:10.19.138.37:9644) (self) 9cf8 (9cf8 at ssl:10.19.138.33:9644) sh-4.2# ovs-appctl -t /run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound 33b4 Name: OVN_Northbound Cluster ID: d93c (d93cad67-7d9b-4197-83b8-59e9aaeb9871) Server ID: 33b4 (33b4c127-1a24-4074-838c-03140538e324) Address: ssl:10.19.138.37:9643 Status: cluster member Adding server d5db (d5db at ssl:10.19.138.38:9643) (adding: catchup) Role: leader Term: 43 Leader: self Vote: self Election timer: 1000 Log: [35807, 38503] Entries not yet committed: 0 Entries not yet applied: 0 Connections: ->720d (->7e4f) <-720d Servers: 720d (720d at ssl:10.19.138.33:9643) next_index=38503 match_index=38502 33b4 (33b4 at ssl:10.19.138.37:9643) (self) next_index=30503 match_index=38502 7e4f (7e4f at ssl:10.19.138.32:9643) next_index=38503 match_index=0 ----------------------------------------------------------------------------------------- For reference, .32 is the old master that got killed and .38 is the replacement. What I did was run cluster/kick command with ovs-appctl to remove the old entries for .32. However .38 is never joining, it's stuck on 'catchup', which looks like an OVS/OVN bug. I'm trying to reach out that team, otherwise I'll move this bug to them for further investigation. Just in case, the ovn version in 4.4.6 is 2.13: $ oc get po -n openshift-ovn-kubernetes NAME READY STATUS RESTARTS AGE ovnkube-master-7fkbb 4/4 Running 0 23h ovnkube-master-kqj4h 3/4 CrashLoopBackOff 203 16h ovnkube-master-zzczg 4/4 Running 0 23h ovnkube-node-5vcsc 2/2 Running 0 23h ovnkube-node-8vg4z 2/2 Running 0 23h ovnkube-node-d8gjq 2/2 Running 0 23h ovnkube-node-vhs9c 2/2 Running 0 23h ovs-node-6kgxx 1/1 Running 0 4d17h ovs-node-l6b2v 1/1 Running 0 4d17h ovs-node-vc6vv 1/1 Running 0 23h ovs-node-vfm6l 1/1 Running 0 4d16h $ oc exec -it ovnkube-master-7fkbb -- /bin/bash -c "rpm -qa | grep -i ovn" Defaulting container name to northd. Use 'oc describe pod/ovnkube-master-7fkbb -n openshift-ovn-kubernetes' to see all of the containers in this pod. ovn2.13-2.13.0-31.el7fdp.x86_64 ovn2.13-central-2.13.0-31.el7fdp.x86_64 ovn2.13-host-2.13.0-31.el7fdp.x86_64 ovn2.13-vtep-2.13.0-31.el7fdp.x86_64 Created attachment 1697595 [details]
ovn-dbs from all masters
The pod that is crash looping is failing to start the NB ovsdb because: ovsdb-server: ovsdb error: server does not belong to cluster This happens when an ovsdb-server is started in clustered mode if the remote_addresses field hasn't been set in the RAFT header of the DB file and the server_id is not present in the prev_servers field of the RAFT header. The remote_addreses field is set by "ovsdb-tool join-cluster ..." here: https://github.com/openvswitch/ovs/blob/master/utilities/ovs-lib.in#L501 "ovsdb-tool join-cluster ..." is called only if the DB file didn't already exist, i.e., the first time the DB server is brought up. At least when the failing server instance is restarted (that is, not the first time the server instance is brought up) the DB already exists so remote_addresses doesn't get set, triggering: ovsdb-server: ovsdb error: server does not belong to cluster It's still not clear why the pod failed the first time and if the DB files already existed before the first NB ovsdb-server instance on the new master was brought up. In case we identify this as an OVS/OVN bug: # rpm -q openvswitch2.13 openvswitch2.13-2.13.0-10.el7fdp.x86_64 # rpm -q ovn2.13 ovn2.13-2.13.0-31.el7fdp.x86_64 I've removed the ovn*_db.db files in the new master host and restarted the pod that was crashloopbacking: ``` oc debug node/kni1-vmaster-3 -- chroot /host sh -c 'rm -f /var/lib/ovn/etc/ovn*_db.db' && oc delete pod ovnkube-master-kqj4 ``` And now it seems the ovnkube-master pod is working: ``` $ oc get po -n openshift-ovn-kubernetes NAME READY STATUS RESTARTS AGE ovnkube-master-7dm82 4/4 Running 0 29m ovnkube-master-7fkbb 4/4 Running 0 28h ovnkube-master-zzczg 4/4 Running 2 28h ovnkube-node-5vcsc 2/2 Running 0 28h ovnkube-node-8vg4z 2/2 Running 0 28h ovnkube-node-d8gjq 2/2 Running 0 28h ovnkube-node-vhs9c 2/2 Running 0 28h ovs-node-6kgxx 1/1 Running 0 4d21h ovs-node-l6b2v 1/1 Running 0 4d21h ovs-node-vc6vv 1/1 Running 0 28h ovs-node-vfm6l 1/1 Running 0 4d21h ``` FWIW, I still can't seem to reproduce on 4.4 : <snip> [ricky@ricky-laptop ~]$ oc -n openshift-machine-api get machines NAME PHASE TYPE REGION ZONE AGE ci-ln-9273fp2-d5d6b-95cnn-master-0 Running m4.xlarge us-west-1 us-west-1a 27m ci-ln-9273fp2-d5d6b-95cnn-master-1 Running m4.xlarge us-west-1 us-west-1b 27m ci-ln-9273fp2-d5d6b-95cnn-master-2 Running m4.xlarge us-west-1 us-west-1a 27m ci-ln-9273fp2-d5d6b-95cnn-worker-us-west-1a-gt9nb Running m4.xlarge us-west-1 us-west-1a 15m ci-ln-9273fp2-d5d6b-95cnn-worker-us-west-1a-v54ns Running m4.xlarge us-west-1 us-west-1a 15m ci-ln-9273fp2-d5d6b-95cnn-worker-us-west-1b-hx9nv Running m4.xlarge us-west-1 us-west-1b 15m [ricky@ricky-laptop ~]$ oc -n openshift-machine-api delete machine ci-ln-9273fp2-d5d6b-95cnn-master-0 machine.machine.openshift.io "ci-ln-9273fp2-d5d6b-95cnn-master-0" deleted [ricky@ricky-laptop ~]$ oc -n openshift-machine-api get machines NAME PHASE TYPE REGION ZONE AGE ci-ln-9273fp2-d5d6b-95cnn-master-1 Running m4.xlarge us-west-1 us-west-1b 33m ci-ln-9273fp2-d5d6b-95cnn-master-2 Running m4.xlarge us-west-1 us-west-1a 33m ci-ln-9273fp2-d5d6b-95cnn-worker-us-west-1a-gt9nb Running m4.xlarge us-west-1 us-west-1a 21m ci-ln-9273fp2-d5d6b-95cnn-worker-us-west-1a-v54ns Running m4.xlarge us-west-1 us-west-1a 21m ci-ln-9273fp2-d5d6b-95cnn-worker-us-west-1b-hx9nv Running m4.xlarge us-west-1 us-west-1b 21m [ricky@ricky-laptop ~]$ oc -n openshift-machine-api get machines ci-ln-9273fp2-d5d6b-95cnn-master-1 -oyaml > /tmp/machine.yaml [ricky@ricky-laptop ~]$ vi /tmp/machine.yaml [ricky@ricky-laptop ~]$ oc apply -f /tmp/machine.yaml machine.machine.openshift.io/ci-ln-9273fp2-d5d6b-95cnn-master-3 created [ricky@ricky-laptop ~]$ oc -n openshift-machine-api get machines NAME PHASE TYPE REGION ZONE AGE ci-ln-9273fp2-d5d6b-95cnn-master-1 Running m4.xlarge us-west-1 us-west-1b 36m ci-ln-9273fp2-d5d6b-95cnn-master-2 Running m4.xlarge us-west-1 us-west-1a 36m ci-ln-9273fp2-d5d6b-95cnn-master-3 Provisioned m4.xlarge us-west-1 us-west-1b 2m31s ci-ln-9273fp2-d5d6b-95cnn-worker-us-west-1a-gt9nb Running m4.xlarge us-west-1 us-west-1a 25m ci-ln-9273fp2-d5d6b-95cnn-worker-us-west-1a-v54ns Running m4.xlarge us-west-1 us-west-1a 25m ci-ln-9273fp2-d5d6b-95cnn-worker-us-west-1b-hx9nv Running m4.xlarge us-west-1 us-west-1b 25m [ricky@ricky-laptop ~]$ oc -n openshift-machine-api get machines NAME PHASE TYPE REGION ZONE AGE ci-ln-9273fp2-d5d6b-95cnn-master-1 Running m4.xlarge us-west-1 us-west-1b 43m ci-ln-9273fp2-d5d6b-95cnn-master-2 Running m4.xlarge us-west-1 us-west-1a 43m ci-ln-9273fp2-d5d6b-95cnn-master-3 Running m4.xlarge us-west-1 us-west-1b 8m56s ci-ln-9273fp2-d5d6b-95cnn-worker-us-west-1a-gt9nb Running m4.xlarge us-west-1 us-west-1a 31m ci-ln-9273fp2-d5d6b-95cnn-worker-us-west-1a-v54ns Running m4.xlarge us-west-1 us-west-1a 31m ci-ln-9273fp2-d5d6b-95cnn-worker-us-west-1b-hx9nv Running m4.xlarge us-west-1 us-west-1b 31m [ricky@ricky-laptop ~]$ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-141-28.us-west-1.compute.internal Ready worker 27m v1.17.1+3f6f40d ip-10-0-163-77.us-west-1.compute.internal Ready worker 27m v1.17.1+3f6f40d ip-10-0-184-211.us-west-1.compute.internal Ready master 39m v1.17.1+3f6f40d ip-10-0-196-9.us-west-1.compute.internal Ready worker 27m v1.17.1+3f6f40d ip-10-0-246-69.us-west-1.compute.internal Ready master 5m28s v1.17.1+3f6f40d ip-10-0-254-36.us-west-1.compute.internal Ready master 39m v1.17.1+3f6f40d [ricky@ricky-laptop ~]$ oc -n openshift-ovn-kubernetes get pods NAME READY STATUS RESTARTS AGE ovnkube-master-7847j 4/4 Running 0 106s ovnkube-master-m2mxt 2/4 Running 0 20s ovnkube-master-xwmxw 4/4 Running 0 67s ovnkube-node-24vrk 2/2 Running 0 50s ovnkube-node-b9lv4 2/2 Running 0 75s ovnkube-node-gt6hg 2/2 Running 0 98s ovnkube-node-l5vx4 2/2 Running 0 19s ovnkube-node-slbl6 2/2 Running 0 60s ovnkube-node-wlbzm 2/2 Running 0 32s ovs-node-65wvp 1/1 Running 0 38m ovs-node-cwk5t 1/1 Running 0 38m ovs-node-h9knm 1/1 Running 0 5m37s ovs-node-pmkbn 1/1 Running 0 28m ovs-node-rk67j 1/1 Running 0 28m ovs-node-s2mz7 1/1 Running 0 28m [ricky@ricky-laptop ~]$ oc get version error: the server doesn't have a resource type "version" [ricky@ricky-laptop ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.8 True False 20m Cluster version is 4.4.8 </snip> Note the version I pulled is from CI, 4.4.8, reporter uses 4.4.6. It may or it may not be related, also I use AWS cluster whereas reporter is on baremetal. Created attachment 1697803 [details]
core files generated in master-3
We have reproduced the procedure twice. The first one there were no issue but the second one the pod failed with a segfault: 2020-06-17T11:43:44Z|00032|fatal_signal|WARN|terminating with signal 15 (Terminated) 2020-06-17T11:43:44Z|00001|fatal_signal(urcu1)|WARN|terminating with signal 11 (Segmentation fault) 2020-06-17T11:43:44Z|00054|fatal_signal(log_fsync3)|WARN|terminating with signal 11 (Segmentation fault) So, it seems that the pod crashes and leaves the DB in an inconsistent state. I've attached the core files found in /var/lib/systemd/coredump/ in the master-3 (the 'new one') There are actually two issues here I guess: * ovn database corruption/inconsistency * old member is not deleted from the ovn database For the first one, I guess it is needed more investigation, but for the second one, I think we just need to include instructions on how to perform the kick procedure like we do for etcd https://docs.openshift.com/container-platform/4.4/backup_and_restore/replacing-unhealthy-etcd-member.html#restore-replace-stopped-etcd-member_replacing-unhealthy-etcd-member but for ovn. @dumitru is there a way to check if the DB file on-disk has not successfully been RAFT initialized for RAFT? If so then we can delete the file when the container exits so that it's fresh and can retry when the container restarts. (In reply to Dan Williams from comment #22) > @dumitru is there a way to check if the DB file on-disk has not successfully > been RAFT initialized for RAFT? If so then we can delete the file when the > container exits so that it's fresh and can retry when the container restarts. Yes, one option would be to check the output of ovsdb-tool show log, e.g.: $ ovsdb-tool show-log /etc/ovn/ovnnb_db.db | grep "server_id\|prev_servers" server_id: d5db prev_servers: 33b4("ssl:10.19.138.37:9643"), 720d("ssl:10.19.138.33:9643"), 7e4f("ssl:10.19.138.32:9643" The above DB is inconsistent because d5db is not part of prev_servers. A consistent DB looks like this: $ ovsdb-tool show-log /etc/ovn/ovnnb_db.db | grep "server_id\|prev_servers" server_id: 33b4 prev_servers: 33b4("ssl:10.19.138.37:9643"), 720d("ssl:10.19.138.33:9643") As discussed offline it could be an option to run a check on the DBs when a nbdb/sbdb container is started and if the server_id is not part of prev_servers (see https://bugzilla.redhat.com/show_bug.cgi?id=1837953#c23) remove the DB inconsistent file before starting ovsdb-server because otherwise ovsdb-server will refuse to join the cluster. I'm going to move this BZ back to OCP and clone it to openvswitch to continue investigation in the RAFT implementation to see if this situation can be avoided in any way without external intervention. Thanks, Dumitru Hey I've been off in vaca most of last week, and I was back at looking at this late yesterday. But if you want to take it, feel free, don't have a PR pushed yet. Let me know. @Dimitru, re https://bugzilla.redhat.com/show_bug.cgi?id=1837953#c23 which was implemented here, would it make sense to have a more integrated way in ovsdb-tool to detect such corruption? A new command or something like that? In case, I can give a look at the code to see if I am able to help there. (In reply to Federico Paolinelli from comment #29) > @Dimitru, re https://bugzilla.redhat.com/show_bug.cgi?id=1837953#c23 which > was implemented here, would it make sense to have a more integrated way in > ovsdb-tool to detect such corruption? > A new command or something like that? In case, I can give a look at the code > to see if I am able to help there. Hi Federico, Yes, that's a good idea. There's already "ovsdb-tool check-cluster" which does some sanity checks on DB files [1]. We could probably enhance it to check for this type of corruption too. AFAIK, it's currently used only by OVS the unit tests but I don't really see a reason why other users can't use it. Thanks, Dumitru [1] https://github.com/openvswitch/ovs/blob/6adc879b6369fd009e2bdc1db6f5d0aea6c50f89/ovsdb/ovsdb-tool.c#L1210 Verified on 4.6.0-0.nightly-2020-07-07-0837186 Current code seems to work, but it won't match when there are multiple prev_servers. See https://github.com/openshift/cluster-network-operator/pull/694 log-ovnkube-master-kgfxr-sbdb:+ check_db_health /etc/ovn/ovnsb_db.db log-ovnkube-master-kgfxr-sbdb-+ db=/etc/ovn/ovnsb_db.db log-ovnkube-master-kgfxr-sbdb-+ [[ ! -f /etc/ovn/ovnsb_db.db ]] log-ovnkube-master-kgfxr-sbdb-+ echo 'Checking /etc/ovn/ovnsb_db.db health' log-ovnkube-master-kgfxr-sbdb-Checking /etc/ovn/ovnsb_db.db health log-ovnkube-master-kgfxr-sbdb-++ ovsdb-tool show-log /etc/ovn/ovnsb_db.db log-ovnkube-master-kgfxr-sbdb-++ sed -ne '/server_id:/s/server_id: *\([[:xdigit:]]\+\)/\1/p' log-ovnkube-master-kgfxr-sbdb-+ serverid=' 11e3' log-ovnkube-master-kgfxr-sbdb-++ ovsdb-tool show-log /etc/ovn/ovnsb_db.db log-ovnkube-master-kgfxr-sbdb-++ grep 'prev_servers: * 11e3(' log-ovnkube-master-kgfxr-sbdb-+ match=' prev_servers: 11e3("ssl:10.0.139.146:9644")' log-ovnkube-master-kgfxr-sbdb-+ [[ -z prev_servers: 11e3("ssl:10.0.139.146:9644") ]] log-ovnkube-master-kgfxr-sbdb-+ echo '/etc/ovn/ovnsb_db.db is healthy' log-ovnkube-master-kgfxr-sbdb-/etc/ovn/ovnsb_db.db is healthy log-ovnkube-master-kgfxr-sbdb-+ MASTER_IP=10.0.139.146 log-ovnkube-master-kgfxr-sbdb-++ date -Iseconds log-ovnkube-master-kgfxr-sbdb-+ echo '2020-07-08T02:04:34+0000 - starting sbdb MASTER_IP=10.0.139.146' -- log-ovnkube-master-kgfxr-nbdb:+ check_db_health /etc/ovn/ovnnb_db.db log-ovnkube-master-kgfxr-nbdb-+ db=/etc/ovn/ovnnb_db.db log-ovnkube-master-kgfxr-nbdb-+ [[ ! -f /etc/ovn/ovnnb_db.db ]] log-ovnkube-master-kgfxr-nbdb-+ echo 'Checking /etc/ovn/ovnnb_db.db health' log-ovnkube-master-kgfxr-nbdb-Checking /etc/ovn/ovnnb_db.db health log-ovnkube-master-kgfxr-nbdb-++ ovsdb-tool show-log /etc/ovn/ovnnb_db.db log-ovnkube-master-kgfxr-nbdb-++ sed -ne '/server_id:/s/server_id: *\([[:xdigit:]]\+\)/\1/p' log-ovnkube-master-kgfxr-nbdb-+ serverid=' cce6' log-ovnkube-master-kgfxr-nbdb-++ ovsdb-tool show-log /etc/ovn/ovnnb_db.db log-ovnkube-master-kgfxr-nbdb-++ grep 'prev_servers: * cce6(' log-ovnkube-master-kgfxr-nbdb-+ match=' prev_servers: cce6("ssl:10.0.139.146:9643")' log-ovnkube-master-kgfxr-nbdb-+ [[ -z prev_servers: cce6("ssl:10.0.139.146:9643") ]] log-ovnkube-master-kgfxr-nbdb-/etc/ovn/ovnnb_db.db is healthy log-ovnkube-master-kgfxr-nbdb-+ echo '/etc/ovn/ovnnb_db.db is healthy' log-ovnkube-master-kgfxr-nbdb-+ MASTER_IP=10.0.139.146 log-ovnkube-master-kgfxr-nbdb-++ date -Iseconds log-ovnkube-master-kgfxr-nbdb-+ echo '2020-07-08T02:04:27+0000 - starting nbdb MASTER_IP=10.0.139.146' -- log-ovnkube-master-fr5z6-sbdb:+ check_db_health /etc/ovn/ovnsb_db.db log-ovnkube-master-fr5z6-sbdb-+ db=/etc/ovn/ovnsb_db.db log-ovnkube-master-fr5z6-sbdb-+ [[ ! -f /etc/ovn/ovnsb_db.db ]] log-ovnkube-master-fr5z6-sbdb-+ echo 'Checking /etc/ovn/ovnsb_db.db health' log-ovnkube-master-fr5z6-sbdb-Checking /etc/ovn/ovnsb_db.db health log-ovnkube-master-fr5z6-sbdb-++ sed -ne '/server_id:/s/server_id: *\([[:xdigit:]]\+\)/\1/p' log-ovnkube-master-fr5z6-sbdb-++ ovsdb-tool show-log /etc/ovn/ovnsb_db.db log-ovnkube-master-fr5z6-sbdb-+ serverid=' 6579' log-ovnkube-master-fr5z6-sbdb-++ ovsdb-tool show-log /etc/ovn/ovnsb_db.db log-ovnkube-master-fr5z6-sbdb-++ grep 'prev_servers: * 6579(' log-ovnkube-master-fr5z6-sbdb-++ true log-ovnkube-master-fr5z6-sbdb-+ match= log-ovnkube-master-fr5z6-sbdb-+ [[ -z '' ]] log-ovnkube-master-fr5z6-sbdb-+ echo 'Current server_id 6579 not found in /etc/ovn/ovnsb_db.db, cleaning up' log-ovnkube-master-fr5z6-sbdb-Current server_id 6579 not found in /etc/ovn/ovnsb_db.db, cleaning up log-ovnkube-master-fr5z6-sbdb-+ rm -- /etc/ovn/ovnsb_db.db log-ovnkube-master-fr5z6-sbdb-+ MASTER_IP=10.0.139.146 -- log-ovnkube-master-fr5z6-nbdb:+ check_db_health /etc/ovn/ovnnb_db.db log-ovnkube-master-fr5z6-nbdb-+ db=/etc/ovn/ovnnb_db.db log-ovnkube-master-fr5z6-nbdb-+ [[ ! -f /etc/ovn/ovnnb_db.db ]] log-ovnkube-master-fr5z6-nbdb-Checking /etc/ovn/ovnnb_db.db health log-ovnkube-master-fr5z6-nbdb-+ echo 'Checking /etc/ovn/ovnnb_db.db health' log-ovnkube-master-fr5z6-nbdb-++ ovsdb-tool show-log /etc/ovn/ovnnb_db.db log-ovnkube-master-fr5z6-nbdb-++ sed -ne '/server_id:/s/server_id: *\([[:xdigit:]]\+\)/\1/p' log-ovnkube-master-fr5z6-nbdb-+ serverid=' cc0b' log-ovnkube-master-fr5z6-nbdb-++ ovsdb-tool show-log /etc/ovn/ovnnb_db.db log-ovnkube-master-fr5z6-nbdb-++ grep 'prev_servers: * cc0b(' log-ovnkube-master-fr5z6-nbdb-++ true log-ovnkube-master-fr5z6-nbdb-+ match= log-ovnkube-master-fr5z6-nbdb-+ [[ -z '' ]] log-ovnkube-master-fr5z6-nbdb-+ echo 'Current server_id cc0b not found in /etc/ovn/ovnnb_db.db, cleaning up' log-ovnkube-master-fr5z6-nbdb-Current server_id cc0b not found in /etc/ovn/ovnnb_db.db, cleaning up log-ovnkube-master-fr5z6-nbdb-+ rm -- /etc/ovn/ovnnb_db.db log-ovnkube-master-fr5z6-nbdb-+ MASTER_IP=10.0.139.146 -- log-ovnkube-master-dq5sd-sbdb:+ check_db_health /etc/ovn/ovnsb_db.db log-ovnkube-master-dq5sd-sbdb-+ db=/etc/ovn/ovnsb_db.db log-ovnkube-master-dq5sd-sbdb-+ [[ ! -f /etc/ovn/ovnsb_db.db ]] log-ovnkube-master-dq5sd-sbdb-+ echo '/etc/ovn/ovnsb_db.db does not exist, skipping health check' log-ovnkube-master-dq5sd-sbdb-/etc/ovn/ovnsb_db.db does not exist, skipping health check log-ovnkube-master-dq5sd-sbdb-+ return log-ovnkube-master-dq5sd-sbdb-+ MASTER_IP=10.0.139.146 log-ovnkube-master-dq5sd-sbdb-++ date -Iseconds log-ovnkube-master-dq5sd-sbdb-+ echo '2020-07-08T02:03:41+0000 - starting sbdb MASTER_IP=10.0.139.146' log-ovnkube-master-dq5sd-sbdb-2020-07-08T02:03:41+0000 - starting sbdb MASTER_IP=10.0.139.146 log-ovnkube-master-dq5sd-sbdb-+ [[ 10.0.216.169 == \1\0\.\0\.\1\3\9\.\1\4\6 ]] log-ovnkube-master-dq5sd-sbdb-++ date -Iseconds log-ovnkube-master-dq5sd-sbdb-+ echo '2020-07-08T02:03:41+0000 - joining cluster at 10.0.139.146' log-ovnkube-master-dq5sd-sbdb-2020-07-08T02:03:41+0000 - joining cluster at 10.0.139.146 log-ovnkube-master-dq5sd-sbdb-++ bracketify 10.0.216.169 log-ovnkube-master-dq5sd-sbdb-++ case "$1" in log-ovnkube-master-dq5sd-sbdb-++ echo 10.0.216.169 -- log-ovnkube-master-dq5sd-nbdb:+ check_db_health /etc/ovn/ovnnb_db.db log-ovnkube-master-dq5sd-nbdb-+ db=/etc/ovn/ovnnb_db.db log-ovnkube-master-dq5sd-nbdb-+ [[ ! -f /etc/ovn/ovnnb_db.db ]] log-ovnkube-master-dq5sd-nbdb-+ echo '/etc/ovn/ovnnb_db.db does not exist, skipping health check' log-ovnkube-master-dq5sd-nbdb-/etc/ovn/ovnnb_db.db does not exist, skipping health check log-ovnkube-master-dq5sd-nbdb-+ return log-ovnkube-master-dq5sd-nbdb-+ MASTER_IP=10.0.139.146 log-ovnkube-master-dq5sd-nbdb-++ date -Iseconds log-ovnkube-master-dq5sd-nbdb-+ echo '2020-07-08T02:03:41+0000 - starting nbdb MASTER_IP=10.0.139.146' log-ovnkube-master-dq5sd-nbdb-2020-07-08T02:03:41+0000 - starting nbdb MASTER_IP=10.0.139.146 log-ovnkube-master-dq5sd-nbdb-+ [[ 10.0.216.169 == \1\0\.\0\.\1\3\9\.\1\4\6 ]] log-ovnkube-master-dq5sd-nbdb-++ bracketify 10.0.216.169 log-ovnkube-master-dq5sd-nbdb-++ case "$1" in log-ovnkube-master-dq5sd-nbdb-++ echo 10.0.216.169 log-ovnkube-master-dq5sd-nbdb-++ bracketify 10.0.139.146 log-ovnkube-master-dq5sd-nbdb-++ case "$1" in log-ovnkube-master-dq5sd-nbdb-++ echo 10.0.139.146 *** Bug 1858834 has been marked as a duplicate of this bug. *** *** Bug 1857455 has been marked as a duplicate of this bug. *** *** Bug 1857462 has been marked as a duplicate of this bug. *** Following up this, I am not sure it's going to work. I saw the second pr merged [1], and wanted to validate it. The thing is, in all the three masters pods, on a newly created cluster (using the cluster bot) only the leader is listed as prev_server: [root@ci-ln-nqtg9mk-f76d1-fz9b9-master-0 ~]# ovsdb-tool show-log /etc/ovn/ovnnb_db.db | grep prev_servers prev_servers: f680("ssl:10.0.0.3:9643") [root@ci-ln-nqtg9mk-f76d1-fz9b9-master-1 ~]# ovsdb-tool show-log /etc/ovn/ovnnb_db.db | grep prev_servers prev_servers: f680("ssl:10.0.0.3:9643") [root@ci-ln-nqtg9mk-f76d1-fz9b9-master-2 ~]# ovsdb-tool show-log /etc/ovn/ovnnb_db.db | grep prev_servers prev_servers: f680("ssl:10.0.0.3:9643") This is different from what was being discussed in https://bugzilla.redhat.com/show_bug.cgi?id=1837953#c23 and it's gonna cause all the masters but the leader to delete their db file (which may lead to unexpected results). Moreover, after rebooting one non leader master, the pod is running openshift-ovn-kubernetes ovnkube-master-8ssx5 4/4 Running 0 3m52s openshift-ovn-kubernetes ovnkube-master-f27br 4/4 Running 1 119m openshift-ovn-kubernetes ovnkube-master-zp5dp 4/4 Running 0 119m BUT from the nbdb logs I am seeing: 44094085","leader_commit":2222,"log":[],"prev_log_index":2222,"prev_log_term":1,"term":1,"to":"bb423d57-5460-4354-9608-006ee7696cf5"}": syntax error: Parsing raft append_request RPC failed: misrouted message (addressed to bb42 but we're 022d) 2020-07-21T13:50:53Z|00108|raft|INFO|ssl:10.0.0.3:55768: syntax "{"cluster":"eeaaec50-a349-4fd4-874a-58f672cac5c9","comment":"heartbeat","from":"f680cfa2-0c0d-48a8-9cf2-3bfb44094085","leader_commit":2222,"log":[],"prev_log_index":2222,"prev_log_term":1,"term":1,"to":"bb423d57-5460-4354-9608-006ee7696cf5"}": syntax error: Parsing raft append_request RPC failed: misrouted message (addressed to bb42 but we're 022d) 2020-07-21T13:50:55Z|00109|raft|INFO|ssl:10.0.0.3:55768: syntax "{"cluster":"eeaaec50-a349-4fd4-874a-58f672cac5c9","comment":"heartbeat","from":"f680cfa2-0c0d-48a8-9cf2-3bfb44094085","leader_commit":2222,"log":[],"prev_log_index":2222,"prev_log_term":1,"term":1,"to":"bb423d57-5460-4354-9608-006ee7696cf5"}": syntax error: Parsing raft append_request RPC failed: misrouted message (addressed to bb42 but we're 022d) 2020-07-21T13:50:57Z|00110|raft|INFO|ssl:10.0.0.3:55768: syntax "{"cluster":"eeaaec50-a349-4fd4-874a-58f672cac5c9","comment":"heartbeat","from":"f680cfa2-0c0d-48a8-9cf2-3bfb44094085","leader_commit":2222,"log":[],"prev_log_index":2222,"prev_log_term":1,"term":1,"to":"bb423d57-5460-4354-9608-006ee7696cf5"}": syntax error: Parsing raft append_request RPC failed: misrouted message (addressed to bb42 but we're 022d) So, it looks like the rebooted pod renamed itself but the peers are referring it with the wrong name. Not sure what are the side effects of this. [1] https://github.com/openshift/cluster-network-operator/pull/694 Adding also that I remember trying this when opening the original PR and *I think* it was working. Also, when running the log command against the db linked in https://bugzilla.redhat.com/show_bug.cgi?id=1837953#c16 , it returns the three servers. Moving it back to Assigned state until dependent BZ 1848066 has a resolution. Another data point: I just tried doing the same on a 4.4, and the output is the one we expected: [root@master-1 ~]# ovsdb-tool show-log /etc/ovn/ovnnb_db.db | grep prev_servers prev_servers: 149e("ssl:10.1.190.31:9643"), 7f13("ssl:10.1.190.32:9643"), fb57("ssl:10.1.190.33:9643") on 4.5 though, it still contains only the leader: [root@ci-ln-vz8t882-f76d1-zm24n-master-0 ~]# ovsdb-tool show-log /etc/ovn/ovnnb_db.db | grep prev_servers prev_servers: 4b50("ssl:10.0.0.3:9643") Hi Federico, The initial workaround was incomplete because the correct way to determine if the DB is inconsistent is to inspect the RAFT header and RAFT records (in "ovsdb-tool show-log"). If the local server ID is not present in any of them then we can conclude that the DB is inconsistent. In https://bugzilla.redhat.com/show_bug.cgi?id=1837953#c39, in the output from 4.5 the first local snapshot didn't happen yet so the prev_servers field contains only the leader but the local server ID is probably present in the RAFT log records. I mentioned the same point in the upstream discussion on the patch to improve "ovsdb-tool check-cluster": https://patchwork.ozlabs.org/project/openvswitch/patch/CAAFK5zwuJ1GjfNgNObmsqHF3bQ9tTe2phVwbCHjHpkeyf0pmVA@mail.gmail.com/#2494442 Thanks, Dumitru Will do the change. At the same time, as we discussed offline, a fix is needed for the 4094085","leader_commit":2222,"log":[],"prev_log_index":2222,"prev_log_term":1,"term":1,"to":"bb423d57-5460-4354-9608-006ee7696cf5"}": syntax error: Parsing raft append_request RPC failed: misrouted message (addressed to bb42 but we're 022d) 2020-07-21T13:50:53Z|00108|raft|INFO|ssl:10.0.0.3:55768: syntax "{"cluster":"eeaaec50-a349-4fd4-874a-58f672cac5c9","comment":"heartbeat","from":"f680cfa2-0c0d-48a8-9cf2-3bfb44094085","leader_commit":2222,"log":[],"prev_log_index":2222,"prev_log_term":1,"term":1,"to":"bb423d57-5460-4354-9608-006ee7696cf5"}": syntax error: Parsing raft append_request RPC failed: misrouted message (addressed to bb42 but we're 022d) 2020-07-21T13:50:55Z|00109|raft|INFO|ssl:10.0.0.3:55768: syntax "{"cluster":"eeaaec50-a349-4fd4-874a-58f672cac5c9","comment":"heartbeat","from":"f680cfa2-0c0d-48a8-9cf2-3bfb44094085","leader_commit":2222,"log":[],"prev_log_index":2222,"prev_log_term":1,"term":1,"to":"bb423d57-5460-4354-9608-006ee7696cf5"}": syntax error: Parsing raft append_request RPC failed: misrouted message (addressed to bb42 but we're 022d) 2020-07-21T13:50:57Z|00110|raft|INFO|ssl:10.0.0.3:55768: syntax "{"cluster":"eeaaec50-a349-4fd4-874a-58f672cac5c9","comment":"heartbeat","from":"f680cfa2-0c0d-48a8-9cf2-3bfb44094085","leader_commit":2222,"log":[],"prev_log_index":2222,"prev_log_term":1,"term":1,"to":"bb423d57-5460-4354-9608-006ee7696cf5"}": syntax error: Parsing raft append_request RPC failed: misrouted message (addressed to bb42 but we're 022d) part, because it won't work even if we identify that the db file is corrupted. This also mean that we can't do a "bash only" temporary workaround right? At least this part needs to be addressed in ovn. Yes, you're right we need an additional fix in ovsdb-server to deal with duplicate servers. I sent a new version of the patch https://patchwork.ozlabs.org/project/openvswitch/patch/CAAFK5zxZ2PPKbq9ythtRrGyWhXA47RyHktyGGC20UvVCLT9WRw@mail.gmail.com/ @dumitru, would it make sense to reassign this bug to you? My understanding is that we need both my change and the duplicate servers change (assuming you are working on it)? As we need both, I think (hope) that once they are in we'll be able to use directly ovsdb-tool to do the validation in CNO. (In reply to Federico Paolinelli from comment #43) > I sent a new version of the patch > https://patchwork.ozlabs.org/project/openvswitch/patch/ > CAAFK5zxZ2PPKbq9ythtRrGyWhXA47RyHktyGGC20UvVCLT9WRw.com/ Thanks, I'll have another look. > > @dumitru, would it make sense to reassign this bug to you? My understanding > is that we need both my change and the duplicate servers change (assuming > you are working on it)? > As we need both, I think (hope) that once they are in we'll be able to use > directly ovsdb-tool to do the validation in CNO. I spent some time thinking about this and trying different things in the ovsdb-server RAFT implementation but I came to the conclusion that it wouldn't be exactly OK for ovsdb-server to automatically remove the "stale" server if a new one joins with the same <address:port> tuple. The main reason is that ovsdb-server can't differentiate between the following two cases: 1. The DB was removed on a node and a server is rejoining a cluster using a new SID. In this case the old server entry is indeed "stale". 2. There's a misconfig in the cluster (e.g., ovn-kube misconfigures two nodes to use the same IP address). In this case the old entry isn't stale and the "syntax error" logs are valid and point to the misconfiguration. I think the best approach right now (until we root cause and fix bug 1848066) is to get the ovsdb-tool check-cluster enhancement merged in OVS and afterwards we can use it to do the validation in CNO. If the validation fails, the old DB file should be removed and as soon as the DB restarts on the node we could have an additional check (I guess in "postStart") to see if there is a stale <address:port> tuple matching the "self" entry. If so then CNO should run "ovs-appctl -t /var/run/ovn/ovnsb_db.ctl <DB> cluster/kick <stale-server>" to get rid of the stale entry. What do you think? > I think the best approach right now (until we root cause and fix bug > 1848066) is to get the ovsdb-tool check-cluster enhancement merged in OVS > and afterwards we can use it to do the validation in CNO. > From what we are saying / what we fixed, this could also be implemented as a short term solution in bash right? Checking not only if the serverid is in the prev_servers list but also in servers > If the validation fails, the old DB file should be removed and as soon as > the DB restarts on the node we could have an additional check (I guess in > "postStart") to see if there is a stale <address:port> tuple matching the > "self" entry. If so then CNO should run "ovs-appctl -t > /var/run/ovn/ovnsb_db.ctl <DB> cluster/kick <stale-server>" to get rid of > the stale entry. > The stale entry would not match the "self" entry, as the serverid would be a new one. One thing we could do is to have the removal code (which happens in the command section) let know the id ti the postStart part (where we can kick the old server off). This could be done by leaving a file with the old id in the filesystem. It would also have the nice side effect that the kick would be intentionally triggered only when a delete happens, making the logic more straightforward. Not sure if leaving a file around is too dirty, but I don't think it brings side effects. If the file is there, we read the id, kick the server and delete the file. > What do you think? (In reply to Federico Paolinelli from comment #45) > > I think the best approach right now (until we root cause and fix bug > > 1848066) is to get the ovsdb-tool check-cluster enhancement merged in OVS > > and afterwards we can use it to do the validation in CNO. > > > > From what we are saying / what we fixed, this could also be implemented as a > short term solution in bash right? > Checking not only if the serverid is in the prev_servers list but also in > servers > > > > If the validation fails, the old DB file should be removed and as soon as > > the DB restarts on the node we could have an additional check (I guess in > > "postStart") to see if there is a stale <address:port> tuple matching the > > "self" entry. If so then CNO should run "ovs-appctl -t > > /var/run/ovn/ovnsb_db.ctl <DB> cluster/kick <stale-server>" to get rid of > > the stale entry. > > > > The stale entry would not match the "self" entry, as the serverid would be a > new one. One thing we could do is to have the removal code (which happens in > the command section) let know the id ti the postStart part (where we can > kick the > old server off). This could be done by leaving a file with the old id in the > filesystem. It would also have the nice side effect that the kick would be > intentionally triggered only when a delete happens, making the logic more > straightforward. > > Not sure if leaving a file around is too dirty, but I don't think it brings > side effects. If the file is there, we read the id, kick the server and > delete the file. > I'm not knowledgeable enough with ovn-kubernetes but fwiw this sounds ok to me. Thanks, Dumitru Following up this, it took me a bit to validate it. I have another PR (https://github.com/openshift/cluster-network-operator/pull/746) where I implemented what was discussed here. When trying to see how it behaves, I found that while deleting a non leader's db work, if I nuke the leader's db, it chooses a different clusterid and complains about the other referring to the old one; 2020-08-07T15:53:59Z|00449|raft|INFO|ssl:10.0.0.5:44624: syntax "{"cluster":"a8fcb0ad-c219-4f33-b5b3-56cc62fff86e","comment":"heartbeat","from":"35fb413c-1792-47ac-8ac2-87d8b1755052","leader_commit":2901,"log":[],"prev_log_index":2901,"prev_log_term":12,"term":12,"to":"a0f7e381-a94c-4034-83ca-edf90f04e42a"}": syntax error: Parsing raft append_request RPC failed: wrong cluster a8fc (expected 9052) 2020-08-07T15:54:01Z|00450|raft|INFO|ssl:10.0.0.5:44624: syntax "{"cluster":"a8fcb0ad-c219-4f33-b5b3-56cc62fff86e","comment":"heartbeat","from":"35fb413c-1792-47ac-8ac2-87d8b1755052","leader_commit":2901,"log":[],"prev_log_index":2901,"prev_log_term":12,"term":12,"to":"a0f7e381-a94c-4034-83ca-edf90f04e42a"}": syntax error: Parsing raft append_request RPC failed: wrong cluster a8fc (expected 9052) Not sure this is acceptable, I am a bit afraid of this to being fragile and having other side effects behind the corner. After some internal discussion with the OVN/OVN-Kubernetes team we we can break down this bug into 3 components with solutions: 1. Ensuring correct raft membership. Two parts to this problem. The member is composed of a random uuid + ip address. Consider the following scenarios with control-plane nodes A,B,C, with initial raft membership of A,B,C. Scenario 1: node A goes down, it loses its DB, comes back up. Now after rejoining the raft cluster the new membership is oldA,B,C,newA. oldA and newA have the same address, but different uuids. The fix for this is for *each* ovn-kubernetes master node to be responsible for kicking the old version of itself out of the cluster. Scenario 2: node A goes down, it is replaced by node D. Now the raft membership is A,B,C,D. A and D have different ip addresses and uuids. The fix for this is that the ovn-kubernetes master node who is also leader, will be refreshed with the new IP addresses for the DB cluster (B,C,D). He will then check the current raft membership and kick the stale entry (as long as the minimum number of required entries + 1 exist) 2. Ensuring we do not corrupt database on a stop. We currently just kill the ovsdb-servers for NB/SB. The fix here is to add a pre-hook on stopping the pod so that it will stop ovsdb gracefully. 3. Ensuring the DB on each node is not corrupt and ovsdb-server is not in an endless cycle failing to start. The fix here is for each ovn-kubernetes master node to determine if his local database is healthy. This may be via a ovsdb-tool CLI check, ovsdb query, or examining the database file itself. If it is determined that the database is unhealthy, ovn-kube master will backup and destroy the database and then exit his process. This will ensure that the ovn-kube master restarts, which will handle a case where if all 3 dbs became corrupt and restarted fresh, we would need to resync kapi events back and rebuild the database using ovn-kube master. Fix posted ovn-kubernetes upstream for 1. For 2, tracking that with this bz: https://bugzilla.redhat.com/show_bug.cgi?id=1872750 Per chat with Aniket, assigning to him since he had a PR WIP already for 3) . Covering the corresponding cno change in https://bugzilla.redhat.com/show_bug.cgi?id=1882569 Verified replacing masters works on GCP, AWS with 4.7.0-0.nightly-2020-10-21-001511 ovn-dbchecker is running 2020-10-21T21:29:05.726178849Z + [[ -f /env/_master ]] 2020-10-21T21:29:05.726845277Z ++ date '+%m%d %H:%M:%S.%N' 2020-10-21T21:29:05.732488300Z + echo 'I1021 21:29:05.731565255 - ovn-dbchecker - start ovn-dbchecker' 2020-10-21T21:29:05.732530594Z I1021 21:29:05.731565255 - ovn-dbchecker - start ovn-dbchecker 2020-10-21T21:29:05.732705990Z + exec /usr/bin/ovndbchecker --config-file=/run/ovnkube-config/ovnkube.conf --loglevel 4 --sb-address ssl:10.0.0.4:9642,ssl:10.0.0.5:9642,ssl:10.0.0.7:9642 --sb-client-privkey /ovn-cert/tls.key --sb-client-cert /ovn-cert/tls.crt --sb-client-cacert /ovn-ca/ca-bundle.crt --sb-cert-common-> 2020-10-21T21:29:05.749346541Z I1021 21:29:05.749216 1 config.go:1282] Parsed config file /run/ovnkube-config/ovnkube.conf 2020-10-21T21:29:05.749409581Z I1021 21:29:05.749294 1 config.go:1283] Parsed config: {Default:{MTU:1360 ConntrackZone:64000 EncapType:geneve EncapIP: EncapPort:6081 InactivityProbe:100000 OpenFlowProbe:180 RawClusterSubnets:10.128.0.0/14/23 ClusterSubnets:[]} Logging:{File: CNIFile: Level:4 LogFileMaxSize:100 > 2020-10-21T21:29:05.753247811Z I1021 21:29:05.753165 1 ovndbmanager.go:24] Starting DB Checker to ensure cluster membership and DB consistency 2020-10-21T21:29:05.753247811Z I1021 21:29:05.753230 1 ovndbmanager.go:45] Starting ensure routine for Raft db: /etc/ovn/ovnsb_db.db 2020-10-21T21:29:05.753296454Z I1021 21:29:05.753273 1 ovndbmanager.go:45] Starting ensure routine for Raft db: /etc/ovn/ovnnb_db.db 2020-10-21T21:30:06.033818125Z I1021 21:30:06.033739 1 ovndbmanager.go:250] check-cluster returned out: "", stderr: "" 2020-10-21T21:30:06.041169895Z I1021 21:30:06.041069 1 ovndbmanager.go:250] check-cluster returned out: "", stderr: "" 2020-10-21T21:31:06.055653415Z I1021 21:31:06.055565 1 ovndbmanager.go:250] check-cluster returned out: "", stderr: "" Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |