Bug 1837953 - Replacing masters doesn't work for ovn-kubernetes 4.4
Summary: Replacing masters doesn't work for ovn-kubernetes 4.4
Keywords:
Status: VERIFIED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.0
Assignee: Aniket Bhat
QA Contact: Ross Brattain
URL:
Whiteboard:
: 1857455 1857462 (view as bug list)
Depends On: 1848066
Blocks: 1854072 1882569 1885675 1887462 1857455 1857462 1858712
TreeView+ depends on / blocked
 
Reported: 2020-05-20 09:17 UTC by Eduardo Minguez
Modified: 2020-10-22 02:19 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1848066 1857455 1882569 1885675 (view as bug list)
Environment:
Last Closed: 2020-06-04 17:43:53 UTC
Target Upstream Version:


Attachments (Terms of Use)
ovnkube-master-h566j logs (ocp 4.4.6) (24.93 KB, text/plain)
2020-06-15 09:12 UTC, Eduardo Minguez
no flags Details
ovn-dbs from all masters (3.97 MB, application/x-xz)
2020-06-16 11:04 UTC, Eduardo Minguez
no flags Details
core files generated in master-3 (8.31 MB, application/gzip)
2020-06-17 11:55 UTC, Eduardo Minguez
no flags Details


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 806 None closed Bug 1882569: Add support for OVN DB Management 2020-10-26 19:20:54 UTC
Github openshift ovn-kubernetes pull 288 None closed Bug 1837953: 9-24-2020 merge 2020-10-26 19:20:40 UTC
Github openshift ovn-kubernetes pull 306 None closed 10-8-2020 merge 2020-10-26 19:20:55 UTC

Description Eduardo Minguez 2020-05-20 09:17:28 UTC
Description of problem:
Replacing a master host in OCP4.4 makes the ovn-master pod running in the new master 'CrashLoopBackOff'


Version-Release number of selected component (if applicable):
4.4.3


How reproducible:
Deploy OCP 4.4 on 3 masters + n workers.
Remove a master
Add a new master
Observe the ovn-master pods


Steps to Reproduce:
1. Install OCP 4.4 (in my case IPI bare metal) with ovn-kubernetes
2. Remove a master (in my case, because hardware replacement)
3. Add a new master
4. Perform the etcd fixes required (https://docs.openshift.com/container-platform/4.4/backup_and_restore/replacing-unhealthy-etcd-member.html#restore-replace-stopped-etcd-member_replacing-unhealthy-etcd-member)

Actual results:
The ovnkube-master pod running in the new master is failing.
Logs mention "server does not belong to cluster"

Expected results:
All the pods are running successfully

Additional info:
The environment is OCP4 IPI on bare metal, but I guess it is not environment dependent.

Comment 1 Eduardo Minguez 2020-05-20 09:18:31 UTC
oc logs -> https://pastebin.com/Jn5xbywP

Comment 4 Ben Bennett 2020-05-20 13:29:16 UTC
Setting the target release to the development branch so we can identify the issue and fix it.  We can work out where we backport to after the fix has been identified.

Comment 5 Ricardo Carrillo Cruz 2020-06-03 17:41:51 UTC
I think there was a job or test somewhere in which we killed a master in order to test
the DB could be rebuilt from scratch. I'm asking around about it.

Comment 6 Dan Williams 2020-06-03 20:04:07 UTC
Replacing a master with a completely new master will surely need changes to CNO and possible master containers.  It's not supported at this time but it's a must-have for GA/4.6.

Comment 8 Ricardo Carrillo Cruz 2020-06-04 17:43:53 UTC
Opened JIRA https://issues.redhat.com/browse/SDN-1024 , closing this BZ.

Comment 10 Ricardo Carrillo Cruz 2020-06-08 19:20:03 UTC
I don't have an estimate yet because I'm not able to repro this on master.
Killing the machine hosting the OVN master lead, then creating a new machine the CNO eventually
creates the new pods fine and the network converges.
The only thing I had to do was to remove the removed etcd member and that was it.
I should be able to tell if I can repro or not on 4.4 tomorrow and get back to you.

Comment 11 Ricardo Carrillo Cruz 2020-06-09 18:28:31 UTC
I'm afraid I haven't been able to reproduce on 4.4 either.
Moreover, chatting with other peers in the team we don't think there should be issues on recovering
masters in OVN, there may be some raft fixes needed tho if the cluster size was reduced tho.

Edu: Can you please try to reproduce on your side again? Let me know so I can jump in tmate or something with you to debug.

Comment 12 Eduardo Minguez 2020-06-15 09:10:47 UTC
I've been able to reproduce it again with 4.4.6 in baremetal:

$ oc version
Client Version: 4.4.6
Server Version: 4.4.6
Kubernetes Version: v1.17.1+f63db30

$ oc get po -n openshift-ovn-kubernetes
NAME                   READY   STATUS    RESTARTS   AGE
ovnkube-master-7fkbb   4/4     Running   0          5m6s
ovnkube-master-h566j   3/4     Error     5          3m23s
ovnkube-master-zzczg   4/4     Running   0          4m24s
ovnkube-node-5vcsc     2/2     Running   0          4m23s
ovnkube-node-8vg4z     2/2     Running   0          4m42s
ovnkube-node-d8gjq     2/2     Running   0          5m16s
ovnkube-node-vhs9c     2/2     Running   0          4m54s
ovs-node-6kgxx         1/1     Running   0          3d17h
ovs-node-l6b2v         1/1     Running   0          3d17h
ovs-node-vc6vv         1/1     Running   0          8m2s
ovs-node-vfm6l         1/1     Running   0          3d17h

$ oc logs ovnkube-master-h566j --all-containers -n openshift-ovn-kubernetes > see-attached.txt

Comment 13 Eduardo Minguez 2020-06-15 09:12:16 UTC
Created attachment 1697352 [details]
ovnkube-master-h566j logs (ocp 4.4.6)

Comment 14 Ricardo Carrillo Cruz 2020-06-15 16:06:18 UTC
[ricky@ricky-laptop ~]$ oc -n openshift-ovn-kubernetes logs -c nbdb ovnkube-master-h566j                                                                                                                                                                      
+ [[ -f /env/_master ]]                                                                                                                                                                                                                                       
+ [[ -f /usr/bin/ovn-appctl ]]                                                                                                 
+ OVNCTL_PATH=/usr/share/ovn/scripts/ovn-ctl                                                                                   
+ MASTER_IP=10.19.138.33                                                                                                       
+ [[ 10.19.138.38 == \1\0\.\1\9\.\1\3\8\.\3\3 ]]                                                                                                                                                                                                              
++ bracketify 10.19.138.38                                                                                                     
++ case "$1" in                                                                                                                
++ echo 10.19.138.38                                                                                                           
++ bracketify 10.19.138.33                                                                                                     
++ case "$1" in                                                                                                                                                                                                                                               
++ echo 10.19.138.33                                                                                                           
+ exec /usr/share/ovn/scripts/ovn-ctl --db-nb-cluster-local-port=9643 --db-nb-cluster-remote-port=9643 --db-nb-cluster-local-addr=10.19.138.38 --db-nb-cluster-remote-addr=10.19.138.33 --no-monitor --db-nb-cluster-local-proto=ssl --db-nb-cluster-remote-pr
oto=ssl --ovn-nb-db-ssl-key=/ovn-cert/tls.key --ovn-nb-db-ssl-cert=/ovn-cert/tls.crt --ovn-nb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt '--ovn-nb-log=-vconsole:info -vfile:off' run_nb_ovsdb                                                                      
2020-06-15T14:53:33Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log                                                                                                                                                                         
ovsdb-server: ovsdb error: server does not belong to cluster 

-----------------------------------------------------------------------

[ricky@ricky-laptop ~]$ oc get nodes -owide
NAME                                         STATUS   ROLES            AGE     VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                CONTAINER-RUNTIME
kni1-vmaster-1                               Ready    master,virtual   3d23h   v1.17.1   10.19.138.33   <none>        Red Hat Enterprise Linux CoreOS 44.81.202005250830-0 (Ootpa)   4.18.0-147.8.1.el8_1.x86_64   cri-o://1.17.4-12.dev.rhaos4.4.git2be4d9c.el8
kni1-vmaster-2                               Ready    master,virtual   3d23h   v1.17.1   10.19.138.37   <none>        Red Hat Enterprise Linux CoreOS 44.81.202005250830-0 (Ootpa)   4.18.0-147.8.1.el8_1.x86_64   cri-o://1.17.4-12.dev.rhaos4.4.git2be4d9c.el8
kni1-vmaster-3                               Ready    master           6h10m   v1.17.1   10.19.138.38   <none>        Red Hat Enterprise Linux CoreOS 44.81.202005250830-0 (Ootpa)   4.18.0-147.8.1.el8_1.x86_64   cri-o://1.17.4-12.dev.rhaos4.4.git2be4d9c.el8
kni1-worker-0.cloud.lab.eng.bos.redhat.com   Ready    worker           3d23h   v1.17.1   10.19.138.9    <none>        Red Hat Enterprise Linux CoreOS 44.81.202005250830-0 (Ootpa)   4.18.0-147.8.1.el8_1.x86_64   cri-o://1.17.4-12.dev.rhaos4.4.git2be4d9c.el8


---------- from a working ovn-k pod -----------------------------------

sh-4.2# ovs-appctl -t /run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound
9f9f
Name: OVN_Southbound
Cluster ID: 2219 (2219724f-9183-4702-9d19-a0846a98a417)
Server ID: 9f9f (9f9f6675-a7d3-44b8-a908-82cb587f8702)
Address: ssl:10.19.138.37:9644
Status: cluster member
Role: follower
Term: 71
Leader: 9cf8
Vote: 9cf8

Election timer: 1000
Log: [37186, 37538]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: (->f641) ->9cf8 <-9cf8 ->752c <-752c
Servers:
    f641 (f641 at ssl:10.19.138.32:9644)
    752c (752c at ssl:10.19.138.38:9644)
    9f9f (9f9f at ssl:10.19.138.37:9644) (self)
    9cf8 (9cf8 at ssl:10.19.138.33:9644)
sh-4.2# ovs-appctl -t /run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
33b4
Name: OVN_Northbound
Cluster ID: d93c (d93cad67-7d9b-4197-83b8-59e9aaeb9871)
Server ID: 33b4 (33b4c127-1a24-4074-838c-03140538e324)
Address: ssl:10.19.138.37:9643
Status: cluster member
Adding server d5db (d5db at ssl:10.19.138.38:9643) (adding: catchup)
Role: leader
Term: 43
Leader: self
Vote: self

Election timer: 1000
Log: [35807, 38503]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: ->720d (->7e4f) <-720d
Servers:
    720d (720d at ssl:10.19.138.33:9643) next_index=38503 match_index=38502
    33b4 (33b4 at ssl:10.19.138.37:9643) (self) next_index=30503 match_index=38502
    7e4f (7e4f at ssl:10.19.138.32:9643) next_index=38503 match_index=0

-----------------------------------------------------------------------------------------

For reference, .32 is the old master that got killed and .38 is the replacement.
What I did was run cluster/kick command with ovs-appctl to remove the old entries for .32.

However .38 is never joining, it's stuck on 'catchup', which looks like an OVS/OVN bug.
I'm trying to reach out that team, otherwise I'll move this bug to them for further investigation.

Comment 15 Eduardo Minguez 2020-06-16 08:20:29 UTC
Just in case, the ovn version in 4.4.6 is 2.13:

$ oc get po -n openshift-ovn-kubernetes
NAME                   READY   STATUS             RESTARTS   AGE
ovnkube-master-7fkbb   4/4     Running            0          23h
ovnkube-master-kqj4h   3/4     CrashLoopBackOff   203        16h
ovnkube-master-zzczg   4/4     Running            0          23h
ovnkube-node-5vcsc     2/2     Running            0          23h
ovnkube-node-8vg4z     2/2     Running            0          23h
ovnkube-node-d8gjq     2/2     Running            0          23h
ovnkube-node-vhs9c     2/2     Running            0          23h
ovs-node-6kgxx         1/1     Running            0          4d17h
ovs-node-l6b2v         1/1     Running            0          4d17h
ovs-node-vc6vv         1/1     Running            0          23h
ovs-node-vfm6l         1/1     Running            0          4d16h
$ oc exec -it ovnkube-master-7fkbb -- /bin/bash -c "rpm -qa | grep -i ovn"
Defaulting container name to northd.
Use 'oc describe pod/ovnkube-master-7fkbb -n openshift-ovn-kubernetes' to see all of the containers in this pod.
ovn2.13-2.13.0-31.el7fdp.x86_64
ovn2.13-central-2.13.0-31.el7fdp.x86_64
ovn2.13-host-2.13.0-31.el7fdp.x86_64
ovn2.13-vtep-2.13.0-31.el7fdp.x86_64

Comment 16 Eduardo Minguez 2020-06-16 11:04:30 UTC
Created attachment 1697595 [details]
ovn-dbs from all masters

Comment 17 Dumitru Ceara 2020-06-16 12:31:57 UTC
The pod that is crash looping is failing to start the NB ovsdb because:
ovsdb-server: ovsdb error: server does not belong to cluster 

This happens when an ovsdb-server is started in clustered mode if the remote_addresses field hasn't been set in the RAFT header of the DB file and the server_id is not present in the prev_servers field of the RAFT header. The remote_addreses field is set by "ovsdb-tool join-cluster ..." here:

https://github.com/openvswitch/ovs/blob/master/utilities/ovs-lib.in#L501

"ovsdb-tool join-cluster ..." is called only if the DB file didn't already exist, i.e., the first time the DB server is brought up. At least when the failing server instance is restarted (that is, not the first time the server instance is brought up) the DB already exists so remote_addresses doesn't get set, triggering:

ovsdb-server: ovsdb error: server does not belong to cluster

It's still not clear why the pod failed the first time and if the DB files already existed before the first NB ovsdb-server instance on the new master was brought up.

In case we identify this as an OVS/OVN bug:

# rpm -q openvswitch2.13                                                                                                                                                                                                                      openvswitch2.13-2.13.0-10.el7fdp.x86_64                                                                                                                                                                                                                                                                                                                                                                                                                                                     # rpm -q ovn2.13                                                                                                                                                                                                                              ovn2.13-2.13.0-31.el7fdp.x86_64

Comment 18 Eduardo Minguez 2020-06-16 13:13:06 UTC
I've removed the ovn*_db.db files in the new master host and restarted the pod that was crashloopbacking:

```
oc debug node/kni1-vmaster-3 -- chroot /host sh -c 'rm -f /var/lib/ovn/etc/ovn*_db.db' && oc delete pod ovnkube-master-kqj4
```

And now it seems the ovnkube-master pod is working:

```
$ oc get po -n openshift-ovn-kubernetes
NAME                   READY   STATUS    RESTARTS   AGE
ovnkube-master-7dm82   4/4     Running   0          29m
ovnkube-master-7fkbb   4/4     Running   0          28h
ovnkube-master-zzczg   4/4     Running   2          28h
ovnkube-node-5vcsc     2/2     Running   0          28h
ovnkube-node-8vg4z     2/2     Running   0          28h
ovnkube-node-d8gjq     2/2     Running   0          28h
ovnkube-node-vhs9c     2/2     Running   0          28h
ovs-node-6kgxx         1/1     Running   0          4d21h
ovs-node-l6b2v         1/1     Running   0          4d21h
ovs-node-vc6vv         1/1     Running   0          28h
ovs-node-vfm6l         1/1     Running   0          4d21h
```

Comment 19 Ricardo Carrillo Cruz 2020-06-16 14:40:29 UTC
FWIW, I still can't seem to reproduce on 4.4 :

<snip>

[ricky@ricky-laptop ~]$ oc -n openshift-machine-api get machines
NAME                                                PHASE     TYPE        REGION      ZONE         AGE
ci-ln-9273fp2-d5d6b-95cnn-master-0                  Running   m4.xlarge   us-west-1   us-west-1a   27m
ci-ln-9273fp2-d5d6b-95cnn-master-1                  Running   m4.xlarge   us-west-1   us-west-1b   27m
ci-ln-9273fp2-d5d6b-95cnn-master-2                  Running   m4.xlarge   us-west-1   us-west-1a   27m
ci-ln-9273fp2-d5d6b-95cnn-worker-us-west-1a-gt9nb   Running   m4.xlarge   us-west-1   us-west-1a   15m
ci-ln-9273fp2-d5d6b-95cnn-worker-us-west-1a-v54ns   Running   m4.xlarge   us-west-1   us-west-1a   15m
ci-ln-9273fp2-d5d6b-95cnn-worker-us-west-1b-hx9nv   Running   m4.xlarge   us-west-1   us-west-1b   15m
[ricky@ricky-laptop ~]$ oc -n openshift-machine-api delete machine ci-ln-9273fp2-d5d6b-95cnn-master-0
machine.machine.openshift.io "ci-ln-9273fp2-d5d6b-95cnn-master-0" deleted
[ricky@ricky-laptop ~]$ oc -n openshift-machine-api get machines
NAME                                                PHASE     TYPE        REGION      ZONE         AGE
ci-ln-9273fp2-d5d6b-95cnn-master-1                  Running   m4.xlarge   us-west-1   us-west-1b   33m
ci-ln-9273fp2-d5d6b-95cnn-master-2                  Running   m4.xlarge   us-west-1   us-west-1a   33m
ci-ln-9273fp2-d5d6b-95cnn-worker-us-west-1a-gt9nb   Running   m4.xlarge   us-west-1   us-west-1a   21m
ci-ln-9273fp2-d5d6b-95cnn-worker-us-west-1a-v54ns   Running   m4.xlarge   us-west-1   us-west-1a   21m
ci-ln-9273fp2-d5d6b-95cnn-worker-us-west-1b-hx9nv   Running   m4.xlarge   us-west-1   us-west-1b   21m
[ricky@ricky-laptop ~]$ oc -n openshift-machine-api get machines ci-ln-9273fp2-d5d6b-95cnn-master-1 -oyaml > /tmp/machine.yaml
[ricky@ricky-laptop ~]$ vi /tmp/machine.yaml 
[ricky@ricky-laptop ~]$ oc apply -f /tmp/machine.yaml 
machine.machine.openshift.io/ci-ln-9273fp2-d5d6b-95cnn-master-3 created
[ricky@ricky-laptop ~]$ oc -n openshift-machine-api get machines
NAME                                                PHASE         TYPE        REGION      ZONE         AGE
ci-ln-9273fp2-d5d6b-95cnn-master-1                  Running       m4.xlarge   us-west-1   us-west-1b   36m
ci-ln-9273fp2-d5d6b-95cnn-master-2                  Running       m4.xlarge   us-west-1   us-west-1a   36m
ci-ln-9273fp2-d5d6b-95cnn-master-3                  Provisioned   m4.xlarge   us-west-1   us-west-1b   2m31s
ci-ln-9273fp2-d5d6b-95cnn-worker-us-west-1a-gt9nb   Running       m4.xlarge   us-west-1   us-west-1a   25m
ci-ln-9273fp2-d5d6b-95cnn-worker-us-west-1a-v54ns   Running       m4.xlarge   us-west-1   us-west-1a   25m
ci-ln-9273fp2-d5d6b-95cnn-worker-us-west-1b-hx9nv   Running       m4.xlarge   us-west-1   us-west-1b   25m
[ricky@ricky-laptop ~]$ oc -n openshift-machine-api get machines
NAME                                                PHASE     TYPE        REGION      ZONE         AGE
ci-ln-9273fp2-d5d6b-95cnn-master-1                  Running   m4.xlarge   us-west-1   us-west-1b   43m
ci-ln-9273fp2-d5d6b-95cnn-master-2                  Running   m4.xlarge   us-west-1   us-west-1a   43m
ci-ln-9273fp2-d5d6b-95cnn-master-3                  Running   m4.xlarge   us-west-1   us-west-1b   8m56s
ci-ln-9273fp2-d5d6b-95cnn-worker-us-west-1a-gt9nb   Running   m4.xlarge   us-west-1   us-west-1a   31m
ci-ln-9273fp2-d5d6b-95cnn-worker-us-west-1a-v54ns   Running   m4.xlarge   us-west-1   us-west-1a   31m
ci-ln-9273fp2-d5d6b-95cnn-worker-us-west-1b-hx9nv   Running   m4.xlarge   us-west-1   us-west-1b   31m
[ricky@ricky-laptop ~]$ oc get nodes
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-141-28.us-west-1.compute.internal    Ready    worker   27m     v1.17.1+3f6f40d
ip-10-0-163-77.us-west-1.compute.internal    Ready    worker   27m     v1.17.1+3f6f40d
ip-10-0-184-211.us-west-1.compute.internal   Ready    master   39m     v1.17.1+3f6f40d
ip-10-0-196-9.us-west-1.compute.internal     Ready    worker   27m     v1.17.1+3f6f40d
ip-10-0-246-69.us-west-1.compute.internal    Ready    master   5m28s   v1.17.1+3f6f40d
ip-10-0-254-36.us-west-1.compute.internal    Ready    master   39m     v1.17.1+3f6f40d
[ricky@ricky-laptop ~]$ oc -n openshift-ovn-kubernetes get pods
NAME                   READY   STATUS    RESTARTS   AGE
ovnkube-master-7847j   4/4     Running   0          106s
ovnkube-master-m2mxt   2/4     Running   0          20s
ovnkube-master-xwmxw   4/4     Running   0          67s
ovnkube-node-24vrk     2/2     Running   0          50s
ovnkube-node-b9lv4     2/2     Running   0          75s
ovnkube-node-gt6hg     2/2     Running   0          98s
ovnkube-node-l5vx4     2/2     Running   0          19s
ovnkube-node-slbl6     2/2     Running   0          60s
ovnkube-node-wlbzm     2/2     Running   0          32s
ovs-node-65wvp         1/1     Running   0          38m
ovs-node-cwk5t         1/1     Running   0          38m
ovs-node-h9knm         1/1     Running   0          5m37s
ovs-node-pmkbn         1/1     Running   0          28m
ovs-node-rk67j         1/1     Running   0          28m
ovs-node-s2mz7         1/1     Running   0          28m
[ricky@ricky-laptop ~]$ oc get version
error: the server doesn't have a resource type "version"
[ricky@ricky-laptop ~]$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.8     True        False         20m     Cluster version is 4.4.8

</snip>

Note the version I pulled is from CI, 4.4.8, reporter uses 4.4.6. It may or it may not be related, also I use AWS cluster whereas reporter is on baremetal.

Comment 20 Eduardo Minguez 2020-06-17 11:55:39 UTC
Created attachment 1697803 [details]
core files generated in master-3

Comment 21 Eduardo Minguez 2020-06-17 12:01:30 UTC
We have reproduced the procedure twice.

The first one there were no issue but the second one the pod failed with a segfault:

2020-06-17T11:43:44Z|00032|fatal_signal|WARN|terminating with signal 15 (Terminated)
2020-06-17T11:43:44Z|00001|fatal_signal(urcu1)|WARN|terminating with signal 11 (Segmentation fault)
2020-06-17T11:43:44Z|00054|fatal_signal(log_fsync3)|WARN|terminating with signal 11 (Segmentation fault)

So, it seems that the pod crashes and leaves the DB in an inconsistent state.

I've attached the core files found in /var/lib/systemd/coredump/ in the master-3 (the 'new one')


There are actually two issues here I guess:

* ovn database corruption/inconsistency
* old member is not deleted from the ovn database

For the first one, I guess it is needed more investigation, but for the second one, I think we just need to include instructions on how to perform the kick procedure like we do for etcd https://docs.openshift.com/container-platform/4.4/backup_and_restore/replacing-unhealthy-etcd-member.html#restore-replace-stopped-etcd-member_replacing-unhealthy-etcd-member but for ovn.

Comment 22 Dan Williams 2020-06-17 13:54:44 UTC
@dumitru is there a way to check if the DB file on-disk has not successfully been RAFT initialized for RAFT? If so then we can delete the file when the container exits so that it's fresh and can retry when the container restarts.

Comment 23 Dumitru Ceara 2020-06-17 14:11:13 UTC
(In reply to Dan Williams from comment #22)
> @dumitru is there a way to check if the DB file on-disk has not successfully
> been RAFT initialized for RAFT? If so then we can delete the file when the
> container exits so that it's fresh and can retry when the container restarts.

Yes, one option would be to check the output of ovsdb-tool show log, e.g.:

$ ovsdb-tool show-log /etc/ovn/ovnnb_db.db | grep "server_id\|prev_servers"
 server_id: d5db
 prev_servers: 33b4("ssl:10.19.138.37:9643"), 720d("ssl:10.19.138.33:9643"), 7e4f("ssl:10.19.138.32:9643"

The above DB is inconsistent because d5db is not part of prev_servers.

A consistent DB looks like this:
$ ovsdb-tool show-log /etc/ovn/ovnnb_db.db | grep "server_id\|prev_servers"
 server_id: 33b4
 prev_servers: 33b4("ssl:10.19.138.37:9643"), 720d("ssl:10.19.138.33:9643")

Comment 24 Dumitru Ceara 2020-06-17 15:22:14 UTC
As discussed offline it could be an option to run a check on the DBs when a nbdb/sbdb container is started and if the server_id is not part of prev_servers (see https://bugzilla.redhat.com/show_bug.cgi?id=1837953#c23) remove the DB inconsistent file before starting ovsdb-server because otherwise ovsdb-server will refuse to join the cluster.

I'm going to move this BZ back to OCP and clone it to openvswitch to continue investigation in the RAFT implementation to see if this situation can be avoided in any way without external intervention.

Thanks,
Dumitru

Comment 26 Ricardo Carrillo Cruz 2020-06-30 14:53:27 UTC
Hey

I've been off in vaca most of last week, and I was back at looking at this late yesterday.
But if you want to take it, feel free, don't have a PR pushed yet.

Let me know.

Comment 29 Federico Paolinelli 2020-07-06 09:00:55 UTC
@Dimitru, re https://bugzilla.redhat.com/show_bug.cgi?id=1837953#c23 which was implemented here, would it make sense to have a more integrated way in ovsdb-tool to detect such corruption?
A new command or something like that? In case, I can give a look at the code to see if I am able to help there.

Comment 30 Dumitru Ceara 2020-07-06 09:06:44 UTC
(In reply to Federico Paolinelli from comment #29)
> @Dimitru, re https://bugzilla.redhat.com/show_bug.cgi?id=1837953#c23 which
> was implemented here, would it make sense to have a more integrated way in
> ovsdb-tool to detect such corruption?
> A new command or something like that? In case, I can give a look at the code
> to see if I am able to help there.

Hi Federico,

Yes, that's a good idea. There's already "ovsdb-tool check-cluster" which does some sanity checks on DB files [1]. We could probably enhance it to check for this type of corruption too. AFAIK, it's currently used only by OVS the unit tests but I don't really see a reason why other users can't use it.

Thanks,
Dumitru

[1] https://github.com/openvswitch/ovs/blob/6adc879b6369fd009e2bdc1db6f5d0aea6c50f89/ovsdb/ovsdb-tool.c#L1210

Comment 31 Ross Brattain 2020-07-08 03:40:00 UTC
Verified on 4.6.0-0.nightly-2020-07-07-0837186

Current code seems to work, but it won't match when there are multiple prev_servers.  See https://github.com/openshift/cluster-network-operator/pull/694

log-ovnkube-master-kgfxr-sbdb:+ check_db_health /etc/ovn/ovnsb_db.db
log-ovnkube-master-kgfxr-sbdb-+ db=/etc/ovn/ovnsb_db.db
log-ovnkube-master-kgfxr-sbdb-+ [[ ! -f /etc/ovn/ovnsb_db.db ]]
log-ovnkube-master-kgfxr-sbdb-+ echo 'Checking /etc/ovn/ovnsb_db.db health'
log-ovnkube-master-kgfxr-sbdb-Checking /etc/ovn/ovnsb_db.db health
log-ovnkube-master-kgfxr-sbdb-++ ovsdb-tool show-log /etc/ovn/ovnsb_db.db
log-ovnkube-master-kgfxr-sbdb-++ sed -ne '/server_id:/s/server_id: *\([[:xdigit:]]\+\)/\1/p'
log-ovnkube-master-kgfxr-sbdb-+ serverid=' 11e3'
log-ovnkube-master-kgfxr-sbdb-++ ovsdb-tool show-log /etc/ovn/ovnsb_db.db
log-ovnkube-master-kgfxr-sbdb-++ grep 'prev_servers: * 11e3('
log-ovnkube-master-kgfxr-sbdb-+ match=' prev_servers: 11e3("ssl:10.0.139.146:9644")'
log-ovnkube-master-kgfxr-sbdb-+ [[ -z  prev_servers: 11e3("ssl:10.0.139.146:9644") ]]
log-ovnkube-master-kgfxr-sbdb-+ echo '/etc/ovn/ovnsb_db.db is healthy'
log-ovnkube-master-kgfxr-sbdb-/etc/ovn/ovnsb_db.db is healthy
log-ovnkube-master-kgfxr-sbdb-+ MASTER_IP=10.0.139.146
log-ovnkube-master-kgfxr-sbdb-++ date -Iseconds
log-ovnkube-master-kgfxr-sbdb-+ echo '2020-07-08T02:04:34+0000 - starting sbdb  MASTER_IP=10.0.139.146'
--
log-ovnkube-master-kgfxr-nbdb:+ check_db_health /etc/ovn/ovnnb_db.db
log-ovnkube-master-kgfxr-nbdb-+ db=/etc/ovn/ovnnb_db.db
log-ovnkube-master-kgfxr-nbdb-+ [[ ! -f /etc/ovn/ovnnb_db.db ]]
log-ovnkube-master-kgfxr-nbdb-+ echo 'Checking /etc/ovn/ovnnb_db.db health'
log-ovnkube-master-kgfxr-nbdb-Checking /etc/ovn/ovnnb_db.db health
log-ovnkube-master-kgfxr-nbdb-++ ovsdb-tool show-log /etc/ovn/ovnnb_db.db
log-ovnkube-master-kgfxr-nbdb-++ sed -ne '/server_id:/s/server_id: *\([[:xdigit:]]\+\)/\1/p'
log-ovnkube-master-kgfxr-nbdb-+ serverid=' cce6'
log-ovnkube-master-kgfxr-nbdb-++ ovsdb-tool show-log /etc/ovn/ovnnb_db.db
log-ovnkube-master-kgfxr-nbdb-++ grep 'prev_servers: * cce6('
log-ovnkube-master-kgfxr-nbdb-+ match=' prev_servers: cce6("ssl:10.0.139.146:9643")'
log-ovnkube-master-kgfxr-nbdb-+ [[ -z  prev_servers: cce6("ssl:10.0.139.146:9643") ]]
log-ovnkube-master-kgfxr-nbdb-/etc/ovn/ovnnb_db.db is healthy
log-ovnkube-master-kgfxr-nbdb-+ echo '/etc/ovn/ovnnb_db.db is healthy'
log-ovnkube-master-kgfxr-nbdb-+ MASTER_IP=10.0.139.146
log-ovnkube-master-kgfxr-nbdb-++ date -Iseconds
log-ovnkube-master-kgfxr-nbdb-+ echo '2020-07-08T02:04:27+0000 - starting nbdb  MASTER_IP=10.0.139.146'
--
log-ovnkube-master-fr5z6-sbdb:+ check_db_health /etc/ovn/ovnsb_db.db
log-ovnkube-master-fr5z6-sbdb-+ db=/etc/ovn/ovnsb_db.db
log-ovnkube-master-fr5z6-sbdb-+ [[ ! -f /etc/ovn/ovnsb_db.db ]]
log-ovnkube-master-fr5z6-sbdb-+ echo 'Checking /etc/ovn/ovnsb_db.db health'
log-ovnkube-master-fr5z6-sbdb-Checking /etc/ovn/ovnsb_db.db health
log-ovnkube-master-fr5z6-sbdb-++ sed -ne '/server_id:/s/server_id: *\([[:xdigit:]]\+\)/\1/p'
log-ovnkube-master-fr5z6-sbdb-++ ovsdb-tool show-log /etc/ovn/ovnsb_db.db
log-ovnkube-master-fr5z6-sbdb-+ serverid=' 6579'
log-ovnkube-master-fr5z6-sbdb-++ ovsdb-tool show-log /etc/ovn/ovnsb_db.db
log-ovnkube-master-fr5z6-sbdb-++ grep 'prev_servers: * 6579('
log-ovnkube-master-fr5z6-sbdb-++ true
log-ovnkube-master-fr5z6-sbdb-+ match=
log-ovnkube-master-fr5z6-sbdb-+ [[ -z '' ]]
log-ovnkube-master-fr5z6-sbdb-+ echo 'Current server_id  6579 not found in /etc/ovn/ovnsb_db.db, cleaning up'
log-ovnkube-master-fr5z6-sbdb-Current server_id  6579 not found in /etc/ovn/ovnsb_db.db, cleaning up
log-ovnkube-master-fr5z6-sbdb-+ rm -- /etc/ovn/ovnsb_db.db
log-ovnkube-master-fr5z6-sbdb-+ MASTER_IP=10.0.139.146
--
log-ovnkube-master-fr5z6-nbdb:+ check_db_health /etc/ovn/ovnnb_db.db
log-ovnkube-master-fr5z6-nbdb-+ db=/etc/ovn/ovnnb_db.db
log-ovnkube-master-fr5z6-nbdb-+ [[ ! -f /etc/ovn/ovnnb_db.db ]]
log-ovnkube-master-fr5z6-nbdb-Checking /etc/ovn/ovnnb_db.db health
log-ovnkube-master-fr5z6-nbdb-+ echo 'Checking /etc/ovn/ovnnb_db.db health'
log-ovnkube-master-fr5z6-nbdb-++ ovsdb-tool show-log /etc/ovn/ovnnb_db.db
log-ovnkube-master-fr5z6-nbdb-++ sed -ne '/server_id:/s/server_id: *\([[:xdigit:]]\+\)/\1/p'
log-ovnkube-master-fr5z6-nbdb-+ serverid=' cc0b'
log-ovnkube-master-fr5z6-nbdb-++ ovsdb-tool show-log /etc/ovn/ovnnb_db.db
log-ovnkube-master-fr5z6-nbdb-++ grep 'prev_servers: * cc0b('
log-ovnkube-master-fr5z6-nbdb-++ true
log-ovnkube-master-fr5z6-nbdb-+ match=
log-ovnkube-master-fr5z6-nbdb-+ [[ -z '' ]]
log-ovnkube-master-fr5z6-nbdb-+ echo 'Current server_id  cc0b not found in /etc/ovn/ovnnb_db.db, cleaning up'
log-ovnkube-master-fr5z6-nbdb-Current server_id  cc0b not found in /etc/ovn/ovnnb_db.db, cleaning up
log-ovnkube-master-fr5z6-nbdb-+ rm -- /etc/ovn/ovnnb_db.db
log-ovnkube-master-fr5z6-nbdb-+ MASTER_IP=10.0.139.146
--
log-ovnkube-master-dq5sd-sbdb:+ check_db_health /etc/ovn/ovnsb_db.db
log-ovnkube-master-dq5sd-sbdb-+ db=/etc/ovn/ovnsb_db.db
log-ovnkube-master-dq5sd-sbdb-+ [[ ! -f /etc/ovn/ovnsb_db.db ]]
log-ovnkube-master-dq5sd-sbdb-+ echo '/etc/ovn/ovnsb_db.db does not exist, skipping health check'
log-ovnkube-master-dq5sd-sbdb-/etc/ovn/ovnsb_db.db does not exist, skipping health check
log-ovnkube-master-dq5sd-sbdb-+ return
log-ovnkube-master-dq5sd-sbdb-+ MASTER_IP=10.0.139.146
log-ovnkube-master-dq5sd-sbdb-++ date -Iseconds
log-ovnkube-master-dq5sd-sbdb-+ echo '2020-07-08T02:03:41+0000 - starting sbdb  MASTER_IP=10.0.139.146'
log-ovnkube-master-dq5sd-sbdb-2020-07-08T02:03:41+0000 - starting sbdb  MASTER_IP=10.0.139.146
log-ovnkube-master-dq5sd-sbdb-+ [[ 10.0.216.169 == \1\0\.\0\.\1\3\9\.\1\4\6 ]]
log-ovnkube-master-dq5sd-sbdb-++ date -Iseconds
log-ovnkube-master-dq5sd-sbdb-+ echo '2020-07-08T02:03:41+0000 - joining cluster at 10.0.139.146'
log-ovnkube-master-dq5sd-sbdb-2020-07-08T02:03:41+0000 - joining cluster at 10.0.139.146
log-ovnkube-master-dq5sd-sbdb-++ bracketify 10.0.216.169
log-ovnkube-master-dq5sd-sbdb-++ case "$1" in
log-ovnkube-master-dq5sd-sbdb-++ echo 10.0.216.169
--
log-ovnkube-master-dq5sd-nbdb:+ check_db_health /etc/ovn/ovnnb_db.db
log-ovnkube-master-dq5sd-nbdb-+ db=/etc/ovn/ovnnb_db.db
log-ovnkube-master-dq5sd-nbdb-+ [[ ! -f /etc/ovn/ovnnb_db.db ]]
log-ovnkube-master-dq5sd-nbdb-+ echo '/etc/ovn/ovnnb_db.db does not exist, skipping health check'
log-ovnkube-master-dq5sd-nbdb-/etc/ovn/ovnnb_db.db does not exist, skipping health check
log-ovnkube-master-dq5sd-nbdb-+ return
log-ovnkube-master-dq5sd-nbdb-+ MASTER_IP=10.0.139.146
log-ovnkube-master-dq5sd-nbdb-++ date -Iseconds
log-ovnkube-master-dq5sd-nbdb-+ echo '2020-07-08T02:03:41+0000 - starting nbdb  MASTER_IP=10.0.139.146'
log-ovnkube-master-dq5sd-nbdb-2020-07-08T02:03:41+0000 - starting nbdb  MASTER_IP=10.0.139.146
log-ovnkube-master-dq5sd-nbdb-+ [[ 10.0.216.169 == \1\0\.\0\.\1\3\9\.\1\4\6 ]]
log-ovnkube-master-dq5sd-nbdb-++ bracketify 10.0.216.169
log-ovnkube-master-dq5sd-nbdb-++ case "$1" in
log-ovnkube-master-dq5sd-nbdb-++ echo 10.0.216.169
log-ovnkube-master-dq5sd-nbdb-++ bracketify 10.0.139.146
log-ovnkube-master-dq5sd-nbdb-++ case "$1" in
log-ovnkube-master-dq5sd-nbdb-++ echo 10.0.139.146

Comment 32 W. Trevor King 2020-07-20 14:18:19 UTC
*** Bug 1858834 has been marked as a duplicate of this bug. ***

Comment 33 Scott Dodson 2020-07-20 14:31:21 UTC
*** Bug 1857455 has been marked as a duplicate of this bug. ***

Comment 34 Scott Dodson 2020-07-20 14:33:18 UTC
*** Bug 1857462 has been marked as a duplicate of this bug. ***

Comment 36 Federico Paolinelli 2020-07-21 13:56:18 UTC
Following up this, I am not sure it's going to work.

I saw the second pr merged [1], and wanted to validate it.

The thing is, in all the three masters pods, on a newly created cluster (using the cluster bot) only the leader is listed as prev_server:

[root@ci-ln-nqtg9mk-f76d1-fz9b9-master-0 ~]# ovsdb-tool show-log /etc/ovn/ovnnb_db.db | grep prev_servers
 prev_servers: f680("ssl:10.0.0.3:9643")

[root@ci-ln-nqtg9mk-f76d1-fz9b9-master-1 ~]# ovsdb-tool show-log /etc/ovn/ovnnb_db.db | grep prev_servers
 prev_servers: f680("ssl:10.0.0.3:9643")

[root@ci-ln-nqtg9mk-f76d1-fz9b9-master-2 ~]#  ovsdb-tool show-log /etc/ovn/ovnnb_db.db | grep prev_servers
 prev_servers: f680("ssl:10.0.0.3:9643")

This is different from what was being discussed in https://bugzilla.redhat.com/show_bug.cgi?id=1837953#c23 and it's gonna cause all the masters but the leader to delete their
db file (which may lead to unexpected results).

Moreover, after rebooting one non leader master, the pod is running 

openshift-ovn-kubernetes                           ovnkube-master-8ssx5                                          4/4     Running     0          3m52s
openshift-ovn-kubernetes                           ovnkube-master-f27br                                          4/4     Running     1          119m
openshift-ovn-kubernetes                           ovnkube-master-zp5dp                                          4/4     Running     0          119m

BUT from the nbdb logs I am seeing:

44094085","leader_commit":2222,"log":[],"prev_log_index":2222,"prev_log_term":1,"term":1,"to":"bb423d57-5460-4354-9608-006ee7696cf5"}": syntax error: Parsing raft append_request RPC failed: misrouted message (addressed to bb42 but we're 022d)
2020-07-21T13:50:53Z|00108|raft|INFO|ssl:10.0.0.3:55768: syntax "{"cluster":"eeaaec50-a349-4fd4-874a-58f672cac5c9","comment":"heartbeat","from":"f680cfa2-0c0d-48a8-9cf2-3bfb44094085","leader_commit":2222,"log":[],"prev_log_index":2222,"prev_log_term":1,"term":1,"to":"bb423d57-5460-4354-9608-006ee7696cf5"}": syntax error: Parsing raft append_request RPC failed: misrouted message (addressed to bb42 but we're 022d)
2020-07-21T13:50:55Z|00109|raft|INFO|ssl:10.0.0.3:55768: syntax "{"cluster":"eeaaec50-a349-4fd4-874a-58f672cac5c9","comment":"heartbeat","from":"f680cfa2-0c0d-48a8-9cf2-3bfb44094085","leader_commit":2222,"log":[],"prev_log_index":2222,"prev_log_term":1,"term":1,"to":"bb423d57-5460-4354-9608-006ee7696cf5"}": syntax error: Parsing raft append_request RPC failed: misrouted message (addressed to bb42 but we're 022d)
2020-07-21T13:50:57Z|00110|raft|INFO|ssl:10.0.0.3:55768: syntax "{"cluster":"eeaaec50-a349-4fd4-874a-58f672cac5c9","comment":"heartbeat","from":"f680cfa2-0c0d-48a8-9cf2-3bfb44094085","leader_commit":2222,"log":[],"prev_log_index":2222,"prev_log_term":1,"term":1,"to":"bb423d57-5460-4354-9608-006ee7696cf5"}": syntax error: Parsing raft append_request RPC failed: misrouted message (addressed to bb42 but we're 022d)


So, it looks like the rebooted pod renamed itself but the peers are referring it with the wrong name. Not sure what are the side effects of this.


[1] https://github.com/openshift/cluster-network-operator/pull/694

Comment 37 Federico Paolinelli 2020-07-21 13:59:01 UTC
Adding also that I remember trying this when opening the original PR and *I think* it was working.
Also, when running the log command against the db linked in https://bugzilla.redhat.com/show_bug.cgi?id=1837953#c16 , it returns the three servers.

Comment 38 Anurag saxena 2020-07-21 20:20:10 UTC
Moving it back to Assigned state until dependent BZ 1848066 has a resolution.

Comment 39 Federico Paolinelli 2020-07-22 10:39:57 UTC
Another data point:

I just tried doing the same on a 4.4, and the output is the one we expected:

[root@master-1 ~]# ovsdb-tool show-log /etc/ovn/ovnnb_db.db | grep prev_servers
 prev_servers: 149e("ssl:10.1.190.31:9643"), 7f13("ssl:10.1.190.32:9643"), fb57("ssl:10.1.190.33:9643")

on 4.5 though, it still contains only the leader:

[root@ci-ln-vz8t882-f76d1-zm24n-master-0 ~]# ovsdb-tool show-log /etc/ovn/ovnnb_db.db | grep prev_servers
 prev_servers: 4b50("ssl:10.0.0.3:9643")

Comment 40 Dumitru Ceara 2020-07-29 07:39:45 UTC
Hi Federico,

The initial workaround was incomplete because the correct way to determine if the DB is inconsistent is to inspect the RAFT header and RAFT records (in "ovsdb-tool show-log"). If the local server ID is not present in any of them then we can conclude that the DB is inconsistent. In https://bugzilla.redhat.com/show_bug.cgi?id=1837953#c39, in the output from 4.5 the first local snapshot didn't happen yet so the prev_servers field contains only the leader but the local server ID is probably present in the RAFT log records.

I mentioned the same point in the upstream discussion on the patch to improve "ovsdb-tool check-cluster":

https://patchwork.ozlabs.org/project/openvswitch/patch/CAAFK5zwuJ1GjfNgNObmsqHF3bQ9tTe2phVwbCHjHpkeyf0pmVA@mail.gmail.com/#2494442

Thanks,
Dumitru

Comment 41 Federico Paolinelli 2020-07-29 07:58:06 UTC
Will do the change. At the same time, as we discussed offline, a fix is needed for the 


4094085","leader_commit":2222,"log":[],"prev_log_index":2222,"prev_log_term":1,"term":1,"to":"bb423d57-5460-4354-9608-006ee7696cf5"}": syntax error: Parsing raft append_request RPC failed: misrouted message (addressed to bb42 but we're 022d)
2020-07-21T13:50:53Z|00108|raft|INFO|ssl:10.0.0.3:55768: syntax "{"cluster":"eeaaec50-a349-4fd4-874a-58f672cac5c9","comment":"heartbeat","from":"f680cfa2-0c0d-48a8-9cf2-3bfb44094085","leader_commit":2222,"log":[],"prev_log_index":2222,"prev_log_term":1,"term":1,"to":"bb423d57-5460-4354-9608-006ee7696cf5"}": syntax error: Parsing raft append_request RPC failed: misrouted message (addressed to bb42 but we're 022d)
2020-07-21T13:50:55Z|00109|raft|INFO|ssl:10.0.0.3:55768: syntax "{"cluster":"eeaaec50-a349-4fd4-874a-58f672cac5c9","comment":"heartbeat","from":"f680cfa2-0c0d-48a8-9cf2-3bfb44094085","leader_commit":2222,"log":[],"prev_log_index":2222,"prev_log_term":1,"term":1,"to":"bb423d57-5460-4354-9608-006ee7696cf5"}": syntax error: Parsing raft append_request RPC failed: misrouted message (addressed to bb42 but we're 022d)
2020-07-21T13:50:57Z|00110|raft|INFO|ssl:10.0.0.3:55768: syntax "{"cluster":"eeaaec50-a349-4fd4-874a-58f672cac5c9","comment":"heartbeat","from":"f680cfa2-0c0d-48a8-9cf2-3bfb44094085","leader_commit":2222,"log":[],"prev_log_index":2222,"prev_log_term":1,"term":1,"to":"bb423d57-5460-4354-9608-006ee7696cf5"}": syntax error: Parsing raft append_request RPC failed: misrouted message (addressed to bb42 but we're 022d)


part, because it won't work even if we identify that the db file is corrupted. This also mean that we can't do a "bash only" temporary workaround right?
At least this part needs to be addressed in ovn.

Comment 42 Dumitru Ceara 2020-07-29 08:15:20 UTC
Yes, you're right we need an additional fix in ovsdb-server to deal with duplicate servers.

Comment 43 Federico Paolinelli 2020-07-30 11:17:11 UTC
I sent a new version of the patch https://patchwork.ozlabs.org/project/openvswitch/patch/CAAFK5zxZ2PPKbq9ythtRrGyWhXA47RyHktyGGC20UvVCLT9WRw@mail.gmail.com/

@dumitru, would it make sense to reassign this bug to you? My understanding is that we need both my change and the duplicate servers change (assuming you are working on it)?
As we need both, I think (hope) that once they are in we'll be able to use directly ovsdb-tool to do the validation in CNO.

Comment 44 Dumitru Ceara 2020-07-31 14:27:57 UTC
(In reply to Federico Paolinelli from comment #43)
> I sent a new version of the patch
> https://patchwork.ozlabs.org/project/openvswitch/patch/
> CAAFK5zxZ2PPKbq9ythtRrGyWhXA47RyHktyGGC20UvVCLT9WRw@mail.gmail.com/

Thanks, I'll have another look.

> 
> @dumitru, would it make sense to reassign this bug to you? My understanding
> is that we need both my change and the duplicate servers change (assuming
> you are working on it)?
> As we need both, I think (hope) that once they are in we'll be able to use
> directly ovsdb-tool to do the validation in CNO.

I spent some time thinking about this and trying different things in the ovsdb-server RAFT implementation but I came to the conclusion that it wouldn't be exactly OK for ovsdb-server to automatically remove the "stale" server if a new one joins with the same <address:port> tuple. The main reason is that ovsdb-server can't differentiate between the following two cases:
1. The DB was removed on a node and a server is rejoining a cluster using a new SID. In this case the old server entry is indeed "stale".
2. There's a misconfig in the cluster (e.g., ovn-kube misconfigures two nodes to use the same IP address). In this case the old entry isn't stale and the "syntax error" logs are valid and point to the misconfiguration.

I think the best approach right now (until we root cause and fix bug 1848066) is to get the ovsdb-tool check-cluster enhancement merged in OVS and afterwards we can use it to do the validation in CNO.

If the validation fails, the old DB file should be removed and as soon as the DB restarts on the node we could have an additional check (I guess in "postStart") to see if there is a stale <address:port> tuple matching the "self" entry. If so then CNO should run "ovs-appctl -t /var/run/ovn/ovnsb_db.ctl <DB> cluster/kick <stale-server>" to get rid of the stale entry.

What do you think?

Comment 45 Federico Paolinelli 2020-08-03 07:47:02 UTC
> I think the best approach right now (until we root cause and fix bug
> 1848066) is to get the ovsdb-tool check-cluster enhancement merged in OVS
> and afterwards we can use it to do the validation in CNO.
> 

From what we are saying / what we fixed, this could also be implemented as a short term solution in bash right?
Checking not only if the serverid is in the prev_servers list but also in servers


> If the validation fails, the old DB file should be removed and as soon as
> the DB restarts on the node we could have an additional check (I guess in
> "postStart") to see if there is a stale <address:port> tuple matching the
> "self" entry. If so then CNO should run "ovs-appctl -t
> /var/run/ovn/ovnsb_db.ctl <DB> cluster/kick <stale-server>" to get rid of
> the stale entry.
> 

The stale entry would not match the "self" entry, as the serverid would be a new one. One thing we could do is to have the removal code (which happens in the command section) let know the id ti the postStart part (where we can kick the
old server off). This could be done by leaving a file with the old id in the filesystem. It would also have the nice side effect that the kick would be intentionally triggered only when a delete happens, making the logic more straightforward.

Not sure if leaving a file around is too dirty, but I don't think it brings side effects. If the file is there, we read the id, kick the server and delete the file.


> What do you think?

Comment 46 Dumitru Ceara 2020-08-03 07:51:38 UTC
(In reply to Federico Paolinelli from comment #45)
> > I think the best approach right now (until we root cause and fix bug
> > 1848066) is to get the ovsdb-tool check-cluster enhancement merged in OVS
> > and afterwards we can use it to do the validation in CNO.
> > 
> 
> From what we are saying / what we fixed, this could also be implemented as a
> short term solution in bash right?
> Checking not only if the serverid is in the prev_servers list but also in
> servers
> 
> 
> > If the validation fails, the old DB file should be removed and as soon as
> > the DB restarts on the node we could have an additional check (I guess in
> > "postStart") to see if there is a stale <address:port> tuple matching the
> > "self" entry. If so then CNO should run "ovs-appctl -t
> > /var/run/ovn/ovnsb_db.ctl <DB> cluster/kick <stale-server>" to get rid of
> > the stale entry.
> > 
> 
> The stale entry would not match the "self" entry, as the serverid would be a
> new one. One thing we could do is to have the removal code (which happens in
> the command section) let know the id ti the postStart part (where we can
> kick the
> old server off). This could be done by leaving a file with the old id in the
> filesystem. It would also have the nice side effect that the kick would be
> intentionally triggered only when a delete happens, making the logic more
> straightforward.
> 
> Not sure if leaving a file around is too dirty, but I don't think it brings
> side effects. If the file is there, we read the id, kick the server and
> delete the file.
> 

I'm not knowledgeable enough with ovn-kubernetes but fwiw this sounds ok to me.

Thanks,
Dumitru

Comment 47 Federico Paolinelli 2020-08-07 16:05:59 UTC
Following up this, it took me a bit to validate it.
I have another PR (https://github.com/openshift/cluster-network-operator/pull/746) where I implemented what was discussed here.

When trying to see how it behaves, I found that while deleting a non leader's db work, if I nuke the leader's db, it chooses a different clusterid and complains about the other referring to the old one;

2020-08-07T15:53:59Z|00449|raft|INFO|ssl:10.0.0.5:44624: syntax "{"cluster":"a8fcb0ad-c219-4f33-b5b3-56cc62fff86e","comment":"heartbeat","from":"35fb413c-1792-47ac-8ac2-87d8b1755052","leader_commit":2901,"log":[],"prev_log_index":2901,"prev_log_term":12,"term":12,"to":"a0f7e381-a94c-4034-83ca-edf90f04e42a"}": syntax error: Parsing raft append_request RPC failed: wrong cluster a8fc (expected 9052)
2020-08-07T15:54:01Z|00450|raft|INFO|ssl:10.0.0.5:44624: syntax "{"cluster":"a8fcb0ad-c219-4f33-b5b3-56cc62fff86e","comment":"heartbeat","from":"35fb413c-1792-47ac-8ac2-87d8b1755052","leader_commit":2901,"log":[],"prev_log_index":2901,"prev_log_term":12,"term":12,"to":"a0f7e381-a94c-4034-83ca-edf90f04e42a"}": syntax error: Parsing raft append_request RPC failed: wrong cluster a8fc (expected 9052)


Not sure this is acceptable, I am a bit afraid of this to being fragile and having other side effects behind the corner.

Comment 48 Tim Rozet 2020-08-25 15:55:23 UTC
After some internal discussion with the OVN/OVN-Kubernetes team we we can break down this bug into 3 components with solutions:

1. Ensuring correct raft membership. Two parts to this problem. The member is composed of a random uuid + ip address. Consider the following scenarios with control-plane nodes A,B,C, with initial raft membership of A,B,C.

Scenario 1: node A goes down, it loses its DB, comes back up. Now after rejoining the raft cluster the new membership is oldA,B,C,newA. oldA and newA have the same address, but different uuids.

The fix for this is for *each* ovn-kubernetes master node to be responsible for kicking the old version of itself out of the cluster.

Scenario 2: node A goes down, it is replaced by node D. Now the raft membership is A,B,C,D. A and D have different ip addresses and uuids.

The fix for this is that the ovn-kubernetes master node who is also leader, will be refreshed with the new IP addresses for the DB cluster (B,C,D). He will then check the current raft membership and kick the stale entry (as long as the minimum number of required entries + 1 exist)

2. Ensuring we do not corrupt database on a stop. We currently just kill the ovsdb-servers for NB/SB.

The fix here is to add a pre-hook on stopping the pod so that it will stop ovsdb gracefully.

3. Ensuring the DB on each node is not corrupt and ovsdb-server is not in an endless cycle failing to start.

The fix here is for each ovn-kubernetes master node to determine if his local database is healthy. This may be via a ovsdb-tool CLI check, ovsdb query, or examining the database file itself. If it is determined that the database is unhealthy, ovn-kube master will backup and destroy the database and then exit his process. This will ensure that the ovn-kube master restarts, which will handle a case where if all 3 dbs became corrupt and restarted fresh, we would need to resync kapi events back and rebuild the database using ovn-kube master.

Comment 49 Tim Rozet 2020-08-26 19:08:40 UTC
Fix posted ovn-kubernetes upstream for 1. For 2, tracking that with this bz: https://bugzilla.redhat.com/show_bug.cgi?id=1872750

Comment 50 Ricardo Carrillo Cruz 2020-09-11 11:03:01 UTC
Per chat with Aniket, assigning to him since he had a PR WIP already for 3) .

Comment 51 Tim Rozet 2020-09-25 02:33:36 UTC
Covering the corresponding cno change in https://bugzilla.redhat.com/show_bug.cgi?id=1882569

Comment 57 Ross Brattain 2020-10-22 01:58:01 UTC
Verified replacing masters works on GCP, AWS with 4.7.0-0.nightly-2020-10-21-001511

ovn-dbchecker is running

2020-10-21T21:29:05.726178849Z + [[ -f /env/_master ]]
2020-10-21T21:29:05.726845277Z ++ date '+%m%d %H:%M:%S.%N'
2020-10-21T21:29:05.732488300Z + echo 'I1021 21:29:05.731565255 - ovn-dbchecker - start ovn-dbchecker'
2020-10-21T21:29:05.732530594Z I1021 21:29:05.731565255 - ovn-dbchecker - start ovn-dbchecker
2020-10-21T21:29:05.732705990Z + exec /usr/bin/ovndbchecker --config-file=/run/ovnkube-config/ovnkube.conf --loglevel 4 --sb-address ssl:10.0.0.4:9642,ssl:10.0.0.5:9642,ssl:10.0.0.7:9642 --sb-client-privkey /ovn-cert/tls.key --sb-client-cert /ovn-cert/tls.crt --sb-client-cacert /ovn-ca/ca-bundle.crt --sb-cert-common->
2020-10-21T21:29:05.749346541Z I1021 21:29:05.749216       1 config.go:1282] Parsed config file /run/ovnkube-config/ovnkube.conf
2020-10-21T21:29:05.749409581Z I1021 21:29:05.749294       1 config.go:1283] Parsed config: {Default:{MTU:1360 ConntrackZone:64000 EncapType:geneve EncapIP: EncapPort:6081 InactivityProbe:100000 OpenFlowProbe:180 RawClusterSubnets:10.128.0.0/14/23 ClusterSubnets:[]} Logging:{File: CNIFile: Level:4 LogFileMaxSize:100 >
2020-10-21T21:29:05.753247811Z I1021 21:29:05.753165       1 ovndbmanager.go:24] Starting DB Checker to ensure cluster membership and DB consistency
2020-10-21T21:29:05.753247811Z I1021 21:29:05.753230       1 ovndbmanager.go:45] Starting ensure routine for Raft db: /etc/ovn/ovnsb_db.db
2020-10-21T21:29:05.753296454Z I1021 21:29:05.753273       1 ovndbmanager.go:45] Starting ensure routine for Raft db: /etc/ovn/ovnnb_db.db
2020-10-21T21:30:06.033818125Z I1021 21:30:06.033739       1 ovndbmanager.go:250] check-cluster returned out: "", stderr: ""
2020-10-21T21:30:06.041169895Z I1021 21:30:06.041069       1 ovndbmanager.go:250] check-cluster returned out: "", stderr: ""
2020-10-21T21:31:06.055653415Z I1021 21:31:06.055565       1 ovndbmanager.go:250] check-cluster returned out: "", stderr: ""


Note You need to log in before you can comment on or make changes to this bug.