Bug 1858712 - [OVN] 4.5.3 upgrade failure---some ovnkube-master and ovnkube-node is in CrashLoopBackOff [NEEDINFO]
Summary: [OVN] 4.5.3 upgrade failure---some ovnkube-master and ovnkube-node is in Cra...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.5
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.5.z
Assignee: Aniket Bhat
QA Contact: Anurag saxena
URL:
Whiteboard:
: 1858767 (view as bug list)
Depends On: 1837953 1858834 1882569
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-20 07:58 UTC by huirwang
Modified: 2020-09-25 02:31 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1858834 (view as bug list)
Environment:
Last Closed: 2020-07-30 18:56:59 UTC
Target Upstream Version:
lmohanty: needinfo? (bbennett)


Attachments (Terms of Use)
The crash pods logs (43.37 KB, application/gzip)
2020-07-20 08:15 UTC, huirwang
no flags Details


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 718 None closed Bug 1858712: Revert "[release-4.5] Bug 1854072: Check if the db file has been correctly initialized with raft. 2020-08-31 08:58:39 UTC
Red Hat Product Errata RHBA-2020:3028 None None None 2020-07-30 18:57:26 UTC

Description huirwang 2020-07-20 07:58:19 UTC
Version-Release number of selected component (if applicable):

Base version:4.5.2-x86_64
Target version:4.5.0-0.nightly-2020-07-18-024505


How reproducible: 
always

Steps to Reproduce:
Use the upgrade ci to trigger upgrade from 4.5.2-x86_64 to 4.5.0-0.nightly-2020-07-18-024505
https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/upgrade_CI/3787/console

Finally upgrade failed.


Actual Result:
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.2     True        True          3h19m   Unable to apply 4.5.0-0.nightly-2020-07-18-024505: an unknown error has occurred: MultipleErrors

oc get co network -o yaml
status:
  conditions:
  - lastTransitionTime: "2020-07-20T04:43:27Z"
    message: |-
      DaemonSet "openshift-ovn-kubernetes/ovnkube-master" rollout is not making progress - last change 2020-07-20T04:33:13Z
      DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2020-07-20T04:32:51Z
    reason: RolloutHung
    status: "True"
    type: Degraded
  - lastTransitionTime: "2020-07-20T03:27:52Z"
    status: "True"
    type: Upgradeable
  - lastTransitionTime: "2020-07-20T04:31:05Z"
    message: |-
      DaemonSet "openshift-multus/multus-admission-controller" update is rolling out (1 out of 3 updated)
      DaemonSet "openshift-ovn-kubernetes/ovnkube-master" is not available (awaiting 1 nodes)
      DaemonSet "openshift-ovn-kubernetes/ovnkube-node" update is rolling out (4 out of 6 updated)


One multus pod is in ContainerCreating with error:
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_multus-admission-controller-2wxwn_openshift-multus_7dc31947-f76a-4207-9288-38778b17eafe_0(be44e539169c537a7867a026c71109fb91d3b7086580e86471269665b7548578): Multus: [openshift-multus/multus-admission-controller-2wxwn]: error adding container to network “ovn-kubernetes”: delegateAdd: error invoking confAdd - “ovn-k8s-cni-overlay”: error in getting result from AddNetwork: CNI request failed with status 400: ’[openshift-multus/multus-admission-controller-2wxwn] failed to configure pod interface: failure in plugging pod interface: failed to run ‘ovs-vsctl --timeout=30 add-port br-int be44e539169c537 -- set interface be44e539169c537 external_ids:attached_mac=2e:a8:2b:82:00:04 external_ids:iface-id=openshift-multus_multus-admission-controller-2wxwn external_ids:ip_addresses=10.130.0.3/23 external_ids:sandbox=be44e539169c537a7867a026c71109fb91d3b7086580e86471269665b7548578’: exit status 1
 “ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)\n”
 
 oc get pods -o wide -n openshift-ovn-kubernetes
NAME                   READY   STATUS             RESTARTS   AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
ovnkube-master-gjs4n   2/4     CrashLoopBackOff   39         3h      10.0.136.16    ip-10-0-136-16.us-east-2.compute.internal    <none>           <none>
ovnkube-master-gx75c   4/4     Running            0          3h2m    10.0.204.230   ip-10-0-204-230.us-east-2.compute.internal   <none>           <none>
ovnkube-master-mn5vc   4/4     Running            0          3h1m    10.0.178.208   ip-10-0-178-208.us-east-2.compute.internal   <none>           <none>
ovnkube-node-2bwg2     2/2     Running            0          3h2m    10.0.178.208   ip-10-0-178-208.us-east-2.compute.internal   <none>           <none>
ovnkube-node-7vszf     1/2     CrashLoopBackOff   33         3h1m    10.0.204.230   ip-10-0-204-230.us-east-2.compute.internal   <none>           <none>
ovnkube-node-8clcn     2/2     Running            0          3h49m   10.0.135.242   ip-10-0-135-242.us-east-2.compute.internal   <none>           <none>
ovnkube-node-srwtf     2/2     Running            0          3h2m    10.0.165.9     ip-10-0-165-9.us-east-2.compute.internal     <none>           <none>
ovnkube-node-v8j5p     2/2     Running            0          3h2m    10.0.214.190   ip-10-0-214-190.us-east-2.compute.internal   <none>           <none>
ovnkube-node-zslrj     2/2     Running            0          4h5m    10.0.136.16    ip-10-0-136-16.us-east-2.compute.internal    <none>           <none>
ovs-node-cwfz2         1/1     Running            0          3h      10.0.135.242   ip-10-0-135-242.us-east-2.compute.internal   <none>           <none>
ovs-node-fvblz         1/1     Running            0          3h1m    10.0.178.208   ip-10-0-178-208.us-east-2.compute.internal   <none>           <none>
ovs-node-t8vl6         1/1     Running            0          179m    10.0.204.230   ip-10-0-204-230.us-east-2.compute.internal   <none>           <none>
ovs-node-thbn7         1/1     Running            0          3h2m    10.0.136.16    ip-10-0-136-16.us-east-2.compute.internal    <none>           <none>
ovs-node-vp2rp         1/1     Running            0          3h2m    10.0.214.190   ip-10-0-214-190.us-east-2.compute.internal   <none>           <none>
ovs-node-vwrps         1/1     Running            0          179m    10.0.165.9     ip-10-0-165-9.us-east-2.compute.internal     <none> 

 oc logs -c ovnkube-master ovnkube-master-gjs4n  -n   openshift-ovn-kubernetes
+ [[ -f /env/_master ]]
+ hybrid_overlay_flags=
+ [[ -n '' ]]
++ ovn-nbctl --pidfile=/var/run/ovn/ovn-nbctl.pid --detach -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt --db ssl:10.0.136.16:9641,ssl:10.0.178.208:9641,ssl:10.0.204.230:9641
2020-07-20T05:39:13Z|00184|stream_ssl|WARN|SSL_connect: unexpected SSL connection close


oc logs -c ovnkube-node ovnkube-node-7vszf -n openshift-ovn-kubernetes
.....
I0720 06:48:08.321931  404343 ovs.go:249] exec(122): stdout: "not connected\n"
I0720 06:48:08.321965  404343 ovs.go:250] exec(122): stderr: ""
I0720 06:48:08.321981  404343 node.go:116] node ip-10-0-204-230.us-east-2.compute.internal connection status = not connected
I0720 06:48:08.792527  404343 ovs.go:246] exec(123): /usr/bin/ovs-appctl --timeout=15 -t /var/run/ovn/ovn-controller.93522.ctl connection-status
I0720 06:48:08.820573  404343 ovs.go:249] exec(123): stdout: "not connected\n"
I0720 06:48:08.820724  404343 ovs.go:250] exec(123): stderr: ""
I0720 06:48:08.820748  404343 node.go:116] node ip-10-0-204-230.us-east-2.compute.internal connection status = not connected
I0720 06:48:08.820767  404343 ovs.go:246] exec(124): /usr/bin/ovs-appctl --timeout=15 -t /var/run/ovn/ovn-controller.93522.ctl connection-status
I0720 06:48:08.847272  404343 ovs.go:249] exec(124): stdout: "not connected\n"
I0720 06:48:08.847306  404343 ovs.go:250] exec(124): stderr: ""
I0720 06:48:08.847321  404343 node.go:116] node ip-10-0-204-230.us-east-2.compute.internal connection status = not connected
F0720 06:48:08.847355  404343 ovnkube.go:129] timed out waiting sbdb for node ip-10-0-204-230.us-east-2.compute.internal: timed out waiting for the condition

Comment 2 huirwang 2020-07-20 08:15:47 UTC
Created attachment 1701717 [details]
The crash pods logs

Comment 3 Alexander Constantinescu 2020-07-20 09:35:31 UTC
FYI, for tracking purposes:

oc get pod -owide
NAME                   READY   STATUS             RESTARTS   AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
ovnkube-master-gjs4n   2/4     CrashLoopBackOff   63         4h58m   10.0.136.16    ip-10-0-136-16.us-east-2.compute.internal    <none>           <none>
ovnkube-master-gx75c   4/4     Running            0          5h      10.0.204.230   ip-10-0-204-230.us-east-2.compute.internal   <none>           <none>
ovnkube-master-mn5vc   4/4     Running            0          4h59m   10.0.178.208   ip-10-0-178-208.us-east-2.compute.internal   <none>           <none>
ovnkube-node-2bwg2     2/2     Running            0          5h      10.0.178.208   ip-10-0-178-208.us-east-2.compute.internal   <none>           <none>
ovnkube-node-7vszf     1/2     CrashLoopBackOff   52         4h59m   10.0.204.230   ip-10-0-204-230.us-east-2.compute.internal   <none>           <none>
ovnkube-node-8clcn     2/2     Running            0          5h47m   10.0.135.242   ip-10-0-135-242.us-east-2.compute.internal   <none>           <none>
ovnkube-node-srwtf     2/2     Running            0          4h59m   10.0.165.9     ip-10-0-165-9.us-east-2.compute.internal     <none>           <none>
ovnkube-node-v8j5p     2/2     Running            0          5h      10.0.214.190   ip-10-0-214-190.us-east-2.compute.internal   <none>           <none>
ovnkube-node-zslrj     2/2     Running            0          6h3m    10.0.136.16    ip-10-0-136-16.us-east-2.compute.internal    <none>           <none>
ovs-node-cwfz2         1/1     Running            0          4h58m   10.0.135.242   ip-10-0-135-242.us-east-2.compute.internal   <none>           <none>
ovs-node-fvblz         1/1     Running            0          4h59m   10.0.178.208   ip-10-0-178-208.us-east-2.compute.internal   <none>           <none>
ovs-node-t8vl6         1/1     Running            0          4h57m   10.0.204.230   ip-10-0-204-230.us-east-2.compute.internal   <none>           <none>
ovs-node-thbn7         1/1     Running            0          5h      10.0.136.16    ip-10-0-136-16.us-east-2.compute.internal    <none>           <none>
ovs-node-vp2rp         1/1     Running            0          5h      10.0.214.190   ip-10-0-214-190.us-east-2.compute.internal   <none>           <none>
ovs-node-vwrps         1/1     Running            0          4h57m   10.0.165.9     ip-10-0-165-9.us-east-2.compute.internal     <none>           <none>
aconstan@linux-3 ~/Downloads/tmp $ oc exec -ti -c nbdb ovnkube-master-gx75c -- bash 
[root@ip-10-0-204-230 ~]# /usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=3 cluster/status OVN_Northbound
3eb7
Name: OVN_Northbound
Cluster ID: 3fe2 (3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2)
Server ID: 3eb7 (3eb71f71-3761-4903-b721-05e532588025)
Address: ssl:10.0.204.230:9643
Status: cluster member
Role: candidate
Term: 3288
Leader: unknown
Vote: self

Election timer: 5000
Log: [2, 4742]
Entries not yet committed: 1
Entries not yet applied: 1
Connections: (->0000) ->c763 <-1e5e
Servers:
    3eb7 (3eb7 at ssl:10.0.204.230:9643) (self) (voted for 3eb7)
    6932 (6932 at ssl:10.0.204.230:9643)
    9ece (9ece at ssl:10.0.136.16:9643)
    c763 (c763 at ssl:10.0.178.208:9643)

oc logs -c nbdb ovnkube-master-gx75c | less

+ [[ -f /env/_master ]]
+ check_db_health /etc/ovn/ovnnb_db.db
+ db=/etc/ovn/ovnnb_db.db
+ [[ ! -f /etc/ovn/ovnnb_db.db ]]
+ echo 'Checking /etc/ovn/ovnnb_db.db health'
Checking /etc/ovn/ovnnb_db.db health
++ sed -ne '/server_id:/s/server_id: *\([[:xdigit:]]\+\)/\1/p'
++ ovsdb-tool show-log /etc/ovn/ovnnb_db.db
+ serverid=' 6932'
++ ovsdb-tool show-log /etc/ovn/ovnnb_db.db
++ grep 'prev_servers: * 6932('
++ true
+ match=
+ [[ -z '' ]]
+ echo 'Current server_id  6932 not found in /etc/ovn/ovnnb_db.db, cleaning up'
Current server_id  6932 not found in /etc/ovn/ovnnb_db.db, cleaning up
+ rm -- /etc/ovn/ovnnb_db.db
+ MASTER_IP=10.0.136.16
+ [[ 10.0.204.230 == \1\0\.\0\.\1\3\6\.\1\6 ]]
++ bracketify 10.0.204.230
++ case "$1" in
++ echo 10.0.204.230
++ bracketify 10.0.136.16
++ case "$1" in
++ echo 10.0.136.16
+ exec /usr/share/ovn/scripts/ovn-ctl --db-nb-cluster-local-port=9643 --db-nb-cluster-remote-port=9643 --db-nb-cluster-local-addr=10.0.204.230 --db-nb-cluster-remote-addr=10.0.136.16 --no-monitor --db-nb-cluster-local-proto=ssl --db-nb-cluster-remote-proto=ssl --ovn-nb-db-ssl-key=/ovn-cert/tls.key --ovn-nb-db-ssl-cert=/ovn-cert/tls.crt --ovn-nb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt '--ovn-nb-log=-vconsole:info -vfile:off' run_nb_ovsdb
2020-07-20T04:31:45Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log
2020-07-20T04:31:45Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.0
2020-07-20T04:31:45Z|00003|reconnect|INFO|ssl:10.0.136.16:9643: connecting...
2020-07-20T04:31:45Z|00004|reconnect|INFO|ssl:10.0.136.16:9643: connected
2020-07-20T04:31:45Z|00005|raft_rpc|INFO|learned cluster ID 3fe2
2020-07-20T04:31:45Z|00006|raft|INFO|ssl:10.0.136.16:9643: learned server ID 9ece
2020-07-20T04:31:45Z|00007|raft|INFO|server 9ece is leader for term 3
2020-07-20T04:31:45Z|00008|raft|INFO|rejecting append_request because previous entry 3,4455 not in local log (mismatch past end of log)
2020-07-20T04:31:45Z|00009|raft|INFO|server 9ece added to configuration
2020-07-20T04:31:45Z|00010|raft|INFO|server c763 added to configuration
2020-07-20T04:31:45Z|00011|raft|INFO|server 6932 added to configuration
2020-07-20T04:31:45Z|00012|raft|INFO|Election timer changed from 5000 to 2000
2020-07-20T04:31:45Z|00013|raft|INFO|Election timer changed from 2000 to 4000
2020-07-20T04:31:45Z|00014|raft|INFO|Election timer changed from 4000 to 5000
2020-07-20T04:31:45Z|00015|reconnect|INFO|ssl:10.0.178.208:9643: connecting...
2020-07-20T04:31:45Z|00016|reconnect|INFO|ssl:10.0.178.208:9643: connected
2020-07-20T04:31:45Z|00017|raft|INFO|ssl:10.0.204.230:9643: learned new server address for joining cluster
2020-07-20T04:31:45Z|00018|raft|INFO|server 3eb7 added to configuration
2020-07-20T04:31:48Z|00019|raft|INFO|ssl:10.0.178.208:44378: syntax "{"address":"ssl:10.0.178.208:9643","cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"c7632632-4935-43a9-858b-7b6b8af1bec9","to":"6932e2fa-9816-46a8-8dbd-b16b76d3e51b"}": syntax error: Parsing raft hello_request RPC failed: misrouted message (addressed to 6932 but we're 3eb7)
2020-07-20T04:31:48Z|00020|raft|INFO|ssl:10.0.136.16:53316: syntax "{"address":"ssl:10.0.136.16:9643","cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"9ece5c56-4b4d-40ae-88fb-653dc89e4342","to":"6932e2fa-9816-46a8-8dbd-b16b76d3e51b"}": syntax error: Parsing raft hello_request RPC failed: misrouted message (addressed to 6932 but we're 3eb7)


So somehow 10.0.204.230 was added twice to the cluster and the state is now incorrect

> 2020-07-20T04:31:45Z|00017|raft|INFO|ssl:10.0.204.230:9643: learned new server address for joining cluster
> 2020-07-20T04:31:45Z|00018|raft|INFO|server 3eb7 added to configuration

Comment 4 Alexander Constantinescu 2020-07-20 09:56:30 UTC
oc logs -c nbdb ovnkube-master-mn5vc | head -n 100
+ [[ -f /env/_master ]]
+ check_db_health /etc/ovn/ovnnb_db.db
+ db=/etc/ovn/ovnnb_db.db
+ [[ ! -f /etc/ovn/ovnnb_db.db ]]
+ echo 'Checking /etc/ovn/ovnnb_db.db health'
Checking /etc/ovn/ovnnb_db.db health
++ ovsdb-tool show-log /etc/ovn/ovnnb_db.db
++ sed -ne '/server_id:/s/server_id: *\([[:xdigit:]]\+\)/\1/p'
+ serverid=' c763'
++ ovsdb-tool show-log /etc/ovn/ovnnb_db.db
++ grep 'prev_servers: * c763('
++ true
+ match=
+ [[ -z '' ]]
+ echo 'Current server_id  c763 not found in /etc/ovn/ovnnb_db.db, cleaning up'
Current server_id  c763 not found in /etc/ovn/ovnnb_db.db, cleaning up
+ rm -- /etc/ovn/ovnnb_db.db
+ MASTER_IP=10.0.136.16
+ [[ 10.0.178.208 == \1\0\.\0\.\1\3\6\.\1\6 ]]
++ bracketify 10.0.178.208
++ case "$1" in
++ echo 10.0.178.208
++ bracketify 10.0.136.16
++ case "$1" in
++ echo 10.0.136.16
+ exec /usr/share/ovn/scripts/ovn-ctl --db-nb-cluster-local-port=9643 --db-nb-cluster-remote-port=9643 --db-nb-cluster-local-addr=10.0.178.208 --db-nb-cluster-remote-addr=10.0.136.16 --no-monitor --db-nb-cluster-local-proto=ssl --db-nb-cluster-remote-proto=ssl --ovn-nb-db-ssl-key=/ovn-cert/tls.key --ovn-nb-db-ssl-cert=/ovn-cert/tls.crt --ovn-nb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt '--ovn-nb-log=-vconsole:info -vfile:off' run_nb_ovsdb
2020-07-20T04:32:38Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log
2020-07-20T04:32:38Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.0
2020-07-20T04:32:38Z|00003|reconnect|INFO|ssl:10.0.136.16:9643: connecting...
2020-07-20T04:32:38Z|00004|reconnect|INFO|ssl:10.0.136.16:9643: connected
2020-07-20T04:32:38Z|00005|raft_rpc|INFO|learned cluster ID 3fe2
2020-07-20T04:32:38Z|00006|raft|INFO|ssl:10.0.136.16:9643: learned server ID 9ece
2020-07-20T04:32:38Z|00007|raft|INFO|ssl:10.0.178.208:9643: learned new server address for joining cluster
2020-07-20T04:32:38Z|00008|raft|INFO|ssl:10.0.204.230:9643: learned new server address for joining cluster
2020-07-20T04:32:38Z|00009|reconnect|INFO|ssl:10.0.204.230:9643: connecting...
2020-07-20T04:32:38Z|00010|reconnect|INFO|ssl:10.0.204.230:9643: connected
2020-07-20T04:32:38Z|00011|raft|INFO|ssl:10.0.204.230:9643: learned server ID 3eb7
2020-07-20T04:32:39Z|00012|raft|INFO|ssl:10.0.204.230:60136: syntax "{"address":"ssl:10.0.204.230:9643","cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"3eb71f71-3761-4903-b721-05e532588025","to":"c7632632-4935-43a9-858b-7b6b8af1bec9"}": syntax error: Parsing raft hello_request RPC failed: misrouted message (addressed to c763 but we're 1e5e)
2020-07-20T04:32:40Z|00013|raft|INFO|ssl:10.0.136.16:34778: syntax "{"address":"ssl:10.0.136.16:9643","cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"9ece5c56-4b4d-40ae-88fb-653dc89e4342","to":"c7632632-4935-43a9-858b-7b6b8af1bec9"}": syntax error: Parsing raft hello_request RPC failed: misrouted message (addressed to c763 but we're 1e5e)
2020-07-20T04:32:45Z|00014|raft|INFO|ssl:10.0.204.230:60136: syntax "{"cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"3eb71f71-3761-4903-b721-05e532588025","last_log_index":4741,"last_log_term":3,"term":7,"to":"c7632632-4935-43a9-858b-7b6b8af1bec9"}": syntax error: Parsing raft vote_request RPC failed: misrouted message (addressed to c763 but we're 1e5e)
2020-07-20T04:32:45Z|00015|raft|INFO|ssl:10.0.136.16:34778: syntax "{"cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"9ece5c56-4b4d-40ae-88fb-653dc89e4342","last_log_index":4741,"last_log_term":3,"term":7,"to":"c7632632-4935-43a9-858b-7b6b8af1bec9"}": syntax error: Parsing raft vote_request RPC failed: misrouted message (addressed to c763 but we're 1e5e)
2020-07-20T04:32:48Z|00016|memory|INFO|18020 kB peak resident set size after 10.0 seconds
2020-07-20T04:32:48Z|00017|memory|INFO|cells:20 monitors:0
2020-07-20T04:32:50Z|00018|raft|INFO|ssl:10.0.136.16:34778: syntax "{"cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"9ece5c56-4b4d-40ae-88fb-653dc89e4342","last_log_index":4741,"last_log_term":3,"term":8,"to":"c7632632-4935-43a9-858b-7b6b8af1bec9"}": syntax error: Parsing raft vote_request RPC failed: misrouted message (addressed to c763 but we're 1e5e)
2020-07-20T04:32:56Z|00019|raft|INFO|ssl:10.0.204.230:60136: syntax "{"cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"3eb71f71-3761-4903-b721-05e532588025","last_log_index":4741,"last_log_term":3,"term":9,"to":"c7632632-4935-43a9-858b-7b6b8af1bec9"}": syntax error: Parsing raft vote_request RPC failed: misrouted message (addressed to c763 but we're 1e5e)
2020-07-20T04:33:01Z|00020|raft|INFO|ssl:10.0.136.16:34778: syntax "{"cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"9ece5c56-4b4d-40ae-88fb-653dc89e4342","last_log_index":4741,"last_log_term":3,"term":10,"to":"c7632632-4935-43a9-858b-7b6b8af1bec9"}": syntax error: Parsing raft vote_request RPC failed: misrouted message (addressed to c763 but we're 1e5e)
2020-07-20T04:33:06Z|00021|raft|INFO|ssl:10.0.204.230:60136: syntax "{"cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"3eb71f71-3761-4903-b721-05e532588025","last_log_index":4741,"last_log_term":3,"term":11,"to":"c7632632-4935-43a9-858b-7b6b8af1bec9"}": syntax error: Parsing raft vote_request RPC failed: misrouted message (addressed to c763 but we're 1e5e)


oc exec -ti -c nbdb ovnkube-master-mn5vc -- bash 
[root@ip-10-0-178-208 ~]# ovsdb-tool -m show-log -v
record 0:
 name: "OVN_Northbound'
 local address: "ssl:10.0.178.208:9643"
 server_id: 1e5e
 remote_addresses: ssl:10.0.136.16:9643

[root@ip-10-0-178-208 ~]# /usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=3 cluster/status OVN_Northbound
1e5e
Name: OVN_Northbound
Cluster ID: 3fe2 (3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2)
Server ID: 1e5e (1e5ed63a-1695-4965-9b88-be6d3bef021e)
Address: ssl:10.0.178.208:9643
Status: joining cluster
Remotes for joining: ssl:10.0.178.208:9643 ssl:10.0.204.230:9643 ssl:10.0.136.16:9643
Role: follower
Term: 0
Leader: unknown
Vote: unknown

Election timer: 1000
Log: [1, 1]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: (->0000) ->0000 <-0000
Servers:


So it seems something royally weird happened to this cluster as ovnkube-master-mn5vc has id 1e5e in cluster-id: 3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2 and ovnkube-master-gx75c thinks he has id c763 in the same cluster. Essentially every node has it's own representation of what that cluster is and who its members are....

Both pods have never been restarted, so it seems like the state was wrong from the start.

Comment 5 Alexander Constantinescu 2020-07-20 10:00:27 UTC
From #comment 4:

> + echo 'Current server_id  c763 not found in /etc/ovn/ovnnb_db.db, cleaning up'
> Current server_id  c763 not found in /etc/ovn/ovnnb_db.db, cleaning up

And 

> syntax error: Parsing raft hello_request RPC failed: misrouted message (addressed to c763 but we're 1e5e)

It seems like we deleted the old db file when starting this pod, but the other cluster members still have our old ID and are trying to communicate with it

Comment 6 Alexander Constantinescu 2020-07-20 10:12:03 UTC
Also

oc logs -c nbdb ovnkube-master-gx75c | head -n 100
+ [[ -f /env/_master ]]
+ check_db_health /etc/ovn/ovnnb_db.db
+ db=/etc/ovn/ovnnb_db.db
+ [[ ! -f /etc/ovn/ovnnb_db.db ]]
+ echo 'Checking /etc/ovn/ovnnb_db.db health'
Checking /etc/ovn/ovnnb_db.db health
++ sed -ne '/server_id:/s/server_id: *\([[:xdigit:]]\+\)/\1/p'
++ ovsdb-tool show-log /etc/ovn/ovnnb_db.db
+ serverid=' 6932'
++ ovsdb-tool show-log /etc/ovn/ovnnb_db.db
++ grep 'prev_servers: * 6932('
++ true
+ match=
+ [[ -z '' ]]
+ echo 'Current server_id  6932 not found in /etc/ovn/ovnnb_db.db, cleaning up'
Current server_id  6932 not found in /etc/ovn/ovnnb_db.db, cleaning up
+ rm -- /etc/ovn/ovnnb_db.db
+ MASTER_IP=10.0.136.16
+ [[ 10.0.204.230 == \1\0\.\0\.\1\3\6\.\1\6 ]]
++ bracketify 10.0.204.230
++ case "$1" in
++ echo 10.0.204.230
++ bracketify 10.0.136.16
++ case "$1" in
++ echo 10.0.136.16
+ exec /usr/share/ovn/scripts/ovn-ctl --db-nb-cluster-local-port=9643 --db-nb-cluster-remote-port=9643 --db-nb-cluster-local-addr=10.0.204.230 --db-nb-cluster-remote-addr=10.0.136.16 --no-monitor --db-nb-cluster-local-proto=ssl --db-nb-cluster-remote-proto=ssl --ovn-nb-db-ssl-key=/ovn-cert/tls.key --ovn-nb-db-ssl-cert=/ovn-cert/tls.crt --ovn-nb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt '--ovn-nb-log=-vconsole:info -vfile:off' run_nb_ovsdb
2020-07-20T04:31:45Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log
2020-07-20T04:31:45Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.0
2020-07-20T04:31:45Z|00003|reconnect|INFO|ssl:10.0.136.16:9643: connecting...
2020-07-20T04:31:45Z|00004|reconnect|INFO|ssl:10.0.136.16:9643: connected
2020-07-20T04:31:45Z|00005|raft_rpc|INFO|learned cluster ID 3fe2
2020-07-20T04:31:45Z|00006|raft|INFO|ssl:10.0.136.16:9643: learned server ID 9ece
2020-07-20T04:31:45Z|00007|raft|INFO|server 9ece is leader for term 3
2020-07-20T04:31:45Z|00008|raft|INFO|rejecting append_request because previous entry 3,4455 not in local log (mismatch past end of log)
2020-07-20T04:31:45Z|00009|raft|INFO|server 9ece added to configuration
2020-07-20T04:31:45Z|00010|raft|INFO|server c763 added to configuration
2020-07-20T04:31:45Z|00011|raft|INFO|server 6932 added to configuration

And further down in the log from the same pod

8dbd-b16b76d3e51b"}": syntax error: Parsing raft vote_request RPC failed: misrouted message (addressed to 6932 but we're 3eb7)
2020-07-20T04:32:38Z|00526|raft|INFO|ssl:10.0.178.208:45270: learned server ID 1e5e
2020-07-20T04:32:38Z|00527|raft|INFO|ssl:10.0.178.208:45270: learned remote address ssl:10.0.178.208:9643
2020-07-20T04:32:38Z|00528|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader)
2020-07-20T04:32:39Z|00529|reconnect|INFO|ssl:10.0.178.208:9643: connected
2020-07-20T04:32:39Z|00530|raft|WARN|ssl:10.0.178.208:9643: ignoring message with unexpected server ID 1e5e (expected c763)
2020-07-20T04:32:39Z|00531|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader)
2020-07-20T04:32:40Z|00532|raft|INFO|ssl:10.0.136.16:53316: syntax "{"cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"9ece5c56-4b4d-40ae-88fb-653dc89e4342","last_log_index":4741,"last_log_term":3,"term":6,"to":"6932e2fa-9816-46a8-8dbd-b16b76d3e51b"}": syntax error: Parsing raft vote_request RPC failed: misrouted message (addressed to 6932 but we're 3eb7)
2020-07-20T04:32:40Z|00533|raft|WARN|ssl:10.0.178.208:9643: ignoring message with unexpected server ID 1e5e (expected c763)
2020-07-20T04:32:40Z|00534|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader)
2020-07-20T04:32:41Z|00535|raft|WARN|ssl:10.0.178.208:9643: ignoring message with unexpected server ID 1e5e (expected c763)
2020-07-20T04:32:41Z|00536|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader)
2020-07-20T04:32:42Z|00537|raft|WARN|ssl:10.0.178.208:9643: ignoring message with unexpected server ID 1e5e (expected c763)
2020-07-20T04:32:42Z|00538|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader)
2020-07-20T04:32:43Z|00539|raft|WARN|ssl:10.0.178.208:9643: ignoring message with unexpected server ID 1e5e (expected c763)
2020-07-20T04:32:43Z|00540|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader)
2020-07-20T04:32:44Z|00541|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader)
2020-07-20T04:32:45Z|00542|raft|INFO|term 7: 5662 ms timeout expired, starting election
2020-07-20T04:32:45Z|00543|raft|INFO|ssl:10.0.136.16:53316: syntax "{"cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"9ece5c56-4b4d-40ae-88fb-653dc89e4342","last_log_index":4741,"last_log_term":3,"term":7,"to":"6932e2fa-9816-46a8-8dbd-b16b76d3e51b"}": syntax error: Parsing raft vote_request RPC failed: misrouted message (addressed to 6932 but we're 3eb7)
2020-07-20T04:32:45Z|00544|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader)
2020-07-20T04:32:46Z|00545|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader)
2020-07-20T04:32:47Z|00546|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader)
2020-07-20T04:32:48Z|00547|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader)
2020-07-20T04:32:50Z|00548|raft|INFO|ssl:10.0.136.16:53316: syntax "{"cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"9ece5c56-4b4d-40ae-88fb-653dc89e4342","last_log_index":4741,"last_log_term":3,"term":8,"to":"6932e2fa-9816-46a8-8dbd-b16b76d3e51b"}": syntax error: Parsing raft vote_request RPC failed: misrouted message (addressed to 6932 but we're 3eb7)
2020-07-20T04:32:51Z|00549|raft|INFO|Dropped 2 log messages in last 2 seconds (most recently, 1 seconds ago) due to excessive rate
2020-07-20T04:32:51Z|00550|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader)
2020-07-20T04:32:56Z|00551|raft|INFO|term 9: 5502 ms timeout expired, starting election
2020-07-20T04:32:57Z|00552|raft|INFO|Dropped 5 log messages in last 5 seconds (most recently, 1 seconds ago) due to excessive rate
2020-07-20T04:32:57Z|00553|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader)
2020-07-20T04:33:01Z|00554|raft|INFO|ssl:10.0.136.16:53316: syntax "{"cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"9ece5c56-4b4d-40ae-88fb-653dc89e4342","last_log_index":4741,"last_log_term":3,"term":10,"to":"6932e2fa-9816-46a8-8dbd-b16b76d3e51b"}": syntax error: Parsing raft vote_request RPC failed: misrouted message (addressed to 6932 but we're 3eb7)
2020-07-20T04:33:03Z|00555|raft|INFO|Dropped 5 log messages in last 5 seconds (most recently, 1 seconds ago) due to excessive rate
2020-07-20T04:33:03Z|00556|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader)
2020-07-20T04:33:06Z|00557|raft|INFO|term 11: 5106 ms timeout expired, starting election
2020-07-20T04:33:09Z|00558|raft|INFO|Dropped 5 log messages in last 5 seconds (most recently, 1 seconds ago) due to excessive rate
2020-07-20T04:33:09Z|00559|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader)

So 

> 2020-07-20T04:31:45Z|00010|raft|INFO|server c763 added to configuration

And 

> 2020-07-20T04:32:38Z|00528|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader)

Tells me there's a race caused by the deletion of the DB file

However that does not explain this:

oc get pod -o yaml ovnkube-master-mn5vc | less
  - containerID: cri-o://f40e4f7ecf9d27a339e61844a6e19ea6ed8f1eafe5035362c2305ef89646e81b
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd9a322ccdd080b274bcf08f5742b72f6fdaa84a37a07da4e4e9a118135fc057
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd9a322ccdd080b274bcf08f5742b72f6fdaa84a37a07da4e4e9a118135fc057
    lastState: {}
    name: nbdb
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2020-07-20T04:32:36Z"

I don't understand how 

> 2020-07-20T04:31:45Z|00010|raft|INFO|server c763 added to configuration

Could have happened, when that pod was started at "2020-07-20T04:32:36Z"??

Both of these pods have never been restarted since.

Comment 7 Alexander Constantinescu 2020-07-20 10:22:15 UTC
Yes, more output from the pod ovnkube-master-mn5vc

Actually no container had ever been started at 04:31:45

containerStatuses:
  - containerID: cri-o://f40e4f7ecf9d27a339e61844a6e19ea6ed8f1eafe5035362c2305ef89646e81b
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd9a322ccdd080b274bcf08f5742b72f6fdaa84a37a07da4e4e9a118135fc057
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd9a322ccdd080b274bcf08f5742b72f6fdaa84a37a07da4e4e9a118135fc057
    lastState: {}
    name: nbdb
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2020-07-20T04:32:36Z"
  - containerID: cri-o://87a547943bd6830da031f56a1c0f0dda0e4f55699e679a66141a7959712a2901
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd9a322ccdd080b274bcf08f5742b72f6fdaa84a37a07da4e4e9a118135fc057
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd9a322ccdd080b274bcf08f5742b72f6fdaa84a37a07da4e4e9a118135fc057
    lastState: {}
    name: northd
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2020-07-20T04:32:36Z"
  - containerID: cri-o://b2461eb0b82478b0fd7182d449e62564b905e6e5d8904150604c4b6dc3c6a48d
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd9a322ccdd080b274bcf08f5742b72f6fdaa84a37a07da4e4e9a118135fc057
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd9a322ccdd080b274bcf08f5742b72f6fdaa84a37a07da4e4e9a118135fc057
    lastState: {}
    name: ovnkube-master
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2020-07-20T04:32:39Z"
  - containerID: cri-o://d1e473efaadab2964311147deb59c001fb3d2dcdc323c641ace76b05c305fe5d
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd9a322ccdd080b274bcf08f5742b72f6fdaa84a37a07da4e4e9a118135fc057
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd9a322ccdd080b274bcf08f5742b72f6fdaa84a37a07da4e4e9a118135fc057
    lastState: {}
    name: sbdb
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2020-07-20T04:32:37Z"
  hostIP: 10.0.178.208
  phase: Running
  podIP: 10.0.178.208
  podIPs:
  - ip: 10.0.178.208
  qosClass: Burstable
  startTime: "2020-07-20T04:32:35Z"

So, it remains a mystery to me how leader ovnkube-master-gx75c ever received c763 (c763 at ssl:10.0.178.208:9643) as a cluster member at 04:31:45

Comment 8 Alexander Constantinescu 2020-07-20 11:10:13 UTC
Ohhh, wait a minute....

From #comment 0

> Use the upgrade ci to trigger upgrade from 4.5.2-x86_64 to 4.5.0-0.nightly-2020-07-18-024505

That could explain #comment 7. I.e the cluster 4.5.2-x86_64 had member "c763 at ssl:10.0.178.208:9643". When that node in the new cluster 4.5.0-0.nightly-2020-07-18-024505 is started, it deletes its database as we can see in #comment 4:

> + echo 'Current server_id  c763 not found in /etc/ovn/ovnnb_db.db, cleaning up'
> Current server_id  c763 not found in /etc/ovn/ovnnb_db.db, cleaning up

This then leads to the cluster inconsistencies seen later on.

Comment 9 Alexander Constantinescu 2020-07-20 11:20:04 UTC
When the cluster is upgraded, do the nodes reboot? 

If that is the case and /var/lib/ovn/etc/ is wiped on the node (which mounts to /etc in the pod)...then we have a problem, because our OVN cluster state will be wiped leading to, well: all of this.   

The good point might be: that I am not sure we support cluster upgrades for OVN in 4.4/4.5, meaning this might not be a release-blocker...?

Comment 10 Federico Paolinelli 2020-07-20 11:33:49 UTC
Following my offline conversation with Alexander, I think this was caused by having this https://github.com/openshift/cluster-network-operator/pull/688 in but not https://github.com/openshift/cluster-network-operator/pull/694 

Because of that, all but one master will wipe their dbs (because of the strict check on the space fixed by the second PR), leading to the current state. The one who did not delete it's db has different ids than the other ones.

I think this can also be reproduced on a 4.6 cluster by restarting all the masters all together.

Comment 11 Alexander Constantinescu 2020-07-20 11:46:18 UTC
*** Bug 1858767 has been marked as a duplicate of this bug. ***

Comment 12 Lalatendu Mohanty 2020-07-20 12:18:17 UTC
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?  Is it serious enough to warrant blocking edges?
  example: Up to 2 minute disruption in edge routing
  example: Up to 90seconds of API downtime
  example: etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  example: Issue resolves itself after five minutes
  example: Admin uses oc to fix things
  example: Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  example: No, it’s always been like this we just never noticed
  example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 14 W. Trevor King 2020-07-20 14:07:28 UTC
Cloned forward to 4.6 as bug 1858834.  That will let me reset the target here without Eric's bot clearing the target.

Comment 15 W. Trevor King 2020-07-20 14:14:05 UTC
> Is this a regression...

To this impact-statement-request question, the fix in flight reverts net-operator#693, which is new in 4.5 since 4.5.2.  4.5.3 hasn't been created yet, so it's a currently a regression in the 4.5 branch which impacts no releases.

Comment 16 W. Trevor King 2020-07-20 14:15:51 UTC
Closing as a dup of bug 1854072, which has been moved back to ASSIGNED.

*** This bug has been marked as a duplicate of bug 1854072 ***

Comment 19 Anurag saxena 2020-07-20 19:56:50 UTC
Verifying based on above observations.

Comment 22 errata-xmlrpc 2020-07-30 18:56:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3028


Note You need to log in before you can comment on or make changes to this bug.