Version-Release number of selected component (if applicable): Base version:4.5.2-x86_64 Target version:4.5.0-0.nightly-2020-07-18-024505 How reproducible: always Steps to Reproduce: Use the upgrade ci to trigger upgrade from 4.5.2-x86_64 to 4.5.0-0.nightly-2020-07-18-024505 https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/upgrade_CI/3787/console Finally upgrade failed. Actual Result: NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.2 True True 3h19m Unable to apply 4.5.0-0.nightly-2020-07-18-024505: an unknown error has occurred: MultipleErrors oc get co network -o yaml status: conditions: - lastTransitionTime: "2020-07-20T04:43:27Z" message: |- DaemonSet "openshift-ovn-kubernetes/ovnkube-master" rollout is not making progress - last change 2020-07-20T04:33:13Z DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2020-07-20T04:32:51Z reason: RolloutHung status: "True" type: Degraded - lastTransitionTime: "2020-07-20T03:27:52Z" status: "True" type: Upgradeable - lastTransitionTime: "2020-07-20T04:31:05Z" message: |- DaemonSet "openshift-multus/multus-admission-controller" update is rolling out (1 out of 3 updated) DaemonSet "openshift-ovn-kubernetes/ovnkube-master" is not available (awaiting 1 nodes) DaemonSet "openshift-ovn-kubernetes/ovnkube-node" update is rolling out (4 out of 6 updated) One multus pod is in ContainerCreating with error: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_multus-admission-controller-2wxwn_openshift-multus_7dc31947-f76a-4207-9288-38778b17eafe_0(be44e539169c537a7867a026c71109fb91d3b7086580e86471269665b7548578): Multus: [openshift-multus/multus-admission-controller-2wxwn]: error adding container to network “ovn-kubernetes”: delegateAdd: error invoking confAdd - “ovn-k8s-cni-overlay”: error in getting result from AddNetwork: CNI request failed with status 400: ’[openshift-multus/multus-admission-controller-2wxwn] failed to configure pod interface: failure in plugging pod interface: failed to run ‘ovs-vsctl --timeout=30 add-port br-int be44e539169c537 -- set interface be44e539169c537 external_ids:attached_mac=2e:a8:2b:82:00:04 external_ids:iface-id=openshift-multus_multus-admission-controller-2wxwn external_ids:ip_addresses=10.130.0.3/23 external_ids:sandbox=be44e539169c537a7867a026c71109fb91d3b7086580e86471269665b7548578’: exit status 1 “ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)\n” oc get pods -o wide -n openshift-ovn-kubernetes NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ovnkube-master-gjs4n 2/4 CrashLoopBackOff 39 3h 10.0.136.16 ip-10-0-136-16.us-east-2.compute.internal <none> <none> ovnkube-master-gx75c 4/4 Running 0 3h2m 10.0.204.230 ip-10-0-204-230.us-east-2.compute.internal <none> <none> ovnkube-master-mn5vc 4/4 Running 0 3h1m 10.0.178.208 ip-10-0-178-208.us-east-2.compute.internal <none> <none> ovnkube-node-2bwg2 2/2 Running 0 3h2m 10.0.178.208 ip-10-0-178-208.us-east-2.compute.internal <none> <none> ovnkube-node-7vszf 1/2 CrashLoopBackOff 33 3h1m 10.0.204.230 ip-10-0-204-230.us-east-2.compute.internal <none> <none> ovnkube-node-8clcn 2/2 Running 0 3h49m 10.0.135.242 ip-10-0-135-242.us-east-2.compute.internal <none> <none> ovnkube-node-srwtf 2/2 Running 0 3h2m 10.0.165.9 ip-10-0-165-9.us-east-2.compute.internal <none> <none> ovnkube-node-v8j5p 2/2 Running 0 3h2m 10.0.214.190 ip-10-0-214-190.us-east-2.compute.internal <none> <none> ovnkube-node-zslrj 2/2 Running 0 4h5m 10.0.136.16 ip-10-0-136-16.us-east-2.compute.internal <none> <none> ovs-node-cwfz2 1/1 Running 0 3h 10.0.135.242 ip-10-0-135-242.us-east-2.compute.internal <none> <none> ovs-node-fvblz 1/1 Running 0 3h1m 10.0.178.208 ip-10-0-178-208.us-east-2.compute.internal <none> <none> ovs-node-t8vl6 1/1 Running 0 179m 10.0.204.230 ip-10-0-204-230.us-east-2.compute.internal <none> <none> ovs-node-thbn7 1/1 Running 0 3h2m 10.0.136.16 ip-10-0-136-16.us-east-2.compute.internal <none> <none> ovs-node-vp2rp 1/1 Running 0 3h2m 10.0.214.190 ip-10-0-214-190.us-east-2.compute.internal <none> <none> ovs-node-vwrps 1/1 Running 0 179m 10.0.165.9 ip-10-0-165-9.us-east-2.compute.internal <none> oc logs -c ovnkube-master ovnkube-master-gjs4n -n openshift-ovn-kubernetes + [[ -f /env/_master ]] + hybrid_overlay_flags= + [[ -n '' ]] ++ ovn-nbctl --pidfile=/var/run/ovn/ovn-nbctl.pid --detach -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt --db ssl:10.0.136.16:9641,ssl:10.0.178.208:9641,ssl:10.0.204.230:9641 2020-07-20T05:39:13Z|00184|stream_ssl|WARN|SSL_connect: unexpected SSL connection close oc logs -c ovnkube-node ovnkube-node-7vszf -n openshift-ovn-kubernetes ..... I0720 06:48:08.321931 404343 ovs.go:249] exec(122): stdout: "not connected\n" I0720 06:48:08.321965 404343 ovs.go:250] exec(122): stderr: "" I0720 06:48:08.321981 404343 node.go:116] node ip-10-0-204-230.us-east-2.compute.internal connection status = not connected I0720 06:48:08.792527 404343 ovs.go:246] exec(123): /usr/bin/ovs-appctl --timeout=15 -t /var/run/ovn/ovn-controller.93522.ctl connection-status I0720 06:48:08.820573 404343 ovs.go:249] exec(123): stdout: "not connected\n" I0720 06:48:08.820724 404343 ovs.go:250] exec(123): stderr: "" I0720 06:48:08.820748 404343 node.go:116] node ip-10-0-204-230.us-east-2.compute.internal connection status = not connected I0720 06:48:08.820767 404343 ovs.go:246] exec(124): /usr/bin/ovs-appctl --timeout=15 -t /var/run/ovn/ovn-controller.93522.ctl connection-status I0720 06:48:08.847272 404343 ovs.go:249] exec(124): stdout: "not connected\n" I0720 06:48:08.847306 404343 ovs.go:250] exec(124): stderr: "" I0720 06:48:08.847321 404343 node.go:116] node ip-10-0-204-230.us-east-2.compute.internal connection status = not connected F0720 06:48:08.847355 404343 ovnkube.go:129] timed out waiting sbdb for node ip-10-0-204-230.us-east-2.compute.internal: timed out waiting for the condition
Created attachment 1701717 [details] The crash pods logs
FYI, for tracking purposes: oc get pod -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ovnkube-master-gjs4n 2/4 CrashLoopBackOff 63 4h58m 10.0.136.16 ip-10-0-136-16.us-east-2.compute.internal <none> <none> ovnkube-master-gx75c 4/4 Running 0 5h 10.0.204.230 ip-10-0-204-230.us-east-2.compute.internal <none> <none> ovnkube-master-mn5vc 4/4 Running 0 4h59m 10.0.178.208 ip-10-0-178-208.us-east-2.compute.internal <none> <none> ovnkube-node-2bwg2 2/2 Running 0 5h 10.0.178.208 ip-10-0-178-208.us-east-2.compute.internal <none> <none> ovnkube-node-7vszf 1/2 CrashLoopBackOff 52 4h59m 10.0.204.230 ip-10-0-204-230.us-east-2.compute.internal <none> <none> ovnkube-node-8clcn 2/2 Running 0 5h47m 10.0.135.242 ip-10-0-135-242.us-east-2.compute.internal <none> <none> ovnkube-node-srwtf 2/2 Running 0 4h59m 10.0.165.9 ip-10-0-165-9.us-east-2.compute.internal <none> <none> ovnkube-node-v8j5p 2/2 Running 0 5h 10.0.214.190 ip-10-0-214-190.us-east-2.compute.internal <none> <none> ovnkube-node-zslrj 2/2 Running 0 6h3m 10.0.136.16 ip-10-0-136-16.us-east-2.compute.internal <none> <none> ovs-node-cwfz2 1/1 Running 0 4h58m 10.0.135.242 ip-10-0-135-242.us-east-2.compute.internal <none> <none> ovs-node-fvblz 1/1 Running 0 4h59m 10.0.178.208 ip-10-0-178-208.us-east-2.compute.internal <none> <none> ovs-node-t8vl6 1/1 Running 0 4h57m 10.0.204.230 ip-10-0-204-230.us-east-2.compute.internal <none> <none> ovs-node-thbn7 1/1 Running 0 5h 10.0.136.16 ip-10-0-136-16.us-east-2.compute.internal <none> <none> ovs-node-vp2rp 1/1 Running 0 5h 10.0.214.190 ip-10-0-214-190.us-east-2.compute.internal <none> <none> ovs-node-vwrps 1/1 Running 0 4h57m 10.0.165.9 ip-10-0-165-9.us-east-2.compute.internal <none> <none> aconstan@linux-3 ~/Downloads/tmp $ oc exec -ti -c nbdb ovnkube-master-gx75c -- bash [root@ip-10-0-204-230 ~]# /usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=3 cluster/status OVN_Northbound 3eb7 Name: OVN_Northbound Cluster ID: 3fe2 (3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2) Server ID: 3eb7 (3eb71f71-3761-4903-b721-05e532588025) Address: ssl:10.0.204.230:9643 Status: cluster member Role: candidate Term: 3288 Leader: unknown Vote: self Election timer: 5000 Log: [2, 4742] Entries not yet committed: 1 Entries not yet applied: 1 Connections: (->0000) ->c763 <-1e5e Servers: 3eb7 (3eb7 at ssl:10.0.204.230:9643) (self) (voted for 3eb7) 6932 (6932 at ssl:10.0.204.230:9643) 9ece (9ece at ssl:10.0.136.16:9643) c763 (c763 at ssl:10.0.178.208:9643) oc logs -c nbdb ovnkube-master-gx75c | less + [[ -f /env/_master ]] + check_db_health /etc/ovn/ovnnb_db.db + db=/etc/ovn/ovnnb_db.db + [[ ! -f /etc/ovn/ovnnb_db.db ]] + echo 'Checking /etc/ovn/ovnnb_db.db health' Checking /etc/ovn/ovnnb_db.db health ++ sed -ne '/server_id:/s/server_id: *\([[:xdigit:]]\+\)/\1/p' ++ ovsdb-tool show-log /etc/ovn/ovnnb_db.db + serverid=' 6932' ++ ovsdb-tool show-log /etc/ovn/ovnnb_db.db ++ grep 'prev_servers: * 6932(' ++ true + match= + [[ -z '' ]] + echo 'Current server_id 6932 not found in /etc/ovn/ovnnb_db.db, cleaning up' Current server_id 6932 not found in /etc/ovn/ovnnb_db.db, cleaning up + rm -- /etc/ovn/ovnnb_db.db + MASTER_IP=10.0.136.16 + [[ 10.0.204.230 == \1\0\.\0\.\1\3\6\.\1\6 ]] ++ bracketify 10.0.204.230 ++ case "$1" in ++ echo 10.0.204.230 ++ bracketify 10.0.136.16 ++ case "$1" in ++ echo 10.0.136.16 + exec /usr/share/ovn/scripts/ovn-ctl --db-nb-cluster-local-port=9643 --db-nb-cluster-remote-port=9643 --db-nb-cluster-local-addr=10.0.204.230 --db-nb-cluster-remote-addr=10.0.136.16 --no-monitor --db-nb-cluster-local-proto=ssl --db-nb-cluster-remote-proto=ssl --ovn-nb-db-ssl-key=/ovn-cert/tls.key --ovn-nb-db-ssl-cert=/ovn-cert/tls.crt --ovn-nb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt '--ovn-nb-log=-vconsole:info -vfile:off' run_nb_ovsdb 2020-07-20T04:31:45Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log 2020-07-20T04:31:45Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.0 2020-07-20T04:31:45Z|00003|reconnect|INFO|ssl:10.0.136.16:9643: connecting... 2020-07-20T04:31:45Z|00004|reconnect|INFO|ssl:10.0.136.16:9643: connected 2020-07-20T04:31:45Z|00005|raft_rpc|INFO|learned cluster ID 3fe2 2020-07-20T04:31:45Z|00006|raft|INFO|ssl:10.0.136.16:9643: learned server ID 9ece 2020-07-20T04:31:45Z|00007|raft|INFO|server 9ece is leader for term 3 2020-07-20T04:31:45Z|00008|raft|INFO|rejecting append_request because previous entry 3,4455 not in local log (mismatch past end of log) 2020-07-20T04:31:45Z|00009|raft|INFO|server 9ece added to configuration 2020-07-20T04:31:45Z|00010|raft|INFO|server c763 added to configuration 2020-07-20T04:31:45Z|00011|raft|INFO|server 6932 added to configuration 2020-07-20T04:31:45Z|00012|raft|INFO|Election timer changed from 5000 to 2000 2020-07-20T04:31:45Z|00013|raft|INFO|Election timer changed from 2000 to 4000 2020-07-20T04:31:45Z|00014|raft|INFO|Election timer changed from 4000 to 5000 2020-07-20T04:31:45Z|00015|reconnect|INFO|ssl:10.0.178.208:9643: connecting... 2020-07-20T04:31:45Z|00016|reconnect|INFO|ssl:10.0.178.208:9643: connected 2020-07-20T04:31:45Z|00017|raft|INFO|ssl:10.0.204.230:9643: learned new server address for joining cluster 2020-07-20T04:31:45Z|00018|raft|INFO|server 3eb7 added to configuration 2020-07-20T04:31:48Z|00019|raft|INFO|ssl:10.0.178.208:44378: syntax "{"address":"ssl:10.0.178.208:9643","cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"c7632632-4935-43a9-858b-7b6b8af1bec9","to":"6932e2fa-9816-46a8-8dbd-b16b76d3e51b"}": syntax error: Parsing raft hello_request RPC failed: misrouted message (addressed to 6932 but we're 3eb7) 2020-07-20T04:31:48Z|00020|raft|INFO|ssl:10.0.136.16:53316: syntax "{"address":"ssl:10.0.136.16:9643","cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"9ece5c56-4b4d-40ae-88fb-653dc89e4342","to":"6932e2fa-9816-46a8-8dbd-b16b76d3e51b"}": syntax error: Parsing raft hello_request RPC failed: misrouted message (addressed to 6932 but we're 3eb7) So somehow 10.0.204.230 was added twice to the cluster and the state is now incorrect > 2020-07-20T04:31:45Z|00017|raft|INFO|ssl:10.0.204.230:9643: learned new server address for joining cluster > 2020-07-20T04:31:45Z|00018|raft|INFO|server 3eb7 added to configuration
oc logs -c nbdb ovnkube-master-mn5vc | head -n 100 + [[ -f /env/_master ]] + check_db_health /etc/ovn/ovnnb_db.db + db=/etc/ovn/ovnnb_db.db + [[ ! -f /etc/ovn/ovnnb_db.db ]] + echo 'Checking /etc/ovn/ovnnb_db.db health' Checking /etc/ovn/ovnnb_db.db health ++ ovsdb-tool show-log /etc/ovn/ovnnb_db.db ++ sed -ne '/server_id:/s/server_id: *\([[:xdigit:]]\+\)/\1/p' + serverid=' c763' ++ ovsdb-tool show-log /etc/ovn/ovnnb_db.db ++ grep 'prev_servers: * c763(' ++ true + match= + [[ -z '' ]] + echo 'Current server_id c763 not found in /etc/ovn/ovnnb_db.db, cleaning up' Current server_id c763 not found in /etc/ovn/ovnnb_db.db, cleaning up + rm -- /etc/ovn/ovnnb_db.db + MASTER_IP=10.0.136.16 + [[ 10.0.178.208 == \1\0\.\0\.\1\3\6\.\1\6 ]] ++ bracketify 10.0.178.208 ++ case "$1" in ++ echo 10.0.178.208 ++ bracketify 10.0.136.16 ++ case "$1" in ++ echo 10.0.136.16 + exec /usr/share/ovn/scripts/ovn-ctl --db-nb-cluster-local-port=9643 --db-nb-cluster-remote-port=9643 --db-nb-cluster-local-addr=10.0.178.208 --db-nb-cluster-remote-addr=10.0.136.16 --no-monitor --db-nb-cluster-local-proto=ssl --db-nb-cluster-remote-proto=ssl --ovn-nb-db-ssl-key=/ovn-cert/tls.key --ovn-nb-db-ssl-cert=/ovn-cert/tls.crt --ovn-nb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt '--ovn-nb-log=-vconsole:info -vfile:off' run_nb_ovsdb 2020-07-20T04:32:38Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log 2020-07-20T04:32:38Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.0 2020-07-20T04:32:38Z|00003|reconnect|INFO|ssl:10.0.136.16:9643: connecting... 2020-07-20T04:32:38Z|00004|reconnect|INFO|ssl:10.0.136.16:9643: connected 2020-07-20T04:32:38Z|00005|raft_rpc|INFO|learned cluster ID 3fe2 2020-07-20T04:32:38Z|00006|raft|INFO|ssl:10.0.136.16:9643: learned server ID 9ece 2020-07-20T04:32:38Z|00007|raft|INFO|ssl:10.0.178.208:9643: learned new server address for joining cluster 2020-07-20T04:32:38Z|00008|raft|INFO|ssl:10.0.204.230:9643: learned new server address for joining cluster 2020-07-20T04:32:38Z|00009|reconnect|INFO|ssl:10.0.204.230:9643: connecting... 2020-07-20T04:32:38Z|00010|reconnect|INFO|ssl:10.0.204.230:9643: connected 2020-07-20T04:32:38Z|00011|raft|INFO|ssl:10.0.204.230:9643: learned server ID 3eb7 2020-07-20T04:32:39Z|00012|raft|INFO|ssl:10.0.204.230:60136: syntax "{"address":"ssl:10.0.204.230:9643","cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"3eb71f71-3761-4903-b721-05e532588025","to":"c7632632-4935-43a9-858b-7b6b8af1bec9"}": syntax error: Parsing raft hello_request RPC failed: misrouted message (addressed to c763 but we're 1e5e) 2020-07-20T04:32:40Z|00013|raft|INFO|ssl:10.0.136.16:34778: syntax "{"address":"ssl:10.0.136.16:9643","cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"9ece5c56-4b4d-40ae-88fb-653dc89e4342","to":"c7632632-4935-43a9-858b-7b6b8af1bec9"}": syntax error: Parsing raft hello_request RPC failed: misrouted message (addressed to c763 but we're 1e5e) 2020-07-20T04:32:45Z|00014|raft|INFO|ssl:10.0.204.230:60136: syntax "{"cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"3eb71f71-3761-4903-b721-05e532588025","last_log_index":4741,"last_log_term":3,"term":7,"to":"c7632632-4935-43a9-858b-7b6b8af1bec9"}": syntax error: Parsing raft vote_request RPC failed: misrouted message (addressed to c763 but we're 1e5e) 2020-07-20T04:32:45Z|00015|raft|INFO|ssl:10.0.136.16:34778: syntax "{"cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"9ece5c56-4b4d-40ae-88fb-653dc89e4342","last_log_index":4741,"last_log_term":3,"term":7,"to":"c7632632-4935-43a9-858b-7b6b8af1bec9"}": syntax error: Parsing raft vote_request RPC failed: misrouted message (addressed to c763 but we're 1e5e) 2020-07-20T04:32:48Z|00016|memory|INFO|18020 kB peak resident set size after 10.0 seconds 2020-07-20T04:32:48Z|00017|memory|INFO|cells:20 monitors:0 2020-07-20T04:32:50Z|00018|raft|INFO|ssl:10.0.136.16:34778: syntax "{"cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"9ece5c56-4b4d-40ae-88fb-653dc89e4342","last_log_index":4741,"last_log_term":3,"term":8,"to":"c7632632-4935-43a9-858b-7b6b8af1bec9"}": syntax error: Parsing raft vote_request RPC failed: misrouted message (addressed to c763 but we're 1e5e) 2020-07-20T04:32:56Z|00019|raft|INFO|ssl:10.0.204.230:60136: syntax "{"cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"3eb71f71-3761-4903-b721-05e532588025","last_log_index":4741,"last_log_term":3,"term":9,"to":"c7632632-4935-43a9-858b-7b6b8af1bec9"}": syntax error: Parsing raft vote_request RPC failed: misrouted message (addressed to c763 but we're 1e5e) 2020-07-20T04:33:01Z|00020|raft|INFO|ssl:10.0.136.16:34778: syntax "{"cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"9ece5c56-4b4d-40ae-88fb-653dc89e4342","last_log_index":4741,"last_log_term":3,"term":10,"to":"c7632632-4935-43a9-858b-7b6b8af1bec9"}": syntax error: Parsing raft vote_request RPC failed: misrouted message (addressed to c763 but we're 1e5e) 2020-07-20T04:33:06Z|00021|raft|INFO|ssl:10.0.204.230:60136: syntax "{"cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"3eb71f71-3761-4903-b721-05e532588025","last_log_index":4741,"last_log_term":3,"term":11,"to":"c7632632-4935-43a9-858b-7b6b8af1bec9"}": syntax error: Parsing raft vote_request RPC failed: misrouted message (addressed to c763 but we're 1e5e) oc exec -ti -c nbdb ovnkube-master-mn5vc -- bash [root@ip-10-0-178-208 ~]# ovsdb-tool -m show-log -v record 0: name: "OVN_Northbound' local address: "ssl:10.0.178.208:9643" server_id: 1e5e remote_addresses: ssl:10.0.136.16:9643 [root@ip-10-0-178-208 ~]# /usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=3 cluster/status OVN_Northbound 1e5e Name: OVN_Northbound Cluster ID: 3fe2 (3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2) Server ID: 1e5e (1e5ed63a-1695-4965-9b88-be6d3bef021e) Address: ssl:10.0.178.208:9643 Status: joining cluster Remotes for joining: ssl:10.0.178.208:9643 ssl:10.0.204.230:9643 ssl:10.0.136.16:9643 Role: follower Term: 0 Leader: unknown Vote: unknown Election timer: 1000 Log: [1, 1] Entries not yet committed: 0 Entries not yet applied: 0 Connections: (->0000) ->0000 <-0000 Servers: So it seems something royally weird happened to this cluster as ovnkube-master-mn5vc has id 1e5e in cluster-id: 3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2 and ovnkube-master-gx75c thinks he has id c763 in the same cluster. Essentially every node has it's own representation of what that cluster is and who its members are.... Both pods have never been restarted, so it seems like the state was wrong from the start.
From #comment 4: > + echo 'Current server_id c763 not found in /etc/ovn/ovnnb_db.db, cleaning up' > Current server_id c763 not found in /etc/ovn/ovnnb_db.db, cleaning up And > syntax error: Parsing raft hello_request RPC failed: misrouted message (addressed to c763 but we're 1e5e) It seems like we deleted the old db file when starting this pod, but the other cluster members still have our old ID and are trying to communicate with it
Also oc logs -c nbdb ovnkube-master-gx75c | head -n 100 + [[ -f /env/_master ]] + check_db_health /etc/ovn/ovnnb_db.db + db=/etc/ovn/ovnnb_db.db + [[ ! -f /etc/ovn/ovnnb_db.db ]] + echo 'Checking /etc/ovn/ovnnb_db.db health' Checking /etc/ovn/ovnnb_db.db health ++ sed -ne '/server_id:/s/server_id: *\([[:xdigit:]]\+\)/\1/p' ++ ovsdb-tool show-log /etc/ovn/ovnnb_db.db + serverid=' 6932' ++ ovsdb-tool show-log /etc/ovn/ovnnb_db.db ++ grep 'prev_servers: * 6932(' ++ true + match= + [[ -z '' ]] + echo 'Current server_id 6932 not found in /etc/ovn/ovnnb_db.db, cleaning up' Current server_id 6932 not found in /etc/ovn/ovnnb_db.db, cleaning up + rm -- /etc/ovn/ovnnb_db.db + MASTER_IP=10.0.136.16 + [[ 10.0.204.230 == \1\0\.\0\.\1\3\6\.\1\6 ]] ++ bracketify 10.0.204.230 ++ case "$1" in ++ echo 10.0.204.230 ++ bracketify 10.0.136.16 ++ case "$1" in ++ echo 10.0.136.16 + exec /usr/share/ovn/scripts/ovn-ctl --db-nb-cluster-local-port=9643 --db-nb-cluster-remote-port=9643 --db-nb-cluster-local-addr=10.0.204.230 --db-nb-cluster-remote-addr=10.0.136.16 --no-monitor --db-nb-cluster-local-proto=ssl --db-nb-cluster-remote-proto=ssl --ovn-nb-db-ssl-key=/ovn-cert/tls.key --ovn-nb-db-ssl-cert=/ovn-cert/tls.crt --ovn-nb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt '--ovn-nb-log=-vconsole:info -vfile:off' run_nb_ovsdb 2020-07-20T04:31:45Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log 2020-07-20T04:31:45Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.0 2020-07-20T04:31:45Z|00003|reconnect|INFO|ssl:10.0.136.16:9643: connecting... 2020-07-20T04:31:45Z|00004|reconnect|INFO|ssl:10.0.136.16:9643: connected 2020-07-20T04:31:45Z|00005|raft_rpc|INFO|learned cluster ID 3fe2 2020-07-20T04:31:45Z|00006|raft|INFO|ssl:10.0.136.16:9643: learned server ID 9ece 2020-07-20T04:31:45Z|00007|raft|INFO|server 9ece is leader for term 3 2020-07-20T04:31:45Z|00008|raft|INFO|rejecting append_request because previous entry 3,4455 not in local log (mismatch past end of log) 2020-07-20T04:31:45Z|00009|raft|INFO|server 9ece added to configuration 2020-07-20T04:31:45Z|00010|raft|INFO|server c763 added to configuration 2020-07-20T04:31:45Z|00011|raft|INFO|server 6932 added to configuration And further down in the log from the same pod 8dbd-b16b76d3e51b"}": syntax error: Parsing raft vote_request RPC failed: misrouted message (addressed to 6932 but we're 3eb7) 2020-07-20T04:32:38Z|00526|raft|INFO|ssl:10.0.178.208:45270: learned server ID 1e5e 2020-07-20T04:32:38Z|00527|raft|INFO|ssl:10.0.178.208:45270: learned remote address ssl:10.0.178.208:9643 2020-07-20T04:32:38Z|00528|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader) 2020-07-20T04:32:39Z|00529|reconnect|INFO|ssl:10.0.178.208:9643: connected 2020-07-20T04:32:39Z|00530|raft|WARN|ssl:10.0.178.208:9643: ignoring message with unexpected server ID 1e5e (expected c763) 2020-07-20T04:32:39Z|00531|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader) 2020-07-20T04:32:40Z|00532|raft|INFO|ssl:10.0.136.16:53316: syntax "{"cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"9ece5c56-4b4d-40ae-88fb-653dc89e4342","last_log_index":4741,"last_log_term":3,"term":6,"to":"6932e2fa-9816-46a8-8dbd-b16b76d3e51b"}": syntax error: Parsing raft vote_request RPC failed: misrouted message (addressed to 6932 but we're 3eb7) 2020-07-20T04:32:40Z|00533|raft|WARN|ssl:10.0.178.208:9643: ignoring message with unexpected server ID 1e5e (expected c763) 2020-07-20T04:32:40Z|00534|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader) 2020-07-20T04:32:41Z|00535|raft|WARN|ssl:10.0.178.208:9643: ignoring message with unexpected server ID 1e5e (expected c763) 2020-07-20T04:32:41Z|00536|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader) 2020-07-20T04:32:42Z|00537|raft|WARN|ssl:10.0.178.208:9643: ignoring message with unexpected server ID 1e5e (expected c763) 2020-07-20T04:32:42Z|00538|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader) 2020-07-20T04:32:43Z|00539|raft|WARN|ssl:10.0.178.208:9643: ignoring message with unexpected server ID 1e5e (expected c763) 2020-07-20T04:32:43Z|00540|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader) 2020-07-20T04:32:44Z|00541|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader) 2020-07-20T04:32:45Z|00542|raft|INFO|term 7: 5662 ms timeout expired, starting election 2020-07-20T04:32:45Z|00543|raft|INFO|ssl:10.0.136.16:53316: syntax "{"cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"9ece5c56-4b4d-40ae-88fb-653dc89e4342","last_log_index":4741,"last_log_term":3,"term":7,"to":"6932e2fa-9816-46a8-8dbd-b16b76d3e51b"}": syntax error: Parsing raft vote_request RPC failed: misrouted message (addressed to 6932 but we're 3eb7) 2020-07-20T04:32:45Z|00544|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader) 2020-07-20T04:32:46Z|00545|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader) 2020-07-20T04:32:47Z|00546|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader) 2020-07-20T04:32:48Z|00547|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader) 2020-07-20T04:32:50Z|00548|raft|INFO|ssl:10.0.136.16:53316: syntax "{"cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"9ece5c56-4b4d-40ae-88fb-653dc89e4342","last_log_index":4741,"last_log_term":3,"term":8,"to":"6932e2fa-9816-46a8-8dbd-b16b76d3e51b"}": syntax error: Parsing raft vote_request RPC failed: misrouted message (addressed to 6932 but we're 3eb7) 2020-07-20T04:32:51Z|00549|raft|INFO|Dropped 2 log messages in last 2 seconds (most recently, 1 seconds ago) due to excessive rate 2020-07-20T04:32:51Z|00550|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader) 2020-07-20T04:32:56Z|00551|raft|INFO|term 9: 5502 ms timeout expired, starting election 2020-07-20T04:32:57Z|00552|raft|INFO|Dropped 5 log messages in last 5 seconds (most recently, 1 seconds ago) due to excessive rate 2020-07-20T04:32:57Z|00553|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader) 2020-07-20T04:33:01Z|00554|raft|INFO|ssl:10.0.136.16:53316: syntax "{"cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"9ece5c56-4b4d-40ae-88fb-653dc89e4342","last_log_index":4741,"last_log_term":3,"term":10,"to":"6932e2fa-9816-46a8-8dbd-b16b76d3e51b"}": syntax error: Parsing raft vote_request RPC failed: misrouted message (addressed to 6932 but we're 3eb7) 2020-07-20T04:33:03Z|00555|raft|INFO|Dropped 5 log messages in last 5 seconds (most recently, 1 seconds ago) due to excessive rate 2020-07-20T04:33:03Z|00556|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader) 2020-07-20T04:33:06Z|00557|raft|INFO|term 11: 5106 ms timeout expired, starting election 2020-07-20T04:33:09Z|00558|raft|INFO|Dropped 5 log messages in last 5 seconds (most recently, 1 seconds ago) due to excessive rate 2020-07-20T04:33:09Z|00559|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader) So > 2020-07-20T04:31:45Z|00010|raft|INFO|server c763 added to configuration And > 2020-07-20T04:32:38Z|00528|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader) Tells me there's a race caused by the deletion of the DB file However that does not explain this: oc get pod -o yaml ovnkube-master-mn5vc | less - containerID: cri-o://f40e4f7ecf9d27a339e61844a6e19ea6ed8f1eafe5035362c2305ef89646e81b image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd9a322ccdd080b274bcf08f5742b72f6fdaa84a37a07da4e4e9a118135fc057 imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd9a322ccdd080b274bcf08f5742b72f6fdaa84a37a07da4e4e9a118135fc057 lastState: {} name: nbdb ready: true restartCount: 0 started: true state: running: startedAt: "2020-07-20T04:32:36Z" I don't understand how > 2020-07-20T04:31:45Z|00010|raft|INFO|server c763 added to configuration Could have happened, when that pod was started at "2020-07-20T04:32:36Z"?? Both of these pods have never been restarted since.
Yes, more output from the pod ovnkube-master-mn5vc Actually no container had ever been started at 04:31:45 containerStatuses: - containerID: cri-o://f40e4f7ecf9d27a339e61844a6e19ea6ed8f1eafe5035362c2305ef89646e81b image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd9a322ccdd080b274bcf08f5742b72f6fdaa84a37a07da4e4e9a118135fc057 imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd9a322ccdd080b274bcf08f5742b72f6fdaa84a37a07da4e4e9a118135fc057 lastState: {} name: nbdb ready: true restartCount: 0 started: true state: running: startedAt: "2020-07-20T04:32:36Z" - containerID: cri-o://87a547943bd6830da031f56a1c0f0dda0e4f55699e679a66141a7959712a2901 image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd9a322ccdd080b274bcf08f5742b72f6fdaa84a37a07da4e4e9a118135fc057 imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd9a322ccdd080b274bcf08f5742b72f6fdaa84a37a07da4e4e9a118135fc057 lastState: {} name: northd ready: true restartCount: 0 started: true state: running: startedAt: "2020-07-20T04:32:36Z" - containerID: cri-o://b2461eb0b82478b0fd7182d449e62564b905e6e5d8904150604c4b6dc3c6a48d image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd9a322ccdd080b274bcf08f5742b72f6fdaa84a37a07da4e4e9a118135fc057 imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd9a322ccdd080b274bcf08f5742b72f6fdaa84a37a07da4e4e9a118135fc057 lastState: {} name: ovnkube-master ready: true restartCount: 0 started: true state: running: startedAt: "2020-07-20T04:32:39Z" - containerID: cri-o://d1e473efaadab2964311147deb59c001fb3d2dcdc323c641ace76b05c305fe5d image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd9a322ccdd080b274bcf08f5742b72f6fdaa84a37a07da4e4e9a118135fc057 imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd9a322ccdd080b274bcf08f5742b72f6fdaa84a37a07da4e4e9a118135fc057 lastState: {} name: sbdb ready: true restartCount: 0 started: true state: running: startedAt: "2020-07-20T04:32:37Z" hostIP: 10.0.178.208 phase: Running podIP: 10.0.178.208 podIPs: - ip: 10.0.178.208 qosClass: Burstable startTime: "2020-07-20T04:32:35Z" So, it remains a mystery to me how leader ovnkube-master-gx75c ever received c763 (c763 at ssl:10.0.178.208:9643) as a cluster member at 04:31:45
Ohhh, wait a minute.... From #comment 0 > Use the upgrade ci to trigger upgrade from 4.5.2-x86_64 to 4.5.0-0.nightly-2020-07-18-024505 That could explain #comment 7. I.e the cluster 4.5.2-x86_64 had member "c763 at ssl:10.0.178.208:9643". When that node in the new cluster 4.5.0-0.nightly-2020-07-18-024505 is started, it deletes its database as we can see in #comment 4: > + echo 'Current server_id c763 not found in /etc/ovn/ovnnb_db.db, cleaning up' > Current server_id c763 not found in /etc/ovn/ovnnb_db.db, cleaning up This then leads to the cluster inconsistencies seen later on.
When the cluster is upgraded, do the nodes reboot? If that is the case and /var/lib/ovn/etc/ is wiped on the node (which mounts to /etc in the pod)...then we have a problem, because our OVN cluster state will be wiped leading to, well: all of this. The good point might be: that I am not sure we support cluster upgrades for OVN in 4.4/4.5, meaning this might not be a release-blocker...?
Following my offline conversation with Alexander, I think this was caused by having this https://github.com/openshift/cluster-network-operator/pull/688 in but not https://github.com/openshift/cluster-network-operator/pull/694 Because of that, all but one master will wipe their dbs (because of the strict check on the space fixed by the second PR), leading to the current state. The one who did not delete it's db has different ids than the other ones. I think this can also be reproduced on a 4.6 cluster by restarting all the masters all together.
*** Bug 1858767 has been marked as a duplicate of this bug. ***
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, it’s always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
Cloned forward to 4.6 as bug 1858834. That will let me reset the target here without Eric's bot clearing the target.
> Is this a regression... To this impact-statement-request question, the fix in flight reverts net-operator#693, which is new in 4.5 since 4.5.2. 4.5.3 hasn't been created yet, so it's a currently a regression in the 4.5 branch which impacts no releases.
Closing as a dup of bug 1854072, which has been moved back to ASSIGNED. *** This bug has been marked as a duplicate of bug 1854072 ***
Verifying based on above observations.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3028
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days