Bug 1858712
Summary: | [OVN] 4.5.3 upgrade failure---some ovnkube-master and ovnkube-node is in CrashLoopBackOff | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | huirwang | ||||
Component: | Networking | Assignee: | Aniket Bhat <anbhat> | ||||
Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> | ||||
Status: | CLOSED ERRATA | Docs Contact: | |||||
Severity: | high | ||||||
Priority: | urgent | CC: | aconstan, anbhat, bbennett, fpaoline, jiazha, lmohanty, mifiedle, vrutkovs, wking, xtian, zzhao | ||||
Version: | 4.5 | Keywords: | Regression, Reopened, Upgrades | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.5.z | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1858834 (view as bug list) | Environment: | |||||
Last Closed: | 2020-07-30 18:56:59 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1837953, 1858834, 1882569, 1885675, 1887462 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
Description
huirwang
2020-07-20 07:58:19 UTC
Created attachment 1701717 [details]
The crash pods logs
FYI, for tracking purposes:
oc get pod -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
ovnkube-master-gjs4n 2/4 CrashLoopBackOff 63 4h58m 10.0.136.16 ip-10-0-136-16.us-east-2.compute.internal <none> <none>
ovnkube-master-gx75c 4/4 Running 0 5h 10.0.204.230 ip-10-0-204-230.us-east-2.compute.internal <none> <none>
ovnkube-master-mn5vc 4/4 Running 0 4h59m 10.0.178.208 ip-10-0-178-208.us-east-2.compute.internal <none> <none>
ovnkube-node-2bwg2 2/2 Running 0 5h 10.0.178.208 ip-10-0-178-208.us-east-2.compute.internal <none> <none>
ovnkube-node-7vszf 1/2 CrashLoopBackOff 52 4h59m 10.0.204.230 ip-10-0-204-230.us-east-2.compute.internal <none> <none>
ovnkube-node-8clcn 2/2 Running 0 5h47m 10.0.135.242 ip-10-0-135-242.us-east-2.compute.internal <none> <none>
ovnkube-node-srwtf 2/2 Running 0 4h59m 10.0.165.9 ip-10-0-165-9.us-east-2.compute.internal <none> <none>
ovnkube-node-v8j5p 2/2 Running 0 5h 10.0.214.190 ip-10-0-214-190.us-east-2.compute.internal <none> <none>
ovnkube-node-zslrj 2/2 Running 0 6h3m 10.0.136.16 ip-10-0-136-16.us-east-2.compute.internal <none> <none>
ovs-node-cwfz2 1/1 Running 0 4h58m 10.0.135.242 ip-10-0-135-242.us-east-2.compute.internal <none> <none>
ovs-node-fvblz 1/1 Running 0 4h59m 10.0.178.208 ip-10-0-178-208.us-east-2.compute.internal <none> <none>
ovs-node-t8vl6 1/1 Running 0 4h57m 10.0.204.230 ip-10-0-204-230.us-east-2.compute.internal <none> <none>
ovs-node-thbn7 1/1 Running 0 5h 10.0.136.16 ip-10-0-136-16.us-east-2.compute.internal <none> <none>
ovs-node-vp2rp 1/1 Running 0 5h 10.0.214.190 ip-10-0-214-190.us-east-2.compute.internal <none> <none>
ovs-node-vwrps 1/1 Running 0 4h57m 10.0.165.9 ip-10-0-165-9.us-east-2.compute.internal <none> <none>
aconstan@linux-3 ~/Downloads/tmp $ oc exec -ti -c nbdb ovnkube-master-gx75c -- bash
[root@ip-10-0-204-230 ~]# /usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=3 cluster/status OVN_Northbound
3eb7
Name: OVN_Northbound
Cluster ID: 3fe2 (3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2)
Server ID: 3eb7 (3eb71f71-3761-4903-b721-05e532588025)
Address: ssl:10.0.204.230:9643
Status: cluster member
Role: candidate
Term: 3288
Leader: unknown
Vote: self
Election timer: 5000
Log: [2, 4742]
Entries not yet committed: 1
Entries not yet applied: 1
Connections: (->0000) ->c763 <-1e5e
Servers:
3eb7 (3eb7 at ssl:10.0.204.230:9643) (self) (voted for 3eb7)
6932 (6932 at ssl:10.0.204.230:9643)
9ece (9ece at ssl:10.0.136.16:9643)
c763 (c763 at ssl:10.0.178.208:9643)
oc logs -c nbdb ovnkube-master-gx75c | less
+ [[ -f /env/_master ]]
+ check_db_health /etc/ovn/ovnnb_db.db
+ db=/etc/ovn/ovnnb_db.db
+ [[ ! -f /etc/ovn/ovnnb_db.db ]]
+ echo 'Checking /etc/ovn/ovnnb_db.db health'
Checking /etc/ovn/ovnnb_db.db health
++ sed -ne '/server_id:/s/server_id: *\([[:xdigit:]]\+\)/\1/p'
++ ovsdb-tool show-log /etc/ovn/ovnnb_db.db
+ serverid=' 6932'
++ ovsdb-tool show-log /etc/ovn/ovnnb_db.db
++ grep 'prev_servers: * 6932('
++ true
+ match=
+ [[ -z '' ]]
+ echo 'Current server_id 6932 not found in /etc/ovn/ovnnb_db.db, cleaning up'
Current server_id 6932 not found in /etc/ovn/ovnnb_db.db, cleaning up
+ rm -- /etc/ovn/ovnnb_db.db
+ MASTER_IP=10.0.136.16
+ [[ 10.0.204.230 == \1\0\.\0\.\1\3\6\.\1\6 ]]
++ bracketify 10.0.204.230
++ case "$1" in
++ echo 10.0.204.230
++ bracketify 10.0.136.16
++ case "$1" in
++ echo 10.0.136.16
+ exec /usr/share/ovn/scripts/ovn-ctl --db-nb-cluster-local-port=9643 --db-nb-cluster-remote-port=9643 --db-nb-cluster-local-addr=10.0.204.230 --db-nb-cluster-remote-addr=10.0.136.16 --no-monitor --db-nb-cluster-local-proto=ssl --db-nb-cluster-remote-proto=ssl --ovn-nb-db-ssl-key=/ovn-cert/tls.key --ovn-nb-db-ssl-cert=/ovn-cert/tls.crt --ovn-nb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt '--ovn-nb-log=-vconsole:info -vfile:off' run_nb_ovsdb
2020-07-20T04:31:45Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log
2020-07-20T04:31:45Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.0
2020-07-20T04:31:45Z|00003|reconnect|INFO|ssl:10.0.136.16:9643: connecting...
2020-07-20T04:31:45Z|00004|reconnect|INFO|ssl:10.0.136.16:9643: connected
2020-07-20T04:31:45Z|00005|raft_rpc|INFO|learned cluster ID 3fe2
2020-07-20T04:31:45Z|00006|raft|INFO|ssl:10.0.136.16:9643: learned server ID 9ece
2020-07-20T04:31:45Z|00007|raft|INFO|server 9ece is leader for term 3
2020-07-20T04:31:45Z|00008|raft|INFO|rejecting append_request because previous entry 3,4455 not in local log (mismatch past end of log)
2020-07-20T04:31:45Z|00009|raft|INFO|server 9ece added to configuration
2020-07-20T04:31:45Z|00010|raft|INFO|server c763 added to configuration
2020-07-20T04:31:45Z|00011|raft|INFO|server 6932 added to configuration
2020-07-20T04:31:45Z|00012|raft|INFO|Election timer changed from 5000 to 2000
2020-07-20T04:31:45Z|00013|raft|INFO|Election timer changed from 2000 to 4000
2020-07-20T04:31:45Z|00014|raft|INFO|Election timer changed from 4000 to 5000
2020-07-20T04:31:45Z|00015|reconnect|INFO|ssl:10.0.178.208:9643: connecting...
2020-07-20T04:31:45Z|00016|reconnect|INFO|ssl:10.0.178.208:9643: connected
2020-07-20T04:31:45Z|00017|raft|INFO|ssl:10.0.204.230:9643: learned new server address for joining cluster
2020-07-20T04:31:45Z|00018|raft|INFO|server 3eb7 added to configuration
2020-07-20T04:31:48Z|00019|raft|INFO|ssl:10.0.178.208:44378: syntax "{"address":"ssl:10.0.178.208:9643","cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"c7632632-4935-43a9-858b-7b6b8af1bec9","to":"6932e2fa-9816-46a8-8dbd-b16b76d3e51b"}": syntax error: Parsing raft hello_request RPC failed: misrouted message (addressed to 6932 but we're 3eb7)
2020-07-20T04:31:48Z|00020|raft|INFO|ssl:10.0.136.16:53316: syntax "{"address":"ssl:10.0.136.16:9643","cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"9ece5c56-4b4d-40ae-88fb-653dc89e4342","to":"6932e2fa-9816-46a8-8dbd-b16b76d3e51b"}": syntax error: Parsing raft hello_request RPC failed: misrouted message (addressed to 6932 but we're 3eb7)
So somehow 10.0.204.230 was added twice to the cluster and the state is now incorrect
> 2020-07-20T04:31:45Z|00017|raft|INFO|ssl:10.0.204.230:9643: learned new server address for joining cluster
> 2020-07-20T04:31:45Z|00018|raft|INFO|server 3eb7 added to configuration
oc logs -c nbdb ovnkube-master-mn5vc | head -n 100 + [[ -f /env/_master ]] + check_db_health /etc/ovn/ovnnb_db.db + db=/etc/ovn/ovnnb_db.db + [[ ! -f /etc/ovn/ovnnb_db.db ]] + echo 'Checking /etc/ovn/ovnnb_db.db health' Checking /etc/ovn/ovnnb_db.db health ++ ovsdb-tool show-log /etc/ovn/ovnnb_db.db ++ sed -ne '/server_id:/s/server_id: *\([[:xdigit:]]\+\)/\1/p' + serverid=' c763' ++ ovsdb-tool show-log /etc/ovn/ovnnb_db.db ++ grep 'prev_servers: * c763(' ++ true + match= + [[ -z '' ]] + echo 'Current server_id c763 not found in /etc/ovn/ovnnb_db.db, cleaning up' Current server_id c763 not found in /etc/ovn/ovnnb_db.db, cleaning up + rm -- /etc/ovn/ovnnb_db.db + MASTER_IP=10.0.136.16 + [[ 10.0.178.208 == \1\0\.\0\.\1\3\6\.\1\6 ]] ++ bracketify 10.0.178.208 ++ case "$1" in ++ echo 10.0.178.208 ++ bracketify 10.0.136.16 ++ case "$1" in ++ echo 10.0.136.16 + exec /usr/share/ovn/scripts/ovn-ctl --db-nb-cluster-local-port=9643 --db-nb-cluster-remote-port=9643 --db-nb-cluster-local-addr=10.0.178.208 --db-nb-cluster-remote-addr=10.0.136.16 --no-monitor --db-nb-cluster-local-proto=ssl --db-nb-cluster-remote-proto=ssl --ovn-nb-db-ssl-key=/ovn-cert/tls.key --ovn-nb-db-ssl-cert=/ovn-cert/tls.crt --ovn-nb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt '--ovn-nb-log=-vconsole:info -vfile:off' run_nb_ovsdb 2020-07-20T04:32:38Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log 2020-07-20T04:32:38Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.0 2020-07-20T04:32:38Z|00003|reconnect|INFO|ssl:10.0.136.16:9643: connecting... 2020-07-20T04:32:38Z|00004|reconnect|INFO|ssl:10.0.136.16:9643: connected 2020-07-20T04:32:38Z|00005|raft_rpc|INFO|learned cluster ID 3fe2 2020-07-20T04:32:38Z|00006|raft|INFO|ssl:10.0.136.16:9643: learned server ID 9ece 2020-07-20T04:32:38Z|00007|raft|INFO|ssl:10.0.178.208:9643: learned new server address for joining cluster 2020-07-20T04:32:38Z|00008|raft|INFO|ssl:10.0.204.230:9643: learned new server address for joining cluster 2020-07-20T04:32:38Z|00009|reconnect|INFO|ssl:10.0.204.230:9643: connecting... 2020-07-20T04:32:38Z|00010|reconnect|INFO|ssl:10.0.204.230:9643: connected 2020-07-20T04:32:38Z|00011|raft|INFO|ssl:10.0.204.230:9643: learned server ID 3eb7 2020-07-20T04:32:39Z|00012|raft|INFO|ssl:10.0.204.230:60136: syntax "{"address":"ssl:10.0.204.230:9643","cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"3eb71f71-3761-4903-b721-05e532588025","to":"c7632632-4935-43a9-858b-7b6b8af1bec9"}": syntax error: Parsing raft hello_request RPC failed: misrouted message (addressed to c763 but we're 1e5e) 2020-07-20T04:32:40Z|00013|raft|INFO|ssl:10.0.136.16:34778: syntax "{"address":"ssl:10.0.136.16:9643","cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"9ece5c56-4b4d-40ae-88fb-653dc89e4342","to":"c7632632-4935-43a9-858b-7b6b8af1bec9"}": syntax error: Parsing raft hello_request RPC failed: misrouted message (addressed to c763 but we're 1e5e) 2020-07-20T04:32:45Z|00014|raft|INFO|ssl:10.0.204.230:60136: syntax "{"cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"3eb71f71-3761-4903-b721-05e532588025","last_log_index":4741,"last_log_term":3,"term":7,"to":"c7632632-4935-43a9-858b-7b6b8af1bec9"}": syntax error: Parsing raft vote_request RPC failed: misrouted message (addressed to c763 but we're 1e5e) 2020-07-20T04:32:45Z|00015|raft|INFO|ssl:10.0.136.16:34778: syntax "{"cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"9ece5c56-4b4d-40ae-88fb-653dc89e4342","last_log_index":4741,"last_log_term":3,"term":7,"to":"c7632632-4935-43a9-858b-7b6b8af1bec9"}": syntax error: Parsing raft vote_request RPC failed: misrouted message (addressed to c763 but we're 1e5e) 2020-07-20T04:32:48Z|00016|memory|INFO|18020 kB peak resident set size after 10.0 seconds 2020-07-20T04:32:48Z|00017|memory|INFO|cells:20 monitors:0 2020-07-20T04:32:50Z|00018|raft|INFO|ssl:10.0.136.16:34778: syntax "{"cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"9ece5c56-4b4d-40ae-88fb-653dc89e4342","last_log_index":4741,"last_log_term":3,"term":8,"to":"c7632632-4935-43a9-858b-7b6b8af1bec9"}": syntax error: Parsing raft vote_request RPC failed: misrouted message (addressed to c763 but we're 1e5e) 2020-07-20T04:32:56Z|00019|raft|INFO|ssl:10.0.204.230:60136: syntax "{"cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"3eb71f71-3761-4903-b721-05e532588025","last_log_index":4741,"last_log_term":3,"term":9,"to":"c7632632-4935-43a9-858b-7b6b8af1bec9"}": syntax error: Parsing raft vote_request RPC failed: misrouted message (addressed to c763 but we're 1e5e) 2020-07-20T04:33:01Z|00020|raft|INFO|ssl:10.0.136.16:34778: syntax "{"cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"9ece5c56-4b4d-40ae-88fb-653dc89e4342","last_log_index":4741,"last_log_term":3,"term":10,"to":"c7632632-4935-43a9-858b-7b6b8af1bec9"}": syntax error: Parsing raft vote_request RPC failed: misrouted message (addressed to c763 but we're 1e5e) 2020-07-20T04:33:06Z|00021|raft|INFO|ssl:10.0.204.230:60136: syntax "{"cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"3eb71f71-3761-4903-b721-05e532588025","last_log_index":4741,"last_log_term":3,"term":11,"to":"c7632632-4935-43a9-858b-7b6b8af1bec9"}": syntax error: Parsing raft vote_request RPC failed: misrouted message (addressed to c763 but we're 1e5e) oc exec -ti -c nbdb ovnkube-master-mn5vc -- bash [root@ip-10-0-178-208 ~]# ovsdb-tool -m show-log -v record 0: name: "OVN_Northbound' local address: "ssl:10.0.178.208:9643" server_id: 1e5e remote_addresses: ssl:10.0.136.16:9643 [root@ip-10-0-178-208 ~]# /usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=3 cluster/status OVN_Northbound 1e5e Name: OVN_Northbound Cluster ID: 3fe2 (3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2) Server ID: 1e5e (1e5ed63a-1695-4965-9b88-be6d3bef021e) Address: ssl:10.0.178.208:9643 Status: joining cluster Remotes for joining: ssl:10.0.178.208:9643 ssl:10.0.204.230:9643 ssl:10.0.136.16:9643 Role: follower Term: 0 Leader: unknown Vote: unknown Election timer: 1000 Log: [1, 1] Entries not yet committed: 0 Entries not yet applied: 0 Connections: (->0000) ->0000 <-0000 Servers: So it seems something royally weird happened to this cluster as ovnkube-master-mn5vc has id 1e5e in cluster-id: 3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2 and ovnkube-master-gx75c thinks he has id c763 in the same cluster. Essentially every node has it's own representation of what that cluster is and who its members are.... Both pods have never been restarted, so it seems like the state was wrong from the start. From #comment 4: > + echo 'Current server_id c763 not found in /etc/ovn/ovnnb_db.db, cleaning up' > Current server_id c763 not found in /etc/ovn/ovnnb_db.db, cleaning up And > syntax error: Parsing raft hello_request RPC failed: misrouted message (addressed to c763 but we're 1e5e) It seems like we deleted the old db file when starting this pod, but the other cluster members still have our old ID and are trying to communicate with it Also oc logs -c nbdb ovnkube-master-gx75c | head -n 100 + [[ -f /env/_master ]] + check_db_health /etc/ovn/ovnnb_db.db + db=/etc/ovn/ovnnb_db.db + [[ ! -f /etc/ovn/ovnnb_db.db ]] + echo 'Checking /etc/ovn/ovnnb_db.db health' Checking /etc/ovn/ovnnb_db.db health ++ sed -ne '/server_id:/s/server_id: *\([[:xdigit:]]\+\)/\1/p' ++ ovsdb-tool show-log /etc/ovn/ovnnb_db.db + serverid=' 6932' ++ ovsdb-tool show-log /etc/ovn/ovnnb_db.db ++ grep 'prev_servers: * 6932(' ++ true + match= + [[ -z '' ]] + echo 'Current server_id 6932 not found in /etc/ovn/ovnnb_db.db, cleaning up' Current server_id 6932 not found in /etc/ovn/ovnnb_db.db, cleaning up + rm -- /etc/ovn/ovnnb_db.db + MASTER_IP=10.0.136.16 + [[ 10.0.204.230 == \1\0\.\0\.\1\3\6\.\1\6 ]] ++ bracketify 10.0.204.230 ++ case "$1" in ++ echo 10.0.204.230 ++ bracketify 10.0.136.16 ++ case "$1" in ++ echo 10.0.136.16 + exec /usr/share/ovn/scripts/ovn-ctl --db-nb-cluster-local-port=9643 --db-nb-cluster-remote-port=9643 --db-nb-cluster-local-addr=10.0.204.230 --db-nb-cluster-remote-addr=10.0.136.16 --no-monitor --db-nb-cluster-local-proto=ssl --db-nb-cluster-remote-proto=ssl --ovn-nb-db-ssl-key=/ovn-cert/tls.key --ovn-nb-db-ssl-cert=/ovn-cert/tls.crt --ovn-nb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt '--ovn-nb-log=-vconsole:info -vfile:off' run_nb_ovsdb 2020-07-20T04:31:45Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log 2020-07-20T04:31:45Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.0 2020-07-20T04:31:45Z|00003|reconnect|INFO|ssl:10.0.136.16:9643: connecting... 2020-07-20T04:31:45Z|00004|reconnect|INFO|ssl:10.0.136.16:9643: connected 2020-07-20T04:31:45Z|00005|raft_rpc|INFO|learned cluster ID 3fe2 2020-07-20T04:31:45Z|00006|raft|INFO|ssl:10.0.136.16:9643: learned server ID 9ece 2020-07-20T04:31:45Z|00007|raft|INFO|server 9ece is leader for term 3 2020-07-20T04:31:45Z|00008|raft|INFO|rejecting append_request because previous entry 3,4455 not in local log (mismatch past end of log) 2020-07-20T04:31:45Z|00009|raft|INFO|server 9ece added to configuration 2020-07-20T04:31:45Z|00010|raft|INFO|server c763 added to configuration 2020-07-20T04:31:45Z|00011|raft|INFO|server 6932 added to configuration And further down in the log from the same pod 8dbd-b16b76d3e51b"}": syntax error: Parsing raft vote_request RPC failed: misrouted message (addressed to 6932 but we're 3eb7) 2020-07-20T04:32:38Z|00526|raft|INFO|ssl:10.0.178.208:45270: learned server ID 1e5e 2020-07-20T04:32:38Z|00527|raft|INFO|ssl:10.0.178.208:45270: learned remote address ssl:10.0.178.208:9643 2020-07-20T04:32:38Z|00528|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader) 2020-07-20T04:32:39Z|00529|reconnect|INFO|ssl:10.0.178.208:9643: connected 2020-07-20T04:32:39Z|00530|raft|WARN|ssl:10.0.178.208:9643: ignoring message with unexpected server ID 1e5e (expected c763) 2020-07-20T04:32:39Z|00531|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader) 2020-07-20T04:32:40Z|00532|raft|INFO|ssl:10.0.136.16:53316: syntax "{"cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"9ece5c56-4b4d-40ae-88fb-653dc89e4342","last_log_index":4741,"last_log_term":3,"term":6,"to":"6932e2fa-9816-46a8-8dbd-b16b76d3e51b"}": syntax error: Parsing raft vote_request RPC failed: misrouted message (addressed to 6932 but we're 3eb7) 2020-07-20T04:32:40Z|00533|raft|WARN|ssl:10.0.178.208:9643: ignoring message with unexpected server ID 1e5e (expected c763) 2020-07-20T04:32:40Z|00534|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader) 2020-07-20T04:32:41Z|00535|raft|WARN|ssl:10.0.178.208:9643: ignoring message with unexpected server ID 1e5e (expected c763) 2020-07-20T04:32:41Z|00536|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader) 2020-07-20T04:32:42Z|00537|raft|WARN|ssl:10.0.178.208:9643: ignoring message with unexpected server ID 1e5e (expected c763) 2020-07-20T04:32:42Z|00538|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader) 2020-07-20T04:32:43Z|00539|raft|WARN|ssl:10.0.178.208:9643: ignoring message with unexpected server ID 1e5e (expected c763) 2020-07-20T04:32:43Z|00540|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader) 2020-07-20T04:32:44Z|00541|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader) 2020-07-20T04:32:45Z|00542|raft|INFO|term 7: 5662 ms timeout expired, starting election 2020-07-20T04:32:45Z|00543|raft|INFO|ssl:10.0.136.16:53316: syntax "{"cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"9ece5c56-4b4d-40ae-88fb-653dc89e4342","last_log_index":4741,"last_log_term":3,"term":7,"to":"6932e2fa-9816-46a8-8dbd-b16b76d3e51b"}": syntax error: Parsing raft vote_request RPC failed: misrouted message (addressed to 6932 but we're 3eb7) 2020-07-20T04:32:45Z|00544|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader) 2020-07-20T04:32:46Z|00545|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader) 2020-07-20T04:32:47Z|00546|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader) 2020-07-20T04:32:48Z|00547|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader) 2020-07-20T04:32:50Z|00548|raft|INFO|ssl:10.0.136.16:53316: syntax "{"cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"9ece5c56-4b4d-40ae-88fb-653dc89e4342","last_log_index":4741,"last_log_term":3,"term":8,"to":"6932e2fa-9816-46a8-8dbd-b16b76d3e51b"}": syntax error: Parsing raft vote_request RPC failed: misrouted message (addressed to 6932 but we're 3eb7) 2020-07-20T04:32:51Z|00549|raft|INFO|Dropped 2 log messages in last 2 seconds (most recently, 1 seconds ago) due to excessive rate 2020-07-20T04:32:51Z|00550|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader) 2020-07-20T04:32:56Z|00551|raft|INFO|term 9: 5502 ms timeout expired, starting election 2020-07-20T04:32:57Z|00552|raft|INFO|Dropped 5 log messages in last 5 seconds (most recently, 1 seconds ago) due to excessive rate 2020-07-20T04:32:57Z|00553|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader) 2020-07-20T04:33:01Z|00554|raft|INFO|ssl:10.0.136.16:53316: syntax "{"cluster":"3fe2eb09-d6d5-4f67-8dad-f7291de8c0e2","from":"9ece5c56-4b4d-40ae-88fb-653dc89e4342","last_log_index":4741,"last_log_term":3,"term":10,"to":"6932e2fa-9816-46a8-8dbd-b16b76d3e51b"}": syntax error: Parsing raft vote_request RPC failed: misrouted message (addressed to 6932 but we're 3eb7) 2020-07-20T04:33:03Z|00555|raft|INFO|Dropped 5 log messages in last 5 seconds (most recently, 1 seconds ago) due to excessive rate 2020-07-20T04:33:03Z|00556|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader) 2020-07-20T04:33:06Z|00557|raft|INFO|term 11: 5106 ms timeout expired, starting election 2020-07-20T04:33:09Z|00558|raft|INFO|Dropped 5 log messages in last 5 seconds (most recently, 1 seconds ago) due to excessive rate 2020-07-20T04:33:09Z|00559|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader) So > 2020-07-20T04:31:45Z|00010|raft|INFO|server c763 added to configuration And > 2020-07-20T04:32:38Z|00528|raft|INFO|adding 1e5e (1e5e at ssl:10.0.178.208:9643) to cluster 3fe2 failed (not leader) Tells me there's a race caused by the deletion of the DB file However that does not explain this: oc get pod -o yaml ovnkube-master-mn5vc | less - containerID: cri-o://f40e4f7ecf9d27a339e61844a6e19ea6ed8f1eafe5035362c2305ef89646e81b image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd9a322ccdd080b274bcf08f5742b72f6fdaa84a37a07da4e4e9a118135fc057 imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd9a322ccdd080b274bcf08f5742b72f6fdaa84a37a07da4e4e9a118135fc057 lastState: {} name: nbdb ready: true restartCount: 0 started: true state: running: startedAt: "2020-07-20T04:32:36Z" I don't understand how > 2020-07-20T04:31:45Z|00010|raft|INFO|server c763 added to configuration Could have happened, when that pod was started at "2020-07-20T04:32:36Z"?? Both of these pods have never been restarted since. Yes, more output from the pod ovnkube-master-mn5vc Actually no container had ever been started at 04:31:45 containerStatuses: - containerID: cri-o://f40e4f7ecf9d27a339e61844a6e19ea6ed8f1eafe5035362c2305ef89646e81b image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd9a322ccdd080b274bcf08f5742b72f6fdaa84a37a07da4e4e9a118135fc057 imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd9a322ccdd080b274bcf08f5742b72f6fdaa84a37a07da4e4e9a118135fc057 lastState: {} name: nbdb ready: true restartCount: 0 started: true state: running: startedAt: "2020-07-20T04:32:36Z" - containerID: cri-o://87a547943bd6830da031f56a1c0f0dda0e4f55699e679a66141a7959712a2901 image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd9a322ccdd080b274bcf08f5742b72f6fdaa84a37a07da4e4e9a118135fc057 imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd9a322ccdd080b274bcf08f5742b72f6fdaa84a37a07da4e4e9a118135fc057 lastState: {} name: northd ready: true restartCount: 0 started: true state: running: startedAt: "2020-07-20T04:32:36Z" - containerID: cri-o://b2461eb0b82478b0fd7182d449e62564b905e6e5d8904150604c4b6dc3c6a48d image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd9a322ccdd080b274bcf08f5742b72f6fdaa84a37a07da4e4e9a118135fc057 imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd9a322ccdd080b274bcf08f5742b72f6fdaa84a37a07da4e4e9a118135fc057 lastState: {} name: ovnkube-master ready: true restartCount: 0 started: true state: running: startedAt: "2020-07-20T04:32:39Z" - containerID: cri-o://d1e473efaadab2964311147deb59c001fb3d2dcdc323c641ace76b05c305fe5d image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd9a322ccdd080b274bcf08f5742b72f6fdaa84a37a07da4e4e9a118135fc057 imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd9a322ccdd080b274bcf08f5742b72f6fdaa84a37a07da4e4e9a118135fc057 lastState: {} name: sbdb ready: true restartCount: 0 started: true state: running: startedAt: "2020-07-20T04:32:37Z" hostIP: 10.0.178.208 phase: Running podIP: 10.0.178.208 podIPs: - ip: 10.0.178.208 qosClass: Burstable startTime: "2020-07-20T04:32:35Z" So, it remains a mystery to me how leader ovnkube-master-gx75c ever received c763 (c763 at ssl:10.0.178.208:9643) as a cluster member at 04:31:45 Ohhh, wait a minute.... From #comment 0 > Use the upgrade ci to trigger upgrade from 4.5.2-x86_64 to 4.5.0-0.nightly-2020-07-18-024505 That could explain #comment 7. I.e the cluster 4.5.2-x86_64 had member "c763 at ssl:10.0.178.208:9643". When that node in the new cluster 4.5.0-0.nightly-2020-07-18-024505 is started, it deletes its database as we can see in #comment 4: > + echo 'Current server_id c763 not found in /etc/ovn/ovnnb_db.db, cleaning up' > Current server_id c763 not found in /etc/ovn/ovnnb_db.db, cleaning up This then leads to the cluster inconsistencies seen later on. When the cluster is upgraded, do the nodes reboot? If that is the case and /var/lib/ovn/etc/ is wiped on the node (which mounts to /etc in the pod)...then we have a problem, because our OVN cluster state will be wiped leading to, well: all of this. The good point might be: that I am not sure we support cluster upgrades for OVN in 4.4/4.5, meaning this might not be a release-blocker...? Following my offline conversation with Alexander, I think this was caused by having this https://github.com/openshift/cluster-network-operator/pull/688 in but not https://github.com/openshift/cluster-network-operator/pull/694 Because of that, all but one master will wipe their dbs (because of the strict check on the space fixed by the second PR), leading to the current state. The one who did not delete it's db has different ids than the other ones. I think this can also be reproduced on a 4.6 cluster by restarting all the masters all together. *** Bug 1858767 has been marked as a duplicate of this bug. *** We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, it’s always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1 Cloned forward to 4.6 as bug 1858834. That will let me reset the target here without Eric's bot clearing the target. > Is this a regression...
To this impact-statement-request question, the fix in flight reverts net-operator#693, which is new in 4.5 since 4.5.2. 4.5.3 hasn't been created yet, so it's a currently a regression in the 4.5 branch which impacts no releases.
Closing as a dup of bug 1854072, which has been moved back to ASSIGNED. *** This bug has been marked as a duplicate of bug 1854072 *** Verifying based on above observations. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3028 Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |