I0327 02:40:09.927720 1 leaderelection.go:254] successfully acquired lease [openshift-ovn-kubernetes/ovn-kubernetes-master openshift-ovn-kubernetes/ovn-kubernetes-master] I0327 02:40:09.927728 1 leaderelection.go:255] lock *resourcelock.ConfigMapLock &{{ovn-kubernetes-master openshift-ovn-kubernetes 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []} 0xc0000f5820 {ci-op-wyxgdrk7-a52b2-kh69k-master-0 <nil>} 0xc00245e500} I0327 02:40:09.927917 1 master.go:81] Won leader election; in active mode I0327 02:40:09.929200 1 master.go:243] Starting cluster master ... I0327 02:40:10.230873 1 reflector.go:530] github.com/openshift/ovn-kubernetes/go-controller/pkg/crd/egressip/v1/apis/informers/externalversions/factory.go:117: Watch close - *v1.EgressIP total 0 items received I0327 02:40:10.445443 1 reflector.go:255] Listing and watching *v1.Namespace from k8s.io/client-go/informers/factory.go:134 I0327 02:40:10.925366 1 reflector.go:255] Listing and watching *v1.NetworkPolicy from k8s.io/client-go/informers/factory.go:134 2021-03-27T02:40:11.594Z|00007|ovsdb_idl|INFO|ssl:10.0.0.5:9641: clustered database server is not cluster leader; trying another server E0327 02:40:24.963872 1 master.go:278] Failed to upgrade OVN topology to version 4: failed in retrieving ovn_cluster_router to determine the current version of OVN logical topology: stderr: "2021-03-27T02:40:24Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\n", error: OVN command '/usr/bin/ovn-nbctl --timeout=15 --data=bare --no-headings --columns=name find logical_router name=ovn_cluster_router' failed: signal: alarm clock panic: failed in retrieving ovn_cluster_router to determine the current version of OVN logical topology: stderr: "2021-03-27T02:40:24Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\n", error: OVN command '/usr/bin/ovn-nbctl --timeout=15 --data=bare --no-headings --columns=name find logical_router name=ovn_cluster_router' failed: signal: alarm clock goroutine 700 [running]: github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).Start.func1(0x1e68ce0, 0xc00297d440) /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/master.go:92 +0x26d created by k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:207 +0x113 Seen in upgrade job: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/479/pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn-upgrade/1375609461648920576/ in ovnkube-master logs here: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/479/pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn-upgrade/1375609461648920576/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-vl6lg_ovnkube-master_previous.log
On the NB side: 1) 02:40:08 - leader 10.0.0.3 is told to exit by CRIO 2) 02:40:11 - ovnkube master tries to talk to the database * presumably hangs trying to contact 10.0.0.3? * does ovn-nbctl handle leader elections in the middle of a DB access attempt? 3) 02:40:15 - 10.0.0.5 is elected leader 4) 02:40:24 - ovnkube-master exits with timeout error ovnkube-master-4gldd_nbdb (10.0.0.3) 2021-03-27T02:40:08Z|00001|fatal_signal(log_fsync3)|WARN|terminating with signal 15 (Terminated) ... 2021-03-27T02:44:41Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log 2021-03-27T02:44:41Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.15.0 2021-03-27T02:44:41Z|00003|reconnect|INFO|ssl:10.0.0.5:9643: connecting... ovnkube-master-4rxnb_nbdb (10.0.0.4) 2021-03-27T02:37:01Z|00015|memory|INFO|cells:6893 monitors:0 raft-connections:4 raft-log:368 2021-03-27T02:40:08Z|00016|stream_ssl|WARN|SSL_read: unexpected SSL connection close 2021-03-27T02:40:08Z|00017|jsonrpc|WARN|ssl:10.0.0.3:9643: receive error: Protocol error 2021-03-27T02:40:08Z|00018|stream_ssl|WARN|SSL_read: unexpected SSL connection close 2021-03-27T02:40:08Z|00019|jsonrpc|WARN|ssl:10.0.0.3:34174: receive error: Protocol error 2021-03-27T02:40:08Z|00020|reconnect|WARN|ssl:10.0.0.3:9643: connection dropped (Protocol error) 2021-03-27T02:40:08Z|00021|reconnect|WARN|ssl:10.0.0.3:34174: connection dropped (Protocol error) 2021-03-27T02:40:09Z|00022|reconnect|INFO|ssl:10.0.0.3:9643: connecting... 2021-03-27T02:40:10Z|00023|reconnect|INFO|ssl:10.0.0.3:9643: connection attempt timed out 2021-03-27T02:40:10Z|00024|reconnect|INFO|ssl:10.0.0.3:9643: waiting 2 seconds before reconnect 2021-03-27T02:40:12Z|00025|reconnect|INFO|ssl:10.0.0.3:9643: connecting... 2021-03-27T02:40:14Z|00026|reconnect|INFO|ssl:10.0.0.3:9643: connection attempt timed out 2021-03-27T02:40:14Z|00027|reconnect|INFO|ssl:10.0.0.3:9643: waiting 4 seconds before reconnect 2021-03-27T02:40:15Z|00028|raft|INFO|server 6059 is leader for term 5 2021-03-27T02:40:18Z|00029|reconnect|INFO|ssl:10.0.0.3:9643: connecting... 2021-03-27T02:40:22Z|00030|reconnect|INFO|ssl:10.0.0.3:9643: connection attempt timed out 2021-03-27T02:40:22Z|00031|reconnect|INFO|ssl:10.0.0.3:9643: continuing to reconnect in the background but suppressing further logging 2021-03-27T02:40:24Z|00032|stream_ssl|WARN|SSL_read: unexpected SSL connection close 2021-03-27T02:40:24Z|00033|jsonrpc|WARN|ssl:10.0.0.5:52656: receive error: Protocol error 2021-03-27T02:40:24Z|00034|reconnect|WARN|ssl:10.0.0.5:52656: connection dropped (Protocol error) 2021-03-27T02:44:41Z|00035|raft|INFO|ssl:10.0.0.3:41030: learned server ID ddd0 2021-03-27T02:44:41Z|00036|raft|INFO|ssl:10.0.0.3:41030: learned remote address ssl:10.0.0.3:9643 ovnkube-master-vl6lg_nbdb (10.0.0.5) 2021-03-27T02:40:08Z|00035|stream_ssl|WARN|SSL_read: unexpected SSL connection close 2021-03-27T02:40:08Z|00036|jsonrpc|WARN|ssl:10.0.0.3:9643: receive error: Protocol error 2021-03-27T02:40:08Z|00037|stream_ssl|WARN|SSL_read: unexpected SSL connection close 2021-03-27T02:40:08Z|00038|jsonrpc|WARN|ssl:10.0.0.3:38212: receive error: Protocol error 2021-03-27T02:40:08Z|00039|reconnect|WARN|ssl:10.0.0.3:9643: connection dropped (Protocol error) 2021-03-27T02:40:08Z|00040|reconnect|WARN|ssl:10.0.0.3:38212: connection dropped (Protocol error) 2021-03-27T02:40:09Z|00041|reconnect|INFO|ssl:10.0.0.3:9643: connecting... 2021-03-27T02:40:10Z|00042|reconnect|INFO|ssl:10.0.0.3:9643: connection attempt timed out 2021-03-27T02:40:10Z|00043|reconnect|INFO|ssl:10.0.0.3:9643: waiting 2 seconds before reconnect 2021-03-27T02:40:12Z|00044|reconnect|INFO|ssl:10.0.0.3:9643: connecting... 2021-03-27T02:40:14Z|00045|reconnect|INFO|ssl:10.0.0.3:9643: connection attempt timed out 2021-03-27T02:40:14Z|00046|reconnect|INFO|ssl:10.0.0.3:9643: waiting 4 seconds before reconnect 2021-03-27T02:40:15Z|00047|raft|INFO|term 5: 10265 ms timeout expired, starting election 2021-03-27T02:40:15Z|00048|raft|INFO|term 5: elected leader by 2+ of 3 servers 2021-03-27T02:40:18Z|00049|reconnect|INFO|ssl:10.0.0.3:9643: connecting... 2021-03-27T02:40:22Z|00050|reconnect|INFO|ssl:10.0.0.3:9643: connection attempt timed out 2021-03-27T02:40:22Z|00051|reconnect|INFO|ssl:10.0.0.3:9643: continuing to reconnect in the background but suppressing further logging 2021-03-27T02:43:05Z|00052|raft|INFO|current entry eid bbd3c766-e9a1-43b6-84e1-1a4375d91d36 does not match prerequisite bafcbc55-00bb-428e-98eb-2c89636ccb92 in execute_command_request 2021-03-27T02:43:06Z|00053|raft|INFO|current entry eid 6e66bf68-f491-4320-83a2-52b3fd9cf79b does not match prerequisite fb2d21bf-9a2e-4741-8111-2cf061611beb in execute_command_request 2021-03-27T02:43:06Z|00054|raft|INFO|current entry eid be8b3a4b-f0eb-49f9-838c-5ad1a6c72fcd does not match prerequisite 739f4919-dea1-44fc-81a1-7ca398b67622 in execute_command_request 2021-03-27T02:43:06Z|00055|raft|INFO|current entry eid 323f5c6e-24e0-40fa-9680-4570cbe0313e does not match prerequisite 00a065eb-0db9-48f7-b6f2-d1e449582aac in execute_command_request 2021-03-27T02:43:06Z|00056|raft|INFO|current entry eid 60cf1f38-75eb-4f35-b7a9-d7f1d6b01669 does not match prerequisite 141f33ab-0266-4749-8109-9cec4b549fcd in execute_command_request 2021-03-27T02:44:41Z|00057|raft|INFO|ssl:10.0.0.3:44080: learned server ID ddd0 2021-03-27T02:44:41Z|00058|raft|INFO|ssl:10.0.0.3:44080: learned remote address ssl:10.0.0.3:9643
(In reply to Dan Williams from comment #1) > On the NB side: > > 1) 02:40:08 - leader 10.0.0.3 is told to exit by CRIO > 2) 02:40:11 - ovnkube master tries to talk to the database > * presumably hangs trying to contact 10.0.0.3? > * does ovn-nbctl handle leader elections in the middle of a DB access > attempt? ovn-nbctl shoudl handle leader changes, reconnect and re-try the transaction. Hwever, it's unclear if the ovn-nbctl daemon was actually connected to the database or not. We have a following message: 02:40:11|00007|ovsdb_idl|INFO|ssl:10.0.0.5:9641: clustered database server is not cluster leader; trying another server So, at least, the daemon was not connected at 02:40:11. Assuming there was two backoff intervals, i.e. 4+8 seconds or even 8+8 seconds, when connection would be established at 02:40:23-24 or even 02:40:27, accordingly. Both might miss the 15 second timeout. Unfortunately ovn-nbctl configures "reconnect:warn" log level by default. It'll be good to replicate the issue with '-vreconnect:file:info' or dbg for the nbctl daemon. This way we can actually tell if it was connected or not. @dcbw , is it possible? I also created a BZ to actually enable 'info' level logging in OCP for ovn-nbctl daemon as this is necessary for debugging purposes: https://bugzilla.redhat.com/show_bug.cgi?id=1944160
Posted PR https://github.com/openshift/cluster-network-operator/pull/1040 to handle the reconnect info level for nbctl daemon.
Created one more BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1944264 It is about not really correct process termination, that likely contributes to this kind of issues.
Not a blocker because the pod restarts and things recover.
We think this may have been fixed by the other PRs and we have not seen it lately.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days