Bug 1943800

Summary:	[upgrade] Failed to upgrade OVN topology to version 4: failed in retrieving ovn_cluster_router
Product:	OpenShift Container Platform	Reporter:	Dan Williams <dcbw>
Component:	Networking	Assignee:	Mohamed Mahmoud <mmahmoud>
Networking sub component:	ovn-kubernetes	QA Contact:	Anurag saxena <anusaxen>
Status:	CLOSED INSUFFICIENT_DATA	Docs Contact:
Severity:	high
Priority:	high	CC:	anbhat, astoycos, bbennett, i.maximets, mmahmoud, mmichels, trozet
Version:	4.8
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-06-09 15:00:17 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1944264
Bug Blocks:

Description Dan Williams 2021-03-27 14:38:54 UTC

I0327 02:40:09.927720       1 leaderelection.go:254] successfully acquired lease [openshift-ovn-kubernetes/ovn-kubernetes-master openshift-ovn-kubernetes/ovn-kubernetes-master]
I0327 02:40:09.927728       1 leaderelection.go:255] lock *resourcelock.ConfigMapLock &{{ovn-kubernetes-master  openshift-ovn-kubernetes    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] []  []} 0xc0000f5820 {ci-op-wyxgdrk7-a52b2-kh69k-master-0 <nil>} 0xc00245e500}
I0327 02:40:09.927917       1 master.go:81] Won leader election; in active mode
I0327 02:40:09.929200       1 master.go:243] Starting cluster master
...
I0327 02:40:10.230873       1 reflector.go:530] github.com/openshift/ovn-kubernetes/go-controller/pkg/crd/egressip/v1/apis/informers/externalversions/factory.go:117: Watch close - *v1.EgressIP total 0 items received
I0327 02:40:10.445443       1 reflector.go:255] Listing and watching *v1.Namespace from k8s.io/client-go/informers/factory.go:134
I0327 02:40:10.925366       1 reflector.go:255] Listing and watching *v1.NetworkPolicy from k8s.io/client-go/informers/factory.go:134
2021-03-27T02:40:11.594Z|00007|ovsdb_idl|INFO|ssl:10.0.0.5:9641: clustered database server is not cluster leader; trying another server
E0327 02:40:24.963872       1 master.go:278] Failed to upgrade OVN topology to version 4: failed in retrieving ovn_cluster_router to determine the current version of OVN logical topology: stderr: "2021-03-27T02:40:24Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\n", error: OVN command '/usr/bin/ovn-nbctl --timeout=15 --data=bare --no-headings --columns=name find logical_router name=ovn_cluster_router' failed: signal: alarm clock
panic: failed in retrieving ovn_cluster_router to determine the current version of OVN logical topology: stderr: "2021-03-27T02:40:24Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\n", error: OVN command '/usr/bin/ovn-nbctl --timeout=15 --data=bare --no-headings --columns=name find logical_router name=ovn_cluster_router' failed: signal: alarm clock

goroutine 700 [running]:
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).Start.func1(0x1e68ce0, 0xc00297d440)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/master.go:92 +0x26d
created by k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:207 +0x113


Seen in upgrade job:

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/479/pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn-upgrade/1375609461648920576/

in ovnkube-master logs here:

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/479/pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn-upgrade/1375609461648920576/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-vl6lg_ovnkube-master_previous.log

Comment 1 Dan Williams 2021-03-27 14:53:16 UTC

On the NB side:

1) 02:40:08 - leader 10.0.0.3 is told to exit by CRIO
2) 02:40:11 - ovnkube master tries to talk to the database
    * presumably hangs trying to contact 10.0.0.3?
    * does ovn-nbctl handle leader elections in the middle of a DB access attempt?
3) 02:40:15 - 10.0.0.5 is elected leader
4) 02:40:24 - ovnkube-master exits with timeout error

ovnkube-master-4gldd_nbdb (10.0.0.3)
2021-03-27T02:40:08Z|00001|fatal_signal(log_fsync3)|WARN|terminating with signal 15 (Terminated)
...
2021-03-27T02:44:41Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log
2021-03-27T02:44:41Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.15.0
2021-03-27T02:44:41Z|00003|reconnect|INFO|ssl:10.0.0.5:9643: connecting...


ovnkube-master-4rxnb_nbdb (10.0.0.4)
2021-03-27T02:37:01Z|00015|memory|INFO|cells:6893 monitors:0 raft-connections:4 raft-log:368
2021-03-27T02:40:08Z|00016|stream_ssl|WARN|SSL_read: unexpected SSL connection close
2021-03-27T02:40:08Z|00017|jsonrpc|WARN|ssl:10.0.0.3:9643: receive error: Protocol error
2021-03-27T02:40:08Z|00018|stream_ssl|WARN|SSL_read: unexpected SSL connection close
2021-03-27T02:40:08Z|00019|jsonrpc|WARN|ssl:10.0.0.3:34174: receive error: Protocol error
2021-03-27T02:40:08Z|00020|reconnect|WARN|ssl:10.0.0.3:9643: connection dropped (Protocol error)
2021-03-27T02:40:08Z|00021|reconnect|WARN|ssl:10.0.0.3:34174: connection dropped (Protocol error)
2021-03-27T02:40:09Z|00022|reconnect|INFO|ssl:10.0.0.3:9643: connecting...
2021-03-27T02:40:10Z|00023|reconnect|INFO|ssl:10.0.0.3:9643: connection attempt timed out
2021-03-27T02:40:10Z|00024|reconnect|INFO|ssl:10.0.0.3:9643: waiting 2 seconds before reconnect
2021-03-27T02:40:12Z|00025|reconnect|INFO|ssl:10.0.0.3:9643: connecting...
2021-03-27T02:40:14Z|00026|reconnect|INFO|ssl:10.0.0.3:9643: connection attempt timed out
2021-03-27T02:40:14Z|00027|reconnect|INFO|ssl:10.0.0.3:9643: waiting 4 seconds before reconnect
2021-03-27T02:40:15Z|00028|raft|INFO|server 6059 is leader for term 5
2021-03-27T02:40:18Z|00029|reconnect|INFO|ssl:10.0.0.3:9643: connecting...
2021-03-27T02:40:22Z|00030|reconnect|INFO|ssl:10.0.0.3:9643: connection attempt timed out
2021-03-27T02:40:22Z|00031|reconnect|INFO|ssl:10.0.0.3:9643: continuing to reconnect in the background but suppressing further logging
2021-03-27T02:40:24Z|00032|stream_ssl|WARN|SSL_read: unexpected SSL connection close
2021-03-27T02:40:24Z|00033|jsonrpc|WARN|ssl:10.0.0.5:52656: receive error: Protocol error
2021-03-27T02:40:24Z|00034|reconnect|WARN|ssl:10.0.0.5:52656: connection dropped (Protocol error)
2021-03-27T02:44:41Z|00035|raft|INFO|ssl:10.0.0.3:41030: learned server ID ddd0
2021-03-27T02:44:41Z|00036|raft|INFO|ssl:10.0.0.3:41030: learned remote address ssl:10.0.0.3:9643

ovnkube-master-vl6lg_nbdb (10.0.0.5)
2021-03-27T02:40:08Z|00035|stream_ssl|WARN|SSL_read: unexpected SSL connection close
2021-03-27T02:40:08Z|00036|jsonrpc|WARN|ssl:10.0.0.3:9643: receive error: Protocol error
2021-03-27T02:40:08Z|00037|stream_ssl|WARN|SSL_read: unexpected SSL connection close
2021-03-27T02:40:08Z|00038|jsonrpc|WARN|ssl:10.0.0.3:38212: receive error: Protocol error
2021-03-27T02:40:08Z|00039|reconnect|WARN|ssl:10.0.0.3:9643: connection dropped (Protocol error)
2021-03-27T02:40:08Z|00040|reconnect|WARN|ssl:10.0.0.3:38212: connection dropped (Protocol error)
2021-03-27T02:40:09Z|00041|reconnect|INFO|ssl:10.0.0.3:9643: connecting...
2021-03-27T02:40:10Z|00042|reconnect|INFO|ssl:10.0.0.3:9643: connection attempt timed out
2021-03-27T02:40:10Z|00043|reconnect|INFO|ssl:10.0.0.3:9643: waiting 2 seconds before reconnect
2021-03-27T02:40:12Z|00044|reconnect|INFO|ssl:10.0.0.3:9643: connecting...
2021-03-27T02:40:14Z|00045|reconnect|INFO|ssl:10.0.0.3:9643: connection attempt timed out
2021-03-27T02:40:14Z|00046|reconnect|INFO|ssl:10.0.0.3:9643: waiting 4 seconds before reconnect
2021-03-27T02:40:15Z|00047|raft|INFO|term 5: 10265 ms timeout expired, starting election
2021-03-27T02:40:15Z|00048|raft|INFO|term 5: elected leader by 2+ of 3 servers
2021-03-27T02:40:18Z|00049|reconnect|INFO|ssl:10.0.0.3:9643: connecting...
2021-03-27T02:40:22Z|00050|reconnect|INFO|ssl:10.0.0.3:9643: connection attempt timed out
2021-03-27T02:40:22Z|00051|reconnect|INFO|ssl:10.0.0.3:9643: continuing to reconnect in the background but suppressing further logging
2021-03-27T02:43:05Z|00052|raft|INFO|current entry eid bbd3c766-e9a1-43b6-84e1-1a4375d91d36 does not match prerequisite bafcbc55-00bb-428e-98eb-2c89636ccb92 in execute_command_request
2021-03-27T02:43:06Z|00053|raft|INFO|current entry eid 6e66bf68-f491-4320-83a2-52b3fd9cf79b does not match prerequisite fb2d21bf-9a2e-4741-8111-2cf061611beb in execute_command_request
2021-03-27T02:43:06Z|00054|raft|INFO|current entry eid be8b3a4b-f0eb-49f9-838c-5ad1a6c72fcd does not match prerequisite 739f4919-dea1-44fc-81a1-7ca398b67622 in execute_command_request
2021-03-27T02:43:06Z|00055|raft|INFO|current entry eid 323f5c6e-24e0-40fa-9680-4570cbe0313e does not match prerequisite 00a065eb-0db9-48f7-b6f2-d1e449582aac in execute_command_request
2021-03-27T02:43:06Z|00056|raft|INFO|current entry eid 60cf1f38-75eb-4f35-b7a9-d7f1d6b01669 does not match prerequisite 141f33ab-0266-4749-8109-9cec4b549fcd in execute_command_request
2021-03-27T02:44:41Z|00057|raft|INFO|ssl:10.0.0.3:44080: learned server ID ddd0
2021-03-27T02:44:41Z|00058|raft|INFO|ssl:10.0.0.3:44080: learned remote address ssl:10.0.0.3:9643

Comment 2 Ilya Maximets 2021-03-29 12:42:24 UTC

(In reply to Dan Williams from comment #1)
> On the NB side:
> 
> 1) 02:40:08 - leader 10.0.0.3 is told to exit by CRIO
> 2) 02:40:11 - ovnkube master tries to talk to the database
>     * presumably hangs trying to contact 10.0.0.3?
>     * does ovn-nbctl handle leader elections in the middle of a DB access
> attempt?

ovn-nbctl shoudl handle leader changes, reconnect and re-try the transaction.
Hwever, it's unclear if the ovn-nbctl daemon was actually connected to the
database or not.  We have a following message:

    02:40:11|00007|ovsdb_idl|INFO|ssl:10.0.0.5:9641: clustered database
                    server is not cluster leader; trying another server

So, at least, the daemon was not connected at 02:40:11.  Assuming there
was two backoff intervals, i.e. 4+8 seconds or even 8+8 seconds, when
connection would be established at 02:40:23-24 or even 02:40:27,
accordingly.  Both might miss the 15 second timeout.

Unfortunately ovn-nbctl configures "reconnect:warn" log level by default.
It'll be good to replicate the issue with '-vreconnect:file:info' or dbg
for the nbctl daemon.  This way we can actually tell if it was connected
or not.

@dcbw , is it possible?

I also created a BZ to actually enable 'info' level logging in OCP for
ovn-nbctl daemon as this is necessary for debugging purposes:
  https://bugzilla.redhat.com/show_bug.cgi?id=1944160

Comment 3 Dan Williams 2021-03-29 13:51:55 UTC

Posted PR https://github.com/openshift/cluster-network-operator/pull/1040 to handle the reconnect info level for nbctl daemon.

Comment 4 Ilya Maximets 2021-03-29 15:43:37 UTC

Created one more BZ:
  https://bugzilla.redhat.com/show_bug.cgi?id=1944264

It is about not really correct process termination, that
likely contributes to this kind of issues.

Comment 6 Ben Bennett 2021-04-30 13:55:13 UTC

Not a blocker because the pod restarts and things recover.

Comment 8 Ben Bennett 2021-06-09 15:00:17 UTC

We think this may have been fixed by the other PRs and we have not seen it lately.

Comment 9 Red Hat Bugzilla 2023-09-15 01:04:15 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days