1943800 – [upgrade] Failed to upgrade OVN topology to version 4: failed in retrieving ovn_cluster_router

Bug 1943800 - [upgrade] Failed to upgrade OVN topology to version 4: failed in retrieving ovn_cluster_router

Summary: [upgrade] Failed to upgrade OVN topology to version 4: failed in retrieving o...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Mohamed Mahmoud
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:	1944264
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-27 14:38 UTC by Dan Williams
Modified:	2023-09-15 01:04 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-06-09 15:00:17 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1943363	1	unspecified	CLOSED	[ovn] CNO should gracefully terminate ovn-northd	2022-03-12 04:35:19 UTC
Red Hat Bugzilla	1944160	1	high	CLOSED	CNO: nbctl daemon should log reconnection info	2021-07-27 22:56:50 UTC
Red Hat Bugzilla	1944264	1	urgent	CLOSED	[ovn] CNO should gracefully terminate OVN databases	2022-08-10 10:36:37 UTC

Description Dan Williams 2021-03-27 14:38:54 UTC

I0327 02:40:09.927720       1 leaderelection.go:254] successfully acquired lease [openshift-ovn-kubernetes/ovn-kubernetes-master openshift-ovn-kubernetes/ovn-kubernetes-master]
I0327 02:40:09.927728       1 leaderelection.go:255] lock *resourcelock.ConfigMapLock &{{ovn-kubernetes-master  openshift-ovn-kubernetes    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] []  []} 0xc0000f5820 {ci-op-wyxgdrk7-a52b2-kh69k-master-0 <nil>} 0xc00245e500}
I0327 02:40:09.927917       1 master.go:81] Won leader election; in active mode
I0327 02:40:09.929200       1 master.go:243] Starting cluster master
...
I0327 02:40:10.230873       1 reflector.go:530] github.com/openshift/ovn-kubernetes/go-controller/pkg/crd/egressip/v1/apis/informers/externalversions/factory.go:117: Watch close - *v1.EgressIP total 0 items received
I0327 02:40:10.445443       1 reflector.go:255] Listing and watching *v1.Namespace from k8s.io/client-go/informers/factory.go:134
I0327 02:40:10.925366       1 reflector.go:255] Listing and watching *v1.NetworkPolicy from k8s.io/client-go/informers/factory.go:134
2021-03-27T02:40:11.594Z|00007|ovsdb_idl|INFO|ssl:10.0.0.5:9641: clustered database server is not cluster leader; trying another server
E0327 02:40:24.963872       1 master.go:278] Failed to upgrade OVN topology to version 4: failed in retrieving ovn_cluster_router to determine the current version of OVN logical topology: stderr: "2021-03-27T02:40:24Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\n", error: OVN command '/usr/bin/ovn-nbctl --timeout=15 --data=bare --no-headings --columns=name find logical_router name=ovn_cluster_router' failed: signal: alarm clock
panic: failed in retrieving ovn_cluster_router to determine the current version of OVN logical topology: stderr: "2021-03-27T02:40:24Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\n", error: OVN command '/usr/bin/ovn-nbctl --timeout=15 --data=bare --no-headings --columns=name find logical_router name=ovn_cluster_router' failed: signal: alarm clock

goroutine 700 [running]:
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).Start.func1(0x1e68ce0, 0xc00297d440)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/master.go:92 +0x26d
created by k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:207 +0x113


Seen in upgrade job:

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/479/pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn-upgrade/1375609461648920576/

in ovnkube-master logs here:

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/479/pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn-upgrade/1375609461648920576/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-vl6lg_ovnkube-master_previous.log

Comment 1 Dan Williams 2021-03-27 14:53:16 UTC

On the NB side:

1) 02:40:08 - leader 10.0.0.3 is told to exit by CRIO
2) 02:40:11 - ovnkube master tries to talk to the database
    * presumably hangs trying to contact 10.0.0.3?
    * does ovn-nbctl handle leader elections in the middle of a DB access attempt?
3) 02:40:15 - 10.0.0.5 is elected leader
4) 02:40:24 - ovnkube-master exits with timeout error

ovnkube-master-4gldd_nbdb (10.0.0.3)
2021-03-27T02:40:08Z|00001|fatal_signal(log_fsync3)|WARN|terminating with signal 15 (Terminated)
...
2021-03-27T02:44:41Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log
2021-03-27T02:44:41Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.15.0
2021-03-27T02:44:41Z|00003|reconnect|INFO|ssl:10.0.0.5:9643: connecting...


ovnkube-master-4rxnb_nbdb (10.0.0.4)
2021-03-27T02:37:01Z|00015|memory|INFO|cells:6893 monitors:0 raft-connections:4 raft-log:368
2021-03-27T02:40:08Z|00016|stream_ssl|WARN|SSL_read: unexpected SSL connection close
2021-03-27T02:40:08Z|00017|jsonrpc|WARN|ssl:10.0.0.3:9643: receive error: Protocol error
2021-03-27T02:40:08Z|00018|stream_ssl|WARN|SSL_read: unexpected SSL connection close
2021-03-27T02:40:08Z|00019|jsonrpc|WARN|ssl:10.0.0.3:34174: receive error: Protocol error
2021-03-27T02:40:08Z|00020|reconnect|WARN|ssl:10.0.0.3:9643: connection dropped (Protocol error)
2021-03-27T02:40:08Z|00021|reconnect|WARN|ssl:10.0.0.3:34174: connection dropped (Protocol error)
2021-03-27T02:40:09Z|00022|reconnect|INFO|ssl:10.0.0.3:9643: connecting...
2021-03-27T02:40:10Z|00023|reconnect|INFO|ssl:10.0.0.3:9643: connection attempt timed out
2021-03-27T02:40:10Z|00024|reconnect|INFO|ssl:10.0.0.3:9643: waiting 2 seconds before reconnect
2021-03-27T02:40:12Z|00025|reconnect|INFO|ssl:10.0.0.3:9643: connecting...
2021-03-27T02:40:14Z|00026|reconnect|INFO|ssl:10.0.0.3:9643: connection attempt timed out
2021-03-27T02:40:14Z|00027|reconnect|INFO|ssl:10.0.0.3:9643: waiting 4 seconds before reconnect
2021-03-27T02:40:15Z|00028|raft|INFO|server 6059 is leader for term 5
2021-03-27T02:40:18Z|00029|reconnect|INFO|ssl:10.0.0.3:9643: connecting...
2021-03-27T02:40:22Z|00030|reconnect|INFO|ssl:10.0.0.3:9643: connection attempt timed out
2021-03-27T02:40:22Z|00031|reconnect|INFO|ssl:10.0.0.3:9643: continuing to reconnect in the background but suppressing further logging
2021-03-27T02:40:24Z|00032|stream_ssl|WARN|SSL_read: unexpected SSL connection close
2021-03-27T02:40:24Z|00033|jsonrpc|WARN|ssl:10.0.0.5:52656: receive error: Protocol error
2021-03-27T02:40:24Z|00034|reconnect|WARN|ssl:10.0.0.5:52656: connection dropped (Protocol error)
2021-03-27T02:44:41Z|00035|raft|INFO|ssl:10.0.0.3:41030: learned server ID ddd0
2021-03-27T02:44:41Z|00036|raft|INFO|ssl:10.0.0.3:41030: learned remote address ssl:10.0.0.3:9643

ovnkube-master-vl6lg_nbdb (10.0.0.5)
2021-03-27T02:40:08Z|00035|stream_ssl|WARN|SSL_read: unexpected SSL connection close
2021-03-27T02:40:08Z|00036|jsonrpc|WARN|ssl:10.0.0.3:9643: receive error: Protocol error
2021-03-27T02:40:08Z|00037|stream_ssl|WARN|SSL_read: unexpected SSL connection close
2021-03-27T02:40:08Z|00038|jsonrpc|WARN|ssl:10.0.0.3:38212: receive error: Protocol error
2021-03-27T02:40:08Z|00039|reconnect|WARN|ssl:10.0.0.3:9643: connection dropped (Protocol error)
2021-03-27T02:40:08Z|00040|reconnect|WARN|ssl:10.0.0.3:38212: connection dropped (Protocol error)
2021-03-27T02:40:09Z|00041|reconnect|INFO|ssl:10.0.0.3:9643: connecting...
2021-03-27T02:40:10Z|00042|reconnect|INFO|ssl:10.0.0.3:9643: connection attempt timed out
2021-03-27T02:40:10Z|00043|reconnect|INFO|ssl:10.0.0.3:9643: waiting 2 seconds before reconnect
2021-03-27T02:40:12Z|00044|reconnect|INFO|ssl:10.0.0.3:9643: connecting...
2021-03-27T02:40:14Z|00045|reconnect|INFO|ssl:10.0.0.3:9643: connection attempt timed out
2021-03-27T02:40:14Z|00046|reconnect|INFO|ssl:10.0.0.3:9643: waiting 4 seconds before reconnect
2021-03-27T02:40:15Z|00047|raft|INFO|term 5: 10265 ms timeout expired, starting election
2021-03-27T02:40:15Z|00048|raft|INFO|term 5: elected leader by 2+ of 3 servers
2021-03-27T02:40:18Z|00049|reconnect|INFO|ssl:10.0.0.3:9643: connecting...
2021-03-27T02:40:22Z|00050|reconnect|INFO|ssl:10.0.0.3:9643: connection attempt timed out
2021-03-27T02:40:22Z|00051|reconnect|INFO|ssl:10.0.0.3:9643: continuing to reconnect in the background but suppressing further logging
2021-03-27T02:43:05Z|00052|raft|INFO|current entry eid bbd3c766-e9a1-43b6-84e1-1a4375d91d36 does not match prerequisite bafcbc55-00bb-428e-98eb-2c89636ccb92 in execute_command_request
2021-03-27T02:43:06Z|00053|raft|INFO|current entry eid 6e66bf68-f491-4320-83a2-52b3fd9cf79b does not match prerequisite fb2d21bf-9a2e-4741-8111-2cf061611beb in execute_command_request
2021-03-27T02:43:06Z|00054|raft|INFO|current entry eid be8b3a4b-f0eb-49f9-838c-5ad1a6c72fcd does not match prerequisite 739f4919-dea1-44fc-81a1-7ca398b67622 in execute_command_request
2021-03-27T02:43:06Z|00055|raft|INFO|current entry eid 323f5c6e-24e0-40fa-9680-4570cbe0313e does not match prerequisite 00a065eb-0db9-48f7-b6f2-d1e449582aac in execute_command_request
2021-03-27T02:43:06Z|00056|raft|INFO|current entry eid 60cf1f38-75eb-4f35-b7a9-d7f1d6b01669 does not match prerequisite 141f33ab-0266-4749-8109-9cec4b549fcd in execute_command_request
2021-03-27T02:44:41Z|00057|raft|INFO|ssl:10.0.0.3:44080: learned server ID ddd0
2021-03-27T02:44:41Z|00058|raft|INFO|ssl:10.0.0.3:44080: learned remote address ssl:10.0.0.3:9643

Comment 2 Ilya Maximets 2021-03-29 12:42:24 UTC

(In reply to Dan Williams from comment #1)
> On the NB side:
> 
> 1) 02:40:08 - leader 10.0.0.3 is told to exit by CRIO
> 2) 02:40:11 - ovnkube master tries to talk to the database
>     * presumably hangs trying to contact 10.0.0.3?
>     * does ovn-nbctl handle leader elections in the middle of a DB access
> attempt?

ovn-nbctl shoudl handle leader changes, reconnect and re-try the transaction.
Hwever, it's unclear if the ovn-nbctl daemon was actually connected to the
database or not.  We have a following message:

    02:40:11|00007|ovsdb_idl|INFO|ssl:10.0.0.5:9641: clustered database
                    server is not cluster leader; trying another server

So, at least, the daemon was not connected at 02:40:11.  Assuming there
was two backoff intervals, i.e. 4+8 seconds or even 8+8 seconds, when
connection would be established at 02:40:23-24 or even 02:40:27,
accordingly.  Both might miss the 15 second timeout.

Unfortunately ovn-nbctl configures "reconnect:warn" log level by default.
It'll be good to replicate the issue with '-vreconnect:file:info' or dbg
for the nbctl daemon.  This way we can actually tell if it was connected
or not.

@dcbw , is it possible?

I also created a BZ to actually enable 'info' level logging in OCP for
ovn-nbctl daemon as this is necessary for debugging purposes:
  https://bugzilla.redhat.com/show_bug.cgi?id=1944160

Comment 3 Dan Williams 2021-03-29 13:51:55 UTC

Posted PR https://github.com/openshift/cluster-network-operator/pull/1040 to handle the reconnect info level for nbctl daemon.

Comment 4 Ilya Maximets 2021-03-29 15:43:37 UTC

Created one more BZ:
  https://bugzilla.redhat.com/show_bug.cgi?id=1944264

It is about not really correct process termination, that
likely contributes to this kind of issues.

Comment 6 Ben Bennett 2021-04-30 13:55:13 UTC

Not a blocker because the pod restarts and things recover.

Comment 8 Ben Bennett 2021-06-09 15:00:17 UTC

We think this may have been fixed by the other PRs and we have not seen it lately.

Comment 9 Red Hat Bugzilla 2023-09-15 01:04:15 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.