1943363 – [ovn] CNO should gracefully terminate ovn-northd

Bug 1943363 - [ovn] CNO should gracefully terminate ovn-northd

Summary: [ovn] CNO should gracefully terminate ovn-northd

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.10.0
Assignee:	ffernand
QA Contact:	Arti Sood
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1943566 2005818 2040530
TreeView+	depends on / blocked

Reported:	2021-03-25 21:41 UTC by Tim Rozet
Modified:	2022-05-12 00:41 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2040530 (view as bug list)
Environment:
Last Closed:	2022-03-12 04:34:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logs, dbs (1.60 MB, application/gzip) 2021-03-26 14:43 UTC, Tim Rozet	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 1221	0	None	open	Bug 1943363: ovn: try to gracefully terminate ovn-northd	2021-11-09 21:22:42 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-12 04:35:19 UTC

Internal Links: 1943800

Description Tim Rozet 2021-03-25 21:41:47 UTC

Description of problem:
In OVN-Kubernetes OCP we see that during this run a pod CNI request fails to find the OVS port set to up during upgrade:

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-releas[…]de-from-stable-4.7-e2e-gcp-ovn-upgrade/1375019219732664320

Relevant pod:
authentication/oauth-openshift-bdcf659df-tjbc5 07a320ce7387e01c3f26459243927623035daf9d36f3b5063ee0dc9602bd5ecf] failed to configure pod interface: error while waiting on OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting for OVS port binding

We see in ovn-kube that the pod was wired at 11:30:14 in our northbound:
I0325 11:30:14.881154       1 pods.go:261] [openshift-authentication/oauth-openshift-bdcf659df-tjbc5] addLogicalPort took 133.201809ms

and in southbound (CNI):
I0325 11:30:15.170172    6142 cni.go:187] [openshift-authentication/oauth-openshift-bdcf659df-tjbc5 07a320ce7387e01c3f26459243927623035daf9d36f3b5063ee0dc9602bd5ecf] ADD starting CNI request [openshift-authentication/oauth-openshift-bdcf659df-tjbc5 07a320ce7387e01c3f26459243927623035daf9d36f3b5063ee0dc9602bd5ecf]
I0325 11:30:37.391372    6142 cni.go:197] [openshift-authentication/oauth-openshift-bdcf659df-tjbc5 07a320ce7387e01c3f26459243927623035daf9d36f3b5063ee0dc9602bd5ecf] ADD finished CNI request [openshift-authentication/oauth-openshift-bdcf659df-tjbc5 07a320ce7387e01c3f26459243927623035daf9d36f3b5063ee0dc9602bd5ecf], result "", err failed to configure pod interface: error while waiting on OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting for OVS port binding

However in ovn-controller, we see the port bind happened 4 minutes later:
2021-03-25T11:34:26Z|00102|binding|INFO|Claiming lport openshift-authentication_oauth-openshift-bdcf659df-tjbc5 for this chassis.
2021-03-25T11:34:26Z|00103|binding|INFO|openshift-authentication_oauth-openshift-bdcf659df-tjbc5: Claiming 0a:58:0a:80:00:19 10.128.0.25

If we look at OVN SBDBs, we can see some problems from 11:30->11:34:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1375019219732664320/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-595x7_sbdb.log

2021-03-25T11:28:50Z|00021|stream_ssl|WARN|SSL_read: system error (Connection reset by peer)
2021-03-25T11:28:50Z|00022|jsonrpc|WARN|ssl:10.0.0.3:9644: receive error: Connection reset by peer
2021-03-25T11:28:50Z|00023|reconnect|WARN|ssl:10.0.0.3:9644: connection dropped (Connection reset by peer)
2021-03-25T11:28:51Z|00024|reconnect|INFO|ssl:10.0.0.3:9644: connecting...
2021-03-25T11:28:51Z|00025|reconnect|INFO|ssl:10.0.0.3:9644: connected
2021-03-25T11:29:07Z|00026|raft|INFO|server 57c1 is leader for term 6
2021-03-25T11:30:02Z|00027|reconnect|ERR|ssl:10.0.0.5:35406: no response to inactivity probe after 60 seconds, disconnecting
2021-03-25T11:32:17Z|00028|stream_ssl|WARN|SSL_read: system error (Connection reset by peer)
2021-03-25T11:32:17Z|00029|jsonrpc|WARN|ssl:10.0.0.5:9644: receive error: Connection reset by peer
2021-03-25T11:32:17Z|00030|reconnect|WARN|ssl:10.0.0.5:9644: connection dropped (Connection reset by peer)
2021-03-25T11:32:18Z|00031|reconnect|INFO|ssl:10.0.0.5:9644: connecting...
2021-03-25T11:32:18Z|00032|reconnect|INFO|ssl:10.0.0.5:9644: connection attempt failed (Connection refused)
2021-03-25T11:32:18Z|00033|reconnect|INFO|ssl:10.0.0.5:9644: waiting 2 seconds before reconnect
2021-03-25T11:32:20Z|00034|reconnect|INFO|ssl:10.0.0.5:9644: connecting...
2021-03-25T11:32:20Z|00035|reconnect|INFO|ssl:10.0.0.5:9644: connection attempt failed (Connection refused)
2021-03-25T11:32:20Z|00036|reconnect|INFO|ssl:10.0.0.5:9644: waiting 4 seconds before reconnect
2021-03-25T11:32:24Z|00037|reconnect|INFO|ssl:10.0.0.5:9644: connecting...
2021-03-25T11:32:24Z|00038|reconnect|INFO|ssl:10.0.0.5:9644: connection attempt failed (Connection refused)
2021-03-25T11:32:24Z|00039|reconnect|INFO|ssl:10.0.0.5:9644: continuing to reconnect in the background but suppressing further logging
2021-03-25T11:34:31Z|00040|raft|INFO|ssl:10.0.0.5:34652: learned server ID a74e
2021-03-25T11:34:31Z|00041|raft|INFO|ssl:10.0.0.5:34652: learned remote address ssl:10.0.0.5:9644
2021-03-25T11:34:32Z|00042|reconnect|INFO|ssl:10.0.0.5:9644: connected
2021-03-25T11:38:41Z|00043|ovsdb_server|INFO|compacting OVN_Southbound database by user request

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1375019219732664320/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-8vjlz_sbdb.log

2021-03-25T11:29:07Z|00032|raft|INFO|term 6: 16370 ms timeout expired, starting election
2021-03-25T11:29:07Z|00033|raft|INFO|term 6: elected leader by 2+ of 3 servers
2021-03-25T11:32:23Z|00034|stream_ssl|WARN|SSL_read: system error (Connection reset by peer)
2021-03-25T11:32:23Z|00035|jsonrpc|WARN|ssl:10.0.0.5:9644: receive error: Connection reset by peer
2021-03-25T11:32:23Z|00036|reconnect|WARN|ssl:10.0.0.5:9644: connection dropped (Connection reset by peer)
2021-03-25T11:32:24Z|00037|reconnect|INFO|ssl:10.0.0.5:9644: connecting...
2021-03-25T11:32:24Z|00038|reconnect|INFO|ssl:10.0.0.5:9644: connection attempt failed (Connection refused)
2021-03-25T11:32:24Z|00039|reconnect|INFO|ssl:10.0.0.5:9644: waiting 2 seconds before reconnect
2021-03-25T11:32:26Z|00040|reconnect|INFO|ssl:10.0.0.5:9644: connecting...
2021-03-25T11:32:26Z|00041|reconnect|INFO|ssl:10.0.0.5:9644: connection attempt failed (Connection refused)
2021-03-25T11:32:26Z|00042|reconnect|INFO|ssl:10.0.0.5:9644: waiting 4 seconds before reconnect
2021-03-25T11:32:28Z|00043|stream_ssl|WARN|SSL_read: system error (Connection reset by peer)
2021-03-25T11:32:28Z|00044|jsonrpc|WARN|ssl:10.0.0.5:40696: receive error: Connection reset by peer
2021-03-25T11:32:28Z|00045|reconnect|WARN|ssl:10.0.0.5:40696: connection dropped (Connection reset by peer)
2021-03-25T11:32:30Z|00046|reconnect|INFO|ssl:10.0.0.5:9644: connecting...
2021-03-25T11:32:30Z|00047|reconnect|INFO|ssl:10.0.0.5:9644: connection attempt failed (Connection refused)
2021-03-25T11:32:30Z|00048|reconnect|INFO|ssl:10.0.0.5:9644: continuing to reconnect in the background but suppressing further logging
2021-03-25T11:34:31Z|00049|raft|INFO|ssl:10.0.0.5:37584: learned server ID a74e
2021-03-25T11:34:31Z|00050|raft|INFO|ssl:10.0.0.5:37584: learned remote address ssl:10.0.0.5:9644
2021-03-25T11:34:38Z|00051|reconnect|INFO|ssl:10.0.0.5:9644: connected


https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1375019219732664320/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-wvwwr_sbdb.log

2021-03-25T11:28:47Z|00001|fatal_signal(urcu1)|WARN|terminating with signal 15 (Terminated)
/usr/share/ovn/scripts/ovn-ctl: line 322:   104 Terminated              "$@" $file
2021-03-25T11:34:30+00:00 - starting sbdb  CLUSTER_INITIATOR_IP=10.0.0.3

Comment 1 Tim Rozet 2021-03-25 21:43:50 UTC

Ilya can you please look into this?

Comment 2 Ilya Maximets 2021-03-26 13:24:46 UTC

From the logs it seems like a problem with database locking
in ovn-northd.  There are following events:

10:59:47 sbdb on 10.0.0.5 is a new leader
10:59:48 northd on 10.0.0.3 acquired the lock --> active
10:59:48 northd on 10.0.0.3 lost the lock     --> standby
10:59:49 sbdb on 10.0.0.3 is a new leader
10:59:51 sbdb on 10.0.0.5 is a new leader
11:00:26 northd on 10.0.0.4 acquired the lock --> active
11:00:26 northd on 10.0.0.4 lost the lock     --> standby
11:12:51 northd on 10.0.0.4 terminated
11:18:33 northd on 10.0.0.4 started
11:19:09 sbdb on 10.0.0.5 is a new leader
11:21:17 northd on 10.0.0.3 terminated
11:23:06 northd on 10.0.0.5 acquired the lock --> active
<some leadership transfers and sbdb restarts>
11:26:16 northd on 10.0.0.3 started
11:26:16 northd on 10.0.0.3 acquired the lock --> active
11:26:16 northd on 10.0.0.3 lost the lock     --> standby
11:26:43 sbdb on 10.0.0.5 is a new leader
11:28:47 northd on 10.0.0.5 terminated  <-- this northd held the lock
11:29:07 sbdb on 10.0.0.3 is a new leader
11:30:14 ovn-kube addLogicalPort to Nb DB
11:34:25 northd on 10.0.0.5 started
11:34:25 northd on 10.0.0.5 acquired the lock --> active
11:34:26 ovn-controller claims the port

After termination of northd on 10.0.0.5 no other northd acquired the
lock.  And only after bringing of that northd back up after 6 minutes,
it acquires the lock by itself.  So, it looks like northd didn't
work for 6 minutes.  Nb DB update happened in this time interval.
Once northd on 10.0.0.5 started and lock is acquired, ovn-controller
immediately claimed the port because Sb DB got updated.

Need to investigate what happened to northd lock and why other instances
of northd didn't take it over.

Also, sequences like this are suspicious:
11:26:16 northd on 10.0.0.3 acquired the lock --> active
11:26:16 northd on 10.0.0.3 lost the lock     --> standby

Comment 3 Tim Rozet 2021-03-26 14:43:26 UTC

Ah thanks Ilya, so the problem is actually northd and not sbdb? Feel free to update the title of this bug to be more accurate if so. I'm guessing you were able to find all the other logs like northd on your own. I'll just add the links for those logs as well as dbs here:

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1375019219732664320/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-595x7_northd.log

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1375019219732664320/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-8vjlz_northd.log

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1375019219732664320/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-wvwwr_northd.log

 Will attach DBs, OVN logs, show outputs, etc.

Comment 4 Tim Rozet 2021-03-26 14:43:58 UTC

Created attachment 1766664 [details]
logs, dbs

Comment 5 Ilya Maximets 2021-03-29 18:45:07 UTC

(In reply to Tim Rozet from comment #3)
> Ah thanks Ilya, so the problem is actually northd and not sbdb?

Yes, it's most likely issue with northd, but it's hard to tell.

Active northd terminated at 11:28:47 and others stayed at standby
mode for almost 6 minutes.  The one that was active got restarted
at 11:34:25 and acquired the lock.  Right after that ovn-controller
claimed the port because Sb DB got updated.

There is actually an interesting thing:

11:28:47 -- Sb DB cluster leader on 10.0.0.5 terminated
            and active northd on 10.0.0.5 terminated too

But logs of northd from 10.0.0.3 and 10.0.0.4 ends at 11:26 with
successful connection to Sb DB on 10.0.0.5 and no further logs:
10.0.0.3 last log from northd:
  2021-03-25T11:26:43Z|00037|reconnect|INFO|ssl:10.0.0.5:9641: connected
10.0.0.4 last log from northd:
  2021-03-25T11:26:50Z|00041|reconnect|INFO|ssl:10.0.0.5:9641: connected

So, these northd never noticed that active Sb DB went down.  That
is, probably, why they didn't try to reconnect and didn't acquire
the lock.  OTOH, end of the log from these northd instances makes
me think that they just stopped working/disappeared.  They should
have inactivity probes enabled, so they should detect broken
connection within a short period of time in any case.  So, we lost
some logs or processes died silently which is weird.

Sb DB instances on 10.0.0.3 and 10.0.0.4 noticed missed heartbeats
and initiated election.  They didn't notice the broken connection,
though, but they don't have inactivity probes.

Some more investigation on what actually happened to northd needed.

I'll update the title of this bug.

It'll be great if we can reproduce the issue with '-vjsonrpc:dbg'
log level on northd or figure out what happened to these processes.

Comment 6 Tim Rozet 2021-09-22 20:40:55 UTC

Moving this back into the unassigned bucket.

Comment 20 errata-xmlrpc 2022-03-12 04:34:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.