Bug 1943363

Summary: [ovn] CNO should gracefully terminate ovn-northd
Product: OpenShift Container Platform Reporter: Tim Rozet <trozet>
Component: NetworkingAssignee: ffernand <ffernand>
Networking sub component: ovn-kubernetes QA Contact: Arti Sood <asood>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: unspecified CC: anusaxen, asood, astoycos, ctrautma, dcbw, ffernand, i.maximets
Version: 4.8Keywords: Triaged
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2040530 (view as bug list) Environment:
Last Closed: 2022-03-12 04:34:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1943566, 2005818, 2040530    
Attachments:
Description Flags
logs, dbs none

Description Tim Rozet 2021-03-25 21:41:47 UTC
Description of problem:
In OVN-Kubernetes OCP we see that during this run a pod CNI request fails to find the OVS port set to up during upgrade:

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-releas[…]de-from-stable-4.7-e2e-gcp-ovn-upgrade/1375019219732664320

Relevant pod:
authentication/oauth-openshift-bdcf659df-tjbc5 07a320ce7387e01c3f26459243927623035daf9d36f3b5063ee0dc9602bd5ecf] failed to configure pod interface: error while waiting on OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting for OVS port binding

We see in ovn-kube that the pod was wired at 11:30:14 in our northbound:
I0325 11:30:14.881154       1 pods.go:261] [openshift-authentication/oauth-openshift-bdcf659df-tjbc5] addLogicalPort took 133.201809ms

and in southbound (CNI):
I0325 11:30:15.170172    6142 cni.go:187] [openshift-authentication/oauth-openshift-bdcf659df-tjbc5 07a320ce7387e01c3f26459243927623035daf9d36f3b5063ee0dc9602bd5ecf] ADD starting CNI request [openshift-authentication/oauth-openshift-bdcf659df-tjbc5 07a320ce7387e01c3f26459243927623035daf9d36f3b5063ee0dc9602bd5ecf]
I0325 11:30:37.391372    6142 cni.go:197] [openshift-authentication/oauth-openshift-bdcf659df-tjbc5 07a320ce7387e01c3f26459243927623035daf9d36f3b5063ee0dc9602bd5ecf] ADD finished CNI request [openshift-authentication/oauth-openshift-bdcf659df-tjbc5 07a320ce7387e01c3f26459243927623035daf9d36f3b5063ee0dc9602bd5ecf], result "", err failed to configure pod interface: error while waiting on OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting for OVS port binding

However in ovn-controller, we see the port bind happened 4 minutes later:
2021-03-25T11:34:26Z|00102|binding|INFO|Claiming lport openshift-authentication_oauth-openshift-bdcf659df-tjbc5 for this chassis.
2021-03-25T11:34:26Z|00103|binding|INFO|openshift-authentication_oauth-openshift-bdcf659df-tjbc5: Claiming 0a:58:0a:80:00:19 10.128.0.25

If we look at OVN SBDBs, we can see some problems from 11:30->11:34:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1375019219732664320/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-595x7_sbdb.log

2021-03-25T11:28:50Z|00021|stream_ssl|WARN|SSL_read: system error (Connection reset by peer)
2021-03-25T11:28:50Z|00022|jsonrpc|WARN|ssl:10.0.0.3:9644: receive error: Connection reset by peer
2021-03-25T11:28:50Z|00023|reconnect|WARN|ssl:10.0.0.3:9644: connection dropped (Connection reset by peer)
2021-03-25T11:28:51Z|00024|reconnect|INFO|ssl:10.0.0.3:9644: connecting...
2021-03-25T11:28:51Z|00025|reconnect|INFO|ssl:10.0.0.3:9644: connected
2021-03-25T11:29:07Z|00026|raft|INFO|server 57c1 is leader for term 6
2021-03-25T11:30:02Z|00027|reconnect|ERR|ssl:10.0.0.5:35406: no response to inactivity probe after 60 seconds, disconnecting
2021-03-25T11:32:17Z|00028|stream_ssl|WARN|SSL_read: system error (Connection reset by peer)
2021-03-25T11:32:17Z|00029|jsonrpc|WARN|ssl:10.0.0.5:9644: receive error: Connection reset by peer
2021-03-25T11:32:17Z|00030|reconnect|WARN|ssl:10.0.0.5:9644: connection dropped (Connection reset by peer)
2021-03-25T11:32:18Z|00031|reconnect|INFO|ssl:10.0.0.5:9644: connecting...
2021-03-25T11:32:18Z|00032|reconnect|INFO|ssl:10.0.0.5:9644: connection attempt failed (Connection refused)
2021-03-25T11:32:18Z|00033|reconnect|INFO|ssl:10.0.0.5:9644: waiting 2 seconds before reconnect
2021-03-25T11:32:20Z|00034|reconnect|INFO|ssl:10.0.0.5:9644: connecting...
2021-03-25T11:32:20Z|00035|reconnect|INFO|ssl:10.0.0.5:9644: connection attempt failed (Connection refused)
2021-03-25T11:32:20Z|00036|reconnect|INFO|ssl:10.0.0.5:9644: waiting 4 seconds before reconnect
2021-03-25T11:32:24Z|00037|reconnect|INFO|ssl:10.0.0.5:9644: connecting...
2021-03-25T11:32:24Z|00038|reconnect|INFO|ssl:10.0.0.5:9644: connection attempt failed (Connection refused)
2021-03-25T11:32:24Z|00039|reconnect|INFO|ssl:10.0.0.5:9644: continuing to reconnect in the background but suppressing further logging
2021-03-25T11:34:31Z|00040|raft|INFO|ssl:10.0.0.5:34652: learned server ID a74e
2021-03-25T11:34:31Z|00041|raft|INFO|ssl:10.0.0.5:34652: learned remote address ssl:10.0.0.5:9644
2021-03-25T11:34:32Z|00042|reconnect|INFO|ssl:10.0.0.5:9644: connected
2021-03-25T11:38:41Z|00043|ovsdb_server|INFO|compacting OVN_Southbound database by user request

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1375019219732664320/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-8vjlz_sbdb.log

2021-03-25T11:29:07Z|00032|raft|INFO|term 6: 16370 ms timeout expired, starting election
2021-03-25T11:29:07Z|00033|raft|INFO|term 6: elected leader by 2+ of 3 servers
2021-03-25T11:32:23Z|00034|stream_ssl|WARN|SSL_read: system error (Connection reset by peer)
2021-03-25T11:32:23Z|00035|jsonrpc|WARN|ssl:10.0.0.5:9644: receive error: Connection reset by peer
2021-03-25T11:32:23Z|00036|reconnect|WARN|ssl:10.0.0.5:9644: connection dropped (Connection reset by peer)
2021-03-25T11:32:24Z|00037|reconnect|INFO|ssl:10.0.0.5:9644: connecting...
2021-03-25T11:32:24Z|00038|reconnect|INFO|ssl:10.0.0.5:9644: connection attempt failed (Connection refused)
2021-03-25T11:32:24Z|00039|reconnect|INFO|ssl:10.0.0.5:9644: waiting 2 seconds before reconnect
2021-03-25T11:32:26Z|00040|reconnect|INFO|ssl:10.0.0.5:9644: connecting...
2021-03-25T11:32:26Z|00041|reconnect|INFO|ssl:10.0.0.5:9644: connection attempt failed (Connection refused)
2021-03-25T11:32:26Z|00042|reconnect|INFO|ssl:10.0.0.5:9644: waiting 4 seconds before reconnect
2021-03-25T11:32:28Z|00043|stream_ssl|WARN|SSL_read: system error (Connection reset by peer)
2021-03-25T11:32:28Z|00044|jsonrpc|WARN|ssl:10.0.0.5:40696: receive error: Connection reset by peer
2021-03-25T11:32:28Z|00045|reconnect|WARN|ssl:10.0.0.5:40696: connection dropped (Connection reset by peer)
2021-03-25T11:32:30Z|00046|reconnect|INFO|ssl:10.0.0.5:9644: connecting...
2021-03-25T11:32:30Z|00047|reconnect|INFO|ssl:10.0.0.5:9644: connection attempt failed (Connection refused)
2021-03-25T11:32:30Z|00048|reconnect|INFO|ssl:10.0.0.5:9644: continuing to reconnect in the background but suppressing further logging
2021-03-25T11:34:31Z|00049|raft|INFO|ssl:10.0.0.5:37584: learned server ID a74e
2021-03-25T11:34:31Z|00050|raft|INFO|ssl:10.0.0.5:37584: learned remote address ssl:10.0.0.5:9644
2021-03-25T11:34:38Z|00051|reconnect|INFO|ssl:10.0.0.5:9644: connected


https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1375019219732664320/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-wvwwr_sbdb.log

2021-03-25T11:28:47Z|00001|fatal_signal(urcu1)|WARN|terminating with signal 15 (Terminated)
/usr/share/ovn/scripts/ovn-ctl: line 322:   104 Terminated              "$@" $file
2021-03-25T11:34:30+00:00 - starting sbdb  CLUSTER_INITIATOR_IP=10.0.0.3

Comment 1 Tim Rozet 2021-03-25 21:43:50 UTC
Ilya can you please look into this?

Comment 2 Ilya Maximets 2021-03-26 13:24:46 UTC
From the logs it seems like a problem with database locking
in ovn-northd.  There are following events:

10:59:47 sbdb on 10.0.0.5 is a new leader
10:59:48 northd on 10.0.0.3 acquired the lock --> active
10:59:48 northd on 10.0.0.3 lost the lock     --> standby
10:59:49 sbdb on 10.0.0.3 is a new leader
10:59:51 sbdb on 10.0.0.5 is a new leader
11:00:26 northd on 10.0.0.4 acquired the lock --> active
11:00:26 northd on 10.0.0.4 lost the lock     --> standby
11:12:51 northd on 10.0.0.4 terminated
11:18:33 northd on 10.0.0.4 started
11:19:09 sbdb on 10.0.0.5 is a new leader
11:21:17 northd on 10.0.0.3 terminated
11:23:06 northd on 10.0.0.5 acquired the lock --> active
<some leadership transfers and sbdb restarts>
11:26:16 northd on 10.0.0.3 started
11:26:16 northd on 10.0.0.3 acquired the lock --> active
11:26:16 northd on 10.0.0.3 lost the lock     --> standby
11:26:43 sbdb on 10.0.0.5 is a new leader
11:28:47 northd on 10.0.0.5 terminated  <-- this northd held the lock
11:29:07 sbdb on 10.0.0.3 is a new leader
11:30:14 ovn-kube addLogicalPort to Nb DB
11:34:25 northd on 10.0.0.5 started
11:34:25 northd on 10.0.0.5 acquired the lock --> active
11:34:26 ovn-controller claims the port

After termination of northd on 10.0.0.5 no other northd acquired the
lock.  And only after bringing of that northd back up after 6 minutes,
it acquires the lock by itself.  So, it looks like northd didn't
work for 6 minutes.  Nb DB update happened in this time interval.
Once northd on 10.0.0.5 started and lock is acquired, ovn-controller
immediately claimed the port because Sb DB got updated.

Need to investigate what happened to northd lock and why other instances
of northd didn't take it over.

Also, sequences like this are suspicious:
11:26:16 northd on 10.0.0.3 acquired the lock --> active
11:26:16 northd on 10.0.0.3 lost the lock     --> standby

Comment 4 Tim Rozet 2021-03-26 14:43:58 UTC
Created attachment 1766664 [details]
logs, dbs

Comment 5 Ilya Maximets 2021-03-29 18:45:07 UTC
(In reply to Tim Rozet from comment #3)
> Ah thanks Ilya, so the problem is actually northd and not sbdb?

Yes, it's most likely issue with northd, but it's hard to tell.

Active northd terminated at 11:28:47 and others stayed at standby
mode for almost 6 minutes.  The one that was active got restarted
at 11:34:25 and acquired the lock.  Right after that ovn-controller
claimed the port because Sb DB got updated.

There is actually an interesting thing:

11:28:47 -- Sb DB cluster leader on 10.0.0.5 terminated
            and active northd on 10.0.0.5 terminated too

But logs of northd from 10.0.0.3 and 10.0.0.4 ends at 11:26 with
successful connection to Sb DB on 10.0.0.5 and no further logs:
10.0.0.3 last log from northd:
  2021-03-25T11:26:43Z|00037|reconnect|INFO|ssl:10.0.0.5:9641: connected
10.0.0.4 last log from northd:
  2021-03-25T11:26:50Z|00041|reconnect|INFO|ssl:10.0.0.5:9641: connected

So, these northd never noticed that active Sb DB went down.  That
is, probably, why they didn't try to reconnect and didn't acquire
the lock.  OTOH, end of the log from these northd instances makes
me think that they just stopped working/disappeared.  They should
have inactivity probes enabled, so they should detect broken
connection within a short period of time in any case.  So, we lost
some logs or processes died silently which is weird.

Sb DB instances on 10.0.0.3 and 10.0.0.4 noticed missed heartbeats
and initiated election.  They didn't notice the broken connection,
though, but they don't have inactivity probes.

Some more investigation on what actually happened to northd needed.

I'll update the title of this bug.

It'll be great if we can reproduce the issue with '-vjsonrpc:dbg'
log level on northd or figure out what happened to these processes.

Comment 6 Tim Rozet 2021-09-22 20:40:55 UTC
Moving this back into the unassigned bucket.

Comment 20 errata-xmlrpc 2022-03-12 04:34:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056