Hide Forgot
Description of problem: In OVN-Kubernetes OCP we see that during this run a pod CNI request fails to find the OVS port set to up during upgrade: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-releas[…]de-from-stable-4.7-e2e-gcp-ovn-upgrade/1375019219732664320 Relevant pod: authentication/oauth-openshift-bdcf659df-tjbc5 07a320ce7387e01c3f26459243927623035daf9d36f3b5063ee0dc9602bd5ecf] failed to configure pod interface: error while waiting on OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting for OVS port binding We see in ovn-kube that the pod was wired at 11:30:14 in our northbound: I0325 11:30:14.881154 1 pods.go:261] [openshift-authentication/oauth-openshift-bdcf659df-tjbc5] addLogicalPort took 133.201809ms and in southbound (CNI): I0325 11:30:15.170172 6142 cni.go:187] [openshift-authentication/oauth-openshift-bdcf659df-tjbc5 07a320ce7387e01c3f26459243927623035daf9d36f3b5063ee0dc9602bd5ecf] ADD starting CNI request [openshift-authentication/oauth-openshift-bdcf659df-tjbc5 07a320ce7387e01c3f26459243927623035daf9d36f3b5063ee0dc9602bd5ecf] I0325 11:30:37.391372 6142 cni.go:197] [openshift-authentication/oauth-openshift-bdcf659df-tjbc5 07a320ce7387e01c3f26459243927623035daf9d36f3b5063ee0dc9602bd5ecf] ADD finished CNI request [openshift-authentication/oauth-openshift-bdcf659df-tjbc5 07a320ce7387e01c3f26459243927623035daf9d36f3b5063ee0dc9602bd5ecf], result "", err failed to configure pod interface: error while waiting on OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting for OVS port binding However in ovn-controller, we see the port bind happened 4 minutes later: 2021-03-25T11:34:26Z|00102|binding|INFO|Claiming lport openshift-authentication_oauth-openshift-bdcf659df-tjbc5 for this chassis. 2021-03-25T11:34:26Z|00103|binding|INFO|openshift-authentication_oauth-openshift-bdcf659df-tjbc5: Claiming 0a:58:0a:80:00:19 10.128.0.25 If we look at OVN SBDBs, we can see some problems from 11:30->11:34: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1375019219732664320/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-595x7_sbdb.log 2021-03-25T11:28:50Z|00021|stream_ssl|WARN|SSL_read: system error (Connection reset by peer) 2021-03-25T11:28:50Z|00022|jsonrpc|WARN|ssl:10.0.0.3:9644: receive error: Connection reset by peer 2021-03-25T11:28:50Z|00023|reconnect|WARN|ssl:10.0.0.3:9644: connection dropped (Connection reset by peer) 2021-03-25T11:28:51Z|00024|reconnect|INFO|ssl:10.0.0.3:9644: connecting... 2021-03-25T11:28:51Z|00025|reconnect|INFO|ssl:10.0.0.3:9644: connected 2021-03-25T11:29:07Z|00026|raft|INFO|server 57c1 is leader for term 6 2021-03-25T11:30:02Z|00027|reconnect|ERR|ssl:10.0.0.5:35406: no response to inactivity probe after 60 seconds, disconnecting 2021-03-25T11:32:17Z|00028|stream_ssl|WARN|SSL_read: system error (Connection reset by peer) 2021-03-25T11:32:17Z|00029|jsonrpc|WARN|ssl:10.0.0.5:9644: receive error: Connection reset by peer 2021-03-25T11:32:17Z|00030|reconnect|WARN|ssl:10.0.0.5:9644: connection dropped (Connection reset by peer) 2021-03-25T11:32:18Z|00031|reconnect|INFO|ssl:10.0.0.5:9644: connecting... 2021-03-25T11:32:18Z|00032|reconnect|INFO|ssl:10.0.0.5:9644: connection attempt failed (Connection refused) 2021-03-25T11:32:18Z|00033|reconnect|INFO|ssl:10.0.0.5:9644: waiting 2 seconds before reconnect 2021-03-25T11:32:20Z|00034|reconnect|INFO|ssl:10.0.0.5:9644: connecting... 2021-03-25T11:32:20Z|00035|reconnect|INFO|ssl:10.0.0.5:9644: connection attempt failed (Connection refused) 2021-03-25T11:32:20Z|00036|reconnect|INFO|ssl:10.0.0.5:9644: waiting 4 seconds before reconnect 2021-03-25T11:32:24Z|00037|reconnect|INFO|ssl:10.0.0.5:9644: connecting... 2021-03-25T11:32:24Z|00038|reconnect|INFO|ssl:10.0.0.5:9644: connection attempt failed (Connection refused) 2021-03-25T11:32:24Z|00039|reconnect|INFO|ssl:10.0.0.5:9644: continuing to reconnect in the background but suppressing further logging 2021-03-25T11:34:31Z|00040|raft|INFO|ssl:10.0.0.5:34652: learned server ID a74e 2021-03-25T11:34:31Z|00041|raft|INFO|ssl:10.0.0.5:34652: learned remote address ssl:10.0.0.5:9644 2021-03-25T11:34:32Z|00042|reconnect|INFO|ssl:10.0.0.5:9644: connected 2021-03-25T11:38:41Z|00043|ovsdb_server|INFO|compacting OVN_Southbound database by user request https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1375019219732664320/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-8vjlz_sbdb.log 2021-03-25T11:29:07Z|00032|raft|INFO|term 6: 16370 ms timeout expired, starting election 2021-03-25T11:29:07Z|00033|raft|INFO|term 6: elected leader by 2+ of 3 servers 2021-03-25T11:32:23Z|00034|stream_ssl|WARN|SSL_read: system error (Connection reset by peer) 2021-03-25T11:32:23Z|00035|jsonrpc|WARN|ssl:10.0.0.5:9644: receive error: Connection reset by peer 2021-03-25T11:32:23Z|00036|reconnect|WARN|ssl:10.0.0.5:9644: connection dropped (Connection reset by peer) 2021-03-25T11:32:24Z|00037|reconnect|INFO|ssl:10.0.0.5:9644: connecting... 2021-03-25T11:32:24Z|00038|reconnect|INFO|ssl:10.0.0.5:9644: connection attempt failed (Connection refused) 2021-03-25T11:32:24Z|00039|reconnect|INFO|ssl:10.0.0.5:9644: waiting 2 seconds before reconnect 2021-03-25T11:32:26Z|00040|reconnect|INFO|ssl:10.0.0.5:9644: connecting... 2021-03-25T11:32:26Z|00041|reconnect|INFO|ssl:10.0.0.5:9644: connection attempt failed (Connection refused) 2021-03-25T11:32:26Z|00042|reconnect|INFO|ssl:10.0.0.5:9644: waiting 4 seconds before reconnect 2021-03-25T11:32:28Z|00043|stream_ssl|WARN|SSL_read: system error (Connection reset by peer) 2021-03-25T11:32:28Z|00044|jsonrpc|WARN|ssl:10.0.0.5:40696: receive error: Connection reset by peer 2021-03-25T11:32:28Z|00045|reconnect|WARN|ssl:10.0.0.5:40696: connection dropped (Connection reset by peer) 2021-03-25T11:32:30Z|00046|reconnect|INFO|ssl:10.0.0.5:9644: connecting... 2021-03-25T11:32:30Z|00047|reconnect|INFO|ssl:10.0.0.5:9644: connection attempt failed (Connection refused) 2021-03-25T11:32:30Z|00048|reconnect|INFO|ssl:10.0.0.5:9644: continuing to reconnect in the background but suppressing further logging 2021-03-25T11:34:31Z|00049|raft|INFO|ssl:10.0.0.5:37584: learned server ID a74e 2021-03-25T11:34:31Z|00050|raft|INFO|ssl:10.0.0.5:37584: learned remote address ssl:10.0.0.5:9644 2021-03-25T11:34:38Z|00051|reconnect|INFO|ssl:10.0.0.5:9644: connected https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1375019219732664320/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-wvwwr_sbdb.log 2021-03-25T11:28:47Z|00001|fatal_signal(urcu1)|WARN|terminating with signal 15 (Terminated) /usr/share/ovn/scripts/ovn-ctl: line 322: 104 Terminated "$@" $file 2021-03-25T11:34:30+00:00 - starting sbdb CLUSTER_INITIATOR_IP=10.0.0.3
Ilya can you please look into this?
From the logs it seems like a problem with database locking in ovn-northd. There are following events: 10:59:47 sbdb on 10.0.0.5 is a new leader 10:59:48 northd on 10.0.0.3 acquired the lock --> active 10:59:48 northd on 10.0.0.3 lost the lock --> standby 10:59:49 sbdb on 10.0.0.3 is a new leader 10:59:51 sbdb on 10.0.0.5 is a new leader 11:00:26 northd on 10.0.0.4 acquired the lock --> active 11:00:26 northd on 10.0.0.4 lost the lock --> standby 11:12:51 northd on 10.0.0.4 terminated 11:18:33 northd on 10.0.0.4 started 11:19:09 sbdb on 10.0.0.5 is a new leader 11:21:17 northd on 10.0.0.3 terminated 11:23:06 northd on 10.0.0.5 acquired the lock --> active <some leadership transfers and sbdb restarts> 11:26:16 northd on 10.0.0.3 started 11:26:16 northd on 10.0.0.3 acquired the lock --> active 11:26:16 northd on 10.0.0.3 lost the lock --> standby 11:26:43 sbdb on 10.0.0.5 is a new leader 11:28:47 northd on 10.0.0.5 terminated <-- this northd held the lock 11:29:07 sbdb on 10.0.0.3 is a new leader 11:30:14 ovn-kube addLogicalPort to Nb DB 11:34:25 northd on 10.0.0.5 started 11:34:25 northd on 10.0.0.5 acquired the lock --> active 11:34:26 ovn-controller claims the port After termination of northd on 10.0.0.5 no other northd acquired the lock. And only after bringing of that northd back up after 6 minutes, it acquires the lock by itself. So, it looks like northd didn't work for 6 minutes. Nb DB update happened in this time interval. Once northd on 10.0.0.5 started and lock is acquired, ovn-controller immediately claimed the port because Sb DB got updated. Need to investigate what happened to northd lock and why other instances of northd didn't take it over. Also, sequences like this are suspicious: 11:26:16 northd on 10.0.0.3 acquired the lock --> active 11:26:16 northd on 10.0.0.3 lost the lock --> standby
Ah thanks Ilya, so the problem is actually northd and not sbdb? Feel free to update the title of this bug to be more accurate if so. I'm guessing you were able to find all the other logs like northd on your own. I'll just add the links for those logs as well as dbs here: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1375019219732664320/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-595x7_northd.log https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1375019219732664320/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-8vjlz_northd.log https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1375019219732664320/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-wvwwr_northd.log Will attach DBs, OVN logs, show outputs, etc.
Created attachment 1766664 [details] logs, dbs
(In reply to Tim Rozet from comment #3) > Ah thanks Ilya, so the problem is actually northd and not sbdb? Yes, it's most likely issue with northd, but it's hard to tell. Active northd terminated at 11:28:47 and others stayed at standby mode for almost 6 minutes. The one that was active got restarted at 11:34:25 and acquired the lock. Right after that ovn-controller claimed the port because Sb DB got updated. There is actually an interesting thing: 11:28:47 -- Sb DB cluster leader on 10.0.0.5 terminated and active northd on 10.0.0.5 terminated too But logs of northd from 10.0.0.3 and 10.0.0.4 ends at 11:26 with successful connection to Sb DB on 10.0.0.5 and no further logs: 10.0.0.3 last log from northd: 2021-03-25T11:26:43Z|00037|reconnect|INFO|ssl:10.0.0.5:9641: connected 10.0.0.4 last log from northd: 2021-03-25T11:26:50Z|00041|reconnect|INFO|ssl:10.0.0.5:9641: connected So, these northd never noticed that active Sb DB went down. That is, probably, why they didn't try to reconnect and didn't acquire the lock. OTOH, end of the log from these northd instances makes me think that they just stopped working/disappeared. They should have inactivity probes enabled, so they should detect broken connection within a short period of time in any case. So, we lost some logs or processes died silently which is weird. Sb DB instances on 10.0.0.3 and 10.0.0.4 noticed missed heartbeats and initiated election. They didn't notice the broken connection, though, but they don't have inactivity probes. Some more investigation on what actually happened to northd needed. I'll update the title of this bug. It'll be great if we can reproduce the issue with '-vjsonrpc:dbg' log level on northd or figure out what happened to these processes.
Moving this back into the unassigned bucket.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056