Bug 1943363
Summary: | [ovn] CNO should gracefully terminate ovn-northd | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Tim Rozet <trozet> | ||||
Component: | Networking | Assignee: | ffernand <ffernand> | ||||
Networking sub component: | ovn-kubernetes | QA Contact: | Arti Sood <asood> | ||||
Status: | CLOSED ERRATA | Docs Contact: | |||||
Severity: | medium | ||||||
Priority: | unspecified | CC: | anusaxen, asood, astoycos, ctrautma, dcbw, ffernand, i.maximets | ||||
Version: | 4.8 | Keywords: | Triaged | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.10.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 2040530 (view as bug list) | Environment: | |||||
Last Closed: | 2022-03-12 04:34:58 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1943566, 2005818, 2040530 | ||||||
Attachments: |
|
Description
Tim Rozet
2021-03-25 21:41:47 UTC
Ilya can you please look into this? From the logs it seems like a problem with database locking in ovn-northd. There are following events: 10:59:47 sbdb on 10.0.0.5 is a new leader 10:59:48 northd on 10.0.0.3 acquired the lock --> active 10:59:48 northd on 10.0.0.3 lost the lock --> standby 10:59:49 sbdb on 10.0.0.3 is a new leader 10:59:51 sbdb on 10.0.0.5 is a new leader 11:00:26 northd on 10.0.0.4 acquired the lock --> active 11:00:26 northd on 10.0.0.4 lost the lock --> standby 11:12:51 northd on 10.0.0.4 terminated 11:18:33 northd on 10.0.0.4 started 11:19:09 sbdb on 10.0.0.5 is a new leader 11:21:17 northd on 10.0.0.3 terminated 11:23:06 northd on 10.0.0.5 acquired the lock --> active <some leadership transfers and sbdb restarts> 11:26:16 northd on 10.0.0.3 started 11:26:16 northd on 10.0.0.3 acquired the lock --> active 11:26:16 northd on 10.0.0.3 lost the lock --> standby 11:26:43 sbdb on 10.0.0.5 is a new leader 11:28:47 northd on 10.0.0.5 terminated <-- this northd held the lock 11:29:07 sbdb on 10.0.0.3 is a new leader 11:30:14 ovn-kube addLogicalPort to Nb DB 11:34:25 northd on 10.0.0.5 started 11:34:25 northd on 10.0.0.5 acquired the lock --> active 11:34:26 ovn-controller claims the port After termination of northd on 10.0.0.5 no other northd acquired the lock. And only after bringing of that northd back up after 6 minutes, it acquires the lock by itself. So, it looks like northd didn't work for 6 minutes. Nb DB update happened in this time interval. Once northd on 10.0.0.5 started and lock is acquired, ovn-controller immediately claimed the port because Sb DB got updated. Need to investigate what happened to northd lock and why other instances of northd didn't take it over. Also, sequences like this are suspicious: 11:26:16 northd on 10.0.0.3 acquired the lock --> active 11:26:16 northd on 10.0.0.3 lost the lock --> standby Ah thanks Ilya, so the problem is actually northd and not sbdb? Feel free to update the title of this bug to be more accurate if so. I'm guessing you were able to find all the other logs like northd on your own. I'll just add the links for those logs as well as dbs here: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1375019219732664320/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-595x7_northd.log https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1375019219732664320/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-8vjlz_northd.log https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1375019219732664320/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-wvwwr_northd.log Will attach DBs, OVN logs, show outputs, etc. Created attachment 1766664 [details]
logs, dbs
(In reply to Tim Rozet from comment #3) > Ah thanks Ilya, so the problem is actually northd and not sbdb? Yes, it's most likely issue with northd, but it's hard to tell. Active northd terminated at 11:28:47 and others stayed at standby mode for almost 6 minutes. The one that was active got restarted at 11:34:25 and acquired the lock. Right after that ovn-controller claimed the port because Sb DB got updated. There is actually an interesting thing: 11:28:47 -- Sb DB cluster leader on 10.0.0.5 terminated and active northd on 10.0.0.5 terminated too But logs of northd from 10.0.0.3 and 10.0.0.4 ends at 11:26 with successful connection to Sb DB on 10.0.0.5 and no further logs: 10.0.0.3 last log from northd: 2021-03-25T11:26:43Z|00037|reconnect|INFO|ssl:10.0.0.5:9641: connected 10.0.0.4 last log from northd: 2021-03-25T11:26:50Z|00041|reconnect|INFO|ssl:10.0.0.5:9641: connected So, these northd never noticed that active Sb DB went down. That is, probably, why they didn't try to reconnect and didn't acquire the lock. OTOH, end of the log from these northd instances makes me think that they just stopped working/disappeared. They should have inactivity probes enabled, so they should detect broken connection within a short period of time in any case. So, we lost some logs or processes died silently which is weird. Sb DB instances on 10.0.0.3 and 10.0.0.4 noticed missed heartbeats and initiated election. They didn't notice the broken connection, though, but they don't have inactivity probes. Some more investigation on what actually happened to northd needed. I'll update the title of this bug. It'll be great if we can reproduce the issue with '-vjsonrpc:dbg' log level on northd or figure out what happened to these processes. Moving this back into the unassigned bucket. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |