Bug 1944264
Summary: | [ovn] CNO should gracefully terminate OVN databases | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Ilya Maximets <i.maximets> |
Component: | Networking | Assignee: | jamo luhrsen <jluhrsen> |
Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | urgent | ||
Priority: | urgent | CC: | anbhat, bbennett, bzvonar, cgaynor, dcbw, ffernand, jluhrsen, kurathod, openshift-bugs-escalate |
Version: | 4.7 | Flags: | jluhrsen:
needinfo-
jluhrsen: needinfo- jluhrsen: needinfo- jluhrsen: needinfo- |
Target Milestone: | --- | ||
Target Release: | 4.11.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-08-10 10:36:17 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1943566, 1943800 |
Description
Ilya Maximets
2021-03-29 15:35:30 UTC
Ilya, does this look right for northd? At least the lock is bouncing around, but it looks like it's bouncing around a *lot*. Is that just logging problems, or is there actually a problem? Pod logs: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/1042/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade/1376632693281787904/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/ https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/1042/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade/1376632693281787904/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-l6sjf_northd.log 22:24:03 - lock acquired 22:35:27 - terminated 22:40:11 - lock acquired https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/1042/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade/1376632693281787904/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-252ls_northd.log 22:35:27 - connection close from 10.0.0.3 22:35:28 - lock acquired 22:35:30 - lock lost 22:35:30 - lock acquired 22:35:46 - lock lost 22:35:46 - lock acquired 22:35:54 - lock lost 22:42:26 - lock acquired https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/1042/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade/1376632693281787904/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-l9xx6_northd.log 22:35:27 - connection close from 10.0.0.3 22:35:28 - lock acquired 22:35:30 - lock lost 22:35:30 - lock acquired 22:35:46 - lock lost 22:35:46 - lock acquired 22:35:54 - lock lost 22:35:54 - lock acquired 22:42:25 - terminated (In reply to Dan Williams from comment #1) > Ilya, does this look right for northd? > > At least the lock is bouncing around, but it looks like it's bouncing around > a *lot*. Is that just logging problems, or is there actually a problem? > > 22:35:30 - lock lost > 22:35:30 - lock acquired > 22:35:46 - lock lost > 22:35:46 - lock acquired This looks fine. This is a period of time where Sb DB cluster has no leader for 16 seconds (election timeout), so northd processes re-connects to the new server every couple of seconds. (reconnect + not a cluster leader + reconnect again + ...) On reconnection, active northd looses the lock and logs this info. Lock request sent before monitor request, so it acquires the lock before discovering that the server is not a cluster leader. It logs the lock acquisition anyway. So lost+acquire within a single second is a normal thing. This also happens very fast, so 2 northd could establish new connection, loose the lock, acquire the lock, discover that server is not a leader and disconnect within a single second. BTW, northd processes are still terminated with SIGTERM (signal 15) in these logs. I guess, I know why preStop hook is not executed in this scenario. Looking at the journal, I see following logs: Mar 29 22:22:25.758786 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 crio[1690]: time="2021-03-29 22:22:25.758712709Z" level=info msg="Started container 6bfde199d94de8b52ca1649fdee2e83cc28484e7a26f482bd4c185c44d031689: openshift-ovn-kubernetes/ovnkube-master-l6sjf/northd" id=d82760ac-192c-471f-8ae7-56cd01de4068 name=/runtime.v1alpha2.RuntimeService/StartContainer ... Mar 29 22:35:25.637339 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 rpm-ostree[182235]: Created new deployment /ostree/deploy/rhcos/deploy/27e53d6717a7378165d326c8fe69133ed2621d73dbc78d67c17208663ce92d84.1 Mar 29 22:35:25.702233 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 rpm-ostree[182235]: sanitycheck(/usr/bin/true) successful ... Mar 29 22:35:26.395406 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 rpm-ostree[182235]: Txn UpdateDeployment on /org/projectatomic/rpmostree1/rhcos successful ... Mar 29 22:35:27.167777 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 logger[182444]: rendered-master-12a522dad98c2c798f8d84bd60469f45 Mar 29 22:35:27.172725 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 root[182445]: machine-config-daemon[167106]: Rebooting node Mar 29 22:35:27.176007 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 root[182446]: machine-config-daemon[167106]: initiating reboot: Node will reboot into config rendered-master-12a522dad98c2c798f8d84bd60469f45 Mar 29 22:35:27.190155 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-master-12a522dad98c2c798f8d84bd60469f45. Mar 29 22:35:27.203331 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 systemd[1]: Stopping Kubernetes Kubelet... Mar 29 22:35:27.220520 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 systemd[1]: kubelet.service: Succeeded. Mar 29 22:35:27.221599 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 systemd[1]: Stopped Kubernetes Kubelet. Mar 29 22:35:27.221756 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 systemd[1]: kubelet.service: Consumed 12min 48.278s CPU time ... Mar 29 22:35:27.295951 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 systemd[1]: Stopping libcontainer container 6bfde199d94de8b52ca1649fdee2e83cc28484e7a26f482bd4c185c44d031689. --- So, the reboot to upgrade the base image is triggered and it stops the kubelet service first and then stops actual containers. It's not the kubelet that stops the container, so the preStop hook is not executed. Should some high-level management software that triggers base image upgrade request stopping all containers before the node reboot? Clayton suggests the solution for 4.8 (where there isn't graceful termination on restart) is trapping SIGTERM and cleanly exiting. Should be better in 4.9. Retargetting this bug to handle gracefully termination of the OVN databases in the CNO container scripts. northd gracefully termination will be handled by https://bugzilla.redhat.com/show_bug.cgi?id=1943363 Most of this actually landed as part of https://bugzilla.redhat.com/show_bug.cgi?id=1943363 via a SIGTERM hook. But we should also use the same mechanism in preStop as we do in the TERM hook. Hi Colum - sorry for the delayed response - no power in my home 'office' since last Saturday, have been working partial days & only just noticed this now (got an email saying I was overdue to respond but never saw the notification in the first place). Your math looks right to me, other than the fact that the June 15 release is 4.8.43. Hi Colum - the 4.8.z BZ was created (see comment above by Jamo) - I've linked it to the case as well. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |