Despite the fact that CNO uses 'appctl -t <service>.ctl exit' as a preStop hook for most of the services, all of them are usually terminated with: |00001|fatal_signal(urcu2)|WARN|terminating with signal 15 (Terminated) This is likely happening because CNO doesn't actually wait for processes being stopped. 'appctl -t <service>.ctl exit' only notifies the application that it needs to exit. Typical ovsdb-server after that will finalize all the storage operations, close all the remaining network connections, free resources and exit only after that. My guess is that CNO sends SIGTERM right after execution of a preStop hook, so the process just has no time to exit gracefully. Termination by SIGTERM is not a good thing for OVN processes, it leaves half-open network/unix socket connections, might cause unfinished transactions being lost. This all should still be OK for a cluster, but it's much better to close everything correctly than stressing out fail-over mechanism on every normal container stop. Correct termination should contribute to faster recovery of the raft cluster, because the leader will correctly transfer the leadership and followers will not need to wait for election timeout/detect that connection is broken anyhow. For ovn-controller right now it's possible to use 'ovn-ctl stop_controller'. Northd has no preStop hook in CNO, but it could be started and stopped by ovn-ctl. (beware the OVN_MANAGE_OVSDB option that is enabled by default) For databases, unfortunately, ovn-ctl just uses the same 'appctl -t <service>.ctl exit', but I created a BZ to fix that and actually wait for termination: https://bugzilla.redhat.com/show_bug.cgi?id=1944239 Alternatively, waiting for termination could be implemented by CNO itself.
Ilya, does this look right for northd? At least the lock is bouncing around, but it looks like it's bouncing around a *lot*. Is that just logging problems, or is there actually a problem? Pod logs: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/1042/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade/1376632693281787904/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/ https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/1042/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade/1376632693281787904/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-l6sjf_northd.log 22:24:03 - lock acquired 22:35:27 - terminated 22:40:11 - lock acquired https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/1042/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade/1376632693281787904/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-252ls_northd.log 22:35:27 - connection close from 10.0.0.3 22:35:28 - lock acquired 22:35:30 - lock lost 22:35:30 - lock acquired 22:35:46 - lock lost 22:35:46 - lock acquired 22:35:54 - lock lost 22:42:26 - lock acquired https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/1042/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade/1376632693281787904/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-l9xx6_northd.log 22:35:27 - connection close from 10.0.0.3 22:35:28 - lock acquired 22:35:30 - lock lost 22:35:30 - lock acquired 22:35:46 - lock lost 22:35:46 - lock acquired 22:35:54 - lock lost 22:35:54 - lock acquired 22:42:25 - terminated
(In reply to Dan Williams from comment #1) > Ilya, does this look right for northd? > > At least the lock is bouncing around, but it looks like it's bouncing around > a *lot*. Is that just logging problems, or is there actually a problem? > > 22:35:30 - lock lost > 22:35:30 - lock acquired > 22:35:46 - lock lost > 22:35:46 - lock acquired This looks fine. This is a period of time where Sb DB cluster has no leader for 16 seconds (election timeout), so northd processes re-connects to the new server every couple of seconds. (reconnect + not a cluster leader + reconnect again + ...) On reconnection, active northd looses the lock and logs this info. Lock request sent before monitor request, so it acquires the lock before discovering that the server is not a cluster leader. It logs the lock acquisition anyway. So lost+acquire within a single second is a normal thing. This also happens very fast, so 2 northd could establish new connection, loose the lock, acquire the lock, discover that server is not a leader and disconnect within a single second. BTW, northd processes are still terminated with SIGTERM (signal 15) in these logs.
I guess, I know why preStop hook is not executed in this scenario. Looking at the journal, I see following logs: Mar 29 22:22:25.758786 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 crio[1690]: time="2021-03-29 22:22:25.758712709Z" level=info msg="Started container 6bfde199d94de8b52ca1649fdee2e83cc28484e7a26f482bd4c185c44d031689: openshift-ovn-kubernetes/ovnkube-master-l6sjf/northd" id=d82760ac-192c-471f-8ae7-56cd01de4068 name=/runtime.v1alpha2.RuntimeService/StartContainer ... Mar 29 22:35:25.637339 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 rpm-ostree[182235]: Created new deployment /ostree/deploy/rhcos/deploy/27e53d6717a7378165d326c8fe69133ed2621d73dbc78d67c17208663ce92d84.1 Mar 29 22:35:25.702233 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 rpm-ostree[182235]: sanitycheck(/usr/bin/true) successful ... Mar 29 22:35:26.395406 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 rpm-ostree[182235]: Txn UpdateDeployment on /org/projectatomic/rpmostree1/rhcos successful ... Mar 29 22:35:27.167777 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 logger[182444]: rendered-master-12a522dad98c2c798f8d84bd60469f45 Mar 29 22:35:27.172725 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 root[182445]: machine-config-daemon[167106]: Rebooting node Mar 29 22:35:27.176007 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 root[182446]: machine-config-daemon[167106]: initiating reboot: Node will reboot into config rendered-master-12a522dad98c2c798f8d84bd60469f45 Mar 29 22:35:27.190155 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-master-12a522dad98c2c798f8d84bd60469f45. Mar 29 22:35:27.203331 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 systemd[1]: Stopping Kubernetes Kubelet... Mar 29 22:35:27.220520 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 systemd[1]: kubelet.service: Succeeded. Mar 29 22:35:27.221599 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 systemd[1]: Stopped Kubernetes Kubelet. Mar 29 22:35:27.221756 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 systemd[1]: kubelet.service: Consumed 12min 48.278s CPU time ... Mar 29 22:35:27.295951 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 systemd[1]: Stopping libcontainer container 6bfde199d94de8b52ca1649fdee2e83cc28484e7a26f482bd4c185c44d031689. --- So, the reboot to upgrade the base image is triggered and it stops the kubelet service first and then stops actual containers. It's not the kubelet that stops the container, so the preStop hook is not executed. Should some high-level management software that triggers base image upgrade request stopping all containers before the node reboot?
Clayton suggests the solution for 4.8 (where there isn't graceful termination on restart) is trapping SIGTERM and cleanly exiting. Should be better in 4.9.
Retargetting this bug to handle gracefully termination of the OVN databases in the CNO container scripts. northd gracefully termination will be handled by https://bugzilla.redhat.com/show_bug.cgi?id=1943363
Most of this actually landed as part of https://bugzilla.redhat.com/show_bug.cgi?id=1943363 via a SIGTERM hook. But we should also use the same mechanism in preStop as we do in the TERM hook.
Hi Colum - sorry for the delayed response - no power in my home 'office' since last Saturday, have been working partial days & only just noticed this now (got an email saying I was overdue to respond but never saw the notification in the first place). Your math looks right to me, other than the fact that the June 15 release is 4.8.43.
Hi Colum - the 4.8.z BZ was created (see comment above by Jamo) - I've linked it to the case as well.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069