Bug 1944264

Summary:	[ovn] CNO should gracefully terminate OVN databases
Product:	OpenShift Container Platform	Reporter:	Ilya Maximets <i.maximets>
Component:	Networking	Assignee:	jamo luhrsen <jluhrsen>
Networking sub component:	ovn-kubernetes	QA Contact:	Anurag saxena <anusaxen>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	anbhat, bbennett, bzvonar, cgaynor, dcbw, ffernand, jluhrsen, kurathod, openshift-bugs-escalate
Version:	4.7	Flags:	jluhrsen: needinfo- jluhrsen: needinfo- jluhrsen: needinfo- jluhrsen: needinfo-
Target Milestone:	---
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-08-10 10:36:17 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1943566, 1943800

Description Ilya Maximets 2021-03-29 15:35:30 UTC

Despite the fact that CNO uses 'appctl -t <service>.ctl exit' as a preStop
hook for most of the services, all of them are usually terminated with:

  |00001|fatal_signal(urcu2)|WARN|terminating with signal 15 (Terminated)

This is likely happening because CNO doesn't actually wait for processes
being stopped.

'appctl -t <service>.ctl exit' only notifies the application that it needs
to exit.  Typical ovsdb-server after that will finalize all the storage
operations, close all the remaining network connections, free resources
and exit only after that.  My guess is that CNO sends SIGTERM right after
execution of a preStop hook, so the process just has no time to exit
gracefully.

Termination by SIGTERM is not a good thing for OVN processes, it leaves
half-open network/unix socket connections, might cause unfinished
transactions being lost.  This all should still be OK for a cluster,
but it's much better to close everything correctly than stressing out
fail-over mechanism on every normal container stop.

Correct termination should contribute to faster recovery of the raft
cluster, because the leader will correctly transfer the leadership and
followers will not need to wait for election timeout/detect that connection
is broken anyhow.

For ovn-controller right now it's possible to use 'ovn-ctl stop_controller'.
Northd has no preStop hook in CNO, but it could be started and stopped
by ovn-ctl. (beware the OVN_MANAGE_OVSDB option that is enabled by default)

For databases, unfortunately, ovn-ctl just uses the same
'appctl -t <service>.ctl exit', but I created a BZ to fix that and actually
wait for termination:

  https://bugzilla.redhat.com/show_bug.cgi?id=1944239

Alternatively, waiting for termination could be implemented by CNO itself.

Comment 1 Dan Williams 2021-03-30 01:38:27 UTC

Ilya, does this look right for northd?

At least the lock is bouncing around, but it looks like it's bouncing around a *lot*. Is that just logging problems, or is there actually a problem?

Pod logs:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/1042/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade/1376632693281787904/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/1042/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade/1376632693281787904/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-l6sjf_northd.log
22:24:03 - lock acquired
22:35:27 - terminated
22:40:11 - lock acquired

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/1042/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade/1376632693281787904/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-252ls_northd.log
22:35:27 - connection close from 10.0.0.3
22:35:28 - lock acquired
22:35:30 - lock lost
22:35:30 - lock acquired
22:35:46 - lock lost
22:35:46 - lock acquired
22:35:54 - lock lost
22:42:26 - lock acquired

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/1042/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade/1376632693281787904/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-l9xx6_northd.log
22:35:27 - connection close from 10.0.0.3
22:35:28 - lock acquired
22:35:30 - lock lost
22:35:30 - lock acquired
22:35:46 - lock lost
22:35:46 - lock acquired
22:35:54 - lock lost
22:35:54 - lock acquired
22:42:25 - terminated

Comment 2 Ilya Maximets 2021-03-30 08:50:55 UTC

(In reply to Dan Williams from comment #1)
> Ilya, does this look right for northd?
> 
> At least the lock is bouncing around, but it looks like it's bouncing around
> a *lot*. Is that just logging problems, or is there actually a problem?
>
> 22:35:30 - lock lost
> 22:35:30 - lock acquired
> 22:35:46 - lock lost
> 22:35:46 - lock acquired

This looks fine.  This is a period of time where Sb DB cluster has no leader
for 16 seconds (election timeout), so northd processes re-connects to the new
server every couple of seconds. (reconnect + not a cluster leader + reconnect
again + ...)

On reconnection, active northd looses the lock and logs this info.  Lock request
sent before monitor request, so it acquires the lock before discovering that
the server is not a cluster leader.  It logs the lock acquisition anyway.
So lost+acquire within a single second is a normal thing.  This also happens
very fast, so 2 northd could establish new connection, loose the lock, acquire
the lock, discover that server is not a leader and disconnect within a single
second.

BTW, northd processes are still terminated with SIGTERM (signal 15) in these
logs.

Comment 3 Ilya Maximets 2021-03-30 11:30:32 UTC

I guess, I know why preStop hook is not executed in this scenario.
Looking at the journal, I see following logs:

Mar 29 22:22:25.758786 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 crio[1690]: time="2021-03-29 22:22:25.758712709Z" level=info msg="Started container 6bfde199d94de8b52ca1649fdee2e83cc28484e7a26f482bd4c185c44d031689: openshift-ovn-kubernetes/ovnkube-master-l6sjf/northd" id=d82760ac-192c-471f-8ae7-56cd01de4068 name=/runtime.v1alpha2.RuntimeService/StartContainer
...
Mar 29 22:35:25.637339 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 rpm-ostree[182235]: Created new deployment /ostree/deploy/rhcos/deploy/27e53d6717a7378165d326c8fe69133ed2621d73dbc78d67c17208663ce92d84.1
Mar 29 22:35:25.702233 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 rpm-ostree[182235]: sanitycheck(/usr/bin/true) successful
...
Mar 29 22:35:26.395406 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 rpm-ostree[182235]: Txn UpdateDeployment on /org/projectatomic/rpmostree1/rhcos successful
...
Mar 29 22:35:27.167777 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 logger[182444]: rendered-master-12a522dad98c2c798f8d84bd60469f45
Mar 29 22:35:27.172725 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 root[182445]: machine-config-daemon[167106]: Rebooting node
Mar 29 22:35:27.176007 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 root[182446]: machine-config-daemon[167106]: initiating reboot: Node will reboot into config rendered-master-12a522dad98c2c798f8d84bd60469f45
Mar 29 22:35:27.190155 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-master-12a522dad98c2c798f8d84bd60469f45.
Mar 29 22:35:27.203331 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 systemd[1]: Stopping Kubernetes Kubelet...
Mar 29 22:35:27.220520 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 systemd[1]: kubelet.service: Succeeded.
Mar 29 22:35:27.221599 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 systemd[1]: Stopped Kubernetes Kubelet.
Mar 29 22:35:27.221756 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 systemd[1]: kubelet.service: Consumed 12min 48.278s CPU time
...
Mar 29 22:35:27.295951 ci-op-ly1z9j01-8aaeb-nr4jx-master-2 systemd[1]: Stopping libcontainer container 6bfde199d94de8b52ca1649fdee2e83cc28484e7a26f482bd4c185c44d031689.

---

So, the reboot to upgrade the base image is triggered and it stops the kubelet service
first and then stops actual containers.  It's not the kubelet that stops the container,
so the preStop hook is not executed.

Should some high-level management software that triggers base image upgrade request
stopping all containers before the node reboot?

Comment 4 Dan Williams 2021-03-30 14:39:27 UTC

Clayton suggests the solution for 4.8 (where there isn't graceful termination on restart) is trapping SIGTERM and cleanly exiting. Should be better in 4.9.

Comment 5 Dan Williams 2021-04-05 15:10:45 UTC

Retargetting this bug to handle gracefully termination of the OVN databases in the CNO container scripts.  northd gracefully termination will be handled by https://bugzilla.redhat.com/show_bug.cgi?id=1943363

Comment 6 Dan Williams 2022-02-16 17:38:47 UTC

Most of this actually landed as part of https://bugzilla.redhat.com/show_bug.cgi?id=1943363 via a SIGTERM hook.

But we should also use the same mechanism in preStop as we do in the TERM hook.

Comment 17 Bill Zvonar 2022-05-26 11:00:10 UTC

Hi Colum - sorry for the delayed response - no power in my home 'office' since last Saturday, have been working partial days & only just noticed this now (got an email saying I was overdue to respond but never saw the notification in the first place).  Your math looks right to me, other than the fact that the June 15 release is 4.8.43.

Comment 19 Bill Zvonar 2022-05-26 16:14:06 UTC

Hi Colum - the 4.8.z BZ was created (see comment above by Jamo) - I've linked it to the case as well.

Comment 21 errata-xmlrpc 2022-08-10 10:36:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069