1702414 – [upgrade] Pods aren't terminated gracefully on reboot, leading to availability failures for control plane

Bug 1702414 - [upgrade] Pods aren't terminated gracefully on reboot, leading to availability failures for control plane

Summary: [upgrade] Pods aren't terminated gracefully on reboot, leading to availabilit...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Containers
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Mrunal Patel
QA Contact:	Peter Ruan
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1701291 (view as bug list)
Depends On:	1703485 1705379
Blocks:	1703879
TreeView+	depends on / blocked

Reported:	2019-04-23 17:35 UTC by Mrunal Patel
Modified:	2019-06-04 10:47 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:47:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
upgrade log from output `oc get clusterversion` (134.06 KB, text/plain) 2019-04-26 03:23 UTC, Peter Ruan	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:47:58 UTC

Description Mrunal Patel 2019-04-23 17:35:57 UTC

Description of problem:
Static pods aren't getting termination grace period on node shutdown during upgrades leading to failures.

Components such as the apiserver need termination grace period to be able to report status correctly.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Start a cluster.
2. Perform an upgrade.
3. There will be failures during node reboot.

Comment 1 Mrunal Patel 2019-04-23 17:41:08 UTC

We looked into add the grace period in runc, but systemd is not allowing us to set the TimeoutStopSec. Here are the logs from dbus-monitor:

Comment 2 Mrunal Patel 2019-04-23 18:33:52 UTC

We are able to bump up the timeout using DefaultTimeoutStopSec in /etc/systemd/system.conf.

Transferring to master team to see if bumping up the timeout helps with their issues.

Comment 3 Michal Fojtik 2019-04-24 12:28:45 UTC

We also need https://github.com/openshift/origin/pull/22648 to ensure the kube apiserver will shutdown with zero exit code.

Comment 4 W. Trevor King 2019-04-25 19:27:05 UTC

Cross-ref bug 1701291.  Why are we POST here?

Comment 5 Mrunal Patel 2019-04-25 19:47:10 UTC

Moving this back to containers. We have determined that we need to flip a couple of systemd properties for this to work and now are waiting
for systemd team to ack whether setting Delegate=no is okay in runc.

Comment 6 Peter Ruan 2019-04-25 21:13:04 UTC

Adding some output from user-end's perspective
################
pruan@dhcp-91-104 ~ $ oc get clusterversion -o yaml                                                                                                                
The connection to the server api.qe-pruan-ap-south-1.qe.devcluster.openshift.com:6443 was refused - did you specify the right host or port?

---
Unable to connect to the server: unexpected EOF



#### simple script to monitor the oc output
while true; do
    PROGRESS_OUTPUT=$(oc get clusterversion version -o json | jq '.status.conditions[] | select(.type == "Progressing").message')
    SPEC_UPSTREAM_OUTPUT=$(oc get clusterversion version -o json | jq '.spec.upstream')
    echo $PROGRESS_OUTPUT
    echo $SPEC_UPSTREAM_OUTPUT
done

Comment 7 Peter Ruan 2019-04-26 03:23:48 UTC

Created attachment 1558900 [details]
upgrade log from output `oc get clusterversion`

I've added the output from `oc get clusterversion` during the upgrade process to track the point to failures.

Comment 8 Mrunal Patel 2019-04-29 01:27:06 UTC

systemd bug - https://bugzilla.redhat.com/show_bug.cgi?id=1703485

Comment 9 Michal Fojtik 2019-04-29 08:24:07 UTC

*** Bug 1701291 has been marked as a duplicate of this bug. ***

Comment 10 Mrunal Patel 2019-04-29 19:22:52 UTC

systemd team has provided a scratch build that is being integrated into RHCOS to test the fix - https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=21356365

Comment 11 Mrunal Patel 2019-04-29 19:24:25 UTC

Note: This particular fix may lead to cases where we wait for the full timeout period even though there may be no pids remaining in a cgroups scope. During the sync-up meeting, this
was discussed and agreed as an acceptable tradeoff for now.

Comment 12 Mrunal Patel 2019-05-01 04:38:01 UTC

RHCOS has built with the systemd fix and we are now waiting for this to show up in CI.

Comment 13 Mrunal Patel 2019-05-01 18:33:22 UTC

The new systemd is now available.

Comment 14 Mrunal Patel 2019-05-01 18:54:13 UTC

I tested it working with a custom test static pod.

Comment 17 Peter Ruan 2019-05-02 21:42:22 UTC

moving it back to ASSIGNED until nightly build is available

Comment 18 Peter Ruan 2019-05-02 22:50:25 UTC

changing it to POST, just waiting for a nightly build that has the patch.

Comment 23 Peter Ruan 2019-05-03 21:12:15 UTC

verified with image 4.1.0-0.nightly-2019-05-03-121956 that the systemd has a 90s sleep to accommodate for reboot. Here are the steps I used to verify following what Murnal suggested
0. From a worker node
[root@ip-172-31-137-189 core]# rpm -qa systemd
systemd-239-13.20190429.el8_0.nofastkill.0.x86_64

1. ssh into a worker node (follow this direction https://github.com/eparis/ssh-bastion), do `oc get nodes` to get a worker-node's iip
$> ./ssh.ssh <worker_node_ip>
2. create a test dir
$> mkdir -p /etc/test
3. grant permission
chcon system_u:object_r:container_file_t:s0 /etc/test
4. create a static pod using the following yaml, which should be placed in the directory /etc/kubernetes/manifests/

[root@ip-172-31-137-189 core]# cat /etc/kubernetes/manifests/testsig.yaml
apiVersion: v1
kind: Pod
metadata:
name: test-sig
spec:
containers:
- image: docker.io/mrunalp/testsig:latest
name: testsig
volumeMounts:
- mountPath: /etc/test
name: test-volume
volumes:
- name: test-volume
hostPath:
# directory location on host
path: /etc/test
type: Directory
5. check the pod is created by running `crictl pods`
6. check the pod is functioning correctly, which purpose is to write a timestamp every second to a file
$> tail -f /etc/test/test_sig.log
7. reboot the node by calling `reboot` from the worker node in question
8. wait a couple of minutes and log back into the worker node
9. vi /etc/test/test_sig.log and look for the line `Caught TERM` and note the timestamp
$> 2019-05-03_20:40:05
Caught TERM
2019-05-03_20:40:06
END
2019-05-03_20:40:06
scroll down and watch for a ~90 seconds time gap from the time of Caught TERM

2019-05-03_20:41:32
2019-05-03_20:41:33
2019-05-03_20:41:34
2019-05-03_20:41:35
2019-05-03_20:41:36 <--- rebooted here
2019-05-03_20:42:49 <--- sleep time of ~90 seconds cause the gap in the log
2019-05-03_20:42:50
2019-05-03_20:42:51
2019-05-03_20:42:52

Comment 25 errata-xmlrpc 2019-06-04 10:47:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.