Bug 1702414

Summary: [upgrade] Pods aren't terminated gracefully on reboot, leading to availability failures for control plane
Product: OpenShift Container Platform Reporter: Mrunal Patel <mpatel>
Component: ContainersAssignee: Mrunal Patel <mpatel>
Status: CLOSED ERRATA QA Contact: Peter Ruan <pruan>
Severity: urgent Docs Contact:
Priority: high    
Version: 4.1.0CC: aos-bugs, ccoleman, dwalsh, gblomqui, gklein, jokerman, mifiedle, mmccomas, nstielau, pruan, smilner, vlaad, wking
Target Milestone: ---Keywords: BetaBlocker, TestBlocker, Upgrades
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:47:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1703485, 1705379    
Bug Blocks: 1703879    
Attachments:
Description Flags
upgrade log from output `oc get clusterversion` none

Description Mrunal Patel 2019-04-23 17:35:57 UTC
Description of problem:
Static pods aren't getting termination grace period on node shutdown during upgrades leading to failures.

Components such as the apiserver need termination grace period to be able to report status correctly.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Start a cluster.
2. Perform an upgrade.
3. There will be failures during node reboot.

Comment 1 Mrunal Patel 2019-04-23 17:41:08 UTC
We looked into add the grace period in runc, but systemd is not allowing us to set the TimeoutStopSec. Here are the logs from dbus-monitor:

Comment 2 Mrunal Patel 2019-04-23 18:33:52 UTC
We are able to bump up the timeout using DefaultTimeoutStopSec in /etc/systemd/system.conf.

Transferring to master team to see if bumping up the timeout helps with their issues.

Comment 3 Michal Fojtik 2019-04-24 12:28:45 UTC
We also need https://github.com/openshift/origin/pull/22648 to ensure the kube apiserver will shutdown with zero exit code.

Comment 4 W. Trevor King 2019-04-25 19:27:05 UTC
Cross-ref bugĀ 1701291.  Why are we POST here?

Comment 5 Mrunal Patel 2019-04-25 19:47:10 UTC
Moving this back to containers. We have determined that we need to flip a couple of systemd properties for this to work and now are waiting
for systemd team to ack whether setting Delegate=no is okay in runc.

Comment 6 Peter Ruan 2019-04-25 21:13:04 UTC
Adding some output from user-end's perspective
################
pruan@dhcp-91-104 ~ $ oc get clusterversion -o yaml                                                                                                                
The connection to the server api.qe-pruan-ap-south-1.qe.devcluster.openshift.com:6443 was refused - did you specify the right host or port?

---
Unable to connect to the server: unexpected EOF



#### simple script to monitor the oc output
while true; do
    PROGRESS_OUTPUT=$(oc get clusterversion version -o json | jq '.status.conditions[] | select(.type == "Progressing").message')
    SPEC_UPSTREAM_OUTPUT=$(oc get clusterversion version -o json | jq '.spec.upstream')
    echo $PROGRESS_OUTPUT
    echo $SPEC_UPSTREAM_OUTPUT
done

Comment 7 Peter Ruan 2019-04-26 03:23:48 UTC
Created attachment 1558900 [details]
upgrade log from output `oc get clusterversion`

I've added the output from `oc get clusterversion` during the upgrade process to track the point to failures.

Comment 8 Mrunal Patel 2019-04-29 01:27:06 UTC
systemd bug - https://bugzilla.redhat.com/show_bug.cgi?id=1703485

Comment 9 Michal Fojtik 2019-04-29 08:24:07 UTC
*** Bug 1701291 has been marked as a duplicate of this bug. ***

Comment 10 Mrunal Patel 2019-04-29 19:22:52 UTC
systemd team has provided a scratch build that is being integrated into RHCOS to test the fix - https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=21356365

Comment 11 Mrunal Patel 2019-04-29 19:24:25 UTC
Note: This particular fix may lead to cases where we wait for the full timeout period even though there may be no pids remaining in a cgroups scope. During the sync-up meeting, this
was discussed and agreed as an acceptable tradeoff for now.

Comment 12 Mrunal Patel 2019-05-01 04:38:01 UTC
RHCOS has built with the systemd fix and we are now waiting for this to show up in CI.

Comment 13 Mrunal Patel 2019-05-01 18:33:22 UTC
The new systemd is now available.

Comment 14 Mrunal Patel 2019-05-01 18:54:13 UTC
I tested it working with a custom test static pod.

Comment 17 Peter Ruan 2019-05-02 21:42:22 UTC
moving it back to ASSIGNED until nightly build is available

Comment 18 Peter Ruan 2019-05-02 22:50:25 UTC
changing it to POST, just waiting for a nightly build that has the patch.

Comment 23 Peter Ruan 2019-05-03 21:12:15 UTC
verified with image 4.1.0-0.nightly-2019-05-03-121956 that the systemd has a 90s sleep to accommodate for reboot.  Here are the steps I used to verify following what Murnal suggested
0. From a worker node 
[root@ip-172-31-137-189 core]# rpm -qa systemd
systemd-239-13.20190429.el8_0.nofastkill.0.x86_64

1. ssh into a worker node (follow this direction https://github.com/eparis/ssh-bastion), do `oc get nodes` to get a worker-node's iip
  $> ./ssh.ssh <worker_node_ip>
2. create a test dir 
  $> mkdir -p /etc/test
3. grant permission
   chcon system_u:object_r:container_file_t:s0 /etc/test
4. create a static pod using the following yaml, which should be placed in the directory /etc/kubernetes/manifests/
 
[root@ip-172-31-137-189 core]# cat /etc/kubernetes/manifests/testsig.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: test-sig
spec:
  containers:
  - image: docker.io/mrunalp/testsig:latest
    name: testsig
    volumeMounts:
    - mountPath: /etc/test
      name: test-volume
  volumes:
  - name: test-volume
    hostPath:
      # directory location on host
      path: /etc/test
      type: Directory
5. check the pod is created by running `crictl pods`
6. check the pod is functioning correctly, which purpose is to write a timestamp every second to a file 
   $> tail -f /etc/test/test_sig.log
7. reboot the node by calling `reboot` from the worker node in question
8. wait a couple of minutes and log back into the worker node
9. vi /etc/test/test_sig.log and look for the line `Caught TERM` and note the timestamp
  $> 2019-05-03_20:40:05
     Caught TERM
     2019-05-03_20:40:06
     END
     2019-05-03_20:40:06
scroll down and watch for a ~90 seconds time gap from the time of Caught TERM

2019-05-03_20:41:32
2019-05-03_20:41:33
2019-05-03_20:41:34
2019-05-03_20:41:35
2019-05-03_20:41:36    <--- rebooted here
2019-05-03_20:42:49    <--- sleep time of ~90 seconds cause the gap in the log
2019-05-03_20:42:50
2019-05-03_20:42:51
2019-05-03_20:42:52

Comment 25 errata-xmlrpc 2019-06-04 10:47:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758