Description of problem: Static pods aren't getting termination grace period on node shutdown during upgrades leading to failures. Components such as the apiserver need termination grace period to be able to report status correctly. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Start a cluster. 2. Perform an upgrade. 3. There will be failures during node reboot.
We looked into add the grace period in runc, but systemd is not allowing us to set the TimeoutStopSec. Here are the logs from dbus-monitor:
We are able to bump up the timeout using DefaultTimeoutStopSec in /etc/systemd/system.conf. Transferring to master team to see if bumping up the timeout helps with their issues.
We also need https://github.com/openshift/origin/pull/22648 to ensure the kube apiserver will shutdown with zero exit code.
Cross-ref bug 1701291. Why are we POST here?
Moving this back to containers. We have determined that we need to flip a couple of systemd properties for this to work and now are waiting for systemd team to ack whether setting Delegate=no is okay in runc.
Adding some output from user-end's perspective ################ pruan@dhcp-91-104 ~ $ oc get clusterversion -o yaml The connection to the server api.qe-pruan-ap-south-1.qe.devcluster.openshift.com:6443 was refused - did you specify the right host or port? --- Unable to connect to the server: unexpected EOF #### simple script to monitor the oc output while true; do PROGRESS_OUTPUT=$(oc get clusterversion version -o json | jq '.status.conditions[] | select(.type == "Progressing").message') SPEC_UPSTREAM_OUTPUT=$(oc get clusterversion version -o json | jq '.spec.upstream') echo $PROGRESS_OUTPUT echo $SPEC_UPSTREAM_OUTPUT done
Created attachment 1558900 [details] upgrade log from output `oc get clusterversion` I've added the output from `oc get clusterversion` during the upgrade process to track the point to failures.
systemd bug - https://bugzilla.redhat.com/show_bug.cgi?id=1703485
*** Bug 1701291 has been marked as a duplicate of this bug. ***
systemd team has provided a scratch build that is being integrated into RHCOS to test the fix - https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=21356365
Note: This particular fix may lead to cases where we wait for the full timeout period even though there may be no pids remaining in a cgroups scope. During the sync-up meeting, this was discussed and agreed as an acceptable tradeoff for now.
RHCOS has built with the systemd fix and we are now waiting for this to show up in CI.
The new systemd is now available.
I tested it working with a custom test static pod.
moving it back to ASSIGNED until nightly build is available
changing it to POST, just waiting for a nightly build that has the patch.
verified with image 4.1.0-0.nightly-2019-05-03-121956 that the systemd has a 90s sleep to accommodate for reboot. Here are the steps I used to verify following what Murnal suggested 0. From a worker node [root@ip-172-31-137-189 core]# rpm -qa systemd systemd-239-13.20190429.el8_0.nofastkill.0.x86_64 1. ssh into a worker node (follow this direction https://github.com/eparis/ssh-bastion), do `oc get nodes` to get a worker-node's iip $> ./ssh.ssh <worker_node_ip> 2. create a test dir $> mkdir -p /etc/test 3. grant permission chcon system_u:object_r:container_file_t:s0 /etc/test 4. create a static pod using the following yaml, which should be placed in the directory /etc/kubernetes/manifests/ [root@ip-172-31-137-189 core]# cat /etc/kubernetes/manifests/testsig.yaml apiVersion: v1 kind: Pod metadata: name: test-sig spec: containers: - image: docker.io/mrunalp/testsig:latest name: testsig volumeMounts: - mountPath: /etc/test name: test-volume volumes: - name: test-volume hostPath: # directory location on host path: /etc/test type: Directory 5. check the pod is created by running `crictl pods` 6. check the pod is functioning correctly, which purpose is to write a timestamp every second to a file $> tail -f /etc/test/test_sig.log 7. reboot the node by calling `reboot` from the worker node in question 8. wait a couple of minutes and log back into the worker node 9. vi /etc/test/test_sig.log and look for the line `Caught TERM` and note the timestamp $> 2019-05-03_20:40:05 Caught TERM 2019-05-03_20:40:06 END 2019-05-03_20:40:06 scroll down and watch for a ~90 seconds time gap from the time of Caught TERM 2019-05-03_20:41:32 2019-05-03_20:41:33 2019-05-03_20:41:34 2019-05-03_20:41:35 2019-05-03_20:41:36 <--- rebooted here 2019-05-03_20:42:49 <--- sleep time of ~90 seconds cause the gap in the log 2019-05-03_20:42:50 2019-05-03_20:42:51 2019-05-03_20:42:52
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758