Description of problem: Init containers restart when the exited init container is removed from node by kubelet garbage collector. Version-Release number of selected component (if applicable): 3.11 How reproducible: 100% Steps to Reproduce: 1. Create pod # oc create -f [1] pod/init-test created # oc exec init-test cat /mnt/data/count 1 2. Set node garbage collection low to trigger cleanup faster ``` kubeletArguments: minimum-container-ttl-duration: - "10s" maximum-dead-containers-per-container: - "0" maximum-dead-containers: - "0" ``` or remove the exited init container manually with `docker rm` # docker ps -a | grep init e4978a8bb28a registry.redhat.io/rhel7/rhel@sha256:7bca0d15377805f6b3bbb79d28325da8dd90f4eeef6fb463c0973d05257e5d03 "/bin/sh -ec 'cat ..." 2 minutes ago Up 2 minutes k8s_my-container_init-test_openshift-sdn_3deb5d3e-01ab-11ea-ae27-fa163e05678f_0 94fb06cc88b0 registry.redhat.io/rhel7/rhel@sha256:7bca0d15377805f6b3bbb79d28325da8dd90f4eeef6fb463c0973d05257e5d03 "/bin/bash -c '#!/..." 2 minutes ago Exited (0) 2 minutes ago k8s_inittest_init-test_openshift-sdn_3deb5d3e-01ab-11ea-ae27-fa163e05678f_0 d5f1053ffc23 registry.redhat.io/openshift3/ose-pod:v3.11.153 "/usr/bin/pod" 2 minutes ago Up 2 minutes # docker rm 94fb06cc88b0 Actual results: - init container continues to restart due to being cleaned up by kubelet. # oc exec init-test cat /mnt/data/count 2 # oc exec init-test cat /mnt/data/count 3 ... ... - Main container never gets restarted: # docker ps -a | grep init 9d4605b3d6cf registry.redhat.io/rhel7/rhel@sha256:7bca0d15377805f6b3bbb79d28325da8dd90f4eeef6fb463c0973d05257e5d03 "/bin/bash -c '#!/..." 3 minutes ago Exited (0) 3 minutes ago k8s_inittest_init-test_openshift-sdn_3deb5d3e-01ab-11ea-ae27-fa163e05678f_0 e4978a8bb28a registry.redhat.io/rhel7/rhel@sha256:7bca0d15377805f6b3bbb79d28325da8dd90f4eeef6fb463c0973d05257e5d03 "/bin/sh -ec 'cat ..." 6 minutes ago Up 6 minutes k8s_my-container_init-test_openshift-sdn_3deb5d3e-01ab-11ea-ae27-fa163e05678f_0 d5f1053ffc23 registry.redhat.io/openshift3/ose-pod:v3.11.153 "/usr/bin/pod" 6 minutes ago Up 6 minutes k8s_POD_init-test_openshift-sdn_3deb5d3e-01ab-11ea-ae27-fa163e05678f_0 Expected results: init container to only run when pod is restarted or first scheduled to node. Additional info: [1] oc create -f - << EOF apiVersion: v1 kind: Pod metadata: name: my-pod spec: initContainers: - name: inittest image: "registry.redhat.io/rhel7/rhel" command: ["bin/sh", "-ec", "echo running >> /mnt/data/test"] volumeMounts: - name: my-volume mountPath: /mnt/data containers: - name: my-container image: "registry.redhat.io/rhel7/rhel" command: ["/bin/sh", "-ec", "ls /mnt/data; sleep 999999"] volumeMounts: - mountPath: /mnt/data name: my-volume volumes: - name: my-volume emptyDir: {} EOF
Further if the init container fails when rerun due to already executing on this process it can cause the pod the pod to appear to be in a pad status as the init container is restarting in a loop. 1. Create pod apiVersion: v1 kind: Pod metadata: name: init-fail-test spec: initContainers: - name: inittest image: "registry.redhat.io/rhel7/rhel" command: - /bin/bash - -c - | #!/bin/bash set -euo pipefail file=/mnt/data/count if [[ -f "${file}" ]]; then count=$(<"${file}") expr $count + 1 > $file echo "Init Container has run ${count} times" exit 1 fi echo 1 > ${file} volumeMounts: - name: my-volume mountPath: /mnt/data containers: - name: my-container image: "registry.redhat.io/rhel7/rhel" command: ["/bin/sh", "-ec", "while true; do sleep 30; cat /mnt/data/count; done;"] volumeMounts: - mountPath: /mnt/data name: my-volume volumes: - name: my-volume emptyDir: {} 2. Wait for kubelet to clean up init container. $ oc get pods NAME READY STATUS RESTARTS AGE init-fail-test 0/1 Init:CrashLoopBackOff 4 6m $ oc logs init-fail-test 1 1 1 1 2 4 5 5 6 6 6 7 3. We see that the pod and main container never get restarted, but the init container is looping and increasing the restart count on the pod. # docker ps -a | grep init cdaeb00eef23 registry.redhat.io/rhel7/rhel@sha256:7bca0d15377805f6b3bbb79d28325da8dd90f4eeef6fb463c0973d05257e5d03 "/bin/bash -c '#!/..." 12 seconds ago Exited (1) 4 seconds ago k8s_inittest_init-fail-test_openshift-sdn_57eee915-01ad-11ea-ae27-fa163e05678f_5 918fabe7c867 registry.redhat.io/rhel7/rhel@sha256:7bca0d15377805f6b3bbb79d28325da8dd90f4eeef6fb463c0973d05257e5d03 "/bin/sh -ec 'whil..." 7 minutes ago Up 7 minutes k8s_my-container_init-fail-test_openshift-sdn_57eee915-01ad-11ea-ae27-fa163e05678f_0 b768406fb266 registry.redhat.io/openshift3/ose-pod:v3.11.153 "/usr/bin/pod" 7 minutes ago Up 7 minutes k8s_POD_init-fail-test_openshift-sdn_57eee915-01ad-11ea-ae27-fa163e05678f_0
Hi Tony, Thanks for checking in on this. We would like to fix it but the fix may be complicated and difficult to backport to 3.11. I'm planning to spend some time working on a fix for 4.6, then assessing feasibility of a backport to 3.11.z.
This issue is hit with Kubelet GC, but can be reproduced with any dead container cleanup.
Can you please let me know if this bug is fixed? If yes, in which OC release?
Interesting side effect of this, is that kubelet controlled volume mounts like /etc/host get remounted, so any changes to this file get reverted. Not sure what the effect of this is with other volumes. ================================== # oc create -f - << EOF apiVersion: v1 kind: Pod metadata: name: my-pod spec: initContainers: - name: inittest image: "registry.redhat.io/rhel7/rhel" command: ["bin/sh", "-ec", "echo running >> /mnt/data/test"] volumeMounts: - name: my-volume mountPath: /mnt/data containers: - name: my-container image: "registry.redhat.io/rhel7/rhel" command: ["/bin/sh", "-ec", "ls /mnt/data; sleep 999999"] volumeMounts: - mountPath: /mnt/data name: my-volume volumes: - name: my-volume emptyDir: {} EOF [root@node-2 quicklab]# docker ps -a | grep my-pod faa9efd5dae8 registry.redhat.io/rhel7/rhel@sha256:6dabf4a152c6f209b1fbbfc59dc684fb327e3df7d24410c696ec6c5388f4f33a "/bin/sh -ec 'ls /..." 46 seconds ago Up 45 seconds k8s_my-container_my-pod_default_e0e2a1e6-d82e-11ea-8c23-fa163e318e7f_0 8430f48640f9 registry.redhat.io/rhel7/rhel@sha256:6dabf4a152c6f209b1fbbfc59dc684fb327e3df7d24410c696ec6c5388f4f33a "bin/sh -ec 'echo ..." 49 seconds ago Exited (0) 48 seconds ago k8s_inittest_my-pod_default_e0e2a1e6-d82e-11ea-8c23-fa163e318e7f_0 c6a2f318e010 registry.redhat.io/openshift3/ose-pod:v3.11.248 "/usr/bin/pod" About a minute ago Up About a minute k8s_POD_my-pod_default_e0e2a1e6-d82e-11ea-8c23-fa163e318e7f_0 [root@node-2 quicklab]# docker exec -u 0 -it faa9efd5dae8 /bin/sh sh-4.2# echo "1.1.1.1 test" >> /etc/hosts sh-4.2# exit [root@node-2 quicklab]# docker exec -u 0 -it faa9efd5dae8 /bin/cat /etc/hosts # Kubernetes-managed hosts file. 127.0.0.1 localhost ::1 localhost ip6-localhost ip6-loopback fe00::0 ip6-localnet fe00::0 ip6-mcastprefix fe00::1 ip6-allnodes fe00::2 ip6-allrouters 10.129.2.4 my-pod 1.1.1.1 test [root@node-2 quicklab]# docker rm 8430f48640f9 8430f48640f9 [root@node-2 quicklab]# docker ps -a | grep my-pod faa9efd5dae8 registry.redhat.io/rhel7/rhel@sha256:6dabf4a152c6f209b1fbbfc59dc684fb327e3df7d24410c696ec6c5388f4f33a "/bin/sh -ec 'ls /..." 3 minutes ago Up 3 minutes k8s_my-container_my-pod_default_e0e2a1e6-d82e-11ea-8c23-fa163e318e7f_0 c6a2f318e010 registry.redhat.io/openshift3/ose-pod:v3.11.248 "/usr/bin/pod" 3 minutes ago Up 3 minutes k8s_POD_my-pod_default_e0e2a1e6-d82e-11ea-8c23-fa163e318e7f_0 [root@node-2 quicklab]# docker ps -a | grep my-pod faa9efd5dae8 registry.redhat.io/rhel7/rhel@sha256:6dabf4a152c6f209b1fbbfc59dc684fb327e3df7d24410c696ec6c5388f4f33a "/bin/sh -ec 'ls /..." 3 minutes ago Up 3 minutes k8s_my-container_my-pod_default_e0e2a1e6-d82e-11ea-8c23-fa163e318e7f_0 c6a2f318e010 registry.redhat.io/openshift3/ose-pod:v3.11.248 "/usr/bin/pod" 4 minutes ago Up 4 minutes k8s_POD_my-pod_default_e0e2a1e6-d82e-11ea-8c23-fa163e318e7f_0 [root@node-2 quicklab]# docker ps -a | grep my-pod f7e5a172329d registry.redhat.io/rhel7/rhel@sha256:6dabf4a152c6f209b1fbbfc59dc684fb327e3df7d24410c696ec6c5388f4f33a "bin/sh -ec 'echo ..." 2 seconds ago Exited (0) Less than a second ago k8s_inittest_my-pod_default_e0e2a1e6-d82e-11ea-8c23-fa163e318e7f_0 faa9efd5dae8 registry.redhat.io/rhel7/rhel@sha256:6dabf4a152c6f209b1fbbfc59dc684fb327e3df7d24410c696ec6c5388f4f33a "/bin/sh -ec 'ls /..." 3 minutes ago Up 3 minutes k8s_my-container_my-pod_default_e0e2a1e6-d82e-11ea-8c23-fa163e318e7f_0 c6a2f318e010 registry.redhat.io/openshift3/ose-pod:v3.11.248 "/usr/bin/pod" 4 minutes ago Up 4 minutes k8s_POD_my-pod_default_e0e2a1e6-d82e-11ea-8c23-fa163e318e7f_0 [root@node-2 quicklab]# docker ps -a | grep my-pod f7e5a172329d registry.redhat.io/rhel7/rhel@sha256:6dabf4a152c6f209b1fbbfc59dc684fb327e3df7d24410c696ec6c5388f4f33a "bin/sh -ec 'echo ..." 7 seconds ago Exited (0) 6 seconds ago k8s_inittest_my-pod_default_e0e2a1e6-d82e-11ea-8c23-fa163e318e7f_0 faa9efd5dae8 registry.redhat.io/rhel7/rhel@sha256:6dabf4a152c6f209b1fbbfc59dc684fb327e3df7d24410c696ec6c5388f4f33a "/bin/sh -ec 'ls /..." 3 minutes ago Up 3 minutes k8s_my-container_my-pod_default_e0e2a1e6-d82e-11ea-8c23-fa163e318e7f_0 c6a2f318e010 registry.redhat.io/openshift3/ose-pod:v3.11.248 "/usr/bin/pod" 4 minutes ago Up 4 minutes k8s_POD_my-pod_default_e0e2a1e6-d82e-11ea-8c23-fa163e318e7f_0 [root@node-2 quicklab]# docker exec -u 0 -it faa9efd5dae8 /bin/cat /etc/hosts # Kubernetes-managed hosts file. 127.0.0.1 localhost ::1 localhost ip6-localhost ip6-loopback fe00::0 ip6-localnet fe00::0 ip6-mcastprefix fe00::1 ip6-allnodes fe00::2 ip6-allrouters 10.129.2.4 my-pod =============================
To Ruchi, I'm sorry, this bug is not fixed yet. To Ryan, /etc/hosts is managed by the Kubelet and it is rewritten every time a container in the pod is started. That's what this comment in the file means: # Kubernetes-managed hosts file. A pod shouldn't have any expectation that any changes to that file will remain. If a pod needs to add entries to that file, it should use hostAliases. More info about that in this comment: https://bugzilla.redhat.com/show_bug.cgi?id=1860201#c12
verified on version : 4.7.0-0.nightly-2021-01-06-055910 removed the exited init container, and the init container didn't restart any more.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days