Bug 1779916
| Summary: | dns operator exits with code 255 on sigterm, should be 0 on safe exit | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> |
| Component: | Node | Assignee: | Mrunal Patel <mpatel> |
| Status: | CLOSED ERRATA | QA Contact: | Sunil Choudhary <schoudha> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.3.0 | CC: | aos-bugs, dcbw, dmace, jokerman, mpatel, pehunt, weinliu, wking |
| Target Milestone: | --- | ||
| Target Release: | 4.3.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-01-23 11:17:47 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Clayton Coleman
2019-12-05 01:22:18 UTC
Still researching, but here's what I've found so far.
Using an example[1], the problems start during a node reboot. According to the node journal, the reboot begins:
Dec 05 00:02:51 ip-10-0-148-144 systemd-logind[1877]: System is rebooting.
Dec 05 00:02:51 ip-10-0-148-144 systemd[1]: Stopping Kubernetes Kubelet...
Dec 05 00:04:22 ip-10-0-148-144 systemd[1]: Starting Reboot...
Dec 05 00:04:22 ip-10-0-148-144 systemd[1]: Shutting down.
Dec 05 00:04:22 ip-10-0-148-144 systemd-shutdown[1]: Syncing filesystems and block devices.
Dec 05 00:04:22 ip-10-0-148-144 systemd-shutdown[1]: Sending SIGTERM to remaining processes...
Dec 05 00:04:22 ip-10-0-148-144 systemd-journald[1188]: Journal stopped
Dec 05 00:05:04 localhost NetworkManager[1848]: <info> [1575504304.6470] policy: set-hostname: set hostname to 'ip-10-0-148-144' (from DHCPv4)
Dec 05 00:05:07 ip-10-0-148-144 systemd[1]: Starting Kubernetes Kubelet...
Dec 05 00:05:09 ip-10-0-148-144 hyperkube[2353]: I1205 00:05:09.313116 2353 kubelet_pods.go:1328] Generating status for "dns-default-4wmcw_openshift-dns(fa587656-1002-4986-bf72-62b90f9aa5a4)"
During shutdown, crio reports the 'dns' container stopping container stopping:
Dec 05 00:02:52 ip-10-0-148-144 systemd[1]: Stopped libcontainer container 7acdba1ed8da67ca6e63337edd566ee4667a08295cc96b49a92390babed7ae5c.
Dec 05 00:02:52 ip-10-0-148-144 systemd[1]: crio-7acdba1ed8da67ca6e63337edd566ee4667a08295cc96b49a92390babed7ae5c.scope: Consumed 642ms CPU time
After the reboot, PLEG (via Kubelet) syncs the last known status for that container, and reports that the 'dns' container exited 255 (which is suspicious, because the coredns process logs graceful shutdown in response to a TERM):
Dec 05 00:05:09 ip-10-0-148-144 hyperkube[2353]: I1205 00:05:09.352840 2353 kubelet.go:1953] SyncLoop (PLEG): "dns-default-4wmcw_openshift-dns(fa587656-1002-4986-bf72-62b90f9aa5a4)", event: &pleg.PodLifecycleEvent{ID:"fa587656-1002-4986-bf72-62b90f9aa5a4", Type:"ContainerDied", Data:"7acdba1ed8da67ca6e63337edd566ee4667a08295cc96b49a92390babed7ae5c"}
Dec 05 00:05:09 ip-10-0-148-144 hyperkube[2353]: I1205 00:05:09.517234 2353 status_manager.go:561] Patch status for pod "dns-default-4wmcw_openshift-dns(fa587656-1002-4986-bf72-62b90f9aa5a4)" with "{\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"Initialized\"},{\"type\":\"Ready\"},{\"type\":\"ContainersReady\"},{\"type\":\"PodScheduled\"}],\"conditions\":[{\"message\":\"containers with unready status: [dns dns-node-resolver]\",\"reason\":\"ContainersNotReady\",\"type\":\"Ready\"},{\"lastTransitionTime\":\"2019-12-05T00:05:09Z\",\"message\":\"containers with unready status: [dns dns-node-resolver]\",\"reason\":\"ContainersNotReady\",\"status\":\"False\",\"type\":\"ContainersReady\"}],\"containerStatuses\":[{\"containerID\":\"cri-o://7acdba1ed8da67ca6e63337edd566ee4667a08295cc96b49a92390babed7ae5c\",\"image\":\"registry.svc.ci.openshift.org/ocp/4.3-2019-12-04-224546@sha256:6ae4883e50c3b307c7b6dc301057e5edd94e18318f608da3a8090985f5e53f9c\",\"imageID\":\"registry.svc.ci.openshift.org/ocp/4.3-2019-12-04-224546@sha256:6ae4883e50c3b307c7b6dc301057e5edd94e18318f608da3a8090985f5e53f9c\",\"lastState\":{},\"name\":\"dns\",\"ready\":false,\"restartCount\":0,\"started\":false,\"state\":{\"terminated\":{\"containerID\":\"cri-o://7acdba1ed8da67ca6e63337edd566ee4667a08295cc96b49a92390babed7ae5c\",\"exitCode\":255,\"finishedAt\":\"2019-12-05T00:05:06Z\",\"message\":\".:5353\\n2019-12-04T23:59:26.072Z [INFO] plugin/reload: Running configuration MD5 = acefebac40c697acb35b9e96ca3c7ec9\\n2019-12-04T23:59:26.072Z [INFO] CoreDNS-1.5.2\\n2019-12-04T23:59:26.072Z [INFO] linux/amd64, go1.12.9, \\nCoreDNS-1.5.2\\nlinux/amd64, go1.12.9, \\n[INFO] SIGTERM: Shutting down servers then terminating\\n\",\"reason\":\"Error\",\"startedAt\":\"2019-12-04T23:59:25Z\"}}},{\"containerID\":\"cri-o://0bf66b6910fc3082754ec9e0924bf4abb351fadcb7137dcd3d3e691010542444\",\"image\":\"registry.svc.ci.openshift.org/ocp/4.3-2019-12-04-224546@sha256:3d491d61aaeb329980c631ada69aacad350bfa2999cd31da92cb373e877ffcfb\",\"imageID\":\"registry.svc.ci.openshift.org/ocp/4.3-2019-12-04-224546@sha256:3d491d61aaeb329980c631ada69aacad350bfa2999cd31da92cb373e877ffcfb\",\"lastState\":{},\"name\":\"dns-node-resolver\",\"ready\":false,\"restartCount\":0,\"started\":false,\"state\":{\"terminated\":{\"containerID\":\"cri-o://0bf66b6910fc3082754ec9e0924bf4abb351fadcb7137dcd3d3e691010542444\",\"exitCode\":255,\"finishedAt\":\"2019-12-05T00:05:07Z\",\"message\":\"/bin/bash: line 1: kill: (40) - No such process\\n\",\"reason\":\"Error\",\"startedAt\":\"2019-12-04T23:59:30Z\"}}}],\"podIP\":null,\"podIPs\":null}}"
Dec 05 00:05:09 ip-10-0-148-144 hyperkube[2353]: I1205 00:05:09.517523 2353 status_manager.go:568] Status for pod "dns-default-4wmcw_openshift-dns(fa587656-1002-4986-bf72-62b90f9aa5a4)" updated successfully: (1, {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2019-12-04 23:59:13 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2019-12-05 00:04:27 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [dns dns-node-resolver]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2019-12-05 00:05:09 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [dns dns-node-resolver]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2019-12-04 23:59:13 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.0.148.144 PodIP: PodIPs:[] StartTime:2019-12-04 23:59:13 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:dns State:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:255,Signal:0,Reason:Error,Message:.:5353
Dec 05 00:05:09 ip-10-0-148-144 hyperkube[2353]: 2019-12-04T23:59:26.072Z [INFO] plugin/reload: Running configuration MD5 = acefebac40c697acb35b9e96ca3c7ec9
Dec 05 00:05:09 ip-10-0-148-144 hyperkube[2353]: 2019-12-04T23:59:26.072Z [INFO] CoreDNS-1.5.2
Dec 05 00:05:09 ip-10-0-148-144 hyperkube[2353]: 2019-12-04T23:59:26.072Z [INFO] linux/amd64, go1.12.9,
Dec 05 00:05:09 ip-10-0-148-144 hyperkube[2353]: CoreDNS-1.5.2
Dec 05 00:05:09 ip-10-0-148-144 hyperkube[2353]: linux/amd64, go1.12.9,
Dec 05 00:05:09 ip-10-0-148-144 hyperkube[2353]: [INFO] SIGTERM: Shutting down servers then terminating
Dec 05 00:05:09 ip-10-0-148-144 hyperkube[2353]: ,StartedAt:2019-12-04 23:59:25 +0000 UTC,FinishedAt:2019-12-05 00:05:06 +0000 UTC,ContainerID:cri-o://7acdba1ed8da67ca6e63337edd566ee4667a08295cc96b49a92390babed7ae5c,}} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:registry.svc.ci.openshift.org/ocp/4.3-2019-12-04-224546@sha256:6ae4883e50c3b307c7b6dc301057e5edd94e18318f608da3a8090985f5e53f9c ImageID:registry.svc.ci.openshift.org/ocp/4.3-2019-12-04-224546@sha256:6ae4883e50c3b307c7b6dc301057e5edd94e18318f608da3a8090985f5e53f9c ContainerID:cri-o://7acdba1ed8da67ca6e63337edd566ee4667a08295cc96b49a92390babed7ae5c Started:0xc00100fec0} {Name:dns-node-resolver State:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:255,Signal:0,Reason:Error,Message:/bin/bash: line 1: kill: (40) - No such process
Unfortunately, because of the log level of crio, I don't have crio's log of the container status to check against PLEG's logs.
If I issue a `crictl stop` to the same kind of dns container on a new cluster with crio logging set to 'debug', I can see the last container status (in the CONTAINER_EXITED state) shows exit code 0.
Still working on a reproducer.
[1] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/12064
>If I issue a `crictl stop` to the same kind of dns container on a new cluster with crio logging set to 'debug', I can see the last container status (in the CONTAINER_EXITED state) shows exit code 0.
Also in this case, I can see the Kubelet records ContainerDied with an exit code of 0.
I can reproduce this by issuing a `reboot` on the node. Containers are torn down, e.g.:
Dec 06 00:55:03 ip-10-0-130-123 systemd[1]: Stopped libcontainer container d5368d03283b3a7fb2acc7a1b7ef01ca75da4c0beaeaadeb91a6145942ec37ef.
Dec 06 00:55:03 ip-10-0-130-123 systemd[1]: crio-d5368d03283b3a7fb2acc7a1b7ef01ca75da4c0beaeaadeb91a6145942ec37ef.scope: Consumed 21.833s CPU time
Dec 06 00:55:04 ip-10-0-130-123 systemd[1]: crio-conmon-d5368d03283b3a7fb2acc7a1b7ef01ca75da4c0beaeaadeb91a6145942ec37ef.scope: Consumed 49ms CPU time
crio observes the containers exiting:
Dec 06 00:55:04 ip-10-0-130-123 crio[2414276]: time="2019-12-06 00:55:04.563771362Z" level=debug msg="event: \"/var/run/crio/exits/d5368d03283b3a7fb2acc7a1b7ef01ca75da4c0beaeaadeb91a6145942ec37ef.PEE2B0\": RENAME" file="server/server.go:687"
Dec 06 00:55:04 ip-10-0-130-123 crio[2414276]: time="2019-12-06 00:55:04.563871375Z" level=debug msg="event: \"/var/run/crio/exits/d5368d03283b3a7fb2acc7a1b7ef01ca75da4c0beaeaadeb91a6145942ec37ef\": CREATE" file="server/server.go:687"
Dec 06 00:55:04 ip-10-0-130-123 crio[2414276]: time="2019-12-06 00:55:04.563905509Z" level=debug msg="container or sandbox exited: d5368d03283b3a7fb2acc7a1b7ef01ca75da4c0beaeaadeb91a6145942ec37ef" file="server/server.go:690"
Dec 06 00:55:04 ip-10-0-130-123 crio[2414276]: time="2019-12-06 00:55:04.563935623Z" level=debug msg="container exited and found: d5368d03283b3a7fb2acc7a1b7ef01ca75da4c0beaeaadeb91a6145942ec37ef" file="server/server.go:693"
I didn't catch the exit files, which I suspect will have the state I want. Interesting that there's no logging about crio itself stopping the container. My understanding is that Systemd is signaling the containers pids directly.
Upon reboot, crio reports a 255 status for the container's last state (but no message...):
Dec 06 00:55:55 ip-10-0-130-123 crio[1527]: time="2019-12-06 00:55:55.394577936Z" level=debug msg="request: &ContainerStatusRequest{ContainerId:d5368d03283b3a7fb2acc7a1b7ef01ca75da4c0beaeaadeb91a6145942ec37ef,Verbose:false,}" file="v1alpha2/api.pb.go:7798" id=fa272cc6-2dca-4c9c-9283-88a669d9c817
Dec 06 00:55:55 ip-10-0-130-123 crio[1527]: time="2019-12-06 00:55:55.394710442Z" level=debug msg="response: &ContainerStatusResponse{Status:&ContainerStatus{Id:d5368d03283b3a7fb2acc7a1b7ef01ca75da4c0beaeaadeb91a6145942ec37ef,Metadata:&ContainerMetadata{Name:dns,Attempt:1,},State:CONTAINER_EXITED,CreatedAt:1575585108224674534,StartedAt:1575585108278984847,FinishedAt:1575593752845524914,ExitCode:255,Image:&ImageSpec{Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:234cb579f88f53781e547f202152eb273cf569073818fdeb4ba7cd7086071065,},ImageRef:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:234cb579f88f53781e547f202152eb273cf569073818fdeb4ba7cd7086071065,Reason:Error,Message:,Labels:map[string]string{io.kubernetes.container.name: dns,io.kubernetes.pod.name: dns-default-d8dnh,io.kubernetes.pod.namespace: openshift-dns,io.kubernetes.pod.uid: 9ea08485-9593-40a6-90a7-c5d37ccdda0d,},Annotations:map[string]string{io.kubernetes.container.hash: 4d61f4f6,io.kubernetes.container.ports: [{\"name\":\"dns\",\"containerPort\":5353,\"protocol\":\"UDP\"},{\"name\":\"dns-tcp\",\"containerPort\":5353,\"protocol\":\"TCP\"},{\"name\":\"metrics\",\"containerPort\":9153,\"protocol\":\"TCP\"}],io.kubernetes.container.restartCount: 1,io.kubernetes.container.terminationMessagePath: /dev/termination-log,io.kubernetes.container.terminationMessagePolicy: FallbackToLogsOnError,io.kubernetes.pod.terminationGracePeriod: 30,},Mounts:[]*Mount{},LogPath:/var/log/pods/openshift-dns_dns-default-d8dnh_9ea08485-9593-40a6-90a7-c5d37ccdda0d/dns/1.log,},Info:map[string]string{},}" file="v1alpha2/api.pb.go:7798" id=fa272cc6-2dca-4c9c-9283-88a669d9c817
The kubelet posts a status update for the last container state, propagating the 255 (and now the message absent from the response log is back):
Dec 06 00:55:56 ip-10-0-130-123 hyperkube[1824]: I1206 00:55:56.175479 1824 status_manager.go:561] Patch status for pod "dns-default-d8dnh_openshift-dns(9ea08485-9593-40a6-90a7-c5d37ccdda0d)" with "{\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"Initialized\"},{\"type\":\"Ready\"},{\"type\":\"ContainersReady\"},{\"type\":\"PodScheduled\"}],\"conditions\":[{\"message\":\"containers with unready status: [dns dns-node-resolver]\",\"reason\":\"ContainersNotReady\",\"type\":\"Ready\"},{\"lastTransitionTime\":\"2019-12-06T00:55:55Z\",\"message\":\"containers with unready status: [dns dns-node-resolver]\",\"reason\":\"ContainersNotReady\",\"status\":\"False\",\"type\":\"ContainersReady\"}],\"containerStatuses\":[{\"containerID\":\"cri-o://d5368d03283b3a7fb2acc7a1b7ef01ca75da4c0beaeaadeb91a6145942ec37ef\",\"image\":\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:234cb579f88f53781e547f202152eb273cf569073818fdeb4ba7cd7086071065\",\"imageID\":\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:234cb579f88f53781e547f202152eb273cf569073818fdeb4ba7cd7086071065\",\"lastState\":{},\"name\":\"dns\",\"ready\":false,\"restartCount\":1,\"started\":false,\"state\":{\"terminated\":{\"containerID\":\"cri-o://d5368d03283b3a7fb2acc7a1b7ef01ca75da4c0beaeaadeb91a6145942ec37ef\",\"exitCode\":255,\"finishedAt\":\"2019-12-06T00:55:52Z\",\"message\":\".:5353\\n2019-12-05T22:31:48.396Z [INFO] plugin/reload: Running configuration MD5 = acefebac40c697acb35b9e96ca3c7ec9\\n2019-12-05T22:31:48.396Z [INFO] CoreDNS-1.5.2\\n2019-12-05T22:31:48.396Z [INFO] linux/amd64, go1.12.12, \\nCoreDNS-1.5.2\\nlinux/amd64, go1.12.12, \\n[INFO] SIGTERM: Shutting down servers then terminating\\n\",\"reason\":\"Error\",\"startedAt\":\"2019-12-05T22:31:48Z\"}}},{\"containerID\":\"cri-o://6c8bd0b7b2a37c4dc2aea2d1dbfa5ce6164920b6fd729d5a0dceadb90c1d0a3f\",\"image\":\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1c51252d457f4975303dee5edcb55564cd5d9304b98b965288cb0ed94eb0ee8a\",\"imageID\":\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1c51252d457f4975303dee5edcb55564cd5d9304b98b965288cb0ed94eb0ee8a\",\"lastState\":{},\"name\":\"dns-node-resolver\",\"ready\":false,\"restartCount\":0,\"started\":false,\"state\":{\"terminated\":{\"containerID\":\"cri-o://6c8bd0b7b2a37c4dc2aea2d1dbfa5ce6164920b6fd729d5a0dceadb90c1d0a3f\",\"exitCode\":255,\"finishedAt\":\"2019-12-06T00:55:52Z\",\"message\":\"kill: sending signal to 2454 failed: No such process\\n\",\"reason\":\"Error\",\"startedAt\":\"2019-12-05T19:49:54Z\"}}}],\"podIP\":null,\"podIPs\":null}}"
If I TERM the container pid directly, I get an exit code 0 from crictl inspect immediately after, and kubelet also receives the 0 when updating status upon noticing container death. If I KILL the pid, I don't get the coredns graceful shutdown log messages and the return code is 137 (so it's not clear whether Systemd is issuing a KILL on shutdown to these processes).
> I didn't catch the exit files, which I suspect will have the state I want. Interesting that there's no logging about crio itself stopping the container. My understanding is that Systemd is signaling the containers pids directly.
I bet this is:
* systemd TERMs everyone.
* conmon and the container processes race each other to the grave.
* When conmon dies first, or maybe in all cases, since the TERM is likely to land before conmon gets much done, conmon doesn't fully reap and report the container process's exit code.
We should have conmon catch TERM and wait an appropriate amount of time to see if it can gracefully reap the container process. Or maybe, because everyone's about to die anyway, we don't care? If we do care, we should tell the kubelet / CRI-O to gracefully terminate containers before we tell systemd to stop everything (which was, I though, what the machine-config daemon did when rebooting?).
conmon sets on_sig_exit for TERM [1]. I'm not familiar enough with C/conmon to understand what on_sig_exit is doing [2] ;). [1]: https://github.com/containers/conmon/blob/df8c6aa516b22ca8fd28903024a90d9093d383e5/src/conmon.c#L1572 [2]: https://github.com/containers/conmon/blob/df8c6aa516b22ca8fd28903024a90d9093d383e5/src/conmon.c#L388-L404 Maybe it's just forwarding the signal to possible children. Then the raise(SIGUSR1) pops us over to on_sigusr1_cb [1], which runs check_child_processes [2]. Which... checks some stuff [3]? Probably loops as long as non-blocking waitpid calls reap stuff. Then on_sigusr1_cb returns G_SOURCE_CONTINUE. Not sure what that's about yet. But still, I don't see anything under on_sig_exit that looks like "and now conmon gracefully shuts down". So now I'm less convinced about my theory than previously ;). [1]: https://github.com/containers/conmon/blob/df8c6aa516b22ca8fd28903024a90d9093d383e5/src/conmon.c#L1588 [2]: https://github.com/containers/conmon/blob/df8c6aa516b22ca8fd28903024a90d9093d383e5/src/conmon.c#L446 [3]: https://github.com/containers/conmon/blob/df8c6aa516b22ca8fd28903024a90d9093d383e5/src/conmon.c#L408-L441 With Mrunal and Trevor's help, I've confirmed the issue affects DaemonSets generally. Some example DaemonSetSpec which are by default scheduled across nodes which are easily tested: * openshift/dns * openshift-monitoring/node-exporter * openshift-image-registry/node-ca Find the container ID for any of them, and issue a `reboot` on the node. Looking at the tail of the previous boot journal, systemd reports signaling these containers for shutdown, but nothing in the journal indicates that an exit file is being created before the shutdown finishes. Looking at the head of the current boot journal (following reboot), crio reports the status code for those containers as 255 (which Mrunal has suggested could indicate absence of an exit file w/ specific exit details). Oddly, the stdout from the container shutdowns is somehow preserved. The final outcome is crio tells the Kubelet the last container state was exit=255 and message=<stdout indicating graceful shutdown in response to TERM>. So far, it seems like the absence of exit state files for the DaemonSet containers is a significant mystery. DaemonSet container handling generally seems to be different than other containers, and so it's not clear to me yet what the lifecycle expectations are or how the sequence is deferred pending conman writing exit files, allowing the Kubelet to pick up status, etc. Spoke with Mrunal and we agreed this should move over to the Node component. One impact to note is that owners of these daemonsets are implicated as having broken signal handling, when it seems that their signal handling is correct and that the reporting or lifecycling of the containers is wrong. If there are any other bug reports along these lines, we can mark them as dupes of this bug. Here's a self-contained reproducer.
1. Create a DaemonSet that gets scheduled everywhere:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: reboot-test
namespace: default
spec:
selector:
matchLabels:
app: reboot-test
template:
metadata:
labels:
app: reboot-test
spec:
containers:
- name: test
image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1c51252d457f4975303dee5edcb55564cd5d9304b98b965288cb0ed94eb0ee8a
imagePullPolicy: IfNotPresent
command:
- /bin/bash
- -c
- |
#!/bin/bash
set -uo pipefail
trap 'echo gracefully shutting down; exit 0' TERM
while true; do
sleep 1
done
nodeSelector:
kubernetes.io/os: linux
restartPolicy: Always
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
2. On a node, `kill -TERM` the container pid or `crictl stop` the container to become convinced the process gracefully exits on TERM with exit 0 and correctly propagates through the Kubelet.
3. Issue `reboot`.
4. See the same problematic behavior (systemd kills the container on shutdown, no exit file produced, crio reports exit=255 as the container status after reboot).
Opened PR with fixes https://github.com/cri-o/cri-o/pull/3041
Verified to get fixed on
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.3.0-0.nightly-2019-12-30-021042 True False 90m Cluster version is 4.3.0-0.nightly-2019-12-30-021042
1. Create a DaemonSet that gets scheduled everywhere:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: reboot-test
namespace: default
spec:
selector:
matchLabels:
app: reboot-test
template:
metadata:
labels:
app: reboot-test
spec:
containers:
- name: test
image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1c51252d457f4975303dee5edcb55564cd5d9304b98b965288cb0ed94eb0ee8a
imagePullPolicy: IfNotPresent
command:
- /bin/bash
- -c
- |
#!/bin/bash
set -uo pipefail
trap 'echo gracefully shutting down; exit 0' TERM
while true; do
sleep 1
done
nodeSelector:
kubernetes.io/os: linux
restartPolicy: Always
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
2. On a node, `oc debug node/<node name> `, `crictl stop` the container
3. Issue `reboot` the node
4. Check the journalctl log, no code 255
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062 |