According the to documentation [0](see: "Container Execution Checks") the timeout should not have any impact on the probes. >> The timeoutSeconds parameter has no effect on the readiness and liveness probes for Container Execution Checks. Using the DeploymentConfig [1] we can see under Docker based cluster the probes run as expected. But under CRI-O we see the timeout having an impact. Running [0] in a CRI-O based cluster we see the following behaviors: - liveness timeouts increase every second. - from inside the running container you can watch the 60 second sleep command start and get killed 1 second later - the container is not being restarted and is accepting traffic, even through the liveness is reporting as failing. - Setting the timeoutSeconds to above 60 seconds stops the probe failures This behavior has been confirmed on 3.11 and 4.2 clusters. Expect the container running under CRI-O to behave the same as under Docker (ignore the timeoutSeconds setting). Or the documentation and examples/templates should be updated to match the behavior. See previous bz: https://bugzilla.redhat.com/show_bug.cgi?id=1549683 and RFE: https://bugzilla.redhat.com/show_bug.cgi?id=1668047 [0] https://docs.openshift.com/container-platform/3.11/dev_guide/application_health.html#container-health-checks-using-probes [1] apiVersion: apps.openshift.io/v1 kind: DeploymentConfig metadata: labels: {} name: test-timeout spec: replicas: 1 revisionHistoryLimit: 10 selector: deploymentConfig: test-timeout strategy: activeDeadlineSeconds: 21600 recreateParams: timeoutSeconds: 600 resources: {} type: Recreate template: metadata: creationTimestamp: null labels: application: test-timeout deploymentConfig: test-timeout name: test-timeout spec: containers: - command: - /bin/sh - '-c' - sleep 86400 image: busybox imagePullPolicy: Always livenessProbe: exec: command: - /bin/sh - '-c' - sleep 60 failureThreshold: 3 initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 name: test-timeout ports: {} resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 75 test: false triggers: - type: ConfigChange
Peter another one for the 4.3 deadline.
the issue is in conmon, I've opened a PR here: https://github.com/containers/conmon/pull/95
conmon-2.0.6 containing this fix is now built for rhaos-4.3-rhel-8.
check with version: 4.3.0-0.nightly-2019-12-26-101933 $ oc describe pod test-timeout-1-nq2qb ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled <unknown> default-scheduler Successfully assigned default/test-timeout-1-nq2qb to ip-10-0-157-73.ap-northeast-1.compute.internal Normal Pulling 2m29s kubelet, ip-10-0-157-73.ap-northeast-1.compute.internal Pulling image "busybox" Normal Pulled 2m21s kubelet, ip-10-0-157-73.ap-northeast-1.compute.internal Successfully pulled image "busybox" Normal Created 2m21s kubelet, ip-10-0-157-73.ap-northeast-1.compute.internal Created container test-timeout Normal Started 2m21s kubelet, ip-10-0-157-73.ap-northeast-1.compute.internal Started container test-timeout Warning Unhealthy 4s (x13 over 2m4s) kubelet, ip-10-0-157-73.ap-northeast-1.compute.internal Liveness probe errored: rpc error: code = Unknown desc = command error: command timed out, stdout: , stderr: , exit code -1 the container is not being restarted even through the liveness is reporting as failing. And if Setting the timeoutSeconds to above 60 seconds, the probe will not report failures.
Opened https://github.com/cri-o/cri-o/pull/3065 for fixing the issue of timeouts not resulting in restarts.
I'm setting the target release and current issue as 3.11.z, as we have fixed it in the 4.x series, and the only remaining open portions of this bug are in 4.x
Fixed on version below. $oc version oc v3.11.219 kubernetes v1.11.0+d4cacc0 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://juzhao-311master-etcd-nfs-1:8443 openshift v3.11.219 kubernetes v1.11.0+d4cacc0 Using manifest [1] in the description, no such issue waiting 6 min oc describe po test-timeout-1-ss8c5 ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 6m default-scheduler Successfully assigned default/test-timeout-1-ss8c5 to juzhao-311node-2 Normal Pulling 6m kubelet, juzhao-311node-2 pulling image "busybox" Normal Pulled 6m kubelet, juzhao-311node-2 Successfully pulled image "busybox" Normal Created 6m kubelet, juzhao-311node-2 Created container Normal Started 6m kubelet, juzhao-311node-2 Started container ...
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2215