1774184 – [conmon] Liveness probes timeout unexpectedly

Bug 1774184 - [conmon] Liveness probes timeout unexpectedly

Summary: [conmon] Liveness probes timeout unexpectedly

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Jindrich Novy
QA Contact:	Weinan Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1186913
TreeView+	depends on / blocked

Reported:	2019-11-19 19:03 UTC by Brian Jarvis
Modified:	2023-09-07 21:02 UTC (History)
CC List:	13 users (show)
Fixed In Version:	conmon-2.0.8-1.el7.x86_64 cri-o-1.11.16-0.9.dev.rhaos3.11.git6d43aae.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-28 05:44:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:2215	0	None	None	None	2020-05-28 05:44:30 UTC

Description Brian Jarvis 2019-11-19 19:03:11 UTC

According the to documentation [0](see: "Container Execution Checks") the timeout should not have any impact on the probes.  
>> The timeoutSeconds parameter has no effect on the readiness and liveness probes for Container Execution Checks. 

Using the DeploymentConfig [1] we can see under Docker based cluster the probes run as expected.
But under CRI-O we see the timeout having an impact.  Running [0] in a CRI-O based cluster we see the following behaviors:
  - liveness timeouts increase every second.
  - from inside the running container you can watch the 60 second sleep command start and get killed 1 second later
  - the container is not being restarted and is accepting traffic, even through the liveness is reporting as failing.  
  - Setting the timeoutSeconds to above 60 seconds stops the probe failures

This behavior has been confirmed on 3.11 and 4.2 clusters.

Expect the container running under CRI-O to behave the same as under Docker (ignore the timeoutSeconds setting).  Or the documentation and examples/templates should be updated to match the behavior.

See previous bz: https://bugzilla.redhat.com/show_bug.cgi?id=1549683
and RFE: https://bugzilla.redhat.com/show_bug.cgi?id=1668047


[0] https://docs.openshift.com/container-platform/3.11/dev_guide/application_health.html#container-health-checks-using-probes
[1] 
apiVersion: apps.openshift.io/v1
kind: DeploymentConfig
metadata:
  labels: {}
  name: test-timeout
spec:
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    deploymentConfig: test-timeout
  strategy:
    activeDeadlineSeconds: 21600
    recreateParams:
      timeoutSeconds: 600
    resources: {}
    type: Recreate
  template:
    metadata:
      creationTimestamp: null
      labels:
        application: test-timeout
        deploymentConfig: test-timeout
      name: test-timeout
    spec:
      containers:
        - command:
            - /bin/sh
            - '-c'
            - sleep 86400
          image: busybox
          imagePullPolicy: Always
          livenessProbe:
            exec:
              command:
                - /bin/sh
                - '-c'
                - sleep 60
            failureThreshold: 3
            initialDelaySeconds: 10
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          name: test-timeout
          ports: {}
          resources: {}
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 75
  test: false
  triggers:
    - type: ConfigChange

Comment 1 Tom Sweeney 2019-11-22 20:15:23 UTC

Peter another one for the 4.3 deadline.

Comment 2 Giuseppe Scrivano 2019-12-11 13:15:01 UTC

the issue is in conmon, I've opened a PR here: https://github.com/containers/conmon/pull/95

Comment 3 Jindrich Novy 2019-12-12 13:17:11 UTC

conmon-2.0.6 containing this fix is now built for rhaos-4.3-rhel-8.

Comment 8 MinLi 2019-12-27 09:06:53 UTC

check with version: 4.3.0-0.nightly-2019-12-26-101933

$ oc describe pod test-timeout-1-nq2qb
...
Events:
  Type     Reason     Age                 From                                                     Message
  ----     ------     ----                ----                                                     -------
  Normal   Scheduled  <unknown>           default-scheduler                                        Successfully assigned default/test-timeout-1-nq2qb to ip-10-0-157-73.ap-northeast-1.compute.internal
  Normal   Pulling    2m29s               kubelet, ip-10-0-157-73.ap-northeast-1.compute.internal  Pulling image "busybox"
  Normal   Pulled     2m21s               kubelet, ip-10-0-157-73.ap-northeast-1.compute.internal  Successfully pulled image "busybox"
  Normal   Created    2m21s               kubelet, ip-10-0-157-73.ap-northeast-1.compute.internal  Created container test-timeout
  Normal   Started    2m21s               kubelet, ip-10-0-157-73.ap-northeast-1.compute.internal  Started container test-timeout
  Warning  Unhealthy  4s (x13 over 2m4s)  kubelet, ip-10-0-157-73.ap-northeast-1.compute.internal  Liveness probe errored: rpc error: code = Unknown desc = command error: command timed out, stdout: , stderr: , exit code -1

the container is not being restarted even through the liveness is reporting as failing. And if Setting the timeoutSeconds to above 60 seconds, the probe will not report failures.

Comment 9 Mrunal Patel 2020-01-03 23:23:31 UTC

Opened https://github.com/cri-o/cri-o/pull/3065 for fixing the issue of timeouts not resulting in restarts.

Comment 18 Peter Hunt 2020-04-06 19:45:40 UTC

I'm setting the target release and current issue as 3.11.z, as we have fixed it in the 4.x series, and the only remaining open portions of this bug are in 4.x

Comment 21 Weinan Liu 2020-05-25 15:33:29 UTC

Fixed on version below.
$oc version
oc v3.11.219
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://juzhao-311master-etcd-nfs-1:8443
openshift v3.11.219
kubernetes v1.11.0+d4cacc0


Using manifest [1] in the description, no such issue waiting 6 min


oc describe po test-timeout-1-ss8c5

...
Events:
  Type    Reason     Age   From                       Message
  ----    ------     ----  ----                       -------
  Normal  Scheduled  6m    default-scheduler          Successfully assigned default/test-timeout-1-ss8c5 to juzhao-311node-2
  Normal  Pulling    6m    kubelet, juzhao-311node-2  pulling image "busybox"
  Normal  Pulled     6m    kubelet, juzhao-311node-2  Successfully pulled image "busybox"
  Normal  Created    6m    kubelet, juzhao-311node-2  Created container
  Normal  Started    6m    kubelet, juzhao-311node-2  Started container

...

Comment 23 errata-xmlrpc 2020-05-28 05:44:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2215

Note You need to log in before you can comment on or make changes to this bug.