Bug 1455056

Summary:	failureThreshold do not reset after new container creation
Product:	OpenShift Container Platform	Reporter:	Jaspreet Kaur <jkaur>
Component:	Node	Assignee:	Seth Jennings <sjenning>
Status:	CLOSED ERRATA	QA Contact:	Weihua Meng <wmeng>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.5.0	CC:	aos-bugs, decarr, dma, jokerman, mark.vinkx, mmccomas, pdwyer, rromerom, smunilla
Target Milestone:	---	Keywords:	TestCaseNeeded
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: If the pod restarts due to exceeding failureThreshold on a probe, the restarted pod is only allowed a single probe failure before being restarted, regardless of the failureThreshold value. Consequence: Restarted pods do not get the expected number of probe attempts before being restarted. Fix: Reset the failure counter when the pod is restarted. Result: Restarted pod gets failureThreshold attempts for probe to succeed.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-08-10 05:25:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jaspreet Kaur 2017-05-24 07:43:42 UTC

Description of problem: When using below configuration it is observed that the first time pod gets killed after 5 unhealthy attempts of the probe but after getting killed it kills the successive  pods after first unhealthy attempt itself.

the liveness check as follows

          livenessProbe:
            httpGet:
              path: /dancertestt
              port: 3000
              scheme: HTTP
            initialDelaySeconds: 30
            timeoutSeconds: 1
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 5





Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results: failureThreshold is not being respected


Expected results: failureThreshold should be considered every time before killing an existing pod


Additional info:

Comment 2 Derek Carr 2017-05-24 18:14:27 UTC

Attempting to reproduce in kubernetes/master.

Comment 4 Seth Jennings 2017-05-24 18:58:10 UTC

Kube PR:
https://github.com/kubernetes/kubernetes/pull/46371

Comment 5 Seth Jennings 2017-05-24 20:27:37 UTC

Origin PR:
https://github.com/openshift/origin/pull/14332

Comment 7 Seth Jennings 2017-06-05 14:23:48 UTC

*** Bug 1457399 has been marked as a duplicate of this bug. ***

Comment 9 Weihua Meng 2017-06-28 09:37:23 UTC

verified on openshift v3.6.126
Fixed.
pod-probe-fail.yaml
apiVersion: v1
kind: Pod
metadata:
  name: busybox
spec:
  containers:
  - name: busybox
    image: busybox
    command:
    - sleep
    - "3600"
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      initialDelaySeconds: 3
      timeoutSeconds: 1
      periodSeconds: 3
      successThreshold: 1
      failureThreshold: 10
  terminationGracePeriodSeconds: 0

# oc create -f pod-probe-fail.yaml
pod "busybox" created
# oc get pods -w
NAME      READY     STATUS    RESTARTS   AGE
busybox   1/1       Running   0          8s
busybox   0/1       Error     0         39s
busybox   1/1       Running   1         43s
busybox   0/1       Error     1         1m
busybox   0/1       CrashLoopBackOff   1         1m
busybox   1/1       Running   2         1m
busybox   0/1       Error     2         1m
busybox   0/1       CrashLoopBackOff   2         2m
busybox   1/1       Running   3         2m
busybox   0/1       Error     3         2m
busybox   0/1       CrashLoopBackOff   3         3m
busybox   1/1       Running   4         3m
busybox   0/1       Error     4         4m
busybox   0/1       CrashLoopBackOff   4         4m
busybox   1/1       Running   5         5m
busybox   0/1       Error     5         6m
busybox   0/1       CrashLoopBackOff   5         6m
busybox   1/1       Running   6         9m
busybox   0/1       Error     6         9m
busybox   0/1       CrashLoopBackOff   6         9m

# oc describe pod
Events:
  FirstSeen	LastSeen	Count	From					SubObjectPath			Type		Reason			Message
  ---------	--------	-----	----					-------------			--------	------			-------
  4m		4m		1	default-scheduler							Normal		Scheduled		Successfully assigned busybox to jialiu-node-zone1-primary-1
  4m		57s		5	kubelet, jialiu-node-zone1-primary-1	spec.containers{busybox}	Normal		Pulling			pulling image "busybox"
  4m		55s		5	kubelet, jialiu-node-zone1-primary-1	spec.containers{busybox}	Normal		Pulled			Successfully pulled image "busybox"
  4m		54s		5	kubelet, jialiu-node-zone1-primary-1	spec.containers{busybox}	Normal		Created			Created container
  4m		54s		5	kubelet, jialiu-node-zone1-primary-1	spec.containers{busybox}	Normal		Started			Started container
  4m		26s		46	kubelet, jialiu-node-zone1-primary-1	spec.containers{busybox}	Warning		Unhealthy		Liveness probe failed: Get http://10.2.6.83:8080/healthz: dial tcp 10.2.6.83:8080: getsockopt: connection refused
  4m		24s		5	kubelet, jialiu-node-zone1-primary-1	spec.containers{busybox}	Normal		Killing			Killing container with id docker://busybox:pod "busybox_wmeng1(3165a7dc-5be3-11e7-8638-42010af00004)" container "busybox" is unhealthy, it will be killed and re-created.
  4m		12s		26	kubelet, jialiu-node-zone1-primary-1					Warning		DNSSearchForming	Found and omitted duplicated dns domain in host search line: 'cluster.local' during merging with cluster dns domains
  4m		12s		14	kubelet, jialiu-node-zone1-primary-1					Warning		FailedSync		Error syncing pod
  3m		12s		9	kubelet, jialiu-node-zone1-primary-1	spec.containers{busybox}	Warning		BackOff			Back-off restarting failed container

Comment 11 errata-xmlrpc 2017-08-10 05:25:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716