Bug 1382855 - HPA fails to collect accurate information when there are failed pods in certain situations
Summary: HPA fails to collect accurate information when there are failed pods in certa...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.1.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Solly Ross
QA Contact: DeShuai Ma
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-10-07 21:51 UTC by Eric Jones
Modified: 2019-12-16 07:02 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Horizontal Pod Autoscalers would fail to scale when it could not retrieve metrics for pods matching its target selector. Therefore, dead pods and newly-created pods would cause Horizontal Pod Autoscalers to skip scaling. The Horizontal Pod Autoscaler controller now has logic which assumes conservative metric values (depending on the state of the pod and the direction of the scale) when metrics are missing or pods are marked as unready or not active, meaning newly created or dead pods will no longer block scaling.
Clone Of:
Environment:
Last Closed: 2017-04-12 19:07:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0884 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.5 RPM Release Advisory 2017-04-12 22:50:07 UTC

Description Eric Jones 2016-10-07 21:51:02 UTC
Description of problem:
HPA claims "failed to get CPU consumption and request: failed to get metrics..." But the customer provided the following information:

"The root of the problem was that the ab-service project had 4 dead pods in there that failed when the hosts had the cpu resource exhaustion.  HPA was attempting to pull metrics on these dead pods.  It's not very resilient or forgiving, any sort of error and it won't complete the update.

I resolved this by deleting the 4 dead pods.  Now hpa gets good status and completed the minimal scale out of 2"

Additional details included in attachment (coming soon)

Comment 2 DeShuai Ma 2016-10-27 02:35:58 UTC
upstream issue: https://github.com/kubernetes/kubernetes/pull/33593

Comment 4 Seth Jennings 2017-01-27 03:29:50 UTC
This has been merged upstream and picked up in the Origin 1.5 rebase

Comment 5 DeShuai Ma 2017-02-07 01:56:58 UTC
Waitting https://bugzilla.redhat.com/show_bug.cgi?id=1419481 fix, then verify this bug.

Comment 6 DeShuai Ma 2017-02-14 09:03:18 UTC
Verify on v3.5.0.19+199197c

Version-Release number of selected component (if applicable):
openshift v3.5.0.19+199197c
kubernetes v1.5.2+43a9be4
etcd 3.1.0

Steps:
1. Create a scale resource
oc run resource-consumer --image=docker.io/ocpqe/resource_consumer:v1 --replicas=1 --expose --port 8080 --requests='cpu=100m,memory=256Mi' -n dma

2. Create hpa for it
oc autoscale dc/resource-consumer --min=1 --max=30 --cpu-percent=80 -n dma

3. Create some pods(with label match dc's selector) consume cpu then pods will become complete
for i in `seq 1 1 3`; do oc create -f pod.yaml -n dma; done

4. Check pods status and watch how hpa scale up/down pods


Actual results:
4. As the pods created in steps consume some cpus, when they become complete, hpa will scale up create new pods.
Then wait about 5mins hpa will scale down pods to 1.

Expected results:
4. As the pods created in steps consume some cpus, when they become complete, hpa will scale up create new pods.
Then wait about 5mins hpa will scale down pod

Additional info:
Detail verify steps result:
[root@host-8-174-253 dma]# oc get pods -n dma
NAME                        READY     STATUS    RESTARTS   AGE
resource-consumer-1-0223q   1/1       Running   0          5h
[root@host-8-174-253 dma]# oc get hpa -n dma
NAME                REFERENCE                            TARGET    CURRENT   MINPODS   MAXPODS   AGE
resource-consumer   DeploymentConfig/resource-consumer   80%       0%        1         30        5h
[root@host-8-174-253 dma]# for i in `seq 1 1 3`; do oc create -f pod.yaml -n dma; done
pod "hpa-fake-lzpjm" created
pod "hpa-fake-xbw06" created
pod "hpa-fake-trkht" created
[root@host-8-174-253 dma]# 
[root@host-8-174-253 dma]# oc get pods -n dma
NAME                        READY     STATUS    RESTARTS   AGE
hpa-fake-lzpjm              1/1       Running   0          9s
hpa-fake-trkht              1/1       Running   0          8s
hpa-fake-xbw06              1/1       Running   0          9s
resource-consumer-1-0223q   1/1       Running   0          5h
[root@host-8-174-253 dma]# date
Tue Feb 14 03:46:00 EST 2017
[root@host-8-174-253 dma]# 
[root@host-8-174-253 dma]# 
[root@host-8-174-253 dma]# oc get pods -n dma
NAME                        READY     STATUS    RESTARTS   AGE
hpa-fake-lzpjm              1/1       Running   0          17s
hpa-fake-trkht              1/1       Running   0          16s
hpa-fake-xbw06              1/1       Running   0          17s
resource-consumer-1-0223q   1/1       Running   0          5h
[root@host-8-174-253 dma]# 
[root@host-8-174-253 dma]# 
[root@host-8-174-253 dma]# 
[root@host-8-174-253 dma]# oc get pods -n dma
NAME                        READY     STATUS    RESTARTS   AGE
hpa-fake-lzpjm              1/1       Running   0          31s
hpa-fake-trkht              1/1       Running   0          30s
hpa-fake-xbw06              1/1       Running   0          31s
resource-consumer-1-0223q   1/1       Running   0          5h
[root@host-8-174-253 dma]# oc get pods -n dma
NAME                        READY     STATUS      RESTARTS   AGE
hpa-fake-lzpjm              0/1       Completed   0          1m
hpa-fake-trkht              0/1       Completed   0          1m
hpa-fake-xbw06              0/1       Completed   0          1m
resource-consumer-1-0223q   1/1       Running     0          5h
resource-consumer-1-6t1kb   1/1       Running     0          28s
resource-consumer-1-fpbv0   1/1       Running     0          28s
resource-consumer-1-sn2mb   1/1       Running     0          28s
[root@host-8-174-253 dma]# oc get hpa -n dma
NAME                REFERENCE                            TARGET    CURRENT   MINPODS   MAXPODS   AGE
resource-consumer   DeploymentConfig/resource-consumer   80%       0%        1         30        5h
[root@host-8-174-253 dma]# cat pod.yaml 
apiVersion: v1
kind: Pod
metadata:
  labels:
    run: resource-consumer
  generateName: hpa-fake-
spec:
  containers:
    - image: docker.io/ocpqe/resource_consumer:v1
      command:
       - /consume-cpu/consume-cpu
      args:
       - -duration-sec=60
       - -millicores=200
      imagePullPolicy: IfNotPresent
      name: hpa-fake
      ports:
        - containerPort: 8080
          protocol: TCP
      resources:
        requests:
          cpu: 100m
          memory: 256Mi
      securityContext:
        capabilities: {}
        privileged: false
      terminationMessagePath: /dev/termination-log
  dnsPolicy: ClusterFirst
  restartPolicy: OnFailure
  serviceAccount: ""
[root@host-8-174-253 dma]# oc get hpa -n dma
NAME                REFERENCE                            TARGET    CURRENT   MINPODS   MAXPODS   AGE
resource-consumer   DeploymentConfig/resource-consumer   80%       0%        1         30        5h
[root@host-8-174-253 dma]# oc describe hpa resource-consumer -n dma
Name:				resource-consumer
Namespace:			dma
Labels:				<none>
Annotations:			<none>
CreationTimestamp:		Mon, 13 Feb 2017 22:10:56 -0500
Reference:			DeploymentConfig/resource-consumer
Target CPU utilization:		80%
Current CPU utilization:	0%
Min replicas:			1
Max replicas:			30
Events:
  FirstSeen	LastSeen	Count	From				SubObjectPath	Type		Reason			Message
  ---------	--------	-----	----				-------------	--------	------			-------
  1h		1h		1	{horizontal-pod-autoscaler }			Normal		DesiredReplicasComputed	Computed the desired num of replicas: 6 (avgCPUutil: 138, current replicas: 1)
  57m		57m		1	{horizontal-pod-autoscaler }			Normal		DesiredReplicasComputed	Computed the desired num of replicas: 8 (avgCPUutil: 154, current replicas: 1)
  57m		57m		2	{horizontal-pod-autoscaler }			Normal		DesiredReplicasComputed	Computed the desired num of replicas: 8 (avgCPUutil: 154, current replicas: 4)
  54m		54m		1	{horizontal-pod-autoscaler }			Normal		DesiredReplicasComputed	Computed the desired num of replicas: 3 (avgCPUutil: 0, current replicas: 4)
  5h		52m		14	{horizontal-pod-autoscaler }			Normal		DesiredReplicasComputed	Computed the desired num of replicas: 2 (avgCPUutil: 0, current replicas: 4)
  3h		52m		5	{horizontal-pod-autoscaler }			Normal		SuccessfulRescale	New size: 1; reason: All metrics below target
  51m		51m		1	{horizontal-pod-autoscaler }			Normal		DesiredReplicasComputed	Computed the desired num of replicas: 8 (avgCPUutil: 149, current replicas: 1)
  49m		49m		1	{horizontal-pod-autoscaler }			Normal		DesiredReplicasComputed	Computed the desired num of replicas: 8 (avgCPUutil: 151, current replicas: 1)
  5h		49m		107	{horizontal-pod-autoscaler }			Normal		DesiredReplicasComputed	(events with common reason combined)
  3h		4m		54	{horizontal-pod-autoscaler }			Warning		FailedGetMetrics	unable to get metrics for resource cpu: no metrics returned from heapster
  3h		2m		373	{horizontal-pod-autoscaler }			Normal		DesiredReplicasComputed	Computed the desired num of replicas: 0 (avgCPUutil: 0, current replicas: 1)
  5h		1m		8	{horizontal-pod-autoscaler }			Normal		SuccessfulRescale	New size: 4; reason: CPU utilization above target
  1h		1m		3	{horizontal-pod-autoscaler }			Normal		DesiredReplicasComputed	Computed the desired num of replicas: 8 (avgCPUutil: 152, current replicas: 1)
  1h		3s		38	{horizontal-pod-autoscaler }			Normal		DesiredReplicasComputed	Computed the desired num of replicas: 0 (avgCPUutil: 0, current replicas: 4)
[root@host-8-174-253 dma]# 
[root@host-8-174-253 dma]# 
[root@host-8-174-253 dma]# 
[root@host-8-174-253 dma]# oc get pods -n dma
NAME                        READY     STATUS      RESTARTS   AGE
hpa-fake-lzpjm              0/1       Completed   0          2m
hpa-fake-trkht              0/1       Completed   0          2m
hpa-fake-xbw06              0/1       Completed   0          2m
resource-consumer-1-0223q   1/1       Running     0          5h
resource-consumer-1-6t1kb   1/1       Running     0          2m
resource-consumer-1-fpbv0   1/1       Running     0          2m
resource-consumer-1-sn2mb   1/1       Running     0          2m
[root@host-8-174-253 dma]# oc get pods -n dma
NAME                        READY     STATUS      RESTARTS   AGE
hpa-fake-lzpjm              0/1       Completed   0          3m
hpa-fake-trkht              0/1       Completed   0          3m
hpa-fake-xbw06              0/1       Completed   0          3m
resource-consumer-1-0223q   1/1       Running     0          5h
resource-consumer-1-6t1kb   1/1       Running     0          2m
resource-consumer-1-fpbv0   1/1       Running     0          2m
resource-consumer-1-sn2mb   1/1       Running     0          2m
[root@host-8-174-253 dma]# date
Tue Feb 14 03:49:25 EST 2017
[root@host-8-174-253 dma]# 
[root@host-8-174-253 dma]# 
[root@host-8-174-253 dma]# oc get pods -n dma
NAME                        READY     STATUS      RESTARTS   AGE
hpa-fake-lzpjm              0/1       Completed   0          3m
hpa-fake-trkht              0/1       Completed   0          3m
hpa-fake-xbw06              0/1       Completed   0          3m
resource-consumer-1-0223q   1/1       Running     0          5h
resource-consumer-1-6t1kb   1/1       Running     0          3m
resource-consumer-1-fpbv0   1/1       Running     0          3m
resource-consumer-1-sn2mb   1/1       Running     0          3m
[root@host-8-174-253 dma]# 
[root@host-8-174-253 dma]# 
[root@host-8-174-253 dma]# 
[root@host-8-174-253 dma]# 
[root@host-8-174-253 dma]# 
[root@host-8-174-253 dma]# oc get pods -n dma
NAME                        READY     STATUS      RESTARTS   AGE
hpa-fake-lzpjm              0/1       Completed   0          4m
hpa-fake-trkht              0/1       Completed   0          4m
hpa-fake-xbw06              0/1       Completed   0          4m
resource-consumer-1-0223q   1/1       Running     0          6h
resource-consumer-1-6t1kb   1/1       Running     0          3m
resource-consumer-1-fpbv0   1/1       Running     0          3m
resource-consumer-1-sn2mb   1/1       Running     0          3m
[root@host-8-174-253 dma]# 
[root@host-8-174-253 dma]# oc get pods -n dma
NAME                        READY     STATUS      RESTARTS   AGE
hpa-fake-lzpjm              0/1       Completed   0          4m
hpa-fake-trkht              0/1       Completed   0          4m
hpa-fake-xbw06              0/1       Completed   0          4m
resource-consumer-1-0223q   1/1       Running     0          6h
resource-consumer-1-6t1kb   1/1       Running     0          4m
resource-consumer-1-fpbv0   1/1       Running     0          4m
resource-consumer-1-sn2mb   1/1       Running     0          4m
[root@host-8-174-253 dma]# 
[root@host-8-174-253 dma]# 
[root@host-8-174-253 dma]# 
[root@host-8-174-253 dma]# 
[root@host-8-174-253 dma]# oc get pods -n dma
NAME                        READY     STATUS      RESTARTS   AGE
hpa-fake-lzpjm              0/1       Completed   0          7m
hpa-fake-trkht              0/1       Completed   0          7m
hpa-fake-xbw06              0/1       Completed   0          7m
resource-consumer-1-0223q   1/1       Running     0          6h

Comment 8 errata-xmlrpc 2017-04-12 19:07:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0884


Note You need to log in before you can comment on or make changes to this bug.