Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1363641

Summary:

Cannot decrease current CPU rate via HorizontalPodAutoscaler.

Product:

OpenShift Container Platform

Reporter:

Zhang Cheng <chezhang>

Component:

Hawkular

Assignee:

Matt Wringe <mwringe>

Status:

CLOSED ERRATA

QA Contact:

chunchen <chunchen>

Severity:

high

Docs Contact:

Priority:

high

Version:

3.3.0

CC:

aos-bugs, chezhang, dma, mwringe, tdawson, wmeng, wsun

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-09-27 09:42:19 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
The logs for pod heapster in openshift-infra project	none

Description Zhang Cheng 2016-08-03 09:20:50 UTC

Created attachment 1186991 [details]
The logs for pod heapster in openshift-infra project

Description of problem: Cannot decrease current CPU rate via HorizontalPodAutoscaler. Hit this bug in Full Functional Testing - OCP-3.3 - RPM - RHEL-7.2.6 - Docker-1.10 - Multitenant -  LDAP - Cinder - Swift 


Version-Release number of selected component (if applicable):
Metrics image: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/metrics-heapster:3.3.0
[root@openshift-136 ~]# openshift version
openshift v3.3.0.13
kubernetes v1.3.0+57fb9ac
etcd 2.3.0+git

Server https://openshift-136.lab.sjc.redhat.com:8443
OpenShift v3.3.0.13
Kubernetes v1.3.0+57fb9ac


How reproducible:
Always


Steps to Reproduce:
1. Setup Metrics in openshift-infra project
# oc get pod -n openshift-infra
NAME                         READY     STATUS      RESTARTS   AGE
hawkular-cassandra-1-hr0up   1/1       Running     0          1h
hawkular-metrics-yixq7       1/1       Running     0          1h
heapster-emkt8               1/1       Running     2          1h
metrics-deployer-rftsk       0/1       Completed   0          1h

2. add project serviceaccount to scc/privileged user part
# oadm policy add-scc-to-user privileged system:serviceaccount:chezhang:default

3. oc create -f dc.yaml -n chezhang
# cat dc.yaml
apiVersion: v1
kind: DeploymentConfig
metadata:
  labels:
    run: php-apache
  name: php-apache
spec:
  replicas: 1
  selector:
    run: php-apache
  template:
    metadata:
      labels:
        run: php-apache
    spec:
      containers:
      - image: gcr.io/google_containers/hpa-example
        imagePullPolicy: IfNotPresent
        name: php-apache
        resources:
          requests:
            cpu: 200m
        securityContext:
          privileged: true
      restartPolicy: Always
      securityContext: {}
      terminationGracePeriodSeconds: 30
  triggers:
  - type: ConfigChange
apiVersion: v1
kind: ConfigMap
metadata:
  name: special-config
data:
  special.how: very
  special.type: charm

4. Expose rc:
# oc expose rc php-apache-1 --port=80 -n chezhang
# oc get svc -n chezhang
NAME           CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
php-apache-1   172.30.155.13   <none>        80/TCP    3m

5. oc create -f hpa.yaml -n chezhang
# cat hpa.yaml
apiVersion: extensions/v1beta1
kind: HorizontalPodAutoscaler
metadata:
  name: php-apache
spec:
  scaleRef:
    kind: DeploymentConfig
    name: php-apache
    subresource: scale
  minReplicas: 1
  maxReplicas: 10
  cpuUtilization:
    targetPercentage: 50

6. Check hpa status:
# oc get hpa
NAME         REFERENCE                     TARGET    CURRENT   MINPODS   MAXPODS   AGE
php-apache   DeploymentConfig/php-apache   50%       0%        1         10        3m

7. Increase load again and check rc after a while
while true; do curl http://146.148.6.244; done

8. Check hpa and pod info agin
# oc get pod
# oc get hpa


Actual results: Started MAX count(10) pod, but cannot decrease current CPU rate. 
8. Check hpa info agin after started MAX pod count(10)
$ oc get dc
NAME         REVISION   DESIRED   CURRENT   TRIGGERED BY
php-apache   1          10        10        config
[chezhang@chezhang ~]$ oc get rc
NAME           DESIRED   CURRENT   AGE
php-apache-1   10        10        20m
[chezhang@chezhang ~]$ oc get pod
NAME                 READY     STATUS    RESTARTS   AGE
php-apache-1-0dpq2   1/1       Running   0          16m
php-apache-1-3kp4x   1/1       Running   0          16m
php-apache-1-3zj5b   1/1       Running   0          16m
php-apache-1-9xirn   1/1       Running   0          16m
php-apache-1-armep   1/1       Running   0          16m
php-apache-1-f0oom   1/1       Running   0          16m
php-apache-1-ob22q   1/1       Running   0          16m
php-apache-1-rwn5b   0/1       Pending   0          16m
php-apache-1-w7bwl   1/1       Running   0          16m
php-apache-1-wp3vs   1/1       Running   0          20m
[chezhang@chezhang ~]$ oc get hpa
NAME         REFERENCE                     TARGET    CURRENT   MINPODS   MAXPODS   AGE
php-apache   DeploymentConfig/php-apache   50%       4705%     1         10        19m
[chezhang@chezhang ~]$ oc describe hpa php-apache
Name:                           php-apache
Namespace:                      chezhang
Labels:                         <none>
Annotations:                    <none>
CreationTimestamp:              Wed, 03 Aug 2016 15:17:40 +0800
Reference:                      DeploymentConfig/php-apache
Target CPU utilization:         50%
Current CPU utilization:        4705%
Min replicas:                   1
Max replicas:                   10
Events:
  FirstSeen     LastSeen        Count   From                            SubobjectPath   Type            Reason                  Message
  ---------     --------        -----   ----                            -------------   --------        ------                  -------
  20m           19m             4       {horizontal-pod-autoscaler }                    Warning         FailedComputeReplicas   failed to get CPU utilization: failed to get CPU consumption and request: no running pods
  20m           19m             4       {horizontal-pod-autoscaler }                    Warning         FailedGetMetrics        failed to get CPU consumption and request: no running pods
  16m           16m             1       {horizontal-pod-autoscaler }                    Normal          SuccessfulRescale       New size: 10; reason: CPU utilization above target
  18m           16m             5       {horizontal-pod-autoscaler }                    Warning         FailedGetMetrics        failed to get CPU consumption and request: metrics obtained for 0/1 of pods
  18m           16m             5       {horizontal-pod-autoscaler }                    Warning         FailedComputeReplicas   failed to get CPU utilization: failed to get CPU consumption and request: metrics obtained for 0/1 of pods
  15m           15m             1       {horizontal-pod-autoscaler }                    Warning         FailedGetMetrics        failed to get CPU consumption and request: metrics obtained for 5/9 of pods
  15m           15m             1       {horizontal-pod-autoscaler }                    Warning         FailedComputeReplicas   failed to get CPU utilization: failed to get CPU consumption and request: metrics obtained for 5/9 of pods
  15m           7m              3       {horizontal-pod-autoscaler }                    Warning         FailedComputeReplicas   failed to get CPU utilization: failed to get CPU consumption and request: metrics obtained for 6/9 of pods
  15m           7m              3       {horizontal-pod-autoscaler }                    Warning         FailedGetMetrics        failed to get CPU consumption and request: metrics obtained for 6/9 of pods
  14m           3m              7       {horizontal-pod-autoscaler }                    Warning         FailedComputeReplicas   failed to get CPU utilization: failed to get CPU consumption and request: metrics obtained for 2/9 of pods
  14m           3m              7       {horizontal-pod-autoscaler }                    Warning         FailedGetMetrics        failed to get CPU consumption and request: metrics obtained for 2/9 of pods
  11m           2m              4       {horizontal-pod-autoscaler }                    Warning         FailedGetMetrics        failed to get CPU consumption and request: metrics obtained for 1/9 of pods
  11m           2m              4       {horizontal-pod-autoscaler }                    Warning         FailedComputeReplicas   failed to get CPU utilization: failed to get CPU consumption and request: metrics obtained for 1/9 of pods
  10m           2m              7       {horizontal-pod-autoscaler }                    Warning         FailedGetMetrics        failed to get CPU consumption and request: metrics obtained for 4/9 of pods
  10m           2m              7       {horizontal-pod-autoscaler }                    Warning         FailedComputeReplicas   failed to get CPU utilization: failed to get CPU consumption and request: metrics obtained for 4/9 of pods
  3m            1m              2       {horizontal-pod-autoscaler }                    Warning         FailedGetMetrics        failed to get CPU consumption and request: metrics obtained for 0/9 of pods
  3m            1m              2       {horizontal-pod-autoscaler }                    Warning         FailedComputeReplicas   failed to get CPU utilization: failed to get CPU consumption and request: metrics obtained for 0/9 of pods
  13m           12s             8       {horizontal-pod-autoscaler }                    Warning         FailedGetMetrics        failed to get CPU consumption and request: metrics obtained for 3/9 of pods
  13m           12s             8       {horizontal-pod-autoscaler }                    Warning         FailedComputeReplicas   failed to get CPU utilization: failed to get CPU consumption and request: metrics obtained for 3/9 of pods


Expected results: 
Started MAX count pod, current CPU rate was decreased down to TARGET(50%).


Addition info: 
The root cause is get Metrics failed, below is the relate Metrics events and logs for your reference:
# oc get pod -n openshift-infra
NAME                         READY     STATUS      RESTARTS   AGE
hawkular-cassandra-1-hr0up   1/1       Running     0          45m
hawkular-metrics-yixq7       1/1       Running     0          45m
heapster-emkt8               1/1       Running     2          45m
metrics-deployer-rftsk       0/1       Completed   0          46m

# oc describe pod heapster-emkt8
Name:			heapster-emkt8
Namespace:		openshift-infra
Security Policy:	restricted
Node:			openshift-134.lab.sjc.redhat.com/192.168.1.154
Start Time:		Wed, 03 Aug 2016 02:53:54 -0400
Labels:			metrics-infra=heapster
			name=heapster
Status:			Running
IP:			10.1.0.7
Controllers:		ReplicationController/heapster
Containers:
  heapster:
    Container ID:	docker://6b7e728f9964973c9a94218f4274760c52b7316a56c0a90eee2f318c5d44bb33
    Image:		brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/metrics-heapster:3.3.0
    Image ID:		docker://sha256:179e3ed5c3b2a8a47ef6da49d520f99846be82a9bb225b0fb9631b59de8347c9
    Port:		8082/TCP
    Command:
      heapster-wrapper.sh
      --wrapper.username_file=/hawkular-account/hawkular-metrics.username
      --wrapper.password_file=/hawkular-account/hawkular-metrics.password
      --wrapper.allowed_users_file=/secrets/heapster.allowed-users
      --wrapper.endpoint_check=https://hawkular-metrics:443/hawkular/metrics/status
      --source=kubernetes:https://openshift-136.lab.sjc.redhat.com:8443?useServiceAccount=true&kubeletHttps=true&kubeletPort=10250
      --sink=hawkular:https://hawkular-metrics:443?tenant=_system&labelToTenant=pod_namespace&labelNodeId=nodename&caCert=/hawkular-cert/hawkular-metrics-ca.certificate&user=%username%&pass=%password%&filter=label(container_name:^system.slice.*|^user.slice)
      --tls_cert=/secrets/heapster.cert
      --tls_key=/secrets/heapster.key
      --tls_client_ca=/secrets/heapster.client-ca
      --allowed_users=%allowed_users%
      --metric_resolution=10s
    State:		Running
      Started:		Wed, 03 Aug 2016 02:58:46 -0400
    Last State:		Terminated
      Reason:		Error
      Exit Code:	1
      Started:		Wed, 03 Aug 2016 02:56:28 -0400
      Finished:		Wed, 03 Aug 2016 02:58:30 -0400
    Ready:		True
    Restart Count:	2
    Readiness:		exec [/opt/heapster-readiness.sh] delay=0s timeout=1s period=10s #success=1 #failure=3
    Volume Mounts:
      /hawkular-account from hawkular-metrics-account (rw)
      /hawkular-cert from hawkular-metrics-certificate (rw)
      /secrets from heapster-secrets (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from heapster-token-8cnxe (ro)
    Environment Variables:	<none>
Conditions:
  Type		Status
  Initialized 	True 
  Ready 	True 
  PodScheduled 	True 
Volumes:
  heapster-secrets:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	heapster-secrets
  hawkular-metrics-certificate:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	hawkular-metrics-certificate
  hawkular-metrics-account:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	hawkular-metrics-account
  heapster-token-8cnxe:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	heapster-token-8cnxe
QoS Tier:	BestEffort
Events:
  FirstSeen	LastSeen	Count	From						SubobjectPath			Type		Reason		Message
  ---------	--------	-----	----						-------------			--------	------		-------
  1h		1h		1	{default-scheduler }								Normal		Scheduled	Successfully assigned heapster-emkt8 to openshift-134.lab.sjc.redhat.com
  1h		1h		1	{kubelet openshift-134.lab.sjc.redhat.com}	spec.containers{heapster}	Normal		Pulling		pulling image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/metrics-heapster:3.3.0"
  1h		1h		1	{kubelet openshift-134.lab.sjc.redhat.com}	spec.containers{heapster}	Normal		Pulled		Successfully pulled image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/metrics-heapster:3.3.0"
  1h		1h		1	{kubelet openshift-134.lab.sjc.redhat.com}	spec.containers{heapster}	Normal		Created		Created container with docker id 73d1d67538d0
  1h		1h		1	{kubelet openshift-134.lab.sjc.redhat.com}	spec.containers{heapster}	Normal		Started		Started container with docker id 73d1d67538d0
  1h		1h		1	{kubelet openshift-134.lab.sjc.redhat.com}	spec.containers{heapster}	Normal		Created		Created container with docker id 1ea8754a186a
  1h		1h		1	{kubelet openshift-134.lab.sjc.redhat.com}	spec.containers{heapster}	Normal		Started		Started container with docker id 1ea8754a186a
  1h		1h		1	{kubelet openshift-134.lab.sjc.redhat.com}					Warning		FailedSync	Error syncing pod, skipping: failed to "StartContainer" for "heapster" with CrashLoopBackOff: "Back-off 10s restarting failed container=heapster pod=heapster-emkt8_openshift-infra(0b2c4994-5947-11e6-9e88-fa163e0ea5c4)"

  1h	1h	1	{kubelet openshift-134.lab.sjc.redhat.com}	spec.containers{heapster}	Warning	BackOff		Back-off restarting failed docker container
  1h	1h	2	{kubelet openshift-134.lab.sjc.redhat.com}	spec.containers{heapster}	Normal	Pulled		Container image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/metrics-heapster:3.3.0" already present on machine
  1h	1h	1	{kubelet openshift-134.lab.sjc.redhat.com}	spec.containers{heapster}	Normal	Created		Created container with docker id 6b7e728f9964
  1h	1h	1	{kubelet openshift-134.lab.sjc.redhat.com}	spec.containers{heapster}	Normal	Started		Started container with docker id 6b7e728f9964
  1h	1h	33	{kubelet openshift-134.lab.sjc.redhat.com}	spec.containers{heapster}	Warning	Unhealthy	Readiness probe failed: 

# oc logs heapster-emkt8  --> detail info in attachment
... 
W0803 04:49:45.114965       1 node_aggregator.go:44] Failed to find node: node:openshift-134.lab.sjc.redhat.com
W0803 04:49:45.114972       1 node_aggregator.go:44] Failed to find node: node:openshift-134.lab.sjc.redhat.com
I0803 04:49:55.000213       1 manager.go:79] Scraping metrics start: 2016-08-03 04:49:40 -0400 EDT, end: 2016-08-03 04:49:50 -0400 EDT
I0803 04:49:55.117002       1 manager.go:152] ScrapeMetrics: time: 115.275446ms size: 58
I0803 04:49:55.631946       1 handlers.go:178] No metrics for pod chezhang/php-apache-1-0dpq2
I0803 04:49:55.631995       1 handlers.go:178] No metrics for pod chezhang/php-apache-1-3kp4x
...

Comment 1 Solly Ross 2016-08-08 20:06:29 UTC

Can you attach in OpenShift node logs?  Is there anything odd in there?

Comment 3 Zhang Cheng 2016-08-09 10:22:38 UTC

I also collected the master and node logs for your debugging. It cannot be attached in here since size more than limitation. I shared with you via google Drive, you can find it from https://drive.google.com/drive/shared-with-me

Comment 4 Solly Ross 2016-08-12 15:08:00 UTC

how many nodes were you running?  Which node were the pods that had no metrics available running on?  Were they all running on the same node, and were the ones that did have metrics available running on a different node?

Also, what log level were you logging at?

Comment 6 Zhang Cheng 2016-08-15 08:20:01 UTC

This bug is easy to recreate follow my steps mentioned in the beginning of this page. You can try to reproduce it in your local env, please let me know if you need any support.

Comment 7 Solly Ross 2016-08-16 20:56:12 UTC

ok, so, a bit more digging.  Can you try upping your collection interval in the Heapster RC?  set `-metrics_resolution` to `20s` instead of `10s`, and then killing the Heapster pod (so it gets restarted with the new collection interval)?

It's possible what we're seeing is that since the collection interval for cAdvisor is 10s, and the Heapster collection interval is 10s, Heapster is managing to collect two samples from the same interval.  This would cause the rate calculator not to fire, which would cause the "no metrics" error (which actually means "couldn't find memory/usage or cpu/usage_rate")

Comment 8 Zhang Cheng 2016-08-17 03:30:55 UTC

Hi, Solly

HPA can get metrics info succeed after change `-metrics_resolution` to `20s` and restart heapster pod by manually.

I used all by default while setting up metrics, below is my steps:
oc project openshift-infra
oc create -f https://raw.githubusercontent.com/openshift/origin-metrics/master/metrics-deployer-setup.yaml
oadm policy add-cluster-role-to-user cluster-reader system:serviceaccount:openshift-infra:heapster
oc policy add-role-to-user edit system:serviceaccount:openshift-infra:metrics-deployer
oc secrets new metrics-deployer nothing=/dev/null
oc process openshift//metrics-deployer-template -v HAWKULAR_METRICS_HOSTNAME=hawkular-metrics.0816-lfr.qe.rhcloud.com,IMAGE_PREFIX=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/,IMAGE_VERSION=latest,USE_PERSISTENT_STORAGE=false,MASTER_URL=https://host-8-173-175.host.centralci.eng.rdu2.redhat.com:8443,CASSANDRA_PV_SIZE=10Gi |oc create -f -

In this case, can you change the default value of `-metrics_resolution` in metrics-deployer-template?

Comment 9 Solly Ross 2016-08-17 14:25:02 UTC

mwringe@ I think we should update the default collection interval to be higher than 10s (not sure it needs to be a full 20s, but I think it at least needs to be more than 10s to make sure we always get a sample from a different cAdvisor sample period).

Comment 10 Solly Ross 2016-08-18 15:12:01 UTC

reassigning to mwringe@ to deal with the origin-metrics template change.

Comment 11 Matt Wringe 2016-08-19 20:18:37 UTC

The metric resolution has now been updated to 15s

Comment 12 Zhang Cheng 2016-08-22 08:33:57 UTC

Test passed and verified.

Comment 14 errata-xmlrpc 2016-09-27 09:42:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1933