Bug 1363641 - Cannot decrease current CPU rate via HorizontalPodAutoscaler.
Summary: Cannot decrease current CPU rate via HorizontalPodAutoscaler.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Hawkular
Version: 3.3.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Matt Wringe
QA Contact: chunchen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-08-03 09:20 UTC by Zhang Cheng
Modified: 2016-09-30 02:15 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-09-27 09:42:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
The logs for pod heapster in openshift-infra project (602.95 KB, text/plain)
2016-08-03 09:20 UTC, Zhang Cheng
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:1933 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.3 Release Advisory 2016-09-27 13:24:36 UTC

Description Zhang Cheng 2016-08-03 09:20:50 UTC
Created attachment 1186991 [details]
The logs for pod heapster in openshift-infra project

Description of problem: Cannot decrease current CPU rate via HorizontalPodAutoscaler. Hit this bug in Full Functional Testing - OCP-3.3 - RPM - RHEL-7.2.6 - Docker-1.10 - Multitenant -  LDAP - Cinder - Swift 


Version-Release number of selected component (if applicable):
Metrics image: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/metrics-heapster:3.3.0
[root@openshift-136 ~]# openshift version
openshift v3.3.0.13
kubernetes v1.3.0+57fb9ac
etcd 2.3.0+git

Server https://openshift-136.lab.sjc.redhat.com:8443
OpenShift v3.3.0.13
Kubernetes v1.3.0+57fb9ac


How reproducible:
Always


Steps to Reproduce:
1. Setup Metrics in openshift-infra project
# oc get pod -n openshift-infra
NAME                         READY     STATUS      RESTARTS   AGE
hawkular-cassandra-1-hr0up   1/1       Running     0          1h
hawkular-metrics-yixq7       1/1       Running     0          1h
heapster-emkt8               1/1       Running     2          1h
metrics-deployer-rftsk       0/1       Completed   0          1h

2. add project serviceaccount to scc/privileged user part
# oadm policy add-scc-to-user privileged system:serviceaccount:chezhang:default

3. oc create -f dc.yaml -n chezhang
# cat dc.yaml
apiVersion: v1
kind: DeploymentConfig
metadata:
  labels:
    run: php-apache
  name: php-apache
spec:
  replicas: 1
  selector:
    run: php-apache
  template:
    metadata:
      labels:
        run: php-apache
    spec:
      containers:
      - image: gcr.io/google_containers/hpa-example
        imagePullPolicy: IfNotPresent
        name: php-apache
        resources:
          requests:
            cpu: 200m
        securityContext:
          privileged: true
      restartPolicy: Always
      securityContext: {}
      terminationGracePeriodSeconds: 30
  triggers:
  - type: ConfigChange
apiVersion: v1
kind: ConfigMap
metadata:
  name: special-config
data:
  special.how: very
  special.type: charm

4. Expose rc:
# oc expose rc php-apache-1 --port=80 -n chezhang
# oc get svc -n chezhang
NAME           CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
php-apache-1   172.30.155.13   <none>        80/TCP    3m

5. oc create -f hpa.yaml -n chezhang
# cat hpa.yaml
apiVersion: extensions/v1beta1
kind: HorizontalPodAutoscaler
metadata:
  name: php-apache
spec:
  scaleRef:
    kind: DeploymentConfig
    name: php-apache
    subresource: scale
  minReplicas: 1
  maxReplicas: 10
  cpuUtilization:
    targetPercentage: 50

6. Check hpa status:
# oc get hpa
NAME         REFERENCE                     TARGET    CURRENT   MINPODS   MAXPODS   AGE
php-apache   DeploymentConfig/php-apache   50%       0%        1         10        3m

7. Increase load again and check rc after a while
while true; do curl http://146.148.6.244; done

8. Check hpa and pod info agin
# oc get pod
# oc get hpa


Actual results: Started MAX count(10) pod, but cannot decrease current CPU rate. 
8. Check hpa info agin after started MAX pod count(10)
$ oc get dc
NAME         REVISION   DESIRED   CURRENT   TRIGGERED BY
php-apache   1          10        10        config
[chezhang@chezhang ~]$ oc get rc
NAME           DESIRED   CURRENT   AGE
php-apache-1   10        10        20m
[chezhang@chezhang ~]$ oc get pod
NAME                 READY     STATUS    RESTARTS   AGE
php-apache-1-0dpq2   1/1       Running   0          16m
php-apache-1-3kp4x   1/1       Running   0          16m
php-apache-1-3zj5b   1/1       Running   0          16m
php-apache-1-9xirn   1/1       Running   0          16m
php-apache-1-armep   1/1       Running   0          16m
php-apache-1-f0oom   1/1       Running   0          16m
php-apache-1-ob22q   1/1       Running   0          16m
php-apache-1-rwn5b   0/1       Pending   0          16m
php-apache-1-w7bwl   1/1       Running   0          16m
php-apache-1-wp3vs   1/1       Running   0          20m
[chezhang@chezhang ~]$ oc get hpa
NAME         REFERENCE                     TARGET    CURRENT   MINPODS   MAXPODS   AGE
php-apache   DeploymentConfig/php-apache   50%       4705%     1         10        19m
[chezhang@chezhang ~]$ oc describe hpa php-apache
Name:                           php-apache
Namespace:                      chezhang
Labels:                         <none>
Annotations:                    <none>
CreationTimestamp:              Wed, 03 Aug 2016 15:17:40 +0800
Reference:                      DeploymentConfig/php-apache
Target CPU utilization:         50%
Current CPU utilization:        4705%
Min replicas:                   1
Max replicas:                   10
Events:
  FirstSeen     LastSeen        Count   From                            SubobjectPath   Type            Reason                  Message
  ---------     --------        -----   ----                            -------------   --------        ------                  -------
  20m           19m             4       {horizontal-pod-autoscaler }                    Warning         FailedComputeReplicas   failed to get CPU utilization: failed to get CPU consumption and request: no running pods
  20m           19m             4       {horizontal-pod-autoscaler }                    Warning         FailedGetMetrics        failed to get CPU consumption and request: no running pods
  16m           16m             1       {horizontal-pod-autoscaler }                    Normal          SuccessfulRescale       New size: 10; reason: CPU utilization above target
  18m           16m             5       {horizontal-pod-autoscaler }                    Warning         FailedGetMetrics        failed to get CPU consumption and request: metrics obtained for 0/1 of pods
  18m           16m             5       {horizontal-pod-autoscaler }                    Warning         FailedComputeReplicas   failed to get CPU utilization: failed to get CPU consumption and request: metrics obtained for 0/1 of pods
  15m           15m             1       {horizontal-pod-autoscaler }                    Warning         FailedGetMetrics        failed to get CPU consumption and request: metrics obtained for 5/9 of pods
  15m           15m             1       {horizontal-pod-autoscaler }                    Warning         FailedComputeReplicas   failed to get CPU utilization: failed to get CPU consumption and request: metrics obtained for 5/9 of pods
  15m           7m              3       {horizontal-pod-autoscaler }                    Warning         FailedComputeReplicas   failed to get CPU utilization: failed to get CPU consumption and request: metrics obtained for 6/9 of pods
  15m           7m              3       {horizontal-pod-autoscaler }                    Warning         FailedGetMetrics        failed to get CPU consumption and request: metrics obtained for 6/9 of pods
  14m           3m              7       {horizontal-pod-autoscaler }                    Warning         FailedComputeReplicas   failed to get CPU utilization: failed to get CPU consumption and request: metrics obtained for 2/9 of pods
  14m           3m              7       {horizontal-pod-autoscaler }                    Warning         FailedGetMetrics        failed to get CPU consumption and request: metrics obtained for 2/9 of pods
  11m           2m              4       {horizontal-pod-autoscaler }                    Warning         FailedGetMetrics        failed to get CPU consumption and request: metrics obtained for 1/9 of pods
  11m           2m              4       {horizontal-pod-autoscaler }                    Warning         FailedComputeReplicas   failed to get CPU utilization: failed to get CPU consumption and request: metrics obtained for 1/9 of pods
  10m           2m              7       {horizontal-pod-autoscaler }                    Warning         FailedGetMetrics        failed to get CPU consumption and request: metrics obtained for 4/9 of pods
  10m           2m              7       {horizontal-pod-autoscaler }                    Warning         FailedComputeReplicas   failed to get CPU utilization: failed to get CPU consumption and request: metrics obtained for 4/9 of pods
  3m            1m              2       {horizontal-pod-autoscaler }                    Warning         FailedGetMetrics        failed to get CPU consumption and request: metrics obtained for 0/9 of pods
  3m            1m              2       {horizontal-pod-autoscaler }                    Warning         FailedComputeReplicas   failed to get CPU utilization: failed to get CPU consumption and request: metrics obtained for 0/9 of pods
  13m           12s             8       {horizontal-pod-autoscaler }                    Warning         FailedGetMetrics        failed to get CPU consumption and request: metrics obtained for 3/9 of pods
  13m           12s             8       {horizontal-pod-autoscaler }                    Warning         FailedComputeReplicas   failed to get CPU utilization: failed to get CPU consumption and request: metrics obtained for 3/9 of pods


Expected results: 
Started MAX count pod, current CPU rate was decreased down to TARGET(50%).


Addition info: 
The root cause is get Metrics failed, below is the relate Metrics events and logs for your reference:
# oc get pod -n openshift-infra
NAME                         READY     STATUS      RESTARTS   AGE
hawkular-cassandra-1-hr0up   1/1       Running     0          45m
hawkular-metrics-yixq7       1/1       Running     0          45m
heapster-emkt8               1/1       Running     2          45m
metrics-deployer-rftsk       0/1       Completed   0          46m

# oc describe pod heapster-emkt8
Name:			heapster-emkt8
Namespace:		openshift-infra
Security Policy:	restricted
Node:			openshift-134.lab.sjc.redhat.com/192.168.1.154
Start Time:		Wed, 03 Aug 2016 02:53:54 -0400
Labels:			metrics-infra=heapster
			name=heapster
Status:			Running
IP:			10.1.0.7
Controllers:		ReplicationController/heapster
Containers:
  heapster:
    Container ID:	docker://6b7e728f9964973c9a94218f4274760c52b7316a56c0a90eee2f318c5d44bb33
    Image:		brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/metrics-heapster:3.3.0
    Image ID:		docker://sha256:179e3ed5c3b2a8a47ef6da49d520f99846be82a9bb225b0fb9631b59de8347c9
    Port:		8082/TCP
    Command:
      heapster-wrapper.sh
      --wrapper.username_file=/hawkular-account/hawkular-metrics.username
      --wrapper.password_file=/hawkular-account/hawkular-metrics.password
      --wrapper.allowed_users_file=/secrets/heapster.allowed-users
      --wrapper.endpoint_check=https://hawkular-metrics:443/hawkular/metrics/status
      --source=kubernetes:https://openshift-136.lab.sjc.redhat.com:8443?useServiceAccount=true&kubeletHttps=true&kubeletPort=10250
      --sink=hawkular:https://hawkular-metrics:443?tenant=_system&labelToTenant=pod_namespace&labelNodeId=nodename&caCert=/hawkular-cert/hawkular-metrics-ca.certificate&user=%username%&pass=%password%&filter=label(container_name:^system.slice.*|^user.slice)
      --tls_cert=/secrets/heapster.cert
      --tls_key=/secrets/heapster.key
      --tls_client_ca=/secrets/heapster.client-ca
      --allowed_users=%allowed_users%
      --metric_resolution=10s
    State:		Running
      Started:		Wed, 03 Aug 2016 02:58:46 -0400
    Last State:		Terminated
      Reason:		Error
      Exit Code:	1
      Started:		Wed, 03 Aug 2016 02:56:28 -0400
      Finished:		Wed, 03 Aug 2016 02:58:30 -0400
    Ready:		True
    Restart Count:	2
    Readiness:		exec [/opt/heapster-readiness.sh] delay=0s timeout=1s period=10s #success=1 #failure=3
    Volume Mounts:
      /hawkular-account from hawkular-metrics-account (rw)
      /hawkular-cert from hawkular-metrics-certificate (rw)
      /secrets from heapster-secrets (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from heapster-token-8cnxe (ro)
    Environment Variables:	<none>
Conditions:
  Type		Status
  Initialized 	True 
  Ready 	True 
  PodScheduled 	True 
Volumes:
  heapster-secrets:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	heapster-secrets
  hawkular-metrics-certificate:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	hawkular-metrics-certificate
  hawkular-metrics-account:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	hawkular-metrics-account
  heapster-token-8cnxe:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	heapster-token-8cnxe
QoS Tier:	BestEffort
Events:
  FirstSeen	LastSeen	Count	From						SubobjectPath			Type		Reason		Message
  ---------	--------	-----	----						-------------			--------	------		-------
  1h		1h		1	{default-scheduler }								Normal		Scheduled	Successfully assigned heapster-emkt8 to openshift-134.lab.sjc.redhat.com
  1h		1h		1	{kubelet openshift-134.lab.sjc.redhat.com}	spec.containers{heapster}	Normal		Pulling		pulling image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/metrics-heapster:3.3.0"
  1h		1h		1	{kubelet openshift-134.lab.sjc.redhat.com}	spec.containers{heapster}	Normal		Pulled		Successfully pulled image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/metrics-heapster:3.3.0"
  1h		1h		1	{kubelet openshift-134.lab.sjc.redhat.com}	spec.containers{heapster}	Normal		Created		Created container with docker id 73d1d67538d0
  1h		1h		1	{kubelet openshift-134.lab.sjc.redhat.com}	spec.containers{heapster}	Normal		Started		Started container with docker id 73d1d67538d0
  1h		1h		1	{kubelet openshift-134.lab.sjc.redhat.com}	spec.containers{heapster}	Normal		Created		Created container with docker id 1ea8754a186a
  1h		1h		1	{kubelet openshift-134.lab.sjc.redhat.com}	spec.containers{heapster}	Normal		Started		Started container with docker id 1ea8754a186a
  1h		1h		1	{kubelet openshift-134.lab.sjc.redhat.com}					Warning		FailedSync	Error syncing pod, skipping: failed to "StartContainer" for "heapster" with CrashLoopBackOff: "Back-off 10s restarting failed container=heapster pod=heapster-emkt8_openshift-infra(0b2c4994-5947-11e6-9e88-fa163e0ea5c4)"

  1h	1h	1	{kubelet openshift-134.lab.sjc.redhat.com}	spec.containers{heapster}	Warning	BackOff		Back-off restarting failed docker container
  1h	1h	2	{kubelet openshift-134.lab.sjc.redhat.com}	spec.containers{heapster}	Normal	Pulled		Container image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/metrics-heapster:3.3.0" already present on machine
  1h	1h	1	{kubelet openshift-134.lab.sjc.redhat.com}	spec.containers{heapster}	Normal	Created		Created container with docker id 6b7e728f9964
  1h	1h	1	{kubelet openshift-134.lab.sjc.redhat.com}	spec.containers{heapster}	Normal	Started		Started container with docker id 6b7e728f9964
  1h	1h	33	{kubelet openshift-134.lab.sjc.redhat.com}	spec.containers{heapster}	Warning	Unhealthy	Readiness probe failed: 

# oc logs heapster-emkt8  --> detail info in attachment
... 
W0803 04:49:45.114965       1 node_aggregator.go:44] Failed to find node: node:openshift-134.lab.sjc.redhat.com
W0803 04:49:45.114972       1 node_aggregator.go:44] Failed to find node: node:openshift-134.lab.sjc.redhat.com
I0803 04:49:55.000213       1 manager.go:79] Scraping metrics start: 2016-08-03 04:49:40 -0400 EDT, end: 2016-08-03 04:49:50 -0400 EDT
I0803 04:49:55.117002       1 manager.go:152] ScrapeMetrics: time: 115.275446ms size: 58
I0803 04:49:55.631946       1 handlers.go:178] No metrics for pod chezhang/php-apache-1-0dpq2
I0803 04:49:55.631995       1 handlers.go:178] No metrics for pod chezhang/php-apache-1-3kp4x
...

Comment 1 Solly Ross 2016-08-08 20:06:29 UTC
Can you attach in OpenShift node logs?  Is there anything odd in there?

Comment 3 Zhang Cheng 2016-08-09 10:22:38 UTC
I also collected the master and node logs for your debugging. It cannot be attached in here since size more than limitation. I shared with you via google Drive, you can find it from https://drive.google.com/drive/shared-with-me

Comment 4 Solly Ross 2016-08-12 15:08:00 UTC
how many nodes were you running?  Which node were the pods that had no metrics available running on?  Were they all running on the same node, and were the ones that did have metrics available running on a different node?

Also, what log level were you logging at?

Comment 6 Zhang Cheng 2016-08-15 08:20:01 UTC
This bug is easy to recreate follow my steps mentioned in the beginning of this page. You can try to reproduce it in your local env, please let me know if you need any support.

Comment 7 Solly Ross 2016-08-16 20:56:12 UTC
ok, so, a bit more digging.  Can you try upping your collection interval in the Heapster RC?  set `-metrics_resolution` to `20s` instead of `10s`, and then killing the Heapster pod (so it gets restarted with the new collection interval)?

It's possible what we're seeing is that since the collection interval for cAdvisor is 10s, and the Heapster collection interval is 10s, Heapster is managing to collect two samples from the same interval.  This would cause the rate calculator not to fire, which would cause the "no metrics" error (which actually means "couldn't find memory/usage or cpu/usage_rate")

Comment 8 Zhang Cheng 2016-08-17 03:30:55 UTC
Hi, Solly

HPA can get metrics info succeed after change `-metrics_resolution` to `20s` and restart heapster pod by manually.

I used all by default while setting up metrics, below is my steps:
oc project openshift-infra
oc create -f https://raw.githubusercontent.com/openshift/origin-metrics/master/metrics-deployer-setup.yaml
oadm policy add-cluster-role-to-user cluster-reader system:serviceaccount:openshift-infra:heapster
oc policy add-role-to-user edit system:serviceaccount:openshift-infra:metrics-deployer
oc secrets new metrics-deployer nothing=/dev/null
oc process openshift//metrics-deployer-template -v HAWKULAR_METRICS_HOSTNAME=hawkular-metrics.0816-lfr.qe.rhcloud.com,IMAGE_PREFIX=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/,IMAGE_VERSION=latest,USE_PERSISTENT_STORAGE=false,MASTER_URL=https://host-8-173-175.host.centralci.eng.rdu2.redhat.com:8443,CASSANDRA_PV_SIZE=10Gi |oc create -f -

In this case, can you change the default value of `-metrics_resolution` in metrics-deployer-template?

Comment 9 Solly Ross 2016-08-17 14:25:02 UTC
mwringe@ I think we should update the default collection interval to be higher than 10s (not sure it needs to be a full 20s, but I think it at least needs to be more than 10s to make sure we always get a sample from a different cAdvisor sample period).

Comment 10 Solly Ross 2016-08-18 15:12:01 UTC
reassigning to mwringe@ to deal with the origin-metrics template change.

Comment 11 Matt Wringe 2016-08-19 20:18:37 UTC
The metric resolution has now been updated to 15s

Comment 12 Zhang Cheng 2016-08-22 08:33:57 UTC
Test passed and verified.

Comment 14 errata-xmlrpc 2016-09-27 09:42:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1933


Note You need to log in before you can comment on or make changes to this bug.