Created attachment 1186991 [details] The logs for pod heapster in openshift-infra project Description of problem: Cannot decrease current CPU rate via HorizontalPodAutoscaler. Hit this bug in Full Functional Testing - OCP-3.3 - RPM - RHEL-7.2.6 - Docker-1.10 - Multitenant - LDAP - Cinder - Swift Version-Release number of selected component (if applicable): Metrics image: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/metrics-heapster:3.3.0 [root@openshift-136 ~]# openshift version openshift v3.3.0.13 kubernetes v1.3.0+57fb9ac etcd 2.3.0+git Server https://openshift-136.lab.sjc.redhat.com:8443 OpenShift v3.3.0.13 Kubernetes v1.3.0+57fb9ac How reproducible: Always Steps to Reproduce: 1. Setup Metrics in openshift-infra project # oc get pod -n openshift-infra NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-hr0up 1/1 Running 0 1h hawkular-metrics-yixq7 1/1 Running 0 1h heapster-emkt8 1/1 Running 2 1h metrics-deployer-rftsk 0/1 Completed 0 1h 2. add project serviceaccount to scc/privileged user part # oadm policy add-scc-to-user privileged system:serviceaccount:chezhang:default 3. oc create -f dc.yaml -n chezhang # cat dc.yaml apiVersion: v1 kind: DeploymentConfig metadata: labels: run: php-apache name: php-apache spec: replicas: 1 selector: run: php-apache template: metadata: labels: run: php-apache spec: containers: - image: gcr.io/google_containers/hpa-example imagePullPolicy: IfNotPresent name: php-apache resources: requests: cpu: 200m securityContext: privileged: true restartPolicy: Always securityContext: {} terminationGracePeriodSeconds: 30 triggers: - type: ConfigChange apiVersion: v1 kind: ConfigMap metadata: name: special-config data: special.how: very special.type: charm 4. Expose rc: # oc expose rc php-apache-1 --port=80 -n chezhang # oc get svc -n chezhang NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE php-apache-1 172.30.155.13 <none> 80/TCP 3m 5. oc create -f hpa.yaml -n chezhang # cat hpa.yaml apiVersion: extensions/v1beta1 kind: HorizontalPodAutoscaler metadata: name: php-apache spec: scaleRef: kind: DeploymentConfig name: php-apache subresource: scale minReplicas: 1 maxReplicas: 10 cpuUtilization: targetPercentage: 50 6. Check hpa status: # oc get hpa NAME REFERENCE TARGET CURRENT MINPODS MAXPODS AGE php-apache DeploymentConfig/php-apache 50% 0% 1 10 3m 7. Increase load again and check rc after a while while true; do curl http://146.148.6.244; done 8. Check hpa and pod info agin # oc get pod # oc get hpa Actual results: Started MAX count(10) pod, but cannot decrease current CPU rate. 8. Check hpa info agin after started MAX pod count(10) $ oc get dc NAME REVISION DESIRED CURRENT TRIGGERED BY php-apache 1 10 10 config [chezhang@chezhang ~]$ oc get rc NAME DESIRED CURRENT AGE php-apache-1 10 10 20m [chezhang@chezhang ~]$ oc get pod NAME READY STATUS RESTARTS AGE php-apache-1-0dpq2 1/1 Running 0 16m php-apache-1-3kp4x 1/1 Running 0 16m php-apache-1-3zj5b 1/1 Running 0 16m php-apache-1-9xirn 1/1 Running 0 16m php-apache-1-armep 1/1 Running 0 16m php-apache-1-f0oom 1/1 Running 0 16m php-apache-1-ob22q 1/1 Running 0 16m php-apache-1-rwn5b 0/1 Pending 0 16m php-apache-1-w7bwl 1/1 Running 0 16m php-apache-1-wp3vs 1/1 Running 0 20m [chezhang@chezhang ~]$ oc get hpa NAME REFERENCE TARGET CURRENT MINPODS MAXPODS AGE php-apache DeploymentConfig/php-apache 50% 4705% 1 10 19m [chezhang@chezhang ~]$ oc describe hpa php-apache Name: php-apache Namespace: chezhang Labels: <none> Annotations: <none> CreationTimestamp: Wed, 03 Aug 2016 15:17:40 +0800 Reference: DeploymentConfig/php-apache Target CPU utilization: 50% Current CPU utilization: 4705% Min replicas: 1 Max replicas: 10 Events: FirstSeen LastSeen Count From SubobjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 20m 19m 4 {horizontal-pod-autoscaler } Warning FailedComputeReplicas failed to get CPU utilization: failed to get CPU consumption and request: no running pods 20m 19m 4 {horizontal-pod-autoscaler } Warning FailedGetMetrics failed to get CPU consumption and request: no running pods 16m 16m 1 {horizontal-pod-autoscaler } Normal SuccessfulRescale New size: 10; reason: CPU utilization above target 18m 16m 5 {horizontal-pod-autoscaler } Warning FailedGetMetrics failed to get CPU consumption and request: metrics obtained for 0/1 of pods 18m 16m 5 {horizontal-pod-autoscaler } Warning FailedComputeReplicas failed to get CPU utilization: failed to get CPU consumption and request: metrics obtained for 0/1 of pods 15m 15m 1 {horizontal-pod-autoscaler } Warning FailedGetMetrics failed to get CPU consumption and request: metrics obtained for 5/9 of pods 15m 15m 1 {horizontal-pod-autoscaler } Warning FailedComputeReplicas failed to get CPU utilization: failed to get CPU consumption and request: metrics obtained for 5/9 of pods 15m 7m 3 {horizontal-pod-autoscaler } Warning FailedComputeReplicas failed to get CPU utilization: failed to get CPU consumption and request: metrics obtained for 6/9 of pods 15m 7m 3 {horizontal-pod-autoscaler } Warning FailedGetMetrics failed to get CPU consumption and request: metrics obtained for 6/9 of pods 14m 3m 7 {horizontal-pod-autoscaler } Warning FailedComputeReplicas failed to get CPU utilization: failed to get CPU consumption and request: metrics obtained for 2/9 of pods 14m 3m 7 {horizontal-pod-autoscaler } Warning FailedGetMetrics failed to get CPU consumption and request: metrics obtained for 2/9 of pods 11m 2m 4 {horizontal-pod-autoscaler } Warning FailedGetMetrics failed to get CPU consumption and request: metrics obtained for 1/9 of pods 11m 2m 4 {horizontal-pod-autoscaler } Warning FailedComputeReplicas failed to get CPU utilization: failed to get CPU consumption and request: metrics obtained for 1/9 of pods 10m 2m 7 {horizontal-pod-autoscaler } Warning FailedGetMetrics failed to get CPU consumption and request: metrics obtained for 4/9 of pods 10m 2m 7 {horizontal-pod-autoscaler } Warning FailedComputeReplicas failed to get CPU utilization: failed to get CPU consumption and request: metrics obtained for 4/9 of pods 3m 1m 2 {horizontal-pod-autoscaler } Warning FailedGetMetrics failed to get CPU consumption and request: metrics obtained for 0/9 of pods 3m 1m 2 {horizontal-pod-autoscaler } Warning FailedComputeReplicas failed to get CPU utilization: failed to get CPU consumption and request: metrics obtained for 0/9 of pods 13m 12s 8 {horizontal-pod-autoscaler } Warning FailedGetMetrics failed to get CPU consumption and request: metrics obtained for 3/9 of pods 13m 12s 8 {horizontal-pod-autoscaler } Warning FailedComputeReplicas failed to get CPU utilization: failed to get CPU consumption and request: metrics obtained for 3/9 of pods Expected results: Started MAX count pod, current CPU rate was decreased down to TARGET(50%). Addition info: The root cause is get Metrics failed, below is the relate Metrics events and logs for your reference: # oc get pod -n openshift-infra NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-hr0up 1/1 Running 0 45m hawkular-metrics-yixq7 1/1 Running 0 45m heapster-emkt8 1/1 Running 2 45m metrics-deployer-rftsk 0/1 Completed 0 46m # oc describe pod heapster-emkt8 Name: heapster-emkt8 Namespace: openshift-infra Security Policy: restricted Node: openshift-134.lab.sjc.redhat.com/192.168.1.154 Start Time: Wed, 03 Aug 2016 02:53:54 -0400 Labels: metrics-infra=heapster name=heapster Status: Running IP: 10.1.0.7 Controllers: ReplicationController/heapster Containers: heapster: Container ID: docker://6b7e728f9964973c9a94218f4274760c52b7316a56c0a90eee2f318c5d44bb33 Image: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/metrics-heapster:3.3.0 Image ID: docker://sha256:179e3ed5c3b2a8a47ef6da49d520f99846be82a9bb225b0fb9631b59de8347c9 Port: 8082/TCP Command: heapster-wrapper.sh --wrapper.username_file=/hawkular-account/hawkular-metrics.username --wrapper.password_file=/hawkular-account/hawkular-metrics.password --wrapper.allowed_users_file=/secrets/heapster.allowed-users --wrapper.endpoint_check=https://hawkular-metrics:443/hawkular/metrics/status --source=kubernetes:https://openshift-136.lab.sjc.redhat.com:8443?useServiceAccount=true&kubeletHttps=true&kubeletPort=10250 --sink=hawkular:https://hawkular-metrics:443?tenant=_system&labelToTenant=pod_namespace&labelNodeId=nodename&caCert=/hawkular-cert/hawkular-metrics-ca.certificate&user=%username%&pass=%password%&filter=label(container_name:^system.slice.*|^user.slice) --tls_cert=/secrets/heapster.cert --tls_key=/secrets/heapster.key --tls_client_ca=/secrets/heapster.client-ca --allowed_users=%allowed_users% --metric_resolution=10s State: Running Started: Wed, 03 Aug 2016 02:58:46 -0400 Last State: Terminated Reason: Error Exit Code: 1 Started: Wed, 03 Aug 2016 02:56:28 -0400 Finished: Wed, 03 Aug 2016 02:58:30 -0400 Ready: True Restart Count: 2 Readiness: exec [/opt/heapster-readiness.sh] delay=0s timeout=1s period=10s #success=1 #failure=3 Volume Mounts: /hawkular-account from hawkular-metrics-account (rw) /hawkular-cert from hawkular-metrics-certificate (rw) /secrets from heapster-secrets (rw) /var/run/secrets/kubernetes.io/serviceaccount from heapster-token-8cnxe (ro) Environment Variables: <none> Conditions: Type Status Initialized True Ready True PodScheduled True Volumes: heapster-secrets: Type: Secret (a volume populated by a Secret) SecretName: heapster-secrets hawkular-metrics-certificate: Type: Secret (a volume populated by a Secret) SecretName: hawkular-metrics-certificate hawkular-metrics-account: Type: Secret (a volume populated by a Secret) SecretName: hawkular-metrics-account heapster-token-8cnxe: Type: Secret (a volume populated by a Secret) SecretName: heapster-token-8cnxe QoS Tier: BestEffort Events: FirstSeen LastSeen Count From SubobjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 1h 1h 1 {default-scheduler } Normal Scheduled Successfully assigned heapster-emkt8 to openshift-134.lab.sjc.redhat.com 1h 1h 1 {kubelet openshift-134.lab.sjc.redhat.com} spec.containers{heapster} Normal Pulling pulling image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/metrics-heapster:3.3.0" 1h 1h 1 {kubelet openshift-134.lab.sjc.redhat.com} spec.containers{heapster} Normal Pulled Successfully pulled image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/metrics-heapster:3.3.0" 1h 1h 1 {kubelet openshift-134.lab.sjc.redhat.com} spec.containers{heapster} Normal Created Created container with docker id 73d1d67538d0 1h 1h 1 {kubelet openshift-134.lab.sjc.redhat.com} spec.containers{heapster} Normal Started Started container with docker id 73d1d67538d0 1h 1h 1 {kubelet openshift-134.lab.sjc.redhat.com} spec.containers{heapster} Normal Created Created container with docker id 1ea8754a186a 1h 1h 1 {kubelet openshift-134.lab.sjc.redhat.com} spec.containers{heapster} Normal Started Started container with docker id 1ea8754a186a 1h 1h 1 {kubelet openshift-134.lab.sjc.redhat.com} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "heapster" with CrashLoopBackOff: "Back-off 10s restarting failed container=heapster pod=heapster-emkt8_openshift-infra(0b2c4994-5947-11e6-9e88-fa163e0ea5c4)" 1h 1h 1 {kubelet openshift-134.lab.sjc.redhat.com} spec.containers{heapster} Warning BackOff Back-off restarting failed docker container 1h 1h 2 {kubelet openshift-134.lab.sjc.redhat.com} spec.containers{heapster} Normal Pulled Container image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/metrics-heapster:3.3.0" already present on machine 1h 1h 1 {kubelet openshift-134.lab.sjc.redhat.com} spec.containers{heapster} Normal Created Created container with docker id 6b7e728f9964 1h 1h 1 {kubelet openshift-134.lab.sjc.redhat.com} spec.containers{heapster} Normal Started Started container with docker id 6b7e728f9964 1h 1h 33 {kubelet openshift-134.lab.sjc.redhat.com} spec.containers{heapster} Warning Unhealthy Readiness probe failed: # oc logs heapster-emkt8 --> detail info in attachment ... W0803 04:49:45.114965 1 node_aggregator.go:44] Failed to find node: node:openshift-134.lab.sjc.redhat.com W0803 04:49:45.114972 1 node_aggregator.go:44] Failed to find node: node:openshift-134.lab.sjc.redhat.com I0803 04:49:55.000213 1 manager.go:79] Scraping metrics start: 2016-08-03 04:49:40 -0400 EDT, end: 2016-08-03 04:49:50 -0400 EDT I0803 04:49:55.117002 1 manager.go:152] ScrapeMetrics: time: 115.275446ms size: 58 I0803 04:49:55.631946 1 handlers.go:178] No metrics for pod chezhang/php-apache-1-0dpq2 I0803 04:49:55.631995 1 handlers.go:178] No metrics for pod chezhang/php-apache-1-3kp4x ...
Can you attach in OpenShift node logs? Is there anything odd in there?
I also collected the master and node logs for your debugging. It cannot be attached in here since size more than limitation. I shared with you via google Drive, you can find it from https://drive.google.com/drive/shared-with-me
how many nodes were you running? Which node were the pods that had no metrics available running on? Were they all running on the same node, and were the ones that did have metrics available running on a different node? Also, what log level were you logging at?
This bug is easy to recreate follow my steps mentioned in the beginning of this page. You can try to reproduce it in your local env, please let me know if you need any support.
ok, so, a bit more digging. Can you try upping your collection interval in the Heapster RC? set `-metrics_resolution` to `20s` instead of `10s`, and then killing the Heapster pod (so it gets restarted with the new collection interval)? It's possible what we're seeing is that since the collection interval for cAdvisor is 10s, and the Heapster collection interval is 10s, Heapster is managing to collect two samples from the same interval. This would cause the rate calculator not to fire, which would cause the "no metrics" error (which actually means "couldn't find memory/usage or cpu/usage_rate")
Hi, Solly HPA can get metrics info succeed after change `-metrics_resolution` to `20s` and restart heapster pod by manually. I used all by default while setting up metrics, below is my steps: oc project openshift-infra oc create -f https://raw.githubusercontent.com/openshift/origin-metrics/master/metrics-deployer-setup.yaml oadm policy add-cluster-role-to-user cluster-reader system:serviceaccount:openshift-infra:heapster oc policy add-role-to-user edit system:serviceaccount:openshift-infra:metrics-deployer oc secrets new metrics-deployer nothing=/dev/null oc process openshift//metrics-deployer-template -v HAWKULAR_METRICS_HOSTNAME=hawkular-metrics.0816-lfr.qe.rhcloud.com,IMAGE_PREFIX=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/,IMAGE_VERSION=latest,USE_PERSISTENT_STORAGE=false,MASTER_URL=https://host-8-173-175.host.centralci.eng.rdu2.redhat.com:8443,CASSANDRA_PV_SIZE=10Gi |oc create -f - In this case, can you change the default value of `-metrics_resolution` in metrics-deployer-template?
mwringe@ I think we should update the default collection interval to be higher than 10s (not sure it needs to be a full 20s, but I think it at least needs to be more than 10s to make sure we always get a sample from a different cAdvisor sample period).
reassigning to mwringe@ to deal with the origin-metrics template change.
The metric resolution has now been updated to 15s
Test passed and verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1933