Bug 1726328
| Summary: | Test Failure: ReplicationController light Should scale from 2 pods to 1 pod [Suite:openshift/conformance/parallel] [Suite:k8s] | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Samuel Padgett <spadgett> |
| Component: | Node | Assignee: | Joel Smith <joelsmith> |
| Status: | CLOSED NOTABUG | QA Contact: | Jianwei Hou <jhou> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.2.0 | CC: | aos-bugs, ccoleman, gblomqui, joelsmith, jokerman, mmccomas, pliu, rkrawitz, rteague, sdodson, sjenning, wking |
| Target Milestone: | --- | ||
| Target Release: | 4.3.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-11-04 13:56:44 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Samuel Padgett
2019-07-02 15:40:59 UTC
I also experienced the same failure here https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_sriov-network-operator/7/pull-ci-openshift-sriov-network-operator-master-e2e-aws/14 I'm trying to track this down. I'm not sure if the test is sending some bad metrics or what. On a good run, the HPA controller logs this at V4: curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/23348/pull-ci-openshift-origin-master-e2e-aws/10867/artifacts/e2e-aws/pods/openshift-kube-controller-manager_kube-controller-manager-ip-10-0-138-40.ec2.internal_kube-controller-manager-5.log | sed -rn '/proposing.*9763/ s/.*] (.*)UTC.*/\1/p' proposing 0 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:38:50.659783807 +0000 proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:39:05.679280594 +0000 proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:39:20.698529839 +0000 proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:39:35.723805289 +0000 proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:39:50.750030513 +0000 proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:40:05.763069981 +0000 proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:40:20.78080316 +0000 proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:40:35.922116818 +0000 proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:40:50.948820274 +0000 proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:41:05.972832426 +0000 proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:41:21.004629642 +0000 proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:41:36.03979533 +0000 proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:41:51.062884403 +0000 proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:42:06.255909507 +0000 proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:42:21.277581663 +0000 proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:42:36.291088518 +0000 proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:42:51.319519876 +0000 proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:43:06.438235514 +0000 proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:43:21.457157967 +0000 proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:43:36.470952614 +0000 On a bad run, it logs this: curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/23348/pull-ci-openshift-origin-master-e2e-aws/10859/artifacts/e2e-aws/pods/openshift-kube-controller-manager_kube-controller-manager-ip-10-0-158-214.ec2.internal_kube-controller-manager-5.log | sed -rn '/proposing.*5761/ s/.*] (.*)UTC.*/\1/p' proposing 2 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:41:21.303495177 +0000 proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:41:36.324387129 +0000 proposing 2 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:41:51.50368587 +0000 proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:42:06.524034259 +0000 proposing 2 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:42:21.540504546 +0000 proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:42:36.56936478 +0000 proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:42:51.68773958 +0000 proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:43:06.720575179 +0000 proposing 2 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:43:21.746420597 +0000 proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:43:36.769546521 +0000 proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:43:52.000804575 +0000 proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:44:07.021929823 +0000 proposing 2 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:44:22.039830497 +0000 proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:44:37.063148574 +0000 proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:44:52.085068185 +0000 proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:45:07.10957028 +0000 proposing 2 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:45:22.244067706 +0000 proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:45:37.289819405 +0000 proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:45:52.462762369 +0000 proposing 2 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:46:07.493771073 +0000 .... etc .... I added some additional logging in the scale logic computation to see why the scaledown failed, and one of my suspicions turned out to be the issue (although I still don't know what the root cause is). The problem is that one of the pods managed by the rc often seems to be missing from the point of view of the HPA. When deciding to scale down, the HPA will treat any missing pod as using its request if it is missing. The logs from replica_calculator.go are ones that were added to my debug PR. I0711 23:42:51.093669 1 replica_calculator.go:111] GetResourceReplicas 3: missing pod: rc-light-jsl4n assumed to be using all its request I0711 23:42:51.093699 1 replica_calculator.go:139] GetResourceReplicas 5: no missing or unready pods: 2, 7, 15 (true, true, false) Note: the above log message that I added is not correct. There potentially are missing pods when the log message #5 is reached. I0711 23:42:51.093723 1 horizontal.go:594] proposing 2 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-11 23:42:51.079018011 +0000 UTC m=+2151.611016267) for ReplicationController/horizontal-pod-autoscaling-7957/rc-light I0711 23:42:51.093745 1 horizontal.go:629] decided not to scale ReplicationController/horizontal-pod-autoscaling-7957/rc-light to 2 (last scale time was <nil>) So now the question to figure out is, why is one of the pods from the rc missing from the POV of the HPA, and frequently enough to prevent a scale down? I didn't find anything obvious in the node logs that would explain it, but I'm going to continue digging. *** Bug 1713137 has been marked as a duplicate of this bug. *** This is still happening on about 1 out of 10 runs in testgrid Seeing this happen quite frequently: https://ci-search-ci-search-next.svc.ci.openshift.org/?search=failed%3A.*ReplicationController+light+Should+scale&maxAge=336h&context=2&type=all Tracking down flakes for the rhel7 scaleup job and I'm finding that this test is accounting for 48% of all job failures over the last ~24 hours. https://ci-search-ci-search-next.svc.ci.openshift.org/chart?name=e2e-aws-scaleup-rhel7&search=failed:.*ReplicationController+light+Should+scale This is still hitting the rhel scaleup job in 7 out of 8 failures. Failed % of 8 Test (started between 2019-08-21T05:32:31 and 2019-08-21T07:22:40 UTC) 7 87 [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] ReplicationController light Should scale from 2 pods to 1 pod [Suite:openshift/conformance/parallel] [Suite:k8s] When looking at the job templates, I found that the installer template has CPU requests of '3', whereas the scaleup template has CPU requests of '1'. Would this contribute to the test failing more often on scaleup? https://github.com/openshift/release/blob/master/ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml#L153 https://github.com/openshift/release/blob/master/ci-operator/templates/openshift/openshift-ansible/cluster-scaleup-e2e-40.yaml#L157 I'm not familiar with what the test actually does so I'm just looking for differences in the test environment. > When looking at the job templates, I found that the installer template has CPU requests of '3', whereas the scaleup template has CPU requests of '1'. Would this contribute to the test failing more often on scaleup? I don't see how that would impact flake rate, but it probably is making it harder for the CI cluster to intelligently schedule resources. See [1] for why the installer template was bumped to 3 CPUs. I expect the Ansible template should be bumped to match, since it's likely to need the same resources. On the other hand, that PR only bumped one of many similar templates, so maybe I'm missing something. [1]: https://github.com/openshift/release/pull/4508/files Looks like this is fixed in recent CI jobs. I suspect that the root cause was spotty metrics availability which has probably been addressed by other fixes. https://search.svc.ci.openshift.org/?search=2+pods+to+1+pod&maxAge=336h&context=2&type=all |