Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1726328

Summary:	Test Failure: ReplicationController light Should scale from 2 pods to 1 pod [Suite:openshift/conformance/parallel] [Suite:k8s]
Product:	OpenShift Container Platform	Reporter:	Samuel Padgett <spadgett>
Component:	Node	Assignee:	Joel Smith <joelsmith>
Status:	CLOSED NOTABUG	QA Contact:	Jianwei Hou <jhou>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.2.0	CC:	aos-bugs, ccoleman, gblomqui, joelsmith, jokerman, mmccomas, pliu, rkrawitz, rteague, sdodson, sjenning, wking
Target Milestone:	---
Target Release:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-11-04 13:56:44 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Samuel Padgett 2019-07-02 15:40:59 UTC

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_console/1878/pull-ci-openshift-console-master-e2e-aws/3174

> [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] ReplicationController light Should scale from 2 pods to 1 pod [Suite:openshift/conformance/parallel] [Suite:k8s]

Comment 1 Seth Jennings 2019-07-02 15:51:42 UTC

https://search.svc.ci.openshift.org/?search=ReplicationController+light+Should+scale&maxAge=168h&context=2&type=all

Comment 2 Peng Liu 2019-07-09 13:42:50 UTC

I also experienced the same failure here

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_sriov-network-operator/7/pull-ci-openshift-sriov-network-operator-master-e2e-aws/14

Comment 3 Joel Smith 2019-07-10 17:43:14 UTC

I'm trying to track this down. I'm not sure if the test is sending some bad metrics or what. On a good run, the HPA controller logs this at V4:

curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/23348/pull-ci-openshift-origin-master-e2e-aws/10867/artifacts/e2e-aws/pods/openshift-kube-controller-manager_kube-controller-manager-ip-10-0-138-40.ec2.internal_kube-controller-manager-5.log | sed -rn '/proposing.*9763/ s/.*] (.*)UTC.*/\1/p'

proposing 0 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:38:50.659783807 +0000 
proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:39:05.679280594 +0000 
proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:39:20.698529839 +0000 
proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:39:35.723805289 +0000 
proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:39:50.750030513 +0000 
proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:40:05.763069981 +0000 
proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:40:20.78080316 +0000 
proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:40:35.922116818 +0000 
proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:40:50.948820274 +0000 
proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:41:05.972832426 +0000 
proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:41:21.004629642 +0000 
proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:41:36.03979533 +0000 
proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:41:51.062884403 +0000 
proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:42:06.255909507 +0000 
proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:42:21.277581663 +0000 
proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:42:36.291088518 +0000 
proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:42:51.319519876 +0000 
proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:43:06.438235514 +0000 
proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:43:21.457157967 +0000 
proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 12:43:36.470952614 +0000

On a bad run, it logs this:

curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/23348/pull-ci-openshift-origin-master-e2e-aws/10859/artifacts/e2e-aws/pods/openshift-kube-controller-manager_kube-controller-manager-ip-10-0-158-214.ec2.internal_kube-controller-manager-5.log | sed -rn '/proposing.*5761/ s/.*] (.*)UTC.*/\1/p'

proposing 2 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:41:21.303495177 +0000 
proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:41:36.324387129 +0000 
proposing 2 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:41:51.50368587 +0000 
proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:42:06.524034259 +0000 
proposing 2 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:42:21.540504546 +0000 
proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:42:36.56936478 +0000 
proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:42:51.68773958 +0000 
proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:43:06.720575179 +0000 
proposing 2 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:43:21.746420597 +0000 
proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:43:36.769546521 +0000 
proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:43:52.000804575 +0000 
proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:44:07.021929823 +0000 
proposing 2 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:44:22.039830497 +0000 
proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:44:37.063148574 +0000 
proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:44:52.085068185 +0000 
proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:45:07.10957028 +0000 
proposing 2 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:45:22.244067706 +0000 
proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:45:37.289819405 +0000 
proposing 1 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:45:52.462762369 +0000 
proposing 2 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-10 07:46:07.493771073 +0000 
.... etc ....

Comment 4 Joel Smith 2019-07-12 01:22:18 UTC

I added some additional logging in the scale logic computation to see why the scaledown failed, and one of my suspicions turned out to be the issue (although I still don't know what the root cause is).  The problem is that one of the pods managed by the rc often seems to be missing from the point of view of the HPA. When deciding to scale down, the HPA will treat any missing pod as using its request if it is missing. The logs from replica_calculator.go are ones that were added to my debug PR.

I0711 23:42:51.093669       1 replica_calculator.go:111] GetResourceReplicas 3: missing pod: rc-light-jsl4n assumed to be using all its request
I0711 23:42:51.093699       1 replica_calculator.go:139] GetResourceReplicas 5: no missing or unready pods: 2, 7, 15 (true, true, false)
Note: the above log message that I added is not correct. There potentially are missing pods when the log message #5 is reached.
I0711 23:42:51.093723       1 horizontal.go:594] proposing 2 desired replicas (based on cpu resource utilization (percentage of request) from 2019-07-11 23:42:51.079018011 +0000 UTC m=+2151.611016267) for ReplicationController/horizontal-pod-autoscaling-7957/rc-light
I0711 23:42:51.093745       1 horizontal.go:629] decided not to scale ReplicationController/horizontal-pod-autoscaling-7957/rc-light to 2 (last scale time was <nil>)

So now the question to figure out is, why is one of the pods from the rc missing from the POV of the HPA, and frequently enough to prevent a scale down? I didn't find anything obvious in the node logs that would explain it, but I'm going to continue digging.

Comment 5 Seth Jennings 2019-07-31 15:45:41 UTC

*** Bug 1713137 has been marked as a duplicate of this bug. ***

Comment 6 Seth Jennings 2019-07-31 15:54:37 UTC

This is still happening on about 1 out of 10 runs in testgrid

Comment 8 Russell Teague 2019-08-14 18:51:23 UTC

Seeing this happen quite frequently:
https://ci-search-ci-search-next.svc.ci.openshift.org/?search=failed%3A.*ReplicationController+light+Should+scale&maxAge=336h&context=2&type=all

Comment 9 Russell Teague 2019-08-16 13:29:14 UTC

Tracking down flakes for the rhel7 scaleup job and I'm finding that this test is accounting for 48% of all job failures over the last ~24 hours.

https://ci-search-ci-search-next.svc.ci.openshift.org/chart?name=e2e-aws-scaleup-rhel7&search=failed:.*ReplicationController+light+Should+scale

Comment 10 Russell Teague 2019-08-21 13:04:02 UTC

This is still hitting the rhel scaleup job in 7 out of 8 failures.

Failed	% of 8	Test (started between 2019-08-21T05:32:31 and 2019-08-21T07:22:40 UTC)
7	87	[sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] ReplicationController light Should scale from 2 pods to 1 pod [Suite:openshift/conformance/parallel] [Suite:k8s]

When looking at the job templates, I found that the installer template has CPU requests of '3', whereas the scaleup template has CPU requests of '1'.  Would this contribute to the test failing more often on scaleup?

https://github.com/openshift/release/blob/master/ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml#L153
https://github.com/openshift/release/blob/master/ci-operator/templates/openshift/openshift-ansible/cluster-scaleup-e2e-40.yaml#L157

I'm not familiar with what the test actually does so I'm just looking for differences in the test environment.

Comment 11 W. Trevor King 2019-08-21 19:23:44 UTC

> When looking at the job templates, I found that the installer template has CPU requests of '3', whereas the scaleup template has CPU requests of '1'.  Would this contribute to the test failing more often on scaleup?

I don't see how that would impact flake rate, but it probably is making it harder for the CI cluster to intelligently schedule resources.  See [1] for why the installer template was bumped to 3 CPUs.  I expect the Ansible template should be bumped to match, since it's likely to need the same resources.  On the other hand, that PR only bumped one of many similar templates, so maybe I'm missing something.

[1]: https://github.com/openshift/release/pull/4508/files

Comment 12 Joel Smith 2019-11-04 13:56:44 UTC

Looks like this is fixed in recent CI jobs.  I suspect that the root cause was spotty metrics availability which has probably been addressed by other fixes.

https://search.svc.ci.openshift.org/?search=2+pods+to+1+pod&maxAge=336h&context=2&type=all