1750851 – [gcp][serial][sig-autoscaling] [HPA] test failures

Bug 1750851 - [gcp][serial][sig-autoscaling] [HPA] test failures

Summary: [gcp][serial][sig-autoscaling] [HPA] test failures

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Joel Smith
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1744069 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-09-10 15:22 UTC by Brenton Leanhardt
Modified:	2019-10-16 06:41 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-16 06:40:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift release pull 4983/files	0	None	None	None	2020-05-19 22:36:04 UTC
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:41:04 UTC

Description Brenton Leanhardt 2019-09-10 15:22:35 UTC

Description of problem:

From:
https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-serial-4.2/114

[Suite:openshift/conformance/serial] [Suite:k8s] [Serial]
[sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: Custom Metrics from Stackdriver) should scale down with External Metric with target value from Stackdriver [Feature:CustomMetricsAutoscaling] [Suite:openshift/conformance/serial] [Suite:k8s] [Serial]
[sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: Custom Metrics from Stackdriver) should scale up with two External metrics from Stackdriver [Feature:CustomMetricsAutoscaling] [Suite:openshift/conformance/serial] [Suite:k8s] [Serial]
[sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: Custom Metrics from Stackdriver) should scale up with two metrics of type Pod from Stackdriver [Feature:CustomMetricsAutoscaling] [Suite:openshift/conformance/serial] [Suite:k8s] [Serial]

Relevant error message:

stderr:
Error from server (AlreadyExists): clusterrolebindings.rbac.authorization.k8s.io "e2e-test-cluster-admin-binding" already exists

Comment 1 Ryan Phillips 2019-09-10 15:50:23 UTC

This looks like an installer or GCP issue. There are numerous errors attempting to contact the GCP metadata server.

```
csi-gce-pd-node-wxh42/gce-pd-driver: E0909 17:31:03.638894       1 gce.go:135] error fetching initial token: Get http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token: dial tcp 169.254.169.254:80: connect: connection refused
csi-gce-pd-node-wxh42/gce-pd-driver: E0909 17:31:09.654892       1 gce.go:135] error fetching initial token: Get http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token: dial tcp 169.254.169.254:80: connect: connection refused
csi-gce-pd-node-wxh42/gce-pd-driver: E0909 17:31:14.646909       1 gce.go:135] error fetching initial token: Get http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token: dial tcp 169.254.169.254:80: connect: connection refused
```
and

```
Sep  9 15:51:28.316: INFO: lookupDiskImageSources: gcloud error with [[]string{"instance-groups", "list-instances", "", "--format=get(instance)"}]; err:exit status 1
Sep  9 15:51:28.316: INFO:  > ERROR: (gcloud.compute.instance-groups.list-instances) could not parse resource []
```

Comment 2 Ryan Phillips 2019-09-10 16:01:10 UTC

Pods needing GCP metadata services need to have hostNetwork set.

Comment 3 Ryan Phillips 2019-09-10 16:09:13 UTC

Reassigning to the test team. It looks like some test pods may need to be run under hostNetwork.

Comment 9 Brenton Leanhardt 2019-09-10 20:58:17 UTC

The fastest way to getting this working may be to turn on hostNetwork for the entire serial gcp suite.  If you look here you can see the definition of the serial job and which template it uses currently:

https://github.com/openshift/release/blob/69ffbdb41b4efdb435c97b1512fb671fe74e2246/ci-operator/jobs/openshift/release/openshift-release-release-4.2-periodics.yaml#L1051

You wouldn't want to edit that template because it is used by all the test jobs.  I'm assuming you'd create a new template like this one but add in the hostNetwork configuration in the spec.  Here's an example:

https://github.com/openshift/release/blob/69ffbdb41b4efdb435c97b1512fb671fe74e2246/ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml#L70

If that fails another option would be to create a separate test suite for running the serial tests that require hostNetwork.

Comment 10 Eric Paris 2019-09-10 21:01:33 UTC

So I understand how hostNetwork would impact the csi-gce-pd job, which is explicitly not this bug. THAT problem is being dealt with in: https://github.com/openshift/origin/pull/23760/files


I do not understand how hostNetwork affect an HPA job.

Comment 11 Brenton Leanhardt 2019-09-10 21:18:59 UTC

In the event we determine using the hostNetwork is the right thing to do here, Trevor created a PR.

Comment 12 Ryan Phillips 2019-09-10 22:41:46 UTC

Looks like the replicas are not converging, and then the 15 minute timeout within the test is triggered:


Sep  9 15:51:31.020: INFO: waiting for 1 replicas (current: 0)
Sep  9 15:51:51.025: INFO: waiting for 1 replicas (current: 1)
Sep  9 15:51:51.036: INFO: waiting for 3 replicas (current: 1)
Sep  9 15:52:11.041: INFO: waiting for 3 replicas (current: 1)
Sep  9 15:52:31.041: INFO: waiting for 3 replicas (current: 1)
Sep  9 15:52:51.041: INFO: waiting for 3 replicas (current: 1)
Sep  9 15:53:11.041: INFO: waiting for 3 replicas (current: 1)
Sep  9 15:53:31.041: INFO: waiting for 3 replicas (current: 1)
Sep  9 15:53:51.041: INFO: waiting for 3 replicas (current: 1)
Sep  9 15:54:11.041: INFO: waiting for 3 replicas (current: 1)
Sep  9 15:54:31.042: INFO: waiting for 3 replicas (current: 1)
Sep  9 15:54:51.041: INFO: waiting for 3 replicas (current: 1)
Sep  9 15:55:11.042: INFO: waiting for 3 replicas (current: 1)
Sep  9 15:55:31.043: INFO: waiting for 3 replicas (current: 1)
Sep  9 15:55:51.042: INFO: waiting for 3 replicas (current: 1)
Sep  9 15:56:11.041: INFO: waiting for 3 replicas (current: 1)
Sep  9 15:56:31.041: INFO: waiting for 3 replicas (current: 1)
Sep  9 15:56:51.041: INFO: waiting for 3 replicas (current: 1)
Sep  9 15:57:11.041: INFO: waiting for 3 replicas (current: 1)
Sep  9 15:57:31.042: INFO: waiting for 3 replicas (current: 1)
Sep  9 15:57:51.041: INFO: waiting for 3 replicas (current: 1)
Sep  9 15:58:11.041: INFO: waiting for 3 replicas (current: 1)
Sep  9 15:58:31.041: INFO: waiting for 3 replicas (current: 1)
Sep  9 15:58:51.042: INFO: waiting for 3 replicas (current: 1)
Sep  9 15:59:11.041: INFO: waiting for 3 replicas (current: 1)
Sep  9 15:59:31.042: INFO: waiting for 3 replicas (current: 1)
Sep  9 15:59:51.042: INFO: waiting for 3 replicas (current: 1)
Sep  9 16:00:11.041: INFO: waiting for 3 replicas (current: 1)
Sep  9 16:00:31.041: INFO: waiting for 3 replicas (current: 1)
Sep  9 16:00:51.042: INFO: waiting for 3 replicas (current: 1)
Sep  9 16:01:11.042: INFO: waiting for 3 replicas (current: 1)
Sep  9 16:01:31.041: INFO: waiting for 3 replicas (current: 1)
Sep  9 16:01:51.041: INFO: waiting for 3 replicas (current: 1)
Sep  9 16:02:11.042: INFO: waiting for 3 replicas (current: 1)
Sep  9 16:02:31.041: INFO: waiting for 3 replicas (current: 1)
Sep  9 16:02:51.041: INFO: waiting for 3 replicas (current: 1)
Sep  9 16:03:11.041: INFO: waiting for 3 replicas (current: 1)
Sep  9 16:03:31.040: INFO: waiting for 3 replicas (current: 1)
Sep  9 16:03:51.042: INFO: waiting for 3 replicas (current: 1)
Sep  9 16:04:11.041: INFO: waiting for 3 replicas (current: 1)
Sep  9 16:04:31.041: INFO: waiting for 3 replicas (current: 1)
Sep  9 16:04:51.041: INFO: waiting for 3 replicas (current: 1)
Sep  9 16:05:11.041: INFO: waiting for 3 replicas (current: 1)
Sep  9 16:05:31.041: INFO: waiting for 3 replicas (current: 1)
Sep  9 16:05:51.042: INFO: waiting for 3 replicas (current: 1)
Sep  9 16:06:11.041: INFO: waiting for 3 replicas (current: 1)

Comment 13 Joel Smith 2019-09-12 13:35:14 UTC

There are a bunch of e2e tests for HPA that are GCP or GKE only tests that rely upon external metrics from Stackdriver to make pod scaling decisions. All of these tests are failing, and not failing gracefully, meaning that the first test gets killed after a timeout and leaves its service accounts and other prerequisites in place. This causes subsequent tests to complain about resources already existing, but I don't believe this is the root issue.

Is this a test that used to work and broke recently, or is this a test that we are now running for the first time because it's the first time we've run the e2e tests on GCP? If it's the latter, it seems likely that this is a case of us missing infrastructure on GCP that is required for this test. I'm not sure if we haven't enabled a Stackdriver service on the cluster or if we just need the pods (both the test pods and the HPA controller) to be able to reach the metadata service or some other GCP resource (like others have surmised above).

Just to be clear, the only HPA tests that are failing here are the ones that rely upon Stackdriver metrics.

Comment 14 Brenton Leanhardt 2019-09-12 14:27:25 UTC

If Stackdriver isn't something we support as part of OCP these tests can certainly be disabled.  This test has likely been failing ever since this GCP job was created and simply buried in the noise.

Comment 15 Ryan Phillips 2019-09-12 14:55:21 UTC

*** Bug 1744069 has been marked as a duplicate of this bug. ***

Comment 16 Ryan Phillips 2019-09-12 16:45:22 UTC

We can disable the tests by adding them here: https://github.com/openshift/origin/blob/master/test/extended/util/test.go#L440

Comment 17 Joel Smith 2019-09-13 14:56:28 UTC

We're just waiting to confirm that the PR disables the tests. https://github.com/openshift/origin/pull/23774

Comment 18 Ryan Phillips 2019-09-16 13:58:20 UTC

PR merged.

Comment 20 errata-xmlrpc 2019-10-16 06:40:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.