Description of problem: From: https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-serial-4.2/114 [Suite:openshift/conformance/serial] [Suite:k8s] [Serial] [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: Custom Metrics from Stackdriver) should scale down with External Metric with target value from Stackdriver [Feature:CustomMetricsAutoscaling] [Suite:openshift/conformance/serial] [Suite:k8s] [Serial] [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: Custom Metrics from Stackdriver) should scale up with two External metrics from Stackdriver [Feature:CustomMetricsAutoscaling] [Suite:openshift/conformance/serial] [Suite:k8s] [Serial] [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: Custom Metrics from Stackdriver) should scale up with two metrics of type Pod from Stackdriver [Feature:CustomMetricsAutoscaling] [Suite:openshift/conformance/serial] [Suite:k8s] [Serial] Relevant error message: stderr: Error from server (AlreadyExists): clusterrolebindings.rbac.authorization.k8s.io "e2e-test-cluster-admin-binding" already exists
This looks like an installer or GCP issue. There are numerous errors attempting to contact the GCP metadata server. ``` csi-gce-pd-node-wxh42/gce-pd-driver: E0909 17:31:03.638894 1 gce.go:135] error fetching initial token: Get http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token: dial tcp 169.254.169.254:80: connect: connection refused csi-gce-pd-node-wxh42/gce-pd-driver: E0909 17:31:09.654892 1 gce.go:135] error fetching initial token: Get http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token: dial tcp 169.254.169.254:80: connect: connection refused csi-gce-pd-node-wxh42/gce-pd-driver: E0909 17:31:14.646909 1 gce.go:135] error fetching initial token: Get http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token: dial tcp 169.254.169.254:80: connect: connection refused ``` and ``` Sep 9 15:51:28.316: INFO: lookupDiskImageSources: gcloud error with [[]string{"instance-groups", "list-instances", "", "--format=get(instance)"}]; err:exit status 1 Sep 9 15:51:28.316: INFO: > ERROR: (gcloud.compute.instance-groups.list-instances) could not parse resource [] ```
Pods needing GCP metadata services need to have hostNetwork set.
Reassigning to the test team. It looks like some test pods may need to be run under hostNetwork.
The fastest way to getting this working may be to turn on hostNetwork for the entire serial gcp suite. If you look here you can see the definition of the serial job and which template it uses currently: https://github.com/openshift/release/blob/69ffbdb41b4efdb435c97b1512fb671fe74e2246/ci-operator/jobs/openshift/release/openshift-release-release-4.2-periodics.yaml#L1051 You wouldn't want to edit that template because it is used by all the test jobs. I'm assuming you'd create a new template like this one but add in the hostNetwork configuration in the spec. Here's an example: https://github.com/openshift/release/blob/69ffbdb41b4efdb435c97b1512fb671fe74e2246/ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml#L70 If that fails another option would be to create a separate test suite for running the serial tests that require hostNetwork.
So I understand how hostNetwork would impact the csi-gce-pd job, which is explicitly not this bug. THAT problem is being dealt with in: https://github.com/openshift/origin/pull/23760/files I do not understand how hostNetwork affect an HPA job.
In the event we determine using the hostNetwork is the right thing to do here, Trevor created a PR.
Looks like the replicas are not converging, and then the 15 minute timeout within the test is triggered: Sep 9 15:51:31.020: INFO: waiting for 1 replicas (current: 0) Sep 9 15:51:51.025: INFO: waiting for 1 replicas (current: 1) Sep 9 15:51:51.036: INFO: waiting for 3 replicas (current: 1) Sep 9 15:52:11.041: INFO: waiting for 3 replicas (current: 1) Sep 9 15:52:31.041: INFO: waiting for 3 replicas (current: 1) Sep 9 15:52:51.041: INFO: waiting for 3 replicas (current: 1) Sep 9 15:53:11.041: INFO: waiting for 3 replicas (current: 1) Sep 9 15:53:31.041: INFO: waiting for 3 replicas (current: 1) Sep 9 15:53:51.041: INFO: waiting for 3 replicas (current: 1) Sep 9 15:54:11.041: INFO: waiting for 3 replicas (current: 1) Sep 9 15:54:31.042: INFO: waiting for 3 replicas (current: 1) Sep 9 15:54:51.041: INFO: waiting for 3 replicas (current: 1) Sep 9 15:55:11.042: INFO: waiting for 3 replicas (current: 1) Sep 9 15:55:31.043: INFO: waiting for 3 replicas (current: 1) Sep 9 15:55:51.042: INFO: waiting for 3 replicas (current: 1) Sep 9 15:56:11.041: INFO: waiting for 3 replicas (current: 1) Sep 9 15:56:31.041: INFO: waiting for 3 replicas (current: 1) Sep 9 15:56:51.041: INFO: waiting for 3 replicas (current: 1) Sep 9 15:57:11.041: INFO: waiting for 3 replicas (current: 1) Sep 9 15:57:31.042: INFO: waiting for 3 replicas (current: 1) Sep 9 15:57:51.041: INFO: waiting for 3 replicas (current: 1) Sep 9 15:58:11.041: INFO: waiting for 3 replicas (current: 1) Sep 9 15:58:31.041: INFO: waiting for 3 replicas (current: 1) Sep 9 15:58:51.042: INFO: waiting for 3 replicas (current: 1) Sep 9 15:59:11.041: INFO: waiting for 3 replicas (current: 1) Sep 9 15:59:31.042: INFO: waiting for 3 replicas (current: 1) Sep 9 15:59:51.042: INFO: waiting for 3 replicas (current: 1) Sep 9 16:00:11.041: INFO: waiting for 3 replicas (current: 1) Sep 9 16:00:31.041: INFO: waiting for 3 replicas (current: 1) Sep 9 16:00:51.042: INFO: waiting for 3 replicas (current: 1) Sep 9 16:01:11.042: INFO: waiting for 3 replicas (current: 1) Sep 9 16:01:31.041: INFO: waiting for 3 replicas (current: 1) Sep 9 16:01:51.041: INFO: waiting for 3 replicas (current: 1) Sep 9 16:02:11.042: INFO: waiting for 3 replicas (current: 1) Sep 9 16:02:31.041: INFO: waiting for 3 replicas (current: 1) Sep 9 16:02:51.041: INFO: waiting for 3 replicas (current: 1) Sep 9 16:03:11.041: INFO: waiting for 3 replicas (current: 1) Sep 9 16:03:31.040: INFO: waiting for 3 replicas (current: 1) Sep 9 16:03:51.042: INFO: waiting for 3 replicas (current: 1) Sep 9 16:04:11.041: INFO: waiting for 3 replicas (current: 1) Sep 9 16:04:31.041: INFO: waiting for 3 replicas (current: 1) Sep 9 16:04:51.041: INFO: waiting for 3 replicas (current: 1) Sep 9 16:05:11.041: INFO: waiting for 3 replicas (current: 1) Sep 9 16:05:31.041: INFO: waiting for 3 replicas (current: 1) Sep 9 16:05:51.042: INFO: waiting for 3 replicas (current: 1) Sep 9 16:06:11.041: INFO: waiting for 3 replicas (current: 1)
There are a bunch of e2e tests for HPA that are GCP or GKE only tests that rely upon external metrics from Stackdriver to make pod scaling decisions. All of these tests are failing, and not failing gracefully, meaning that the first test gets killed after a timeout and leaves its service accounts and other prerequisites in place. This causes subsequent tests to complain about resources already existing, but I don't believe this is the root issue. Is this a test that used to work and broke recently, or is this a test that we are now running for the first time because it's the first time we've run the e2e tests on GCP? If it's the latter, it seems likely that this is a case of us missing infrastructure on GCP that is required for this test. I'm not sure if we haven't enabled a Stackdriver service on the cluster or if we just need the pods (both the test pods and the HPA controller) to be able to reach the metadata service or some other GCP resource (like others have surmised above). Just to be clear, the only HPA tests that are failing here are the ones that rely upon Stackdriver metrics.
If Stackdriver isn't something we support as part of OCP these tests can certainly be disabled. This test has likely been failing ever since this GCP job was created and simply buried in the noise.
*** Bug 1744069 has been marked as a duplicate of this bug. ***
We can disable the tests by adding them here: https://github.com/openshift/origin/blob/master/test/extended/util/test.go#L440
We're just waiting to confirm that the PR disables the tests. https://github.com/openshift/origin/pull/23774
PR merged.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922