Bug 1975283 - gcp-realtime: e2e test failing [sig-storage] Multi-AZ Cluster Volumes should only be allowed to provision PDs in zones where nodes exist [Suite:openshift/conformance/parallel] [Suite:k8s]
Summary: gcp-realtime: e2e test failing [sig-storage] Multi-AZ Cluster Volumes should ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.9.0
Assignee: aos-storage-staff@redhat.com
QA Contact: Chao Yang
URL:
Whiteboard:
Depends On:
Blocks: 1975938
TreeView+ depends on / blocked
 
Reported: 2021-06-23 11:42 UTC by Jan Chaloupka
Modified: 2021-10-18 17:36 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-18 17:36:21 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift kubernetes pull 825 0 None open Bug 1975283: update Multi-AZ Cluster Volumes test name 2021-06-23 20:56:11 UTC
Github openshift origin pull 26264 0 None open Bug 1975283: Fix skipping of Multi-AZ Cluster Volumes test on GCP with k8s 1.21 2021-06-23 21:02:39 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:36:40 UTC

Description Jan Chaloupka 2021-06-23 11:42:26 UTC
The test is failing across multiple gcp jobs: https://search.ci.openshift.org/?search=Multi-AZ+Cluster+Volumes+should+only+be+allowed+to+provision+PDs+in+zones+where+nodes+exist&maxAge=12h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job:


```
Jun 23 09:45:26.513: INFO: PersistentVolumeClaim pvc-1 found but phase is Pending instead of Bound.
Jun 23 09:45:28.551: INFO: PersistentVolumeClaim pvc-1 found but phase is Pending instead of Bound.
Jun 23 09:45:30.584: INFO: PersistentVolumeClaim pvc-1 found but phase is Pending instead of Bound.
Jun 23 09:45:32.619: INFO: PersistentVolumeClaim pvc-1 found but phase is Pending instead of Bound.
Jun 23 09:45:34.651: INFO: PersistentVolumeClaim pvc-1 found but phase is Pending instead of Bound.
Jun 23 09:45:36.689: INFO: PersistentVolumeClaim pvc-1 found but phase is Pending instead of Bound.
Jun 23 09:45:38.793: INFO: PersistentVolumeClaim pvc-1 found but phase is Pending instead of Bound.
Jun 23 09:45:40.837: INFO: PersistentVolumeClaim pvc-1 found but phase is Pending instead of Bound.
Jun 23 09:45:42.838: INFO: deleting claim "e2e-multi-az-4704"/"pvc-4"
Jun 23 09:45:42.895: INFO: deleting claim "e2e-multi-az-4704"/"pvc-3"
Jun 23 09:45:42.948: INFO: deleting claim "e2e-multi-az-4704"/"pvc-2"
Jun 23 09:45:42.992: INFO: deleting claim "e2e-multi-az-4704"/"pvc-1"
Jun 23 09:45:43.037: INFO: Deleting compute resource: compute-f60cb1e9-b839-4293-af52-bbbb44595846
[AfterEach] [sig-storage] Multi-AZ Cluster Volumes
  k8s.io/kubernetes@v1.21.1/test/e2e/framework/framework.go:186
STEP: Collecting events from namespace "e2e-multi-az-4704".
STEP: Found 5 events.
Jun 23 09:46:11.235: INFO: At 2021-06-23 09:40:30 +0000 UTC - event for e2e-multi-az-4704: {namespace-security-allocation-controller } CreatedSCCRanges: created SCC ranges
Jun 23 09:46:11.235: INFO: At 2021-06-23 09:40:41 +0000 UTC - event for pvc-1: {persistentvolume-controller } WaitForFirstConsumer: waiting for first consumer to be created before binding
Jun 23 09:46:11.235: INFO: At 2021-06-23 09:40:41 +0000 UTC - event for pvc-2: {persistentvolume-controller } WaitForFirstConsumer: waiting for first consumer to be created before binding
Jun 23 09:46:11.235: INFO: At 2021-06-23 09:40:41 +0000 UTC - event for pvc-3: {persistentvolume-controller } WaitForFirstConsumer: waiting for first consumer to be created before binding
Jun 23 09:46:11.235: INFO: At 2021-06-23 09:40:41 +0000 UTC - event for pvc-4: {persistentvolume-controller } WaitForFirstConsumer: waiting for first consumer to be created before binding
Jun 23 09:46:11.272: INFO: POD  NODE  PHASE  GRACE  CONDITIONS
Jun 23 09:46:11.272: INFO: 
Jun 23 09:46:11.387: INFO: skipping dumping cluster info - cluster too large
STEP: Destroying namespace "e2e-multi-az-4704" for this suite.
fail [k8s.io/kubernetes@v1.21.1/test/e2e/storage/ubernetes_lite_volumes.go:163]: Unexpected error:
    <*errors.errorString | 0xc001cef930>: {
        s: "PersistentVolumeClaims [pvc-1] not all in phase Bound within 5m0s",
    }
    PersistentVolumeClaims [pvc-1] not all in phase Bound within 5m0s
occurred
```

jobs affected:
- https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-nightly-4.9-e2e-gcp-rt
- https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-fips
- https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#release-openshift-ocp-installer-e2e-gcp-ovn-4.8
- ...

Reported previously against 4.2: https://bugzilla.redhat.com/show_bug.cgi?id=1738691
From https://bugzilla.redhat.com/show_bug.cgi?id=1738691#c1:
```
The failure from the description happened because the default StorageClass in the cluster has the volumeBindingMode option set to use WaitForFirstConsumer.

However, this test has another problem: it needs to create an extra compute instance in a different zone [1], and we can't do that at the moment. I'll create a PR to disable it.
```

The test started to fail since Jun 21.

Comment 1 Jan Chaloupka 2021-06-23 11:58:11 UTC
The test got reintroduced by https://github.com/openshift/origin/pull/26054/commits/c5fbd2f74d7959e93db5de6f5f640e2a5cf76735. Merged in May 9. Yet, it started to be ran 2 days ago. Not sure why.

Comment 3 Yaakov Selkowitz 2021-06-23 20:37:27 UTC
The test was renamed in 1.21: https://github.com/kubernetes/kubernetes/commit/006dc7477f15e42ae70adc02421a5bacd068ba05
And therefore no longer matches the skip pattern previously used: https://github.com/openshift/kubernetes/blob/master/openshift-hack/e2e/annotate/rules.go#L148
So I guess the latter needs to be fixed then imported into origin to fix this?

Comment 4 Oleg Bulatov 2021-06-24 13:27:23 UTC
Please also backport it to 4.8.

Example of an affected job: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-rt

Comment 9 errata-xmlrpc 2021-10-18 17:36:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.