Bug 2066865
Summary: | Flaky test: In-tree Volumes [Driver: azure-disk] [Testpattern: Dynamic PV (delayed binding)] topology should provision a volume and schedule a pod with AllowedTopologies | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Jan Safranek <jsafrane> |
Component: | Storage | Assignee: | Jan Safranek <jsafrane> |
Storage sub component: | Kubernetes External Components | QA Contact: | Wei Duan <wduan> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | medium | ||
Priority: | unspecified | CC: | aos-bugs, dgoodwin, miabbott |
Version: | 4.11 | ||
Target Milestone: | --- | ||
Target Release: | 4.11.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: |
[sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Dynamic PV (delayed binding)] topology should provision a volume and schedule a pod with AllowedTopologies [Suite:openshift/conformance/parallel] [Suite:k8s]
|
|
Last Closed: | 2022-08-10 10:55:23 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Jan Safranek
2022-03-22 16:18:42 UTC
This may be unrelated: I can see that nodes in that failed run have weird topology labels: "topology.disk.csi.azure.com/zone": "", "topology.kubernetes.io/region": "westus", "topology.kubernetes.io/zone": "0" In a successful run it looks differently: "topology.disk.csi.azure.com/zone": "centralus-1", "topology.kubernetes.io/region": "centralus", "topology.kubernetes.io/zone": "centralus-1" https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1502150912049680384 pull-ci-openshift-installer-master-e2e-azure-upi fails pretty constantly since 03/02/2022 The first failing one: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/5674/pull-ci-openshift-installer-master-e2e-azure-upi/1499128361434222592 The last successful one: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/5665/pull-ci-openshift-installer-master-e2e-azure-upi/1498979872859492352 (there were couple of failed installs in between) [ e2e-azure-upi linked in the previous comment may be unrelated to flake in comment #0, they use a different install workflow ] "westus" region is special - it does not have availability zones. The test creates StorageClass requesting explicit topology discovered from (in-tree) Node labels: allowedTopologies: - matchLabelExpressions: - key: failure-domain.beta.kubernetes.io/zone values: - "0" apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: e2e-topology-84vk7sp provisioner: kubernetes.io/azure-disk reclaimPolicy: Delete volumeBindingMode: WaitForFirstConsumer PVC created by the test: apiVersion: v1 kind: PersistentVolumeClaim metadata: annotations: volume.beta.kubernetes.io/storage-provisioner: disk.csi.azure.com volume.kubernetes.io/selected-node: ci-op-9j1nby77-e54d5-9zlwv-worker-westus-ljw4g volume.kubernetes.io/storage-provisioner: disk.csi.azure.com creationTimestamp: "2022-03-24T12:36:08Z" finalizers: - kubernetes.io/pvc-protection generateName: pvc- name: pvc-ph9tb namespace: e2e-topology-84 resourceVersion: "135683" uid: 7596de95-b733-46d3-93f2-f1309a244d20 spec: accessModes: - ReadWriteOnce resources: requests: storage: 1Gi storageClassName: e2e-topology-84vk7sp volumeMode: Filesystem And external-provisioner cannot compute topology requirements from the SC + PVC.Annotations["volume.kubernetes.io/selected-node"]: I0324 12:36:09.194043 1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"e2e-topology-84", Name:"pvc-ph9tb", UID:"7596de95-b733-46d3-93f2-f1309a244d20", APIVersion:"v1", ResourceVersion:"135683", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "e2e-topology-84vk7sp": error generating accessibility requirements: topology map[topology.disk.csi.azure.com/zone:] from selected node "ci-op-9j1nby77-e54d5-9zlwv-worker-westus-ljw4g" is not in requisite: [map[topology.disk.csi.azure.com/zone:0]] Upstream issue: https://github.com/kubernetes/kubernetes/issues/108980 Upstream PR to fix the CSI translation: https://github.com/kubernetes/kubernetes/pull/109154 It will not fix the issue completely. In Azure regions without availability zones the in-tree volume plugin (and cloud provider) uses machine's failure domain instead of availability zones. I don't understand how nodes are distributed among failure domains, to me it seems it's completely random. The CSI driver does not use the failure domains and uses `zone: ""` in these regions, which expresses volume topology better - a PV in such region can be used by any node. From perspective of the failing test, it looks like there are real in-tree availability zones and PVs in one zone can't be used in the other one, but that's not true in reality. The tests expect that if they provision a volume in failure domain "1" (via StorageClass.allowedTopologies), a PV gets provisioned there and only nodes from that failure domain can use it. But such topology requirement is translated to CSI as "" and the provisioned PV can be used in any failure domain and sometimes a node with domain "0" is chosen. The test then fails, while there is no error on Kubernetes / OCP side. I am going to skip that tests for in-tree volume plugin on Azure and keep it for the CSI driver. I need to update openshift/origin. Two cases are not runable in current CI: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.11-informing#periodic-ci-openshift-release-master-nightly-4.11-e2e-azure&show-stale-tests= Verified pass. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |