OpenStack CI cannot install OCP, because PVCs for Prometheus are not provisioned. Prometheus is degraded: { "lastTransitionTime": "2022-03-08T18:12:42Z", "message": "Failed to rollout the stack. Error: updating prometheus-k8s: waiting for Prometheus object changes failed: waiting for Prometheus openshift-monitoring/k8s: expected 2 replicas, got 0 updated replicas", "reason": "UpdatingPrometheusK8SFailed", "status": "True", "type": "Degraded" }, External-provisioner for Cinder CSI driver says: I0308 18:02:34.736625 1 controller.go:858] successfully created PV pvc-ae3bf088-b269-4f89-8475-22743a072984 for PVC prometheus-data-prometheus-k8s-0 and csi volume name 34a1c6c5-66eb-44e7-96c0-95b9079ad715 I0308 18:02:34.736735 1 controller.go:1442] provision "openshift-monitoring/prometheus-data-prometheus-k8s-0" class "standard": volume "pvc-ae3bf088-b269-4f89-8475-22743a072984" provisioned I0308 18:02:34.736751 1 controller.go:1455] provision "openshift-monitoring/prometheus-data-prometheus-k8s-0" class "standard": succeeded I0308 18:02:34.913996 1 controller.go:858] successfully created PV pvc-25cac051-4a03-4a41-b046-7955a8dcf44b for PVC prometheus-data-prometheus-k8s-1 and csi volume name 220a7511-1c7e-4bb0-a445-fb800cd45b5d I0308 18:02:34.914047 1 controller.go:1442] provision "openshift-monitoring/prometheus-data-prometheus-k8s-1" class "standard": volume "pvc-25cac051-4a03-4a41-b046-7955a8dcf44b" provisioned I0308 18:02:34.914056 1 controller.go:1455] provision "openshift-monitoring/prometheus-data-prometheus-k8s-1" class "standard": succeeded E0308 18:02:35.164799 1 volume_store.go:90] Failed to save volume pvc-ae3bf088-b269-4f89-8475-22743a072984: error saving volume pvc-ae3bf088-b269-4f89-8475-22743a072984: persistentvolumes "pvc-ae3bf088-b269-4f89-8475-22743a072984" is forbidden: error querying Cinder volume 34a1c6c5-66eb-44e7-96c0-95b9079ad715: failed to find object E0308 18:02:35.178605 1 volume_store.go:144] error saving volume pvc-ae3bf088-b269-4f89-8475-22743a072984: persistentvolumes "pvc-ae3bf088-b269-4f89-8475-22743a072984" is forbidden: error querying Cinder volume 34a1c6c5-66eb-44e7-96c0-95b9079ad715: failed to find object CI job run: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_openstack-cinder-csi-driver-operator/76/pull-ci-openshift-openstack-cinder-csi-driver-operator-master-e2e-openstack/1501242355557076992 Full provisioner logs: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_openstack-cinder-csi-driver-operator/76/pull-ci-openshift-openstack-cinder-csi-driver-operator-master-e2e-openstack/1501242355557076992/artifacts/e2e-openstack/gather-extra/artifacts/pods/openshift-cluster-csi-drivers_openstack-cinder-csi-driver-controller-575f99fdb5-6fdck_csi-provisioner.log This seems to be the first nightly that failed with the same error: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.11-e2e-openstack-parallel/1501243062129528832
> error querying Cinder volume 34a1c6c5-66eb-44e7-96c0-95b9079ad715: failed to find object This most likely comes from API server volume label admission plugin that tries to find the provisioned volume and add topology labels to it.
Setting priority to urgent as this completely breaks OpenStack CI. Additional context: the cloud running openstack CI was recently upgraded to Wallaby and among other things Cinder v2 API is no longer available.
The external-provisioner tries to save this PV (captured from the external-provisioner at TraceAll log level): { "kind": "PersistentVolume", "apiVersion": "v1", "metadata": { "name": "pvc-9b8a0722-345e-4901-81d5-973d485842a0", "creationTimestamp": null, "labels": { "topology.kubernetes.io/zone": "nova" }, "annotations": { "pv.kubernetes.io/provisioned-by": "kubernetes.io/cinder" } }, "spec": { "capacity": { "storage": "1Gi" }, "cinder": { "volumeID": "ae149e9f-7e98-46b2-b716-04658c91bd8f", "fsType": "ext4" }, "accessModes": [ "ReadWriteOnce" ], "claimRef": { "kind": "PersistentVolumeClaim", "namespace": "default", "name": "myclaim", "uid": "9b8a0722-345e-4901-81d5-973d485842a0", "apiVersion": "v1", "resourceVersion": "81938" }, "persistentVolumeReclaimPolicy": "Delete", "storageClassName": "standard", "volumeMode": "Filesystem", "nodeAffinity": { "required": { "nodeSelectorTerms": [ { "matchExpressions": [ { "key": "topology.kubernetes.io/zone", "operator": "In", "values": [ "nova" ] } ] } ] } } }, "status": {} } Since it misses label topology.kubernetes.io/region, the admission plugin tries to file that from Cinder and from some reason uses v2 API that fails: I0309 17:08:22.336192 16 queueset.go:387] QS(workload-low): Dispatching request &request.RequestInfo{IsResourceRequest:true, Path:"/api/v1/persistentvolumes", Verb:"create", APIPrefix:"api", APIGroup:"", APIVersion:"v1", Namespace:"", Resource:"persistentvolumes", Subresource:"", Name:"", Parts:[]string{"persist entvolumes"}} &user.DefaultInfo{Name:"system:serviceaccount:openshift-cluster-csi-drivers:openstack-cinder-csi-driver-controller-sa", UID:"deab3ab6-fc92-497c-a790-21e1d8147757", Groups:[]string{"system:serviceaccounts", "system:serviceaccounts:openshift-cluster-csi-drivers", "system:authenticated"}, Extra:map[string] []string{"authentication.kubernetes.io/pod-name":[]string{"openstack-cinder-csi-driver-controller-d4b544bbf-5j2h8"}, "authentication.kubernetes.io/pod-uid":[]string{"04b35d92-e469-4570-916e-5a0306d1cf3b"}}} from its queue I0309 17:08:22.336472 16 handler.go:153] kube-aggregator: POST "/api/v1/persistentvolumes" satisfied by nonGoRestful I0309 17:08:22.336494 16 pathrecorder.go:247] kube-aggregator: "/api/v1/persistentvolumes" satisfied by prefix /api/ I0309 17:08:22.336505 16 handler.go:143] kube-apiserver: POST "/api/v1/persistentvolumes" satisfied by gorestful with webservice /api/v1 I0309 17:08:22.338705 16 openstack.go:942] Using Blockstorage API V2 I don't know why v3 API is not used, the API server tried and got an error from NewBlockStorageV3(), https://github.com/openshift/kubernetes/blob/7478cf2c86c567e1bbe71cefa7267600ed64cfd6/staging/src/k8s.io/legacy-cloud-providers/openstack/openstack.go#L936
The volume ae149e9f-7e98-46b2-b716-04658c91bd8f is available in OpenStack Cinder and I can see it was freshly provisioned. I don't see anything wrong on Cinder side.
Here's what we found so far: - the cloud returns multiple endpoints for cinderv3, as shown in https://github.com/openshift/kubernetes/pull/1208#issuecomment-1063785752 - gophercloud doesn't know what to do with it and errors out, this was fixed in gophercloud v0.7 with https://github.com/gophercloud/gophercloud/pull/1766 - gophercloud used in the in-tree cloud provider is prehistoric (v0.1) and doesn't have the above patch from 2019 Now we need to figure out what is the best way to bump gophercloud to a more recent release. Additionally, it appears we do not respect the `volume_api_version` from `clouds.yaml` when generating the `cloud.conf`. It's not a huge issue as it would only affect in-tree cloud provider, which should go away in the near future. Cinder CSI consume the `clouds.yaml` file and relies on gophercloud to respect the `volume_api_version`.
Moved to OpenStack CSI Drivers sub-component so that it falls on the shiftstack team. The issue is really with the in-tree cinder provisioner.
Setting no-docs update. It's highly unlikely a customer would end up in the same situation we're in now, because: - they shouldn't have multiple endpoints for the same type to start with - they should have cinderv2 functioning, no matter what - we'll hopefully have switched to external CCM when we start supporting envs without cinder v2
Everything seems to be back in order after our cloud provider removed the duplicate volumev3 endpoint. I'll keep this BZ open for a little while, and will close once we've confirmed everything is good with our periodic jobs.
This particular issue appears to be fixed. The cloud still has some trouble provisioning VMs sometimes, but that's an unrelated issue.