Bug 2062148 - Cinder volumes cannot be provisioned
Summary: Cinder volumes cannot be provisioned
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.10
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.11.0
Assignee: Martin André
QA Contact: Jon Uriarte
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-03-09 09:59 UTC by Jan Safranek
Modified: 2022-03-11 12:22 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-11 12:22:00 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Jan Safranek 2022-03-09 09:59:26 UTC
OpenStack CI cannot install OCP, because PVCs for Prometheus are not provisioned.

Prometheus is degraded:

{
    "lastTransitionTime": "2022-03-08T18:12:42Z",
    "message": "Failed to rollout the stack. Error: updating prometheus-k8s: waiting for Prometheus object changes failed: waiting for Prometheus openshift-monitoring/k8s: expected 2 replicas, got 0 updated replicas",
    "reason": "UpdatingPrometheusK8SFailed",
    "status": "True",
    "type": "Degraded"
},

External-provisioner for Cinder CSI driver says:

I0308 18:02:34.736625       1 controller.go:858] successfully created PV pvc-ae3bf088-b269-4f89-8475-22743a072984 for PVC prometheus-data-prometheus-k8s-0 and csi volume name 34a1c6c5-66eb-44e7-96c0-95b9079ad715
I0308 18:02:34.736735       1 controller.go:1442] provision "openshift-monitoring/prometheus-data-prometheus-k8s-0" class "standard": volume "pvc-ae3bf088-b269-4f89-8475-22743a072984" provisioned
I0308 18:02:34.736751       1 controller.go:1455] provision "openshift-monitoring/prometheus-data-prometheus-k8s-0" class "standard": succeeded
I0308 18:02:34.913996       1 controller.go:858] successfully created PV pvc-25cac051-4a03-4a41-b046-7955a8dcf44b for PVC prometheus-data-prometheus-k8s-1 and csi volume name 220a7511-1c7e-4bb0-a445-fb800cd45b5d
I0308 18:02:34.914047       1 controller.go:1442] provision "openshift-monitoring/prometheus-data-prometheus-k8s-1" class "standard": volume "pvc-25cac051-4a03-4a41-b046-7955a8dcf44b" provisioned
I0308 18:02:34.914056       1 controller.go:1455] provision "openshift-monitoring/prometheus-data-prometheus-k8s-1" class "standard": succeeded
E0308 18:02:35.164799       1 volume_store.go:90] Failed to save volume pvc-ae3bf088-b269-4f89-8475-22743a072984: error saving volume pvc-ae3bf088-b269-4f89-8475-22743a072984: persistentvolumes "pvc-ae3bf088-b269-4f89-8475-22743a072984" is forbidden: error querying Cinder volume 34a1c6c5-66eb-44e7-96c0-95b9079ad715: failed to find object
E0308 18:02:35.178605       1 volume_store.go:144] error saving volume pvc-ae3bf088-b269-4f89-8475-22743a072984: persistentvolumes "pvc-ae3bf088-b269-4f89-8475-22743a072984" is forbidden: error querying Cinder volume 34a1c6c5-66eb-44e7-96c0-95b9079ad715: failed to find object

CI job run: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_openstack-cinder-csi-driver-operator/76/pull-ci-openshift-openstack-cinder-csi-driver-operator-master-e2e-openstack/1501242355557076992
Full provisioner logs: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_openstack-cinder-csi-driver-operator/76/pull-ci-openshift-openstack-cinder-csi-driver-operator-master-e2e-openstack/1501242355557076992/artifacts/e2e-openstack/gather-extra/artifacts/pods/openshift-cluster-csi-drivers_openstack-cinder-csi-driver-controller-575f99fdb5-6fdck_csi-provisioner.log

This seems to be the first nightly that failed with the same error: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.11-e2e-openstack-parallel/1501243062129528832

Comment 1 Jan Safranek 2022-03-09 10:00:23 UTC
> error querying Cinder volume 34a1c6c5-66eb-44e7-96c0-95b9079ad715: failed to find object

This most likely comes from API server volume label admission plugin that tries to find the provisioned volume and add topology labels to it.

Comment 2 Martin André 2022-03-09 13:41:55 UTC
Setting priority to urgent as this completely breaks OpenStack CI.

Additional context: the cloud running openstack CI was recently upgraded to Wallaby and among other things Cinder v2 API is no longer available.

Comment 3 Jan Safranek 2022-03-09 18:07:15 UTC
The external-provisioner tries to save this PV (captured from the external-provisioner at TraceAll log level):

{
  "kind": "PersistentVolume",
  "apiVersion": "v1",
  "metadata": {
    "name": "pvc-9b8a0722-345e-4901-81d5-973d485842a0",
    "creationTimestamp": null,
    "labels": {
      "topology.kubernetes.io/zone": "nova"
    },
    "annotations": {
      "pv.kubernetes.io/provisioned-by": "kubernetes.io/cinder"
    }
  },
  "spec": {
    "capacity": {
      "storage": "1Gi"
    },
    "cinder": {
      "volumeID": "ae149e9f-7e98-46b2-b716-04658c91bd8f",
      "fsType": "ext4"
    },
    "accessModes": [
      "ReadWriteOnce"
    ],
    "claimRef": {
      "kind": "PersistentVolumeClaim",
      "namespace": "default",
      "name": "myclaim",
      "uid": "9b8a0722-345e-4901-81d5-973d485842a0",
      "apiVersion": "v1",
      "resourceVersion": "81938"
    },
    "persistentVolumeReclaimPolicy": "Delete",
    "storageClassName": "standard",
    "volumeMode": "Filesystem",
    "nodeAffinity": {
      "required": {
        "nodeSelectorTerms": [
          {
            "matchExpressions": [
              {
                "key": "topology.kubernetes.io/zone",
                "operator": "In",
                "values": [
                  "nova"
                ]
              }
            ]
          }
        ]
      }
    }
  },
  "status": {}
}

Since it misses label topology.kubernetes.io/region, the admission plugin tries to file that from Cinder and from some reason uses v2 API that fails:

I0309 17:08:22.336192      16 queueset.go:387] QS(workload-low): Dispatching request &request.RequestInfo{IsResourceRequest:true, Path:"/api/v1/persistentvolumes", Verb:"create", APIPrefix:"api", APIGroup:"", APIVersion:"v1", Namespace:"", Resource:"persistentvolumes", Subresource:"", Name:"", Parts:[]string{"persist
entvolumes"}} &user.DefaultInfo{Name:"system:serviceaccount:openshift-cluster-csi-drivers:openstack-cinder-csi-driver-controller-sa", UID:"deab3ab6-fc92-497c-a790-21e1d8147757", Groups:[]string{"system:serviceaccounts", "system:serviceaccounts:openshift-cluster-csi-drivers", "system:authenticated"}, Extra:map[string]
[]string{"authentication.kubernetes.io/pod-name":[]string{"openstack-cinder-csi-driver-controller-d4b544bbf-5j2h8"}, "authentication.kubernetes.io/pod-uid":[]string{"04b35d92-e469-4570-916e-5a0306d1cf3b"}}} from its queue
I0309 17:08:22.336472      16 handler.go:153] kube-aggregator: POST "/api/v1/persistentvolumes" satisfied by nonGoRestful
I0309 17:08:22.336494      16 pathrecorder.go:247] kube-aggregator: "/api/v1/persistentvolumes" satisfied by prefix /api/
I0309 17:08:22.336505      16 handler.go:143] kube-apiserver: POST "/api/v1/persistentvolumes" satisfied by gorestful with webservice /api/v1
I0309 17:08:22.338705      16 openstack.go:942] Using Blockstorage API V2

I don't know why v3 API is not used, the API server tried and got an error from NewBlockStorageV3(), https://github.com/openshift/kubernetes/blob/7478cf2c86c567e1bbe71cefa7267600ed64cfd6/staging/src/k8s.io/legacy-cloud-providers/openstack/openstack.go#L936

Comment 4 Jan Safranek 2022-03-09 18:11:11 UTC
The volume ae149e9f-7e98-46b2-b716-04658c91bd8f is available in OpenStack Cinder and I can see it was freshly provisioned. I don't see anything wrong on Cinder side.

Comment 5 Martin André 2022-03-10 08:44:19 UTC
Here's what we found so far:
- the cloud returns multiple endpoints for cinderv3, as shown in https://github.com/openshift/kubernetes/pull/1208#issuecomment-1063785752
- gophercloud doesn't know what to do with it and errors out, this was fixed in gophercloud v0.7 with https://github.com/gophercloud/gophercloud/pull/1766
- gophercloud used in the in-tree cloud provider is prehistoric (v0.1) and doesn't have the above patch from 2019

Now we need to figure out what is the best way to bump gophercloud to a more recent release.

Additionally, it appears we do not respect the `volume_api_version` from `clouds.yaml` when generating the `cloud.conf`. It's not a huge issue as it would only affect in-tree cloud provider, which should go away in the near future. 
Cinder CSI consume the `clouds.yaml` file and relies on gophercloud to respect the `volume_api_version`.

Comment 6 Martin André 2022-03-10 08:45:17 UTC
Moved to OpenStack CSI Drivers sub-component so that it falls on the shiftstack team. The issue is really with the in-tree cinder provisioner.

Comment 8 Martin André 2022-03-10 11:11:35 UTC
Setting no-docs update. It's highly unlikely a customer would end up in the same situation we're in now, because:
- they shouldn't have multiple endpoints for the same type to start with
- they should have cinderv2 functioning, no matter what
- we'll hopefully have switched to external CCM when we start supporting envs without cinder v2

Comment 9 Martin André 2022-03-10 13:47:35 UTC
Everything seems to be back in order after our cloud provider removed the duplicate volumev3 endpoint. I'll keep this BZ open for a little while, and will close once we've confirmed everything is good with our periodic jobs.

Comment 10 Martin André 2022-03-11 12:22:00 UTC
This particular issue appears to be fixed. The cloud still has some trouble provisioning VMs sometimes, but that's an unrelated issue.


Note You need to log in before you can comment on or make changes to this bug.