Description of problem (please be detailed as possible and provide log snippests): RBD PVC is remaining in Pending state for even when a thick provision enabled storage class is used. The below ocs-ci test case is failing because the PVC is not reaching Bound state. tests/manage/pv_services/test_rbd_thick_provisioning.py::TestRbdThickProvisioning::test_rbd_thick_provisioning Test case error: > raise ResourceWrongStatusException(resource.name, resource.describe()) E ocs_ci.ocs.exceptions.ResourceWrongStatusException: Resource pvc-test-a639c8c48fec4bc0ac45a38a9516947 describe output: Name: pvc-test-a639c8c48fec4bc0ac45a38a9516947 E Namespace: namespace-test-42a5043c7e2345ae8670eec6b E StorageClass: storageclass-test-rbd-d7307646d1684f939c E Status: Pending E Volume: E Labels: <none> E Annotations: volume.beta.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com E Finalizers: [kubernetes.io/pvc-protection] E Capacity: E Access Modes: E VolumeMode: Filesystem E Used By: <none> E Events: E Type Reason Age From Message E ---- ------ ---- ---- ------- E Warning ProvisioningFailed 2m33s openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-65464976df-qjrmt_3e20aa8c-e266-4fc3-bb42-a396f877d17f failed to provision volume with StorageClass "storageclass-test-rbd-d7307646d1684f939c": rpc error: code = DeadlineExceeded desc = context deadline exceeded E Normal Provisioning 26s (x9 over 5m3s) openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-65464976df-qjrmt_3e20aa8c-e266-4fc3-bb42-a396f877d17f External provisioner is provisioning volume for claim "namespace-test-42a5043c7e2345ae8670eec6b/pvc-test-a639c8c48fec4bc0ac45a38a9516947" E Warning ProvisioningFailed 26s (x8 over 2m33s) openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-65464976df-qjrmt_3e20aa8c-e266-4fc3-bb42-a396f877d17f failed to provision volume with StorageClass "storageclass-test-rbd-d7307646d1684f939c": rpc error: code = Aborted desc = an operation with the given Volume ID pvc-70f0efe9-1e97-4017-982c-6214fec2bf01 already exists E Normal ExternalProvisioning 8s (x23 over 5m3s) persistentvolume-controller waiting for a volume to be created, either by external provisioner "openshift-storage.rbd.csi.ceph.com" or manually created by system administrator ocs_ci/helpers/helpers.py:121: ResourceWrongStatusException must-gather logs - https://ceph-downstream-jenkins-csb-storage.apps.ocp4.prod.psi.redhat.com/job/ocs-ci/470/artifact/logs/failed_testcase_ocs_logs_1627307344/test_rbd_thick_provisioning_ocs_logs/ Version of all relevant components (if applicable): OCS operator v4.9.0-30.ci OCP 4.9.0-0.nightly-2021-07-26-071921 Ceph Version 16.2.0-81.el8cp (8908ce967004ed706acb5055c01030e6ecd06036) pacific (stable) rook-csi-provisioner ose-csi-external-provisioner@sha256:b3e32d68f1652902fd7e2e3eebf67345754fde66b96a1417aab8efea174a452e Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes. PVC is not reaching Bound state Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes, 5/10 Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: Yes. The test case is passing in OCS4.8 Steps to Reproduce: 1. Create an RBD pool and a storage class with thick provision enabled. 2. Create a RBD PVC using the storage class. Or run the ocs-ci test case tests/manage/pv_services/test_rbd_thick_provisioning.py::TestRbdThickProvisioning::test_rbd_thick_provisioning The test case will create 3 PVCs using RBD thick storage class. Actual results: PVC is not reaching Bound state. Expected results: PVC should be Bound Additional info:
One of the very old pending PVC request which is still in pending state: pvc-test-cda30beb2566450baa7f1ae588582f2 I0728 08:35:57.225211 1 connection.go:182] GRPC call: /csi.v1.Controller/CreateVolume I0728 08:35:57.225225 1 connection.go:183] GRPC request: {"capacity_range":{"required_bytes":1073741824},"name":"pvc-897c9a08-69ad-4c21-8a6f-19d1cf9976ed","parameters":{"clusterID":"openshift-storage","csi.storage.k8s.io/pv/name":"pvc-897c9a08-69ad-4c21-8a6f-19d1cf9976ed","csi.storage.k8s.io/pvc/name":"pvc-test-62ff7ddf6f284aa59d6ec3b3c3bde04","csi.storage.k8s.io/pvc/namespace":"namespace-test-c11e17dc2e1d43988c0ebbdca","imageFeatures":"layering","imageFormat":"2","pool":"cbp-test-a78dccba1be145eeb9c9881b512900a","thickProvision":"true"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Block":{}},"access_mode":{"mode":1}}]} I0728 08:35:58.180941 1 controller.go:1332] provision "namespace-test-c11e17dc2e1d43988c0ebbdca/pvc-test-cda30beb2566450baa7f1ae588582f2" class "storageclass-test-rbd-dbfeb4c91a324d91a4": started I0728 08:35:58.181075 1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-c11e17dc2e1d43988c0ebbdca", Name:"pvc-test-cda30beb2566450baa7f1ae588582f2", UID:"06d80ea7-23b9-48a3-a282-89cd02e881f1", APIVersion:"v1", ResourceVersion:"101912", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "namespace-test-c11e17dc2e1d43988c0ebbdca/pvc-test-cda30beb2566450baa7f1ae588582f2" .... I0728 09:51:58.931892 1 controller.go:1106] Temporary error received, adding PVC 06d80ea7-23b9-48a3-a282-89cd02e881f1 to claims in progress W0728 09:51:58.931903 1 controller.go:958] Retrying syncing claim "06d80ea7-23b9-48a3-a282-89cd02e881f1", failure 28 E0728 09:51:58.931918 1 controller.go:981] error syncing claim "06d80ea7-23b9-48a3-a282-89cd02e881f1": failed to provision volume with StorageClass "storageclass-test-rbd-dbfeb4c91a324d91a4": rpc error: code = Aborted desc = an operation with the given Volume ID pvc-06d80ea7-23b9-48a3-a282-89cd02e881f1 already exists I0728 09:51:58.931937 1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-c11e17dc2e1d43988c0ebbdca", Name:"pvc-test-cda30beb2566450baa7f1ae588582f2", UID:"06d80ea7-23b9-48a3-a282-89cd02e881f1", APIVersion:"v1", ResourceVersion:"101912", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "storageclass-test-rbd-dbfeb4c91a324d91a4": rpc error: code = Aborted desc = an operation with the given Volume ID pvc-06d80ea7-23b9-48a3-a282-89cd02e881f1 already exists ..... The subjected pool state now reports any issues: pool 13 'cbp-test-a78dccba1be145eeb9c9881b512900a' replicated size 3 min_size 2 crush_rule 6 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 208 flags hashpspool stripe_width 0 compression_mode none application rbd No RBD images exists in the pool yet: bash-4.4$ rbd ls cbp-test-a78dccba1be145eeb9c9881b512900a Ceph state reports healthy: bash-4.4$ ceph status cluster: id: 86f30a45-9d27-4712-bbbe-60f5c58fb683 health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 3h) mgr: a(active, since 3h) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 3h), 3 in (since 3h) data: volumes: 1/1 healthy pools: 8 pools, 225 pgs objects: 691 objects, 1.9 GiB usage: 5.4 GiB used, 295 GiB / 300 GiB avail pgs: 225 active+clean io: client: 853 B/s rd, 228 KiB/s wr, 1 op/s rd, 3 op/s wr ----------------
In the same cluster, thick provisioned PVC creation has gone through previously: For example: I0728 07:48:36.126984 1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-d3420719e8854ae98ad6b9895", Name:"pvc-test-40671b461dbe4f04942bce7d8d636d9", UID:"ebfa8adf-6209-4b94-9e72-37cad72f28d8", APIVersion:"v1", ResourceVersion:"69660", FieldPath:""}): type: 'Normal' reason: 'ProvisioningSucceeded' Successfully provisioned volume pvc-ebfa8adf-6209-4b94-9e72-37cad72f28d8 I0728 07:48:36.286685 1 connection.go:185] GRPC response: {"volume":{"capacity_bytes":1073741824,"volume_context":{"clusterID":"openshift-storage","csi.storage.k8s.io/pv/name":"pvc-8c12437a-bb9b-439a-adea-a67f886afb5c","csi.storage.k8s.io/pvc/name":"pvc-test-6d39d7312549439ebca74c5d740eee3","csi.storage.k8s.io/pvc/namespace":"namespace-test-d3420719e8854ae98ad6b9895","imageFeatures":"layering","imageFormat":"2","imageName":"csi-vol-29dfbe0f-ef78-11eb-abdf-0a580a810222","journalPool":"cbp-test-e9b7b90dc78348caa8eb279cffa6403","pool":"cbp-test-e9b7b90dc78348caa8eb279cffa6403","thickProvision":"true"},"volume_id":"0001-0011-openshift-storage-0000000000000005-29dfbe0f-ef78-11eb-abdf-0a580a810222"}} I0728 07:48:36.286766 1 connection.go:186] GRPC error: <nil> I0728 07:48:36.286778 1 controller.go:763] create volume rep: {CapacityBytes:1073741824 VolumeId:0001-0011-openshift-storage-0000000000000005-29dfbe0f-ef78-11eb-abdf-0a580a810222 VolumeContext:map[clusterID:openshift-storage csi.storage.k8s.io/pv/name:pvc-8c12437a-bb9b-439a-adea-a67f886afb5c csi.storage.k8s.io/pvc/name:pvc-test-6d39d7312549439ebca74c5d740eee3 csi.storage.k8s.io/pvc/namespace:namespace-test-d3420719e8854ae98ad6b9895 imageFeatures:layering imageFormat:2 imageName:csi-vol-29dfbe0f-ef78-11eb-abdf-0a580a810222 journalPool:cbp-test-e9b7b90dc78348caa8eb279cffa6403 pool:cbp-test-e9b7b90dc78348caa8eb279cffa6403 thickProvision:true] ContentSource:<nil> AccessibleTopology:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0} I0728 07:48:36.286806 1 controller.go:839] successfully created PV pvc-8c12437a-bb9b-439a-adea-a67f886afb5c for PVC pvc-test-6d39d7312549439ebca74c5d740eee3 and csi volume name 0001-0011-openshift-storage-0000000000000005-29dfbe0f-ef78-11eb-abdf-0a580a810222 ... I0728 07:48:36.288686 1 connection.go:185] GRPC response: {"volume":{"capacity_bytes":1073741824,"volume_context":{"clusterID":"openshift-storage","csi.storage.k8s.io/pv/name":"pvc-a67108b5-9b97-427d-8788-e4ca9cc81710","csi.storage.k8s.io/pvc/name":"pvc-test-3e3d109da7d644b9a55aec6ab56e7d3","csi.storage.k8s.io/pvc/namespace":"namespace-test-d3420719e8854ae98ad6b9895","imageFeatures":"layering","imageFormat":"2","imageName":"csi-vol-28949613-ef78-11eb-abdf-0a580a810222","journalPool":"cbp-test-e9b7b90dc78348caa8eb279cffa6403","pool":"cbp-test-e9b7b90dc78348caa8eb279cffa6403","thickProvision":"true"},"volume_id":"0001-0011-openshift-storage-0000000000000005-28949613-ef78-11eb-abdf-0a580a810222"}} I0728 07:48:36.288751 1 connection.go:186] GRPC error: <nil> I0728 07:48:36.288761 1 controller.go:763] create volume rep: {CapacityBytes:1073741824 VolumeId:0001-0011-openshift-storage-0000000000000005-28949613-ef78-11eb-abdf-0a580a810222 VolumeContext:map[clusterID:openshift-storage csi.storage.k8s.io/pv/name:pvc-a67108b5-9b97-427d-8788-e4ca9cc81710 csi.storage.k8s.io/pvc/name:pvc-test-3e3d109da7d644b9a55aec6ab56e7d3 csi.storage.k8s.io/pvc/namespace:namespace-test-d3420719e8854ae98ad6b9895 imageFeatures:layering imageFormat:2 imageName:csi-vol-28949613-ef78-11eb-abdf-0a580a810222 journalPool:cbp-test-e9b7b90dc78348caa8eb279cffa6403 pool:cbp-test-e9b7b90dc78348caa8eb279cffa6403 thickProvision:true] ContentSource:<nil> AccessibleTopology:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0} I0728 07:48:36.288787 1 controller.go:839] successfully created PV pvc-a67108b5-9b97-427d-8788-e4ca9cc81710 for PVC pvc-test-3e3d109da7d644b9a55aec6ab56e7d3 and csi volume name 0001-0011-openshift-storage-0000000000000005-28949613-ef78-11eb-abdf-0a580a810222 ... Issue started to popup from this timestamp: I0728 08:35:06.615209
A new thick image manual creation goes through in the subjected POOL : bash-4.4$ rbd create -s 1G --image-format 2 --image-feature "deep-flatten,layering" --thick-provision cbp-test-b0366283ecf0454c9c4c920b2b37436/teamtestimage Thick provisioning: 100% complete...done. bash-4.4$ bash-4.4$ rbd ls cbp-test-b0366283ecf0454c9c4c920b2b37436 teamtestimage tmp ash-4.4$ rbd info cbp-test-b0366283ecf0454c9c4c920b2b37436/teamtestimage rbd image 'teamtestimage': size 1 GiB in 256 objects order 22 (4 MiB objects) snapshot_count: 0 id: 39c79cf64abb block_name_prefix: rbd_data.39c79cf64abb format: 2 features: layering, deep-flatten op_features: flags: create_timestamp: Wed Jul 28 11:22:22 2021 access_timestamp: Wed Jul 28 11:22:22 2021 modify_timestamp: Wed Jul 28 11:22:22 2021 bash-4.4$
Unfortunately above cluster got destroyed and jilju++ reproduced this issue again today, here is the cluster details: kubeconfig - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-jul29/jijoy-jul29_20210729T060639/openshift-cluster-dir/auth/kubeconfig Pending PVCs in namespace namespace-test-46d3052a065548338ede08b4d
Pending PVCs: $ oc -n namespace-test-46d3052a065548338ede08b4d get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE pvc-test-637b2a14a5b743039066b3467129773 Pending storageclass-test-rbd-9a5e8645b188468598 111m pvc-test-869e94f6f24440cca596621f3b51899 Pending storageclass-test-rbd-9a5e8645b188468598 111m pvc-test-e6019fe9e3ee4882b409c389c9577e2 Pending storageclass-test-rbd-9a5e8645b188468598 111m RBD Pool that is used: $ oc get sc/storageclass-test-rbd-9a5e8645b188468598 -oyaml | grep -w pool f:pool: {} pool: cbp-test-4a64a7ef23734d318db07b2683dfedc RBD images inside the pool: $ oc -n openshift-storage rsh rook-ceph-tools-784566878-vxv6z sh-4.4$ rbd ls cbp-test-4a64a7ef23734d318db07b2683dfedc csi-vol-ef7cdde4-f042-11eb-9707-0a580a800210 csi-vol-f00f8881-f042-11eb-9707-0a580a800210 csi-vol-f1c13330-f042-11eb-9707-0a580a800210 Pick an image in the pool and find it in the csi-rbd-provisioner logs: $ oc -n openshift-storage get leases/openshift-storage-rbd-csi-ceph-com NAME HOLDER AGE openshift-storage-rbd-csi-ceph-com csi-rbdplugin-provisioner-6bd49cbf55-77brq 4h39m $ oc -n openshift-storage logs -c csi-rbdplugin csi-rbdplugin-provisioner-6bd49cbf55-77brq | grep csi-vol-ef7cdde4-f042-11eb-9707-0a580a800210 I0729 07:59:44.146001 1 omap.go:148] ID: 88 Req-ID: pvc-6b521c92-4923-4f27-9164-53f35dc6a4bf set omap keys (pool="cbp-test-4a64a7ef23734d318db07b2683dfedc", namespace="", name="csi.volume.ef7cdde4-f042-11eb-9707-0a580a800210"): map[csi.imagename:csi-vol-ef7cdde4-f042-11eb-9707-0a580a800210 csi.volname:pvc-6b521c92-4923-4f27-9164-53f35dc6a4bf csi.volume.owner:namespace-test-46d3052a065548338ede08b4d]) I0729 07:59:44.146042 1 rbd_journal.go:472] ID: 88 Req-ID: pvc-6b521c92-4923-4f27-9164-53f35dc6a4bf generated Volume ID (0001-0011-openshift-storage-000000000000000d-ef7cdde4-f042-11eb-9707-0a580a800210) and image name (csi-vol-ef7cdde4-f042-11eb-9707-0a580a800210) for request name (pvc-6b521c92-4923-4f27-9164-53f35dc6a4bf) I0729 07:59:44.146129 1 rbd_util.go:239] ID: 88 Req-ID: pvc-6b521c92-4923-4f27-9164-53f35dc6a4bf rbd: create cbp-test-4a64a7ef23734d318db07b2683dfedc/csi-vol-ef7cdde4-f042-11eb-9707-0a580a800210 size 1024M (features: [layering]) using mon 172.30.204.155:6789,172.30.41.4:6789,172.30.2.40:6789 The image seems to have been created, but the logs do not contain a gRPC response for 'ID: 88'. The creation of the image is somehow stuck. $ oc -n openshift-storage rsh rook-ceph-tools-784566878-vxv6z sh-4.4$ rbd status cbp-test-4a64a7ef23734d318db07b2683dfedc/csi-vol-ef7cdde4-f042-11eb-9707-0a580a800210 Watchers: watcher=10.128.2.16:0/2179620419 client.25634 cookie=140434827411680 There is a watcher with IP-address 10.128.2.16, this could be the provisioner in case it is still busy with writing zeros to the image. $ oc -n openshift-storage get pods -l app=csi-rbdplugin-provisioner -owide | grep 10.128.2.16 csi-rbdplugin-provisioner-6bd49cbf55-77brq 6/6 Running 2 4h46m 10.128.2.16 compute-0 <none> <none> csi-rbdplugin-provisioner-6bd49cbf55-77brq is the active provisioner (the one that holds the lease). This suggests the provisioning has not finished yet. sh-4.4$ rbd info cbp-test-4a64a7ef23734d318db07b2683dfedc/csi-vol-ef7cdde4-f042-11eb-9707-0a580a800210 rbd image 'csi-vol-ef7cdde4-f042-11eb-9707-0a580a800210': size 1 GiB in 256 objects order 22 (4 MiB objects) snapshot_count: 0 id: 64221962d97c block_name_prefix: rbd_data.64221962d97c format: 2 features: layering op_features: flags: create_timestamp: Thu Jul 29 08:00:02 2021 access_timestamp: Thu Jul 29 08:00:02 2021 modify_timestamp: Thu Jul 29 08:00:02 2021 sh-4.4$ rbd du cbp-test-4a64a7ef23734d318db07b2683dfedc/csi-vol-ef7cdde4-f042-11eb-9707-0a580a800210 warning: fast-diff map is not enabled for csi-vol-ef7cdde4-f042-11eb-9707-0a580a800210. operation may be slow. NAME PROVISIONED USED csi-vol-ef7cdde4-f042-11eb-9707-0a580a800210 1 GiB 1 GiB It looks like this image is allocated completely, provisioning should be done. Next steps: - find out where the cephcsi executable is hanging (what tools are available in the running container?) - is this a problem in the thick-provisioning loop, or is it stuck in a librbd call
Created attachment 1807494 [details] golang profiling, call stacks Enabled golang profiling in the deployment for the provisioner, add --enableprofiling=true to the arguments of the csi-rbdplugin container. $ oc -n openshift-storage edit deployment/csi-rbdplugin-provisioner ... - args: - --nodeid=$(NODE_ID) ... - --enableprofiling=true ... The provisioner pods get restarted automatically. A create request comes in again, but hangs as well. The image "csi-vol-ef7cdde4-f042-11eb-9707-0a580a800210" already exists on the backend, and provisioning was as good as completed: $ oc -n openshift-storage rsh rook-ceph-tools-784566878-vxv6z sh-4.4$ rbd image-meta ls cbp-test-4a64a7ef23734d318db07b2683dfedc/csi-vol-f00f8881-f042-11eb-9707-0a580a800210 There is 1 metadatum on this image: Key Value rbd.csi.ceph.com/thick-provisioned true sh-4.4$ rbd du cbp-test-4a64a7ef23734d318db07b2683dfedc/csi-vol-f00f8881-f042-11eb-9707-0a580a800210 warning: fast-diff map is not enabled for csi-vol-f00f8881-f042-11eb-9707-0a580a800210. operation may be slow. NAME PROVISIONED USED csi-vol-f00f8881-f042-11eb-9707-0a580a800210 1 GiB 1 GiB sh-4.4$ exit Kubernetes never got the acknowledgement that the volume was created, so a retry is issued: $ oc -n openshift-storage logs -c csi-rbdplugin csi-rbdplugin-provisioner-65b745d84f-5w6zf ... I0729 13:16:56.279978 1 utils.go:171] ID: 19 Req-ID: pvc-1286982c-cf9b-4669-b12a-6ee77e0d55ad GRPC call: /csi.v1.Controller/CreateVolume I0729 13:16:56.280152 1 utils.go:175] ID: 19 Req-ID: pvc-1286982c-cf9b-4669-b12a-6ee77e0d55ad GRPC request: {"capacity_range":{"required_bytes":1073741824},"name":"pvc-1286982c-cf9b-4669-b12a-6ee77e0d55ad","parameters":{"clusterID":"openshift-storage","csi.storage.k8s.io/pv/name":"pvc-1286982c-cf9b-4669-b12a-6ee77e0d55ad","csi.storage.k8s.io/pvc/name":"pvc-test-869e94f6f24440cca596621f3b51899","csi.storage.k8s.io/pvc/namespace":"namespace-test-46d3052a065548338ede08b4d","imageFeatures":"layering","imageFormat":"2","pool":"cbp-test-4a64a7ef23734d318db07b2683dfedc","thickProvision":"true"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Block":{}},"access_mode":{"mode":1}}]} I0729 13:16:56.280325 1 rbd_util.go:1067] ID: 19 Req-ID: pvc-1286982c-cf9b-4669-b12a-6ee77e0d55ad setting disableInUseChecks: false image features: [layering] mounter: rbd I0729 13:16:56.305685 1 omap.go:84] ID: 19 Req-ID: pvc-1286982c-cf9b-4669-b12a-6ee77e0d55ad got omap values: (pool="cbp-test-4a64a7ef23734d318db07b2683dfedc", namespace="", name="csi.volumes.default"): map[csi.volume.pvc-1286982c-cf9b-4669-b12a-6ee77e0d55ad:f00f8881-f042-11eb-9707-0a580a800210] I0729 13:16:56.306785 1 omap.go:84] ID: 19 Req-ID: pvc-1286982c-cf9b-4669-b12a-6ee77e0d55ad got omap values: (pool="cbp-test-4a64a7ef23734d318db07b2683dfedc", namespace="", name="csi.volume.f00f8881-f042-11eb-9707-0a580a800210"): map[csi.imagename:csi-vol-f00f8881-f042-11eb-9707-0a580a800210 csi.volname:pvc-1286982c-cf9b-4669-b12a-6ee77e0d55ad csi.volume.owner:namespace-test-46d3052a065548338ede08b4d] gRPC "ID: 19" is the re-issued CreateVolume call that would create csi.imagename:csi-vol-f00f8881-f042-11eb-9707-0a580a800210. The image already exists, and is fully provisioned, even the rbd.csi.ceph.com/thick-provisioned metadata is set (only done after allocating the image). Connecting the Golang profiling tools to the provisioner: $ oc -n openshift-storage port-forward pod/csi-rbdplugin-provisioner-65b745d84f-5w6zf 9090 And on an other terminal on the same system: $ go tool pprof http://localhost:9090/debug/pprof/goroutine Fetching profile over HTTP from http://localhost:9090/debug/pprof/goroutine Saved profile in /home/ndevos/pprof/pprof.cephcsi.goroutine.002.pb.gz File: cephcsi Build ID: 94169ca92a09fc707a9a11c31ffe1be03d5470cc Type: goroutine Time: Jul 29, 2021 at 3:40pm (CEST) Entering interactive mode (type "help" for commands, "o" for options) (pprof) text Showing nodes accounting for 27, 100% of 27 total Showing top 10 nodes out of 83 flat flat% sum% cum cum% 20 74.07% 74.07% 20 74.07% runtime.gopark 6 22.22% 96.30% 6 22.22% runtime.cgocall 1 3.70% 100% 1 3.70% runtime/pprof.runtime_goroutineProfileWithLabels 0 0% 100% 5 18.52% bufio.(*Reader).Read 0 0% 100% 1 3.70% github.com/ceph/ceph-csi/internal/csi-common.(*nonBlockingGRPCServer).Wait 0 0% 100% 1 3.70% github.com/ceph/ceph-csi/internal/csi-common.(*nonBlockingGRPCServer).serve 0 0% 100% 6 22.22% github.com/ceph/ceph-csi/internal/csi-common.contextIDInjector 0 0% 100% 6 22.22% github.com/ceph/ceph-csi/internal/csi-common.logGRPC 0 0% 100% 6 22.22% github.com/ceph/ceph-csi/internal/csi-common.panicHandler 0 0% 100% 6 22.22% github.com/ceph/ceph-csi/internal/rbd.(*ControllerServer).CreateVolume (pprof) web The last 'web' command was used to create the attached .svg file. Golang profiling shows that there are several go-routines executing rbd.rbdImage.open() and rbd.CreateImage(). These calls end up in runtime.cgocall as that takes care of the native librbd functions.
Created attachment 1809335 [details] goroutines call traces in text form As done in commit #8, profiling was (still) enabled, and the text formatted goroutine stack traces have been captured. Enable port forwarding: $ oc -n openshift-storage port-forward pod/csi-rbdplugin-provisioner-65b745d84f-tntxq 9090 Forwarding from 127.0.0.1:9090 -> 9090 Forwarding from [::1]:9090 -> 9090 Handling connection for 9090 Download the stacktraces: $ curl 'http://localhost:9090/debug/pprof/goroutine?debug=2' > cephcsi.goroutines.txt % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 50564 0 50564 0 0 55321 0 --:--:-- --:--:-- --:--:-- 55261 This shows that after the most recent provisioner restart, the call to librbd/rbd_open() is stuck for 49 minutes: goroutine 93 [syscall, 49 minutes]: github.com/ceph/go-ceph/rbd._Cfunc_rbd_open(0x7f48f4000ba0, 0x7f48f4000b60, 0xc00041a310, 0x0, 0x0) _cgo_gotypes.go:1711 +0x48 github.com/ceph/go-ceph/rbd.OpenImage.func3(0xc00041a2f8, 0x7f48f4000b60, 0xc00041a310, 0x0, 0xc00041a2f8) /remote-source/app/vendor/github.com/ceph/go-ceph/rbd/rbd.go:1043 +0xb3 github.com/ceph/go-ceph/rbd.OpenImage(0xc00041a2f8, 0xc000a066f0, 0x2c, 0x0, 0x0, 0x0, 0x0, 0x0) /remote-source/app/vendor/github.com/ceph/go-ceph/rbd/rbd.go:1043 +0x105 github.com/ceph/ceph-csi/internal/rbd.(*rbdImage).open(0xc000a3c000, 0x19462df, 0x1d40fa0, 0xc0003f4c00) /remote-source/app/internal/rbd/rbd_util.go:340 +0x79 github.com/ceph/ceph-csi/internal/rbd.(*rbdVolume).getImageInfo(0xc000a3c000, 0x0, 0x0) /remote-source/app/internal/rbd/rbd_util.go:1258 +0x46 github.com/ceph/ceph-csi/internal/rbd.(*rbdVolume).Exists(0xc000a3c000, 0x21d8278, 0xc0003f46c0, 0x0, 0x0, 0x0, 0x0) /remote-source/app/internal/rbd/rbd_journal.go:287 +0x30e github.com/ceph/ceph-csi/internal/rbd.(*ControllerServer).CreateVolume(0xc0003a00e0, 0x21d8278, 0xc0003f46c0, 0xc00034a700, 0x0, 0x0, 0x0) /remote-source/app/internal/rbd/controllerserver.go:299 +0x626 github.com/container-storage-interface/spec/lib/go/csi._Controller_CreateVolume_Handler.func1(0x21d8278, 0xc0003f46c0, 0x1dda8c0, 0xc00034a700, 0x0, 0x0, 0x1, 0xc000a00c10) /remote-source/app/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:5676 +0x89 github.com/ceph/ceph-csi/internal/csi-common.panicHandler(0x21d8278, 0xc0003f46c0, 0x1dda8c0, 0xc00034a700, 0xc0001320c0, 0xc000a12168, 0x0, 0x0, 0x0, 0x0) /remote-source/app/internal/csi-common/utils.go:199 +0x96 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x21d8278, 0xc0003f46c0, 0x1dda8c0, 0xc00034a700, 0xc0007f5980, 0x1, 0x1, 0x30) /remote-source/app/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x63 github.com/ceph/ceph-csi/internal/csi-common.logGRPC(0x21d8278, 0xc0003f46c0, 0x1dda8c0, 0xc00034a700, 0xc0001320c0, 0xc0001320e0, 0x1d75bc0, 0xc0003f46c0, 0x1b69ba0, 0xc000a00720) /remote-source/app/internal/csi-common/utils.go:178 +0x206 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x21d8278, 0xc0003f46c0, 0x1dda8c0, 0xc00034a700, 0x1b70320, 0xc000a00730, 0x21d8278, 0xc0003f46c0) /remote-source/app/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x63 github.com/ceph/ceph-csi/internal/csi-common.contextIDInjector(0x21d8278, 0xc0003f4690, 0x1dda8c0, 0xc00034a700, 0xc0001320c0, 0xc000132100, 0x18b517a, 0x1d28aa0, 0xc000132120, 0xc0001320c0) /remote-source/app/internal/csi-common/utils.go:163 +0x131 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x21d8278, 0xc0003f45a0, 0x1dda8c0, 0xc00034a700, 0xc0003ac400, 0x0, 0xc0007f5b30, 0x41b258) /remote-source/app/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x63 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1(0x21d8278, 0xc0003f45a0, 0x1dda8c0, 0xc00034a700, 0xc0001320c0, 0xc000a12168, 0xc000858ba0, 0x520246, 0x1d75bc0, 0xc0003f45a0) /remote-source/app/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:34 +0xd7 github.com/container-storage-interface/spec/lib/go/csi._Controller_CreateVolume_Handler(0x1e22000, 0xc0003a00e0, 0x21d8278, 0xc0003f45a0, 0xc0000af6e0, 0xc00035bb60, 0x21d8278, 0xc0003f45a0, 0xc000a1a400, 0x1f6) /remote-source/app/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:5678 +0x150 google.golang.org/grpc.(*Server).processUnaryRPC(0xc000141dc0, 0x21ec318, 0xc000385e00, 0xc000a2c000, 0xc00035bd10, 0x2fff6a0, 0x0, 0x0, 0x0) /remote-source/app/vendor/google.golang.org/grpc/server.go:1217 +0x52b google.golang.org/grpc.(*Server).handleStream(0xc000141dc0, 0x21ec318, 0xc000385e00, 0xc000a2c000, 0x0) /remote-source/app/vendor/google.golang.org/grpc/server.go:1540 +0xd0c google.golang.org/grpc.(*Server).serveStreams.func1.2(0xc00043a3e0, 0xc000141dc0, 0x21ec318, 0xc000385e00, 0xc000a2c000) /remote-source/app/vendor/google.golang.org/grpc/server.go:878 +0xab created by google.golang.org/grpc.(*Server).serveStreams.func1 /remote-source/app/vendor/google.golang.org/grpc/server.go:876 +0x1fd Other goroutines are stuck in a similar fashion too.
Hi Ilya, do you have any suggestions for adding additional logs from an RBD perspective? It seems ceph-csi is stuck in librbd:rbd_open(), and as things run inside a container capturing a coredump or C-level stacktraces is difficult. It would help if there is a ceph commandline method that can enable debug logging in the MONs or OSDs, or even on the client-side (by not running in the client container) that can point to the cause of these hangs. Thank!
Reverting to an older release of ceph-csi (upstream v3.3.0 with librbd1-15.2.10-0.el8.x86_64) makes the provisioning succeed... The original deployed image uses a newer version of Ceph (librbd1-16.2.0-81.el8cp.x86_64). Ceph-CSI v3.4.0 hangs in the same way. Next step: Going to build upstream Ceph-CSI devel branch into a Ceph v15 container image.
(In reply to Niels de Vos from comment #11) > Next step: > > Going to build upstream Ceph-CSI devel branch into a Ceph v15 container > image. Ceph-CSI devel (close to v3.4.0) image based on latest ceph/ceph:v15: quay.io/nixpanic/ceph-csi:bz1986794 Using this image, the RBD image creation succeeds immediately I0730 13:20:37.528631 1 utils.go:176] ID: 18 Req-ID: pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004 GRPC call: /csi.v1.Controller/CreateVolume I0730 13:20:37.529198 1 utils.go:180] ID: 18 Req-ID: pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004 GRPC request: {"capacity_range":{"required_bytes":1073741824},"name":"pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004","parameters":{"clusterID":"openshift-storage","csi.storage.k8s.io/pv/name":"pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004","csi.storage.k8s.io/pvc/name":"ndevos-pvc-test","csi.storage.k8s.io/pvc/namespace":"namespace-test-46d3052a065548338ede08b4d","imageFeatures":"layering","imageFormat":"2","pool":"cbp-test-4a64a7ef23734d318db07b2683dfedc","thickProvision":"true"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Block":{}},"access_mode":{"mode":1}}]} I0730 13:20:37.529624 1 rbd_util.go:1109] ID: 18 Req-ID: pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004 setting disableInUseChecks: false image features: [layering] mounter: rbd I0730 13:20:37.554109 1 omap.go:86] ID: 18 Req-ID: pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004 got omap values: (pool="cbp-test-4a64a7ef23734d318db07b2683dfedc", namespace="", name="csi.volumes.default"): map[] I0730 13:20:37.564327 1 omap.go:154] ID: 18 Req-ID: pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004 set omap keys (pool="cbp-test-4a64a7ef23734d318db07b2683dfedc", namespace="", name="csi.volumes.default"): map[csi.volume.pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004:edd4d0a7-f138-11eb-9966-0a580a8103ef]) I0730 13:20:37.567945 1 omap.go:154] ID: 18 Req-ID: pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004 set omap keys (pool="cbp-test-4a64a7ef23734d318db07b2683dfedc", namespace="", name="csi.volume.edd4d0a7-f138-11eb-9966-0a580a8103ef"): map[csi.imagename:csi-vol-edd4d0a7-f138-11eb-9966-0a580a8103ef csi.volname:pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004 csi.volume.owner:namespace-test-46d3052a065548338ede08b4d]) I0730 13:20:37.567991 1 rbd_journal.go:483] ID: 18 Req-ID: pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004 generated Volume ID (0001-0011-openshift-storage-000000000000000d-edd4d0a7-f138-11eb-9966-0a580a8103ef) and image name (csi-vol-edd4d0a7-f138-11eb-9966-0a580a8103ef) for request name (pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004) I0730 13:20:37.568117 1 rbd_util.go:242] ID: 18 Req-ID: pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004 rbd: create cbp-test-4a64a7ef23734d318db07b2683dfedc/csi-vol-edd4d0a7-f138-11eb-9966-0a580a8103ef size 1024M (features: [layering]) using mon 172.30.204.155:6789,172.30.41.4:6789,172.30.2.40:6789 I0730 13:20:56.742435 1 controllerserver.go:662] ID: 18 Req-ID: pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004 created volume pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004 backed by image csi-vol-edd4d0a7-f138-11eb-9966-0a580a8103ef I0730 13:20:56.776063 1 omap.go:154] ID: 18 Req-ID: pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004 set omap keys (pool="cbp-test-4a64a7ef23734d318db07b2683dfedc", namespace="", name="csi.volume.edd4d0a7-f138-11eb-9966-0a580a8103ef"): map[csi.imageid:18ef2e3989d66]) I0730 13:20:56.776723 1 utils.go:187] ID: 18 Req-ID: pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004 GRPC response: {"volume":{"capacity_bytes":1073741824,"volume_context":{"clusterID":"openshift-storage","csi.storage.k8s.io/pv/name":"pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004","csi.storage.k8s.io/pvc/name":"ndevos-pvc-test","csi.storage.k8s.io/pvc/namespace":"namespace-test-46d3052a065548338ede08b4d","imageFeatures":"layering","imageFormat":"2","imageName":"csi-vol-edd4d0a7-f138-11eb-9966-0a580a8103ef","journalPool":"cbp-test-4a64a7ef23734d318db07b2683dfedc","pool":"cbp-test-4a64a7ef23734d318db07b2683dfedc","thickProvision":"true"},"volume_id":"0001-0011-openshift-storage-000000000000000d-edd4d0a7-f138-11eb-9966-0a580a8103ef"}} Reverted back to the original image used for the deployment (quay.io/rhceph-dev/cephcsi@sha256:4cf1e012336d55a0f240e1206181984accd93c03b9a212431e21c25a9a2b3afb), and things work still?! Even tried with multiple parallel PVC, not able to reproduce again. I0730 13:24:18.403671 1 utils.go:171] ID: 17 Req-ID: pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05 GRPC call: /csi.v1.Controller/CreateVolume I0730 13:24:18.403973 1 utils.go:175] ID: 17 Req-ID: pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05 GRPC request: {"capacity_range":{"required_bytes":1073741824},"name":"pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05","parameters":{"clusterID":"openshift-storage","csi.storage.k8s.io/pv/name":"pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05","csi.storage.k8s.io/pvc/name":"ndevos-pvc-test","csi.storage.k8s.io/pvc/namespace":"namespace-test-46d3052a065548338ede08b4d","imageFeatures":"layering","imageFormat":"2","pool":"cbp-test-4a64a7ef23734d318db07b2683dfedc","thickProvision":"true"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Block":{}},"access_mode":{"mode":1}}]} I0730 13:24:18.404224 1 rbd_util.go:1067] ID: 17 Req-ID: pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05 setting disableInUseChecks: false image features: [layering] mounter: rbd I0730 13:24:18.423590 1 omap.go:84] ID: 17 Req-ID: pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05 got omap values: (pool="cbp-test-4a64a7ef23734d318db07b2683dfedc", namespace="", name="csi.volumes.default"): map[] I0730 13:24:18.431642 1 omap.go:148] ID: 17 Req-ID: pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05 set omap keys (pool="cbp-test-4a64a7ef23734d318db07b2683dfedc", namespace="", name="csi.volumes.default"): map[csi.volume.pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05:717ad462-f139-11eb-afe7-0a580a8003b9]) I0730 13:24:18.434827 1 omap.go:148] ID: 17 Req-ID: pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05 set omap keys (pool="cbp-test-4a64a7ef23734d318db07b2683dfedc", namespace="", name="csi.volume.717ad462-f139-11eb-afe7-0a580a8003b9"): map[csi.imagename:csi-vol-717ad462-f139-11eb-afe7-0a580a8003b9 csi.volname:pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05 csi.volume.owner:namespace-test-46d3052a065548338ede08b4d]) I0730 13:24:18.434876 1 rbd_journal.go:472] ID: 17 Req-ID: pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05 generated Volume ID (0001-0011-openshift-storage-000000000000000d-717ad462-f139-11eb-afe7-0a580a8003b9) and image name (csi-vol-717ad462-f139-11eb-afe7-0a580a8003b9) for request name (pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05) I0730 13:24:18.434987 1 rbd_util.go:239] ID: 17 Req-ID: pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05 rbd: create cbp-test-4a64a7ef23734d318db07b2683dfedc/csi-vol-717ad462-f139-11eb-afe7-0a580a8003b9 size 1024M (features: [layering]) using mon 172.30.204.155:6789,172.30.41.4:6789,172.30.2.40:6789 I0730 13:24:38.160759 1 utils.go:171] ID: 18 GRPC call: /csi.v1.Identity/Probe I0730 13:24:38.161007 1 utils.go:175] ID: 18 GRPC request: {} I0730 13:24:38.161068 1 utils.go:182] ID: 18 GRPC response: {} I0730 13:24:39.179882 1 controllerserver.go:636] ID: 17 Req-ID: pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05 created volume pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05 backed by image csi-vol-717ad462-f139-11eb-afe7-0a580a8003b9 I0730 13:24:39.216951 1 omap.go:148] ID: 17 Req-ID: pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05 set omap keys (pool="cbp-test-4a64a7ef23734d318db07b2683dfedc", namespace="", name="csi.volume.717ad462-f139-11eb-afe7-0a580a8003b9"): map[csi.imageid:aa81a9046ff9]) I0730 13:24:39.217381 1 utils.go:182] ID: 17 Req-ID: pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05 GRPC response: {"volume":{"capacity_bytes":1073741824,"volume_context":{"clusterID":"openshift-storage","csi.storage.k8s.io/pv/name":"pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05","csi.storage.k8s.io/pvc/name":"ndevos-pvc-test","csi.storage.k8s.io/pvc/namespace":"namespace-test-46d3052a065548338ede08b4d","imageFeatures":"layering","imageFormat":"2","imageName":"csi-vol-717ad462-f139-11eb-afe7-0a580a8003b9","journalPool":"cbp-test-4a64a7ef23734d318db07b2683dfedc","pool":"cbp-test-4a64a7ef23734d318db07b2683dfedc","thickProvision":"true"},"volume_id":"0001-0011-openshift-storage-000000000000000d-717ad462-f139-11eb-afe7-0a580a8003b9"}} Next step: Identify exact procedure for reproducing this manually. Jilju, can you help with this?
(In reply to Niels de Vos from comment #12) > (In reply to Niels de Vos from comment #11) > Next step: > > Identify exact procedure for reproducing this manually. Jilju, can you help > with this? Copying the relevant logs from the ocs-ci test case tests/manage/pv_services/test_rbd_thick_provisioning.py The test case creates Secret, Ceph block pool, Storage class with thickProvision: 'true. Create 3 PVCs. Yamls are given below along with the logs. 13:19:04 - MainThread - ocs_ci.ocs.resources.ocs - INFO - Adding Secret with name secret-test-rbd-a269c039d62c433b907fed3a 13:19:04 - MainThread - ocs_ci.utility.templating - INFO - apiVersion: v1 kind: Secret metadata: name: secret-test-rbd-a269c039d62c433b907fed3a namespace: openshift-storage stringData: userID: admin userKey: '*****' 13:19:04 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage create -f /tmp/Secretqv2__don -o yaml 13:19:14 - MainThread - ocs_ci.ocs.resources.ocs - INFO - Adding CephBlockPool with name cbp-test-dd4a8e779c7a405aad42612820ac93c 13:19:14 - MainThread - ocs_ci.utility.templating - INFO - apiVersion: ceph.rook.io/v1 kind: CephBlockPool metadata: name: cbp-test-dd4a8e779c7a405aad42612820ac93c namespace: openshift-storage spec: annotations: null failureDomain: rack parameters: compression_mode: none replicated: size: 3 13:19:14 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage create -f /tmp/CephBlockPooln6dxeqqy -o yaml 13:19:23 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage rsh rook-ceph-tools-784566878-vxv6z ceph osd lspools --format json-pretty 13:19:26 - MainThread - ocs_ci.helpers.helpers - INFO - POOLS are [{'poolnum': 1, 'poolname': 'ocs-storagecluster-cephblockpool'}, {'poolnum': 2, 'poolname': 'ocs-storagecluster-cephobjectstore.rgw.meta'}, {'poolnum': 3, 'poolname': 'ocs-storagecluster-cephobjectstore.rgw.control'}, {'poolnum': 4, 'poolname': 'ocs-storagecluster-cephobjectstore.rgw.log'}, {'poolnum': 5, 'poolname': 'ocs-storagecluster-cephobjectstore.rgw.buckets.index'}, {'poolnum': 6, 'poolname': 'ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec'}, {'poolnum': 7, 'poolname': '.rgw.root'}, {'poolnum': 8, 'poolname': 'ocs-storagecluster-cephfilesystem-metadata'}, {'poolnum': 9, 'poolname': 'ocs-storagecluster-cephfilesystem-data0'}, {'poolnum': 10, 'poolname': 'ocs-storagecluster-cephobjectstore.rgw.buckets.data'}, {'poolnum': 11, 'poolname': 'device_health_metrics'}, {'poolnum': 13, 'poolname': 'cbp-test-4a64a7ef23734d318db07b2683dfedc'}, {'poolnum': 14, 'poolname': 'cbp-test-5ce92edcf0f74570a564ba6e1395472'}, {'poolnum': 16, 'poolname': 'cbp-test-dd4a8e779c7a405aad42612820ac93c'}] 13:19:26 - MainThread - ocs_ci.ocs.resources.ocs - INFO - Adding StorageClass with name storageclass-test-rbd-20a57aa726de4d2d9a 13:19:26 - MainThread - ocs_ci.utility.templating - INFO - allowVolumeExpansion: true apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: storageclass-test-rbd-20a57aa726de4d2d9a namespace: openshift-storage parameters: clusterID: openshift-storage csi.storage.k8s.io/controller-expand-secret-name: '*****' csi.storage.k8s.io/controller-expand-secret-namespace: '*****' csi.storage.k8s.io/node-stage-secret-name: '*****' csi.storage.k8s.io/node-stage-secret-namespace: '*****' csi.storage.k8s.io/provisioner-secret-name: '*****' csi.storage.k8s.io/provisioner-secret-namespace: '*****' imageFeatures: layering imageFormat: '2' pool: cbp-test-dd4a8e779c7a405aad42612820ac93c thickProvision: 'true' provisioner: openshift-storage.rbd.csi.ceph.com reclaimPolicy: Delete 13:19:26 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage create -f /tmp/StorageClassd8kamsxc -o yaml 13:19:27 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get StorageClass storageclass-test-rbd-20a57aa726de4d2d9a -n openshift-storage -o yaml 13:19:30 - MainThread - ocs_ci.ocs.resources.ocs - INFO - Adding PersistentVolumeClaim with name pvc-test-acdb94965ac44443983a799c9c7124f 13:19:30 - MainThread - ocs_ci.utility.templating - INFO - apiVersion: v1 kind: PersistentVolumeClaim metadata: name: pvc-test-acdb94965ac44443983a799c9c7124f namespace: namespace-test-7eefd165af0e41bb8f03f0e99 spec: accessModes: - ReadWriteOnce resources: requests: storage: 1Gi storageClassName: storageclass-test-rbd-20a57aa726de4d2d9a 13:19:30 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n namespace-test-7eefd165af0e41bb8f03f0e99 create -f /tmp/PersistentVolumeClaimiiimwi0x -o yaml 13:19:31 - MainThread - ocs_ci.ocs.resources.ocs - INFO - Adding PersistentVolumeClaim with name pvc-test-b5537e46e5534d8584c94b9d4320588 13:19:31 - MainThread - ocs_ci.utility.templating - INFO - apiVersion: v1 kind: PersistentVolumeClaim metadata: name: pvc-test-b5537e46e5534d8584c94b9d4320588 namespace: namespace-test-7eefd165af0e41bb8f03f0e99 spec: accessModes: - ReadWriteOnce resources: requests: storage: 1Gi storageClassName: storageclass-test-rbd-20a57aa726de4d2d9a volumeMode: Block 13:19:31 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n namespace-test-7eefd165af0e41bb8f03f0e99 create -f /tmp/PersistentVolumeClaimvqes2_ww -o yaml 13:19:32 - MainThread - ocs_ci.ocs.resources.ocs - INFO - Adding PersistentVolumeClaim with name pvc-test-11ef6cc72cd849008dfa4bbe8aae8af 13:19:32 - MainThread - ocs_ci.utility.templating - INFO - apiVersion: v1 kind: PersistentVolumeClaim metadata: name: pvc-test-11ef6cc72cd849008dfa4bbe8aae8af namespace: namespace-test-7eefd165af0e41bb8f03f0e99 spec: accessModes: - ReadWriteMany resources: requests: storage: 1Gi storageClassName: storageclass-test-rbd-20a57aa726de4d2d9a volumeMode: Block 13:19:32 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n namespace-test-7eefd165af0e41bb8f03f0e99 create -f /tmp/PersistentVolumeClaimabdko5sq -o yaml Hi Niels, Executed the above test case on the same cluster shared for debugging. The PVCs given below did not reach Bound state. $ oc get pvc -n namespace-test-7eefd165af0e41bb8f03f0e99 NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE pvc-test-11ef6cc72cd849008dfa4bbe8aae8af Pending storageclass-test-rbd-20a57aa726de4d2d9a 14m pvc-test-acdb94965ac44443983a799c9c7124f Pending storageclass-test-rbd-20a57aa726de4d2d9a 14m pvc-test-b5537e46e5534d8584c94b9d4320588 Pending storageclass-test-rbd-20a57aa726de4d2d9a 14m
Thank Jilju! The latest state of the reproducer is: Fetch the goroutine call traces as in comment #9: $ curl 'http://localhost:9090/debug/pprof/goroutine?debug=2' > csi-rbdplugin-provisioner-65b745d84f-hr45x.goroutines.txt All three PVC creations are stuck in calling librbd:rbd_create4(): $ grep -B1 github.com/ceph/go-ceph/rbd._Cfunc csi-rbdplugin-provisioner-65b745d84f-hr45x.goroutines.txt goroutine 1401 [syscall, 3218 minutes]: github.com/ceph/go-ceph/rbd._Cfunc_rbd_create4(0x7fcc18000ba0, 0x7fcc18002840, 0x40000000, 0x7fcc18001040, 0x0) -- goroutine 1400 [syscall, 3218 minutes]: github.com/ceph/go-ceph/rbd._Cfunc_rbd_create4(0x3925af0, 0x391c280, 0x40000000, 0x391c320, 0x0) -- goroutine 1445 [syscall, 3218 minutes]: github.com/ceph/go-ceph/rbd._Cfunc_rbd_create4(0x7fcbc0000da0, 0x7fcbc0002aa0, 0x40000000, 0x7fcbc0002b40, 0x0) It seems that this is not restricted to thick-provisioning. Creating the image is needed for normal (thin) provisioning too. Thick-provisioning only comes into play after the image has been created; it will get opened an filled with zeros. Still trying to find out what the best way is to capture more details.
If this is an OCS tracker for a ceph BZ and has no fix in ceph-csi then it should be on the ceph component and not ceph-csi.
https://bugzilla.redhat.com/show_bug.cgi?id=2000434 is ON_QA which means we have a fix in Ceph. This can be moved to ON_QA once we have an OCS build with the Ceph fix.
*** Bug 2006322 has been marked as a duplicate of this bug. ***
*** Bug 2007442 has been marked as a duplicate of this bug. ***
Executed the test case tests/manage/pv_services/test_rbd_thick_provisioning.py::TestRbdThickProvisioning::test_rbd_thick_provisioning 20 times. No issues observed. Manually created RBD thick storage class and PVCs of different size. All PVCs reached Bound state. Verified in version: ODF full version 4.9.0-189.ci OCP 4.9.0-0.nightly-2021-10-13-170616
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:5086