Bug 1986794 - [Tracker for Ceph BZ #2000434] [RBD][Thick] PVC is not reaching Bound state
Summary: [Tracker for Ceph BZ #2000434] [RBD][Thick] PVC is not reaching Bound state
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ODF 4.9.0
Assignee: Rakshith
QA Contact: Jilju Joy
URL:
Whiteboard:
: 2006322 2007442 (view as bug list)
Depends On: 2000434
Blocks: 1997738
TreeView+ depends on / blocked
 
Reported: 2021-07-28 10:55 UTC by Jilju Joy
Modified: 2023-08-09 16:37 UTC (History)
13 users (show)

Fixed In Version: v4.9.0-164.ci
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2000434 (view as bug list)
Environment:
Last Closed: 2021-12-13 17:44:55 UTC
Embargoed:


Attachments (Terms of Use)
golang profiling, call stacks (110.67 KB, image/svg+xml)
2021-07-29 14:59 UTC, Niels de Vos
no flags Details
goroutines call traces in text form (49.38 KB, text/plain)
2021-07-30 10:25 UTC, Niels de Vos
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2021:5086 0 None None None 2021-12-13 17:45:18 UTC

Description Jilju Joy 2021-07-28 10:55:42 UTC
Description of problem (please be detailed as possible and provide log
snippests):
RBD PVC is remaining in Pending state for even when a thick provision enabled storage class is used.
The below ocs-ci test case is failing because the PVC is not reaching Bound state.
tests/manage/pv_services/test_rbd_thick_provisioning.py::TestRbdThickProvisioning::test_rbd_thick_provisioning

Test case error:

>           raise ResourceWrongStatusException(resource.name, resource.describe())
E           ocs_ci.ocs.exceptions.ResourceWrongStatusException: Resource pvc-test-a639c8c48fec4bc0ac45a38a9516947 describe output: Name:          pvc-test-a639c8c48fec4bc0ac45a38a9516947
E           Namespace:     namespace-test-42a5043c7e2345ae8670eec6b
E           StorageClass:  storageclass-test-rbd-d7307646d1684f939c
E           Status:        Pending
E           Volume:        
E           Labels:        <none>
E           Annotations:   volume.beta.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com
E           Finalizers:    [kubernetes.io/pvc-protection]
E           Capacity:      
E           Access Modes:  
E           VolumeMode:    Filesystem
E           Used By:       <none>
E           Events:
E             Type     Reason                Age                  From                                                                                                                Message
E             ----     ------                ----                 ----                                                                                                                -------
E             Warning  ProvisioningFailed    2m33s                openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-65464976df-qjrmt_3e20aa8c-e266-4fc3-bb42-a396f877d17f  failed to provision volume with StorageClass "storageclass-test-rbd-d7307646d1684f939c": rpc error: code = DeadlineExceeded desc = context deadline exceeded
E             Normal   Provisioning          26s (x9 over 5m3s)   openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-65464976df-qjrmt_3e20aa8c-e266-4fc3-bb42-a396f877d17f  External provisioner is provisioning volume for claim "namespace-test-42a5043c7e2345ae8670eec6b/pvc-test-a639c8c48fec4bc0ac45a38a9516947"
E             Warning  ProvisioningFailed    26s (x8 over 2m33s)  openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-65464976df-qjrmt_3e20aa8c-e266-4fc3-bb42-a396f877d17f  failed to provision volume with StorageClass "storageclass-test-rbd-d7307646d1684f939c": rpc error: code = Aborted desc = an operation with the given Volume ID pvc-70f0efe9-1e97-4017-982c-6214fec2bf01 already exists
E             Normal   ExternalProvisioning  8s (x23 over 5m3s)   persistentvolume-controller                                                                                         waiting for a volume to be created, either by external provisioner "openshift-storage.rbd.csi.ceph.com" or manually created by system administrator

ocs_ci/helpers/helpers.py:121: ResourceWrongStatusException



must-gather logs - https://ceph-downstream-jenkins-csb-storage.apps.ocp4.prod.psi.redhat.com/job/ocs-ci/470/artifact/logs/failed_testcase_ocs_logs_1627307344/test_rbd_thick_provisioning_ocs_logs/

Version of all relevant components (if applicable):
OCS operator	v4.9.0-30.ci
OCP 4.9.0-0.nightly-2021-07-26-071921
Ceph Version	16.2.0-81.el8cp (8908ce967004ed706acb5055c01030e6ecd06036) pacific (stable)

rook-csi-provisioner	ose-csi-external-provisioner@sha256:b3e32d68f1652902fd7e2e3eebf67345754fde66b96a1417aab8efea174a452e


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes. PVC is not reaching Bound state

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes, 5/10

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:
Yes. The test case is passing in OCS4.8

Steps to Reproduce:
1. Create an RBD pool and a storage class with thick provision enabled.
2. Create a RBD PVC using the storage class.

Or run the ocs-ci test case
tests/manage/pv_services/test_rbd_thick_provisioning.py::TestRbdThickProvisioning::test_rbd_thick_provisioning
The test case will create 3 PVCs using RBD thick storage class. 

Actual results:
PVC is not reaching Bound state.

Expected results:
PVC should be Bound


Additional info:

Comment 3 Humble Chirammal 2021-07-28 11:12:41 UTC
One of the very old pending PVC request which is still in pending state: pvc-test-cda30beb2566450baa7f1ae588582f2



I0728 08:35:57.225211       1 connection.go:182] GRPC call: /csi.v1.Controller/CreateVolume
I0728 08:35:57.225225       1 connection.go:183] GRPC request: {"capacity_range":{"required_bytes":1073741824},"name":"pvc-897c9a08-69ad-4c21-8a6f-19d1cf9976ed","parameters":{"clusterID":"openshift-storage","csi.storage.k8s.io/pv/name":"pvc-897c9a08-69ad-4c21-8a6f-19d1cf9976ed","csi.storage.k8s.io/pvc/name":"pvc-test-62ff7ddf6f284aa59d6ec3b3c3bde04","csi.storage.k8s.io/pvc/namespace":"namespace-test-c11e17dc2e1d43988c0ebbdca","imageFeatures":"layering","imageFormat":"2","pool":"cbp-test-a78dccba1be145eeb9c9881b512900a","thickProvision":"true"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Block":{}},"access_mode":{"mode":1}}]}
I0728 08:35:58.180941       1 controller.go:1332] provision "namespace-test-c11e17dc2e1d43988c0ebbdca/pvc-test-cda30beb2566450baa7f1ae588582f2" class "storageclass-test-rbd-dbfeb4c91a324d91a4": started
I0728 08:35:58.181075       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-c11e17dc2e1d43988c0ebbdca", Name:"pvc-test-cda30beb2566450baa7f1ae588582f2", UID:"06d80ea7-23b9-48a3-a282-89cd02e881f1", APIVersion:"v1", ResourceVersion:"101912", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "namespace-test-c11e17dc2e1d43988c0ebbdca/pvc-test-cda30beb2566450baa7f1ae588582f2"

....

I0728 09:51:58.931892       1 controller.go:1106] Temporary error received, adding PVC 06d80ea7-23b9-48a3-a282-89cd02e881f1 to claims in progress
W0728 09:51:58.931903       1 controller.go:958] Retrying syncing claim "06d80ea7-23b9-48a3-a282-89cd02e881f1", failure 28
E0728 09:51:58.931918       1 controller.go:981] error syncing claim "06d80ea7-23b9-48a3-a282-89cd02e881f1": failed to provision volume with StorageClass "storageclass-test-rbd-dbfeb4c91a324d91a4": rpc error: code = Aborted desc = an operation with the given Volume ID pvc-06d80ea7-23b9-48a3-a282-89cd02e881f1 already exists
I0728 09:51:58.931937       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-c11e17dc2e1d43988c0ebbdca", Name:"pvc-test-cda30beb2566450baa7f1ae588582f2", UID:"06d80ea7-23b9-48a3-a282-89cd02e881f1", APIVersion:"v1", ResourceVersion:"101912", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "storageclass-test-rbd-dbfeb4c91a324d91a4": rpc error: code = Aborted desc = an operation with the given Volume ID pvc-06d80ea7-23b9-48a3-a282-89cd02e881f1 already exists
.....

The subjected pool state now reports any issues:


pool 13 'cbp-test-a78dccba1be145eeb9c9881b512900a' replicated size 3 min_size 2 crush_rule 6 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 208 flags hashpspool stripe_width 0 compression_mode none application rbd

No RBD images exists in the pool yet:

bash-4.4$ rbd ls cbp-test-a78dccba1be145eeb9c9881b512900a


Ceph state reports healthy:

bash-4.4$ ceph status
  cluster:
    id:     86f30a45-9d27-4712-bbbe-60f5c58fb683
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 3h)
    mgr: a(active, since 3h)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 3h), 3 in (since 3h)
 
  data:
    volumes: 1/1 healthy
    pools:   8 pools, 225 pgs
    objects: 691 objects, 1.9 GiB
    usage:   5.4 GiB used, 295 GiB / 300 GiB avail
    pgs:     225 active+clean
 
  io:
    client:   853 B/s rd, 228 KiB/s wr, 1 op/s rd, 3 op/s wr
 
----------------

Comment 4 Humble Chirammal 2021-07-28 11:18:34 UTC
In the same cluster, thick provisioned PVC creation has gone through previously:

For example:

I0728 07:48:36.126984       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-d3420719e8854ae98ad6b9895", Name:"pvc-test-40671b461dbe4f04942bce7d8d636d9", UID:"ebfa8adf-6209-4b94-9e72-37cad72f28d8", APIVersion:"v1", ResourceVersion:"69660", FieldPath:""}): type: 'Normal' reason: 'ProvisioningSucceeded' Successfully provisioned volume pvc-ebfa8adf-6209-4b94-9e72-37cad72f28d8
I0728 07:48:36.286685       1 connection.go:185] GRPC response: {"volume":{"capacity_bytes":1073741824,"volume_context":{"clusterID":"openshift-storage","csi.storage.k8s.io/pv/name":"pvc-8c12437a-bb9b-439a-adea-a67f886afb5c","csi.storage.k8s.io/pvc/name":"pvc-test-6d39d7312549439ebca74c5d740eee3","csi.storage.k8s.io/pvc/namespace":"namespace-test-d3420719e8854ae98ad6b9895","imageFeatures":"layering","imageFormat":"2","imageName":"csi-vol-29dfbe0f-ef78-11eb-abdf-0a580a810222","journalPool":"cbp-test-e9b7b90dc78348caa8eb279cffa6403","pool":"cbp-test-e9b7b90dc78348caa8eb279cffa6403","thickProvision":"true"},"volume_id":"0001-0011-openshift-storage-0000000000000005-29dfbe0f-ef78-11eb-abdf-0a580a810222"}}
I0728 07:48:36.286766       1 connection.go:186] GRPC error: <nil>
I0728 07:48:36.286778       1 controller.go:763] create volume rep: {CapacityBytes:1073741824 VolumeId:0001-0011-openshift-storage-0000000000000005-29dfbe0f-ef78-11eb-abdf-0a580a810222 VolumeContext:map[clusterID:openshift-storage csi.storage.k8s.io/pv/name:pvc-8c12437a-bb9b-439a-adea-a67f886afb5c csi.storage.k8s.io/pvc/name:pvc-test-6d39d7312549439ebca74c5d740eee3 csi.storage.k8s.io/pvc/namespace:namespace-test-d3420719e8854ae98ad6b9895 imageFeatures:layering imageFormat:2 imageName:csi-vol-29dfbe0f-ef78-11eb-abdf-0a580a810222 journalPool:cbp-test-e9b7b90dc78348caa8eb279cffa6403 pool:cbp-test-e9b7b90dc78348caa8eb279cffa6403 thickProvision:true] ContentSource:<nil> AccessibleTopology:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0728 07:48:36.286806       1 controller.go:839] successfully created PV pvc-8c12437a-bb9b-439a-adea-a67f886afb5c for PVC pvc-test-6d39d7312549439ebca74c5d740eee3 and csi volume name 0001-0011-openshift-storage-0000000000000005-29dfbe0f-ef78-11eb-abdf-0a580a810222
...


I0728 07:48:36.288686       1 connection.go:185] GRPC response: {"volume":{"capacity_bytes":1073741824,"volume_context":{"clusterID":"openshift-storage","csi.storage.k8s.io/pv/name":"pvc-a67108b5-9b97-427d-8788-e4ca9cc81710","csi.storage.k8s.io/pvc/name":"pvc-test-3e3d109da7d644b9a55aec6ab56e7d3","csi.storage.k8s.io/pvc/namespace":"namespace-test-d3420719e8854ae98ad6b9895","imageFeatures":"layering","imageFormat":"2","imageName":"csi-vol-28949613-ef78-11eb-abdf-0a580a810222","journalPool":"cbp-test-e9b7b90dc78348caa8eb279cffa6403","pool":"cbp-test-e9b7b90dc78348caa8eb279cffa6403","thickProvision":"true"},"volume_id":"0001-0011-openshift-storage-0000000000000005-28949613-ef78-11eb-abdf-0a580a810222"}}
I0728 07:48:36.288751       1 connection.go:186] GRPC error: <nil>
I0728 07:48:36.288761       1 controller.go:763] create volume rep: {CapacityBytes:1073741824 VolumeId:0001-0011-openshift-storage-0000000000000005-28949613-ef78-11eb-abdf-0a580a810222 VolumeContext:map[clusterID:openshift-storage csi.storage.k8s.io/pv/name:pvc-a67108b5-9b97-427d-8788-e4ca9cc81710 csi.storage.k8s.io/pvc/name:pvc-test-3e3d109da7d644b9a55aec6ab56e7d3 csi.storage.k8s.io/pvc/namespace:namespace-test-d3420719e8854ae98ad6b9895 imageFeatures:layering imageFormat:2 imageName:csi-vol-28949613-ef78-11eb-abdf-0a580a810222 journalPool:cbp-test-e9b7b90dc78348caa8eb279cffa6403 pool:cbp-test-e9b7b90dc78348caa8eb279cffa6403 thickProvision:true] ContentSource:<nil> AccessibleTopology:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0728 07:48:36.288787       1 controller.go:839] successfully created PV pvc-a67108b5-9b97-427d-8788-e4ca9cc81710 for PVC pvc-test-3e3d109da7d644b9a55aec6ab56e7d3 and csi volume name 0001-0011-openshift-storage-0000000000000005-28949613-ef78-11eb-abdf-0a580a810222

...

Issue started to popup from this timestamp: I0728 08:35:06.615209

Comment 5 Humble Chirammal 2021-07-28 11:25:20 UTC
A new thick image manual creation goes through in the subjected POOL : 

bash-4.4$ rbd create -s 1G --image-format 2 --image-feature "deep-flatten,layering" --thick-provision cbp-test-b0366283ecf0454c9c4c920b2b37436/teamtestimage
Thick provisioning: 100% complete...done.
bash-4.4$ 
bash-4.4$ rbd ls cbp-test-b0366283ecf0454c9c4c920b2b37436              
teamtestimage
tmp
ash-4.4$ rbd info cbp-test-b0366283ecf0454c9c4c920b2b37436/teamtestimage
rbd image 'teamtestimage':
	size 1 GiB in 256 objects
	order 22 (4 MiB objects)
	snapshot_count: 0
	id: 39c79cf64abb
	block_name_prefix: rbd_data.39c79cf64abb
	format: 2
	features: layering, deep-flatten
	op_features: 
	flags: 
	create_timestamp: Wed Jul 28 11:22:22 2021
	access_timestamp: Wed Jul 28 11:22:22 2021
	modify_timestamp: Wed Jul 28 11:22:22 2021
bash-4.4$

Comment 6 Humble Chirammal 2021-07-29 08:13:20 UTC
Unfortunately above cluster got destroyed and jilju++ reproduced this issue again today, here is the cluster details:

kubeconfig - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-jul29/jijoy-jul29_20210729T060639/openshift-cluster-dir/auth/kubeconfig
Pending PVCs in namespace namespace-test-46d3052a065548338ede08b4d

Comment 7 Niels de Vos 2021-07-29 11:56:58 UTC
Pending PVCs:

$ oc -n namespace-test-46d3052a065548338ede08b4d get pvc 
NAME                                       STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS                               AGE
pvc-test-637b2a14a5b743039066b3467129773   Pending                                      storageclass-test-rbd-9a5e8645b188468598   111m
pvc-test-869e94f6f24440cca596621f3b51899   Pending                                      storageclass-test-rbd-9a5e8645b188468598   111m
pvc-test-e6019fe9e3ee4882b409c389c9577e2   Pending                                      storageclass-test-rbd-9a5e8645b188468598   111m


RBD Pool that is used:

$ oc get sc/storageclass-test-rbd-9a5e8645b188468598 -oyaml | grep -w pool
        f:pool: {}
  pool: cbp-test-4a64a7ef23734d318db07b2683dfedc


RBD images inside the pool:

$ oc -n openshift-storage rsh rook-ceph-tools-784566878-vxv6z
sh-4.4$ rbd ls cbp-test-4a64a7ef23734d318db07b2683dfedc
csi-vol-ef7cdde4-f042-11eb-9707-0a580a800210
csi-vol-f00f8881-f042-11eb-9707-0a580a800210
csi-vol-f1c13330-f042-11eb-9707-0a580a800210


Pick an image in the pool and find it in the csi-rbd-provisioner logs:

$ oc -n openshift-storage get leases/openshift-storage-rbd-csi-ceph-com
NAME                                 HOLDER                                       AGE
openshift-storage-rbd-csi-ceph-com   csi-rbdplugin-provisioner-6bd49cbf55-77brq   4h39m

$ oc -n openshift-storage logs -c csi-rbdplugin csi-rbdplugin-provisioner-6bd49cbf55-77brq | grep csi-vol-ef7cdde4-f042-11eb-9707-0a580a800210
I0729 07:59:44.146001       1 omap.go:148] ID: 88 Req-ID: pvc-6b521c92-4923-4f27-9164-53f35dc6a4bf set omap keys (pool="cbp-test-4a64a7ef23734d318db07b2683dfedc", namespace="", name="csi.volume.ef7cdde4-f042-11eb-9707-0a580a800210"): map[csi.imagename:csi-vol-ef7cdde4-f042-11eb-9707-0a580a800210 csi.volname:pvc-6b521c92-4923-4f27-9164-53f35dc6a4bf csi.volume.owner:namespace-test-46d3052a065548338ede08b4d])
I0729 07:59:44.146042       1 rbd_journal.go:472] ID: 88 Req-ID: pvc-6b521c92-4923-4f27-9164-53f35dc6a4bf generated Volume ID (0001-0011-openshift-storage-000000000000000d-ef7cdde4-f042-11eb-9707-0a580a800210) and image name (csi-vol-ef7cdde4-f042-11eb-9707-0a580a800210) for request name (pvc-6b521c92-4923-4f27-9164-53f35dc6a4bf)
I0729 07:59:44.146129       1 rbd_util.go:239] ID: 88 Req-ID: pvc-6b521c92-4923-4f27-9164-53f35dc6a4bf rbd: create cbp-test-4a64a7ef23734d318db07b2683dfedc/csi-vol-ef7cdde4-f042-11eb-9707-0a580a800210 size 1024M (features: [layering]) using mon 172.30.204.155:6789,172.30.41.4:6789,172.30.2.40:6789


The image seems to have been created, but the logs do not contain a gRPC response for 'ID: 88'. The creation of the image is somehow stuck.

$ oc -n openshift-storage rsh rook-ceph-tools-784566878-vxv6z
sh-4.4$ rbd status cbp-test-4a64a7ef23734d318db07b2683dfedc/csi-vol-ef7cdde4-f042-11eb-9707-0a580a800210
Watchers:
	watcher=10.128.2.16:0/2179620419 client.25634 cookie=140434827411680


There is a watcher with IP-address 10.128.2.16, this could be the provisioner in case it is still busy with writing zeros to the image.

$ oc -n openshift-storage get pods -l app=csi-rbdplugin-provisioner -owide | grep 10.128.2.16
csi-rbdplugin-provisioner-6bd49cbf55-77brq   6/6     Running   2          4h46m   10.128.2.16   compute-0   <none>           <none>

csi-rbdplugin-provisioner-6bd49cbf55-77brq is the active provisioner (the one that holds the lease). This suggests the provisioning has not finished yet.


sh-4.4$ rbd info cbp-test-4a64a7ef23734d318db07b2683dfedc/csi-vol-ef7cdde4-f042-11eb-9707-0a580a800210
rbd image 'csi-vol-ef7cdde4-f042-11eb-9707-0a580a800210':
	size 1 GiB in 256 objects
	order 22 (4 MiB objects)
	snapshot_count: 0
	id: 64221962d97c
	block_name_prefix: rbd_data.64221962d97c
	format: 2
	features: layering
	op_features: 
	flags: 
	create_timestamp: Thu Jul 29 08:00:02 2021
	access_timestamp: Thu Jul 29 08:00:02 2021
	modify_timestamp: Thu Jul 29 08:00:02 2021
sh-4.4$ rbd du cbp-test-4a64a7ef23734d318db07b2683dfedc/csi-vol-ef7cdde4-f042-11eb-9707-0a580a800210
warning: fast-diff map is not enabled for csi-vol-ef7cdde4-f042-11eb-9707-0a580a800210. operation may be slow.
NAME                                          PROVISIONED  USED 
csi-vol-ef7cdde4-f042-11eb-9707-0a580a800210        1 GiB  1 GiB


It looks like this image is allocated completely, provisioning should be done.


Next steps:
- find out where the cephcsi executable is hanging (what tools are available in the running container?)
- is this a problem in the thick-provisioning loop, or is it stuck in a librbd call

Comment 8 Niels de Vos 2021-07-29 14:59:34 UTC
Created attachment 1807494 [details]
golang profiling, call stacks

Enabled golang profiling in the deployment for the provisioner, add --enableprofiling=true to the arguments of the csi-rbdplugin container.

$ oc -n openshift-storage edit deployment/csi-rbdplugin-provisioner
...
      - args:
        - --nodeid=$(NODE_ID)
        ...
        - --enableprofiling=true
...


The provisioner pods get restarted automatically.

A create request comes in again, but hangs as well. The image "csi-vol-ef7cdde4-f042-11eb-9707-0a580a800210" already exists on the backend, and provisioning was as good as completed:

$ oc -n openshift-storage rsh rook-ceph-tools-784566878-vxv6z
sh-4.4$ rbd image-meta ls cbp-test-4a64a7ef23734d318db07b2683dfedc/csi-vol-f00f8881-f042-11eb-9707-0a580a800210
There is 1 metadatum on this image:

Key                                 Value
rbd.csi.ceph.com/thick-provisioned  true 
sh-4.4$ rbd du cbp-test-4a64a7ef23734d318db07b2683dfedc/csi-vol-f00f8881-f042-11eb-9707-0a580a800210
warning: fast-diff map is not enabled for csi-vol-f00f8881-f042-11eb-9707-0a580a800210. operation may be slow.
NAME                                          PROVISIONED  USED 
csi-vol-f00f8881-f042-11eb-9707-0a580a800210        1 GiB  1 GiB
sh-4.4$ exit


Kubernetes never got the acknowledgement that the volume was created, so a retry is issued:

$ oc -n openshift-storage logs -c csi-rbdplugin csi-rbdplugin-provisioner-65b745d84f-5w6zf
...
I0729 13:16:56.279978       1 utils.go:171] ID: 19 Req-ID: pvc-1286982c-cf9b-4669-b12a-6ee77e0d55ad GRPC call: /csi.v1.Controller/CreateVolume
I0729 13:16:56.280152       1 utils.go:175] ID: 19 Req-ID: pvc-1286982c-cf9b-4669-b12a-6ee77e0d55ad GRPC request: {"capacity_range":{"required_bytes":1073741824},"name":"pvc-1286982c-cf9b-4669-b12a-6ee77e0d55ad","parameters":{"clusterID":"openshift-storage","csi.storage.k8s.io/pv/name":"pvc-1286982c-cf9b-4669-b12a-6ee77e0d55ad","csi.storage.k8s.io/pvc/name":"pvc-test-869e94f6f24440cca596621f3b51899","csi.storage.k8s.io/pvc/namespace":"namespace-test-46d3052a065548338ede08b4d","imageFeatures":"layering","imageFormat":"2","pool":"cbp-test-4a64a7ef23734d318db07b2683dfedc","thickProvision":"true"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Block":{}},"access_mode":{"mode":1}}]}
I0729 13:16:56.280325       1 rbd_util.go:1067] ID: 19 Req-ID: pvc-1286982c-cf9b-4669-b12a-6ee77e0d55ad setting disableInUseChecks: false image features: [layering] mounter: rbd
I0729 13:16:56.305685       1 omap.go:84] ID: 19 Req-ID: pvc-1286982c-cf9b-4669-b12a-6ee77e0d55ad got omap values: (pool="cbp-test-4a64a7ef23734d318db07b2683dfedc", namespace="", name="csi.volumes.default"): map[csi.volume.pvc-1286982c-cf9b-4669-b12a-6ee77e0d55ad:f00f8881-f042-11eb-9707-0a580a800210]
I0729 13:16:56.306785       1 omap.go:84] ID: 19 Req-ID: pvc-1286982c-cf9b-4669-b12a-6ee77e0d55ad got omap values: (pool="cbp-test-4a64a7ef23734d318db07b2683dfedc", namespace="", name="csi.volume.f00f8881-f042-11eb-9707-0a580a800210"): map[csi.imagename:csi-vol-f00f8881-f042-11eb-9707-0a580a800210 csi.volname:pvc-1286982c-cf9b-4669-b12a-6ee77e0d55ad csi.volume.owner:namespace-test-46d3052a065548338ede08b4d]


gRPC "ID: 19" is the re-issued CreateVolume call that would create csi.imagename:csi-vol-f00f8881-f042-11eb-9707-0a580a800210. The image already exists, and is fully provisioned, even the rbd.csi.ceph.com/thick-provisioned metadata is set (only done after allocating the image).


Connecting the Golang profiling tools to the provisioner:

$ oc -n openshift-storage port-forward pod/csi-rbdplugin-provisioner-65b745d84f-5w6zf 9090


And on an other terminal on the same system:

$ go tool pprof http://localhost:9090/debug/pprof/goroutine
Fetching profile over HTTP from http://localhost:9090/debug/pprof/goroutine
Saved profile in /home/ndevos/pprof/pprof.cephcsi.goroutine.002.pb.gz
File: cephcsi
Build ID: 94169ca92a09fc707a9a11c31ffe1be03d5470cc
Type: goroutine
Time: Jul 29, 2021 at 3:40pm (CEST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) text
Showing nodes accounting for 27, 100% of 27 total
Showing top 10 nodes out of 83
      flat  flat%   sum%        cum   cum%
        20 74.07% 74.07%         20 74.07%  runtime.gopark
         6 22.22% 96.30%          6 22.22%  runtime.cgocall
         1  3.70%   100%          1  3.70%  runtime/pprof.runtime_goroutineProfileWithLabels
         0     0%   100%          5 18.52%  bufio.(*Reader).Read
         0     0%   100%          1  3.70%  github.com/ceph/ceph-csi/internal/csi-common.(*nonBlockingGRPCServer).Wait
         0     0%   100%          1  3.70%  github.com/ceph/ceph-csi/internal/csi-common.(*nonBlockingGRPCServer).serve
         0     0%   100%          6 22.22%  github.com/ceph/ceph-csi/internal/csi-common.contextIDInjector
         0     0%   100%          6 22.22%  github.com/ceph/ceph-csi/internal/csi-common.logGRPC
         0     0%   100%          6 22.22%  github.com/ceph/ceph-csi/internal/csi-common.panicHandler
         0     0%   100%          6 22.22%  github.com/ceph/ceph-csi/internal/rbd.(*ControllerServer).CreateVolume
(pprof) web 

The last 'web' command was used to create the attached .svg file.

Golang profiling shows that there are several go-routines executing rbd.rbdImage.open() and rbd.CreateImage(). These calls end up in runtime.cgocall as that takes care of the native librbd functions.

Comment 9 Niels de Vos 2021-07-30 10:25:40 UTC
Created attachment 1809335 [details]
goroutines call traces in text form

As done in commit #8, profiling was (still) enabled, and the text formatted goroutine stack traces have been captured.


Enable port forwarding:

$ oc -n openshift-storage port-forward pod/csi-rbdplugin-provisioner-65b745d84f-tntxq 9090
Forwarding from 127.0.0.1:9090 -> 9090
Forwarding from [::1]:9090 -> 9090
Handling connection for 9090


Download the stacktraces:

$ curl 'http://localhost:9090/debug/pprof/goroutine?debug=2' > cephcsi.goroutines.txt
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 50564    0 50564    0     0  55321      0 --:--:-- --:--:-- --:--:-- 55261


This shows that after the most recent provisioner restart, the call to librbd/rbd_open() is stuck for 49 minutes:

goroutine 93 [syscall, 49 minutes]:
github.com/ceph/go-ceph/rbd._Cfunc_rbd_open(0x7f48f4000ba0, 0x7f48f4000b60, 0xc00041a310, 0x0, 0x0)
        _cgo_gotypes.go:1711 +0x48
github.com/ceph/go-ceph/rbd.OpenImage.func3(0xc00041a2f8, 0x7f48f4000b60, 0xc00041a310, 0x0, 0xc00041a2f8)
        /remote-source/app/vendor/github.com/ceph/go-ceph/rbd/rbd.go:1043 +0xb3
github.com/ceph/go-ceph/rbd.OpenImage(0xc00041a2f8, 0xc000a066f0, 0x2c, 0x0, 0x0, 0x0, 0x0, 0x0)
        /remote-source/app/vendor/github.com/ceph/go-ceph/rbd/rbd.go:1043 +0x105
github.com/ceph/ceph-csi/internal/rbd.(*rbdImage).open(0xc000a3c000, 0x19462df, 0x1d40fa0, 0xc0003f4c00)
        /remote-source/app/internal/rbd/rbd_util.go:340 +0x79
github.com/ceph/ceph-csi/internal/rbd.(*rbdVolume).getImageInfo(0xc000a3c000, 0x0, 0x0)
        /remote-source/app/internal/rbd/rbd_util.go:1258 +0x46
github.com/ceph/ceph-csi/internal/rbd.(*rbdVolume).Exists(0xc000a3c000, 0x21d8278, 0xc0003f46c0, 0x0, 0x0, 0x0, 0x0)
        /remote-source/app/internal/rbd/rbd_journal.go:287 +0x30e
github.com/ceph/ceph-csi/internal/rbd.(*ControllerServer).CreateVolume(0xc0003a00e0, 0x21d8278, 0xc0003f46c0, 0xc00034a700, 0x0, 0x0, 0x0)
        /remote-source/app/internal/rbd/controllerserver.go:299 +0x626
github.com/container-storage-interface/spec/lib/go/csi._Controller_CreateVolume_Handler.func1(0x21d8278, 0xc0003f46c0, 0x1dda8c0, 0xc00034a700, 0x0, 0x0, 0x1, 0xc000a00c10)
        /remote-source/app/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:5676 +0x89
github.com/ceph/ceph-csi/internal/csi-common.panicHandler(0x21d8278, 0xc0003f46c0, 0x1dda8c0, 0xc00034a700, 0xc0001320c0, 0xc000a12168, 0x0, 0x0, 0x0, 0x0)
        /remote-source/app/internal/csi-common/utils.go:199 +0x96
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x21d8278, 0xc0003f46c0, 0x1dda8c0, 0xc00034a700, 0xc0007f5980, 0x1, 0x1, 0x30)
        /remote-source/app/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x63
github.com/ceph/ceph-csi/internal/csi-common.logGRPC(0x21d8278, 0xc0003f46c0, 0x1dda8c0, 0xc00034a700, 0xc0001320c0, 0xc0001320e0, 0x1d75bc0, 0xc0003f46c0, 0x1b69ba0, 0xc000a00720)
        /remote-source/app/internal/csi-common/utils.go:178 +0x206
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x21d8278, 0xc0003f46c0, 0x1dda8c0, 0xc00034a700, 0x1b70320, 0xc000a00730, 0x21d8278, 0xc0003f46c0)
        /remote-source/app/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x63
github.com/ceph/ceph-csi/internal/csi-common.contextIDInjector(0x21d8278, 0xc0003f4690, 0x1dda8c0, 0xc00034a700, 0xc0001320c0, 0xc000132100, 0x18b517a, 0x1d28aa0, 0xc000132120, 0xc0001320c0)
        /remote-source/app/internal/csi-common/utils.go:163 +0x131
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x21d8278, 0xc0003f45a0, 0x1dda8c0, 0xc00034a700, 0xc0003ac400, 0x0, 0xc0007f5b30, 0x41b258)
        /remote-source/app/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x63
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1(0x21d8278, 0xc0003f45a0, 0x1dda8c0, 0xc00034a700, 0xc0001320c0, 0xc000a12168, 0xc000858ba0, 0x520246, 0x1d75bc0, 0xc0003f45a0)
        /remote-source/app/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:34 +0xd7
github.com/container-storage-interface/spec/lib/go/csi._Controller_CreateVolume_Handler(0x1e22000, 0xc0003a00e0, 0x21d8278, 0xc0003f45a0, 0xc0000af6e0, 0xc00035bb60, 0x21d8278, 0xc0003f45a0, 0xc000a1a400, 0x1f6)
        /remote-source/app/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:5678 +0x150
google.golang.org/grpc.(*Server).processUnaryRPC(0xc000141dc0, 0x21ec318, 0xc000385e00, 0xc000a2c000, 0xc00035bd10, 0x2fff6a0, 0x0, 0x0, 0x0)
        /remote-source/app/vendor/google.golang.org/grpc/server.go:1217 +0x52b
google.golang.org/grpc.(*Server).handleStream(0xc000141dc0, 0x21ec318, 0xc000385e00, 0xc000a2c000, 0x0)
        /remote-source/app/vendor/google.golang.org/grpc/server.go:1540 +0xd0c
google.golang.org/grpc.(*Server).serveStreams.func1.2(0xc00043a3e0, 0xc000141dc0, 0x21ec318, 0xc000385e00, 0xc000a2c000)
        /remote-source/app/vendor/google.golang.org/grpc/server.go:878 +0xab
created by google.golang.org/grpc.(*Server).serveStreams.func1
        /remote-source/app/vendor/google.golang.org/grpc/server.go:876 +0x1fd


Other goroutines are stuck in a similar fashion too.

Comment 10 Niels de Vos 2021-07-30 12:02:51 UTC
Hi Ilya,

do you have any suggestions for adding additional logs from an RBD perspective? It seems ceph-csi is stuck in librbd:rbd_open(), and as things run inside a container capturing a coredump or C-level stacktraces is difficult. It would help if there is a ceph commandline method that can enable debug logging in the MONs or OSDs, or even on the client-side (by not running in the client container) that can point to the cause of these hangs.

Thank!

Comment 11 Niels de Vos 2021-07-30 13:03:08 UTC
Reverting to an older release of ceph-csi (upstream v3.3.0 with librbd1-15.2.10-0.el8.x86_64) makes the provisioning succeed...

The original deployed image uses a newer version of Ceph (librbd1-16.2.0-81.el8cp.x86_64).
Ceph-CSI v3.4.0 hangs in the same way.


Next step:

Going to build upstream Ceph-CSI devel branch into a Ceph v15 container image.

Comment 12 Niels de Vos 2021-07-30 13:34:01 UTC
(In reply to Niels de Vos from comment #11)
> Next step:
> 
> Going to build upstream Ceph-CSI devel branch into a Ceph v15 container
> image.


Ceph-CSI devel (close to v3.4.0) image based on latest ceph/ceph:v15: quay.io/nixpanic/ceph-csi:bz1986794

Using this image, the RBD image creation succeeds immediately

I0730 13:20:37.528631       1 utils.go:176] ID: 18 Req-ID: pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004 GRPC call: /csi.v1.Controller/CreateVolume
I0730 13:20:37.529198       1 utils.go:180] ID: 18 Req-ID: pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004 GRPC request: {"capacity_range":{"required_bytes":1073741824},"name":"pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004","parameters":{"clusterID":"openshift-storage","csi.storage.k8s.io/pv/name":"pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004","csi.storage.k8s.io/pvc/name":"ndevos-pvc-test","csi.storage.k8s.io/pvc/namespace":"namespace-test-46d3052a065548338ede08b4d","imageFeatures":"layering","imageFormat":"2","pool":"cbp-test-4a64a7ef23734d318db07b2683dfedc","thickProvision":"true"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Block":{}},"access_mode":{"mode":1}}]}
I0730 13:20:37.529624       1 rbd_util.go:1109] ID: 18 Req-ID: pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004 setting disableInUseChecks: false image features: [layering] mounter: rbd
I0730 13:20:37.554109       1 omap.go:86] ID: 18 Req-ID: pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004 got omap values: (pool="cbp-test-4a64a7ef23734d318db07b2683dfedc", namespace="", name="csi.volumes.default"): map[]
I0730 13:20:37.564327       1 omap.go:154] ID: 18 Req-ID: pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004 set omap keys (pool="cbp-test-4a64a7ef23734d318db07b2683dfedc", namespace="", name="csi.volumes.default"): map[csi.volume.pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004:edd4d0a7-f138-11eb-9966-0a580a8103ef])
I0730 13:20:37.567945       1 omap.go:154] ID: 18 Req-ID: pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004 set omap keys (pool="cbp-test-4a64a7ef23734d318db07b2683dfedc", namespace="", name="csi.volume.edd4d0a7-f138-11eb-9966-0a580a8103ef"): map[csi.imagename:csi-vol-edd4d0a7-f138-11eb-9966-0a580a8103ef csi.volname:pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004 csi.volume.owner:namespace-test-46d3052a065548338ede08b4d])
I0730 13:20:37.567991       1 rbd_journal.go:483] ID: 18 Req-ID: pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004 generated Volume ID (0001-0011-openshift-storage-000000000000000d-edd4d0a7-f138-11eb-9966-0a580a8103ef) and image name (csi-vol-edd4d0a7-f138-11eb-9966-0a580a8103ef) for request name (pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004)
I0730 13:20:37.568117       1 rbd_util.go:242] ID: 18 Req-ID: pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004 rbd: create cbp-test-4a64a7ef23734d318db07b2683dfedc/csi-vol-edd4d0a7-f138-11eb-9966-0a580a8103ef size 1024M (features: [layering]) using mon 172.30.204.155:6789,172.30.41.4:6789,172.30.2.40:6789
I0730 13:20:56.742435       1 controllerserver.go:662] ID: 18 Req-ID: pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004 created volume pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004 backed by image csi-vol-edd4d0a7-f138-11eb-9966-0a580a8103ef
I0730 13:20:56.776063       1 omap.go:154] ID: 18 Req-ID: pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004 set omap keys (pool="cbp-test-4a64a7ef23734d318db07b2683dfedc", namespace="", name="csi.volume.edd4d0a7-f138-11eb-9966-0a580a8103ef"): map[csi.imageid:18ef2e3989d66])
I0730 13:20:56.776723       1 utils.go:187] ID: 18 Req-ID: pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004 GRPC response: {"volume":{"capacity_bytes":1073741824,"volume_context":{"clusterID":"openshift-storage","csi.storage.k8s.io/pv/name":"pvc-9c7e0536-b79f-43cc-95c1-2a23d87b1004","csi.storage.k8s.io/pvc/name":"ndevos-pvc-test","csi.storage.k8s.io/pvc/namespace":"namespace-test-46d3052a065548338ede08b4d","imageFeatures":"layering","imageFormat":"2","imageName":"csi-vol-edd4d0a7-f138-11eb-9966-0a580a8103ef","journalPool":"cbp-test-4a64a7ef23734d318db07b2683dfedc","pool":"cbp-test-4a64a7ef23734d318db07b2683dfedc","thickProvision":"true"},"volume_id":"0001-0011-openshift-storage-000000000000000d-edd4d0a7-f138-11eb-9966-0a580a8103ef"}}


Reverted back to the original image used for the deployment (quay.io/rhceph-dev/cephcsi@sha256:4cf1e012336d55a0f240e1206181984accd93c03b9a212431e21c25a9a2b3afb), and things work still?! Even tried with multiple parallel PVC, not able to reproduce again.

I0730 13:24:18.403671       1 utils.go:171] ID: 17 Req-ID: pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05 GRPC call: /csi.v1.Controller/CreateVolume
I0730 13:24:18.403973       1 utils.go:175] ID: 17 Req-ID: pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05 GRPC request: {"capacity_range":{"required_bytes":1073741824},"name":"pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05","parameters":{"clusterID":"openshift-storage","csi.storage.k8s.io/pv/name":"pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05","csi.storage.k8s.io/pvc/name":"ndevos-pvc-test","csi.storage.k8s.io/pvc/namespace":"namespace-test-46d3052a065548338ede08b4d","imageFeatures":"layering","imageFormat":"2","pool":"cbp-test-4a64a7ef23734d318db07b2683dfedc","thickProvision":"true"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Block":{}},"access_mode":{"mode":1}}]}
I0730 13:24:18.404224       1 rbd_util.go:1067] ID: 17 Req-ID: pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05 setting disableInUseChecks: false image features: [layering] mounter: rbd
I0730 13:24:18.423590       1 omap.go:84] ID: 17 Req-ID: pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05 got omap values: (pool="cbp-test-4a64a7ef23734d318db07b2683dfedc", namespace="", name="csi.volumes.default"): map[]
I0730 13:24:18.431642       1 omap.go:148] ID: 17 Req-ID: pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05 set omap keys (pool="cbp-test-4a64a7ef23734d318db07b2683dfedc", namespace="", name="csi.volumes.default"): map[csi.volume.pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05:717ad462-f139-11eb-afe7-0a580a8003b9])
I0730 13:24:18.434827       1 omap.go:148] ID: 17 Req-ID: pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05 set omap keys (pool="cbp-test-4a64a7ef23734d318db07b2683dfedc", namespace="", name="csi.volume.717ad462-f139-11eb-afe7-0a580a8003b9"): map[csi.imagename:csi-vol-717ad462-f139-11eb-afe7-0a580a8003b9 csi.volname:pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05 csi.volume.owner:namespace-test-46d3052a065548338ede08b4d])
I0730 13:24:18.434876       1 rbd_journal.go:472] ID: 17 Req-ID: pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05 generated Volume ID (0001-0011-openshift-storage-000000000000000d-717ad462-f139-11eb-afe7-0a580a8003b9) and image name (csi-vol-717ad462-f139-11eb-afe7-0a580a8003b9) for request name (pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05)
I0730 13:24:18.434987       1 rbd_util.go:239] ID: 17 Req-ID: pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05 rbd: create cbp-test-4a64a7ef23734d318db07b2683dfedc/csi-vol-717ad462-f139-11eb-afe7-0a580a8003b9 size 1024M (features: [layering]) using mon 172.30.204.155:6789,172.30.41.4:6789,172.30.2.40:6789
I0730 13:24:38.160759       1 utils.go:171] ID: 18 GRPC call: /csi.v1.Identity/Probe
I0730 13:24:38.161007       1 utils.go:175] ID: 18 GRPC request: {}
I0730 13:24:38.161068       1 utils.go:182] ID: 18 GRPC response: {}
I0730 13:24:39.179882       1 controllerserver.go:636] ID: 17 Req-ID: pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05 created volume pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05 backed by image csi-vol-717ad462-f139-11eb-afe7-0a580a8003b9
I0730 13:24:39.216951       1 omap.go:148] ID: 17 Req-ID: pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05 set omap keys (pool="cbp-test-4a64a7ef23734d318db07b2683dfedc", namespace="", name="csi.volume.717ad462-f139-11eb-afe7-0a580a8003b9"): map[csi.imageid:aa81a9046ff9])
I0730 13:24:39.217381       1 utils.go:182] ID: 17 Req-ID: pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05 GRPC response: {"volume":{"capacity_bytes":1073741824,"volume_context":{"clusterID":"openshift-storage","csi.storage.k8s.io/pv/name":"pvc-b3c4fe92-28db-4169-861b-3a6aeb6adf05","csi.storage.k8s.io/pvc/name":"ndevos-pvc-test","csi.storage.k8s.io/pvc/namespace":"namespace-test-46d3052a065548338ede08b4d","imageFeatures":"layering","imageFormat":"2","imageName":"csi-vol-717ad462-f139-11eb-afe7-0a580a8003b9","journalPool":"cbp-test-4a64a7ef23734d318db07b2683dfedc","pool":"cbp-test-4a64a7ef23734d318db07b2683dfedc","thickProvision":"true"},"volume_id":"0001-0011-openshift-storage-000000000000000d-717ad462-f139-11eb-afe7-0a580a8003b9"}}


Next step:

Identify exact procedure for reproducing this manually.  Jilju, can you help with this?

Comment 13 Jilju Joy 2021-07-31 08:04:30 UTC
(In reply to Niels de Vos from comment #12)
> (In reply to Niels de Vos from comment #11)
> Next step:
> 
> Identify exact procedure for reproducing this manually.  Jilju, can you help
> with this?

Copying the relevant logs from the ocs-ci test case tests/manage/pv_services/test_rbd_thick_provisioning.py
The test case creates Secret, Ceph block pool, Storage class with thickProvision: 'true. Create 3 PVCs. 
Yamls are given below along with the logs.


13:19:04 - MainThread - ocs_ci.ocs.resources.ocs - INFO - Adding Secret with name secret-test-rbd-a269c039d62c433b907fed3a
13:19:04 - MainThread - ocs_ci.utility.templating - INFO - apiVersion: v1
kind: Secret
metadata:
  name: secret-test-rbd-a269c039d62c433b907fed3a
  namespace: openshift-storage
stringData:
  userID: admin
  userKey: '*****'

13:19:04 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage create -f /tmp/Secretqv2__don -o yaml



13:19:14 - MainThread - ocs_ci.ocs.resources.ocs - INFO - Adding CephBlockPool with name cbp-test-dd4a8e779c7a405aad42612820ac93c
13:19:14 - MainThread - ocs_ci.utility.templating - INFO - apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: cbp-test-dd4a8e779c7a405aad42612820ac93c
  namespace: openshift-storage
spec:
  annotations: null
  failureDomain: rack
  parameters:
    compression_mode: none
  replicated:
    size: 3

13:19:14 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage create -f /tmp/CephBlockPooln6dxeqqy -o yaml


13:19:23 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage rsh rook-ceph-tools-784566878-vxv6z ceph osd lspools --format json-pretty
13:19:26 - MainThread - ocs_ci.helpers.helpers - INFO - POOLS are [{'poolnum': 1, 'poolname': 'ocs-storagecluster-cephblockpool'}, {'poolnum': 2, 'poolname': 'ocs-storagecluster-cephobjectstore.rgw.meta'}, {'poolnum': 3, 'poolname': 'ocs-storagecluster-cephobjectstore.rgw.control'}, {'poolnum': 4, 'poolname': 'ocs-storagecluster-cephobjectstore.rgw.log'}, {'poolnum': 5, 'poolname': 'ocs-storagecluster-cephobjectstore.rgw.buckets.index'}, {'poolnum': 6, 'poolname': 'ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec'}, {'poolnum': 7, 'poolname': '.rgw.root'}, {'poolnum': 8, 'poolname': 'ocs-storagecluster-cephfilesystem-metadata'}, {'poolnum': 9, 'poolname': 'ocs-storagecluster-cephfilesystem-data0'}, {'poolnum': 10, 'poolname': 'ocs-storagecluster-cephobjectstore.rgw.buckets.data'}, {'poolnum': 11, 'poolname': 'device_health_metrics'}, {'poolnum': 13, 'poolname': 'cbp-test-4a64a7ef23734d318db07b2683dfedc'}, {'poolnum': 14, 'poolname': 'cbp-test-5ce92edcf0f74570a564ba6e1395472'}, {'poolnum': 16, 'poolname': 'cbp-test-dd4a8e779c7a405aad42612820ac93c'}]
13:19:26 - MainThread - ocs_ci.ocs.resources.ocs - INFO - Adding StorageClass with name storageclass-test-rbd-20a57aa726de4d2d9a
13:19:26 - MainThread - ocs_ci.utility.templating - INFO - allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: storageclass-test-rbd-20a57aa726de4d2d9a
  namespace: openshift-storage
parameters:
  clusterID: openshift-storage
  csi.storage.k8s.io/controller-expand-secret-name: '*****'
  csi.storage.k8s.io/controller-expand-secret-namespace: '*****'
  csi.storage.k8s.io/node-stage-secret-name: '*****'
  csi.storage.k8s.io/node-stage-secret-namespace: '*****'
  csi.storage.k8s.io/provisioner-secret-name: '*****'
  csi.storage.k8s.io/provisioner-secret-namespace: '*****'
  imageFeatures: layering
  imageFormat: '2'
  pool: cbp-test-dd4a8e779c7a405aad42612820ac93c
  thickProvision: 'true'
provisioner: openshift-storage.rbd.csi.ceph.com
reclaimPolicy: Delete

13:19:26 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage create -f /tmp/StorageClassd8kamsxc -o yaml
13:19:27 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get StorageClass storageclass-test-rbd-20a57aa726de4d2d9a -n openshift-storage -o yaml


13:19:30 - MainThread - ocs_ci.ocs.resources.ocs - INFO - Adding PersistentVolumeClaim with name pvc-test-acdb94965ac44443983a799c9c7124f
13:19:30 - MainThread - ocs_ci.utility.templating - INFO - apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-test-acdb94965ac44443983a799c9c7124f
  namespace: namespace-test-7eefd165af0e41bb8f03f0e99
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: storageclass-test-rbd-20a57aa726de4d2d9a

13:19:30 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n namespace-test-7eefd165af0e41bb8f03f0e99 create -f /tmp/PersistentVolumeClaimiiimwi0x -o yaml
13:19:31 - MainThread - ocs_ci.ocs.resources.ocs - INFO - Adding PersistentVolumeClaim with name pvc-test-b5537e46e5534d8584c94b9d4320588
13:19:31 - MainThread - ocs_ci.utility.templating - INFO - apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-test-b5537e46e5534d8584c94b9d4320588
  namespace: namespace-test-7eefd165af0e41bb8f03f0e99
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: storageclass-test-rbd-20a57aa726de4d2d9a
  volumeMode: Block

13:19:31 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n namespace-test-7eefd165af0e41bb8f03f0e99 create -f /tmp/PersistentVolumeClaimvqes2_ww -o yaml
13:19:32 - MainThread - ocs_ci.ocs.resources.ocs - INFO - Adding PersistentVolumeClaim with name pvc-test-11ef6cc72cd849008dfa4bbe8aae8af
13:19:32 - MainThread - ocs_ci.utility.templating - INFO - apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-test-11ef6cc72cd849008dfa4bbe8aae8af
  namespace: namespace-test-7eefd165af0e41bb8f03f0e99
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 1Gi
  storageClassName: storageclass-test-rbd-20a57aa726de4d2d9a
  volumeMode: Block

13:19:32 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n namespace-test-7eefd165af0e41bb8f03f0e99 create -f /tmp/PersistentVolumeClaimabdko5sq -o yaml



Hi Niels,

Executed the above test case on the same cluster shared for debugging. The PVCs given below did not reach Bound state.
$ oc get pvc -n namespace-test-7eefd165af0e41bb8f03f0e99
NAME                                       STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS                               AGE
pvc-test-11ef6cc72cd849008dfa4bbe8aae8af   Pending                                      storageclass-test-rbd-20a57aa726de4d2d9a   14m
pvc-test-acdb94965ac44443983a799c9c7124f   Pending                                      storageclass-test-rbd-20a57aa726de4d2d9a   14m
pvc-test-b5537e46e5534d8584c94b9d4320588   Pending                                      storageclass-test-rbd-20a57aa726de4d2d9a   14m

Comment 15 Niels de Vos 2021-08-02 15:07:14 UTC
Thank Jilju!

The latest state of the reproducer is:

Fetch the goroutine call traces as in comment #9:

$ curl 'http://localhost:9090/debug/pprof/goroutine?debug=2' > csi-rbdplugin-provisioner-65b745d84f-hr45x.goroutines.txt


All three PVC creations are stuck in calling librbd:rbd_create4():

$ grep -B1 github.com/ceph/go-ceph/rbd._Cfunc csi-rbdplugin-provisioner-65b745d84f-hr45x.goroutines.txt
goroutine 1401 [syscall, 3218 minutes]:
github.com/ceph/go-ceph/rbd._Cfunc_rbd_create4(0x7fcc18000ba0, 0x7fcc18002840, 0x40000000, 0x7fcc18001040, 0x0)
--
goroutine 1400 [syscall, 3218 minutes]:
github.com/ceph/go-ceph/rbd._Cfunc_rbd_create4(0x3925af0, 0x391c280, 0x40000000, 0x391c320, 0x0)
--
goroutine 1445 [syscall, 3218 minutes]:
github.com/ceph/go-ceph/rbd._Cfunc_rbd_create4(0x7fcbc0000da0, 0x7fcbc0002aa0, 0x40000000, 0x7fcbc0002b40, 0x0)


It seems that this is not restricted to thick-provisioning. Creating the image is needed for normal (thin) provisioning too. Thick-provisioning only comes into play after the image has been created; it will get opened an filled with zeros.

Still trying to find out what the best way is to capture more details.

Comment 32 Michael Adam 2021-09-07 15:49:29 UTC
If this is an OCS tracker for a ceph BZ and has no fix in ceph-csi then it should be on the ceph component and not ceph-csi.

Comment 34 Mudit Agarwal 2021-09-20 07:52:29 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=2000434 is ON_QA which means we have a fix in Ceph.

This can be moved to ON_QA once we have an OCS build with the Ceph fix.

Comment 37 Rakshith 2021-09-23 06:32:52 UTC
*** Bug 2006322 has been marked as a duplicate of this bug. ***

Comment 41 Yug Gupta 2021-09-29 09:37:10 UTC
*** Bug 2007442 has been marked as a duplicate of this bug. ***

Comment 43 Jilju Joy 2021-10-14 17:23:41 UTC
Executed the test case tests/manage/pv_services/test_rbd_thick_provisioning.py::TestRbdThickProvisioning::test_rbd_thick_provisioning 20 times. No issues observed.
Manually created RBD thick storage class and PVCs of different size. All PVCs reached Bound state.

Verified in version:
ODF full version 4.9.0-189.ci
OCP 4.9.0-0.nightly-2021-10-13-170616

Comment 45 errata-xmlrpc 2021-12-13 17:44:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:5086


Note You need to log in before you can comment on or make changes to this bug.