Bug 2007442

Summary: [IBM Z] PVC in pending state due to missing provisioner
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Abdul Kandathil (IBM) <akandath>
Component: csi-driverAssignee: Yug Gupta <ygupta>
Status: CLOSED DUPLICATE QA Contact: Elad <ebenahar>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.9CC: madam, ocs-bugs, odf-bz-bot, ygupta
Target Milestone: ---   
Target Release: ---   
Hardware: s390x   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-09-29 09:37:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
test logs none

Description Abdul Kandathil (IBM) 2021-09-23 21:25:15 UTC
Created attachment 1825761 [details]
test logs

Description of problem (please be detailed as possible and provide log
snippests):
Below ocs-ci tests in tier2 fails due to missing provisioner, "openshift-storage.rbd.csi.ceph.com".

tests: 

tests/manage/storageclass/test_create_multiple_sc_with_same_pool_name.py::TestCreateMultipleScWithSamePoolName::test_create_multiple_sc_with_same_pool_name[CephBlockPool]

tests/manage/storageclass/test_create_sc_reclaim_policy_rep2_comp.py::TestScReclaimPolicyRetainRep2Comp::test_sc_reclaim_policy_retain_rep2_comp

Error:

E           ocs_ci.ocs.exceptions.ResourceWrongStatusException: Resource pvc-test-4a62eff8cc9249b19ad6677b09ae68e describe output: Name:          pvc-test-4a62eff8cc9249b19ad6677b09ae68e
E           Namespace:     namespace-test-576e92034cc14d85a164dfd5a
E           StorageClass:  storageclass-test-rbd-d1e8e64b417048149b
E           Status:        Pending
E           Volume:
E           Labels:        <none>
E           Annotations:   volume.beta.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com
E           Finalizers:    [kubernetes.io/pvc-protection]
E           Capacity:
E           Access Modes:
E           VolumeMode:    Filesystem
E           Used By:       <none>
E           Events:
E             Type    Reason                Age                From                                                                                                               Message
E             ----    ------                ----               ----                                                                                                               -------
E             Normal  Provisioning          62s                openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-c649865cb-h7bcc_8962ebeb-4840-4e3a-879e-3d280449cae0  External provisioner is provisioning volume for claim "namespace-test-576e92034cc14d85a164dfd5a/pvc-test-4a62eff8cc9249b19ad6677b09ae68e"
E             Normal  ExternalProvisioning  15s (x5 over 62s)  persistentvolume-controller                                                                                        waiting for a volume to be created, either by external provisioner "openshift-storage.rbd.csi.ceph.com" or manually created by system administrator


Version of all relevant components (if applicable): OCS 4.9 (tested on 4.9.0-156.ci)


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

pvc provisioning fails which uses this provisioner.

Is there any workaround available to the best of your knowledge?
no

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
yes

Can this issue reproduce from the UI?
no

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install ocp cluster
2. Deploy ODF along with LSO
3. executed the ocs-ci test.


Actual results:
pvc's stays in pending state.


Expected results:
pvc's get provisioned successfully and test passes.

Additional info:

Comment 2 Yug Gupta 2021-09-24 02:37:06 UTC
Hey Abdul,

Can you also attach the must-gather for the same? Logs will definitely help to get a deeper understanding of the issue.

Thanks,
Yug

Comment 3 Abdul Kandathil (IBM) 2021-09-24 08:07:50 UTC
Please find the must-gather logs in google drive : https://drive.google.com/file/d/1TYPa3cXPGIU1VB1ymBQvIoBprzsECCeg/view?usp=sharing

Comment 5 Yug Gupta 2021-09-27 03:25:48 UTC
Hey Abdul,

In the logs, it looks like the parallel PVC creation was attempted which led the rbd command to hanged, due to which no response was returned by the CreateVolume call.
Since no response was returned from the first CreateVolume call; eventually, all the upcoming new calls returned "operation already exists".

It is a known librbd issue that is tracked here: https://tracker.ceph.com/issues/52537
PR to fix the same: https://github.com/ceph/ceph/pull/43113
Tracker issue in ceph-csi: https://github.com/ceph/ceph-csi/issues/2521

Also, for workarounds, you can either:

1. Rollback to ceph octopus release (harder to hit the issue there)
2. https://github.com/ceph/ceph-csi/issues/2521#issuecomment-924638203 

Regards,
Yug Gupta

Comment 6 Yug Gupta 2021-09-29 09:37:10 UTC
We have a similar bz open for the same issue which is already on QA. Closing this one as a Duplicate. Please feel free to open if found otherwise.

*** This bug has been marked as a duplicate of bug 1986794 ***