Bug 1965016

Summary:	[Tracker for BZ #1969301] [RBD][Thick] If storage cluster utilization reach 85% while a thick PVC is being provisioned, deleting the pending PVC will not stop provisioning operation and the image will retain
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Jilju Joy <jijoy>
Component:	ceph	Assignee:	Madhu Rajanna <mrajanna>
Status:	CLOSED WONTFIX	QA Contact:	Jilju Joy <jijoy>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.8	CC:	assingh, bniver, kramdoss, madam, mhackett, mrajanna, muagarwa, ndevos, ocs-bugs, odf-bz-bot, owasserm, rcyriac, tdesala
Target Milestone:	---	Keywords:	AutomationBackLog
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	v4.9.0-182.ci	Doc Type:	Bug Fix
Doc Text:	.Provisioning attempts did not stop once the storage cluster utilization reached 85% or even after deleting the PVC. If the storage cluster utilization reaches 85% while an RBD thick PVC is being provisioned, the provisioning attempt will not stop automatically by deleting the pending PVC and the RBD image will not get deleted even after deleting the pending PVC. The best approach is not to start provisioning if the requested size is beyond the available storage.	Story Points:	---
Clone Of:
Clones:	1969301 (view as bug list)		Environment:
Last Closed:	2021-10-20 11:24:23 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1969301
Bug Blocks:	1966894

Description Jilju Joy 2021-05-26 14:37:04 UTC

Description of problem (please be detailed as possible and provide log
snippests):
If the storage cluster utilization reach 85% while a RBD thick PVC is being provisioned, the provisioning attempt will not stop automatically or by deleting the pending PVC. The RBD image will not get deleted even after deleting the pending PVC. Manual attempt to delete the image also failed.


Status of the PVC "rbdthick120" (size 120Gi) when the storage cluster utilization has reached 85%:
$ oc get pvc -n default
NAME          STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
rbdthick120   Pending                                      rbdthick       10m



Deleted the pending PVC to reclaim the space:

[jijoy@localhost ocs-ci]$ oc delete pvc  rbdthick120 -n default
persistentvolumeclaim "rbdthick120" deleted
[jijoy@localhost ocs-ci]$ oc get pvc -n default
No resources found in default namespace.



The below log is showing repeatedly in csi-provisioner even after deleting the pending PVC.


I0526 12:21:30.627056       1 controller.go:1335] provision "default/rbdthick120" class "rbdthick": started
I0526 12:21:30.627224       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"rbdthick120", UID:"187bb8db-8f9b-4cfe-bdb1-797cdbe00d02", APIVersion:"v1", ResourceVersion:"1701250", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/rbdthick120"
I0526 12:21:30.632794       1 controller.go:731] CreateVolumeRequest name:"pvc-187bb8db-8f9b-4cfe-bdb1-797cdbe00d02" capacity_range:<required_bytes:128849018880 > volume_capabilities:<mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER > > parameters:<key:"clusterID" value:"openshift-storage" > parameters:<key:"csi.storage.k8s.io/pv/name" value:"pvc-187bb8db-8f9b-4cfe-bdb1-797cdbe00d02" > parameters:<key:"csi.storage.k8s.io/pvc/name" value:"rbdthick120" > parameters:<key:"csi.storage.k8s.io/pvc/namespace" value:"default" > parameters:<key:"imageFeatures" value:"layering" > parameters:<key:"imageFormat" value:"2" > parameters:<key:"pool" value:"ocs-storagecluster-cephblockpool" > parameters:<key:"thickProvision" value:"true" > secrets:<key:"userID" value:"csi-rbd-provisioner" > secrets:<key:"userKey" value:"AQBEfKtguieTDBAABY/VGkM5TB+JAJvQ6Er1KQ==" > 
I0526 12:21:30.632944       1 connection.go:182] GRPC call: /csi.v1.Controller/CreateVolume
I0526 12:21:30.632952       1 connection.go:183] GRPC request: {"capacity_range":{"required_bytes":128849018880},"name":"pvc-187bb8db-8f9b-4cfe-bdb1-797cdbe00d02","parameters":{"clusterID":"openshift-storage","csi.storage.k8s.io/pv/name":"pvc-187bb8db-8f9b-4cfe-bdb1-797cdbe00d02","csi.storage.k8s.io/pvc/name":"rbdthick120","csi.storage.k8s.io/pvc/namespace":"default","imageFeatures":"layering","imageFormat":"2","pool":"ocs-storagecluster-cephblockpool","thickProvision":"true"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]}
I0526 12:21:30.634348       1 connection.go:185] GRPC response: {}
I0526 12:21:30.634381       1 connection.go:186] GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-187bb8db-8f9b-4cfe-bdb1-797cdbe00d02 already exists
I0526 12:21:30.634405       1 controller.go:752] CreateVolume failed, supports topology = false, node selected false => may reschedule = false => state = Background: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-187bb8db-8f9b-4cfe-bdb1-797cdbe00d02 already exists
I0526 12:21:30.634444       1 controller.go:1109] Temporary error received, adding PVC 187bb8db-8f9b-4cfe-bdb1-797cdbe00d02 to claims in progress
W0526 12:21:30.634457       1 controller.go:961] Retrying syncing claim "187bb8db-8f9b-4cfe-bdb1-797cdbe00d02", failure 13
E0526 12:21:30.634473       1 controller.go:984] error syncing claim "187bb8db-8f9b-4cfe-bdb1-797cdbe00d02": failed to provision volume with StorageClass "rbdthick": rpc error: code = Aborted desc = an operation with the given Volume ID pvc-187bb8db-8f9b-4cfe-bdb1-797cdbe00d02 already exists
I0526 12:21:30.634510       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"rbdthick120", UID:"187bb8db-8f9b-4cfe-bdb1-797cdbe00d02", APIVersion:"v1", ResourceVersion:"1701250", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "rbdthick": rpc error: code = Aborted desc = an operation with the given Volume ID pvc-187bb8db-8f9b-4cfe-bdb1-797cdbe00d02 already exists


The RBD image is not deleted even after one hour of deleting the PVC:

# rbd du -p ocs-storagecluster-cephblockpool csi-vol-44aa633d-be19-11eb-9f4e-0a580a80020e
warning: fast-diff map is not enabled for csi-vol-44aa633d-be19-11eb-9f4e-0a580a80020e. operation may be slow.
NAME                                         PROVISIONED USED   
csi-vol-44aa633d-be19-11eb-9f4e-0a580a80020e     120 GiB 57 GiB 


# ceph status
  cluster:
    id:     46c733c0-6067-4310-ba9a-45a0f7dc511f
    health: HEALTH_ERR
            3 full osd(s)
            10 pool(s) full
 
  services:
    mon: 3 daemons, quorum a,b,c (age 26h)
    mgr: a(active, since 26h)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
    osd: 3 osds: 3 up (since 26h), 3 in (since 2d)
    rgw: 1 daemon active (ocs.storagecluster.cephobjectstore.a)
 
  task status:
    scrub status:
        mds.ocs-storagecluster-cephfilesystem-a: idle
        mds.ocs-storagecluster-cephfilesystem-b: idle
 
  data:
    pools:   10 pools, 176 pgs
    objects: 22.58k objects, 86 GiB
    usage:   260 GiB used, 40 GiB / 300 GiB avail
    pgs:     176 active+clean
 
  io:
    client:   853 B/s rd, 1 op/s rd, 0 op/s wr


Error when trying to delete the image:

# rbd rm ocs-storagecluster-cephblockpool/csi-vol-44aa633d-be19-11eb-9f4e-0a580a80020e    
2021-05-26 14:16:38.697 7fe71e7fc700 -1 librbd::image::PreRemoveRequest: 0x559d858f7710 check_image_watchers: image has watchers - not removing
Removing image: 0% complete...failed.
rbd: error: image still has watchers
This means the image is still open or the client using it crashed. Try again after closing/unmapping it or waiting 30s for the crashed client to timeout.




OCS and OCP must gather logs:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-may24/jijoy-may24_20210524T090226/logs/testcases_1622033934/



=============================================================================

Version of all relevant components (if applicable):
ocs-operator.v4.8.0-399.ci
OCP 4.8.0-0.nightly-2021-05-21-233425

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, OSDs will remain full

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:
RBD thick provision is a new feature in OCS 4.8

================================================================
Steps to Reproduce:
1. Create a RBD PVC of size greater than the OSD size. Use thick provision enabled storage class "ocs-storagecluster-ceph-rbd-thick".
2. Once the storage cluster utilization reach 85%, delete the Pending PVC.
3. Wait for the rbd image to get deleted and verify the space is reclaimed.


Actual results:
Provisioning attempt did not stop once the storage cluster utilization reach 
 85% or even after deleting the PVC. RBD image did not get deleted. Manual attempt to delete the image also failed. ODSs remain full.

Expected results:
Best approach is not to start provisioning if the requested size is beyond the available storage.
Provisioning attempts should be stopped automatically once the once the storage cluster utilization reach 85%.
The rbd image should be deleted automatically by deleting the pending PVC.

Additional info:

Comment 4 Mudit Agarwal 2021-05-31 08:54:04 UTC

We got a VolumeCreate request at 11:55

>> 2021-05-26T11:55:30.029505495Z I0526 11:55:30.029431       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"rbdthick120", UID:"187bb8db-8f9b-4cfe-bdb1-797cdbe00d02", APIVersion:"v1", ResourceVersion:"1701250", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/rbdthick120"
>> 2021-05-26T11:55:30.035659473Z I0526 11:55:30.034289       1 controller.go:731] CreateVolumeRequest name:"pvc-187bb8db-8f9b-4cfe-bdb1-797cdbe00d02" capacity_range:<required_bytes:128849018880 > volume_capabilities:<mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER > > parameters:<key:"clusterID" value:"openshift-storage" > parameters:<key:"csi.storage.k8s.io/pv/name" value:"pvc-187bb8db-8f9b-4cfe-bdb1-797cdbe00d02" > parameters:<key:"csi.storage.k8s.io/pvc/name" value:"rbdthick120" > parameters:<key:"csi.storage.k8s.io/pvc/namespace" value:"default" > parameters:<key:"imageFeatures" value:"layering" > parameters:<key:"imageFormat" value:"2" > parameters:<key:"pool" value:"ocs-storagecluster-cephblockpool" > parameters:<key:"thickProvision" value:"true" > secrets:<key:"userID" value:"csi-rbd-provisioner" > secrets:<key:"userKey" value:"AQBEfKtguieTDBAABY/VGkM5TB+JAJvQ6Er1KQ==" > 
>> 2021-05-26T11:55:30.035659473Z I0526 11:55:30.034969       1 connection.go:182] GRPC call: /csi.v1.Controller/CreateVolume
>> 2021-05-26T11:55:30.035659473Z I0526 11:55:30.034978       1 connection.go:183] GRPC request: {"capacity_range":{"required_bytes":128849018880},"name":"pvc-187bb8db-8f9b-4cfe-bdb1-797cdbe00d02","parameters":{"clusterID":"openshift-storage","csi.storage.k8s.io/pv/name":"pvc-187bb8db-8f9b-4cfe-bdb1-797cdbe00d02","csi.storage.k8s.io/pvc/name":"rbdthick120","csi.storage.k8s.io/pvc/namespace":"default","imageFeatures":"layering","imageFormat":"2","pool":"ocs-

This request was not completed in next 3 minutes, may be because we didn't have space and at 11:58 we hit a DeadlineExceeded error 

>> 2021-05-26T11:58:00.036327187Z I0526 11:58:00.034855       1 connection.go:186] GRPC error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2021-05-26T11:58:00.036327187Z I0526 11:58:00.035237       1 controller.go:752] CreateVolume failed, supports topology = false, node selected false => may reschedule = false => state = Background: rpc error: code = DeadlineExceeded desc = context deadline exceeded
>> 2021-05-26T11:58:00.036327187Z I0526 11:58:00.035398       1 controller.go:1109] Temporary error received, adding PVC 187bb8db-8f9b-4cfe-bdb1-797cdbe00d02 to claims in progress
>> 2021-05-26T11:58:00.036327187Z W0526 11:58:00.035431       1 controller.go:961] Retrying syncing claim "187bb8db-8f9b-4cfe-bdb1-797cdbe00d02", failure 0
>> 2021-05-26T11:58:00.036327187Z E0526 11:58:00.035464       1 controller.go:984] error syncing claim "187bb8db-8f9b-4cfe-bdb1-797cdbe00d02": failed to provision volume with StorageClass "rbdthick": rpc error: code = DeadlineExceeded desc = context deadline exceeded

But looks like Kubernetes kept sending CreateVolume request which kept failing with error
>> Operation already in progress with the given volume.

This looks like a very common case with thick provisioning volume but expected with the current design unless I am missing something here.
We need to do some thick provisioning specific fix here.

Comment 5 Jilju Joy 2021-05-31 09:14:17 UTC

Proposing as a blocker because the cluster remain in same state of storage utilization even after deleting the PVC. Recovery is not possible because the manual attempt to delete the image failed as described in comment #0

Comment 9 Mudit Agarwal 2021-06-01 10:21:33 UTC

Madhu is helping with this one.

Comment 10 Mudit Agarwal 2021-06-02 07:45:50 UTC

What we currently do for thick provisioning from ceph is this:
1. Get a PVC creation request (thick provisioned)
2. Create a thin provisioned image
3. Fill it with zeroes for the requested capacity
4. Once it is filled return the response back to the user.

Now in step 3, if there is not enough storage there are two options:
Option 1. Keep waiting for storage to be available (which is happening in this case)
Option 2. Return ENOSPC if not enough space is available, this can be achieved by using a flag in Ceph 

Even if we implement option 2, we still need to cleanup as there would be a stale image and other leftovers. 
Admin has to wait for space to be freed up after the clean up.

One alternative to this is to check the space upfront before even creating the image and return from there if enough space is not there.
There are some issues/corner cases with this approach given that ceph is a distributed system but I think we can live with those for now
given that we don't have any other option.

Moreover, for those options we will implement option 2 as well (cleanup option)

So, for now we are doing two things:
1. short term: From ceph-csi side check the available storage and based on that either create or don't create the image.
2. long term: implement option 2 as well

Comment 11 Jilju Joy 2021-06-02 08:32:15 UTC

(In reply to Mudit Agarwal from comment #10)
> What we currently do for thick provisioning from ceph is this:
> 1. Get a PVC creation request (thick provisioned)
> 2. Create a thin provisioned image
> 3. Fill it with zeroes for the requested capacity
> 4. Once it is filled return the response back to the user.
> 
> Now in step 3, if there is not enough storage there are two options:
> Option 1. Keep waiting for storage to be available (which is happening in
> this case)
> Option 2. Return ENOSPC if not enough space is available, this can be
> achieved by using a flag in Ceph 
> 
> Even if we implement option 2, we still need to cleanup as there would be a
> stale image and other leftovers. 
> Admin has to wait for space to be freed up after the clean up.
> 
> One alternative to this is to check the space upfront before even creating
> the image and return from there if enough space is not there.
> There are some issues/corner cases with this approach given that ceph is a
> distributed system but I think we can live with those for now
> given that we don't have any other option.
> 
There are two important scenarios:
1. Attempt to create a single RBD thick PVC which is capable of increasing the total storage cluster utilization(consider replica size also) up to or beyond the limit which makes the storage cluster read-only.
Code change to accept or deny volume create request based on the available storage will fix the issue with scenarios 1.
2. Creating an RBD thick PVC which is small in size but while provisioning, the storage cluster became read only because other PVCs were also utilizing the storage.
The probability of occurrence of this scenario is less when compared to 1, but this is also important. We need to ensure that no stale image is present.


> Moreover, for those options we will implement option 2 as well (cleanup
> option)
> 
> So, for now we are doing two things:
> 1. short term: From ceph-csi side check the available storage and based on
> that either create or don't create the image.
What is the maximum achievable storage utilization(if the thick PVC is created) in percentage we plan to set ?
We need to set a safe value for this because at 85% the storage cluster will become read only.
> 2. long term: implement option 2 as well

Comment 12 Mudit Agarwal 2021-06-02 09:53:30 UTC

Yes, for most of the cases we will deny the pvc creation is space is not there. This will avoid any stale image.

>> What is the maximum achievable storage utilization(if the thick PVC is created) in percentage we plan to set ?
Plan is to check it up to 80%, is that ok?

Madhu, FYI

Comment 14 Niels de Vos 2021-06-02 14:49:19 UTC

Administrators are encouraged to configure resource quotas for their users. With a storage quota, users can not request more storage than they got assigned:
- https://access.redhat.com/documentation/en-us/openshift_container_platform/4.5/html/applications/quotas
- https://kubernetes.io/docs/concepts/policy/resource-quotas/#storage-resource-quota

When provisioning fails, the image should get deleted. Provisioning does not fail at the moment, but hangs. This is expected to be addressed with https://github.com/ceph/ceph-csi/pull/2115

Comment 17 Mudit Agarwal 2021-06-21 13:54:27 UTC

Removing the acks as this won't/can't be fixed in OCS in 4.8
Keeping it open and as a proposed blocker till we have the recommended recovery steps as well as other findings.

We need on more BZ as a tracker for BZ #1969301

Comment 18 Mudit Agarwal 2021-06-28 11:43:21 UTC

As discussed in offline mail as well as separate meetings, there is nothing which we can do in this from csi side.
Marking it a tracker for the ceph bz which will improve the situation here. 

Already flagged this as a known issue for 4.8, moving it out to 4.9

Comment 31 Jilju Joy 2021-10-20 09:27:09 UTC

Hi Madhu,

What is the expected behavior now ?

The fix mentioned in the comment #c13 is not merged.