Bug 2060385 - Creating 100 PVC at once takes 3+ minutes with error `context deadline exceeded` in the in topolvm-controller
Summary: Creating 100 PVC at once takes 3+ minutes with error `context deadline exceed...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: lvm-operator
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: N Balachandran
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-03-03 12:22 UTC by Shay Rozen
Modified: 2023-08-09 16:46 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-11-04 10:07:11 UTC
Embargoed:


Attachments (Terms of Use)
File to create 100 pvc for storageclass in immediate volume binding mode. (41.27 KB, text/plain)
2022-03-03 12:22 UTC, Shay Rozen
no flags Details
topolvm controller logs. (7.42 MB, text/plain)
2022-03-03 12:25 UTC, Shay Rozen
no flags Details

Description Shay Rozen 2022-03-03 12:22:52 UTC
Created attachment 1863971 [details]
File to create 100 pvc for storageclass in immediate volume binding mode.

Description of problem (please be detailed as possible and provide log
snippests):
Creating 100 pvc in batch with immediate volume binding mode (also applicable if using waitforfirstconsumer and creating the pod in batch) takes more than 3 minutes to get all bounds and there are some error "context deadline exceeded" in topolvm-controller with fails and retries 


Version of all relevant components (if applicable):
all

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
Wait. My believe is with more than 100 PVC the process will be stuck for long time

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
yes

Can this issue reproduce from the UI?
No

If this is a regression, please provide more details to justify this:
No

Steps to Reproduce:
1. Create ocp sno cluster
2. Install LVMO operator and install a cluster
3. Define storageclass with immediate volume bind mode.
4. Create 100 pvc from one file.
5. Delete the pvc with "oc delete -f <file>


Actual results:
The creation of 100 pvc takes 3+ minutes with error in log:
E0303 09:45:49.221743       1 controller.go:957] error syncing claim "b5dc9fac-5ffa-455b-b6aa-ceddd95a9283": failed to provision volume with StorageClass "odf-lvm-imm-vg1": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0303 09:45:49.221757       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"storage-test", Name:"lvmpvc-100", UID:"b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", APIVersion:"v1", ResourceVersion:"4105735", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "odf-lvm-imm-vg1": rpc error: code = DeadlineExceeded desc = context deadline exceeded


Expected results:
Pvc should be created immediately with no error in log.

Additional info:
Creation and deletion of one of the PV in the log. As you can see there are failure and retries for creation and deletion.

[srozen@localhost 4.10]$ oc logs topolvm-controller-98994987c-pw2xk --all-containers | grep fail | grep b5dc9fac-5ffa-455b-b6aa-ceddd95a9283
W0303 09:45:49.221697       1 controller.go:934] Retrying syncing claim "b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", failure 0
E0303 09:45:49.221743       1 controller.go:957] error syncing claim "b5dc9fac-5ffa-455b-b6aa-ceddd95a9283": failed to provision volume with StorageClass "odf-lvm-imm-vg1": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0303 09:45:49.221757       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"storage-test", Name:"lvmpvc-100", UID:"b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", APIVersion:"v1", ResourceVersion:"4105735", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "odf-lvm-imm-vg1": rpc error: code = DeadlineExceeded desc = context deadline exceeded
W0303 09:46:00.223997       1 controller.go:934] Retrying syncing claim "b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", failure 1
E0303 09:46:00.224042       1 controller.go:957] error syncing claim "b5dc9fac-5ffa-455b-b6aa-ceddd95a9283": failed to provision volume with StorageClass "odf-lvm-imm-vg1": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0303 09:46:00.224040       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"storage-test", Name:"lvmpvc-100", UID:"b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", APIVersion:"v1", ResourceVersion:"4105735", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "odf-lvm-imm-vg1": rpc error: code = DeadlineExceeded desc = context deadline exceeded
W0303 09:46:12.225518       1 controller.go:934] Retrying syncing claim "b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", failure 2
E0303 09:46:12.225557       1 controller.go:957] error syncing claim "b5dc9fac-5ffa-455b-b6aa-ceddd95a9283": failed to provision volume with StorageClass "odf-lvm-imm-vg1": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0303 09:46:12.225606       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"storage-test", Name:"lvmpvc-100", UID:"b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", APIVersion:"v1", ResourceVersion:"4105735", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "odf-lvm-imm-vg1": rpc error: code = DeadlineExceeded desc = context deadline exceeded
W0303 09:46:26.226929       1 controller.go:934] Retrying syncing claim "b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", failure 3
E0303 09:46:26.226976       1 controller.go:957] error syncing claim "b5dc9fac-5ffa-455b-b6aa-ceddd95a9283": failed to provision volume with StorageClass "odf-lvm-imm-vg1": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0303 09:46:26.227127       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"storage-test", Name:"lvmpvc-100", UID:"b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", APIVersion:"v1", ResourceVersion:"4105735", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "odf-lvm-imm-vg1": rpc error: code = DeadlineExceeded desc = context deadline exceeded
W0303 09:46:44.228599       1 controller.go:934] Retrying syncing claim "b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", failure 4
E0303 09:46:44.228650       1 controller.go:957] error syncing claim "b5dc9fac-5ffa-455b-b6aa-ceddd95a9283": failed to provision volume with StorageClass "odf-lvm-imm-vg1": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0303 09:46:44.228680       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"storage-test", Name:"lvmpvc-100", UID:"b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", APIVersion:"v1", ResourceVersion:"4105735", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "odf-lvm-imm-vg1": rpc error: code = DeadlineExceeded desc = context deadline exceeded
W0303 09:47:10.230599       1 controller.go:934] Retrying syncing claim "b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", failure 5
E0303 09:47:10.230745       1 controller.go:957] error syncing claim "b5dc9fac-5ffa-455b-b6aa-ceddd95a9283": failed to provision volume with StorageClass "odf-lvm-imm-vg1": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0303 09:47:10.230745       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"storage-test", Name:"lvmpvc-100", UID:"b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", APIVersion:"v1", ResourceVersion:"4105735", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "odf-lvm-imm-vg1": rpc error: code = DeadlineExceeded desc = context deadline exceeded
W0303 09:47:52.232527       1 controller.go:934] Retrying syncing claim "b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", failure 6
E0303 09:47:52.232574       1 controller.go:957] error syncing claim "b5dc9fac-5ffa-455b-b6aa-ceddd95a9283": failed to provision volume with StorageClass "odf-lvm-imm-vg1": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0303 09:47:52.232604       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"storage-test", Name:"lvmpvc-100", UID:"b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", APIVersion:"v1", ResourceVersion:"4105735", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "odf-lvm-imm-vg1": rpc error: code = DeadlineExceeded desc = context deadline exceeded
E0303 10:32:04.716309       1 controller.go:1481] delete "pvc-b5dc9fac-5ffa-455b-b6aa-ceddd95a9283": volume deletion failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
W0303 10:32:04.716377       1 controller.go:989] Retrying syncing volume "pvc-b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", failure 0
E0303 10:32:15.718173       1 controller.go:1481] delete "pvc-b5dc9fac-5ffa-455b-b6aa-ceddd95a9283": volume deletion failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
W0303 10:32:15.718233       1 controller.go:989] Retrying syncing volume "pvc-b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", failure 1
E0303 10:32:27.719535       1 controller.go:1481] delete "pvc-b5dc9fac-5ffa-455b-b6aa-ceddd95a9283": volume deletion failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
W0303 10:32:27.719609       1 controller.go:989] Retrying syncing volume "pvc-b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", failure 2
E0303 10:32:41.720740       1 controller.go:1481] delete "pvc-b5dc9fac-5ffa-455b-b6aa-ceddd95a9283": volume deletion failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
W0303 10:32:41.720798       1 controller.go:989] Retrying syncing volume "pvc-b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", failure 3
E0303 10:32:59.722164       1 controller.go:1481] delete "pvc-b5dc9fac-5ffa-455b-b6aa-ceddd95a9283": volume deletion failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
W0303 10:32:59.722340       1 controller.go:989] Retrying syncing volume "pvc-b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", failure 4
E0303 10:33:25.724684       1 controller.go:1481] delete "pvc-b5dc9fac-5ffa-455b-b6aa-ceddd95a9283": volume deletion failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
W0303 10:33:25.724761       1 controller.go:989] Retrying syncing volume "pvc-b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", failure 5

Comment 2 Shay Rozen 2022-03-03 12:25:06 UTC
Created attachment 1863974 [details]
topolvm controller logs.

Comment 3 Leela Venkaiah Gangavarapu 2022-03-07 05:03:35 UTC
Hi Shay,

- I could see some contradicting results from the logs, at-least at some point the same setup was re-used for creating PVCs either in batch mode or linearly and as a result the logs are mixed up from various tests.
- Volumes are succeeded, started and provisioned (in that order) from provisioner logs

> # from storage class "odf-lvm-vg1"
# grep -P 'lvmpvc-\d+.*odf-lvm-vg1.*(started|provisioned|succeeded)$' /tmp/lvm.log | awk '{$2=""; print $0}' | uniq | tail -n10
I0303  1 controller.go:1455] provision "storage-test/lvmpvc-96" class "odf-lvm-vg1": succeeded
I0303  1 controller.go:1337] provision "storage-test/lvmpvc-97" class "odf-lvm-vg1": started
I0303  1 controller.go:1442] provision "storage-test/lvmpvc-97" class "odf-lvm-vg1": volume "pvc-5b71d24d-cba6-4d76-83b6-9e5ea5f073ac" provisioned
I0303  1 controller.go:1455] provision "storage-test/lvmpvc-97" class "odf-lvm-vg1": succeeded
I0303  1 controller.go:1337] provision "storage-test/lvmpvc-98" class "odf-lvm-vg1": started
I0303  1 controller.go:1442] provision "storage-test/lvmpvc-98" class "odf-lvm-vg1": volume "pvc-8493a5ba-8769-4483-b997-e311c7b7f926" provisioned
I0303  1 controller.go:1455] provision "storage-test/lvmpvc-98" class "odf-lvm-vg1": succeeded
I0303  1 controller.go:1337] provision "storage-test/lvmpvc-99" class "odf-lvm-vg1": started
I0303  1 controller.go:1442] provision "storage-test/lvmpvc-99" class "odf-lvm-vg1": volume "pvc-80b5d75e-e718-4154-b39d-bbcec1261331" provisioned
I0303  1 controller.go:1455] provision "storage-test/lvmpvc-99" class "odf-lvm-vg1": succeeded

> # from storage class "odf-lvm-imm-vg1"
# grep -P 'lvmpvc-\d+.*odf-lvm-imm-vg1.*(started|provisioned|succeeded)$' /tmp/lvm.log | awk '{$2=""; print $0}' | uniq | tail -n10
I0303  1 controller.go:1455] provision "storage-test/lvmpvc-97" class "odf-lvm-imm-vg1": succeeded
I0303  1 controller.go:1337] provision "storage-test/lvmpvc-98" class "odf-lvm-imm-vg1": started
I0303  1 controller.go:1442] provision "storage-test/lvmpvc-98" class "odf-lvm-imm-vg1": volume "pvc-6102e518-51e8-46c4-ab34-ad6da9acf54d" provisioned
I0303  1 controller.go:1455] provision "storage-test/lvmpvc-98" class "odf-lvm-imm-vg1": succeeded
I0303  1 controller.go:1337] provision "storage-test/lvmpvc-99" class "odf-lvm-imm-vg1": started
I0303  1 controller.go:1442] provision "storage-test/lvmpvc-99" class "odf-lvm-imm-vg1": volume "pvc-8f7391ea-8054-401b-bd9e-1a24d6f3a5d4" provisioned
I0303  1 controller.go:1455] provision "storage-test/lvmpvc-99" class "odf-lvm-imm-vg1": succeeded
I0303  1 controller.go:1337] provision "storage-test/lvmpvc-100" class "odf-lvm-imm-vg1": started
I0303  1 controller.go:1442] provision "storage-test/lvmpvc-100" class "odf-lvm-imm-vg1": volume "pvc-b5dc9fac-5ffa-455b-b6aa-ceddd95a9283" provisioned
I0303  1 controller.go:1455] provision "storage-test/lvmpvc-100" class "odf-lvm-imm-vg1": succeeded


Ask:
1. Please restart topolvm-controller and topolvm-node pods when performing different type of tests as this'll not effect the functionality and also provides consistent logs, could you please do the same for this bug again?
2. Let's isolate the issue from multiple operations as I could see some delete operations from the logs, although distinct operations (delete/expand/create) will be queued sequentially, when doing batch processing they may interfere in syncing the ops for underlying lvm infra, so pls isolate this test.

General:
1. Your deductions on bulk ops are partly accurate in-terms of lvm syncing multiple ops, however at LVMO/Topolvm layer this shouldn't be an issue as all the commands will be funneled through lvmd sequentially, however I'd like to confirm this after an isolated re-test.


Pls let me know if additional info is required.

Thanks,
Leela.

Comment 5 Shay Rozen 2022-03-21 15:27:35 UTC
It's ok from my side.

Comment 6 Mudit Agarwal 2022-06-22 15:03:54 UTC
Not a 4.11 blocker and can't be fixed before 4.11 dev freeze. Moving it to 4.12

Comment 9 N Balachandran 2022-11-02 07:24:31 UTC
Moving this to 4.13 as this is low priority.


Note You need to log in before you can comment on or make changes to this bug.