Bug 2060385

Summary: Creating 100 PVC at once takes 3+ minutes with error `context deadline exceeded` in the in topolvm-controller
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Shay Rozen <srozen>
Component: lvm-operatorAssignee: N Balachandran <nibalach>
Status: CLOSED NOTABUG QA Contact: Elad <ebenahar>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.10CC: jolmomar, lgangava, mmuench, muagarwa, nibalach, ocs-bugs, odf-bz-bot, sapillai
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-11-04 10:07:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
File to create 100 pvc for storageclass in immediate volume binding mode.
none
topolvm controller logs. none

Description Shay Rozen 2022-03-03 12:22:52 UTC
Created attachment 1863971 [details]
File to create 100 pvc for storageclass in immediate volume binding mode.

Description of problem (please be detailed as possible and provide log
snippests):
Creating 100 pvc in batch with immediate volume binding mode (also applicable if using waitforfirstconsumer and creating the pod in batch) takes more than 3 minutes to get all bounds and there are some error "context deadline exceeded" in topolvm-controller with fails and retries 


Version of all relevant components (if applicable):
all

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
Wait. My believe is with more than 100 PVC the process will be stuck for long time

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
yes

Can this issue reproduce from the UI?
No

If this is a regression, please provide more details to justify this:
No

Steps to Reproduce:
1. Create ocp sno cluster
2. Install LVMO operator and install a cluster
3. Define storageclass with immediate volume bind mode.
4. Create 100 pvc from one file.
5. Delete the pvc with "oc delete -f <file>


Actual results:
The creation of 100 pvc takes 3+ minutes with error in log:
E0303 09:45:49.221743       1 controller.go:957] error syncing claim "b5dc9fac-5ffa-455b-b6aa-ceddd95a9283": failed to provision volume with StorageClass "odf-lvm-imm-vg1": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0303 09:45:49.221757       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"storage-test", Name:"lvmpvc-100", UID:"b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", APIVersion:"v1", ResourceVersion:"4105735", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "odf-lvm-imm-vg1": rpc error: code = DeadlineExceeded desc = context deadline exceeded


Expected results:
Pvc should be created immediately with no error in log.

Additional info:
Creation and deletion of one of the PV in the log. As you can see there are failure and retries for creation and deletion.

[srozen@localhost 4.10]$ oc logs topolvm-controller-98994987c-pw2xk --all-containers | grep fail | grep b5dc9fac-5ffa-455b-b6aa-ceddd95a9283
W0303 09:45:49.221697       1 controller.go:934] Retrying syncing claim "b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", failure 0
E0303 09:45:49.221743       1 controller.go:957] error syncing claim "b5dc9fac-5ffa-455b-b6aa-ceddd95a9283": failed to provision volume with StorageClass "odf-lvm-imm-vg1": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0303 09:45:49.221757       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"storage-test", Name:"lvmpvc-100", UID:"b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", APIVersion:"v1", ResourceVersion:"4105735", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "odf-lvm-imm-vg1": rpc error: code = DeadlineExceeded desc = context deadline exceeded
W0303 09:46:00.223997       1 controller.go:934] Retrying syncing claim "b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", failure 1
E0303 09:46:00.224042       1 controller.go:957] error syncing claim "b5dc9fac-5ffa-455b-b6aa-ceddd95a9283": failed to provision volume with StorageClass "odf-lvm-imm-vg1": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0303 09:46:00.224040       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"storage-test", Name:"lvmpvc-100", UID:"b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", APIVersion:"v1", ResourceVersion:"4105735", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "odf-lvm-imm-vg1": rpc error: code = DeadlineExceeded desc = context deadline exceeded
W0303 09:46:12.225518       1 controller.go:934] Retrying syncing claim "b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", failure 2
E0303 09:46:12.225557       1 controller.go:957] error syncing claim "b5dc9fac-5ffa-455b-b6aa-ceddd95a9283": failed to provision volume with StorageClass "odf-lvm-imm-vg1": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0303 09:46:12.225606       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"storage-test", Name:"lvmpvc-100", UID:"b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", APIVersion:"v1", ResourceVersion:"4105735", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "odf-lvm-imm-vg1": rpc error: code = DeadlineExceeded desc = context deadline exceeded
W0303 09:46:26.226929       1 controller.go:934] Retrying syncing claim "b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", failure 3
E0303 09:46:26.226976       1 controller.go:957] error syncing claim "b5dc9fac-5ffa-455b-b6aa-ceddd95a9283": failed to provision volume with StorageClass "odf-lvm-imm-vg1": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0303 09:46:26.227127       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"storage-test", Name:"lvmpvc-100", UID:"b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", APIVersion:"v1", ResourceVersion:"4105735", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "odf-lvm-imm-vg1": rpc error: code = DeadlineExceeded desc = context deadline exceeded
W0303 09:46:44.228599       1 controller.go:934] Retrying syncing claim "b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", failure 4
E0303 09:46:44.228650       1 controller.go:957] error syncing claim "b5dc9fac-5ffa-455b-b6aa-ceddd95a9283": failed to provision volume with StorageClass "odf-lvm-imm-vg1": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0303 09:46:44.228680       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"storage-test", Name:"lvmpvc-100", UID:"b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", APIVersion:"v1", ResourceVersion:"4105735", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "odf-lvm-imm-vg1": rpc error: code = DeadlineExceeded desc = context deadline exceeded
W0303 09:47:10.230599       1 controller.go:934] Retrying syncing claim "b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", failure 5
E0303 09:47:10.230745       1 controller.go:957] error syncing claim "b5dc9fac-5ffa-455b-b6aa-ceddd95a9283": failed to provision volume with StorageClass "odf-lvm-imm-vg1": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0303 09:47:10.230745       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"storage-test", Name:"lvmpvc-100", UID:"b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", APIVersion:"v1", ResourceVersion:"4105735", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "odf-lvm-imm-vg1": rpc error: code = DeadlineExceeded desc = context deadline exceeded
W0303 09:47:52.232527       1 controller.go:934] Retrying syncing claim "b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", failure 6
E0303 09:47:52.232574       1 controller.go:957] error syncing claim "b5dc9fac-5ffa-455b-b6aa-ceddd95a9283": failed to provision volume with StorageClass "odf-lvm-imm-vg1": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0303 09:47:52.232604       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"storage-test", Name:"lvmpvc-100", UID:"b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", APIVersion:"v1", ResourceVersion:"4105735", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "odf-lvm-imm-vg1": rpc error: code = DeadlineExceeded desc = context deadline exceeded
E0303 10:32:04.716309       1 controller.go:1481] delete "pvc-b5dc9fac-5ffa-455b-b6aa-ceddd95a9283": volume deletion failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
W0303 10:32:04.716377       1 controller.go:989] Retrying syncing volume "pvc-b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", failure 0
E0303 10:32:15.718173       1 controller.go:1481] delete "pvc-b5dc9fac-5ffa-455b-b6aa-ceddd95a9283": volume deletion failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
W0303 10:32:15.718233       1 controller.go:989] Retrying syncing volume "pvc-b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", failure 1
E0303 10:32:27.719535       1 controller.go:1481] delete "pvc-b5dc9fac-5ffa-455b-b6aa-ceddd95a9283": volume deletion failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
W0303 10:32:27.719609       1 controller.go:989] Retrying syncing volume "pvc-b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", failure 2
E0303 10:32:41.720740       1 controller.go:1481] delete "pvc-b5dc9fac-5ffa-455b-b6aa-ceddd95a9283": volume deletion failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
W0303 10:32:41.720798       1 controller.go:989] Retrying syncing volume "pvc-b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", failure 3
E0303 10:32:59.722164       1 controller.go:1481] delete "pvc-b5dc9fac-5ffa-455b-b6aa-ceddd95a9283": volume deletion failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
W0303 10:32:59.722340       1 controller.go:989] Retrying syncing volume "pvc-b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", failure 4
E0303 10:33:25.724684       1 controller.go:1481] delete "pvc-b5dc9fac-5ffa-455b-b6aa-ceddd95a9283": volume deletion failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
W0303 10:33:25.724761       1 controller.go:989] Retrying syncing volume "pvc-b5dc9fac-5ffa-455b-b6aa-ceddd95a9283", failure 5

Comment 2 Shay Rozen 2022-03-03 12:25:06 UTC
Created attachment 1863974 [details]
topolvm controller logs.

Comment 3 Leela Venkaiah Gangavarapu 2022-03-07 05:03:35 UTC
Hi Shay,

- I could see some contradicting results from the logs, at-least at some point the same setup was re-used for creating PVCs either in batch mode or linearly and as a result the logs are mixed up from various tests.
- Volumes are succeeded, started and provisioned (in that order) from provisioner logs

> # from storage class "odf-lvm-vg1"
# grep -P 'lvmpvc-\d+.*odf-lvm-vg1.*(started|provisioned|succeeded)$' /tmp/lvm.log | awk '{$2=""; print $0}' | uniq | tail -n10
I0303  1 controller.go:1455] provision "storage-test/lvmpvc-96" class "odf-lvm-vg1": succeeded
I0303  1 controller.go:1337] provision "storage-test/lvmpvc-97" class "odf-lvm-vg1": started
I0303  1 controller.go:1442] provision "storage-test/lvmpvc-97" class "odf-lvm-vg1": volume "pvc-5b71d24d-cba6-4d76-83b6-9e5ea5f073ac" provisioned
I0303  1 controller.go:1455] provision "storage-test/lvmpvc-97" class "odf-lvm-vg1": succeeded
I0303  1 controller.go:1337] provision "storage-test/lvmpvc-98" class "odf-lvm-vg1": started
I0303  1 controller.go:1442] provision "storage-test/lvmpvc-98" class "odf-lvm-vg1": volume "pvc-8493a5ba-8769-4483-b997-e311c7b7f926" provisioned
I0303  1 controller.go:1455] provision "storage-test/lvmpvc-98" class "odf-lvm-vg1": succeeded
I0303  1 controller.go:1337] provision "storage-test/lvmpvc-99" class "odf-lvm-vg1": started
I0303  1 controller.go:1442] provision "storage-test/lvmpvc-99" class "odf-lvm-vg1": volume "pvc-80b5d75e-e718-4154-b39d-bbcec1261331" provisioned
I0303  1 controller.go:1455] provision "storage-test/lvmpvc-99" class "odf-lvm-vg1": succeeded

> # from storage class "odf-lvm-imm-vg1"
# grep -P 'lvmpvc-\d+.*odf-lvm-imm-vg1.*(started|provisioned|succeeded)$' /tmp/lvm.log | awk '{$2=""; print $0}' | uniq | tail -n10
I0303  1 controller.go:1455] provision "storage-test/lvmpvc-97" class "odf-lvm-imm-vg1": succeeded
I0303  1 controller.go:1337] provision "storage-test/lvmpvc-98" class "odf-lvm-imm-vg1": started
I0303  1 controller.go:1442] provision "storage-test/lvmpvc-98" class "odf-lvm-imm-vg1": volume "pvc-6102e518-51e8-46c4-ab34-ad6da9acf54d" provisioned
I0303  1 controller.go:1455] provision "storage-test/lvmpvc-98" class "odf-lvm-imm-vg1": succeeded
I0303  1 controller.go:1337] provision "storage-test/lvmpvc-99" class "odf-lvm-imm-vg1": started
I0303  1 controller.go:1442] provision "storage-test/lvmpvc-99" class "odf-lvm-imm-vg1": volume "pvc-8f7391ea-8054-401b-bd9e-1a24d6f3a5d4" provisioned
I0303  1 controller.go:1455] provision "storage-test/lvmpvc-99" class "odf-lvm-imm-vg1": succeeded
I0303  1 controller.go:1337] provision "storage-test/lvmpvc-100" class "odf-lvm-imm-vg1": started
I0303  1 controller.go:1442] provision "storage-test/lvmpvc-100" class "odf-lvm-imm-vg1": volume "pvc-b5dc9fac-5ffa-455b-b6aa-ceddd95a9283" provisioned
I0303  1 controller.go:1455] provision "storage-test/lvmpvc-100" class "odf-lvm-imm-vg1": succeeded


Ask:
1. Please restart topolvm-controller and topolvm-node pods when performing different type of tests as this'll not effect the functionality and also provides consistent logs, could you please do the same for this bug again?
2. Let's isolate the issue from multiple operations as I could see some delete operations from the logs, although distinct operations (delete/expand/create) will be queued sequentially, when doing batch processing they may interfere in syncing the ops for underlying lvm infra, so pls isolate this test.

General:
1. Your deductions on bulk ops are partly accurate in-terms of lvm syncing multiple ops, however at LVMO/Topolvm layer this shouldn't be an issue as all the commands will be funneled through lvmd sequentially, however I'd like to confirm this after an isolated re-test.


Pls let me know if additional info is required.

Thanks,
Leela.

Comment 5 Shay Rozen 2022-03-21 15:27:35 UTC
It's ok from my side.

Comment 6 Mudit Agarwal 2022-06-22 15:03:54 UTC
Not a 4.11 blocker and can't be fixed before 4.11 dev freeze. Moving it to 4.12

Comment 9 N Balachandran 2022-11-02 07:24:31 UTC
Moving this to 4.13 as this is low priority.