Bug 2219918 - [OCS 4.8] openshift-local-storage loop stuck in Looking for released PVs to clean up and does not create a new pv
Summary: [OCS 4.8] openshift-local-storage loop stuck in Looking for released PVs to c...
Keywords:
Status: CLOSED DUPLICATE of bug 2032924
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: unclassified
Version: 4.8
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: ---
Assignee: Mudit Agarwal
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-07-05 18:45 UTC by jpeyrard
Modified: 2023-09-25 10:25 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-09-19 12:49:08 UTC
Embargoed:


Attachments (Terms of Use)

Description jpeyrard 2023-07-05 18:45:48 UTC
Description of problem (please be detailed as possible and provide log
snippests):

This issue happen on OCP4.8 OCS4.8 using "local-storage-operator" to provide drive for Ceph/OCS.
So this bug is related to "openshift-local-storage" not OCS 4.8
Customer has a support exception to continue using this software.

The "diskmaker-manager-kqrd4" pod seem to be stuck in a loop and don't ingest any new drive.
Deleting this "diskmaker-manager" pod and "local-storage-operator" pod does not make it move forward.
The "diskmaker-manager-kqrd4" after startup quickly loop in a loop :

~~~
2023-06-13T13:48:20.672103650Z {"level":"info","ts":1686664100.6719604,"logger":"deleter","msg":"Looking for released PVs to clean up","Request.Namespace":"openshift-local-storage","Request.Name":""}
2023-06-13T13:48:20.672249550Z I0613 13:48:20.672222 1635570 common.go:334] StorageClass "localblock" configured with MountDir "/mnt/local-storage/localblock", HostDir "/mnt/local-storage/localblock", VolumeMode "Block", FsType "", BlockCleanerCommand ["/scripts/quick_reset.sh"]
2023-06-13T13:48:50.673440909Z {"level":"info","ts":1686664130.673301,"logger":"deleter","msg":"Looking for released PVs to clean up","Request.Namespace":"openshift-local-storage","Request.Name":""}
2023-06-13T13:48:50.673568923Z I0613 13:48:50.673531 1635570 common.go:334] StorageClass "localblock" configured with MountDir "/mnt/local-storage/localblock", HostDir "/mnt/local-storage/localblock", VolumeMode "Block", FsType "", BlockCleanerCommand ["/scripts/quick_reset.sh"]
2023-06-13T13:49:20.674220779Z {"level":"info","ts":1686664160.6741552,"logger":"deleter","msg":"Looking for released PVs to clean up","Request.Namespace":"openshift-local-storage","Request.Name":""}
2023-06-13T13:49:20.674438831Z I0613 13:49:20.674405 1635570 common.go:334] StorageClass "localblock" configured with MountDir "/mnt/local-storage/localblock", HostDir "/mnt/local-storage/localblock", VolumeMode "Block", FsType "", BlockCleanerCommand ["/scripts/quick_reset.sh"]
~~~

We are particulary interested in solving the issue with this w8 node to recreate an OSD.
But we can see that it's not the only pod which has this issue and it look like multiple "diskmaker-manager" pod are affected by this plague and don't loop over drive anymore :

$ grep -c "Device" 0050-inspect.local.7853900946867902518.tar.gz/inspect.local.7853900946867902518/namespaces/openshift-local-storage/pods/diskmaker-manager-*/diskmaker-manager/diskmaker-manager/logs/current.log | cut -b 81-500
namespaces/openshift-local-storage/pods/diskmaker-manager-22qrq/diskmaker-manager/diskmaker-manager/logs/current.log:0
namespaces/openshift-local-storage/pods/diskmaker-manager-68qpt/diskmaker-manager/diskmaker-manager/logs/current.log:0
namespaces/openshift-local-storage/pods/diskmaker-manager-7p7jv/diskmaker-manager/diskmaker-manager/logs/current.log:0
namespaces/openshift-local-storage/pods/diskmaker-manager-7ptl2/diskmaker-manager/diskmaker-manager/logs/current.log:12480
namespaces/openshift-local-storage/pods/diskmaker-manager-96vxc/diskmaker-manager/diskmaker-manager/logs/current.log:63344
namespaces/openshift-local-storage/pods/diskmaker-manager-9vjtf/diskmaker-manager/diskmaker-manager/logs/current.log:19085
namespaces/openshift-local-storage/pods/diskmaker-manager-fnqxw/diskmaker-manager/diskmaker-manager/logs/current.log:111879
namespaces/openshift-local-storage/pods/diskmaker-manager-gl4wf/diskmaker-manager/diskmaker-manager/logs/current.log:0
namespaces/openshift-local-storage/pods/diskmaker-manager-kqrd4/diskmaker-manager/diskmaker-manager/logs/current.log:0
namespaces/openshift-local-storage/pods/diskmaker-manager-pdnct/diskmaker-manager/diskmaker-manager/logs/current.log:129458
namespaces/openshift-local-storage/pods/diskmaker-manager-q6s52/diskmaker-manager/diskmaker-manager/logs/current.log:59920
namespaces/openshift-local-storage/pods/diskmaker-manager-zdstc/diskmaker-manager/diskmaker-manager/logs/current.log:0


In general, on every healthy diskmaker pod, every 60s, we have a loop which browse all the "block device" and create a symlink in /mnt/local-storage/localblock if condition are met.
Like this :
 
~~~
2023-06-20T16:32:54.391660888Z {"level":"info","ts":1687278774.3914862,"logger":"localvolumeset-symlink-controller","msg":"Reconciling LocalVolumeSet","Request.Namespace":"openshift-local-storage","Request.Name":"local-block"}
2023-06-20T16:32:54.391872431Z I0620 16:32:54.391823   15549 common.go:334] StorageClass "localblock" configured with MountDir "/mnt/local-storage/localblock", HostDir "/mnt/local-storage/localblock", VolumeMode "Block", FsType "", BlockCleanerCommand ["/scripts/quick_reset.sh"]
2023-06-20T16:32:54.569914265Z {"level":"info","ts":1687278774.5697858,"logger":"localvolumeset-symlink-controller","msg":"match negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"loop0","matcher.Name":"inTypeList"}
2023-06-20T16:32:54.593580217Z {"level":"info","ts":1687278774.5934741,"logger":"localvolumeset-symlink-controller","msg":"match negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"loop1","matcher.Name":"inTypeList"}
2023-06-20T16:32:54.614147565Z {"level":"info","ts":1687278774.6140366,"logger":"localvolumeset-symlink-controller","msg":"match negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"loop2","matcher.Name":"inTypeList"}
2023-06-20T16:32:54.632720733Z {"level":"info","ts":1687278774.6326215,"logger":"localvolumeset-symlink-controller","msg":"match negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"loop3","matcher.Name":"inSizeRange"}
2023-06-20T16:32:54.637415307Z {"level":"info","ts":1687278774.6373258,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"sdb","filter.Name":"noBindMounts"}
2023-06-20T16:32:54.640344701Z {"level":"info","ts":1687278774.6402955,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"sdc","filter.Name":"noBindMounts"}
2023-06-20T16:32:54.642941844Z {"level":"info","ts":1687278774.6428287,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"sdd","filter.Name":"noBindMounts"}
2023-06-20T16:32:54.645585695Z {"level":"info","ts":1687278774.645547,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"sde","filter.Name":"noChildren"}
2023-06-20T16:32:54.645585695Z {"level":"info","ts":1687278774.645573,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"sde1","filter.Name":"noBiosBootInPartLabel"}
2023-06-20T16:32:54.645784322Z {"level":"info","ts":1687278774.6456826,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"sde2","filter.Name":"noFilesystemSignature"}
2023-06-20T16:32:54.645829280Z {"level":"info","ts":1687278774.645764,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"sde3","filter.Name":"noBiosBootInPartLabel"}
2023-06-20T16:32:54.645867197Z {"level":"info","ts":1687278774.6458266,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"sde4","filter.Name":"noFilesystemSignature"}
2023-06-20T16:32:54.646096199Z {"level":"info","ts":1687278774.6460469,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"rbd0","filter.Name":"noChildren"}
2023-06-20T16:32:54.650369603Z {"level":"info","ts":1687278774.6503327,"logger":"localvolumeset-symlink-controller","msg":"match negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"rbd0p1","matcher.Name":"inSizeRange"}
2023-06-20T16:32:54.653112455Z {"level":"info","ts":1687278774.6530833,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"rbd0p2","filter.Name":"noFilesystemSignature"}
2023-06-20T16:32:54.653160997Z {"level":"info","ts":1687278774.653139,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"rbd0p3","filter.Name":"noFilesystemSignature"}
2023-06-20T16:33:17.781643365Z {"level":"info","ts":1687278797.781562,"logger":"deleter","msg":"Looking for released PVs to clean up","Request.Namespace":"openshift-local-storage","Request.Name":""}
~~~
 
On 6 node it is not the case anymore ( see the previous grep "Device" command above ).
Affected node are : w4,w3,w6,w2,w8,w7 => w2,w3,w4,w6,w7,w8
 
LocalVolumeSet look quite standard :
   {"apiVersion":"local.storage.openshift.io/v1alpha1","kind":"LocalVolumeSet","metadata":{"annotations":{},"name":"local-block","namespace":"openshift-local-storage"},"spec":{"deviceInclusionSpec":{"deviceMechanicalProperties":["NonRotational"],"deviceTypes":["disk"],"maxSize":"5Ti","minSize":"2Ti"},"fstype":"ext4","maxDeviceCount":3,"nodeSelector":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"cluster.ocs.openshift.io/openshift-storage","operator":"In","values":[""]}]}]},"storageClassName":"localblock","volumeMode":"Block"}}
 
And the drive to be ingest on w8 is 3.5T


We had strace two diskmaker pod for 1800s.
We see that it still discuss on the network, but it does not do any syscall related to /dev or /mnt/local-storage/localblock.
Restarting the pod does not help, so there is a "state" save somewhere.
 
 
Version of all relevant components (if applicable):
 
OCP Version    : 4.8.39
OCS Version    : 4.8.14
local-storage  : local-storage-operator.4.8.0-202208020324
 
 
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
 
Right now ceph is fully healthy.
But customer is unable to have a new osd created on node worker8 because it is waiting for a PV to be created.



Is there any workaround available to the best of your knowledge?
 
We have not try to reboot w8.
It will need to be rebooted because there is still one loopback loop1 device associated to a drive which has been replaced.
 
~~~
$ grep ^ sosreport-w8/sys/block/loop*/loop/backing_file
sosreport-w8/sys/block/loop0/loop/backing_file:/var/lib/kubelet/plugins/kubernetes.io~local-volume/volumeDevices/local-pv-f5306836/4da40238-0786-473e-bfb5-cf23c6ce1b5d
sosreport-w8/sys/block/loop1/loop/backing_file:/var/lib/kubelet/plugins/kubernetes.io~local-volume/volumeDevices/local-pv-7bfb9892/09797f5e-b321-47de-ac84-de4fd47ec603 (deleted)
~~~
 
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
 
2) this bug seem to have happen by itself over time.
Not been able to reproduce it, but it look like the diskmaker pod believe it need to delete something and he is not able to achieve it.
Unfortunately, I don't know what he want to delete and on which criteria he is basing his algorythm.
 
Can this issue reproducible?
 
This issue is reproduce on multiple node right now.
 
 
Can this issue reproduce from the UI?
 
Not try.
 
If this is a regression, please provide more details to justify this:
 
Nothing which make thing there is a regression.
 
Steps to Reproduce:
N/A
 
Actual results:
 
N/A
 
Expected results:
 
PV is created by diskmaker and then an osd is created. 
 
Additional info:

Comment 2 jpeyrard 2023-07-07 10:14:36 UTC
BZ=>Jira
JIRA which track this BZ : https://issues.redhat.com/browse/OCPBUGS-15929

Comment 6 jpeyrard 2023-09-19 12:49:08 UTC
Hi Bill Zvonar and Eran ! 

Indeed it's a dup of this one : https://bugzilla.redhat.com/show_bug.cgi?id=2032924
Good to see that it has been fixed.
We should not have this system in production right now, so they will have the fix when they reinstall it, Great !

Closing this case as DUP

*** This bug has been marked as a duplicate of bug 2032924 ***


Note You need to log in before you can comment on or make changes to this bug.