Description of problem (please be detailed as possible and provide log snippests): This issue happen on OCP4.8 OCS4.8 using "local-storage-operator" to provide drive for Ceph/OCS. So this bug is related to "openshift-local-storage" not OCS 4.8 Customer has a support exception to continue using this software. The "diskmaker-manager-kqrd4" pod seem to be stuck in a loop and don't ingest any new drive. Deleting this "diskmaker-manager" pod and "local-storage-operator" pod does not make it move forward. The "diskmaker-manager-kqrd4" after startup quickly loop in a loop : ~~~ 2023-06-13T13:48:20.672103650Z {"level":"info","ts":1686664100.6719604,"logger":"deleter","msg":"Looking for released PVs to clean up","Request.Namespace":"openshift-local-storage","Request.Name":""} 2023-06-13T13:48:20.672249550Z I0613 13:48:20.672222 1635570 common.go:334] StorageClass "localblock" configured with MountDir "/mnt/local-storage/localblock", HostDir "/mnt/local-storage/localblock", VolumeMode "Block", FsType "", BlockCleanerCommand ["/scripts/quick_reset.sh"] 2023-06-13T13:48:50.673440909Z {"level":"info","ts":1686664130.673301,"logger":"deleter","msg":"Looking for released PVs to clean up","Request.Namespace":"openshift-local-storage","Request.Name":""} 2023-06-13T13:48:50.673568923Z I0613 13:48:50.673531 1635570 common.go:334] StorageClass "localblock" configured with MountDir "/mnt/local-storage/localblock", HostDir "/mnt/local-storage/localblock", VolumeMode "Block", FsType "", BlockCleanerCommand ["/scripts/quick_reset.sh"] 2023-06-13T13:49:20.674220779Z {"level":"info","ts":1686664160.6741552,"logger":"deleter","msg":"Looking for released PVs to clean up","Request.Namespace":"openshift-local-storage","Request.Name":""} 2023-06-13T13:49:20.674438831Z I0613 13:49:20.674405 1635570 common.go:334] StorageClass "localblock" configured with MountDir "/mnt/local-storage/localblock", HostDir "/mnt/local-storage/localblock", VolumeMode "Block", FsType "", BlockCleanerCommand ["/scripts/quick_reset.sh"] ~~~ We are particulary interested in solving the issue with this w8 node to recreate an OSD. But we can see that it's not the only pod which has this issue and it look like multiple "diskmaker-manager" pod are affected by this plague and don't loop over drive anymore : $ grep -c "Device" 0050-inspect.local.7853900946867902518.tar.gz/inspect.local.7853900946867902518/namespaces/openshift-local-storage/pods/diskmaker-manager-*/diskmaker-manager/diskmaker-manager/logs/current.log | cut -b 81-500 namespaces/openshift-local-storage/pods/diskmaker-manager-22qrq/diskmaker-manager/diskmaker-manager/logs/current.log:0 namespaces/openshift-local-storage/pods/diskmaker-manager-68qpt/diskmaker-manager/diskmaker-manager/logs/current.log:0 namespaces/openshift-local-storage/pods/diskmaker-manager-7p7jv/diskmaker-manager/diskmaker-manager/logs/current.log:0 namespaces/openshift-local-storage/pods/diskmaker-manager-7ptl2/diskmaker-manager/diskmaker-manager/logs/current.log:12480 namespaces/openshift-local-storage/pods/diskmaker-manager-96vxc/diskmaker-manager/diskmaker-manager/logs/current.log:63344 namespaces/openshift-local-storage/pods/diskmaker-manager-9vjtf/diskmaker-manager/diskmaker-manager/logs/current.log:19085 namespaces/openshift-local-storage/pods/diskmaker-manager-fnqxw/diskmaker-manager/diskmaker-manager/logs/current.log:111879 namespaces/openshift-local-storage/pods/diskmaker-manager-gl4wf/diskmaker-manager/diskmaker-manager/logs/current.log:0 namespaces/openshift-local-storage/pods/diskmaker-manager-kqrd4/diskmaker-manager/diskmaker-manager/logs/current.log:0 namespaces/openshift-local-storage/pods/diskmaker-manager-pdnct/diskmaker-manager/diskmaker-manager/logs/current.log:129458 namespaces/openshift-local-storage/pods/diskmaker-manager-q6s52/diskmaker-manager/diskmaker-manager/logs/current.log:59920 namespaces/openshift-local-storage/pods/diskmaker-manager-zdstc/diskmaker-manager/diskmaker-manager/logs/current.log:0 In general, on every healthy diskmaker pod, every 60s, we have a loop which browse all the "block device" and create a symlink in /mnt/local-storage/localblock if condition are met. Like this : ~~~ 2023-06-20T16:32:54.391660888Z {"level":"info","ts":1687278774.3914862,"logger":"localvolumeset-symlink-controller","msg":"Reconciling LocalVolumeSet","Request.Namespace":"openshift-local-storage","Request.Name":"local-block"} 2023-06-20T16:32:54.391872431Z I0620 16:32:54.391823 15549 common.go:334] StorageClass "localblock" configured with MountDir "/mnt/local-storage/localblock", HostDir "/mnt/local-storage/localblock", VolumeMode "Block", FsType "", BlockCleanerCommand ["/scripts/quick_reset.sh"] 2023-06-20T16:32:54.569914265Z {"level":"info","ts":1687278774.5697858,"logger":"localvolumeset-symlink-controller","msg":"match negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"loop0","matcher.Name":"inTypeList"} 2023-06-20T16:32:54.593580217Z {"level":"info","ts":1687278774.5934741,"logger":"localvolumeset-symlink-controller","msg":"match negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"loop1","matcher.Name":"inTypeList"} 2023-06-20T16:32:54.614147565Z {"level":"info","ts":1687278774.6140366,"logger":"localvolumeset-symlink-controller","msg":"match negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"loop2","matcher.Name":"inTypeList"} 2023-06-20T16:32:54.632720733Z {"level":"info","ts":1687278774.6326215,"logger":"localvolumeset-symlink-controller","msg":"match negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"loop3","matcher.Name":"inSizeRange"} 2023-06-20T16:32:54.637415307Z {"level":"info","ts":1687278774.6373258,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"sdb","filter.Name":"noBindMounts"} 2023-06-20T16:32:54.640344701Z {"level":"info","ts":1687278774.6402955,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"sdc","filter.Name":"noBindMounts"} 2023-06-20T16:32:54.642941844Z {"level":"info","ts":1687278774.6428287,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"sdd","filter.Name":"noBindMounts"} 2023-06-20T16:32:54.645585695Z {"level":"info","ts":1687278774.645547,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"sde","filter.Name":"noChildren"} 2023-06-20T16:32:54.645585695Z {"level":"info","ts":1687278774.645573,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"sde1","filter.Name":"noBiosBootInPartLabel"} 2023-06-20T16:32:54.645784322Z {"level":"info","ts":1687278774.6456826,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"sde2","filter.Name":"noFilesystemSignature"} 2023-06-20T16:32:54.645829280Z {"level":"info","ts":1687278774.645764,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"sde3","filter.Name":"noBiosBootInPartLabel"} 2023-06-20T16:32:54.645867197Z {"level":"info","ts":1687278774.6458266,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"sde4","filter.Name":"noFilesystemSignature"} 2023-06-20T16:32:54.646096199Z {"level":"info","ts":1687278774.6460469,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"rbd0","filter.Name":"noChildren"} 2023-06-20T16:32:54.650369603Z {"level":"info","ts":1687278774.6503327,"logger":"localvolumeset-symlink-controller","msg":"match negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"rbd0p1","matcher.Name":"inSizeRange"} 2023-06-20T16:32:54.653112455Z {"level":"info","ts":1687278774.6530833,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"rbd0p2","filter.Name":"noFilesystemSignature"} 2023-06-20T16:32:54.653160997Z {"level":"info","ts":1687278774.653139,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"rbd0p3","filter.Name":"noFilesystemSignature"} 2023-06-20T16:33:17.781643365Z {"level":"info","ts":1687278797.781562,"logger":"deleter","msg":"Looking for released PVs to clean up","Request.Namespace":"openshift-local-storage","Request.Name":""} ~~~ On 6 node it is not the case anymore ( see the previous grep "Device" command above ). Affected node are : w4,w3,w6,w2,w8,w7 => w2,w3,w4,w6,w7,w8 LocalVolumeSet look quite standard : {"apiVersion":"local.storage.openshift.io/v1alpha1","kind":"LocalVolumeSet","metadata":{"annotations":{},"name":"local-block","namespace":"openshift-local-storage"},"spec":{"deviceInclusionSpec":{"deviceMechanicalProperties":["NonRotational"],"deviceTypes":["disk"],"maxSize":"5Ti","minSize":"2Ti"},"fstype":"ext4","maxDeviceCount":3,"nodeSelector":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"cluster.ocs.openshift.io/openshift-storage","operator":"In","values":[""]}]}]},"storageClassName":"localblock","volumeMode":"Block"}} And the drive to be ingest on w8 is 3.5T We had strace two diskmaker pod for 1800s. We see that it still discuss on the network, but it does not do any syscall related to /dev or /mnt/local-storage/localblock. Restarting the pod does not help, so there is a "state" save somewhere. Version of all relevant components (if applicable): OCP Version : 4.8.39 OCS Version : 4.8.14 local-storage : local-storage-operator.4.8.0-202208020324 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Right now ceph is fully healthy. But customer is unable to have a new osd created on node worker8 because it is waiting for a PV to be created. Is there any workaround available to the best of your knowledge? We have not try to reboot w8. It will need to be rebooted because there is still one loopback loop1 device associated to a drive which has been replaced. ~~~ $ grep ^ sosreport-w8/sys/block/loop*/loop/backing_file sosreport-w8/sys/block/loop0/loop/backing_file:/var/lib/kubelet/plugins/kubernetes.io~local-volume/volumeDevices/local-pv-f5306836/4da40238-0786-473e-bfb5-cf23c6ce1b5d sosreport-w8/sys/block/loop1/loop/backing_file:/var/lib/kubelet/plugins/kubernetes.io~local-volume/volumeDevices/local-pv-7bfb9892/09797f5e-b321-47de-ac84-de4fd47ec603 (deleted) ~~~ Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2) this bug seem to have happen by itself over time. Not been able to reproduce it, but it look like the diskmaker pod believe it need to delete something and he is not able to achieve it. Unfortunately, I don't know what he want to delete and on which criteria he is basing his algorythm. Can this issue reproducible? This issue is reproduce on multiple node right now. Can this issue reproduce from the UI? Not try. If this is a regression, please provide more details to justify this: Nothing which make thing there is a regression. Steps to Reproduce: N/A Actual results: N/A Expected results: PV is created by diskmaker and then an osd is created. Additional info:
BZ=>Jira JIRA which track this BZ : https://issues.redhat.com/browse/OCPBUGS-15929
Hi Bill Zvonar and Eran ! Indeed it's a dup of this one : https://bugzilla.redhat.com/show_bug.cgi?id=2032924 Good to see that it has been fixed. We should not have this system in production right now, so they will have the fix when they reinstall it, Great ! Closing this case as DUP *** This bug has been marked as a duplicate of bug 2032924 ***