Bug 2050335

Summary: azure-disk failed to mount with error special device does not exist
Product: OpenShift Container Platform Reporter: Rajnikant <rkant>
Component: StorageAssignee: Hemant Kumar <hekumar>
Storage sub component: Storage QA Contact: Wei Duan <wduan>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: urgent CC: antonio.quintavalle, aos-bugs, fgrosjea, hekumar, jdelft, jsafrane, sbiradar, sdodson, sreber, wlewis
Version: 4.8Keywords: Rebase
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2083785 (view as bug list) Environment:
Last Closed: 2022-08-10 10:47:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2083785    

Description Rajnikant 2022-02-03 17:59:03 UTC
Description of problem:
azure-disk failed to mount with error special device does not exist

Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.8

How reproducible:

Reported issue: With statefulset, pod intermittently failing to mount volume on the node with following error. 

special device /var/lib/kubelet/plugins/kubernetes.io/azure-disk/mounts/some-id does not exist

Volume provisioned using azure-disk

With some restart this error gone and volume mounted in the pod. 

Will share other logs.
Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 7 Jan Safranek 2022-02-04 15:36:00 UTC
This may be related https://bugzilla.redhat.com/show_bug.cgi?id=2044718.

Comment 12 antonio.quintavalle 2022-02-07 14:28:46 UTC
Reporting a mitigation procedure, in case it might be helpful to anyone.

For some big disks, kubelet does not wait long enough for the device to be umounted.
Once the PV is no longer used by the POD, lsblk shows that the mount point is effectively removed but not the device.
Different behavior has been observed with smaller disks: after umount the device is effectively removed from the list and it reappears only when the PV gets bound again to a pod.

On azure side, the following procedure seems to be effective:
1 - force a detach of the impacted volume
2 - wait a bit (30 secs seemed to be enough) for the kubelet to detect that the disk is no longer in-use/existing
3 - Re-attach the disk to the same instance

At the very next reboot of the PoD, the kubelet will be capable to reattach the volume to the POD.

Detach process is an effective way to finalize the kubelet tasks, that was likely not performed when disk was considered still in use - that was skipped just after the umount action.

It's definitely a bad approach for dynamically provisioned disks but it can save your day in case a cluster is affected by this problem.

Comment 13 Jan Safranek 2022-02-07 15:58:33 UTC
> For some big disks, kubelet does not wait long enough for the device to be umounted.
> Once the PV is no longer used by the POD, lsblk shows that the mount point is effectively removed but not the device.

Is it something that you can reproduce? How big is "big disk"? How long does it take for the device to disappear / be detected as "not in use"? We could try to fix kubelet.

BTW, bugfix for #2044718 could actually help a lot here. Kubelet will mount the device again when a new pod that uses the same volume is about to be started.

Comment 14 antonio.quintavalle 2022-02-07 18:23:13 UTC
Hello Jan,
issue is hitting a StatefulSet using dynamic provisioned volumes of 6TB.

It's something we can reproduce rather easy, given the amount of replicas we have it happens rather systematically when we perform a rolling update.

To answer your questions:
> How long does it take for the device to disappear => device does not disappear at all, the only disappearing thing, according to lsblk, is its own mounting point
> Be detected as "not in use" => this is unknown, the umount process does not get logged and the only issue we get is an error code 32 at the time of the mount. As far as I can see the device is listed in lsblk and the same shows "scsi-rescan", stating that no device is added/removed.
Beside that I don't know what else we can use to inspect whether the device is still currently in use. 

In my opinion we are hitting the issue because we rely on an optimistic unmount action but the controller assumes that it has always ended correctly.  

Even the PR made here:
https://github.com/kubernetes/kubernetes/pull/107789/commits/ff765210d5808df2392c0500f4061e84b3a94ec3#diff-3e8f8a6e5eca57db8e84bcc6c896f0a3b10289fe55d8e999b4bdf1cfdc0de394
marks the disk in "uncertain" status, which makes now the controller taking in account the possibility that it was not unmounted correctly previously.

In the tests I made so far, I have always been able to mount manually the disk, therefore I need to assume there is a bug in the unmount process which is the real root cause.
I acknowledge that the PR will fix the issue but it won't shed any light on the why/when unmount was might be lengthy or unsuccessful.

Should you be interested to get more details or to replicate live on our clusters, please do not hesitate to come back to me,
Antonio

Comment 15 Hemant Kumar 2022-02-08 18:00:26 UTC
> Once the PV is no longer used by the POD, lsblk shows that the mount point is effectively removed but not the device.

The strange thing is, the unmount operation gets retried and still kubelet is unable to get exclusive lock on the device. So, I am not sure if kubelet is not waiting enough during unmounting is an issue. In absence of lsof on those nodes, can we run something like:

~> sudo ls -l /proc/*/fd|grep -C 20  <device>

which would help us identify pid of process that is holding the device lock.



> Should you be interested to get more details or to replicate live on our clusters, please do not hesitate to come back to me,

Yes please. You can reach us out via CEE to setup a call.

Comment 17 antonio.quintavalle 2022-02-09 14:17:05 UTC
Hello Hemant,
the pod today is up and running. Already at this stage I do not see what process is using the device (it's correctly mounted and used by a pod):

sh-4.4# lsblk
NAME                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                            8:0    0   128G  0 disk
|-sda1                         8:1    0   384M  0 part /boot
|-sda2                         8:2    0   127M  0 part
|-sda3                         8:3    0     1M  0 part
`-sda4                         8:4    0 127.5G  0 part
  `-coreos-luks-root-nocrypt 253:0    0 127.5G  0 dm   /sysroot
sdb                            8:16   0   256G  0 disk
`-sdb1                         8:17   0   256G  0 part
sdc                            8:32   0   5.9T  0 disk /var/lib/kubelet/pods/xxx/volumes/kubernetes.io~azure-disk/pvc-xxxx

sh-4.4# mount  | grep "/dev/sdc"
/dev/sdc on /var/lib/kubelet/plugins/kubernetes.io/azure-disk/mounts/m3014679793 type ext4 (rw,relatime,seclabel)
/dev/sdc on /var/lib/kubelet/pods/xxx/volumes/kubernetes.io~azure-disk/pvc-xxx type ext4 (rw,relatime,seclabel)

sh-4.4# ls -l /proc/*/fd|grep -i -C 20 "sdc"
sh-4.4# ls -l /proc/*/fd|grep -i -C 20 "lun0"
sh-4.4# ls -l /proc/*/fd|grep -i -C 20 "m3014679793"
sh-4.4# ls -l /proc/*/fd|grep -i -C 20 "pvc-xxx"
sh-4.4#

Before replicating the issue I would like to understand how to effectively check what PID is using the device when everything runs simply fine.

Can you please suggest?

Best Regards,
Antonio

Comment 18 Hemant Kumar 2022-02-09 14:50:59 UTC
What about when you run 

# grep "sdc" /proc/*/mounts

While the pod is running. Also when the pod is deleted and this issue happens. So basically get output of above command now and after replicating the problem.

Comment 19 antonio.quintavalle 2022-02-10 08:37:12 UTC
Hello Hemant,
this is becoming interesting.

Before I kill the pod I have this:

sh-4.4# lsblk
NAME                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                            8:0    0   128G  0 disk
|-sda1                         8:1    0   384M  0 part /boot
|-sda2                         8:2    0   127M  0 part
|-sda3                         8:3    0     1M  0 part
`-sda4                         8:4    0 127.5G  0 part
  `-coreos-luks-root-nocrypt 253:0    0 127.5G  0 dm   /sysroot
sdb                            8:16   0   256G  0 disk
`-sdb1                         8:17   0   256G  0 part
sdc                            8:32   0   5.9T  0 disk /var/lib/kubelet/pods/xxxx/volumes/kubernetes.io~azure-disk/pvc-xxxx

sh-4.4# grep "sdc" /proc/*/mounts | wc -l 
765

Once the pod is killed, the init container gets affected by the same issue and cannot start again.

sh-4.4# lsblk
NAME                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                            8:0    0   128G  0 disk
|-sda1                         8:1    0   384M  0 part /boot
|-sda2                         8:2    0   127M  0 part
|-sda3                         8:3    0     1M  0 part
`-sda4                         8:4    0 127.5G  0 part
  `-coreos-luks-root-nocrypt 253:0    0 127.5G  0 dm   /sysroot
sdb                            8:16   0   256G  0 disk
`-sdb1                         8:17   0   256G  0 part
sdc                            8:32   0   5.9T  0 disk

sh-4.4# mount | grep sdc
sh-4.4#

sh-4.4# grep "sdc" /proc/*/mounts | wc -l
0

Apparently, as I was saying at the very beginning, there's nothing that prevents the disk from being unmounted; actually as matter of facts it has been unmounted correctly according to lsblk and mount commands.

Can anyone explain me what how to reach CEE so that I can schedule a call to show you the cluster live?

Have a good day,
Antonio

Comment 21 Hemant Kumar 2022-02-10 17:22:47 UTC
Can you also post YAML of twistlock pod? That seems to be a common element in all the bug reports we have seen around this issue(even from outside redhat and other k8s deployments).

Comment 24 Hemant Kumar 2022-02-11 15:52:51 UTC
Can you reproduce this issue with twistlock disabled btw? 

There are several things going on:

1. First is - https://github.com/kubernetes/kubernetes/pull/107789 will fix the issue. I have already backported to 1.21, 1.22, 1.23 and hence will be available in OCP-4.8, 4.9 and 4.10

2. I had a call with kernel engineer and node team and while we are running some experiments to make sure kubelet/container-runtime/kernel isn't the one that is behaving little weird, we can't rule out twistlock issues. At this point tbh, we are just trying to determine who is holding the device exclusively open. If kernel keeps the device open until cached data is flushed - PR I opened will fix the issue. If container-runtime holds the device open because of some race between when containers are "presumed" dead and when containers actually die, the PR I opened will fix the issue. 

3. My PR won't however will fix the issue if some other pod(like twistlock) mounts the same volume (via recursive bind mount, because it mounted "/" or other host root directory) using PRIVATE mount propagation. I know you previously confirmed that you could not find device in different mount point via `/proc/*/mounts` - but this is highly susceptible to when command was run.

Comment 26 antonio.quintavalle 2022-02-23 07:53:14 UTC
Hello RedHat,
at this stage we will wait for the PR to be delivered in a 4.8 release and we will test again the scenario to validate if error does not happen anymore.

In the meanwhile we have engaged PaloAlto support to better understand how dynamic volumes are detected and handled by twistlock to understand if any fix can be provided to the software.

I will keep you updated,
Best Regards,
Antonio

Comment 29 Jan Safranek 2022-03-14 12:00:05 UTC
Still waiting for 1.21.10 to be merged into 4.8

Comment 37 Wei Duan 2022-05-26 12:01:02 UTC
@Rajnikant Any good news on this?

Comment 42 errata-xmlrpc 2022-08-10 10:47:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Comment 44 Hemant Kumar 2022-08-19 14:24:06 UTC
Yes it should be - https://bugzilla.redhat.com/show_bug.cgi?id=2083787