Bug 1877812

Summary: [ROKS] IBM cloud failed to terminate OSDs when upgraded between internal builds of OCS 4.5
Product: OpenShift Container Platform Reporter: Petr Balogh <pbalogh>
Component: StorageAssignee: Jan Safranek <jsafrane>
Storage sub component: Storage QA Contact: Qin Ping <piqin>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: akgunjal, aos-bugs, cewong, jritter, jsafrane, madam, mumutyal, ocs-bugs, owasserm, rgeorge, rphillips, sabose, shan, tnielsen
Version: 4.5   
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1898811 (view as bug list) Environment:
Last Closed: 2021-02-24 15:17:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1892623, 1894836    
Attachments:
Description Flags
output of lsof command
none
Block PVC detach testcases none

Description Petr Balogh 2020-09-10 13:54:19 UTC
Description of problem (please be detailed as possible and provide log
snippests):

After upgrade between two internal builds on IBM Cloud (ROKS) we see that OSDs pod got stuck in terminating state.


$ oc get pod -n openshift-storage |grep osd
rook-ceph-osd-0-5648b79f69-76b8p                                  0/1     Terminating   0          35d
rook-ceph-osd-1-5fb88fbfdc-q9csg                                  0/1     Terminating   0          35d
rook-ceph-osd-2-7cfb859c79-fp26l                                  0/1     Terminating   0          35d
rook-ceph-osd-prepare-ocs-deviceset-0-data-0-cmw5v-rndqt          0/1     Completed     0          35d
rook-ceph-osd-prepare-ocs-deviceset-1-data-0-jbq5j-5rwcl          0/1     Completed     0          35d
rook-ceph-osd-prepare-ocs-deviceset-2-data-0-xwndp-zfkz4          0/1     Completed     0          35d


CSV seems upgraded:
$ oc get csv -n openshift-storage
NAME                         DISPLAY                       VERSION        REPLACES                     PHASE
ocs-operator.v4.5.0-549.ci   OpenShift Container Storage   4.5.0-549.ci   ocs-operator.v4.5.0-507.ci   Succeeded

Event from pod:
Events:
  Type    Reason   Age   From                   Message
  ----    ------   ----  ----                   -------
  Normal  Killing  89m   kubelet, 10.183.63.18  Stopping container osd

OSD logs shows:
Error from server (BadRequest): container "osd" in pod "rook-ceph-osd-0-5648b79f69-76b8p" is waiting to start: PodInitializing

osd prepare log: http://pastebin.test.redhat.com/901085
Rook operator logs: http://pastebin.test.redhat.com/901087

I will add link to whole must gather later on.

Version of all relevant components (if applicable):
Server Version: 4.4.17


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
As OSDs didn't came up, I think there is huge impact


Is there any workaround available to the best of your knowledge?
No


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
haven't tried


Can this issue reproduce from the UI?
haven't tried


If this is a regression, please provide more details to justify this:
didn't upgrade previous ROKS cluster, so not sure

Steps to Reproduce:
1. Install cluster on IBM cloud with build: v4.5.0-507.ci 
2. Upgrade to this: 4.5.0-549.ci
3.


Actual results:
OSDs pod stuck in terminating state

Expected results:
Have OSDs up and running/upgraded

Additional info:

Conversation where we started discuss about this issue is here:
https://chat.google.com/room/AAAAREGEba8/7005kwvAv7w

Must gather:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-ind2/pbalogh-ind2_20200618T133943/logs-bz-osd-upgrade/upgrade-osd-issue.tar.gz

Comment 2 Michael Adam 2020-09-14 15:14:37 UTC
moving to 4.5.z as discussed with Petr.

Comment 3 Travis Nielsen 2020-09-14 15:24:44 UTC
The operator log shows it isn't able to start the OSDs, which is aligned with what has already been observed:

2020-09-10T04:49:13.565882563-05:00 2020-09-10 09:49:13.565762 E | op-osd: failed to update osd deployment 1. gave up waiting for deployment "rook-ceph-osd-1" to update
2020-09-10T04:49:13.565882563-05:00 2020-09-10 09:49:13.565827 I | op-osd: started deployment for osd 1 on pvc
2020-09-10T04:49:13.772234815-05:00 2020-09-10 09:49:13.772174 I | op-osd: osd orchestration status for node ocs-deviceset-2-data-0-xwndp is completed
2020-09-10T04:49:13.772234815-05:00 2020-09-10 09:49:13.772190 I | op-osd: starting 1 osd daemons on pvc ocs-deviceset-2-data-0-xwndp
2020-09-10T04:49:13.772234815-05:00 2020-09-10 09:49:13.772217 I | op-osd: OSD will have its main bluestore block on "ocs-deviceset-2-data-0-xwndp"
2020-09-10T04:49:14.10411389-05:00 2020-09-10 09:49:14.104057 I | cephclient: getting or creating ceph auth key "osd.2"
2020-09-10T04:49:15.912533229-05:00 2020-09-10 09:49:15.912490 I | op-osd: deployment for osd 2 already exists. updating if needed
2020-09-10T04:49:17.024112085-05:00 2020-09-10 09:49:17.024065 I | op-mon: checking if we can stop the deployment rook-ceph-osd-2
2020-09-10T04:49:18.907612229-05:00 2020-09-10 09:49:18.907557 I | op-mon: The osd daemon 2 is not ok-to-stop but 'continueUpgradeAfterChecksEvenIfNotHealthy' is true, so proceeding to stop...
...
2020-09-10T05:14:10.642797258-05:00 2020-09-10 10:14:10.642743 E | op-osd: failed to update osd deployment 2. gave up waiting for deployment "rook-ceph-osd-2" to update
2020-09-10T05:14:10.642797258-05:00 2020-09-10 10:14:10.642774 I | op-osd: started deployment for osd 2 on pvc
2020-09-10T05:14:10.88199413-05:00 2020-09-10 10:14:10.881907 I | op-osd: start provisioning the osds on nodes, if needed
2020-09-10T05:14:10.88199413-05:00 2020-09-10 10:14:10.881931 W | op-osd: skipping osd provisioning where no dataDirHostPath is set
2020-09-10T05:14:11.857362864-05:00 2020-09-10 10:14:11.857275 I | op-osd: completed running osds in namespace openshift-storage

The must-gather doesn't show anything interesting in the osd pod logs. In fact, the OSD daemons aren't showing logs for the terminating pods. The must-gather only shows logs for the OSD pods that are in pending state that are trying to start after the upgrade, but they are stuck waiting for the previous OSD to terminate. We really need to see what is blocking the OSDs from terminating. 

@Petr
1. Do you have a connection to the cluster that is in this state? Or can you get any logs for the OSDs that are in terminating state since the must-gather didn't seem to collect them? 
2. If you force delete an OSD pod, does it finally terminate and start the new pod? If that's the case, the volumes underneath the OSDs must not be releasing. IBM would need to look into that.

Comment 5 Travis Nielsen 2020-09-21 21:16:40 UTC
The OSD deployments have been updated, but the OSD pods are stuck terminating, so the upgrade is blocked. They are refusing to terminate, even with an "oc delete <pod>", which will force delete them. 

These are the OSDs that are stuck terminating: 
rook-ceph-osd-0-5648b79f69-76b8p                                  0/1     Terminating   0          46d
rook-ceph-osd-1-5fb88fbfdc-q9csg                                  0/1     Terminating   0          46d
rook-ceph-osd-2-7cfb859c79-fp26l                                  0/1     Terminating   0          46d

These are the PVs on which the OSDs are running:
ocs-deviceset-0-data-0-cmw5v   Bound    local-pv-b5d93336                          1787Gi     RWO            localblock                    46d
ocs-deviceset-1-data-0-jbq5j   Bound    local-pv-f91eefa8                          1787Gi     RWO            localblock                    46d
ocs-deviceset-2-data-0-xwndp   Bound    local-pv-188f0347                          1787Gi     RWO            localblock                    46d

"oc describe" on an osd pod shows that the container has exited, although there are no events on the pod.
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 06 Aug 2020 10:02:08 +0000
      Finished:     Thu, 06 Aug 2020 10:02:08 +0000

I would expect this means that the volumes underneath the OSDs are not being unmounted, and are holding the pod hostage. 

Rook has deployed the OSDs in "raw" mode, meaning that lvm would not need to free the devices. 

@Petr, can someone from IBM investigate why their devices are not being freed?

Comment 6 Petr Balogh 2020-09-29 11:34:19 UTC
Hey Travis,

I've asked Murali from IBM Cloud team to take a look.

Tried to add his email: mumutyal.com to need info, but it's failing. So pinged Murali on slack again.

Murali, can you please take a look?

Thanks

Comment 7 Murali Krishna 2020-09-30 06:32:48 UTC
We did not use the volumes  and we don't have any data on the volumes.  The occurrence of this case is not frequent . Some times during the OCS un-installtion also OSD pods struck . We still ahve the cluster in same stage . Please let me know if  you need any logs

Comment 8 akgunjal@in.ibm.com 2020-10-27 11:38:22 UTC
Hey @Travis,

I have been seeing this issue consistently now in all clusters of IBM Cloud and its present on both Classic and VPC Gen2. I see the detach of volumes used by OSD pods are failing. I am pasting a snippet below from kubelet log for reference.

```
Oct 22 06:34:33 kube-brc9edsd0jj7h6bgvdqg-ocsbm-default-00000378 kubelet.service: I1022 06:34:33.074051   15601 mount_linux.go:178] Detected OS with systemd
Oct 22 06:34:33 kube-brc9edsd0jj7h6bgvdqg-ocsbm-default-00000378 kubelet.service: E1022 06:34:33.078811   15601 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/local-volume/local-pv-fbb41034 podName:439cf04f-4144-411e-a398-bef87c93202c nodeName:}" failed. No retries permitted until 2020-10-22 06:34:33.578774714 -0500 CDT m=+51435.194625517 (durationBeforeRetry 500ms). Error: "UnmapVolume.UnmapBlockVolume failed for volume \"ocs-deviceset-2-data-0-4bt75\" (UniqueName: \"kubernetes.io/local-volume/local-pv-fbb41034\") pod \"439cf04f-4144-411e-a398-bef87c93202c\" (UID: \"439cf04f-4144-411e-a398-bef87c93202c\") : blkUtil.DetachFileDevice failed. globalUnmapPath:/var/lib/kubelet/plugins/kubernetes.io~local-volume/volumeDevices/local-pv-fbb41034, podUID: 439cf04f-4144-411e-a398-bef87c93202c, bindMount: true: failed to unmount linkPath /var/lib/kubelet/plugins/kubernetes.io~local-volume/volumeDevices/local-pv-fbb41034/439cf04f-4144-411e-a398-bef87c93202c: unmount failed: exit status 32\nUnmounting arguments: /var/lib/kubelet/plugins/kubernetes.io~local-volume/volumeDevices/local-pv-fbb41034/439cf04f-4144-411e-a398-bef87c93202c\nOutput: umount: /var/data/kubelet/plugins/kubernetes.io~local-volume/volumeDevices/local-pv-fbb41034/439cf04f-4144-411e-a398-bef87c93202c: target is busy.\n        (In some cases useful info about processes that use\n         the device is found by lsof(8) or fuser(1))\n"
```

Comment 9 Travis Nielsen 2020-10-27 17:28:12 UTC
What storage class is being used for the local pvs, and what is the driver? It seems the driver is failing the unmount the volume because it is busy. We haven't seen this issue with other environments such as AWS/EDS, vmware, etc, so an understanding of the underlying device driver is needed.

Comment 10 akgunjal@in.ibm.com 2020-10-28 08:13:29 UTC
@Travis,

I have seen this same issue on both classic cluster and VPC cluster. Classic cluster uses local volumes as its a bare metal server and VPC cluster uses remote block storage.

Storage class used for local volumes in Classic cluster:
```
Name:                  localblock
IsDefaultClass:        No
Annotations:           <none>
Provisioner:           kubernetes.io/no-provisioner
Parameters:            <none>
AllowVolumeExpansion:  <unset>
MountOptions:          <none>
ReclaimPolicy:         Delete
VolumeBindingMode:     WaitForFirstConsumer
Events:                <none>
```

Storage class used for remote block volumes in VPC cluster:
```
Name:            custom-10iops-tier
IsDefaultClass:  No
Annotations:     armada-service=addon-vpc-block-csi-driver,kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"storage.k8s.io/v1","kind":"StorageClass","metadata":{"annotations":{"armada-service":"addon-vpc-block-csi-driver","version":"2.0.3_385"},"creationTimestamp":"2020-10-20T08:33:36Z","labels":{"addonmanager.kubernetes.io/mode":"Reconcile","app":"ibm-vpc-block-csi-driver","razee/force-apply":"true"},"name":"custom-10iops-tier","resourceVersion":"6412","selfLink":"/apis/storage.k8s.io/v1/storageclasses/ibmc-vpc-block-10iops-tier","uid":"b2ed23fb-86fc-44fa-a64e-8d135462df3c"},"parameters":{"billingType":"hourly","classVersion":"1","csi.storage.k8s.io/fstype":"ext4","encrypted":"false","encryptionKey":"","profile":"10iops-tier","region":"","resourceGroup":"","sizeRange":"[10-2000]GiB","tags":"","zone":""},"provisioner":"vpc.block.csi.ibm.io","reclaimPolicy":"Delete","volumeBindingMode":"WaitForFirstConsumer"}
,version=2.0.3_385
Provisioner:           vpc.block.csi.ibm.io
Parameters:            billingType=hourly,classVersion=1,csi.storage.k8s.io/fstype=ext4,encrypted=false,encryptionKey=,profile=10iops-tier,region=,resourceGroup=,sizeRange=[10-2000]GiB,tags=,zone=
AllowVolumeExpansion:  <unset>
MountOptions:          <none>
ReclaimPolicy:         Delete
VolumeBindingMode:     WaitForFirstConsumer
Events:                <none>
```

The same issue is present with different storage classes, and there is no issue with unmount of other volumes created with those classes.

Comment 11 Travis Nielsen 2020-10-28 18:41:06 UTC
Is this issue seen in clusters that are fresh installed from OCS 4.5 or newer? Or only when upgrading from OCS 4.4? There was an important change in OCS 4.5 where we stopped using LVM to create OSDs. This was causing a number of issues similar to this where LVM was still holding on to volumes. If the cluster is upgraded from 4.4, we will continue to use LVM in the OSDs. So if you don't hit this issue installed fresh from OCS 4.5 clusters, it is very likely an issue with LVM.

Comment 12 akgunjal@in.ibm.com 2020-10-29 06:39:58 UTC
@Travis,

We have tested the below scenarios and all of them have this issue.

Scenario 1: Install fresh OCS 4.4 on OCP 4.5 cluster. Then upgrade OCS to 4.5 version by updating the subscription.

Scenario 2: Install fresh OCS 4.4 on OCP 4.4 cluster. Then upgrade OCS to 4.5 version by updating the subscription.

So the OCS upgrade fails with same issue on OCP 4.4 and 4.5 clusters for us.

Comment 13 akgunjal@in.ibm.com 2020-10-29 08:13:08 UTC
Scenario 3: Install fresh OCS 4.5 on OCP 4.5 cluster. Uninstall OCS 4.5 and OSDs remain in terminating state.

So basically OCS uninstall has issues across OCS 4.4 and OCS 4.5 versions.

Comment 14 Sébastien Han 2020-10-30 08:53:23 UTC
This could be similar to something we have experienced already.
When the Ceph OSD stops, it releases its lock on the block device, as a result, systemd-udevd intervenes and probes the block by taking a lock on it.
Since, this seems intermittent, what you observed can be a race between systemd-udevd finishing its probing and kubelet trying to umount the block.

However, the lock is only held for a few milliseconds so eventually when kubelet retries the lock should be gone and the device could be unmounted.
Unfortunately, this doesn't really explain why the pod gets stuck in terminating, a bug in OCP?

That's the only clue I have.

Comment 15 Sahina Bose 2020-11-02 06:21:55 UTC
(In reply to leseb from comment #14)
> This could be similar to something we have experienced already.
> When the Ceph OSD stops, it releases its lock on the block device, as a
> result, systemd-udevd intervenes and probes the block by taking a lock on it.
> Since, this seems intermittent, what you observed can be a race between
> systemd-udevd finishing its probing and kubelet trying to umount the block.
> 
> However, the lock is only held for a few milliseconds so eventually when
> kubelet retries the lock should be gone and the device could be unmounted.
> Unfortunately, this doesn't really explain why the pod gets stuck in
> terminating, a bug in OCP?
> 
> That's the only clue I have.

The issue seems to be consistent on every upgrade and drain. And IBM ROKS does not see this with volumes from other storage classes. Could this still be an OCP issue?

How do we proceed?

Comment 16 Petr Balogh 2020-11-02 08:26:23 UTC
Is it possible that this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1892623 is related to this one?

Comment 17 Sébastien Han 2020-11-02 10:45:35 UTC
Sahina, I don't know how to proceed but we must understand what resource is holding that block. Also, we need to understand why the retry is not working.

Comment 18 Sahina Bose 2020-11-02 11:21:24 UTC
Akash, can we check what resource is using the device?

Comment 19 akgunjal@in.ibm.com 2020-11-03 12:36:38 UTC
Created attachment 1726191 [details]
output of lsof command

Comment 20 akgunjal@in.ibm.com 2020-11-03 12:37:38 UTC
@Sahina, Please find the lsblk output in addition to lsof.

sh-4.2# lsblk
NAME            MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda               8:0    0   1.8T  0 disk  
|-sda1            8:1    0   256M  0 part  /boot
|-sda2            8:2    0     1G  0 part  
`-sda3            8:3    0   1.8T  0 part  /
sdb               8:16   0 893.8G  0 disk  
`-sdb1            8:17   0 893.8G  0 part  
  `-docker_data 253:0    0 893.8G  0 crypt /var/data
sdc               8:32   0   1.8T  0 disk  
loop0             7:0    0   1.8T  0 loop  
loop1             7:1    0   1.8T  0 loop  
sh-4.2#

Comment 23 Sahina Bose 2020-11-06 07:17:37 UTC
See also https://bugzilla.redhat.com/show_bug.cgi?id=1894836#c8.

Comment 24 Sébastien Han 2020-11-12 08:38:31 UTC
Ok so far, I still don't have the answer on what is accessing the
block periodically even though I can see it, I just haven't found
which process it is.
Something strange though, the periodic task keeps complaining about a
"busy" target even if the block is not mounted anymore.

I still need to investigate the fs namespace route, I just didn't have
the time, yesterday was a National Holiday in France and I'm busy with
the Rook 1.5 release until tomorrow.
So I can resume next week, "maybe" a bit before the end of the week,
it depends on how the release goes...

I have found similar issues:

* https://github.com/kubernetes/kubernetes/issues/85585
* https://github.com/longhorn/longhorn/issues/1719

This needs more investigation.

Thanks!

Comment 25 Sahina Bose 2020-11-12 09:06:22 UTC
Jan, could you provide any inputs here? This is similar to issue you commented on https://bugzilla.redhat.com/show_bug.cgi?id=1894836#c8

Comment 26 Sébastien Han 2020-11-12 11:21:15 UTC
Ok so something runs "lsblk --all --noheadings --pairs --output KNAME,PKNAME,TYPE,MOUNTPOINT" every 10sec. I assume it's LSO.
That's the only thing accessing the block device on that machine with and OPEN and CLOSE_NOWRITE,CLOSE.

I'm having a hard time thinking that this could be the reason for the "busy" target on unmount but all we have so far.

We can repro the error but execing inside the mount namespace of the kubelet PID:

sh-4.2# nsenter -t 15619 --m

-sh-4.2# mount|grep /var/data/kubelet/plugins/kubernetes.io~local-volume/volumeDevices/local-pv-b3698db3/f78ff720-d582-44d3-abc8-b18a46726ef8
devtmpfs on /var/data/kubelet/plugins/kubernetes.io~local-volume/volumeDevices/local-pv-b3698db3/f78ff720-d582-44d3-abc8-b18a46726ef8 type devtmpfs (rw,seclabel,size=32826904k,nr_inodes=8206726,mode=755)

-sh-4.2# umount /var/data/kubelet/plugins/kubernetes.io~local-volume/volumeDevices/local-pv-b3698db3/f78ff720-d582-44d3-abc8-b18a46726ef8
umount: /var/data/kubelet/plugins/kubernetes.io~local-volume/volumeDevices/local-pv-b3698db3/f78ff720-d582-44d3-abc8-b18a46726ef8: target is busy.
        (In some cases useful info about processes that use
         the device is found by lsof(8) or fuser(1))


The problem is that the loopdevice is still holding the mount:

-sh-4.2# losetup -a
/dev/loop3: [0005]:12315 (/var/data/kubelet/plugins/kubernetes.io~local-volume/volumeDevices/local-pv-b3698db3/f78ff720-d582-44d3-abc8-b18a46726ef8)
-sh-4.2#
-sh-4.2#
-sh-4.2# losetup -d /dev/loop3
-sh-4.2# losetup -a
-sh-4.2#
-sh-4.2#
-sh-4.2#
-sh-4.2# umount --verbose /var/data/kubelet/plugins/kubernetes.io~local-volume/volumeDevices/local-pv-b3698db3/f78ff720-d582-44d3-abc8-b18a46726ef8
umount: /var/data/kubelet/plugins/kubernetes.io~local-volume/volumeDevices/local-pv-b3698db3/f78ff720-d582-44d3-abc8-b18a46726ef8 (devtmpfs) unmounted

Et voilà!

So before attempting to umount Kubernetes should release the loop device.
Should we move this to OCP? Sahina?

Comment 27 Jan Safranek 2020-11-12 14:50:40 UTC
"lsblk --all" can't hold a device for too long (unless something is really broken).

> We can repro the error but execing inside the mount namespace of the kubelet PID:
> sh-4.2# nsenter -t 15619 --m

Wait, do you run kubelet in a different namespace than the host? That's quite unusual and I am not quite sure it's supported. It was very problematic in OCP 3.x and we run kubelet without any namespaces in 4.x.

> So before attempting to umount Kubernetes should release the loop device.

Kubernetes releases the loop device under normal conditions. Can you reproduce it without OCS, i.e. just with LSO + a pod?

Comment 28 Sébastien Han 2020-11-12 15:40:52 UTC
(In reply to Jan Safranek from comment #27)
> "lsblk --all" can't hold a device for too long (unless something is really
> broken).

Agreed, as pointed that earlier, it's not an issue but still something accessing the block.

> 
> > We can repro the error but execing inside the mount namespace of the kubelet PID:
> > sh-4.2# nsenter -t 15619 --m
> 
> Wait, do you run kubelet in a different namespace than the host? That's
> quite unusual and I am not quite sure it's supported. It was very
> problematic in OCP 3.x and we run kubelet without any namespaces in 4.x.

To be honest, I have no idea, I just debugged the setup and reported my findings.
At least, the namespace does not seem to be the one from the host.
The block path was mounted as devtmpfs inside my "oc debug" container, and unmounting it was doing nothing and the issue non-reproducable.
Only the nsenter was reproducing what kubelet was printing in the logs.

> 
> > So before attempting to umount Kubernetes should release the loop device.
> 
> Kubernetes releases the loop device under normal conditions. Can you
> reproduce it without OCS, i.e. just with LSO + a pod?

Petr? Someone else?

Comment 29 Petr Balogh 2020-11-12 17:41:22 UTC
I think Akash tried on other application and haven't seen the issue IIRC. I haven't tried to do any further testing on that outside of OCS.
I am going to be on PTO till Tuesday, so can take a look on or after Wednesday.

Akash, maybe you can assist here to provide feedback?

Thanks

Comment 30 akgunjal@in.ibm.com 2020-11-13 11:17:09 UTC
@leseb, @Petr,

We have tried it in IBM Cloud VPC cluster where we dont use LSO. The volumes used by OSDs are remote block volumes created by our VPC plugin. So this does not seem to be LSO related to me. I had already indicated earlier that I see loop0 kind of devices on worker when I do lsblk which is holding lock.

I have already tried to create a different PVC from VPC volumes and able to unmount them easily. Its getting stuck only for OSD and its not using LSO.

Comment 31 Cesar Wong 2020-11-17 15:23:29 UTC
@rphillips this looks like a kubelet issue, is it something you can help with?

Comment 32 Sahina Bose 2020-11-18 06:40:20 UTC
Created OCP bug 1898811. 

Ryan, please take a look

Comment 33 Jan Safranek 2020-11-24 15:08:00 UTC
> Its getting stuck only for OSD and its not using LSO.

I'm glad to hear that. 

Trying to sum up the previous findings and bug #1894836: we have a block volume that's used by something and we still don't know what it is?

> Oct 22 06:34:33 kube-brc9edsd0jj7h6bgvdqg-ocsbm-default-00000378 kubelet.service: E1022 06:34:33.078811   15601 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/local-volume/local-pv-fbb41034 podName:439cf04f-4144-411e-a398-bef87c93202c nodeName:}" failed. No retries permitted until 2020-10-22 06:34:33.578774714 -0500 CDT m=+51435.194625517 (durationBeforeRetry 500ms). Error: "UnmapVolume.UnmapBlockVolume failed for volume \"ocs-deviceset-2-data-0-4bt75\" (UniqueName: \"kubernetes.io/local-volume/local-pv-fbb41034\") pod \"439cf04f-4144-411e-a398-bef87c93202c\" (UID: \"439cf04f-4144-411e-a398-bef87c93202c\") : blkUtil.DetachFileDevice failed. globalUnmapPath:/var/lib/kubelet/plugins/kubernetes.io~local-volume/volumeDevices/local-pv-fbb41034, podUID: 439cf04f-4144-411e-a398-bef87c93202c, bindMount: true: failed to unmount linkPath /var/lib/kubelet/plugins/kubernetes.io~local-volume/volumeDevices/local-pv-fbb41034/439cf04f-4144-411e-a398-bef87c93202c: unmount failed: exit status 32\nUnmounting arguments: /var/lib/kubelet/plugins/kubernetes.io~local-volume/volumeDevices/local-pv-fbb41034/439cf04f-4144-411e-a398-bef87c93202c\nOutput: umount: /var/data/kubelet/plugins/kubernetes.io~local-volume/volumeDevices/local-pv-fbb41034/439cf04f-4144-411e-a398-bef87c93202c: target is busy.\n        (In some cases useful info about processes that use\n         the device is found by lsof(8) or fuser(1))\n"

Once this volume is unmounted, does kubelet remove /dev/loop0?

The best would probably be if I could ssh into such a problematic node when it happens again and look at it from OCP perspective.

Comment 34 Jan Safranek 2020-11-24 15:08:42 UTC
*** Bug 1898811 has been marked as a duplicate of this bug. ***

Comment 35 akgunjal@in.ibm.com 2020-11-24 16:01:43 UTC
The RedHat team has the cluster where we get the issue and we have kept it for debugging.

Comment 36 Jan Safranek 2020-11-26 11:47:52 UTC
Good news, there is a bug in OCP!

I debugged cluster with API server at https://c115-e.us-south.containers.cloud.ibm.com

On node 10.240.129.31 I found three volumes that are reported as "busy" in recent kubelet logs:
$ journalctl -r | grep busy | head -n 100 | grep -o "pvc-[0-9a-f-]*" | sort | uniq
pvc-171db671-0f81-4279-8eba-03e09ef51b6d
pvc-32704cd6-a242-494c-8a47-7e6c213f413f
pvc-783f4fd2-27bb-4d75-a539-855cbaf59636

Picking pvc-32704cd6-a242-494c-8a47-7e6c213f413f and its usage by pod 2ffdeda3-8aed-4bee-9793-9719878c5677: it was started at Nov 25 10:15:44, volumes were set up at Nov 25 10:16:24. The pod was crashlooping (did not dig into details) and it was deleted at Nov 25 10:17:34:

Nov 25 10:17:34 kube-bufu962d064ogc6i6s8g-ocsnoobaa-default-00000459 kubelet.service[100500]: I1125 10:17:34.317948  100500 kubelet.go:1942] SyncLoop (DELETE, "api"): "rook-ceph-osd-prepare-ocs-deviceset-1-0-947zv-jtwjt_openshift-storage(2ffdeda3-8aed-4bee-9793-9719878c5677)"

Tearing down of the PVC mounts/loop devices failed:

Nov 25 10:17:34 kube-bufu962d064ogc6i6s8g-ocsnoobaa-default-00000459 kubelet.service[100500]: W1125 10:17:34.467669  100500 volume_path_handler_linux.go:61] couldn't find loopback device which takes file descriptor lock. Skip detaching device. device path: "/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-32704cd6-a242-494c-8a47-7e6c213f413f/dev/2ffdeda3-8aed-4bee-9793-9719878c5677"

This means kubelet can't remove the loop device and therefore various bind-mounts that kubelet does can't be unmounted.

Looking into code, "couldn't find loopback device" most probably means that kubelet can't parse output of "losetup -j /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-32704cd6-a242-494c-8a47-7e6c213f413f/dev/2ffdeda3-8aed-4bee-9793-9719878c5677" (if losetup failed, we would see some "exit code 1" in the log, but we don't see it...)

"losetup -j <the device>" shows:
/dev/loop5: [0005]:193908341 (/var/data/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-32704cd6-a242-494c-8a47-7e6c213f413f/dev/2ffdeda3-8aed-4bee-9793-9719878c5677 (deleted))

And here is the bug: kubelet looks for "/var/lib/kubelet/plugins/...", where losetup shows "/var/data/kubelet/...", because /var/lib/kubelet is a symlink to /var/data/kubelet.

There were numerous issues with /var/lib/kubelet being a symlink (or there is symlink somewhere in the path), we gradually fix all of them. As seen in this example, new bugs are popping up. It is *much* safer not to have symlinks anywhere in /var/lib/kubelet. Use mounts if you really want it on a different device.

I can fix OCP, it will take some time. As workaround, please try to use (bind?) mounts instead of symlinks. As cleanup, remove the loopback devices of deleted pods.


As side note, using a block pvc + the most stupid pod with "sleep 9999" that uses the PVC, I was able to reproduce it without OCS.

Comment 37 Jan Safranek 2020-11-26 14:06:20 UTC
Upstream issue: https://github.com/kubernetes/kubernetes/issues/96888
Upstream PR: https://github.com/kubernetes/kubernetes/pull/96889

(upstream is frozen, it will take some time to merge)

Comment 38 Jan Safranek 2020-12-04 15:07:45 UTC
Waiting for upstream to un-freeze...

Comment 39 Sahina Bose 2020-12-07 08:28:59 UTC
*** Bug 1894836 has been marked as a duplicate of this bug. ***

Comment 41 Juerg Ritter 2020-12-17 17:48:56 UTC
Hi,

I'm currently troubleshooting an extremely similar issue, but not on IBM Cloud but on top of the vSphere CSI Driver (Version 2.0.1). The symptoms are:
- Ceph OSD Pods remain in Terminating state
- Reproducable with a dummy pod and an attached Block PVC (filesystem PVCs seem to work without issues)
- Pod terminates after unmounting the devtmpfs mount in /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/publish/ on the host
- Kubelet tells me "couldn't find loopback device which takes file descriptor lock. Skip detaching device. device path: "/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-2dc62ba2-96ed-4bab-a412-b04c34f031c5/dev/3a981daa-d7a7-4369-a0ae-e223d074b3d9"
- vsphere-csi-node pod tells me "/csi.v1.Node/NodeUnpublishVolume: REP 0477: rpc error: code = Internal desc = error getting block device for volume: 351c47f9-6b5d-42fb-ac77-50dea17a1333, err: lstat devtmpfs: 
no such file or directory"

I only see this when using block PVCs on the vsphere csi storage class, when using the ocs-storagecluster-ceph-rbd or thin (vsphere intree) storage classes, block pvcs work. /var/lib/kubelet is NOT a symlink in my setup (standard OCP).

Now to the bedrock of my question: Above you mention that this is a bug in OCP. Is it possible at all that I also hit this bug with the vsphere csi driver? I am currently running OCP 4.5.19 and OCS 4.5.1 and CSI Driver 2.0.1

I'll attach a file where I documented my steps during troubleshooting.

--Juerg

Comment 42 Juerg Ritter 2020-12-17 17:51:04 UTC
Created attachment 1740041 [details]
Block PVC detach testcases

This files includes the procedure I followed for reproducting the OCS problem I encountered while using block PVCs on new CSI storage class.

Comment 43 Jan Safranek 2021-01-04 12:32:37 UTC
Juerg, comment #41 looks like a different bug, unless there is a symlink anywhere in /var/lib/kubelet path. My gut feeling would be that the vSphere driver does not implement idempotency correctly - NodeUnpublish on unmounted/missing volume should return success and it's a common error in CSI drivers. But it could be a bug in OCP too. Found this upstream: https://github.com/kubernetes-sigs/vsphere-csi-driver/issues/338

Comment 44 Juerg Ritter 2021-01-04 13:16:25 UTC
Jan, thanks for your feedback and the GitHub issue. I added a comment there, describing what I am observing.

Comment 45 Juerg Ritter 2021-01-06 14:40:26 UTC
Jan, comment #41 turned out to be related to a RHCOS incopatibility in NodeUnpublishVolume of the vsphere CSI driver. I came up with a simple fix and created a PR: https://github.com/kubernetes-sigs/vsphere-csi-driver/pull/572

Comment 47 Qin Ping 2021-01-18 07:40:46 UTC
Verified with: 4.7.0-0.nightly-2021-01-17-211555

Comment 50 errata-xmlrpc 2021-02-24 15:17:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633