Bug 1952211

Summary:	cascading mounts happening exponentially on when deleting openstack-cinder-csi-driver-node pods
Product:	OpenShift Container Platform	Reporter:	Anshul Verma <ansverma>
Component:	Storage	Assignee:	Mike Fedosin <mfedosin>
Storage sub component:	OpenStack CSI Drivers	QA Contact:	rlobillo
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	medium	CC:	adduarte, aos-bugs, mbagga, mfedosin, palonsor, pprinett, tkimura, wking
Version:	4.7	Keywords:	Triaged
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: The folder /var/lib/kubelet was mounted twice in Cinder CSI Node Controller container. Consequence: When running Cinder CSI Node Controller, it doesn't start and throws an error about not being able to mount /var/lib/kubelet/pods because no more space is left. Fix: Removes duplicate mount of /var/lib/kubelet and /var/lib/kubelet/pods which results in an error. Result: The driver always runs successfully.	Story Points:	---
Clone Of:
Clones:	2025444 2026197 (view as bug list)		Environment:
Last Closed:	2021-07-27 23:02:36 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2016286

Description Anshul Verma 2021-04-21 18:46:04 UTC

Description of problem:

This is about the Upstream issue -
https://github.com/kubernetes/cloud-provider-openstack/issues/772

When the CSI driver pod is created through its DaemonSet, it contains the following mounts -
~~~
        volumeMounts:
        - mountPath: /var/lib/kubelet/pods
          mountPropagation: Bidirectional
          name: pods-mount-dir
        - mountPath: /var/lib/kubelet
          mountPropagation: Bidirectional
          name: kubelet-dir

      volumes:
      - hostPath:
          path: /var/lib/kubelet
          type: Directory
        name: kubelet-dir
      - hostPath:
          path: /var/lib/kubelet/pods
          type: Directory
        name: pods-mount-dir
~~~

When a `openstack-cinder-csi-driver-node` pod created through this DaemonSet is deleted multiple times, the mount entries for `/var/lib/kubelet/pods` get on increasing exponentially with every restart of the pod.
~~~
[root@vm ~]# findmnt -D | grep '/var/lib/kubelet/pods$' | wc -l
127
~~~

When this number exceeds 255, the following error is seen -
~~~
  Warning  Failed  9s  kubelet, master2  Error: container create failed: time="2021-04-07T09:41:40Z" level=warning msg="unable to terminate initProcess" error="exit status 1"
time="2021-04-07T09:41:41Z" level=error msg="container_linux.go:366: starting container process caused: process_linux.go:472: container init caused: rootfs_linux.go:60: mounting \"/var/lib/kubelet/pods\" to rootfs at \"/var/lib/containers/storage/overlay/bed6a1f4bd7769f025ce7179358d7ad61cf1af681e4b9ba65ca07e0048584e45/merged/var/lib/kubelet/pods\" caused: no space left on device"
~~~

When this hostpath mount Block is removed from the Daemon set, the mount entries are way much lower than this -
~~~
[root@vm ~]# findmnt -D | grep '/var/lib/kubelet/pods$' | wc -l
3
~~~

There is a PullRequest created which just removed this Volume and Mount block from the DaemonSet -
https://github.com/kubernetes/cloud-provider-openstack/pull/773

Although this is kubernetes but this should be fixed in OpenShift's side as well in -
https://github.com/openshift/cloud-provider-openstack

Let me know if anything else is required

Comment 1 Mike Fedosin 2021-04-22 06:40:52 UTC

As I see we just need to backport the upstream fix, right? Let's do it then.

Comment 2 Anshul Verma 2021-04-23 13:55:48 UTC

(In reply to Mike Fedosin from comment #1)
> As I see we just need to backport the upstream fix, right? Let's do it then.

Yes, it seems so. Please keep me apprised on the progress.

Along with that, please do check on the fact the few mounts were still present after restarting the pod with those hostpath volume and mount block were removed -
~~
When this hostpath mount Block is removed from the Daemon set, the mount entries are way much lower than this -
~~~
[root@vm ~]# findmnt -D | grep '/var/lib/kubelet/pods$' | wc -l
3
~~~
~~
Are these expected?

Comment 5 rlobillo 2021-06-02 11:48:33 UTC

Verified on 4.8.0-0.nightly-2021-05-29-114625 over OSP16.1 (RHOS-16.1-RHEL-8-20210323.n.0).

clusteroperator storage is fully functional after IPI installation: 

$ oc get clusteroperators storage 
NAME      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
storage   4.8.0-0.nightly-2021-05-29-114625   True        False         False      46h


and the manifest changes are present (there is not any volume targeting /var/lib/kubelet/pods):

$ oc get pods -n openshift-cluster-csi-drivers -l app=openstack-cinder-csi-driver-node --field-selector spec.nodeName=ostest-snz9z-worker-0-jgqhd -o json | jq '.items[0].spec.volumes[]'
{
  "hostPath": {
    "path": "/var/lib/kubelet/plugins/cinder.csi.openstack.org",
    "type": "DirectoryOrCreate"
  },
  "name": "socket-dir"
}
{
  "hostPath": {
    "path": "/var/lib/kubelet/plugins_registry/",
    "type": "Directory"
  },
  "name": "registration-dir"
}
{
  "hostPath": {
    "path": "/var/lib/kubelet",
    "type": "Directory"
  },
  "name": "kubelet-dir"
}
{
  "hostPath": {
    "path": "/dev",
    "type": "Directory"
  },
  "name": "pods-probe-dir"
}
{
  "name": "secret-cinderplugin",
  "secret": {
    "defaultMode": 420,
    "items": [
      {
        "key": "clouds.yaml",
        "path": "clouds.yaml"
      }
    ],
    "secretName": "openstack-cloud-credentials"
  }
}
{
  "configMap": {
    "defaultMode": 420,
    "items": [
      {
        "key": "cloud.conf",
        "path": "cloud.conf"
      }
    ],
    "name": "openstack-cinder-config"
  },
  "name": "config-cinderplugin"
}
{
  "configMap": {
    "defaultMode": 420,
    "items": [
      {
        "key": "ca-bundle.pem",
        "path": "ca-bundle.pem"
      }
    ],
    "name": "cloud-provider-config",
    "optional": true
  },
  "name": "cacert"
}
{
  "name": "openstack-cinder-csi-driver-node-sa-token-2pn5x",
  "secret": {
    "defaultMode": 420,
    "secretName": "openstack-cinder-csi-driver-node-sa-token-2pn5x"
  }
}

Inside csi-driver pod, as expected, there is no partition on /var/lib/kubelet/pods and it is on /var/lib/kubelet:

$ oc rsh -n openshift-cluster-csi-drivers $(oc get pods -n openshift-cluster-csi-drivers -l app=openstack-cinder-csi-driver-node --field-selector spec.nodeName=ostest-snz9z-worker-0-jgqhd -o NAME)
Defaulted container "csi-driver" out of: csi-driver, node-driver-registrar
sh-4.4# findmnt -D | grep '/var/lib/kubelet/pods$'
sh-4.4# findmnt -D | grep '/var/lib/kubelet$' 
/dev/vda4[/ostree/deploy/rhcos/var/lib/kubelet]                                                                                               xfs      39.5G  8.9G 30.6G  23% /var/lib/kubelet
sh-4.4# 

After restarting the pod 100 times, the system remains stable:

$ for i in {1..100}; do echo $i; oc delete -n openshift-cluster-csi-drivers $(oc get pods -n openshift-cluster-csi-drivers -l app=openstack-cinder-csi-driver-node --field-selector spec.nodeName=ostest-snz9z-worker-0-jgqhd -o name); done
...

[stack@undercloud-0 ~]$ oc rsh -n openshift-cluster-csi-drivers $(oc get pods -n openshift-cluster-csi-drivers -l app=openstack-cinder-csi-driver-node --field-selector spec.nodeName=ostest-snz9z-worker-0-jgqhd -o NAME)
sh-4.4# findmnt -D | grep '/var/lib/kubelet$'
/dev/vda4[/ostree/deploy/rhcos/var/lib/kubelet]                                                                                               xfs      39.5G  8.8G 30.7G  22% /var/lib/kubelet
[stack@undercloud-0 ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-05-29-114625   True        False         47h     Cluster version is 4.8.0-0.nightly-2021-05-29-114625
[stack@undercloud-0 ~]$ oc get clusteroperator storage
NAME      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
storage   4.8.0-0.nightly-2021-05-29-114625   True        False         False      2d

Comment 9 errata-xmlrpc 2021-07-27 23:02:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Comment 10 Pierre Prinetti 2021-11-22 09:08:30 UTC

I see that there is a NEEDINFO open on this bug, however it seems that in the meantime this bug has been closed as fixed.

If the solution did not work or if additional information is required, please ask again down below here.

Otherwise, the team considers this to be fixed and closed.