Bug 1952211 - cascading mounts happening exponentially on when deleting openstack-cinder-csi-driver-node pods
Summary: cascading mounts happening exponentially on when deleting openstack-cinder-cs...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.7
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.8.0
Assignee: Mike Fedosin
QA Contact: rlobillo
URL:
Whiteboard:
Depends On:
Blocks: 2016286
TreeView+ depends on / blocked
 
Reported: 2021-04-21 18:46 UTC by Anshul Verma
Modified: 2021-11-24 05:08 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The folder /var/lib/kubelet was mounted twice in Cinder CSI Node Controller container. Consequence: When running Cinder CSI Node Controller, it doesn't start and throws an error about not being able to mount /var/lib/kubelet/pods because no more space is left. Fix: Removes duplicate mount of /var/lib/kubelet and /var/lib/kubelet/pods which results in an error. Result: The driver always runs successfully.
Clone Of:
: 2025444 2026197 (view as bug list)
Environment:
Last Closed: 2021-07-27 23:02:36 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cloud-provider-openstack pull 52 0 None closed Bug 1952211: Merge tag 'v1.21.0' into openshift-master 2021-05-25 11:23:49 UTC
Github openshift openstack-cinder-csi-driver-operator pull 41 0 None closed Bug 1952211: Fix error when mounting /var/lib/kubelet/pods 2021-05-25 11:23:51 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:02:54 UTC

Description Anshul Verma 2021-04-21 18:46:04 UTC
Description of problem:

This is about the Upstream issue -
https://github.com/kubernetes/cloud-provider-openstack/issues/772

When the CSI driver pod is created through its DaemonSet, it contains the following mounts -
~~~
        volumeMounts:
        - mountPath: /var/lib/kubelet/pods
          mountPropagation: Bidirectional
          name: pods-mount-dir
        - mountPath: /var/lib/kubelet
          mountPropagation: Bidirectional
          name: kubelet-dir

      volumes:
      - hostPath:
          path: /var/lib/kubelet
          type: Directory
        name: kubelet-dir
      - hostPath:
          path: /var/lib/kubelet/pods
          type: Directory
        name: pods-mount-dir
~~~

When a `openstack-cinder-csi-driver-node` pod created through this DaemonSet is deleted multiple times, the mount entries for `/var/lib/kubelet/pods` get on increasing exponentially with every restart of the pod.
~~~
[root@vm ~]# findmnt -D | grep '/var/lib/kubelet/pods$' | wc -l
127
~~~

When this number exceeds 255, the following error is seen -
~~~
  Warning  Failed  9s  kubelet, master2  Error: container create failed: time="2021-04-07T09:41:40Z" level=warning msg="unable to terminate initProcess" error="exit status 1"
time="2021-04-07T09:41:41Z" level=error msg="container_linux.go:366: starting container process caused: process_linux.go:472: container init caused: rootfs_linux.go:60: mounting \"/var/lib/kubelet/pods\" to rootfs at \"/var/lib/containers/storage/overlay/bed6a1f4bd7769f025ce7179358d7ad61cf1af681e4b9ba65ca07e0048584e45/merged/var/lib/kubelet/pods\" caused: no space left on device"
~~~

When this hostpath mount Block is removed from the Daemon set, the mount entries are way much lower than this -
~~~
[root@vm ~]# findmnt -D | grep '/var/lib/kubelet/pods$' | wc -l
3
~~~

There is a PullRequest created which just removed this Volume and Mount block from the DaemonSet -
https://github.com/kubernetes/cloud-provider-openstack/pull/773

Although this is kubernetes but this should be fixed in OpenShift's side as well in -
https://github.com/openshift/cloud-provider-openstack

Let me know if anything else is required

Comment 1 Mike Fedosin 2021-04-22 06:40:52 UTC
As I see we just need to backport the upstream fix, right? Let's do it then.

Comment 2 Anshul Verma 2021-04-23 13:55:48 UTC
(In reply to Mike Fedosin from comment #1)
> As I see we just need to backport the upstream fix, right? Let's do it then.

Yes, it seems so. Please keep me apprised on the progress.

Along with that, please do check on the fact the few mounts were still present after restarting the pod with those hostpath volume and mount block were removed -
~~
When this hostpath mount Block is removed from the Daemon set, the mount entries are way much lower than this -
~~~
[root@vm ~]# findmnt -D | grep '/var/lib/kubelet/pods$' | wc -l
3
~~~
~~
Are these expected?

Comment 5 rlobillo 2021-06-02 11:48:33 UTC
Verified on 4.8.0-0.nightly-2021-05-29-114625 over OSP16.1 (RHOS-16.1-RHEL-8-20210323.n.0).

clusteroperator storage is fully functional after IPI installation: 

$ oc get clusteroperators storage 
NAME      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
storage   4.8.0-0.nightly-2021-05-29-114625   True        False         False      46h


and the manifest changes are present (there is not any volume targeting /var/lib/kubelet/pods):

$ oc get pods -n openshift-cluster-csi-drivers -l app=openstack-cinder-csi-driver-node --field-selector spec.nodeName=ostest-snz9z-worker-0-jgqhd -o json | jq '.items[0].spec.volumes[]'
{
  "hostPath": {
    "path": "/var/lib/kubelet/plugins/cinder.csi.openstack.org",
    "type": "DirectoryOrCreate"
  },
  "name": "socket-dir"
}
{
  "hostPath": {
    "path": "/var/lib/kubelet/plugins_registry/",
    "type": "Directory"
  },
  "name": "registration-dir"
}
{
  "hostPath": {
    "path": "/var/lib/kubelet",
    "type": "Directory"
  },
  "name": "kubelet-dir"
}
{
  "hostPath": {
    "path": "/dev",
    "type": "Directory"
  },
  "name": "pods-probe-dir"
}
{
  "name": "secret-cinderplugin",
  "secret": {
    "defaultMode": 420,
    "items": [
      {
        "key": "clouds.yaml",
        "path": "clouds.yaml"
      }
    ],
    "secretName": "openstack-cloud-credentials"
  }
}
{
  "configMap": {
    "defaultMode": 420,
    "items": [
      {
        "key": "cloud.conf",
        "path": "cloud.conf"
      }
    ],
    "name": "openstack-cinder-config"
  },
  "name": "config-cinderplugin"
}
{
  "configMap": {
    "defaultMode": 420,
    "items": [
      {
        "key": "ca-bundle.pem",
        "path": "ca-bundle.pem"
      }
    ],
    "name": "cloud-provider-config",
    "optional": true
  },
  "name": "cacert"
}
{
  "name": "openstack-cinder-csi-driver-node-sa-token-2pn5x",
  "secret": {
    "defaultMode": 420,
    "secretName": "openstack-cinder-csi-driver-node-sa-token-2pn5x"
  }
}

Inside csi-driver pod, as expected, there is no partition on /var/lib/kubelet/pods and it is on /var/lib/kubelet:

$ oc rsh -n openshift-cluster-csi-drivers $(oc get pods -n openshift-cluster-csi-drivers -l app=openstack-cinder-csi-driver-node --field-selector spec.nodeName=ostest-snz9z-worker-0-jgqhd -o NAME)
Defaulted container "csi-driver" out of: csi-driver, node-driver-registrar
sh-4.4# findmnt -D | grep '/var/lib/kubelet/pods$'
sh-4.4# findmnt -D | grep '/var/lib/kubelet$' 
/dev/vda4[/ostree/deploy/rhcos/var/lib/kubelet]                                                                                               xfs      39.5G  8.9G 30.6G  23% /var/lib/kubelet
sh-4.4# 

After restarting the pod 100 times, the system remains stable:

$ for i in {1..100}; do echo $i; oc delete -n openshift-cluster-csi-drivers $(oc get pods -n openshift-cluster-csi-drivers -l app=openstack-cinder-csi-driver-node --field-selector spec.nodeName=ostest-snz9z-worker-0-jgqhd -o name); done
...

[stack@undercloud-0 ~]$ ​oc rsh -n openshift-cluster-csi-drivers $(oc get pods -n openshift-cluster-csi-drivers -l app=openstack-cinder-csi-driver-node --field-selector spec.nodeName=ostest-snz9z-worker-0-jgqhd -o NAME)
sh-4.4# findmnt -D | grep '/var/lib/kubelet$'
/dev/vda4[/ostree/deploy/rhcos/var/lib/kubelet]                                                                                               xfs      39.5G  8.8G 30.7G  22% /var/lib/kubelet
[stack@undercloud-0 ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-05-29-114625   True        False         47h     Cluster version is 4.8.0-0.nightly-2021-05-29-114625
[stack@undercloud-0 ~]$ oc get clusteroperator storage
NAME      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
storage   4.8.0-0.nightly-2021-05-29-114625   True        False         False      2d

Comment 9 errata-xmlrpc 2021-07-27 23:02:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Comment 10 Pierre Prinetti 2021-11-22 09:08:30 UTC
I see that there is a NEEDINFO open on this bug, however it seems that in the meantime this bug has been closed as fixed.

If the solution did not work or if additional information is required, please ask again down below here.

Otherwise, the team considers this to be fixed and closed.


Note You need to log in before you can comment on or make changes to this bug.