Bug 2032924 - PVs are not being cleaned up after PVC deletion
Summary: PVs are not being cleaned up after PVC deletion
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.9
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 4.11.0
Assignee: Jonathan Dobson
QA Contact: Chao Yang
URL:
Whiteboard:
: 2219918 (view as bug list)
Depends On:
Blocks: 2052756
TreeView+ depends on / blocked
 
Reported: 2021-12-15 13:42 UTC by Pablo Rodriguez Guillamon
Modified: 2023-09-19 12:49 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-10 10:40:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift local-storage-operator pull 320 0 None open Bug 2032924: PVs are not being cleaned up after PVC deletion 2022-02-08 22:58:36 UTC
Github openshift sig-storage-local-static-provisioner pull 38 0 None open Bug 2032924: PVs are not being cleaned up after PVC deletion 2022-02-08 19:40:37 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:41:00 UTC

Comment 7 Pablo Rodriguez Guillamon 2022-01-24 10:01:44 UTC
Hi @jdobson , do we have any findings on this issue?

The customer just got back to us letting us know they were able to trigger this same issue on 4.8.27.

They were running on 4.8.20 without any problems and they started happening as soon as they upgraded to 4.8.27.

We do have must-gathers, and LSO logs. Please let me know if you want me to attach them to the bug.

Best regards,

Comment 9 Jonathan Dobson 2022-01-26 01:07:59 UTC
Alright, I was able to reproduce this twice today with the following steps.
It varies how long it takes to reproduce, but both times it took less than 10 minutes for 2032924-delete-recreate-loop.sh to trigger the issue.

1) Create LV

[jon@dagobah deploy]$ cat lv-loop0.yaml
apiVersion: "local.storage.openshift.io/v1"
kind: "LocalVolume"
metadata:
  name: "lv-loop0"
  namespace: "openshift-local-storage"
spec:
  storageClassDevices:
    - storageClassName: "lv-loop0"
      volumeMode: Filesystem
      fsType: xfs
      devicePaths:
        - /dev/loop0

[jon@dagobah deploy]$ oc apply -f lv-loop0.yaml
localvolume.local.storage.openshift.io/lv-loop0 created
[jon@dagobah deploy]$ oc get localvolumes -n openshift-local-storage
NAME       AGE
lv-loop0   18s
[jon@dagobah deploy]$ oc get pv | grep lv-loop0
local-pv-e739fe1b   1Gi        RWO            Delete           Available                                                         lv-loop0                22s

2) Create PVC and Pod

[jon@dagobah deploy]$ cat pvc-pod-lv.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: local-pv-claim1
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  volumeName: local-pv-e739fe1b
  storageClassName: lv-loop0
  volumeMode: Filesystem
---
apiVersion: v1
kind: Pod
metadata:
  name: example
  labels:
    app: example
spec:
  volumes:
  - name: claim1-vol
    persistentVolumeClaim:
      claimName: local-pv-claim1
  containers:
  - image: busybox
    command:
      - "/bin/sh"
      - "-ec"
      - "touch \"/mnt/claim1/$(date -Iseconds).txt\"; sleep 604800"
    imagePullPolicy: IfNotPresent
    name: busybox
    volumeMounts:
      - mountPath: "/mnt/claim1"
        name: claim1-vol
  restartPolicy: Always

[jon@dagobah deploy]$ oc apply -f pvc-pod-lv.yaml
persistentvolumeclaim/local-pv-claim1 created
pod/example created
[jon@dagobah deploy]$ oc get pods
NAME      READY   STATUS    RESTARTS   AGE
example   1/1     Running   0          17s

3) Delete and recreate PVC and Pod in a loop

[jon@dagobah deploy]$ cat 2032924-delete-recreate-loop.sh 
#!/bin/bash
set -x

year=$(date +%Y)
claimdir=/mnt/claim1
podname=pod/example
oc apply -f pvc-pod-lv.yaml
sleep 30

while true; do
        oc delete -f pvc-pod-lv.yaml
        sleep 10
        oc apply -f pvc-pod-lv.yaml
        sleep 30
        oc debug $podname -- ls -l $claimdir 2>&1| grep $year
        numfiles=$(oc debug $podname -- ls -l $claimdir 2>&1| grep $year | wc -l)
        if [ $numfiles -gt 1 ]; then
                echo "found $numfiles instead of 1 file in $claimdir"
                exit 1
        fi
done

[jon@dagobah deploy]$ ./2032924-delete-recreate-loop.sh
...
+ oc delete -f pvc-pod-lv.yaml
persistentvolumeclaim "local-pv-claim1" deleted
pod "example" deleted
+ sleep 10
+ oc apply -f pvc-pod-lv.yaml
persistentvolumeclaim/local-pv-claim1 created
pod/example created
+ sleep 30
+ oc debug pod/example -- ls -l /mnt/claim1
+ grep 2022
-rw-r--r--    1 root     root             0 Jan 26 00:31 2022-01-26T00:31:47+00:00.txt
++ oc debug pod/example -- ls -l /mnt/claim1
++ grep 2022
++ wc -l
+ numfiles=1
+ '[' 1 -gt 1 ']'
+ true
+ oc delete -f pvc-pod-lv.yaml
persistentvolumeclaim "local-pv-claim1" deleted
pod "example" deleted
+ sleep 10
+ oc apply -f pvc-pod-lv.yaml
persistentvolumeclaim/local-pv-claim1 created
pod/example created
+ sleep 30
+ oc debug pod/example -- ls -l /mnt/claim1
+ grep 2022
-rw-r--r--    1 root     root             0 Jan 26 00:31 2022-01-26T00:31:47+00:00.txt
-rw-r--r--    1 root     root             0 Jan 26 00:33 2022-01-26T00:33:08+00:00.txt
++ oc debug pod/example -- ls -l /mnt/claim1
++ grep 2022
++ wc -l
+ numfiles=2
+ '[' 2 -gt 1 ']'
+ echo 'found 2 instead of 1 file in /mnt/claim1'
found 2 instead of 1 file in /mnt/claim1
+ exit 1

Notes: The logs in my case were slightly different than the logs in the original bug report. There was no duplicate cleanup job preceding the missed cleanup. But sig storage deleter did not get triggered to remove the previously successful cleanup job before the next deletion came in and found an existing cleanup job with a successful status. Still, I think this points to the sleep in handlePVChange() being insufficient. To fix this, we need some way to guarantee that the job status in sig storage deleter always gets cleaned up (without starting a second cleanup job) before the next deletion request is processed.

Next steps:
- Work on implementing a fix and verify with the process above.

Comment 10 Jonathan Dobson 2022-01-28 01:20:51 UTC
I discussed this with the team today to clarify a few points, but it wasn't quite clear how best to address this. So I ran through the scenario again with some extra log statements to see exactly what happens step-by-step.

Here are the two files found by ./2032924-delete-recreate-loop.sh

+ oc debug pod/example -- ls -l /mnt/claim1
+ grep 2022
-rw-r--r--    1 root     root             0 Jan 27 23:25 2022-01-27T23:25:20+00:00.txt
-rw-r--r--    1 root     root             0 Jan 27 23:26 2022-01-27T23:26:48+00:00.txt


23:25:06: The previous cleanup finishes, and the PV is deleted.

I0127 23:25:06.513134  171818 deleter.go:162] jdd: cleanup status for local-pv-e739fe1b: 5
I0127 23:25:06.516267  171818 deleter.go:167] Deleting pv local-pv-e739fe1b after successful cleanup

handlePVChange() gets an UpdateEvent, and the PV is still in the released state:

I0127 23:25:06.529823  171818 cache.go:64] Updated pv "local-pv-e739fe1b" to cache
1.6433259065298407e+09  INFO    diskmaker-controllers.Deleter   deleter/reconcile.go:176        found PV with state released    {"pvName": "local-pv-e739fe1b"}
1.6433259065298662e+09  INFO    diskmaker-controllers.Deleter   deleter/reconcile.go:176        jdd: handlePVChange stack trace

This triggers Reconcile() to call DeletePVs() again:

1.6433259065300732e+09  INFO    diskmaker-controllers.Deleter   controller/controller.go:114    Deleting Pvs through sig storage deleter        {"Request.Namespace": "openshift-local-storage", "Request.Name": ""}

Then handlePVChange() gets a DeleteEvent to remove the cache entry:

I0127 23:25:06.531551  171818 cache.go:73] Deleted pv "local-pv-e739fe1b" from cache
1.6433259065315902e+09  INFO    diskmaker-controllers.Deleter   deleter/reconcile.go:182        found PV with state released    {"pvName": "local-pv-e739fe1b"}
1.6433259065316565e+09  INFO    diskmaker-controllers.Deleter   deleter/reconcile.go:182        jdd: handlePVChange stack trace

But it's too late, DeletePVs() already listed PV's in the cache before the cache entry was removed, and the new cleanup job started:

I0127 23:25:06.536126  171818 deleter.go:162] jdd: cleanup status for local-pv-e739fe1b: 2
I0127 23:25:06.536136  171818 deleter.go:197] Start cleanup for pv local-pv-e739fe1b
I0127 23:25:06.536139  171818 deleter.go:215] jdd: calling runProcess for local-pv-e739fe1b
...
I0127 23:25:06.541492  171818 deleter.go:309] Cleanup pv "local-pv-e739fe1b": StdoutBuf - "Calling mkfs"

A new cache entry gets created for the PV:

I0127 23:25:06.727279  171818 cache.go:55] Added pv "local-pv-e739fe1b" to cache
1.643325906727301e+09   INFO    diskmaker-controllers.Deleter   deleter/reconcile.go:170        jdd: handlePVChange stack trace

Cleanup completes:

I0127 23:25:06.739524  171818 deleter.go:309] Cleanup pv "local-pv-e739fe1b": StdoutBuf - "Quick reset completed"
I0127 23:25:06.739898  171818 deleter.go:287] Completed cleanup of pv "local-pv-e739fe1b"

All of that happened in about 0.2 seconds.
But now we have a new cache entry that is NOT in the released state, so Reconcile() does not call DeletePVs() to clean up the successful job status yet.


23:25:20: The new Pod gets created, the volume gets mounted, and this file is created:
 
2022-01-27T23:25:20+00:00.txt

Some time passes and we request deletion of the new pod and PVC.


23:26:24: The PV state is set to released.

I0127 23:26:24.859352  171818 cache.go:64] Updated pv "local-pv-e739fe1b" to cache
1.643325984859374e+09   INFO    diskmaker-controllers.Deleter   deleter/reconcile.go:176        found PV with state released    {"pvName": "local-pv-e739fe1b"}
1.643325984859413e+09   INFO    diskmaker-controllers.Deleter   deleter/reconcile.go:176        jdd: handlePVChange stack trace

Reconcile() calls DeletePVs():

1.6433259848596299e+09  INFO    diskmaker-controllers.Deleter   controller/controller.go:114    Deleting Pvs through sig storage deleter        {"Request.Namespace": "openshift-local-storage", "Request.Name": ""}

It finds the cleanup job that started at 23:25:06 but the status was not checked before.
The status is successful, so it deletes the PV without attempting cleanup.

I0127 23:26:24.865132  171818 deleter.go:162] jdd: cleanup status for local-pv-e739fe1b: 5
I0127 23:26:24.865141  171818 deleter.go:167] Deleting pv local-pv-e739fe1b after successful cleanup


23:26:48: The new Pod gets created, the volume gets mounted, and the new file is created:

-rw-r--r--    1 root     root             0 Jan 27 23:25 2022-01-27T23:25:20+00:00.txt
-rw-r--r--    1 root     root             0 Jan 27 23:26 2022-01-27T23:26:48+00:00.txt


What do we actually expect to happen above?
- PVC / Pod gets deleted
- pv.Status.Phase is set to VolumeReleased
  - this triggers handlePVChange as an UpdateEvent
    Updated pv "local-pv-e739fe1b" to cache
    found PV with state released
- handlePVChange pokes Reconcile (q.Add)
- Reconcile calls r.deleter.DeletePVs()
- Cleanup job is started
- At some point in the future, Reconcile() should run again to check for completed job
  - in progress jobs are ignored, need to check again later
  - failed jobs will try cleanup again, need to check again later
  - succeeded jobs will remove cleanup job status and call DeletePV
- DeletePV (in sig storage deleter) should trigger DeleteEvent and handlePVChange
  - handlePVChange should remove the cache entry

Part of what makes this difficult to address is that diskmaker's Reconcile() can not tell if DeletePVs() in sig storage deleter is going to clean up a completed job, or start a new job. Still evaluating potential fixes. This might require changes upstream in sig storage deleter, which would take some time to deliver to customers.

Comment 11 Jonathan Dobson 2022-01-28 22:13:53 UTC
Submitted a potential fix upstream for review that resolves the issue in my test environment:
https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner/pull/287

Comment 21 Jonathan Dobson 2022-02-11 17:01:15 UTC
Instructions for testing the 4.9 dev build below. I'll provide a subscription manifest for 4.8 as well once that image is ready.
Note: this should only be used in test environments to validate the fix for this bug, NOT on any production systems.

1) Remove existing local storage operator from the cluster

Follow the instructions here:
https://docs.openshift.com/container-platform/4.9/operators/admin/olm-deleting-operators-from-cluster.html

Make sure there are no resources remaining in the openshift-local-storage namespace. The namespace should still exist but be empty.

$ oc get all -n openshift-local-storage
No resources found in openshift-local-storage namespace.

Manually delete the remaining CRD and operator resources.

$ oc delete crd localvolumediscoveries.local.storage.openshift.io localvolumediscoveryresults.local.storage.openshift.io localvolumes.local.storage.openshift.io localvolumesets.local.storage.openshift.io

$ oc delete operators local-storage-operator.openshift-local-storage

And remove the operatorgroup, if one exists.

$ oc get operatorgroup -n openshift-local-storage
NAME                            AGE
openshift-local-storage-hxlwq   85m
$ oc delete operatorgroup -n openshift-local-storage openshift-local-storage-hxlwq

2) Create subscription manifest for the 4.9 dev image

$ cat subscribe-bug2032924-4.9.yaml
apiVersion: operators.coreos.com/v1alpha2
kind: OperatorGroup
metadata:
  name: local-operator-group
  namespace: openshift-local-storage
spec:
  targetNamespaces:
  - openshift-local-storage
---
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: localstorage-operator-manifests
  namespace: openshift-local-storage
spec:
  sourceType: grpc
  image: quay.io/jdobson/index:bug2032924-4.9
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: local-storage-subscription
  namespace: openshift-local-storage
spec:
  channel: preview
  name: local-storage-operator
  source: localstorage-operator-manifests
  sourceNamespace: openshift-local-storage

3) Apply the subscription manifest for the desired release

$ oc apply -f subscribe-bug2032924-4.9.yaml

4) Wait for installation to complete

Monitor the "Installed Operators" page in the web console, or:
$ oc describe operators local-storage-operator.openshift-local-storage

Once it's ready, you can begin testing.

To uninstall the dev build once testing is complete:
$ oc delete -f subscribe-bug2032924-4.9.yaml
Then follow the instructions from Step 1 above to cleanup any remaining resources.

Comment 22 Jonathan Dobson 2022-02-11 18:02:22 UTC
(In reply to Jonathan Dobson from comment #21)
> I'll provide a subscription manifest for 4.8 as well once that image is ready.

The only difference for 4.8 is the image being pulled from quay is built on top of release-4.8.
Otherwise, the instructions are the same as described in https://bugzilla.redhat.com/show_bug.cgi?id=2032924#c21

$ cat subscribe-bug2032924-4.8.yaml 
apiVersion: operators.coreos.com/v1alpha2
kind: OperatorGroup
metadata:
  name: local-operator-group
  namespace: openshift-local-storage
spec:
  targetNamespaces:
  - openshift-local-storage
---
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: localstorage-operator-manifests
  namespace: openshift-local-storage
spec:
  sourceType: grpc
  image: quay.io/jdobson/index:bug2032924-4.8
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: local-storage-subscription
  namespace: openshift-local-storage
spec:
  channel: preview
  name: local-storage-operator
  source: localstorage-operator-manifests
  sourceNamespace: openshift-local-storage

Comment 23 Chao Yang 2022-02-22 03:10:56 UTC
Tested followed https://bugzilla.redhat.com/show_bug.cgi?id=2032924#c16 and passed with local-storage-operator.4.11.0-202202172124

Comment 25 errata-xmlrpc 2022-08-10 10:40:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Comment 26 jpeyrard 2023-09-19 12:49:08 UTC
*** Bug 2219918 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.