Bug 1623426 - [3.9] Recycler refuses to recycle PVs when a pod has a different PV using the same PVC
Summary: [3.9] Recycler refuses to recycle PVs when a pod has a different PV using the...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 3.7.0
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 3.9.z
Assignee: Tomas Smetana
QA Contact: Liang Xia
URL:
Whiteboard:
Depends On: 1663145 1663260
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-08-29 10:39 UTC by Priyanka Kanthale
Modified: 2019-02-20 08:47 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
: 1663145 (view as bug list)
Environment:
Last Closed: 2019-02-20 08:46:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0331 0 None None None 2019-02-20 08:47:02 UTC

Description Priyanka Kanthale 2018-08-29 10:39:50 UTC
Description of problem:


How reproducible:

Steps to Reproduce:

The persistent storage is NFS with persistentVolumeReclaimPolicy "Recycle"  

DC-1 -> pod1 -> PVC1 -> PV1

PVC1 is deleted and recreated *with the same name*, in a short time, so that the recycler doesn't get invoked, Now we have:

DC-2 -> pod2 -> PVC1 -> PV2

At this point, PV1 is not in use, but it is not eligible for recycling, because it is associated with PVC1, which is associated with a running pod. 

So we see: 

atomic-openshift-master-controllers[]: I0821 09:24:20.195138       1 pv_controller.go:1205] isVolumeReleased[PV1]: volume is released

but if you describe the PV you will get:

  Normal  VolumeFailedRecycle  1m (x10289 over 1d)  persistentvolume-controller  Volume is used by pods: pod2


Actual results:

 PV1 is not recycled.  

Expected results:
  
 Should work correctly.
 
Additional info:
Workaround: On scaling down the pods that are using the PVCs, the PVs become eligible for recycling and are recycled (though in some cases, only after undergoing an hour + some random minutes wait).

Comment 1 Tomas Smetana 2018-08-30 13:36:11 UTC
I will try to reproduce the problem. The code to invoke the recycler is the same in the newest openshift/kubernetes versions so it's possible this is not only 3.7 bug...

Comment 2 Tomas Smetana 2018-08-31 10:51:27 UTC
I think I know what is causing the problem -- this method is not checing the PVC to PV binding and may incorrectly consider a PV to be still in use if the PVC was bound to another PV in the meantime:

https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/pkg/controller/volume/persistentvolume/pv_controller.go#L1270

The fix is quite simple. I still have troubles reproducing the issue though: it's quite timing-sensitive.

Comment 12 Liang Xia 2019-01-08 10:06:25 UTC
QE try to verify the fix with below version and steps, but volume failed to mount which blocks bug verification.

@Tomas could you check if there is anything QE is doing not correctly ?

==================================================================================================================================================
# oc version
oc v3.9.64
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://preserve-qe-lxia-39-master-etcd-nfs-1:8443
openshift v3.9.64
kubernetes v1.9.1+a0ce1bc657
==================================================================================================================================================
# oc get pv,pvc,pod
NAME              CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM                 STORAGECLASS   REASON    AGE
pv/pv1            10G        RWX            Recycle          Bound       p1/mypvc                                       13m
pv/pv2            10G        RWX            Recycle          Available                                                  12m
pv/regpv-volume   17G        RWX            Retain           Bound       default/regpv-claim                            1d

NAME        STATUS    VOLUME    CAPACITY   ACCESS MODES   STORAGECLASS   AGE
pvc/mypvc   Bound     pv1       10G        RWX                           13m

NAME         READY     STATUS              RESTARTS   AGE
po/dynamic   0/1       ContainerCreating   0          12m
==================================================================================================================================================
Node logs,
Jan 08 04:27:20 preserve-qe-lxia-39-nrr-1 atomic-openshift-node[123994]: I0108 04:27:20.182726  124020 volume_host.go:218] using default mounter/exec for kubernetes.io/nfs
Jan 08 04:27:20 preserve-qe-lxia-39-nrr-1 atomic-openshift-node[123994]: I0108 04:27:20.282939  124020 operation_executor.go:895] Starting operationExecutor.MountVolume for volume "pv1" (UniqueName: "kubernetes.io/nfs/e0477a21-1326-11e9-aff2-fa163ed08e39-pv1") pod "dynamic" (UID: "e0477a21-1326-11e9-aff2-fa163ed08e39")
Jan 08 04:27:20 preserve-qe-lxia-39-nrr-1 atomic-openshift-node[123994]: I0108 04:27:20.282995  124020 volume_host.go:218] using default mounter/exec for kubernetes.io/nfs
Jan 08 04:27:20 preserve-qe-lxia-39-nrr-1 atomic-openshift-node[123994]: I0108 04:27:20.283022  124020 reconciler.go:262] operationExecutor.MountVolume started for volume "pv1" (UniqueName: "kubernetes.io/nfs/e0477a21-1326-11e9-aff2-fa163ed08e39-pv1") pod "dynamic" (UID: "e0477a21-1326-11e9-aff2-fa163ed08e39")
Jan 08 04:27:20 preserve-qe-lxia-39-nrr-1 atomic-openshift-node[123994]: I0108 04:27:20.283111  124020 nsenter_mount.go:167] findmnt: directory /var/lib/origin/openshift.local.volumes/pods/e0477a21-1326-11e9-aff2-fa163ed08e39/volumes/kubernetes.io~nfs/pv1 does not exist
Jan 08 04:27:20 preserve-qe-lxia-39-nrr-1 atomic-openshift-node[123994]: I0108 04:27:20.283142  124020 nsenter.go:107] Running nsenter command: nsenter [--mount=/rootfs/proc/1/ns/mnt -- realpath -e /var/lib/origin/openshift.local.volumes/pods/e0477a21-1326-11e9-aff2-fa163ed08e39/volumes/kubernetes.io~nfs/pv1]
Jan 08 04:27:20 preserve-qe-lxia-39-nrr-1 atomic-openshift-node[123994]: I0108 04:27:20.286812  124020 nsenter.go:151] failed to resolve symbolic links on /var/lib/origin/openshift.local.volumes/pods/e0477a21-1326-11e9-aff2-fa163ed08e39/volumes/kubernetes.io~nfs/pv1: exit status 1
Jan 08 04:27:20 preserve-qe-lxia-39-nrr-1 atomic-openshift-node[123994]: I0108 04:27:20.286830  124020 nfs.go:237] NFS mount set up: /var/lib/origin/openshift.local.volumes/pods/e0477a21-1326-11e9-aff2-fa163ed08e39/volumes/kubernetes.io~nfs/pv1 false exit status 1
Jan 08 04:27:20 preserve-qe-lxia-39-nrr-1 atomic-openshift-node[123994]: E0108 04:27:20.286937  124020 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/nfs/e0477a21-1326-11e9-aff2-fa163ed08e39-pv1\" (\"e0477a21-1326-11e9-aff2-fa163ed08e39\")" failed. No retries permitted until 2019-01-08 04:27:52.286909929 -0500 EST m=+79.248905481 (durationBeforeRetry 32s). Error: "MountVolume.SetUp failed for volume \"pv1\" (UniqueName: \"kubernetes.io/nfs/e0477a21-1326-11e9-aff2-fa163ed08e39-pv1\") pod \"dynamic\" (UID: \"e0477a21-1326-11e9-aff2-fa163ed08e39\") : exit status 1"
Jan 08 04:27:20 preserve-qe-lxia-39-nrr-1 atomic-openshift-node[123994]: I0108 04:27:20.286959  124020 server.go:290] Event(v1.ObjectReference{Kind:"Pod", Namespace:"p1", Name:"dynamic", UID:"e0477a21-1326-11e9-aff2-fa163ed08e39", APIVersion:"v1", ResourceVersion:"136061", FieldPath:""}): type: 'Warning' reason: 'FailedMount' MountVolume.SetUp failed for volume "pv1" : exit status 1
Jan 08 04:27:20 preserve-qe-lxia-39-nrr-1 atomic-openshift-node[123994]: I0108 04:27:20.287161  124020 request.go:881] Request Body: {"count":7,"lastTimestamp":"2019-01-08T09:27:20Z","message":"MountVolume.SetUp failed for volume \"pv1\" : exit status 1"}

Comment 13 Tomas Smetana 2019-01-08 12:39:51 UTC
This looks to be unrelated to the fix in question. Might be another instance of this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1663260

Comment 14 Liang Xia 2019-01-23 09:04:52 UTC
# oc version
oc v3.9.65
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://qe-lxia-39-nfs-master-etcd-nfs-1:8443
openshift v3.9.65
kubernetes v1.9.1+a0ce1bc657



# oc get pv pv1 pv2 ; oc get pvc mypvc
NAME      CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM           STORAGECLASS   REASON    AGE
pv1       5G         RWX            Recycle          Bound       default/mypvc                            26m
pv2       5G         RWX            Recycle          Available                                            53s
NAME      STATUS    VOLUME    CAPACITY   ACCESS MODES   STORAGECLASS   AGE
mypvc     Bound     pv1       5G         RWX                           1m


# oc delete pvc mypvc ; oc create -f pvc.yaml 
persistentvolumeclaim "mypvc" deleted
persistentvolumeclaim "mypvc" created


# oc get pv pv1 pv2 ; oc get pvc mypvc
NAME      CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS     CLAIM           STORAGECLASS   REASON    AGE
pv1       5G         RWX            Recycle          Released   default/mypvc                            26m
pv2       5G         RWX            Recycle          Bound      default/mypvc                            1m
NAME      STATUS    VOLUME    CAPACITY   ACCESS MODES   STORAGECLASS   AGE
mypvc     Bound     pv2       5G         RWX                           3s


# oc get pv pv1 pv2 ; oc get pvc mypvc
NAME      CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM           STORAGECLASS   REASON    AGE
pv1       5G         RWX            Recycle          Available                                            26m
pv2       5G         RWX            Recycle          Bound       default/mypvc                            1m
NAME      STATUS    VOLUME    CAPACITY   ACCESS MODES   STORAGECLASS   AGE
mypvc     Bound     pv2       5G         RWX                           10s

Comment 16 errata-xmlrpc 2019-02-20 08:46:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0331


Note You need to log in before you can comment on or make changes to this bug.