Description of problem: How reproducible: Steps to Reproduce: The persistent storage is NFS with persistentVolumeReclaimPolicy "Recycle" DC-1 -> pod1 -> PVC1 -> PV1 PVC1 is deleted and recreated *with the same name*, in a short time, so that the recycler doesn't get invoked, Now we have: DC-2 -> pod2 -> PVC1 -> PV2 At this point, PV1 is not in use, but it is not eligible for recycling, because it is associated with PVC1, which is associated with a running pod. So we see: atomic-openshift-master-controllers[]: I0821 09:24:20.195138 1 pv_controller.go:1205] isVolumeReleased[PV1]: volume is released but if you describe the PV you will get: Normal VolumeFailedRecycle 1m (x10289 over 1d) persistentvolume-controller Volume is used by pods: pod2 Actual results: PV1 is not recycled. Expected results: Should work correctly. Additional info: Workaround: On scaling down the pods that are using the PVCs, the PVs become eligible for recycling and are recycled (though in some cases, only after undergoing an hour + some random minutes wait).
I will try to reproduce the problem. The code to invoke the recycler is the same in the newest openshift/kubernetes versions so it's possible this is not only 3.7 bug...
I think I know what is causing the problem -- this method is not checing the PVC to PV binding and may incorrectly consider a PV to be still in use if the PVC was bound to another PV in the meantime: https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/pkg/controller/volume/persistentvolume/pv_controller.go#L1270 The fix is quite simple. I still have troubles reproducing the issue though: it's quite timing-sensitive.
QE try to verify the fix with below version and steps, but volume failed to mount which blocks bug verification. @Tomas could you check if there is anything QE is doing not correctly ? ================================================================================================================================================== # oc version oc v3.9.64 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://preserve-qe-lxia-39-master-etcd-nfs-1:8443 openshift v3.9.64 kubernetes v1.9.1+a0ce1bc657 ================================================================================================================================================== # oc get pv,pvc,pod NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pv/pv1 10G RWX Recycle Bound p1/mypvc 13m pv/pv2 10G RWX Recycle Available 12m pv/regpv-volume 17G RWX Retain Bound default/regpv-claim 1d NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE pvc/mypvc Bound pv1 10G RWX 13m NAME READY STATUS RESTARTS AGE po/dynamic 0/1 ContainerCreating 0 12m ================================================================================================================================================== Node logs, Jan 08 04:27:20 preserve-qe-lxia-39-nrr-1 atomic-openshift-node[123994]: I0108 04:27:20.182726 124020 volume_host.go:218] using default mounter/exec for kubernetes.io/nfs Jan 08 04:27:20 preserve-qe-lxia-39-nrr-1 atomic-openshift-node[123994]: I0108 04:27:20.282939 124020 operation_executor.go:895] Starting operationExecutor.MountVolume for volume "pv1" (UniqueName: "kubernetes.io/nfs/e0477a21-1326-11e9-aff2-fa163ed08e39-pv1") pod "dynamic" (UID: "e0477a21-1326-11e9-aff2-fa163ed08e39") Jan 08 04:27:20 preserve-qe-lxia-39-nrr-1 atomic-openshift-node[123994]: I0108 04:27:20.282995 124020 volume_host.go:218] using default mounter/exec for kubernetes.io/nfs Jan 08 04:27:20 preserve-qe-lxia-39-nrr-1 atomic-openshift-node[123994]: I0108 04:27:20.283022 124020 reconciler.go:262] operationExecutor.MountVolume started for volume "pv1" (UniqueName: "kubernetes.io/nfs/e0477a21-1326-11e9-aff2-fa163ed08e39-pv1") pod "dynamic" (UID: "e0477a21-1326-11e9-aff2-fa163ed08e39") Jan 08 04:27:20 preserve-qe-lxia-39-nrr-1 atomic-openshift-node[123994]: I0108 04:27:20.283111 124020 nsenter_mount.go:167] findmnt: directory /var/lib/origin/openshift.local.volumes/pods/e0477a21-1326-11e9-aff2-fa163ed08e39/volumes/kubernetes.io~nfs/pv1 does not exist Jan 08 04:27:20 preserve-qe-lxia-39-nrr-1 atomic-openshift-node[123994]: I0108 04:27:20.283142 124020 nsenter.go:107] Running nsenter command: nsenter [--mount=/rootfs/proc/1/ns/mnt -- realpath -e /var/lib/origin/openshift.local.volumes/pods/e0477a21-1326-11e9-aff2-fa163ed08e39/volumes/kubernetes.io~nfs/pv1] Jan 08 04:27:20 preserve-qe-lxia-39-nrr-1 atomic-openshift-node[123994]: I0108 04:27:20.286812 124020 nsenter.go:151] failed to resolve symbolic links on /var/lib/origin/openshift.local.volumes/pods/e0477a21-1326-11e9-aff2-fa163ed08e39/volumes/kubernetes.io~nfs/pv1: exit status 1 Jan 08 04:27:20 preserve-qe-lxia-39-nrr-1 atomic-openshift-node[123994]: I0108 04:27:20.286830 124020 nfs.go:237] NFS mount set up: /var/lib/origin/openshift.local.volumes/pods/e0477a21-1326-11e9-aff2-fa163ed08e39/volumes/kubernetes.io~nfs/pv1 false exit status 1 Jan 08 04:27:20 preserve-qe-lxia-39-nrr-1 atomic-openshift-node[123994]: E0108 04:27:20.286937 124020 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/nfs/e0477a21-1326-11e9-aff2-fa163ed08e39-pv1\" (\"e0477a21-1326-11e9-aff2-fa163ed08e39\")" failed. No retries permitted until 2019-01-08 04:27:52.286909929 -0500 EST m=+79.248905481 (durationBeforeRetry 32s). Error: "MountVolume.SetUp failed for volume \"pv1\" (UniqueName: \"kubernetes.io/nfs/e0477a21-1326-11e9-aff2-fa163ed08e39-pv1\") pod \"dynamic\" (UID: \"e0477a21-1326-11e9-aff2-fa163ed08e39\") : exit status 1" Jan 08 04:27:20 preserve-qe-lxia-39-nrr-1 atomic-openshift-node[123994]: I0108 04:27:20.286959 124020 server.go:290] Event(v1.ObjectReference{Kind:"Pod", Namespace:"p1", Name:"dynamic", UID:"e0477a21-1326-11e9-aff2-fa163ed08e39", APIVersion:"v1", ResourceVersion:"136061", FieldPath:""}): type: 'Warning' reason: 'FailedMount' MountVolume.SetUp failed for volume "pv1" : exit status 1 Jan 08 04:27:20 preserve-qe-lxia-39-nrr-1 atomic-openshift-node[123994]: I0108 04:27:20.287161 124020 request.go:881] Request Body: {"count":7,"lastTimestamp":"2019-01-08T09:27:20Z","message":"MountVolume.SetUp failed for volume \"pv1\" : exit status 1"}
This looks to be unrelated to the fix in question. Might be another instance of this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1663260
# oc version oc v3.9.65 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://qe-lxia-39-nfs-master-etcd-nfs-1:8443 openshift v3.9.65 kubernetes v1.9.1+a0ce1bc657 # oc get pv pv1 pv2 ; oc get pvc mypvc NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pv1 5G RWX Recycle Bound default/mypvc 26m pv2 5G RWX Recycle Available 53s NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE mypvc Bound pv1 5G RWX 1m # oc delete pvc mypvc ; oc create -f pvc.yaml persistentvolumeclaim "mypvc" deleted persistentvolumeclaim "mypvc" created # oc get pv pv1 pv2 ; oc get pvc mypvc NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pv1 5G RWX Recycle Released default/mypvc 26m pv2 5G RWX Recycle Bound default/mypvc 1m NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE mypvc Bound pv2 5G RWX 3s # oc get pv pv1 pv2 ; oc get pvc mypvc NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pv1 5G RWX Recycle Available 26m pv2 5G RWX Recycle Bound default/mypvc 1m NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE mypvc Bound pv2 5G RWX 10s
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0331