Bug 1247067 - Sometimes Persistent Volume failed to release because the pv-scrubber-nfs pod exceeds deadline
Sometimes Persistent Volume failed to release because the pv-scrubber-nfs pod...
Status: CLOSED CURRENTRELEASE
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage (Show other bugs)
3.0.0
Unspecified Unspecified
low Severity low
: ---
: ---
Assigned To: Mark Turansky
DeShuai Ma
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-07-27 05:21 EDT by Jianwei Hou
Modified: 2015-11-23 09:25 EST (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-11-23 09:25:31 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Jianwei Hou 2015-07-27 05:21:02 EDT
Description of problem:
Please see the reproduce steps

Version-Release number of selected component (if applicable):
openshift v3.0.1.0-503-g7cc6deb
kubernetes v1.0.0

How reproducible:
Sometimes, easier to reproduce when there are large number of data on the volume

Steps to Reproduce:
1. Create a PV with ReadWriteMany mode and with recycle reclaim policy
oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/persistent-volumes/nfs/rwx/nfs-recycle.json
2. Create dc/rc/pod/service with the mysql replication template
oc process -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/persistent-volumes/nfs/mysql.json | oc create -f -
3. After the pods are created, delete the pods and delete the PVC
4. Verify the data on the nfs exports are all deleted and the PV can be bound again

Actual results:
After step 4, after several minutes:
The status of the pv is 'FAILED'
oc get pv:
NAME          LABELS    CAPACITY     ACCESSMODES   STATUS    CLAIM                                REASON
nfs           <none>    5368709120   RWX           Failed    jhou1/mysql-master                   

The events from 'default' namespace suggested that the 'pv-scrubber-nfs' pod exceeded deadline.

Mon, 27 Jul 2015 16:42:12 +0800   Mon, 27 Jul 2015 16:42:12 +0800   1         pv-scrubber-nfs-6su3s     Pod       spec.containers{scrubber}           created            {kubelet minion2.cluster.local}   Created with docker id f1799a710040
Mon, 27 Jul 2015 16:42:17 +0800   Mon, 27 Jul 2015 16:42:17 +0800   1         pv-scrubber-nfs-6su3s     Pod       spec.containers{scrubber}           started            {kubelet minion2.cluster.local}   Started with docker id f1799a710040
Mon, 27 Jul 2015 16:41:47 +0800   Mon, 27 Jul 2015 16:42:17 +0800   2         pv-scrubber-nfs-6su3s     Pod                                           DeadlineExceeded   {kubelet minion2.cluster.local}   Pod was active on the node longer than specified deadline

On the nfs server, only part of the data are scrubbed, the rest are still there.

Expected results:
All data on the nfs volume is deleted, the PV's status should become 'Available'

Additional info:
Comment 1 Mark Turansky 2015-07-30 09:43:21 EDT
I have an upstream PR to resolve this issue.  It allows for configurable PV recyclers and timeouts based on volume size.

https://github.com/GoogleCloudPlatform/kubernetes/pull/9870
Comment 2 Mark Turansky 2015-09-14 11:18:34 EDT
https://github.com/openshift/origin/blob/master/pkg/cmd/server/kubernetes/master.go#L70

The bits linked above has been merged to Origin HEAD and will lengthen the timeout based on volume size.

This issue should be resolved by that improvement.
Comment 3 DeShuai Ma 2015-09-15 05:53:11 EDT
Version:
[root@openshift-137 tmp]# openshift version
openshift v3.0.2.0
kubernetes v1.1.0-alpha.0-1605-g44c91b1

Step:
1.Create pv and mysql pod
[root@openshift-137 tmp]# oc get pv
nfs               <none>       5368709120    RWX           Bound      dma/mysql-master                   12s
[root@dhcp-128-7 origin]# oc get pod
NAME                   READY     STATUS    RESTARTS   AGE
mysql-master-1-7o1ov   1/1       Running   0          1m
mysql-slave-1-ir9ea    0/1       Running   4          1m
[root@dhcp-128-7 origin]# oc get pvc
NAME           LABELS    STATUS    VOLUME    AGE
mysql-master   map[]     Bound     nfs       1m

2.On nfsserver check the shared dir
[root@openshift-squid ~]# ls /deshuai
ibdata1  ib_logfile0  ib_logfile1  mysql  mysql-bin.000001  mysql-bin.000002  mysql-bin.index  mysql-master-1-7o1ov.pid  performance_schema  replication  userdb

3.Delete the project
# oc delete project dma

4.After delete project check the pv is available again
[root@openshift-137 tmp]# oc get pv
nfs               <none>       5368709120    RWX           Available                                      2m

5.On nfsserver check the shared dir is empty
[root@openshift-squid ~]# ls /deshuai
[root@openshift-squid ~]#
Comment 4 Brenton Leanhardt 2015-11-23 09:25:31 EST
This fix is available in OpenShift Enterprise 3.1.

Note You need to log in before you can comment on or make changes to this bug.