Bug 1392338 - Persistent Volumes on NFS arent all getting recycled
Summary: Persistent Volumes on NFS arent all getting recycled
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 3.3.1
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Jan Safranek
QA Contact: Liang Xia
URL:
Whiteboard:
Depends On: 1395271 1395276
Blocks: 1415624
TreeView+ depends on / blocked
 
Reported: 2016-11-07 09:11 UTC by yannick
Modified: 2020-04-15 14:48 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: OpenShift contained a race condition in NFS recycler hanling. Consequence: When recycler pods for multiple NFS shares started at the same time, some of these pods were not started and corresponding NFS share is not recycled. Fix: Fix the race condition. Result: All scheduled NFS recycler pods are started and NFS shares are recycled.
Clone Of:
: 1415624 (view as bug list)
Environment:
Last Closed: 2017-04-12 19:16:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
master logs deleting project with 4 mysql persistent (35.20 KB, text/plain)
2016-11-07 16:51 UTC, yannick
no flags Details
volume failing to recycle long logs (2.07 MB, text/plain)
2016-11-08 10:02 UTC, yannick
no flags Details
oc get events (10.57 KB, text/plain)
2016-11-08 10:03 UTC, yannick
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0884 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.5 RPM Release Advisory 2017-04-12 22:50:07 UTC

Description yannick 2016-11-07 09:11:15 UTC
Description of problem:


Version-Release number of selected component (if applicable):
openshift v3.3.1.3
kubernetes v1.3.0+52492b4
etcd 2.3.0+git

How reproducible:
1. Create PV in OS namespace.
2. Create new project 
3. Create new MYSQL persistent instances 
4. Delete project.

Steps to Reproduce:
1. Create some NFS persistent volumes on NFS server and in OS (using Recycle). (at least 8)
2. Create new project, add new MYSQL persistent instances (8 in my case)
3. The pods with MYSQL will be deployed and get assigned to a persistent volume using a pvc.
4. Once all pods are up and ready, delete the project, either using web interface or oc delete.
5. Watch PV status => oc get pv -w
6. Some PV get status "AVAILABLE" before the recycler complete the data scrubbing + sometimes the recycler pod didnt start and flag the volume as AVAILABLE while data are still present on the NFS volume.


Actual results:
Some volumes get correctly scrubbed.
Some volumes are not at all scrubbed and their status is changed from Bound to Recycle to Available.

The oc get po -w in openshift-infra namespace list strange pod creation when failing to recycle :
recycler-for-volume1 0/1 Completed
recycler-for-volume1 0/1 Terminating
recycler-for-volume1 0/1 Terminating

in oc get pv -w:
volume1 10Gi RWO,RWX Released projecta/dbmysql 
volume1 10Gi RWO,RWX Released 
volume1 10Gi RWO,RWX Available

=> NFS share still has folder and data.

Expected results:
A volume cannot be in status Available before the recycler successfully delete all data on the NFS volume.

Additional info:
There is no permission issue related here as some volume the data scrubbing is correctly done, based on my observation the issues are related to the recycler pod creation and the PV status update. I feel like the recycler pod creation is not always started correctly like "Pending" -> "Container Creating" -> "Running" -> "Completed" -> "Terminating" 

For some reason sometimes the creation is no done and we have only Completed + terminating in the logs or simply no recycler creation.

Comment 1 yannick 2016-11-07 14:06:21 UTC
I'm suspecting also the fact that the persistent volumes are mounted while the pods are currently killed. I have some files with .nfsXXXXXX on the share which means : "The .nfs files are created when an open file is removed but is still being accessed"

So i guess when deleting a project, first all the pods should be stopped => PV correctly unmounted, then PVC deleted, then the recycler can start to recyle the NFS volumes.

EDIT: Still doing theses steps manually :

1. oc delete dc --all
2. oc delete pvc --all
3. oc delete project

problem persist

Comment 2 yannick 2016-11-07 16:51:58 UTC
Created attachment 1218130 [details]
master logs deleting project with 4 mysql persistent

the volume tv203-volume is NOT cleaned correctly
the volume tv202-volume IS cleaned correctly

the only difference i see is that the recycler pod for volume202 is "CREATED" Pod recycler-for-tv202-volume created: kind:""

as the volume203 the recycler pod is :
Pod recycler-for-tv203-volume updated

I dont see the creation of the recycler pod...

Comment 3 Jan Safranek 2016-11-07 19:24:18 UTC
Can you please attach:

- node logs from longer period of time (say, several minutes) *and* not shortened (e.g. lines at 11:13:11.033138 have elipsis "..." and I don't see what really happens to the pod)

- output of oc get event

- output of oc describe pod (in the namespace where the recycler runs) 

- please also wait at least 5 minutes for the recycler - it may fail for the first time, however it should retry in couple of minutes

Comment 4 yannick 2016-11-08 10:02:22 UTC
Created attachment 1218470 [details]
volume failing to recycle long logs

Comment 5 yannick 2016-11-08 10:03:54 UTC
Created attachment 1218471 [details]
oc get events

result of oc get events

Comment 6 yannick 2016-11-08 10:05:20 UTC
Hello,

I created new projet in OS named "ya-test-to" and deployed 5 mysql (rhscl/mysql-56-rhel7) using persistent volume. Once they all deployed, delete the project.

The volume tv202 didnt get cleaned tv200-201-203-204 were correctly cleaned.

Every 1.0s: du -d 1 -ah ./tv200 ./tv201 ./tv202 ./tv203 ./tv204                                                   
0       ./tv200
0       ./tv201
1.7M    ./tv202/mysql
12M     ./tv202/ibdata1
76M     ./tv202/ib_logfile1
76M     ./tv202/ib_logfile0
636K    ./tv202/performance_schema
4.0K    ./tv202/auto.cnf
4.0K    ./tv202/sampledb
167M    ./tv202
0       ./tv203
0       ./tv204

The only thing i noticed is that in the openshift-infra namespace where the recycler pods are running that before the delete project action, that there were still a pod not ready with status completed :
recycler-for-tv202-volume    0/1       Completed   0          4m


- node logs from longer period of time 
See new attachement

- output of oc get event
See attachement => events.txt

- output of oc describe pod (in the namespace where the recycler runs) 

oc get po -w
NAME                         READY     STATUS      RESTARTS   AGE
hawkular-cassandra-1-gp133   1/1       Running     0          6d
hawkular-metrics-onkqd       1/1       Running     4          28d
heapster-yddsn               1/1       Running     0          6d
recycler-for-tv202-volume    0/1       Completed   0          4m
NAME                        READY     STATUS    RESTARTS   AGE
recycler-for-tv200-volume   0/1       Pending   0          0s
recycler-for-tv203-volume   0/1       Pending   0         0s
recycler-for-tv200-volume   0/1       Pending   0         0s
recycler-for-tv203-volume   0/1       Pending   0         0s
recycler-for-tv200-volume   0/1       ContainerCreating   0         0s
recycler-for-tv202-volume   0/1       Terminating   0         7m
recycler-for-tv204-volume   0/1       Pending   0         1s
recycler-for-tv202-volume   0/1       Terminating   0         7m
recycler-for-tv201-volume   0/1       Pending   0         0s
recycler-for-tv204-volume   0/1       Pending   0         1s
recycler-for-tv203-volume   0/1       ContainerCreating   0         1s
recycler-for-tv201-volume   0/1       Pending   0         0s
recycler-for-tv204-volume   0/1       ContainerCreating   0         2s
recycler-for-tv201-volume   0/1       ContainerCreating   0         1s
recycler-for-tv204-volume   1/1       Running   0         23s
recycler-for-tv200-volume   0/1       Completed   0         26s
recycler-for-tv201-volume   0/1       Completed   0         25s
recycler-for-tv201-volume   0/1       Terminating   0         25s
recycler-for-tv201-volume   0/1       Terminating   0         26s
recycler-for-tv204-volume   0/1       Completed   0         28s
recycler-for-tv204-volume   0/1       Terminating   0         28s
recycler-for-tv204-volume   0/1       Terminating   0         28s
recycler-for-tv203-volume   1/1       Running   0         28s
recycler-for-tv203-volume   0/1       Completed   0         32s
recycler-for-tv203-volume   0/1       Terminating   0         32s
recycler-for-tv203-volume   0/1       Terminating   0         32s

Comment 7 Jan Safranek 2016-11-08 14:40:31 UTC
Thanks for the logs! With a quick check, everything looks like it _should_ work, the only difference in volume tv202 is that its recycler got scheduled to node qa-paas-master.qa-ch7.dev instead of 10.2.64.151, where the test of recyclers were successfully running. It seems to me that "qa-paas-master.qa-ch7.dev" actually *is* 10.2.64.151, isn't it? Why it's used as "qa-paas-master" once??? 


In the logs, I can see the recycler for tv202 pod got "Succeeded" without any error, however it seems it did not run at all. 

Your node logs start at Mon 2016-11-07 20:50:40, when the weird pod for tv202 is already created, do you have earlier logs? It looks like couple of minutes should be fine... I want to see "recycleVolumeOperation [tv202-volume] started" and subsequent "About to try and schedule pod recycler-for-tv202-volume".

Comment 8 yannick 2016-11-08 20:08:11 UTC
Well what i noticed is that the volume that aren't cleaned are the one having a popd in the namespace openshift infra with status Completed. The volume was cleaned but for unknow reason the cleaner po is still there... : 

Ex tv203:
oc get po -w
NAME                         READY     STATUS      RESTARTS   AGE
hawkular-cassandra-1-gp133   1/1       Running     0          6d
hawkular-metrics-onkqd       1/1       Running     4          29d
heapster-yddsn               1/1       Running     0          6d
recycler-for-tv203-volume    0/1       Completed   0          6m

If i delete these po, deploy new project with volume claim, and delete the project, i get all the volumes cleaned. 

Now the issue is if i deploy new project and delete it again, look at the recycler pod 203

oc get po -w
NAME                        READY     STATUS    RESTARTS   AGE
recycler-for-tv201-volume   0/1       Pending   0          0s
recycler-for-tv204-volume   0/1       Pending   0         0s
recycler-for-tv201-volume   0/1       Pending   0         0s
recycler-for-tv204-volume   0/1       Pending   0         0s
recycler-for-tv200-volume   0/1       Pending   0         0s
recycler-for-tv201-volume   0/1       ContainerCreating   0         0s
recycler-for-tv200-volume   0/1       Pending   0         0s
recycler-for-tv204-volume   0/1       ContainerCreating   0         0s
recycler-for-tv202-volume   0/1       Pending   0         0s
recycler-for-tv200-volume   0/1       ContainerCreating   0         0s
recycler-for-tv202-volume   0/1       Pending   0         0s
recycler-for-tv202-volume   0/1       ContainerCreating   0         1s
recycler-for-tv201-volume   1/1       Running   0         11s
recycler-for-tv200-volume   1/1       Running   0         12s
recycler-for-tv200-volume   0/1       Completed   0         17s
recycler-for-tv201-volume   0/1       Completed   0         17s
recycler-for-tv200-volume   0/1       Terminating   0         17s
recycler-for-tv200-volume   0/1       Terminating   0         17s
recycler-for-tv202-volume   1/1       Running   0         21s
recycler-for-tv204-volume   1/1       Running   0         23s
recycler-for-tv202-volume   0/1       Completed   0         25s
recycler-for-tv202-volume   0/1       Terminating   0         25s
recycler-for-tv202-volume   0/1       Terminating   0         25s
recycler-for-tv204-volume   0/1       Completed   0         27s
recycler-for-tv204-volume   0/1       Terminating   0         27s
recycler-for-tv204-volume   0/1       Terminating   0         27s

See nothing... 

doing again oc get po, the recycler still Completed and still not ready.

oc get po
NAME                         READY     STATUS      RESTARTS   AGE
hawkular-cassandra-1-gp133   1/1       Running     0          6d
hawkular-metrics-onkqd       1/1       Running     4          29d
heapster-yddsn               1/1       Running     0          6d
recycler-for-tv201-volume    0/1       Completed   0          48s
recycler-for-tv203-volume    0/1       Completed   0          7m

Looking at the PV...

oc get pv -w
NAME           CAPACITY   ACCESSMODES   STATUS    CLAIM             REASON    AGE
tv200-volume   10Gi       RWO,RWX       Bound     ya-test2/mysql4             6d
tv201-volume   10Gi       RWO,RWX       Bound     ya-test2/mysql1             6d
tv202-volume   10Gi       RWO,RWX       Bound     ya-test2/mysql5             5d
tv203-volume   10Gi       RWO,RWX       Bound     ya-test2/mysql2             6d
tv204-volume   10Gi       RWO,RWX       Bound     ya-test2/mysql3             6d

Now delete the project 

NAME           CAPACITY   ACCESSMODES   STATUS     CLAIM             REASON    AGE
tv201-volume   10Gi       RWO,RWX       Released   ya-test2/mysql1             6d
tv203-volume   10Gi      RWO,RWX   Released   ya-test2/mysql2             6d
tv204-volume   10Gi      RWO,RWX   Released   ya-test2/mysql3             6d
tv200-volume   10Gi      RWO,RWX   Released   ya-test2/mysql4             6d
tv202-volume   10Gi      RWO,RWX   Released   ya-test2/mysql5             5d
tv200-volume   10Gi      RWO,RWX   Released                       6d
tv200-volume   10Gi      RWO,RWX   Available                       6d
tv202-volume   10Gi      RWO,RWX   Released                       5d
tv202-volume   10Gi      RWO,RWX   Available                       5d
tv204-volume   10Gi      RWO,RWX   Released                       6d
tv201-volume   10Gi      RWO,RWX   Released                       6d
tv204-volume   10Gi      RWO,RWX   Available                       6d
tv201-volume   10Gi      RWO,RWX   Available                       6d
tv203-volume   10Gi      RWO,RWX   Released                       6d
tv203-volume   10Gi      RWO,RWX   Available                       6d

the volume 203 get status AVAILABLE but the data are still present on the NFS share.

If i do a oc logs on the 203 recycler pod

oc logs recycler-for-tv203-volume
Scrub ok



The oc describe po recycler-for-tv203-volume

 oc describe po recycler-for-tv203-volume
Name:           recycler-for-tv203-volume
Namespace:      openshift-infra
Node:           10.2.64.151/10.2.64.151
Start Time:     Tue, 08 Nov 2016 20:44:00 +0100
Labels:         <none>
Status:         Succeeded
IP:             10.201.0.2
Controllers:    <none>
Containers:
  pv-recycler:
    Container ID:       docker://97d05dc4e6517b4981ab2436315f531bd0761409d456d072fc903c1ba0ca7c69
    Image:              openshift3/ose-recycler:v3.3.0.32
    Image ID:           docker://sha256:08aa1eb54fafb89c3ec673e931f9f172fea2880fba5133576faa554649dff3fb
    Port:
    Command:
      /usr/bin/openshift-recycle
    Args:
      /scrub
    QoS Tier:
      memory:           BestEffort
      cpu:              BestEffort
    State:              Terminated
      Reason:           Completed
      Exit Code:        0
      Started:          Tue, 08 Nov 2016 20:44:15 +0100
      Finished:         Tue, 08 Nov 2016 20:44:25 +0100
    Ready:              False
    Restart Count:      0
    Environment Variables:
Conditions:
  Type          Status
  Initialized   True
  Ready         False
  PodScheduled  True
Volumes:
  vol:
    Type:       NFS (an NFS mount that lasts the lifetime of a pod)
    Server:     paas-nfs.qa-ch7.dev
    Path:       /exports/tv203
    ReadOnly:   false
  pv-recycler-controller-token-gp1i3:
    Type:       Secret (a volume populated by a Secret)
    SecretName: pv-recycler-controller-token-gp1i3
Events:
  FirstSeen     LastSeen        Count   From                    SubobjectPath                   Type            Reason          Message
  ---------     --------        -----   ----                    -------------                   --------        ------          -------
  16m           16m             1       {default-scheduler }                                    Normal          Scheduled       Successfully assigned recycler-for-tv203-volume to 10.2.64.151
  16m           16m             1       {kubelet 10.2.64.151}   spec.containers{pv-recycler}    Normal          Pulled          Container image "openshift3/ose-recycler:v3.3.0.32" already present on machine
  16m           16m             1       {kubelet 10.2.64.151}   spec.containers{pv-recycler}    Normal          Created         Created container with docker id 97d05dc4e651
  16m           16m             1       {kubelet 10.2.64.151}   spec.containers{pv-recycler}    Normal          Started         Started container with docker id 97d05dc4e651


Seems ok... 

I ll keep testing but so far the issue comes if there is already a previous recycler pod present in the openshift-infra namespace. Now the question is why there is randomly one remaining there... which has logs "Scrub ok" and the next time it get invoked to clean the volume it answer : Scrub done (which is not the case) and then the Volume get back in status Available.

Comment 9 yannick 2016-11-10 09:16:06 UTC
Hello,

so i got more information...

Doing a stupid cron job running every minute on the master... looking for pod called "recycler-for" Not Ready 0/1 and Status Completed and simply deleting these pods...

for i in $( oc get po -n openshift-infra | grep -o '\(recycler\S*\)\(.*0\/1.*Completed\)' | awk '{ print $1}' ); do
   echo "Deleting recycler pod : $i"
   echo ""
   oc describe po $i  -n openshift-infra
   echo ""
   oc delete po $i -n openshift-infra
done

Every time a volume claim get released OS creates a NEW recycler pod instead updating the one which is in not Ready/Completed status.

Since i did this trick... seems no more issue with data scrubbing...

Comment 11 Jan Safranek 2016-11-15 14:30:51 UTC
So, there are several things here:

1) When you delete a namespace with a PVC and a mysql pod, the PVC is deleted earlier that the pod and for some time the mysql pod and the recycler run in parallel and the mysql pod may prevent some files from being deleted.

2) The recycler pod in OpenShift Enterprise does not check that there are no files on the volume when it finishes. This way, it should detect that there are still some mysql files somewhere and retry recycling again after a while.

3) From some reason, some recycle pods are not deleted by Kubernetes when they complete. It's totally unpredictable, I've reproduced it few times:

  1. create 10 NFS PVs, each pointing to a separate directory
  2. create 10 claims
  3. create 10 pods, each using one claim, creating bunch of files and directories in an endless loop
  4. delete the pods and claims at the same time. Pods are slowly being while they're still creating some files and directories and while recycler pods are running.
  5. some recycle pods hang in "0/1 Completed" state and are not deleted -> bug!

I'm investigating 3), I'll fill separate bugs for 1) and 2).

Comment 13 Jan Safranek 2016-11-15 17:37:31 UTC
Patch for 3): https://github.com/kubernetes/kubernetes/pull/36827

Comment 18 Jan Safranek 2016-11-28 12:37:39 UTC
Origin pull request: https://github.com/openshift/origin/pull/12041

Comment 23 Jan Safranek 2017-01-13 09:35:47 UTC
The code went to Origin 1.5 with the last Kubernetes rebase.

Comment 25 Troy Dawson 2017-01-20 23:16:25 UTC
This has been merged into ocp and is in OCP v3.5.0.7 or newer.
If you are going to backport this, please clone this bugzilla and attach the pull request for the backport to it.

Comment 26 Jan Safranek 2017-01-23 11:15:31 UTC
Cloned for 3.4.x as #1415624

Comment 27 Liang Xia 2017-01-24 02:26:04 UTC
Tried on below version with 10 PVs for several times, pv becomes available and data are cleaned.

openshift v3.5.0.7+390ef18
kubernetes v1.5.2+43a9be4
etcd 3.1.0-rc.0

Comment 29 errata-xmlrpc 2017-04-12 19:16:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0884


Note You need to log in before you can comment on or make changes to this bug.