Bug 1279683 - NFS Recycling error: Pod failed, pod.Status.Message unknown
Summary: NFS Recycling error: Pod failed, pod.Status.Message unknown
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OKD
Classification: Red Hat
Component: Storage
Version: 3.x
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Sami Wagiaalla
QA Contact: Liang Xia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-11-10 03:20 UTC by Liang Xia
Modified: 2019-08-15 05:49 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-05-12 17:15:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Liang Xia 2015-11-10 03:20:53 UTC
Description of problem:
After a PVC bound to a PV, but no pod is using the PVC.
Deleting the PVC will cause PV Recycling failed with error: 
Pod failed, pod.Status.Message unknown

Version-Release number of selected component (if applicable):
openshift v1.0.8-4-gabfc3c4
kubernetes v1.1.0-origin-1107-g4c8e6f4
etcd 2.1.2

How reproducible:
Always

Steps to Reproduce:
1.Prepare a NFS server on an All-In-One env.
bash -x nfs-provisioning-localhost.sh
2.Create 5 PVs.
oc new-app -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/persistent-volumes/nfs/template-pv.json
3.Create 10 PVCs.
oc new-app -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/persistent-volumes/nfs/template-pvc.json
4.Delete a PVC which has bound with PV.
oc get pvc
oc delete pvc template-pvc-1
5.Check PV status.
oc get pv
oc describe pv template-pv-5

Actual results:
# oc describe pv template-pv-5
Name:        template-pv-5
Labels:        template=PVs
Status:        Failed
Claim:        default/template-pvc-1
Reclaim Policy:    Recycle
Access Modes:    RWO
Capacity:    5Gi
Message:    Recycling error: Pod failed, pod.Status.Message unknown.
Source:
    Type:    NFS (an NFS mount that lasts the lifetime of a pod)
    Server:    localhost
    Path:    /home/data/pv05
    ReadOnly:    false

Expected results:
PV should be available.

Additional info:
cat nfs-provisioning-localhost.sh
#!/bin/bash

if ! ( rpm -qa | grep -q nfs-utils )
then
  yum install -y nfs-utils
fi

mkdir -p /home/data/pv{01..10}

chmod -R 700 /home/data/pv{01..09..2}
chmod -R 770 /home/data/pv{02..10..2}

for PV in pv{01..10}
do
  if ! ( grep -q "/home/data/$PV" /etc/exports )
  then
    echo "/home/data/$PV *(rw,sync)" >> /etc/exports
  fi
done

systemctl start rpcbind
systemctl start nfs-server
exportfs -a

if ( getsebool virt_use_nfs | grep -q off )
then
  setsebool -P virt_use_nfs 1
fi

Comment 1 Mark Turansky 2015-11-10 17:08:46 UTC
The new security functionality caused errors in the recycler, which was used to NFS volumes without any security.

The recycler needs to work with UID:GID.  This BZ should be fixed by https://github.com/openshift/origin/pull/5792

Comment 2 Mark Turansky 2015-11-10 18:27:35 UTC

*** This bug has been marked as a duplicate of bug 1279335 ***

Comment 3 Jordan Liggitt 2015-11-12 06:21:02 UTC
https://github.com/openshift/origin/pull/5792 is superceded by https://github.com/openshift/origin/pull/5847 

ForkAMI available at https://ci.openshift.redhat.com/jenkins/job/fork_ami/132/

Reopening, since this was related to NFS permissions. The other bug was related to hostmount SCC.

Comment 4 Jordan Liggitt 2015-11-12 21:43:07 UTC
https://github.com/openshift/origin/pull/5847 is in the merge queue

Comment 5 Liang Xia 2015-11-13 09:36:08 UTC
Check again on devenv-rhel7_2695, following exactly the same steps as in #comment 0 , PV still can not be recycled.

# oc describe pv template-pv-5
Name:		template-pv-5
Labels:		template=PVs
Status:		Failed
Claim:		lxiap001/template-pvc-1
Reclaim Policy:	Recycle
Access Modes:	RWO
Capacity:	5Gi
Message:	Recycling error: Pod was active on the node longer than specified deadline
Source:
    Type:	NFS (an NFS mount that lasts the lifetime of a pod)
    Server:	localhost
    Path:	/home/data/pv05
    ReadOnly:	false

The error is a little confuse since it says "Pod was active on the node longer than specified deadline", but actually there are no pods on this environment.

# openshift version
openshift v1.0.8-40-g42ad235
kubernetes v1.1.0-origin-1107-g4c8e6f4
etcd 2.1.2

Comment 6 Mark Turansky 2015-11-13 18:17:45 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1281726 contains the same error ("Pod was active on the node longer than specified").

Are these two dupes?

Comment 7 Mark Turansky 2015-11-13 20:27:11 UTC
I attempted a MySQL pod with NFS using 700 and 770 (as indicated above).  Only 777 worked.  700 had an error when mounting, others when writing.

Try again with 777, please.

Comment 8 Liang Xia 2015-11-16 02:58:26 UTC
Tried again on devenv-rhel7_2712 with openshift version
openshift v1.1-25-g0c0e452
kubernetes v1.1.0-origin-1107-g4c8e6f4
etcd 2.1.2

and PV can be recycled when NFS exported with 777,
adn PV Failed to recycle when NFS exported with 700/770.

Comment 9 Liang Xia 2015-11-16 03:03:30 UTC
Hi Mark,

Since NFS exported with 777 is not good.
Could you confirm that NFS exported with 777 is required ?

Thanks,
Liang

Comment 10 Liang Xia 2015-11-17 02:11:08 UTC
Assgin back to get the confirmation.

Comment 11 Mark Turansky 2016-01-11 15:49:18 UTC
There is a feature request for the automatic addition of GID to pod's running shared storage volumes (NFS, Gluster).

The recycler would run a pod using the same GID that is stored on the PV.  This allows permissions less than 777.

Reassigning to Sami who I believe is handling that feature.  Otherwise, Sami, please reassign to the feature owner.

Comment 12 Sami Wagiaalla 2016-02-04 14:46:38 UTC
PR opened upstream to support a GID annotation which indicates the GID with which to access the volume. The recycler pod will use the same feature.

https://github.com/kubernetes/kubernetes/pull/20490

Comment 16 Liang Xia 2016-02-22 07:52:19 UTC
Check on version,
openshift v3.1.1.904
kubernetes v1.2.0-alpha.7-703-gbc4550d
etcd 2.2.5

The PV (persistent volume) can be recycled now.

Once the bug moved to ON_QA, we can move it to verified.

Comment 17 Sami Wagiaalla 2016-02-22 15:24:39 UTC
On a closer look at this bug it seems like the UID GID setting is not the issue. 
The recycler script merged here: https://github.com/openshift/origin/pull/5847 and referenced above has a 'becomeUser' method which switches the UID to that of the file which requires deletion.

Liang,

This is working for you now then ? I think what happened is that only your most recent test contained the patch referenced above. Moving to ON_QA. Please reopen of you encounter this issue again.

Comment 18 Liang Xia 2016-02-23 01:55:53 UTC
Moving the verified.


Note You need to log in before you can comment on or make changes to this bug.