Bug 1318472

Summary:	Registry pod doesn't mount persistent volume(NFS) after restarting system
Product:	OpenShift Container Platform	Reporter:	Kenjiro Nakayama <knakayam>
Component:	Storage	Assignee:	Paul Morie <pmorie>
Status:	CLOSED ERRATA	QA Contact:	Jianwei Hou <jhou>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	3.1.0	CC:	agoldste, aos-bugs, bleanhar, erjones, knakayam, mbarrett, pmorie
Target Milestone:	---	Keywords:	NeedsTestCase
Target Release:	3.1.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	atomic-openshift-3.1.1.6-4.git.32.adf8ec9.el7aos	Doc Type:	Bug Fix
Doc Text:	Cause: Persistent Volume Claims were added to the list of volumes to preserve rather than the actual name of the Persistent Volume associated with the Persistent Volume Claim. Consequence: The periodic cleanup process would unmount the volume if a pod utilizing the Persistent Volume Claim had not yet entered running state. Fix: The actual name of the Persistent Volume associated with a Persistent Volume Claim is used when determining which volumes can be cleaned up, preventing the cleanup process from considering them orphaned. Result: Persistent Volumes are no longer unmounted while the pod requiring the volume is starting.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-03-24 15:54:02 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Kenjiro Nakayama 2016-03-17 01:08:52 UTC

Description of problem:

- After restarting system (with yum update), registry pod doesn't mount pv so doesn't start correctly.

Version-Release number of selected component (if applicable):

- 3.1.1.x


How reproducible:

- restart system, but it is not 100% reproduceable 

Actual results:

Mar 16 13:14:01 node1.example.com atomic-openshift-node[2624]: E0316 13:14:01.148425    2624 nfs.go:178] IsLikelyNotMountPoint check failed: stat /var/lib/origin/openshift.local.volumes/pods/e707918c-eba9-11e5-992b-fa163e37a7eb/volumes/kubernetes.io~nfs/registry-volume: no such file or directory
Mar 16 13:14:01 node1.example.com atomic-openshift-node[2624]: E0316 13:14:01.148461    2624 kubelet.go:1521] Unable to mount volumes for pod "docker-registry-13-5fist_default": Mount failed: exit status 32
Mar 16 13:14:01 node1.example.com atomic-openshift-node[2624]: Mounting arguments: 10.31.132.45:/IPC1_ePaasv2_eng2_DReg /var/lib/origin/openshift.local.volumes/pods/e707918c-eba9-11e5-992b-fa163e37a7eb/volumes/kubernetes.io~nfs/registry-volume nfs []
Mar 16 13:14:01 node1.example.com atomic-openshift-node[2624]: Output: mount.nfs: mounting 10.31.132.45:/IPC1_ePaasv2_eng2_DReg failed, reason given by server: No such file or directory
Mar 16 13:14:01 node1.example.com atomic-openshift-node[2624]: ; skipping pod
Mar 16 13:14:01 node1.example.com atomic-openshift-node[2624]: E0316 13:14:01.149884    2624 pod_workers.go:113] Error syncing pod e707918c-eba9-11e5-992b-fa163e37a7eb, skipping: Mount failed: exit status 32
Mar 16 13:14:01 node1.example.com atomic-openshift-node[2624]: Mounting arguments: 10.31.132.45:/IPC1_ePaasv2_eng2_DReg /var/lib/origin/openshift.local.volumes/pods/e707918c-eba9-11e5-992b-fa163e37a7eb/volumes/kubernetes.io~nfs/registry-volume nfs []
Mar 16 13:14:01 node1.example.com atomic-openshift-node[2624]: Output: mount.nfs: mounting 10.31.132.45:/IPC1_ePaasv2_eng2_DReg failed, reason given by server: No such file or directory
Mar 16 13:14:02 node1.example.com atomic-openshift-node[2624]: W0316 13:14:02.046270    2624 kubelet.go:1750] Orphaned volume "e707918c-eba9-11e5-992b-fa163e37a7eb/registry-volume" found, tearing down volume
Mar 16 13:15:10 node1.example.com atomic-openshift-node[2624]: E0316 13:15:10.228935    2624 nfs.go:178] IsLikelyNotMountPoint check failed: stat /var/lib/origin/openshift.local.volumes/pods/e707918c-eba9-11e5-992b-fa163e37a7eb/volumes/kubernetes.io~nfs/registry-volume: no such file or directory

Expected results:

- Without error, registry pod should mount the NFS pv.

Additional info:

- Upstream(k8s) has similar report https://github.com/kubernetes/kubernetes/issues/20734

Comment 1 Paul Morie 2016-03-17 15:31:54 UTC

Exit code 32 means the mountpoint is busy or already in use -- what does the mount table look like _after_ you restart the system, but _before_ you restart openshift?

Comment 4 Andy Goldstein 2016-03-18 16:11:32 UTC

This sure looks like https://github.com/kubernetes/kubernetes/issues/20734

Comment 6 Paul Morie 2016-03-18 19:20:55 UTC

I'm working on a backport for the fix of this to 3.1.1

Comment 11 Jianwei Hou 2016-03-22 06:02:58 UTC

Verified with:

[root@openshift-143 ~]# openshift version
openshift v3.1.1.6-33-g81eabcc
kubernetes v1.1.0-origin-1107-g4c8e6f4
etcd 2.1.2

atomic-openshift-sdn-ovs-3.1.1.6-4.git.32.adf8ec9.el7aos.x86_64
atomic-openshift-3.1.1.6-4.git.32.adf8ec9.el7aos.x86_64
tuned-profiles-atomic-openshift-node-3.1.1.6-4.git.32.adf8ec9.el7aos.x86_64
atomic-openshift-node-3.1.1.6-4.git.32.adf8ec9.el7aos.x86_64
atomic-openshift-clients-3.1.1.6-4.git.32.adf8ec9.el7aos.x86_64
atomic-openshift-master-3.1.1.6-4.git.32.adf8ec9.el7aos.x86_64

Steps:
1. Stop atomic-openshift-node
2. On node where docker-registry is hosted: mount|grep nfs
openshift-143.lab.sjc.redhat.com:/var/lib/exports/regpv on /var/lib/origin/openshift.local.volumes/pods/7ff08b81-eff0-11e5-9c0a-fa163e88dd03/volumes/kubernetes.io~nfs/regpv-volume type nfs4 (rw,relatime,vers=4.0,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.1.10,local_lock=none,addr=10.14.6.143)

3. Reboot the node
4. atomic-openshift-node starts after node reboots
5. On node where docker-registry is hosted: mount|grep nfs
openshift-143.lab.sjc.redhat.com:/var/lib/exports/regpv on /var/lib/origin/openshift.local.volumes/pods/7ff08b81-eff0-11e5-9c0a-fa163e88dd03/volumes/kubernetes.io~nfs/regpv-volume type nfs4 (rw,relatime,vers=4.0,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.1.10,local_lock=none,addr=10.14.6.143)
6. oc get pods -n default
The docker-registry pod is healthy and running

Comment 12 Paul Morie 2016-03-23 20:56:26 UTC

Jianwei-

Were you able to reproduce the original issue on a build from before the patch went in?

Comment 15 errata-xmlrpc 2016-03-24 15:54:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:0510