Bug 1656927 - PV(NFS) stop working after upgrade to 3.10.72 in atomic host
Summary: PV(NFS) stop working after upgrade to 3.10.72 in atomic host
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 3.10.0
Hardware: x86_64
OS: Linux
Target Milestone: ---
: 3.10.z
Assignee: Jan Safranek
QA Contact: Wenqi He
Depends On:
TreeView+ depends on / blocked
Reported: 2018-12-06 16:43 UTC by Nicolas Nosenzo
Modified: 2019-02-05 09:29 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2019-01-11 10:14:52 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github openshift origin pull 21284 0 None closed 3.10 Fixed subpath cleanup when /var/lib/kubelet is a symlink 2021-01-19 17:43:13 UTC
Red Hat Knowledge Base (Solution) 3744791 0 None None None 2018-12-09 19:50:56 UTC

Description Nicolas Nosenzo 2018-12-06 16:43:12 UTC
Created attachment 1512194 [details]
node journal logs

Description of problem:
After cluster upgrade to 3.10.72 (from 3.10.66), all the NFS backed PVs fail to be mounted from containers, even though the NFS export can be mounted from nodes without issues. 


Dec 06 11:19:54 node5.example.net atomic-openshift-node[21705]: I1206 11:19:54.606296   21717 reconciler.go:252] operationExecutor.MountVolume started for volume "example-pv" (UniqueName: "kubernetes.io/nfs/52f4d86d-f93f-11e8-94a6-0050569df267-example-pv") pod "mortimer-db-7d595fc8fb-nf2w5" (UID: "52f4d86d-f93f-11e8-94a6-0050569df267")
Dec 06 11:19:54 node5.example.net atomic-openshift-node[21705]: I1206 11:19:54.608059   21717 nsenter.go:151] failed to resolve symbolic links on /var/lib/origin/openshift.local.volumes/pods/52f4d86d-f93f-11e8-94a6-0050569df267/volumes/kubernetes.io~nfs/example-pv: exit status 1
Dec 06 11:19:54 node5.example.net atomic-openshift-node[21705]: E1206 11:19:54.608142   21717 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/nfs/52f4d86d-f93f-11e8-94a6-0050569df267-example-pv\" (\"52f4d86d-f93f-11e8-94a6-0050569df267\")" failed. No retries permitted until 2018-12-06 11:21:56.608117627 +0100 CET m=+1159.964741283 (durationBeforeRetry 2m2s). Error: "MountVolume.SetUp failed for volume \"example-pv\" (UniqueName: \"kubernetes.io/nfs/52f4d86d-f93f-11e8-94a6-0050569df267-example-pv\") pod \"mortimer-db-7d595fc8fb-nf2w5\" (UID: \"52f4d86d-f93f-11e8-94a6-0050569df267\") : exit status 1"

Version-Release number of selected component (if applicable):

Atomic Host: redhat-release-atomic-host-7.6-20180503.0.atomic.el7.1.x86_64
Docker: docker-1.13.1-84.git07f3374.el7.x86_64 / container-selinux-2.74-1
Ansible playbook used for the update: openshift-ansible-3.10.73-1.git.0.8b65cea.el7.noarch

How reproducible:
Can't reproduce in a RHEL hosted-cluster

Additional info:
- virt_use_nfs is enabled

Comment 10 Jan Safranek 2018-12-17 13:32:17 UTC
Origin PR: https://github.com/openshift/origin/pull/21672

Comment 15 Jan Safranek 2019-01-02 09:57:40 UTC
>> Merged into Origin 3.9, it will be part of the next OSE 3.9.z

> I think you meant *3.10*, just to avoid any confusion.

Sorry, I indeed meant 3.10.z.

> What we need to know is on whether the issue affects only OpenShift Container Platform 3.10 or 3.11

Only 3.10 is affected, 3.11 should be fine.

> if a fix can be pushed to become available within the next OpenShift Container Platform 3.10 Errata.

Yes, as noted above, the patch has been merged into 3.10.z

Comment 18 Thomas Schilling 2019-01-07 16:29:45 UTC
The same issue actually also affects our 3.9 installation.

Upgrading from 3.9.43 -> 3.9.57 broke it.

Is there a chance this could be backported to the 3.9 branch (+ new patch release)?

Comment 19 Jan Safranek 2019-01-07 17:19:15 UTC
> Is there a chance this could be backported to the 3.9 branch (+ new patch release)?

Tracked as https://bugzilla.redhat.com/show_bug.cgi?id=1663260

Comment 21 Wenqi He 2019-01-08 09:57:32 UTC
Tested on below version

openshift v3.10.97
kubernetes v1.10.0+b81c8f8

# uname -a
Linux ip-172-18-0-253.ec2.internal 3.10.0-862.11.6.el7.x86_64 #1 SMP Fri Aug 10 16:55:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
# cat /etc/redhat-release 
Red Hat Enterprise Linux Atomic Host release 7.5

Pod using NFS volume is running.

# oc get pv -o yaml
apiVersion: v1
- apiVersion: v1
  kind: PersistentVolume
      pv.kubernetes.io/bound-by-controller: "yes"
    creationTimestamp: 2019-01-08T08:28:57Z
    - kubernetes.io/pv-protection
    name: nfs-y49ir
    namespace: ""
    resourceVersion: "40305"
    selfLink: /api/v1/persistentvolumes/nfs-y49ir
    uid: 7117a724-131f-11e9-9a81-0e18e55051b0
    - ReadWriteMany
      storage: 5Gi
      apiVersion: v1
      kind: PersistentVolumeClaim
      name: nfsc
      namespace: y49ir
      resourceVersion: "40303"
      uid: 732b270c-131f-11e9-9a81-0e18e55051b0
      path: /
    persistentVolumeReclaimPolicy: Retain
    phase: Bound
kind: List
  resourceVersion: ""
  selfLink: ""

# oc get pods
nfs          1/1       Running   0          1h

Comment 23 Jan Safranek 2019-01-11 10:14:52 UTC
This bug is fixed in Errata RHBA-2019:0026

Comment 24 David Schweikert 2019-01-28 12:17:12 UTC
It this bugfix really released?

- RHBA-2019:0026 doesn't mention this bug
- The latest released version seems to be 3.10.89 and the test above was done with 3.10.97

Comment 25 Jan Safranek 2019-02-05 09:29:01 UTC
Yes, .89 is the right release

* Mon Dec 17 2018 AOS Automation Release Team <aos-team-art@redhat.com> 3.10.89-1 
- UPSTREAM: 62304: Remove isNotDir error check (jsafrane@redhat.com) 

Do you still experience any issues in this area?

Note You need to log in before you can comment on or make changes to this bug.