Bug 1656927

Summary: PV(NFS) stop working after upgrade to 3.10.72 in atomic host
Product: OpenShift Container Platform Reporter: Nicolas Nosenzo <nnosenzo>
Component: StorageAssignee: Jan Safranek <jsafrane>
Status: CLOSED CURRENTRELEASE QA Contact: Wenqi He <wehe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.10.0CC: andcosta, aos-bugs, aos-storage-staff, david.schweikert, jokerman, jsafrane, lxia, mmccomas, sreber, thomas.schilling, tsmetana
Target Milestone: ---Keywords: Regression
Target Release: 3.10.z   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-11 10:14:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Nicolas Nosenzo 2018-12-06 16:43:12 UTC
Created attachment 1512194 [details]
node journal logs

Description of problem:
After cluster upgrade to 3.10.72 (from 3.10.66), all the NFS backed PVs fail to be mounted from containers, even though the NFS export can be mounted from nodes without issues. 

Error:

Dec 06 11:19:54 node5.example.net atomic-openshift-node[21705]: I1206 11:19:54.606296   21717 reconciler.go:252] operationExecutor.MountVolume started for volume "example-pv" (UniqueName: "kubernetes.io/nfs/52f4d86d-f93f-11e8-94a6-0050569df267-example-pv") pod "mortimer-db-7d595fc8fb-nf2w5" (UID: "52f4d86d-f93f-11e8-94a6-0050569df267")
Dec 06 11:19:54 node5.example.net atomic-openshift-node[21705]: I1206 11:19:54.608059   21717 nsenter.go:151] failed to resolve symbolic links on /var/lib/origin/openshift.local.volumes/pods/52f4d86d-f93f-11e8-94a6-0050569df267/volumes/kubernetes.io~nfs/example-pv: exit status 1
Dec 06 11:19:54 node5.example.net atomic-openshift-node[21705]: E1206 11:19:54.608142   21717 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/nfs/52f4d86d-f93f-11e8-94a6-0050569df267-example-pv\" (\"52f4d86d-f93f-11e8-94a6-0050569df267\")" failed. No retries permitted until 2018-12-06 11:21:56.608117627 +0100 CET m=+1159.964741283 (durationBeforeRetry 2m2s). Error: "MountVolume.SetUp failed for volume \"example-pv\" (UniqueName: \"kubernetes.io/nfs/52f4d86d-f93f-11e8-94a6-0050569df267-example-pv\") pod \"mortimer-db-7d595fc8fb-nf2w5\" (UID: \"52f4d86d-f93f-11e8-94a6-0050569df267\") : exit status 1"


Version-Release number of selected component (if applicable):

Atomic Host: redhat-release-atomic-host-7.6-20180503.0.atomic.el7.1.x86_64
Docker: docker-1.13.1-84.git07f3374.el7.x86_64 / container-selinux-2.74-1
Ansible playbook used for the update: openshift-ansible-3.10.73-1.git.0.8b65cea.el7.noarch

How reproducible:
Can't reproduce in a RHEL hosted-cluster

Additional info:
- virt_use_nfs is enabled

Comment 10 Jan Safranek 2018-12-17 13:32:17 UTC
Origin PR: https://github.com/openshift/origin/pull/21672

Comment 15 Jan Safranek 2019-01-02 09:57:40 UTC
>> Merged into Origin 3.9, it will be part of the next OSE 3.9.z

> I think you meant *3.10*, just to avoid any confusion.

Sorry, I indeed meant 3.10.z.


> What we need to know is on whether the issue affects only OpenShift Container Platform 3.10 or 3.11

Only 3.10 is affected, 3.11 should be fine.

> if a fix can be pushed to become available within the next OpenShift Container Platform 3.10 Errata.

Yes, as noted above, the patch has been merged into 3.10.z

Comment 18 Thomas Schilling 2019-01-07 16:29:45 UTC
The same issue actually also affects our 3.9 installation.

Upgrading from 3.9.43 -> 3.9.57 broke it.

Is there a chance this could be backported to the 3.9 branch (+ new patch release)?

Comment 19 Jan Safranek 2019-01-07 17:19:15 UTC
> Is there a chance this could be backported to the 3.9 branch (+ new patch release)?

Tracked as https://bugzilla.redhat.com/show_bug.cgi?id=1663260

Comment 21 Wenqi He 2019-01-08 09:57:32 UTC
Tested on below version

openshift v3.10.97
kubernetes v1.10.0+b81c8f8

# uname -a
Linux ip-172-18-0-253.ec2.internal 3.10.0-862.11.6.el7.x86_64 #1 SMP Fri Aug 10 16:55:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
# cat /etc/redhat-release 
Red Hat Enterprise Linux Atomic Host release 7.5

Pod using NFS volume is running.

# oc get pv -o yaml
apiVersion: v1
items:
- apiVersion: v1
  kind: PersistentVolume
  metadata:
    annotations:
      pv.kubernetes.io/bound-by-controller: "yes"
    creationTimestamp: 2019-01-08T08:28:57Z
    finalizers:
    - kubernetes.io/pv-protection
    name: nfs-y49ir
    namespace: ""
    resourceVersion: "40305"
    selfLink: /api/v1/persistentvolumes/nfs-y49ir
    uid: 7117a724-131f-11e9-9a81-0e18e55051b0
  spec:
    accessModes:
    - ReadWriteMany
    capacity:
      storage: 5Gi
    claimRef:
      apiVersion: v1
      kind: PersistentVolumeClaim
      name: nfsc
      namespace: y49ir
      resourceVersion: "40303"
      uid: 732b270c-131f-11e9-9a81-0e18e55051b0
    nfs:
      path: /
      server: 172.30.227.52
    persistentVolumeReclaimPolicy: Retain
  status:
    phase: Bound
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

# oc get pods
NAME         READY     STATUS    RESTARTS   AGE
nfs          1/1       Running   0          1h

Comment 23 Jan Safranek 2019-01-11 10:14:52 UTC
This bug is fixed in Errata RHBA-2019:0026

Comment 24 David Schweikert 2019-01-28 12:17:12 UTC
It this bugfix really released?

- RHBA-2019:0026 doesn't mention this bug
- The latest released version seems to be 3.10.89 and the test above was done with 3.10.97

Comment 25 Jan Safranek 2019-02-05 09:29:01 UTC
Yes, .89 is the right release

* Mon Dec 17 2018 AOS Automation Release Team <aos-team-art> 3.10.89-1 
- UPSTREAM: 62304: Remove isNotDir error check (jsafrane) 

Do you still experience any issues in this area?