Bug 1550372

Summary:	NFS volume recycle failed for ErrImagePull
Product:	OpenShift Container Platform	Reporter:	Qin Ping <piqin>
Component:	Release	Assignee:	David Eads <deads>
Status:	CLOSED ERRATA	QA Contact:	Qin Ping <piqin>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.9.0	CC:	aos-bugs, aos-storage-staff, bchilds, bugzilla.com, byount, chrkim, fbrychta, hekumar, joelsmith, jokerman, jupierce, mmccomas, smunilla
Target Milestone:	---
Target Release:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-05-17 06:42:42 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Qin Ping 2018-03-01 06:26:10 UTC

Description of problem:
NFS volume recycle failed for ErrImagePull

Version-Release number of selected component (if applicable):
openshift v3.9.1
kubernetes v1.9.1+a0ce1bc657

How reproducible:
always

Steps to Reproduce:
1. create a PV with persistentVolumeReclaimPolicy=Recycle
2. create a PVC using the PV created above
3. delete PVC
4. check pv status

Actual results:
PV status is Released


Expected results:
PV status is Available

Master Log:

Node Log (of failed PODs):
Feb 28 03:04:38 qe-gpei-test4master-etcd-zone1-1 atomic-openshift-node[14615]: I0228 03:04:38.947835   14625 kuberuntime_manager.go:514] Container {Name:pv-recycler Image:registry.reg-aws.openshift.com:443/openshift3/ose-recycler:v1.9.1 Command:[/usr/bin/openshift-recycle] Args:[/scrub] WorkingDir: Ports:[] EnvFrom:[] Env:[] Resources:{Limits:map[] Requests:map[]} VolumeMounts:[{Name:vol ReadOnly:false MountPath:/scrub SubPath: MountPropagation:<nil>} {Name:pv-recycler-controller-token-9mdw7 ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath: MountPropagation:<nil>}] VolumeDevices:[] LivenessProbe:nil ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:&SecurityContext{Capabilities:&Capabilities{Add:[],Drop:[MKNOD],},Privileged:nil,SELinuxOptions:nil,RunAsUser:*0,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,} Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
Feb 28 03:04:38 qe-gpei-test4master-etcd-zone1-1 atomic-openshift-node[14615]: I0228 03:04:38.948013   14625 kuberuntime_manager.go:725] Creating container &Container{Name:pv-recycler,Image:registry.reg-aws.openshift.com:443/openshift3/ose-recycler:v1.9.1,Command:[/usr/bin/openshift-recycle],Args:[/scrub],WorkingDir:,Ports:[],Env:[],Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{},},VolumeMounts:[{vol false /scrub  <nil>} {pv-recycler-controller-token-9mdw7 true /var/run/secrets/kubernetes.io/serviceaccount  <nil>}],LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:&SecurityContext{Capabilities:&Capabilities{Add:[],Drop:[MKNOD],},Privileged:nil,SELinuxOptions:nil,RunAsUser:*0,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[],TerminationMessagePolicy:File,VolumeDevices:[],} in pod recycler-for-nfs-hkbss_openshift-infra(b27a548b-1c5d-11e8-86a3-42010af0006b)
Feb 28 03:04:38 qe-gpei-test4master-etcd-zone1-1 atomic-openshift-node[14615]: I0228 03:04:38.950448   14625 kuberuntime_manager.go:732] container start failed: ImagePullBackOff: Back-off pulling image "registry.reg-aws.openshift.com:443/openshift3/ose-recycler:v1.9.1"
Feb 28 03:04:38 qe-gpei-test4master-etcd-zone1-1 atomic-openshift-node[14615]: E0228 03:04:38.950488   14625 pod_workers.go:186] Error syncing pod b27a548b-1c5d-11e8-86a3-42010af0006b ("recycler-for-nfs-hkbss_openshift-infra(b27a548b-1c5d-11e8-86a3-42010af0006b)"), skipping: failed to "StartContainer" for "pv-recycler" with ImagePullBackOff: "Back-off pulling image \"registry.reg-aws.openshift.com:443/openshift3/ose-recycler:v1.9.1\""
Feb 28 03:04:38 qe-gpei-test4master-etcd-zone1-1 atomic-openshift-node[14615]: I0228 03:04:38.950982   14625 server.go:285] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-infra", Name:"recycler-for-nfs-hkbss", UID:"b27a548b-1c5d-11e8-86a3-42010af0006b", APIVersion:"v1", ResourceVersion:"54243", FieldPath:"spec.containers{pv-recycler}"}): type: 'Normal' reason: 'BackOff' Back-off pulling image "registry.reg-aws.openshift.com:443/openshift3/ose-recycler:v1.9.1"
Feb 28 03:04:38 qe-gpei-test4master-etcd-zone1-1 runc[12519]: time="2018-02-28T03:04:38.949591095-05:00" level=error msg="Handler for GET /v1.26/images/registry.reg-aws.openshift.com:443/openshift3/ose-recycler:v1.9.1/json returned error: No such image: registry.reg-aws.openshift.com:443/openshift3/ose-recycler:v1.9.1"
Feb 28 03:04:38 qe-gpei-test4master-etcd-zone1-1 runc[12519]: time="2018-02-28T03:04:38.949997350-05:00" level=error msg="Handler for GET /v1.26/images/registry.reg-aws.openshift.com:443/openshift3/ose-recycler:v1.9.1/json returned error: No such image: registry.reg-aws.openshift.com:443/openshift3/ose-recycler:v1.9.1"



PV Dump:
{
  "apiVersion": "v1",
  "kind": "PersistentVolume",
  "metadata": {
    "name": "nfs",
    "labels": {
      "usedFor": "tc522215"
    }
  },
  "spec": {
    "capacity": {
        "storage": "5Gi"
    },
    "accessModes": [ "ReadWriteMany" ],
    "nfs": {
        "path": "/",
        "server": "172.30.163.146"
    },
    "persistentVolumeReclaimPolicy": "Recycle"
  }
}

PVC Dump:
{
    "apiVersion": "v1",
    "kind": "PersistentVolumeClaim",
    "metadata": {
        "name": "nfsc1",
        "labels": {
            "usedFor": "tc522215"
        }
    },
    "spec": {
        "accessModes": [ "ReadWriteMany" ],
        "resources": {
            "requests": {
                "storage": "5Gi"
            }
        }
}

StorageClass Dump (if StorageClass used by PV/PVC):

Additional info:
# oc describe pod recycler-for-c8a69 -n openshift-infra
Name:         recycler-for-c8a69
Namespace:    openshift-infra
Node:         172.16.120.78/
Start Time:   Thu, 01 Mar 2018 01:15:24 -0500
Labels:       <none>
Annotations:  openshift.io/scc=hostmount-anyuid
Status:       Failed
Reason:       DeadlineExceeded
Message:      Pod was active on the node longer than the specified deadline
IP:           
Containers:
  pv-recycler:
    Image:  registry.reg-aws.openshift.com:443/openshift3/ose-recycler:v1.9.1
    Port:   <none>
    Command:
      /usr/bin/openshift-recycle
    Args:
      /scrub
    Environment:  <none>
    Mounts:
      /scrub from vol (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from pv-recycler-controller-token-qfkqr (ro)
Volumes:
  vol:
    Type:      NFS (an NFS mount that lasts the lifetime of a pod)
    Server:    172.30.163.146
    Path:      /
    ReadOnly:  false
  pv-recycler-controller-token-qfkqr:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  pv-recycler-controller-token-qfkqr
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     <none>
Events:
  Type     Reason                 Age               From                    Message
  ----     ------                 ----              ----                    -------
  Normal   SuccessfulMountVolume  9m                kubelet, 172.16.120.78  MountVolume.SetUp succeeded for volume "pv-recycler-controller-token-qfkqr"
  Normal   SuccessfulMountVolume  9m                kubelet, 172.16.120.78  MountVolume.SetUp succeeded for volume "vol"
  Normal   Scheduled              9m                default-scheduler       Successfully assigned recycler-for-c8a69 to 172.16.120.78
  Normal   Pulling                9m                kubelet, 172.16.120.78  pulling image "registry.reg-aws.openshift.com:443/openshift3/ose-recycler:v1.9.1"
  Warning  Failed                 9m                kubelet, 172.16.120.78  Failed to pull image "registry.reg-aws.openshift.com:443/openshift3/ose-recycler:v1.9.1": rpc error: code = Unknown desc = Error: image openshift3/ose-recycler:v1.9.1 not found
  Warning  Failed                 9m                kubelet, 172.16.120.78  Error: ErrImagePull
  Normal   SandboxChanged         8m (x20 over 9m)  kubelet, 172.16.120.78  Pod sandbox changed, it will be killed and re-created.
  Normal   DeadlineExceeded       4m (x2 over 4m)   kubelet, 172.16.120.78  Pod was active on the node longer than the specified deadline

Comment 5 Justin Pierce 2018-04-10 12:52:56 UTC

Why the v1.9.1 for the image tag? 

The recycler was shipped with 3.9 with the tags listed here: 
https://access.redhat.com/containers/?tab=tags#/registry.access.redhat.com/openshift3/ose-recycler
(e.g. v3.9, v3.9.14)

Comment 6 Filip Brychta 2018-04-12 08:39:01 UTC

I see the same issue in our v3.9.14  instance. It's pulling ose-recycler:v1.9.1 for some reason. Is there any workaround for this? Can I update some template so It would be pulling ose-recycler:latest instead?

Comment 7 Justin Pierce 2018-04-12 15:19:50 UTC

It looks like it would require a change to the pod you are using - which is configured with controller arguments:
https://docs.openshift.com/container-platform/3.6/architecture/additional_concepts/storage.html

:latest should be relatively safe since I don't believe this image varies between releases.

Comment 8 Qin Ping 2018-04-13 03:32:29 UTC

Verify this issue in OCP v3.9.20, still get the same error.

The image tag v3.9.20 for one-recycler is exist, and the following command can be run successfully:
docker pull registry.reg-aws.openshift.com:443/openshift3/ose-recycler:v3.9.20

Comment 14 Hemant Kumar 2018-04-16 18:36:18 UTC

Another workaround available to users is - defining environment variable:

export OPENSHIFT_RECYCLER_IMAGE="openshift/origin-recycler:v3.9.0"

or
export OPENSHIFT_RECYCLER_IMAGE="openshift3/ose-recycler:v3.9.20"

Comment 15 Hemant Kumar 2018-04-16 19:46:26 UTC

Opened a PR for fixing this in 3.9 so as users don't have to use environment variable - https://github.com/openshift/origin/pull/19374

Comment 17 Qin Ping 2018-04-17 05:10:18 UTC

v1.9.1 tag workaround works for me.

Still need a real fix, for the PR is not merged, change the bug status to "ASSIGNED".

Comment 18 Hemant Kumar 2018-04-18 15:10:36 UTC

https://github.com/openshift/origin/pull/19406

Comment 20 Qin Ping 2018-04-20 02:29:45 UTC

Verified in OCP:
oc v3.9.24
openshift v3.9.24
kubernetes v1.9.1+a0ce1bc657

# uname -a
Linux host-172-16-120-35 3.10.0-693.21.1.el7.x86_64 #1 SMP Fri Feb 23 18:54:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.4 (Maipo)

Comment 23 errata-xmlrpc 2018-05-17 06:42:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1566

Comment 24 bugzilla.com 2018-06-18 04:07:33 UTC

(In reply to Hemant Kumar from comment #14)
> Another workaround available to users is - defining environment variable:
> 
> export OPENSHIFT_RECYCLER_IMAGE="openshift/origin-recycler:v3.9.0"
> 
> or
> export OPENSHIFT_RECYCLER_IMAGE="openshift3/ose-recycler:v3.9.20"

@Hemant does ones define this environment variable on every app node or what and with the root account or what?

Comment 25 Hemant Kumar 2018-07-17 16:24:36 UTC

This should be defined on master node where controller manager runs.