Description of problem: What we are seeing is that when a pods lands on a server where it hasn't previously been - failures to pull leading to a back off. Mar 16 16:45:08 server-name atomic-openshift-node[5553]: I0316 16:45:08.883263 5553 docker_manager.go:1783] Need to restart pod infra container for "app-2-chjvl_the-app-prod(96863a37-0aa2-11e7-b4f5-0050568379a2)" because it is not found Mar 16 16:45:09 server-name atomic-openshift-node[5553]: I0316 16:45:09.580560 5553 kubelet.go:2680] SyncLoop (PLEG): "app-2-chjvl_the-app-prod(96863a37-0aa2-11e7-b4f5-0050568379a2)", event: &pleg.PodLifecycleEvent{ID:"96863a37-0aa2-11e7-b4f5-0050568379a2", Type:"ContainerStarted", Data:"362eafa428eae765117afa4f8bc5297cf512375612f50a2123a6d204535e5b5c"} Mar 16 16:45:09 server-name ovs-vsctl[51217]: ovs|00001|vsctl|INFO|Called as ovs-vsctl add-port br0 vethd1c0f29 Mar 16 16:45:10 server-name atomic-openshift-node[5553]: I0316 16:45:10.976227 5553 provider.go:119] Refreshing cache for provider: *credentialprovider.defaultDockerConfigProvider Mar 16 16:45:11 server-name atomic-openshift-node[5553]: I0316 16:45:11.136301 5553 kube_docker_client.go:295] Stop pulling image "172.50.0.2:5000/the-app-tools/app@sha256:55a7243f2ece49c12be4d5506f6f437dd8603c0b9fc649a57f648086c8692f59": "Trying to pull repository 172.50.0.2:5000/the-app-tools/app ... " Mar 16 16:45:11 server-name atomic-openshift-node[5553]: E0316 16:45:11.136345 5553 docker_manager.go:2134] container start failed: ErrImagePull: manifest unknown: manifest unknown Mar 16 16:45:11 server-name atomic-openshift-node[5553]: E0316 16:45:11.136489 5553 pod_workers.go:183] Error syncing pod 96863a37-0aa2-11e7-b4f5-0050568379a2, skipping: failed to "StartContainer" for "the-app" with ErrImagePull: "manifest unknown: manifest unknown" Mar 16 16:45:11 server-name atomic-openshift-node[5553]: I0316 16:45:11.658992 5553 reconciler.go:294] MountVolume operation started for volume "kubernetes.io/secret/96863a37-0aa2-11e7-b4f5-0050568379a2-default-token-p2mgj" (spec.Name: "default-token-p2mgj") to pod "96863a37-0aa2-11e7-b4f5-0050568379a2" (UID: "96863a37-0aa2-11e7-b4f5-0050568379a2"). Volume is already mounted to pod, but remount was requested. Mar 16 16:45:11 server-name atomic-openshift-node[5553]: I0316 16:45:11.665314 5553 operation_executor.go:749] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/96863a37-0aa2-11e7-b4f5-0050568379a2-default-token-p2mgj" (spec.Name: "default-token-p2mgj") pod "96863a37-0aa2-11e7-b4f5-0050568379a2" (UID: "96863a37-0aa2-11e7-b4f5-0050568379a2"). Mar 16 16:45:11 server-name atomic-openshift-node[5553]: E0316 16:45:11.924919 5553 docker_manager.go:2134] container start failed: ImagePullBackOff: Back-off pulling image "172.50.0.2:5000/the-app-tools/app@sha256:55a7243f2ece49c12be4d5506f6f437dd8603c0b9fc649a57f648086c8692f59" Mar 16 16:45:11 server-name atomic-openshift-node[5553]: E0316 16:45:11.924969 5553 pod_workers.go:183] Error syncing pod 96863a37-0aa2-11e7-b4f5-0050568379a2, skipping: failed to "StartContainer" for "the-app" with ImagePullBackOff: "Back-off pulling image \"172.50.0.2:5000/the-app-tools/app@sha256:55a7243f2ece49c12be4d5506f6f437dd8603c0b9fc649a57f648086c8692f59\"" Where as if came back ok on a server where the image already existed, it started fine. server-name | SUCCESS | rc=0 >> 003ab9db4e62 172.50.0.2:5000/the-app-tools/ap@sha256:55a7243f2ece49c12be4d5506f6f437dd8603c0b9fc649a57f648086c8692f59 "container-entrypoint" 13 hours ago Up 13 hours k8s_the-app.2b378b9e_app-4-r6zxh_the-app-prod_05966eee-0abf-11e7-b4f5-0050568379a2_c1e1d35e A project is set up as follows; 4 projects / namespaces per group: the-app-dev - dev - Here you will find deployment configs with an image change trigger pointing to the-app-tools/app:dev the-app-test - test - Here you will find deployment configs with an image change trigger pointing to the-app-tools/app:test the-app-prod - prod - Here you will find deployment configs with an image change trigger pointing to the-app-tools/app:prod the-app-tools - tooling - This contains the buildconfigs, imagestream, imagestreamtags, images etc.. When code is committed, a hook triggers the build in tools and it's pushed to the-app-tools/app:latest and tagged to :dev. When it's promoted to test, it's tagged to :test and onto :prod. Looking at the imagestream for -n the-app-tools, there is only a dev and a latest tag - test and prod are gone and obviously no image with that sha exists anymore. The imagestreamtags for -n the-app-tools again only shows dev and latest tags. DC trigger for the the-app-prod shows an image that doesn't exist from an imagestreamtag that doesn't exist anymore either... triggers: - imageChangeParams: automatic: true containerNames: - the-app from: kind: ImageStreamTag name: app:prod namespace: the-app-tools lastTriggeredImage: 172.50.0.2:5000/the-app-tools/app@sha256:55a7243f2ece49c12be4d5506f6f437dd8603c0b9fc649a57f648086c8692f59 type: ImageChange - type: ConfigChange A prune every night for builds, deployments and images: /bin/oadm prune images --keep-younger-than=96h --confirm A docker inspect of the container which is running, pointing to the image that no longer exists looks like: "Image": "sha256:9dee5d601febb8a89ad69684825a1a529db6a87125f0db50666c40aa5c5f3f36", ... ... "Config": { ... "Image": "172.50.0.2:5000/the-app-tools/app@sha256:55a7243f2ece49c12be4d5506f6f437dd8603c0b9fc649a57f648086c8692f59", Version-Release number of selected component (if applicable): 3.3.1.7 How reproducible: Unsure of cause, but prevalent across all projects. Steps to Reproduce: 1. 2. 3. Actual results: Images, imagestreamtags, and imagestreams are missing for in cases where there are still running PODs and references in deployment configs. Expected results: Should still exist. Additional info:
In parallel we should assess two possibilities: 1. Prune ran on partial data and did not find all references 2. Someone accidentally deleted these
@matthew The manifests are being recently removed from etcd in OSE 3.3 clusters. However, the functionality was introduced in v3.3.1.14. The version reported in this bz is 3.3.1.7. Is it possible that perhaps a newer docker-registry image is being used while the cluster is still at 3.3.1.7? This seems to be a different problem tough. The error message refers to ImageStreamImage not being found. If the image in question exists, it means that the image is no longer tagged in mem-tfrs-tools/tfrs image stream. So far I have no idea why that would happen.
Since https://bugzilla.redhat.com/show_bug.cgi?id=1439926 was verified, i'm moving this ON_QA to see if it helps here as well.
could not recreate this problem, and this bug have same root cause with Bug 1439926 based on comments above, close this bug since Bug 1439926 verified on 3.3.
I've copy&pasted doc text from Bug 1439926.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1129