1433721 – Missing images, streams and tags for running PODs and DCs

Bug 1433721 - Missing images, streams and tags for running PODs and DCs

Summary: Missing images, streams and tags for running PODs and DCs

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Image Registry
Sub Component:
Version:	3.3.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	3.3.1
Assignee:	Michal Minar
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-03-19 14:01 UTC by Matthew Robson
Modified:	2020-08-13 08:58 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Logic error in how weak and strong references are identified when searching images eligible for pruning. Consequence: Some images having both strong and weak references in pruning graph could be removed during pruning. Fix: Fix logic responsible for finding which images have strong references. Result: Pruning correctly recognizes and prunes images.
Clone Of:
Environment:
Last Closed:	2017-04-26 05:36:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:1129	0	normal	SHIPPED_LIVE	OpenShift Container Platform 3.5, 3.4, 3.3, and 3.2 bug fix update	2017-04-26 09:35:35 UTC

Description Matthew Robson 2017-03-19 14:01:27 UTC

Description of problem:

What we are seeing is that when a pods lands on a server where it hasn't previously been - failures to pull leading to a back off.

Mar 16 16:45:08 server-name atomic-openshift-node[5553]: I0316 16:45:08.883263    5553 docker_manager.go:1783] Need to restart pod infra container for "app-2-chjvl_the-app-prod(96863a37-0aa2-11e7-b4f5-0050568379a2)" because it is not found
Mar 16 16:45:09 server-name atomic-openshift-node[5553]: I0316 16:45:09.580560    5553 kubelet.go:2680] SyncLoop (PLEG): "app-2-chjvl_the-app-prod(96863a37-0aa2-11e7-b4f5-0050568379a2)", event: &pleg.PodLifecycleEvent{ID:"96863a37-0aa2-11e7-b4f5-0050568379a2", Type:"ContainerStarted", Data:"362eafa428eae765117afa4f8bc5297cf512375612f50a2123a6d204535e5b5c"}
Mar 16 16:45:09 server-name ovs-vsctl[51217]: ovs|00001|vsctl|INFO|Called as ovs-vsctl add-port br0 vethd1c0f29
Mar 16 16:45:10 server-name atomic-openshift-node[5553]: I0316 16:45:10.976227    5553 provider.go:119] Refreshing cache for provider: *credentialprovider.defaultDockerConfigProvider
Mar 16 16:45:11 server-name atomic-openshift-node[5553]: I0316 16:45:11.136301    5553 kube_docker_client.go:295] Stop pulling image "172.50.0.2:5000/the-app-tools/app@sha256:55a7243f2ece49c12be4d5506f6f437dd8603c0b9fc649a57f648086c8692f59": "Trying to pull repository 172.50.0.2:5000/the-app-tools/app ... "
Mar 16 16:45:11 server-name atomic-openshift-node[5553]: E0316 16:45:11.136345    5553 docker_manager.go:2134] container start failed: ErrImagePull: manifest unknown: manifest unknown
Mar 16 16:45:11 server-name atomic-openshift-node[5553]: E0316 16:45:11.136489    5553 pod_workers.go:183] Error syncing pod 96863a37-0aa2-11e7-b4f5-0050568379a2, skipping: failed to "StartContainer" for "the-app" with ErrImagePull: "manifest unknown: manifest unknown"
Mar 16 16:45:11 server-name atomic-openshift-node[5553]: I0316 16:45:11.658992    5553 reconciler.go:294] MountVolume operation started for volume "kubernetes.io/secret/96863a37-0aa2-11e7-b4f5-0050568379a2-default-token-p2mgj" (spec.Name: "default-token-p2mgj") to pod "96863a37-0aa2-11e7-b4f5-0050568379a2" (UID: "96863a37-0aa2-11e7-b4f5-0050568379a2"). Volume is already mounted to pod, but remount was requested.
Mar 16 16:45:11 server-name atomic-openshift-node[5553]: I0316 16:45:11.665314    5553 operation_executor.go:749] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/96863a37-0aa2-11e7-b4f5-0050568379a2-default-token-p2mgj" (spec.Name: "default-token-p2mgj") pod "96863a37-0aa2-11e7-b4f5-0050568379a2" (UID: "96863a37-0aa2-11e7-b4f5-0050568379a2").
Mar 16 16:45:11 server-name atomic-openshift-node[5553]: E0316 16:45:11.924919    5553 docker_manager.go:2134] container start failed: ImagePullBackOff: Back-off pulling image "172.50.0.2:5000/the-app-tools/app@sha256:55a7243f2ece49c12be4d5506f6f437dd8603c0b9fc649a57f648086c8692f59"
Mar 16 16:45:11 server-name atomic-openshift-node[5553]: E0316 16:45:11.924969    5553 pod_workers.go:183] Error syncing pod 96863a37-0aa2-11e7-b4f5-0050568379a2, skipping: failed to "StartContainer" for "the-app" with ImagePullBackOff: "Back-off pulling image \"172.50.0.2:5000/the-app-tools/app@sha256:55a7243f2ece49c12be4d5506f6f437dd8603c0b9fc649a57f648086c8692f59\""

Where as if came back ok on a server where the image already existed, it started fine.

server-name | SUCCESS | rc=0 >>
003ab9db4e62        172.50.0.2:5000/the-app-tools/ap@sha256:55a7243f2ece49c12be4d5506f6f437dd8603c0b9fc649a57f648086c8692f59                      "container-entrypoint"   13 hours ago        Up 13 hours                             k8s_the-app.2b378b9e_app-4-r6zxh_the-app-prod_05966eee-0abf-11e7-b4f5-0050568379a2_c1e1d35e

A project is set up as follows;

4 projects / namespaces per group:
the-app-dev - dev - Here you will find deployment configs with an image change trigger pointing to the-app-tools/app:dev
the-app-test - test - Here you will find deployment configs with an image change trigger pointing to the-app-tools/app:test
the-app-prod - prod - Here you will find deployment configs with an image change trigger pointing to the-app-tools/app:prod
the-app-tools - tooling - This contains the buildconfigs, imagestream, imagestreamtags, images etc..

When code is committed, a hook triggers the build in tools and it's pushed to the-app-tools/app:latest and tagged to :dev.  When it's promoted to test, it's tagged to :test and onto :prod.

Looking at the imagestream for -n the-app-tools, there is only a dev and a latest tag - test and prod are gone and obviously no image with that sha exists anymore.

The imagestreamtags for -n the-app-tools again only shows dev and latest tags.

DC trigger for the the-app-prod shows an image that doesn't exist from an imagestreamtag that doesn't exist anymore either...
  triggers:
  - imageChangeParams:
      automatic: true
      containerNames:
      - the-app
      from:
        kind: ImageStreamTag
        name: app:prod
        namespace: the-app-tools
      lastTriggeredImage: 172.50.0.2:5000/the-app-tools/app@sha256:55a7243f2ece49c12be4d5506f6f437dd8603c0b9fc649a57f648086c8692f59
    type: ImageChange
  - type: ConfigChange

A prune every night for builds, deployments and images: /bin/oadm prune images --keep-younger-than=96h --confirm

A docker inspect of the container which is running, pointing to the image that no longer exists looks like:

"Image": "sha256:9dee5d601febb8a89ad69684825a1a529db6a87125f0db50666c40aa5c5f3f36",
...
...
"Config": {
...
"Image": "172.50.0.2:5000/the-app-tools/app@sha256:55a7243f2ece49c12be4d5506f6f437dd8603c0b9fc649a57f648086c8692f59",


Version-Release number of selected component (if applicable):

3.3.1.7

How reproducible:

Unsure of cause, but prevalent across all projects.

Steps to Reproduce:
1.
2.
3.

Actual results:

Images, imagestreamtags, and imagestreams are missing for in cases where there are still running PODs and references in deployment configs.

Expected results:

Should still exist.

Additional info:

Comment 8 Clayton Coleman 2017-03-20 04:32:46 UTC

In parallel we should assess two possibilities:

1. Prune ran on partial data and did not find all references
2. Someone accidentally deleted these

Comment 12 Michal Minar 2017-03-28 13:15:54 UTC

@matthew The manifests are being recently removed from etcd in OSE 3.3 clusters. However, the functionality was introduced in v3.3.1.14. The version reported in this bz is 3.3.1.7. Is it possible that perhaps a newer docker-registry image is being used while the cluster is still at 3.3.1.7?

This seems to be a different problem tough. The error message refers to ImageStreamImage not being found. If the image in question exists, it means that the image is no longer tagged in mem-tfrs-tools/tfrs image stream. So far I have no idea why that would happen.

Comment 33 Michal Fojtik 2017-04-12 11:58:46 UTC

Since https://bugzilla.redhat.com/show_bug.cgi?id=1439926 was verified, i'm moving this ON_QA to see if it helps here as well.

Comment 34 ge liu 2017-04-14 07:50:57 UTC

could not recreate this problem, and this bug have same root cause with Bug 1439926 based on comments above, close this bug since Bug 1439926 verified on 3.3.

Comment 35 Michal Minar 2017-04-18 12:01:44 UTC

I've copy&pasted doc text from Bug 1439926.

Comment 37 errata-xmlrpc 2017-04-26 05:36:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1129

Note You need to log in before you can comment on or make changes to this bug.