Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1608410

Summary:	Bulk tagging an image stream (creating large numbers of spec tags) results in some tags not being created
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	ImageStreams	Assignee:	Ben Parees <bparees>
Status:	CLOSED ERRATA	QA Contact:	Mike Fiedler <mifiedle>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	3.11.0	CC:	aos-bugs, jokerman, mifiedle, mmccomas
Target Milestone:	---
Target Release:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-10-11 07:22:15 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Clayton Coleman 2018-07-25 13:26:10 UTC

ci-operator creates a single image stream with 50+ tags in a single call.  On 3.10 this works fine, but when we tried to upgrade api.ci to 3.11, some of the tags did not get imported (the spec tag was set, but the status tag was missing).

Investigating now, my first suspect is the collapseStatusTags function added when image layers were added.  But since everything happens in one call, I somewhat doubt that is the cause.

Blocks upgrading api.ci to 3.11.

Comment 1 Clayton Coleman 2018-07-25 15:45:05 UTC

Appears to be a timeout on image stream import.  I suspect the long running request exception for imagestreamimport got lost in the rebase, OR we suddenly got a bit slower and then hit the long running request timeout (but that shouldn't ever fire, so I think it's the former).

I0725 15:37:09.364025       1 imagestream_controller.go:158] Queued import of stream ci-op-i380vmvm/stable...
I0725 15:37:09.364134       1 imagestream_controller.go:235] Importing stream ci-op-i380vmvm/stable partial=true...
E0725 15:37:39.372550       1 imagestream_controller.go:133] Error syncing image stream "ci-op-i380vmvm/stable": Timeout: request did not complete within allowed duration
I0725 15:37:39.453068       1 imagestream_controller.go:158] Queued import of stream ci-op-i380vmvm/stable...
I0725 15:37:39.453092       1 imagestream_controller.go:235] Importing stream ci-op-i380vmvm/stable partial=true...
E0725 15:38:09.464802       1 imagestream_controller.go:133] Error syncing image stream "ci-op-i380vmvm/stable": Timeout: request did not complete within allowed duration
I0725 15:38:09.625433       1 imagestream_controller.go:158] Queued import of stream ci-op-i380vmvm/stable...
I0725 15:38:09.625484       1 imagestream_controller.go:235] Importing stream ci-op-i380vmvm/stable partial=true...
E0725 15:38:39.645567       1 imagestream_controller.go:133] Error syncing image stream "ci-op-i380vmvm/stable": Timeout: request did not complete within allowed duration
I0725 15:38:39.965799       1 imagestream_controller.go:158] Queued import of stream ci-op-i380vmvm/stable...
I0725 15:38:39.965830       1 imagestream_controller.go:235] Importing stream ci-op-i380vmvm/stable partial=true...

Comment 2 Clayton Coleman 2018-07-25 15:46:08 UTC

That's a 30s timeout, which shouldn't be applied to image stream import.  I also don't see it showing up in long running requests:

$ oc get --raw /metrics | grep longrunning | grep -v WATCH
# HELP apiserver_longrunning_gauge Gauge of all active long-running apiserver requests broken out by verb, API resource, and scope. Not all requests are tracked this way.
# TYPE apiserver_longrunning_gauge gauge
apiserver_longrunning_gauge{resource="pods",scope="namespace",subresource="log",verb="GET"} 0

Comment 3 Clayton Coleman 2018-07-25 17:28:09 UTC

This is back to the image team - we are not setting a default, so we're timing out at 30s on create because ?timeout= defaults to 30s.  We can set a longer timeout.  The client doesn't make that easy today.

Comment 4 Clayton Coleman 2018-07-25 17:30:22 UTC

It looks like it takes a lot longer to import than before.

Comment 5 Ben Parees 2018-07-25 17:55:52 UTC

is it possible we *should* make it treated as a long running request instead?

regardless of whether there is a regression in import time, it seems like imagestreamimport can take an arbitrary length of time depending on the number of tags and the speed of the registry we have to pull metadata from.

Comment 6 Clayton Coleman 2018-07-26 01:37:09 UTC

https://github.com/openshift/origin/pull/20419 increases the timeout.

I made changes to ci-operator to bypass this use case for most flows.

The image layers change should reduce the amount of time to find the manifest by digest.

Comment 7 Ben Parees 2018-07-26 01:44:27 UTC

i'd still like to understand what would have taken us from "occasionally breaching 30s" to "takes 3 minutes", assuming that behavior is consistent. 

Has the imagestream itself grown (in number of tags)?

Comment 8 Ben Parees 2018-08-03 14:29:36 UTC

I believe all the performance improvements are merged now, primarily:
https://github.com/openshift/image-registry/pull/101

Comment 10 Mike Fiedler 2018-08-30 19:25:59 UTC

Tested creation (via oc create) of imagestream with 100 imagestreamtags for a docker image with 100 layers.   Creation was successful and subsequent oc import-image successful.  Any other tests you'd like to see?

Comment 11 Mike Fiedler 2018-08-30 19:31:56 UTC

Imagestream + all imagestreamtags in comment 10 created in 2.5 seconds - oc create command itself was 0.2 seconds.   This is on 3.11.0-0.25.0

Comment 12 Ben Parees 2018-08-30 19:38:35 UTC

Sounds great to me, thanks Mike.

Comment 13 Mike Fiedler 2018-08-31 12:21:44 UTC

Verified on 3.11.0-0.25.0.   See comment 10 and comment 11 for the verification scenario.

Comment 15 errata-xmlrpc 2018-10-11 07:22:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652