1781290 – During a 4.2 to 4.3 upgrade skew test, build e2e tests continuously fail

Bug 1781290 - During a 4.2 to 4.3 upgrade skew test, build e2e tests continuously fail

Summary: During a 4.2 to 4.3 upgrade skew test, build e2e tests continuously fail

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Build
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Adam Kaplan
QA Contact:	wewang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1781302
TreeView+	depends on / blocked

Reported:	2019-12-09 17:24 UTC by Clayton Coleman
Modified:	2020-01-23 11:18 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1781302 (view as bug list)
Environment:
Last Closed:	2020-01-23 11:18:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:0062	0	None	None	None	2020-01-23 11:18:42 UTC

Description Clayton Coleman 2019-12-09 17:24:55 UTC

We run skew tests from 4.2 to 4.3 (where we install 4.2, upgrade to 4.3, but stop before updating nodes).  We then run the 4.2 e2e tests (which verifies that you didn't regress the API function).

Sometime in the last few weeks, the build subsystem e2e tests started failing continuously :

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-old-rhcos-e2e-aws-4.3/160

[Feature:Builds][pruning] prune builds based on settings in the buildconfig [Conformance] buildconfigs should have a default history limit set when created via the group api [Suite:openshift/conformance/parallel/minimal] 

fail [github.com/openshift/origin/test/extended/builds/build_pruning.go:43]: Unexpected error:
    <*errors.errorString | 0xc0000a7a00>: {
        s: "Timed out waiting for internal registry hostname to be published",
    }
    Timed out waiting for internal registry hostname to be published
occurred

There are three cases, this bug must be fixed in the appropriate place:

1. We don't correctly run these e2e tests when the control plane is 4.3 - fix in the 4.2 tests to tolerate a 4.3 control plane so that we can see it pass
2. We regressed a product API that a 4.2 client fails against a 4.3 server - must fix for ship because we don't regress APIs
3. We don't correctly work with the registry when the control plane is 4.3 and the nodes are 4.2 - must fix because you would break if someone ran during an upgrade.

This bug may not be deferred.

Comment 1 Clayton Coleman 2019-12-09 17:27:57 UTC

Dec  9 12:47:39.809: INFO: Waiting up to 2 minutes for the internal registry hostname to be published
Dec  9 12:47:42.186: INFO: did not find the sequence in the OCM pod logs around the build controller getting started after the internal registry hostname has been set in the OCM config

Comment 2 Adam Kaplan 2019-12-09 18:01:50 UTC

> There are three cases, this bug must be fixed in the appropriate place:
> 
> 1. We don't correctly run these e2e tests when the control plane is 4.3 - fix in the 4.2 tests to tolerate a 4.3 control plane so that we can see it pass
> 2. We regressed a product API that a 4.2 client fails against a 4.3 server - must fix for ship because we don't regress APIs
> 3. We don't correctly work with the registry when the control plane is 4.3 and the nodes are 4.2 - must fix because you would break if someone ran during an upgrade.

I suspect #1 - we found on 4.3 we started flaking because the test depended on a specific controller start sequence. We are in fact getting the right internal registry hostname synced:

```
2019-12-09T12:34:29.970319148Z I1209 12:34:29.970252       1 build_controller.go:474] Starting build controller
2019-12-09T12:34:29.970319148Z I1209 12:34:29.970305       1 build_controller.go:476] OpenShift image registry hostname: image-registry.openshift-image-registry.svc:5000
2019-12-09T12:34:30.033985328Z I1209 12:34:30.033931       1 deleted_token_secrets.go:72] caches synced
2019-12-09T12:34:30.034172215Z I1209 12:34:30.034145       1 docker_registry_service.go:154] caches synced
2019-12-09T12:34:30.034209042Z I1209 12:34:30.034188       1 create_dockercfg_secrets.go:220] urls found
2019-12-09T12:34:30.034209042Z I1209 12:34:30.034202       1 create_dockercfg_secrets.go:226] caches synced
2019-12-09T12:34:30.034705813Z I1209 12:34:30.034664       1 docker_registry_service.go:284] Updating registry URLs from map[172.30.107.118:5000:{} image-registry.openshift-image-registry.svc.cluster.local:5000:{} image-registry.openshift-image-registry.svc:5000:{}] to map[172.30.107.118:5000:{} image-registry.openshift-image-registry.svc.cluster.local:5000:{} image-registry.openshift-image-registry.svc:5000:{}]
```

Comment 3 Adam Kaplan 2019-12-09 18:30:10 UTC

Moving to MODIFIED - we changed the logic that detects that the image registry was published in https://github.com/openshift/origin/pull/24048

Comment 5 wewang 2019-12-11 03:38:51 UTC

"[Feature:Builds][pruning] prune builds based on settings in the buildconfig  [Conformance] buildconfigs should have a default history limit set when created via the group api [Suite:openshift/conformance/parallel/minimal]"

is not failed now in https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-old-rhcos-e2e-aws-4.3/161/162/163

and checked in local e2e, also passed.
STEP: waiting for openshift namespace imagestreams
Dec 11 11:31:19.484: INFO: Waiting up to 2 minutes for the internal registry hostname to be published
Dec 11 11:31:22.553: INFO: the OCM pod logs indicate the build controller was started after the internal registry hostname has been set in the OCM config
Dec 11 11:31:23.202: INFO: OCM rollout progressing status reports complete
Dec 11 11:31:23.202: INFO: Scanning openshift ImageStreams 

If need me check me more, please feel free to contact me.

Comment 7 errata-xmlrpc 2020-01-23 11:18:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Note You need to log in before you can comment on or make changes to this bug.