Bug 1745418

Summary:	[Feature:Builds][Conformance] oc new-app should succeed with a --name of 58 characters [Suite:openshift/conformance/parallel/minimal]
Product:	OpenShift Container Platform	Reporter:	Qin Ping <piqin>
Component:	Samples	Assignee:	Gabe Montero <gmontero>
Status:	CLOSED ERRATA	QA Contact:	XiuJuan Wang <xiuwang>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.2.0	CC:	adam.kaplan, aos-bugs, bparees, wzheng
Target Milestone:	---
Target Release:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-10-16 06:37:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Qin Ping 2019-08-26 06:12:58 UTC

Description of problem:
https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-4.2/96
https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-4.2/95

Aug 25 22:33:30.858: INFO: Running 'oc --namespace=e2e-test-new-app-bl59h --config=/tmp/configfile963740453 get dc/a234567890123456789012345678901234567890123456789012345678 -o yaml'
Aug 25 22:33:31.221: INFO: Error running &{/usr/bin/oc [oc --namespace=e2e-test-new-app-bl59h --config=/tmp/configfile963740453 get dc/a234567890123456789012345678901234567890123456789012345678 -o yaml] []   Error from server (NotFound): deploymentconfigs.apps.openshift.io "a234567890123456789012345678901234567890123456789012345678" not found
 Error from server (NotFound): deploymentconfigs.apps.openshift.io "a234567890123456789012345678901234567890123456789012345678" not found
 [] <nil> 0xc0040d0720 exit status 1 <nil> <nil> true [0xc0018c0020 0xc0018c01a8 0xc0018c01a8] [0xc0018c0020 0xc0018c01a8] [0xc0018c0030 0xc0018c0158] [0x95d670 0x95d7a0] 0xc001fcfb00 <nil>}:
Error from server (NotFound): deploymentconfigs.apps.openshift.io "a234567890123456789012345678901234567890123456789012345678" not found
Aug 25 22:33:31.222: INFO: Error getting Deployment Config a234567890123456789012345678901234567890123456789012345678: exit status 1
Aug 25 22:33:31.222: INFO: Running 'oc --namespace=e2e-test-new-app-bl59h --config=/tmp/configfile963740453 get dc/a2345678901234567890123456789012345678901234567890123456789 -o yaml'
Aug 25 22:33:31.574: INFO: Error running &{/usr/bin/oc [oc --namespace=e2e-test-new-app-bl59h --config=/tmp/configfile963740453 get dc/a2345678901234567890123456789012345678901234567890123456789 -o yaml] []   Error from server (NotFound): deploymentconfigs.apps.openshift.io "a2345678901234567890123456789012345678901234567890123456789" not found
 Error from server (NotFound): deploymentconfigs.apps.openshift.io "a2345678901234567890123456789012345678901234567890123456789" not found
 [] <nil> 0xc0040d0cc0 exit status 1 <nil> <nil> true [0xc0018c0240 0xc0018c06b8 0xc0018c06b8] [0xc0018c0240 0xc0018c06b8] [0xc0018c02e8 0xc0018c0648] [0x95d670 0x95d7a0] 0xc001a803c0 <nil>}:
Error from server (NotFound): deploymentconfigs.apps.openshift.io "a2345678901234567890123456789012345678901234567890123456789" not found
Aug 25 22:33:31.574: INFO: Error getting Deployment Config a2345678901234567890123456789012345678901234567890123456789: exit status 1
[AfterEach] [Feature:Builds][Conformance] oc new-app
  /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/util/client.go:101
STEP: Collecting events from namespace "e2e-test-new-app-bl59h".
STEP: Found 0 events.
Aug 25 22:33:31.657: INFO: POD  NODE  PHASE  GRACE  CONDITIONS
Aug 25 22:33:31.657: INFO: 
Aug 25 22:33:31.774: INFO: skipping dumping cluster info - cluster too large
Aug 25 22:33:31.839: INFO: Deleted {user.openshift.io/v1, Resource=users  e2e-test-new-app-bl59h-user}, err: <nil>
Aug 25 22:33:31.895: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthclients  e2e-client-e2e-test-new-app-bl59h}, err: <nil>
Aug 25 22:33:31.958: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthaccesstokens  qv5dEL_oQSusudXp_d3OUQAAAAAAAAAA}, err: <nil>
[AfterEach] [Feature:Builds][Conformance] oc new-app
  /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:150
Aug 25 22:33:31.958: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-test-new-app-bl59h" for this suite.
Aug 25 22:33:38.130: INFO: Waiting up to 30s for server preferred namespaced resources to be successfully discovered
Aug 25 22:33:42.148: INFO: namespace e2e-test-new-app-bl59h deletion completed in 10.146393344s
Aug 25 22:33:42.153: INFO: Running AfterSuite actions on all nodes
Aug 25 22:33:42.153: INFO: Running AfterSuite actions on node 1
fail [github.com/openshift/origin/test/extended/builds/new_app.go:33]: Unexpected error:
    <*errors.errorString | 0xc00267d920>: {
        s: "Failed to import expected imagestreams",
    }
    Failed to import expected imagestreams
occurred

failed: (2m53s) 2019-08-25T22:33:42 "[Feature:Builds][Conformance] oc new-app  should succeed with a --name of 58 characters [Suite:openshift/conformance/parallel/minimal]"


How reproducible:
Some times

Comment 1 Adam Kaplan 2019-08-26 17:11:11 UTC

Looks like the samples operator imagestreams are not importing. If imagestreams do not import then the "not authorized" failures are expected.

Moving to Samples for now, though this may be due to an intermittent flake.

Comment 2 Gabe Montero 2019-08-26 17:50:21 UTC

Bottom line, flake related, as Adam notes.

Details / tl;dr:

So on the https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-4.2/95 run, I do see a TBR flake with the fuse imagestream:

Get https://registry.redhat.io/v2/fuse7/fuse-console/manifests/1.0: unauthorized: Please login to the Red Hat Registry using your Customer Portal credentials. Further instructions can be found here: https://access.redhat.com/articles/3399531

Now at the moment, we are not cross referencing our analysis of failed imports with the list the extended tests compare about, namely {"ruby", "nodejs", "perl", "php", "python", "mysql", "postgresql", "mongodb", "jenkins"}

We could ignore EAP related flakes when running our extended tests.  But that will require a tweak to the messaging done in the samples operator, followed by an updated to the extended tests in openshift/origin

I'll leave this bug open to track that improvement.



With https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-4.2/96

the failure is not related to samples and the "Failed to import expected imagestreams" (there is no presence of that string in https://storage.googleapis.com/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-4.2/96/build-log.txt) ... from the full build log:

Aug 26 00:00:24.029: INFO: At 2019-08-25 23:59:52 +0000 UTC - event for a234567890123456789012345678901234567890123456789012345678-1: {build-controller } BuildStarted: Build e2e-test-new-app-t5wwf/a234567890123456789012345678901234567890123456789012345678-1 is now running
Aug 26 00:00:24.029: INFO: At 2019-08-26 00:00:17 +0000 UTC - event for a234567890123456789012345678901234567890123456789012345678-1: {build-controller } BuildFailed: Build e2e-test-new-app-t5wwf/a234567890123456789012345678901234567890123456789012345678-1 failed
Aug 26 00:00:24.081: INFO: POD                                                                 NODE                                                PHASE   GRACE  CONDITIONS
Aug 26 00:00:24.081: INFO: a234567890123456789012345678901234567890123456789012345678-1-build  ci-op-j9spzp8n-282fe-zj4pp-worker-centralus1-8xpvx  Failed         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2019-08-25 23:59:50 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2019-08-26 00:00:16 +0000 UTC ContainersNotReady containers with unready status: [sti-build]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2019-08-26 00:00:16 +0000 UTC ContainersNotReady containers with unready status: [sti-build]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2019-08-25 23:59:28 +0000 UTC  }]
Aug 26 00:00:24.082: INFO: 
Aug 26 00:00:24.163: INFO: skipping dumping cluster info - cluster too large
Aug 26 00:00:24.228: INFO: Deleted {user.openshift.io/v1, Resource=users  e2e-test-new-app-t5wwf-user}, err: <nil>
Aug 26 00:00:24.295: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthclients  e2e-client-e2e-test-new-app-t5wwf}, err: <nil>
Aug 26 00:00:24.358: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthaccesstokens  6QrNSoBTQaW3e_gOYGA15gAAAAAAAAAA}, err: <nil>
[AfterEach] [Feature:Builds][Conformance] oc new-app
  /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:150
Aug 26 00:00:24.358: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-test-new-app-t5wwf" for this suite.
Aug 26 00:00:32.757: INFO: Waiting up to 30s for server preferred namespaced resources to be successfully discovered
Aug 26 00:00:36.718: INFO: namespace e2e-test-new-app-t5wwf deletion completed in 12.311802456s
Aug 26 00:00:36.722: INFO: Running AfterSuite actions on all nodes
Aug 26 00:00:36.722: INFO: Running AfterSuite actions on node 1
fail [github.com/openshift/origin/test/extended/builds/new_app.go:56]: Unexpected error:
    <*errors.errorString | 0xc0003335c0>: {
        s: "The build \"a234567890123456789012345678901234567890123456789012345678-1\" status is \"Failed\"",
    }
    The build "a234567890123456789012345678901234567890123456789012345678-1" status is "Failed"
occurred

Comment 3 Adam Kaplan 2019-08-30 15:17:24 UTC

@Gabe looks like the latter flake [1] is our known issue around imagestreams where we suspect some caching doesn't pick up that the imagestream is local to the internal registry.

[1] https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-4.2/96

Comment 5 XiuJuan Wang 2019-09-03 02:44:51 UTC

The related e2e passed.
Also when there are imagestream import failure, these images will be listed in openshift-samples clusteroperator.

Test with payload 4.2.0-0.nightly-2019-09-02-172410

Comment 6 Ben Parees 2019-09-19 13:42:46 UTC

Another theory on this:

If the openshift controller manager (OCM) restarts *before* the openshift api server (OAS), it could re-cache imagestreams that still don't have the internal registry represented (because the internal registry will only be represented if the OAS has picked up the internal registry hostname so it can add it as part of the decoration on a GET of an imagestream).

Right now i don't think anything tries to correlate those restarts, I think the OCM restart has to be smarter.

We had previously pursued creating a "canary" imagestream that we could periodically "get" to see if it had the internal registry hostname yet (which would mean the OAS had picked up the hostname and restarted).  Something along those lines may need to be revisited to ensure the OCM always restarts after the OAS has picked up an internal registry hostname change.

The OCM operator itself could probe an imagestream periodically to determine this.

Comment 7 errata-xmlrpc 2019-10-16 06:37:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922