Bug 1896188

Summary:	[sig-cli] oc debug deployment configs from a build: local-busybox-1-build not completed
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	oc	Assignee:	Maciej Szulik <maszulik>
Status:	CLOSED ERRATA	QA Contact:	zhou ying <yinzhou>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.7	CC:	aos-bugs, jack.ottofaro, jluhrsen, jokerman, maszulik, mfojtik
Target Milestone:	---	Keywords:	UpcomingSprint
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:	[sig-cli] oc debug deployment configs from a build
Last Closed:	2021-02-24 15:31:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description W. Trevor King 2020-11-09 22:37:32 UTC

test:
[sig-cli] oc debug deployment configs from a build 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-cli%5C%5D+oc+debug+deployment+configs+from+a+build

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=%5C%5Bsig-cli%5C%5D+oc+debug+deployment+configs+from+a+build' | grep 'failures match' | sortperiodic-ci-openshift-release-master-ocp-4.7-e2e-aws-proxy - 3 runs, 100% failed, 67% of failures match
periodic-ci-openshift-release-master-ocp-4.7-e2e-vsphere - 6 runs, 100% failed, 17% of failures match
pull-ci-cri-o-cri-o-master-e2e-aws - 26 runs, 35% failed, 22% of failures match
pull-ci-openshift-cloud-credential-operator-master-e2e-aws - 4 runs, 100% failed, 25% of failures match
pull-ci-openshift-installer-master-e2e-aws-workers-rhel7 - 6 runs, 17% failed, 100% of failures match
pull-ci-openshift-kubernetes-master-e2e-aws-fips - 14 runs, 86% failed, 17% of failures match
pull-ci-openshift-machine-config-operator-master-e2e-aws - 27 runs, 78% failed, 5% of failures match
pull-ci-openshift-machine-config-operator-master-e2e-aws-workers-rhel7 - 11 runs, 64% failed, 14% of failures match
pull-ci-openshift-machine-config-operator-master-e2e-vsphere-upi - 7 runs, 100% failed, 14% of failures match
pull-ci-openshift-machine-config-operator-master-okd-e2e-aws - 11 runs, 45% failed, 20% of failures match
pull-ci-openshift-oc-master-e2e-aws - 7 runs, 57% failed, 25% of failures match
pull-ci-openshift-openshift-controller-manager-master-e2e-aws - 2 runs, 50% failed, 100% of failures match
pull-ci-openshift-ovn-kubernetes-master-e2e-vsphere-ovn - 5 runs, 100% failed, 20% of failures match
pull-ci-openshift-sriov-network-operator-master-e2e-aws - 22 runs, 100% failed, 14% of failures match
release-openshift-ocp-installer-e2e-ovirt-4.7 - 6 runs, 67% failed, 25% of failures match

Example job [1]:

fail [github.com/openshift/origin/test/extended/cli/debug.go:29]: Unexpected error:
    <*errors.errorString | 0xc0002d48a0>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
occurred

with stdout from that test-case including:

Nov  9 20:41:39.574: INFO: Error running /usr/bin/oc --namespace=e2e-test-oc-debug-754r9 --kubeconfig=/tmp/configfile291112964 get imagestreamtags local-busybox:latest:
StdOut>
Error from server (NotFound): imagestreamtags.image.openshift.io "local-busybox:latest" not found
StdErr>
Error from server (NotFound): imagestreamtags.image.openshift.io "local-busybox:latest" not found
...
Nov  9 20:41:39.798: INFO: POD                    NODE                                       PHASE      GRACE  CONDITIONS
Nov  9 20:41:39.798: INFO: busybox1-1-deploy      ip-10-0-61-59.us-west-1.compute.internal   Succeeded         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-11-09 20:40:38 +0000 UTC PodCompleted } {Ready False 0001-01-01 00:00:00 +0000 UTC 2020-11-09 20:40:38 +0000 UTC PodCompleted } {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2020-11-09 20:40:38 +0000 UTC PodCompleted } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-11-09 20:40:38 +0000 UTC  }]
Nov  9 20:41:39.798: INFO: busybox2-1-deploy      ip-10-0-78-157.us-west-1.compute.internal  Succeeded         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-11-09 20:40:39 +0000 UTC PodCompleted } {Ready False 0001-01-01 00:00:00 +0000 UTC 2020-11-09 20:40:39 +0000 UTC PodCompleted } {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2020-11-09 20:40:39 +0000 UTC PodCompleted } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-11-09 20:40:39 +0000 UTC  }]
Nov  9 20:41:39.798: INFO: local-busybox-1-build  ip-10-0-78-157.us-west-1.compute.internal  Running           [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-11-09 20:40:41 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2020-11-09 20:40:42 +0000 UTC  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2020-11-09 20:40:42 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-11-09 20:40:38 +0000 UTC  }]

Maybe we just need to bump a timeout or something?

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-aws-proxy/1325878491849166848

Comment 1 W. Trevor King 2020-11-09 23:07:03 UTC

Somewhat similar, but without the running pod:

  Error from server (NotFound): imagestreamtags.image.openshift.io "wildfly:latest" not found

out of:

  [sig-cli] oc debug ensure it works with image streams

Maybe related to Docker throttling (bug 1895107), and we are just missing clear ImageStreamTag reporting in some test suite handler?

Comment 5 Maciej Szulik 2020-12-04 16:27:42 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 9 jamo luhrsen 2021-01-22 22:43:54 UTC

both of the PRs attached to this bz are merged (side note: I don't know why 25803 still says open. it was merged 3 days ago).

BUT, I see this '[sig-cli] oc debug ensure it works with image streams' still failing a lot. at least in:
  https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#periodic-ci-openshift-release-master-ocp-4.7-e2e-vsphere-upi

when it does fail, it will reject our release payloads, like this:
  https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.7.0-0.nightly/release/4.7.0-0.nightly-2021-01-22-195722

search:
https://search.ci.openshift.org/?search=oc+debug+ensure+it+works+with+image+streams&maxAge=48h&context=1&type=junit&name=4.7&maxMatches=5&maxBytes=20971520&groupBy=job

my concern was paying attention to 4.7 release and came across this today. Do we need a new bz or do we need more work done
on this one?

Comment 10 zhou ying 2021-01-25 03:22:36 UTC

Checked the logs, this case still failing a lot after all the PRs merged: 

https://search.ci.openshift.org/?search=oc+debug+deployment+configs+from+a+build&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

follows are with the timeout error:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-ovn-upgrade-4.6-stable-to-4.7-ci/1352749515207086080
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_oc/713/pull-ci-openshift-oc-master-e2e-aws/1353149155815460864
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-shared-vpc-4.8/1353457507455995904
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/15187/rehearse-15187-pull-ci-openshift-cluster-network-operator-master-e2e-aws-sdn-multi/1353315696192262144
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_oc/693/pull-ci-openshift-oc-master-e2e-aws/1353155175140823040
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_oc/693/pull-ci-openshift-oc-master-e2e-aws/1353155175140823040

Comment 11 Maciej Szulik 2021-01-25 11:34:47 UTC

(In reply to jamo luhrsen from comment #9)
> both of the PRs attached to this bz are merged (side note: I don't know why
> 25803 still says open. it was merged 3 days ago).
> 
> BUT, I see this '[sig-cli] oc debug ensure it works with image streams'
> still failing a lot. at least in:

This BZ is not related to this test, please don't mix cases.

Comment 12 Maciej Szulik 2021-01-25 11:39:11 UTC

(In reply to zhou ying from comment #10)
> Checked the logs, this case still failing a lot after all the PRs merged: 
> 
> https://search.ci.openshift.org/
> ?search=oc+debug+deployment+configs+from+a+build&maxAge=48h&context=1&type=bu
> g%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job
> 
> follows are with the timeout error:
> https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-
> origin-installer-e2e-aws-ovn-upgrade-4.6-stable-to-4.7-ci/1352749515207086080
> https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/
> openshift_oc/713/pull-ci-openshift-oc-master-e2e-aws/1353149155815460864
> https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-
> origin-installer-e2e-aws-shared-vpc-4.8/1353457507455995904
> https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/
> openshift_release/15187/rehearse-15187-pull-ci-openshift-cluster-network-
> operator-master-e2e-aws-sdn-multi/1353315696192262144
> https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/
> openshift_oc/693/pull-ci-openshift-oc-master-e2e-aws/1353155175140823040
> https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/
> openshift_oc/693/pull-ci-openshift-oc-master-e2e-aws/1353155175140823040

All of the above failures are caused by build not being completed, lemme bump that particular time-out to 10 minutes like it's in the other tests.

Comment 14 jamo luhrsen 2021-01-25 19:17:38 UTC

(In reply to Maciej Szulik from comment #11)
> (In reply to jamo luhrsen from comment #9)
> > both of the PRs attached to this bz are merged (side note: I don't know why
> > 25803 still says open. it was merged 3 days ago).
> > 
> > BUT, I see this '[sig-cli] oc debug ensure it works with image streams'
> > still failing a lot. at least in:
> 
> This BZ is not related to this test, please don't mix cases.

I found this bz by searching and it matched what @wking noted in comment #1. Was thinking it could all
be the same root cause. If you think it's different, I can file a new bz. But, looking at this
failed job:
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-vsphere-upi/1353683727850409984

and the stdout from that failed testcase, I see similar errors as to what is being discussed in this bz:

Jan 25 13:28:15.963: INFO: Running 'oc --namespace=e2e-test-oc-debug-lvbdc --kubeconfig=/tmp/configfile556966883 get imagestreamtags wildfly:latest'
Jan 25 13:28:16.088: INFO: Error running /usr/bin/oc --namespace=e2e-test-oc-debug-lvbdc --kubeconfig=/tmp/configfile556966883 get imagestreamtags wildfly:latest:
StdOut>
Error from server (NotFound): imagestreamtags.image.openshift.io "wildfly:latest" not found
StdErr>
Error from server (NotFound): imagestreamtags.image.openshift.io "wildfly:latest" not found

Error from server (NotFound): imagestreamtags.image.openshift.io "wildfly:latest" not found
[AfterEach] [sig-cli] oc debug
  github.com/openshift/origin/test/extended/util/client.go:138
STEP: Collecting events from namespace "e2e-test-oc-debug-lvbdc".
STEP: Found 1 events.
Jan 25 13:28:16.096: INFO: At 2021-01-25 13:26:58 +0000 UTC - event for e2e-test-oc-debug-lvbdc: {namespace-security-allocation-controller } CreatedSCCRanges: created SCC ranges
Jan 25 13:28:16.101: INFO: POD  NODE  PHASE  GRACE  CONDITIONS
Jan 25 13:28:16.101: INFO: 
Jan 25 13:28:16.110: INFO: skipping dumping cluster info - cluster too large
Jan 25 13:28:16.142: INFO: Deleted {user.openshift.io/v1, Resource=users  e2e-test-oc-debug-lvbdc-user}, err: <nil>
Jan 25 13:28:16.162: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthclients  e2e-client-e2e-test-oc-debug-lvbdc}, err: <nil>
Jan 25 13:28:16.182: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthaccesstokens  POvSR7LhT5WwUW5UGlwwJAAAAAAAAAAA}, err: <nil>
[AfterEach] [sig-cli] oc debug
  github.com/openshift/origin/test/extended/util/client.go:139
Jan 25 13:28:16.182: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-test-oc-debug-lvbdc" for this suite.
Jan 25 13:28:16.229: INFO: Running AfterSuite actions on all nodes
Jan 25 13:28:16.229: INFO: Running AfterSuite actions on node 1
fail [github.com/openshift/origin/test/extended/cli/debug.go:188]: Unexpected error:
    <*errors.errorString | 0xc0002969a0>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
occurred

I see that the PR to bump the timeout to 10m was merged 4h ago. I'll check back on this later and see
if that helped or not. If you think I need an entirely new bz, let me know and I'll file it asap.

Comment 15 Maciej Szulik 2021-01-26 12:32:44 UTC

> and the stdout from that failed testcase, I see similar errors as to what is
> being discussed in this bz:

Both cases are different, namely line 188 in debug.go test awaits image import,
that's the problem you're seeing. It is connected with registry limits, probably
docker if that wasn't changed, yet. See https://github.com/openshift/origin/pull/25746
where I'm fixing parts of that problem.

This particular case is also waiting for image stream to appear which is confusing,
but this time the input for the IS is provided by the build happening. So even 
both cases are waiting for image stream to appear, the source of the image is different
and thus makes this issue different.

Comment 16 jamo luhrsen 2021-01-26 21:09:09 UTC

(In reply to Maciej Szulik from comment #15)
> > and the stdout from that failed testcase, I see similar errors as to what is
> > being discussed in this bz:
> 
> Both cases are different, namely line 188 in debug.go test awaits image
> import,
> that's the problem you're seeing. It is connected with registry limits,
> probably
> docker if that wasn't changed, yet. See
> https://github.com/openshift/origin/pull/25746
> where I'm fixing parts of that problem.
> 
> This particular case is also waiting for image stream to appear which is
> confusing,
> but this time the input for the IS is provided by the build happening. So
> even 
> both cases are waiting for image stream to appear, the source of the image
> is different
> and thus makes this issue different.

thank you for that explanation. It makes sense to me :)

I'll check back on the failure I was noting once your PR is merged to help with the docker.io issues.
I know we've faced those in many places this entire 4.7 release. Thanks again.

Comment 17 Maciej Szulik 2021-01-27 10:47:35 UTC

> thank you for that explanation. It makes sense to me :)

Sure, I'm happy to help :-)

> I'll check back on the failure I was noting once your PR is merged to help
> with the docker.io issues.
> I know we've faced those in many places this entire 4.7 release. Thanks
> again.

Yeah, feel free to open a new one if you see this happening.

Comment 19 Maciej Szulik 2021-02-02 11:45:48 UTC

Zhou Ying, I've checked the failures and:
1. some are from 4.6 which doesn't contain the fix
2. others are from okg, which iirc is based on 4.6, so see 1.
3. the remaining failures are coming from the build part of the process 

For 3 I'll check the queue in a few more days and I'll create a bug for the build to verify why the builds are not completed even 
though we are waiting extended period of time. For now I'd move this to verified.

Comment 22 errata-xmlrpc 2021-02-24 15:31:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633