Bug 1692832 - oc cli failed on build related calls
Summary: oc cli failed on build related calls
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Master
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.1.0
Assignee: Maciej Szulik
QA Contact: Hongkai Liu
Depends On:
TreeView+ depends on / blocked
Reported: 2019-03-26 14:18 UTC by Hongkai Liu
Modified: 2019-06-04 10:46 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Last Closed: 2019-06-04 10:46:25 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 0 None None None 2019-06-04 10:46:33 UTC

Description Hongkai Liu 2019-03-26 14:18:41 UTC
Description of problem:
oc cli failed on build related calls.
Not sure which component this bug should go to.

Version-Release number of selected component (if applicable):
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-03-23-222829   True        False         20h     Cluster version is 4.0.0-0.nightly-2019-03-23-222829

How reproducible: 2/2
the first try was with an earlier OCP build

Steps to Reproduce:
$ oc new-project testproject
$ oc new-app https://github.com/sclorg/cakephp-ex
$ curl -O https://raw.githubusercontent.com/hongkailiu/svt-case-doc/master/scripts/simple_oc_check.sh
$ bash -x ./simple_oc_check.sh 2>&1 | tee -a test.log
wait for 24 hours and check the test.log

Actual results:
Error from server (Forbidden): buildconfigs.build.openshift.io "cakephp-ex" is forbidden: Unauthorized
Error from server (NotFound): Unable to list {"build.openshift.io" "v1" "buildconfigs"}: the server could not find the requested resource (get buildconfigs.build.openshift.io)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (post buildconfigs.build.openshift.io cakephp-ex)
error: You must be logged in to the server (Unauthorized)
The ImageStreamTag "php:7.2" is invalid: from: Error resolving ImageStreamTag php:7.2 in namespace openshift: Unauthorized

Expected results:

Additional info:
Will upload test.log

This one blocks https://bugzilla.redhat.com/show_bug.cgi?id=1690149

Comment 2 Adam Kaplan 2019-03-27 17:20:48 UTC
Assigning to Master team. This looks similar to https://bugzilla.redhat.com/show_bug.cgi?id=1665842

Comment 3 Gabe Montero 2019-03-27 21:00:31 UTC
If it helps, I think I saw something similar in a PR test lunchtime today.  See https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_jenkins/826/pull-ci-openshift-jenkins-master-e2e-aws-jenkins/184 and the openshift-tests [Feature:Builds][Slow] openshift pipeline build Sync plugin tests using the ephemeral template [Suite:openshift] test.

Comment 7 Gabe Montero 2019-03-29 19:08:47 UTC
A triage of an `Unauthorized` failure from a PR e2e was done today by David Eads, Trevor King, and myself, at https://coreos.slack.com/archives/CEKNRGF25/p1553872006116500

Reader's digest:
1) occurred just as install was completing and e2e extended test framework was booting up
2) and apiservice was inaccessible and openshift apiserver available went from true to false (interestingly k8s apiserver was still available true progressing true)
3) use of a normal user's oauth token, while the openshift api server was unhappy, resulted in the 401/Unauthorized
4) by the time the artifacts were grabbed, the apiservices were all happy again (so the apiservice inaccessibility was transient)

Comment 8 Gabe Montero 2019-04-01 21:28:44 UTC
Update from today's various triage efforts:
- have not see much consistency between various `Unauthorized` failures in e2e-aws-build and e2e-aws-jenkins, but one item that has popped up on several occasions are logs in the openshift-apiserver(s) like `E0401 15:19:52.465974       1 webhook.go:192] Failed to make webhook authorizer request: .authorization.k8s.io "" is invalid: spec.user: Invalid value: "": at least one of user or group must be specified`
- one such example was at https://storage.googleapis.com/origin-ci-test/pr-logs/pull/22435/pull-ci-openshift-origin-master-e2e-aws-builds/1180/artifacts/e2e-aws-builds/pods/openshift-apiserver_apiserver-4ptdr_openshift-apiserver.log.gz
- David Eads analyzed it and chimed in with 
-- "could be an issue with front proxy aggregation certs not being in sync."
-- and asked @auth-team: "Can you make the front-proxy authenticator fail if there is an asserted user without a valid cert?  looking at the log above, it looks like the authentication stack of openshift-apiserver isn't quite rigth and is allowing an empty user"

Comment 9 Gabe Montero 2019-04-01 22:13:08 UTC
So after more discussion with David, Comment #7 is pertinent to this bug, but Comment #8 is something different.  I'll be opening a new bug against auth team for it.

Comment 10 Hongkai Liu 2019-04-02 02:21:08 UTC
Thanks for all the work, Gabe and David.

Comment 11 Gabe Montero 2019-04-05 14:13:22 UTC
So I believe I have another instance of what David and I saw last week in #Comment 7

See results from the must-gather dev helper in https://github.com/openshift/origin/pull/22482#issuecomment-480288930 after analyzing https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/22482/pull-ci-openshift-origin-master-e2e-aws-builds/1237/artifacts/e2e-aws-builds/

Comment 12 Michal Fojtik 2019-04-12 13:51:30 UTC
(In reply to Gabe Montero from comment #11)
> So I believe I have another instance of what David and I saw last week in
> #Comment 7
> See results from the must-gather dev helper in
> https://github.com/openshift/origin/pull/22482#issuecomment-480288930 after
> analyzing
> https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/22482/
> pull-ci-openshift-origin-master-e2e-aws-builds/1237/artifacts/e2e-aws-builds/

"Available: v1.quota.openshift.io is not ready: 503" - this was fixed already by
moving cluster resource quota to CRD and is unrelated to this BZ.

Comment 13 Michal Fojtik 2019-04-12 13:54:58 UTC
Meanwhile, we fixed a lot of cert rotation bugs and increased the graceful termination periods for API server shutdown,
so this issue might be already fixed. Also this BZ seems to combine various things together.

Moving to QA to very the new-app is working as expected without unauthorize errors.

Comment 14 Hongkai Liu 2019-04-15 12:00:32 UTC
I did not see any build related oc-cli failures in the recently nightly builds with concurrent build test.
However, it was reported with bigger scale cluster in scale-lab.
Let me verify this after running the test again there in this week.

Comment 15 Xingxing Xia 2019-04-25 03:33:26 UTC
Hongkai, hi, what's the result now? Thanks.

Comment 16 Hongkai Liu 2019-04-25 04:03:09 UTC
I havent had the chance to test in the scale-lab yet (shared by many people).

Comment 17 Hongkai Liu 2019-04-25 13:34:40 UTC
checked the scale-lab list this morning.
Current test is the 4th one in the queue of cases.

Comment 18 Xingxing Xia 2019-05-06 06:33:25 UTC
(In reply to Hongkai Liu from comment #17)
> checked the scale-lab list this morning.
> Current test is the 4th one in the queue of cases.

How about now?

Comment 19 Hongkai Liu 2019-05-06 12:59:20 UTC
Rerun on scale-lab:
With the test running on pbench-controller pod, we still saw the crashes (much less than the last round tough).
Later found out that those are caused by the OS parameters in the client side.
Using normal AWS VM worked it around.

Bug related to the concurrent build test.

Comment 22 errata-xmlrpc 2019-06-04 10:46:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.