Hide Forgot
Description of problem: oc cli failed on build related calls. Not sure which component this bug should go to. Version-Release number of selected component (if applicable): $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-03-23-222829 True False 20h Cluster version is 4.0.0-0.nightly-2019-03-23-222829 How reproducible: 2/2 the first try was with an earlier OCP build Steps to Reproduce: 1. $ oc new-project testproject $ oc new-app https://github.com/sclorg/cakephp-ex 2. $ curl -O https://raw.githubusercontent.com/hongkailiu/svt-case-doc/master/scripts/simple_oc_check.sh $ bash -x ./simple_oc_check.sh 2>&1 | tee -a test.log 3. wait for 24 hours and check the test.log Actual results: Error from server (Forbidden): buildconfigs.build.openshift.io "cakephp-ex" is forbidden: Unauthorized Error from server (NotFound): Unable to list {"build.openshift.io" "v1" "buildconfigs"}: the server could not find the requested resource (get buildconfigs.build.openshift.io) Error from server (ServiceUnavailable): the server is currently unable to handle the request (post buildconfigs.build.openshift.io cakephp-ex) error: You must be logged in to the server (Unauthorized) The ImageStreamTag "php:7.2" is invalid: from: Error resolving ImageStreamTag php:7.2 in namespace openshift: Unauthorized Expected results: Additional info: Will upload test.log This one blocks https://bugzilla.redhat.com/show_bug.cgi?id=1690149
Assigning to Master team. This looks similar to https://bugzilla.redhat.com/show_bug.cgi?id=1665842
If it helps, I think I saw something similar in a PR test lunchtime today. See https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_jenkins/826/pull-ci-openshift-jenkins-master-e2e-aws-jenkins/184 and the openshift-tests [Feature:Builds][Slow] openshift pipeline build Sync plugin tests using the ephemeral template [Suite:openshift] test.
A triage of an `Unauthorized` failure from a PR e2e was done today by David Eads, Trevor King, and myself, at https://coreos.slack.com/archives/CEKNRGF25/p1553872006116500 today. Reader's digest: 1) occurred just as install was completing and e2e extended test framework was booting up 2) and apiservice was inaccessible and openshift apiserver available went from true to false (interestingly k8s apiserver was still available true progressing true) 3) use of a normal user's oauth token, while the openshift api server was unhappy, resulted in the 401/Unauthorized 4) by the time the artifacts were grabbed, the apiservices were all happy again (so the apiservice inaccessibility was transient)
Update from today's various triage efforts: - have not see much consistency between various `Unauthorized` failures in e2e-aws-build and e2e-aws-jenkins, but one item that has popped up on several occasions are logs in the openshift-apiserver(s) like `E0401 15:19:52.465974 1 webhook.go:192] Failed to make webhook authorizer request: .authorization.k8s.io "" is invalid: spec.user: Invalid value: "": at least one of user or group must be specified` - one such example was at https://storage.googleapis.com/origin-ci-test/pr-logs/pull/22435/pull-ci-openshift-origin-master-e2e-aws-builds/1180/artifacts/e2e-aws-builds/pods/openshift-apiserver_apiserver-4ptdr_openshift-apiserver.log.gz - David Eads analyzed it and chimed in with -- "could be an issue with front proxy aggregation certs not being in sync." -- and asked @auth-team: "Can you make the front-proxy authenticator fail if there is an asserted user without a valid cert? looking at the log above, it looks like the authentication stack of openshift-apiserver isn't quite rigth and is allowing an empty user"
So after more discussion with David, Comment #7 is pertinent to this bug, but Comment #8 is something different. I'll be opening a new bug against auth team for it.
Thanks for all the work, Gabe and David.
So I believe I have another instance of what David and I saw last week in #Comment 7 See results from the must-gather dev helper in https://github.com/openshift/origin/pull/22482#issuecomment-480288930 after analyzing https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/22482/pull-ci-openshift-origin-master-e2e-aws-builds/1237/artifacts/e2e-aws-builds/
(In reply to Gabe Montero from comment #11) > So I believe I have another instance of what David and I saw last week in > #Comment 7 > > See results from the must-gather dev helper in > https://github.com/openshift/origin/pull/22482#issuecomment-480288930 after > analyzing > https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/22482/ > pull-ci-openshift-origin-master-e2e-aws-builds/1237/artifacts/e2e-aws-builds/ "Available: v1.quota.openshift.io is not ready: 503" - this was fixed already by moving cluster resource quota to CRD and is unrelated to this BZ.
Meanwhile, we fixed a lot of cert rotation bugs and increased the graceful termination periods for API server shutdown, so this issue might be already fixed. Also this BZ seems to combine various things together. Moving to QA to very the new-app is working as expected without unauthorize errors.
I did not see any build related oc-cli failures in the recently nightly builds with concurrent build test. However, it was reported with bigger scale cluster in scale-lab. Let me verify this after running the test again there in this week.
Hongkai, hi, what's the result now? Thanks.
I havent had the chance to test in the scale-lab yet (shared by many people).
checked the scale-lab list this morning. Current test is the 4th one in the queue of cases.
(In reply to Hongkai Liu from comment #17) > checked the scale-lab list this morning. > Current test is the 4th one in the queue of cases. How about now?
Rerun on scale-lab: With the test running on pbench-controller pod, we still saw the crashes (much less than the last round tough). Later found out that those are caused by the OS parameters in the client side. Using normal AWS VM worked it around. Bug related to the concurrent build test. https://bugzilla.redhat.com/show_bug.cgi?id=1704722
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758