Description of problem: The cluster creation was initiated through cluster bot.It seems IPI install on vSphere https://prow.ci.openshift.org/log?container=test&id=1379763885606703104&job=release-openshift-origin-installer-launch-vsphere Build log indicates terraform variable issues causes failure to create infra resouce. https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-metal/1379771933565915136/build-log.txt Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Moving this over to the Installer component for further action. The "metal" installer job appears to be perma-failing due to undeclared variables: https://prow.ci.openshift.org/?job=release-openshift-origin-installer-launch-metal
This seems to be referring to two different jobs, one vsphere and one metal. Which is the problem? Also, the metal job is not IPI, so it wouldn't go to the Metal IPI installer team.
This is being tracked in https://issues.redhat.com/browse/CORS-1661.
I was using cluster-bot to spin up vSphere cluster (we were having jenkins issues for couple of days). Cluster-bot continues to send over the logs before it finally stops.I did get both logs returned to me by cluster-bot when I requested a vSphere cluster. I included just in case they are connected (apologies for ignorance, not sure how cluster-bot works with failures). I am pretty sure when I issued 'list' I did not see any requests to cluster-bot to create a cluster on metal. There was consistent failure to set up cluster on vSphere. If there is not much to work with vSphere installation failure, this bug can be used to figure out metal installation failure.
We'll look at this next sprint.
*** Bug 1955209 has been marked as a duplicate of this bug. ***
Matthew, reacting on this from the duplicate: > Re-assigning to the core installer team. This is a bug in the e2e-metal job where it is trying to use a Service that only exists on build01. All these jobs *are* running on build01, for example last cluster bot job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-metal/1388097840395325440#1:build-log.txt%3A5 INFO[2021-04-30T11:48:22Z] Using namespace https://console.build01.ci.openshift.org/k8s/cluster/projects/ci-ln-piqf92k I also checked 4.8 jobs and some 4.7 ones in https://prow.ci.openshift.org/?job=release-openshift-ocp-installer-e2e-metal-* , they are also all on build01
(I happened to discover this bug when I was chasing the jobs that drive down the infrastructure pass ratio on 4.8 https://sippy.ci.openshift.org/testdetails?release=4.8&test=%5Bsig-sippy%5D+infrastructure+should+work )
(In reply to Petr Muller from comment #7) > Matthew, reacting on this from the duplicate: > > > Re-assigning to the core installer team. This is a bug in the e2e-metal job where it is trying to use a Service that only exists on build01. > > All these jobs *are* running on build01, for example last cluster bot job: > > https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift- > origin-installer-launch-metal/1388097840395325440#1:build-log.txt%3A5 > > INFO[2021-04-30T11:48:22Z] Using namespace > https://console.build01.ci.openshift.org/k8s/cluster/projects/ci-ln-piqf92k > > I also checked 4.8 jobs and some 4.7 ones in > https://prow.ci.openshift.org/?job=release-openshift-ocp-installer-e2e-metal- > * , they are also all on build01 Oh, thank you for pointing that out, Petr. I am sorry that I did not due the necessary diligence to notice that this error is different than the known e2e-metal error.
(In reply to Matthew Staebler from comment #9) > (In reply to Petr Muller from comment #7) > > Matthew, reacting on this from the duplicate: > > > > > Re-assigning to the core installer team. This is a bug in the e2e-metal job where it is trying to use a Service that only exists on build01. > > > > All these jobs *are* running on build01, for example last cluster bot job: > > > > https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift- > > origin-installer-launch-metal/1388097840395325440#1:build-log.txt%3A5 > > > > INFO[2021-04-30T11:48:22Z] Using namespace > > https://console.build01.ci.openshift.org/k8s/cluster/projects/ci-ln-piqf92k > > > > I also checked 4.8 jobs and some 4.7 ones in > > https://prow.ci.openshift.org/?job=release-openshift-ocp-installer-e2e-metal- > > * , they are also all on build01 > > Oh, thank you for pointing that out, Petr. I am sorry that I did not due the > necessary diligence to notice that this error is different than the known > e2e-metal error. No, this is the same problem. The relevant error is the following. failed to create Matchbox client or connect to a3558a943132041b48b20a67aa291d99-23796056.us-east-1.elb.amazonaws.com:8081: context deadline exceeded
Yes, the problem is identical, but "running on different cluster" is not the real cause. Looks like the service went down or something: let me know if I can help out with build01 somehow!
Turns out the jobs *are* sometimes running on b02 after being automatically shuffled between clusters. https://github.com/openshift/release/pull/18372 will peg these jobs to b01. It will likely not make them pass b/c they are failing on b01 too but it at least removes away some entropy.
After some investigation of the matchbox service, I suspected that the matchbox client certificate was the issue. Petr confirmed the certificate was indeed expired and we're working on updating it.
Secrets updated, and jobs are passing again (Verified) https://prow.ci.openshift.org/?job=release-openshift-ocp-installer-e2e-metal*
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438