1947036 – "failed to create Matchbox client or connect" on e2e-metal jobs or metal clusters via cluster-bot

Bug 1947036 - "failed to create Matchbox client or connect" on e2e-metal jobs or metal clusters via cluster-bot

Summary: "failed to create Matchbox client or connect" on e2e-metal jobs or metal clus...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Etienne Simard
QA Contact:	Gaoyun Pei
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1955209 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-04-07 14:23 UTC by Arti Sood
Modified:	2021-07-27 22:58 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:	This is a CI-only fix.
Clone Of:
Environment:	job=release-openshift-ocp-installer-e2e-metal-serial-4.7=all job=release-openshift-ocp-installer-e2e-metal-4.7=all job=release-openshift-ocp-installer-e2e-metal-compact-4.7=all job=release-openshift-ocp-installer-e2e-metal-serial-4.6=all job=release-openshift-ocp-installer-e2e-metal-4.6=all job=release-openshift-ocp-installer-e2e-metal-compact-4.6=all
Last Closed:	2021-07-27 22:57:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:58:17 UTC

Description Arti Sood 2021-04-07 14:23:45 UTC

Description of problem:
The cluster creation was initiated through cluster bot.It seems IPI install on vSphere

https://prow.ci.openshift.org/log?container=test&id=1379763885606703104&job=release-openshift-origin-installer-launch-vsphere

Build log indicates terraform variable issues causes failure to create infra resouce.
https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-metal/1379771933565915136/build-log.txt
Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 brad.williams 2021-04-09 12:58:36 UTC

Moving this over to the Installer component for further action.  The "metal" installer job appears to be perma-failing due to undeclared variables:

https://prow.ci.openshift.org/?job=release-openshift-origin-installer-launch-metal

Comment 2 Ben Nemec 2021-04-13 16:08:13 UTC

This seems to be referring to two different jobs, one vsphere and one metal. Which is the problem? Also, the metal job is not IPI, so it wouldn't go to the Metal IPI installer team.

Comment 3 Matthew Staebler 2021-04-13 17:03:54 UTC

This is being tracked in https://issues.redhat.com/browse/CORS-1661.

Comment 4 Arti Sood 2021-04-13 17:15:01 UTC

I was using cluster-bot to spin up vSphere cluster (we were having jenkins issues for couple of days). Cluster-bot continues to send over the logs before it finally stops.I did get both logs returned to me by cluster-bot when I requested a vSphere cluster. I included just in case they are connected (apologies for ignorance, not sure how cluster-bot works with failures). I am pretty sure when I issued 'list' I did not see any requests to cluster-bot to create a cluster on metal. There was consistent failure to set up cluster on vSphere. If there is not much to work with vSphere installation failure, this bug can be used to figure out metal installation failure.

Comment 5 Matthew Staebler 2021-04-19 17:17:41 UTC

We'll look at this next sprint.

Comment 6 Matthew Staebler 2021-04-29 18:36:04 UTC

*** Bug 1955209 has been marked as a duplicate of this bug. ***

Comment 7 Petr Muller 2021-04-30 14:21:04 UTC

Matthew, reacting on this from the duplicate:

> Re-assigning to the core installer team. This is a bug in the e2e-metal job where it is trying to use a Service that only exists on build01.

All these jobs *are* running on build01, for example last cluster bot job:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-metal/1388097840395325440#1:build-log.txt%3A5

INFO[2021-04-30T11:48:22Z] Using namespace https://console.build01.ci.openshift.org/k8s/cluster/projects/ci-ln-piqf92k 

I also checked 4.8 jobs and some 4.7 ones in https://prow.ci.openshift.org/?job=release-openshift-ocp-installer-e2e-metal-* , they are also all on build01

Comment 8 Petr Muller 2021-04-30 14:27:34 UTC

(I happened to discover this bug when I was chasing the jobs that drive down the infrastructure pass ratio on 4.8 https://sippy.ci.openshift.org/testdetails?release=4.8&test=%5Bsig-sippy%5D+infrastructure+should+work )

Comment 9 Matthew Staebler 2021-04-30 14:28:22 UTC

(In reply to Petr Muller from comment #7)
> Matthew, reacting on this from the duplicate:
> 
> > Re-assigning to the core installer team. This is a bug in the e2e-metal job where it is trying to use a Service that only exists on build01.
> 
> All these jobs *are* running on build01, for example last cluster bot job:
> 
> https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-
> origin-installer-launch-metal/1388097840395325440#1:build-log.txt%3A5
> 
> INFO[2021-04-30T11:48:22Z] Using namespace
> https://console.build01.ci.openshift.org/k8s/cluster/projects/ci-ln-piqf92k 
> 
> I also checked 4.8 jobs and some 4.7 ones in
> https://prow.ci.openshift.org/?job=release-openshift-ocp-installer-e2e-metal-
> * , they are also all on build01

Oh, thank you for pointing that out, Petr. I am sorry that I did not due the necessary diligence to notice that this error is different than the known e2e-metal error.

Comment 10 Matthew Staebler 2021-04-30 14:31:29 UTC

(In reply to Matthew Staebler from comment #9)
> (In reply to Petr Muller from comment #7)
> > Matthew, reacting on this from the duplicate:
> > 
> > > Re-assigning to the core installer team. This is a bug in the e2e-metal job where it is trying to use a Service that only exists on build01.
> > 
> > All these jobs *are* running on build01, for example last cluster bot job:
> > 
> > https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-
> > origin-installer-launch-metal/1388097840395325440#1:build-log.txt%3A5
> > 
> > INFO[2021-04-30T11:48:22Z] Using namespace
> > https://console.build01.ci.openshift.org/k8s/cluster/projects/ci-ln-piqf92k 
> > 
> > I also checked 4.8 jobs and some 4.7 ones in
> > https://prow.ci.openshift.org/?job=release-openshift-ocp-installer-e2e-metal-
> > * , they are also all on build01
> 
> Oh, thank you for pointing that out, Petr. I am sorry that I did not due the
> necessary diligence to notice that this error is different than the known
> e2e-metal error.

No, this is the same problem. The relevant error is the following.

  failed to create Matchbox client or connect to a3558a943132041b48b20a67aa291d99-23796056.us-east-1.elb.amazonaws.com:8081: context deadline exceeded

Comment 11 Petr Muller 2021-04-30 14:51:04 UTC

Yes, the problem is identical, but "running on different cluster" is not the real cause. Looks like the service went down or something: let me know if I can help out with build01 somehow!

Comment 12 Petr Muller 2021-05-06 12:44:01 UTC

Turns out the jobs *are* sometimes running on b02 after being automatically shuffled between clusters. https://github.com/openshift/release/pull/18372 will peg these jobs to b01. It will likely not make them pass b/c they are failing on b01 too but it at least removes away some entropy.

Comment 13 Etienne Simard 2021-05-07 15:52:19 UTC

After some investigation of the matchbox service, I suspected that the matchbox client certificate was the issue. Petr confirmed the certificate was indeed expired and we're working on updating it.

Comment 14 Etienne Simard 2021-05-11 15:40:11 UTC

Secrets updated, and jobs are passing again (Verified) https://prow.ci.openshift.org/?job=release-openshift-ocp-installer-e2e-metal*

Comment 17 errata-xmlrpc 2021-07-27 22:57:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.