2105003 – e2e-metal-ipi-ovn-dualstack failure: Timed out waiting for node count (5) to equal or exceed machine count (6).

Bug 2105003 - e2e-metal-ipi-ovn-dualstack failure: Timed out waiting for node count (5) to equal or exceed machine count (6).

Summary: e2e-metal-ipi-ovn-dualstack failure: Timed out waiting for node count (5) to ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.12
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.12.0
Assignee:	OpenShift Cluster Infrastructure Bugs
QA Contact:	sunzhaohua
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2118586
TreeView+	depends on / blocked

Reported:	2022-07-07 16:50 UTC by Surya Seetharaman
Modified:	2023-01-17 19:52 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-01-17 19:51:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 3271	0	None	Merged	Bug 2105003: on-prem: improvements on resolv-prepender	2022-09-05 14:52:55 UTC
Red Hat Product Errata	RHSA-2022:7399	0	None	None	None	2023-01-17 19:52:07 UTC

Description Surya Seetharaman 2022-07-07 16:50:46 UTC

Description of problem:

We are seeing CI failures on the metal-ipi-dualstack job:

sample failures:

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1169/pull-ci-openshift-ovn-kubernetes-master-e2e-metal-ipi-ovn-dualstack/1544741112055336960

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1169/pull-ci-openshift-ovn-kubernetes-master-e2e-metal-ipi-ovn-dualstack/1545031688029999104

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1170/pull-ci-openshift-ovn-kubernetes-master-e2e-metal-ipi-ovn-dualstack/1544952520625885184

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1170/pull-ci-openshift-ovn-kubernetes-master-e2e-metal-ipi-ovn-dualstack/1544915796273139712

: node count should match or exceed machine count expand_less	0s
{  
          Timed out waiting for node count (5) to equal or exceed machine count (6).
          NAMESPACE               NAME                          PHASE         TYPE   REGION   ZONE   AGE
openshift-machine-api   ostest-9gzzc-master-0         Running                              60m
openshift-machine-api   ostest-9gzzc-master-1         Running                              60m
openshift-machine-api   ostest-9gzzc-master-2         Running                              60m
openshift-machine-api   ostest-9gzzc-worker-0-njmjc   Running                              43m
openshift-machine-api   ostest-9gzzc-worker-0-rvwqm   Provisioned                          43m
openshift-machine-api   ostest-9gzzc-worker-0-v5b47   Running                              43m
          NAME                                 STATUS   ROLES    AGE   VERSION
master-0.ostest.test.metalkube.org   Ready    master   47m   v1.24.0+bd7662a
master-1.ostest.test.metalkube.org   Ready    master   47m   v1.24.0+bd7662a
master-2.ostest.test.metalkube.org   Ready    master   47m   v1.24.0+bd7662a
worker-0.ostest.test.metalkube.org   Ready    worker   30m   v1.24.0+bd7662a
worker-1.ostest.test.metalkube.org   Ready    worker   29m   v1.24.0+bd7662a
        }

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Riccardo Pittau 2022-07-26 13:21:57 UTC

I see most recent jobs passing, is this issue still happening?

Comment 2 Ben Nemec 2022-07-26 16:59:51 UTC

I have a PR that failed on this just last Friday: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift-metal3_dev-scripts/1348/pull-ci-openshift-metal3-dev-scripts-master-e2e-metal-ipi-ovn-dualstack/1550409018281824256

I suppose it may have been fixed since then, but that wasn't very long ago.

Comment 3 Surya Seetharaman 2022-08-01 10:19:39 UTC

Hi Riccardo, I haven't been following the recent jobs - just reported the bug when it was happening in that week, if you feel the root cause is well understood and fixed already you can close this bug pointing to the fix. However like Ben reported if this is continuing to happen, then we probably need to look into those runs. Maybe the run in previous comment's link can help.

Comment 4 Surya Seetharaman 2022-08-01 12:25:16 UTC

Actually I hit this again today: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1222/pull-ci-openshift-ovn-kubernetes-master-e2e-metal-ipi-ovn-dualstack/1554054347992600576

Comment 5 Surya Seetharaman 2022-08-01 12:27:44 UTC

Another sample: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1222/pull-ci-openshift-ovn-kubernetes-master-e2e-metal-ipi-ovn-dualstack/1553075109843963904

Comment 6 Riccardo Pittau 2022-08-04 10:21:21 UTC

some excerpts from the machine-api logs

2022-07-29T19:16:23.391045156Z I0729 19:16:23.391042       1 nodelink_controller.go:333] Finding node from machine "ostest-5x6x5-worker-0-pwxtb"
2022-07-29T19:16:23.391049363Z I0729 19:16:23.391044       1 nodelink_controller.go:350] Finding node from machine "ostest-5x6x5-worker-0-pwxtb" by providerID
2022-07-29T19:16:23.391055054Z I0729 19:16:23.391050       1 nodelink_controller.go:374] Finding node from machine "ostest-5x6x5-worker-0-pwxtb" by IP
2022-07-29T19:16:23.391055054Z I0729 19:16:23.391051       1 nodelink_controller.go:191] Reconciling Node /worker-0.ostest.test.metalkube.org
2022-07-29T19:16:23.391060404Z I0729 19:16:23.391056       1 nodelink_controller.go:379] Found internal IP for machine "ostest-5x6x5-worker-0-pwxtb": "192.168.111.25"
2022-07-29T19:16:23.391071717Z I0729 19:16:23.391064       1 nodelink_controller.go:403] Matching node not found for machine "ostest-5x6x5-worker-0-pwxtb" with internal IP "192.168.111.25"
2022-07-29T19:16:23.391071717Z I0729 19:16:23.391068       1 nodelink_controller.go:408] Finding machine from node "worker-0.ostest.test.metalkube.org"
2022-07-29T19:16:23.391077027Z I0729 19:16:23.391070       1 nodelink_controller.go:327] No-op: Node for machine "ostest-5x6x5-worker-0-pwxtb" not found
2022-07-29T19:16:23.391077027Z I0729 19:16:23.391072       1 nodelink_controller.go:425] Finding machine from node "worker-0.ostest.test.metalkube.org" by ProviderID
2022-07-29T19:16:23.391081465Z I0729 19:16:23.391074       1 nodelink_controller.go:290] Watched machine event, finding node to reconcile.Request
2022-07-29T19:16:23.391096283Z I0729 19:16:23.391087       1 nodelink_controller.go:333] Finding node from machine "ostest-5x6x5-worker-0-pwxtb"
2022-07-29T19:16:23.391096283Z I0729 19:16:23.391086       1 nodelink_controller.go:441] Found machine "ostest-5x6x5-worker-0-2f5x5" for node "worker-0.ostest.test.metalkube.org" with providerID "baremetalhost:///openshift-machine-api/ostest-worker-0/a4b95a25-03b3-4d81-af01-57b8ae990756"
2022-07-29T19:16:23.391096283Z I0729 19:16:23.391092       1 nodelink_controller.go:350] Finding node from machine "ostest-5x6x5-worker-0-pwxtb" by providerID
2022-07-29T19:16:23.391102705Z I0729 19:16:23.391099       1 nodelink_controller.go:374] Finding node from machine "ostest-5x6x5-worker-0-pwxtb" by IP
2022-07-29T19:16:23.391106964Z I0729 19:16:23.391103       1 nodelink_controller.go:379] Found internal IP for machine "ostest-5x6x5-worker-0-pwxtb": "192.168.111.25"
2022-07-29T19:16:23.391117904Z I0729 19:16:23.391110       1 nodelink_controller.go:403] Matching node not found for machine "ostest-5x6x5-worker-0-pwxtb" with internal IP "192.168.111.25"
2022-07-29T19:16:23.391117904Z I0729 19:16:23.391114       1 nodelink_controller.go:327] No-op: Node for machine "ostest-5x6x5-worker-0-pwxtb" not found


2022-07-29T19:39:52.438903891Z 2022/07/29 19:39:52 Checking if machine ostest-5x6x5-worker-0-pwxtb exists.
2022-07-29T19:39:52.438903891Z 2022/07/29 19:39:52 Machine ostest-5x6x5-worker-0-pwxtb exists.
2022-07-29T19:39:52.438950290Z I0729 19:39:52.438905       1 controller.go:305] ostest-5x6x5-worker-0-pwxtb: reconciling machine triggers idempotent update
2022-07-29T19:39:52.438950290Z 2022/07/29 19:39:52 Updating machine ostest-5x6x5-worker-0-pwxtb 
2022-07-29T19:39:52.438956852Z 2022/07/29 19:39:52 Not setting Node ProviderID - Node does not yet exist for Machine ostest-5x6x5-worker-0-pwxtb
2022-07-29T19:39:52.438961391Z 2022/07/29 19:39:52 Finished updating machine ostest-5x6x5-worker-0-pwxtb .
2022-07-29T19:39:52.446448792Z I0729 19:39:52.446417       1 controller.go:333] ostest-5x6x5-worker-0-pwxtb: has no node yet, requeuing
2022-07-29T19:40:22.447169276Z I0729 19:40:22.447113       1 controller.go:179] ostest-5x6x5-worker-0-pwxtb: reconciling Machine
2022-07-29T19:40:22.447169276Z 2022/07/29 19:40:22 Checking if machine ostest-5x6x5-worker-0-pwxtb exists.
2022-07-29T19:40:22.447169276Z 2022/07/29 19:40:22 Machine ostest-5x6x5-worker-0-pwxtb exists.
2022-07-29T19:40:22.447219841Z I0729 19:40:22.447170       1 controller.go:305] ostest-5x6x5-worker-0-pwxtb: reconciling machine triggers idempotent update
2022-07-29T19:40:22.447219841Z 2022/07/29 19:40:22 Updating machine ostest-5x6x5-worker-0-pwxtb .
2022-07-29T19:40:22.447225643Z 2022/07/29 19:40:22 Not setting Node ProviderID - Node does not yet exist for Machine ostest-5x6x5-worker-0-pwxtb
2022-07-29T19:40:22.447225643Z 2022/07/29 19:40:22 Finished updating machine ostest-5x6x5-worker-0-pwxtb .
2022-07-29T19:40:22.456922006Z I0729 19:40:22.456884       1 controller.go:333] ostest-5x6x5-worker-0-pwxtb: has no node yet, requeuing


all the nodes for workers are up and running, nodes with ips 192.168.111.23 and 192.168.111.24 are correctly assigned to respective machines
the same does not happen for node with ip 192.168.111.25

Comment 7 Riccardo Pittau 2022-08-04 10:22:59 UTC

is this something you can help investigating?

Comment 8 Joel Speed 2022-08-08 17:07:04 UTC

@rpittau I've had a look through and as far as I can tell, the Machine API provisioned the node, the baremetal provisioning system then was supposed to do its thing and ensure the baremetal host got the correct configuration to join the cluster. However, somewhere between MAPI and the Node starting, something seems to have gone wrong.
There's no record of the node attempting to request a client certificate so either the Baremetal provisioning failed, or the software layer broke and kubelet didn't start correctly.

Looking at the baremetal logs it implies the node was provisioned correctly so I suspect this is a random software level failure. Is it possible we collect host serial console logs for baremetal?

Comment 9 Jaime Caamaño Ruiz 2022-08-08 17:17:43 UTC

(In reply to Joel Speed from comment #8)
> @rpittau I've had a look through and as far as I can tell, the
> Machine API provisioned the node, the baremetal provisioning system then was
> supposed to do its thing and ensure the baremetal host got the correct
> configuration to join the cluster. However, somewhere between MAPI and the
> Node starting, something seems to have gone wrong.
> There's no record of the node attempting to request a client certificate so
> either the Baremetal provisioning failed, or the software layer broke and
> kubelet didn't start correctly.
> 
> Looking at the baremetal logs it implies the node was provisioned correctly
> so I suspect this is a random software level failure. Is it possible we
> collect host serial console logs for baremetal?

I am having good results with something I am trying out in MCO

https://github.com/openshift/machine-config-operator/pull/3209

So could this be caused by a problem there that I might be fixing somehow above?

Comment 10 Joel Speed 2022-08-08 17:25:08 UTC

It could well be the case that a failure to resolve the hostname is preventing kubelet from being started for sure, kubelet needs to know the hostname before it can attempt to register so the hostname being set is a prerequisite (I believe via systemd dependencies)

Comment 11 Riccardo Pittau 2022-08-09 09:34:34 UTC

@jspeed as far as I can tell the bare metal provisioning didn't fail
you should be able to find the logs in the must-gather inside the job itself under Artifacts, but if the deployment of the bare metal nodes was successful I don't think we keep the console logs

Comment 12 Joel Speed 2022-08-09 09:54:28 UTC

Without host level logs, there's not much more we can do here then as something went wrong during the boot process. It's likely it could be the hostname issue you linked or possibly some other issue that prevented the networking stack from coming up properly

Comment 14 sunzhaohua 2022-08-15 10:01:42 UTC

Move to verified. It passed in recent all jobs since Aug 10.
https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-master-e2e-metal-ipi-ovn-dualstack

Comment 17 errata-xmlrpc 2023-01-17 19:51:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399

Note You need to log in before you can comment on or make changes to this bug.