Description of problem: We are seeing CI failures on the metal-ipi-dualstack job: sample failures: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1169/pull-ci-openshift-ovn-kubernetes-master-e2e-metal-ipi-ovn-dualstack/1544741112055336960 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1169/pull-ci-openshift-ovn-kubernetes-master-e2e-metal-ipi-ovn-dualstack/1545031688029999104 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1170/pull-ci-openshift-ovn-kubernetes-master-e2e-metal-ipi-ovn-dualstack/1544952520625885184 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1170/pull-ci-openshift-ovn-kubernetes-master-e2e-metal-ipi-ovn-dualstack/1544915796273139712 : node count should match or exceed machine count expand_less 0s { Timed out waiting for node count (5) to equal or exceed machine count (6). NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api ostest-9gzzc-master-0 Running 60m openshift-machine-api ostest-9gzzc-master-1 Running 60m openshift-machine-api ostest-9gzzc-master-2 Running 60m openshift-machine-api ostest-9gzzc-worker-0-njmjc Running 43m openshift-machine-api ostest-9gzzc-worker-0-rvwqm Provisioned 43m openshift-machine-api ostest-9gzzc-worker-0-v5b47 Running 43m NAME STATUS ROLES AGE VERSION master-0.ostest.test.metalkube.org Ready master 47m v1.24.0+bd7662a master-1.ostest.test.metalkube.org Ready master 47m v1.24.0+bd7662a master-2.ostest.test.metalkube.org Ready master 47m v1.24.0+bd7662a worker-0.ostest.test.metalkube.org Ready worker 30m v1.24.0+bd7662a worker-1.ostest.test.metalkube.org Ready worker 29m v1.24.0+bd7662a } Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
I see most recent jobs passing, is this issue still happening?
I have a PR that failed on this just last Friday: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift-metal3_dev-scripts/1348/pull-ci-openshift-metal3-dev-scripts-master-e2e-metal-ipi-ovn-dualstack/1550409018281824256 I suppose it may have been fixed since then, but that wasn't very long ago.
Hi Riccardo, I haven't been following the recent jobs - just reported the bug when it was happening in that week, if you feel the root cause is well understood and fixed already you can close this bug pointing to the fix. However like Ben reported if this is continuing to happen, then we probably need to look into those runs. Maybe the run in previous comment's link can help.
Actually I hit this again today: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1222/pull-ci-openshift-ovn-kubernetes-master-e2e-metal-ipi-ovn-dualstack/1554054347992600576
Another sample: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1222/pull-ci-openshift-ovn-kubernetes-master-e2e-metal-ipi-ovn-dualstack/1553075109843963904
some excerpts from the machine-api logs 2022-07-29T19:16:23.391045156Z I0729 19:16:23.391042 1 nodelink_controller.go:333] Finding node from machine "ostest-5x6x5-worker-0-pwxtb" 2022-07-29T19:16:23.391049363Z I0729 19:16:23.391044 1 nodelink_controller.go:350] Finding node from machine "ostest-5x6x5-worker-0-pwxtb" by providerID 2022-07-29T19:16:23.391055054Z I0729 19:16:23.391050 1 nodelink_controller.go:374] Finding node from machine "ostest-5x6x5-worker-0-pwxtb" by IP 2022-07-29T19:16:23.391055054Z I0729 19:16:23.391051 1 nodelink_controller.go:191] Reconciling Node /worker-0.ostest.test.metalkube.org 2022-07-29T19:16:23.391060404Z I0729 19:16:23.391056 1 nodelink_controller.go:379] Found internal IP for machine "ostest-5x6x5-worker-0-pwxtb": "192.168.111.25" 2022-07-29T19:16:23.391071717Z I0729 19:16:23.391064 1 nodelink_controller.go:403] Matching node not found for machine "ostest-5x6x5-worker-0-pwxtb" with internal IP "192.168.111.25" 2022-07-29T19:16:23.391071717Z I0729 19:16:23.391068 1 nodelink_controller.go:408] Finding machine from node "worker-0.ostest.test.metalkube.org" 2022-07-29T19:16:23.391077027Z I0729 19:16:23.391070 1 nodelink_controller.go:327] No-op: Node for machine "ostest-5x6x5-worker-0-pwxtb" not found 2022-07-29T19:16:23.391077027Z I0729 19:16:23.391072 1 nodelink_controller.go:425] Finding machine from node "worker-0.ostest.test.metalkube.org" by ProviderID 2022-07-29T19:16:23.391081465Z I0729 19:16:23.391074 1 nodelink_controller.go:290] Watched machine event, finding node to reconcile.Request 2022-07-29T19:16:23.391096283Z I0729 19:16:23.391087 1 nodelink_controller.go:333] Finding node from machine "ostest-5x6x5-worker-0-pwxtb" 2022-07-29T19:16:23.391096283Z I0729 19:16:23.391086 1 nodelink_controller.go:441] Found machine "ostest-5x6x5-worker-0-2f5x5" for node "worker-0.ostest.test.metalkube.org" with providerID "baremetalhost:///openshift-machine-api/ostest-worker-0/a4b95a25-03b3-4d81-af01-57b8ae990756" 2022-07-29T19:16:23.391096283Z I0729 19:16:23.391092 1 nodelink_controller.go:350] Finding node from machine "ostest-5x6x5-worker-0-pwxtb" by providerID 2022-07-29T19:16:23.391102705Z I0729 19:16:23.391099 1 nodelink_controller.go:374] Finding node from machine "ostest-5x6x5-worker-0-pwxtb" by IP 2022-07-29T19:16:23.391106964Z I0729 19:16:23.391103 1 nodelink_controller.go:379] Found internal IP for machine "ostest-5x6x5-worker-0-pwxtb": "192.168.111.25" 2022-07-29T19:16:23.391117904Z I0729 19:16:23.391110 1 nodelink_controller.go:403] Matching node not found for machine "ostest-5x6x5-worker-0-pwxtb" with internal IP "192.168.111.25" 2022-07-29T19:16:23.391117904Z I0729 19:16:23.391114 1 nodelink_controller.go:327] No-op: Node for machine "ostest-5x6x5-worker-0-pwxtb" not found 2022-07-29T19:39:52.438903891Z 2022/07/29 19:39:52 Checking if machine ostest-5x6x5-worker-0-pwxtb exists. 2022-07-29T19:39:52.438903891Z 2022/07/29 19:39:52 Machine ostest-5x6x5-worker-0-pwxtb exists. 2022-07-29T19:39:52.438950290Z I0729 19:39:52.438905 1 controller.go:305] ostest-5x6x5-worker-0-pwxtb: reconciling machine triggers idempotent update 2022-07-29T19:39:52.438950290Z 2022/07/29 19:39:52 Updating machine ostest-5x6x5-worker-0-pwxtb 2022-07-29T19:39:52.438956852Z 2022/07/29 19:39:52 Not setting Node ProviderID - Node does not yet exist for Machine ostest-5x6x5-worker-0-pwxtb 2022-07-29T19:39:52.438961391Z 2022/07/29 19:39:52 Finished updating machine ostest-5x6x5-worker-0-pwxtb . 2022-07-29T19:39:52.446448792Z I0729 19:39:52.446417 1 controller.go:333] ostest-5x6x5-worker-0-pwxtb: has no node yet, requeuing 2022-07-29T19:40:22.447169276Z I0729 19:40:22.447113 1 controller.go:179] ostest-5x6x5-worker-0-pwxtb: reconciling Machine 2022-07-29T19:40:22.447169276Z 2022/07/29 19:40:22 Checking if machine ostest-5x6x5-worker-0-pwxtb exists. 2022-07-29T19:40:22.447169276Z 2022/07/29 19:40:22 Machine ostest-5x6x5-worker-0-pwxtb exists. 2022-07-29T19:40:22.447219841Z I0729 19:40:22.447170 1 controller.go:305] ostest-5x6x5-worker-0-pwxtb: reconciling machine triggers idempotent update 2022-07-29T19:40:22.447219841Z 2022/07/29 19:40:22 Updating machine ostest-5x6x5-worker-0-pwxtb . 2022-07-29T19:40:22.447225643Z 2022/07/29 19:40:22 Not setting Node ProviderID - Node does not yet exist for Machine ostest-5x6x5-worker-0-pwxtb 2022-07-29T19:40:22.447225643Z 2022/07/29 19:40:22 Finished updating machine ostest-5x6x5-worker-0-pwxtb . 2022-07-29T19:40:22.456922006Z I0729 19:40:22.456884 1 controller.go:333] ostest-5x6x5-worker-0-pwxtb: has no node yet, requeuing all the nodes for workers are up and running, nodes with ips 192.168.111.23 and 192.168.111.24 are correctly assigned to respective machines the same does not happen for node with ip 192.168.111.25
is this something you can help investigating?
@rpittau I've had a look through and as far as I can tell, the Machine API provisioned the node, the baremetal provisioning system then was supposed to do its thing and ensure the baremetal host got the correct configuration to join the cluster. However, somewhere between MAPI and the Node starting, something seems to have gone wrong. There's no record of the node attempting to request a client certificate so either the Baremetal provisioning failed, or the software layer broke and kubelet didn't start correctly. Looking at the baremetal logs it implies the node was provisioned correctly so I suspect this is a random software level failure. Is it possible we collect host serial console logs for baremetal?
(In reply to Joel Speed from comment #8) > @rpittau I've had a look through and as far as I can tell, the > Machine API provisioned the node, the baremetal provisioning system then was > supposed to do its thing and ensure the baremetal host got the correct > configuration to join the cluster. However, somewhere between MAPI and the > Node starting, something seems to have gone wrong. > There's no record of the node attempting to request a client certificate so > either the Baremetal provisioning failed, or the software layer broke and > kubelet didn't start correctly. > > Looking at the baremetal logs it implies the node was provisioned correctly > so I suspect this is a random software level failure. Is it possible we > collect host serial console logs for baremetal? I am having good results with something I am trying out in MCO https://github.com/openshift/machine-config-operator/pull/3209 So could this be caused by a problem there that I might be fixing somehow above?
It could well be the case that a failure to resolve the hostname is preventing kubelet from being started for sure, kubelet needs to know the hostname before it can attempt to register so the hostname being set is a prerequisite (I believe via systemd dependencies)
@jspeed as far as I can tell the bare metal provisioning didn't fail you should be able to find the logs in the must-gather inside the job itself under Artifacts, but if the deployment of the bare metal nodes was successful I don't think we keep the console logs
Without host level logs, there's not much more we can do here then as something went wrong during the boot process. It's likely it could be the hostname issue you linked or possibly some other issue that prevented the networking stack from coming up properly
Move to verified. It passed in recent all jobs since Aug 10. https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-master-e2e-metal-ipi-ovn-dualstack
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399