Bug 2054896

Summary: worker times out during inspection under 4.10
Product: OpenShift Container Platform Reporter: tonyg
Component: Bare Metal Hardware ProvisioningAssignee: Tomas Sedovic <tsedovic>
Bare Metal Hardware Provisioning sub component: ironic QA Contact: Amit Ugol <augol>
Status: CLOSED NOTABUG Docs Contact:
Severity: unspecified    
Priority: unspecified CC: bfournie, josearod, manrodri, nsilla, tkrishto, yliu1
Version: 4.10   
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-01 17:09:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
must-gather from cluster none

Description tonyg 2022-02-15 22:34:40 UTC
Created attachment 1861356 [details]
must-gather from cluster

Description of problem:

When attempting to install 4.10.0-rc.2 we have observed a few times a failure on the inspection of a worker node.

Version-Release number of selected component (if applicable):

4.10.0-rc.2


How reproducible:

Most of the times with 4.10.0-rc.2


Steps to Reproduce:
1. Create cluster with OCP 4.10.0-rc.2
2. Sometimes a worker node will fail during inspection


Actual results:

A worker is missing (worker-0):

$ oc get nodes
NAME       STATUS   ROLES    AGE    VERSION
master-0   Ready    master   138m   v1.23.3+2e8bad7
master-1   Ready    master   138m   v1.23.3+2e8bad7
master-2   Ready    master   139m   v1.23.3+2e8bad7
worker-1   Ready    worker   108m   v1.23.3+2e8bad7
worker-2   Ready    worker   109m   v1.23.3+2e8bad7
worker-3   Ready    worker   110m   v1.23.3+2e8bad7

$ oc get bmh -A
NAMESPACE               NAME                           STATE                    CONSUMER                        ONLINE   ERROR   AGE
openshift-machine-api   master-0.cluster6.dfwt5g.lab   externally provisioned   cluster6-jfwmb-master-0         true             60m
openshift-machine-api   master-1.cluster6.dfwt5g.lab   externally provisioned   cluster6-jfwmb-master-1         true             60m
openshift-machine-api   master-2.cluster6.dfwt5g.lab   externally provisioned   cluster6-jfwmb-master-2         true             60m
openshift-machine-api   worker-0.cluster6.dfwt5g.lab   inspecting                                               true             60m
openshift-machine-api   worker-1.cluster6.dfwt5g.lab   provisioned              cluster6-jfwmb-worker-0-kkld6   true             60m
openshift-machine-api   worker-2.cluster6.dfwt5g.lab   provisioned              cluster6-jfwmb-worker-0-4wq6h   true             60m
openshift-machine-api   worker-3.cluster6.dfwt5g.lab   provisioned              cluster6-jfwmb-worker-0-8vrzn   true             60m

The metal3-ironic-inspector logs show when the worker-0 (and other workers) were being inspected about the same time, but worker-0 times out.

### worker-0
2022-02-15 19:17:05.330 1 DEBUG ironic_inspector.node_cache [-] [node: 41d9591a-b058-4911-929e-b6111a6b87f5] Node missing in the cache; adding it now start_introspection /usr/lib/python3.6/site-packages/ironic_inspector/node_cache.py:685
-- snip --
2022-02-15 20:17:08.030 1 ERROR ironic_inspector.node_cache [-] Introspection for nodes ['41d9591a-b058-4911-929e-b6111a6b87f5'] has timed out

## worker-1
2022-02-15 19:17:13.198 1 DEBUG ironic_inspector.node_cache [-] [node: 2fd0f590-af3b-4bab-a520-8b2aeb4ba832] Node missing in the cache; adding it now start_introspection /usr/lib/python3.6/site-packages/ironic_inspector/node_cache.py:685
-- snip --
2022-02-15 19:24:28.193 1 DEBUG ironic_inspector.node_cache [-] The following nodes match the attributes: bmc_address=['192.168.10.62', 'fe80::9640:c9ff:fe37:8f4a'], mac=['08:f1:ea:85:d7:b1', 'f4:03:43:cd:10:88', 'ea:de:f7:ad:61:a6', '40:a6:b7:03:10:9c', 'b8:83:03:91:c5:e9', '08:f1:ea:85:d7:b2', '3c:fd:fe:bb:1f:44', '08:f1:ea:85:d7:b0', '40:a6:b7:03:10:9d', 'b8:83:03:91:c5:e8', 'f4:03:43:cd:10:80', '08:f1:ea:85:d7:b3', '3c:fd:fe:bb:1f:45'], scoring: 2fd0f590-af3b-4bab-a520-8b2aeb4ba832: 2 find_node /usr/lib/python3.6/site-packages/ironic_inspector/node_cache.py:838  


Expected results:

Have all workers to join the cluster


After the failure we removed the bmh (worker-0) and added it back, this time there was no issue on the inspection and the provision completed as expected:

 oc get bmh -A
NAMESPACE               NAME                           STATE                    CONSUMER                        ONLINE   ERROR   AGE
openshift-machine-api   master-0.cluster6.dfwt5g.lab   externally provisioned   cluster6-jfwmb-master-0         true             3h24m
openshift-machine-api   master-1.cluster6.dfwt5g.lab   externally provisioned   cluster6-jfwmb-master-1         true             3h24m
openshift-machine-api   master-2.cluster6.dfwt5g.lab   externally provisioned   cluster6-jfwmb-master-2         true             3h24m
openshift-machine-api   worker-0.cluster6.dfwt5g.lab   provisioned              cluster6-jfwmb-worker-0-z2vcj   true             17m
openshift-machine-api   worker-1.cluster6.dfwt5g.lab   provisioned              cluster6-jfwmb-worker-0-kkld6   true             3h24m
openshift-machine-api   worker-2.cluster6.dfwt5g.lab   provisioned              cluster6-jfwmb-worker-0-4wq6h   true             3h23m
openshift-machine-api   worker-3.cluster6.dfwt5g.lab   provisioned              cluster6-jfwmb-worker-0-8vrzn   true             3h23m


Additional info:

- Air-gapped cluster (disconnected)
- Including must-gather, but additional logs (including must-gather) are available here: https://www.distributed-ci.io/jobs/2ed83a7e-5ebf-4e70-b626-8f35e2212bde/files

We've observed the similar behavior on other attempts, we also collected logs for those:
- https://www.distributed-ci.io/jobs/e35a2262-84ea-405c-a318-547eacaf93b5/files
- https://www.distributed-ci.io/jobs/d32d59a5-bc61-4712-b741-426c345fb1c9/files

Comment 1 Bob Fournier 2022-02-16 12:11:53 UTC
Are you able to attach to the console when the error occurs? Is it possible to get a screen shot so we can see what state the host is in? Thanks.

Comment 3 Nacho Silla 2022-02-17 08:42:51 UTC
Hi,

If it helps, we found a mismatch in the MAC address for the provisioning interface configured in the inventory and the actual value.

This resulted in some error messages in the console during the bootstrap phase:


```
Attempting to boot from MAC f4-03-43-d0-72-c0
pxelinux.cfg/f4-03-43-d0-72-c0... No such file or directory
(http://ipxe.org/2d0c618e)
pxelinux.cfg/f4-03-43-d0-72-c0... No such file or directory
(http://ipxe.org/2d0c618e)
```

We're currently verifying if setting the right address has any effect on the result. In any case this hasn't been a problem with versions prior to 4.10 and even with older 4.10 RCs.

Comment 4 Bob Fournier 2022-02-17 22:34:19 UTC
Is f4-03-43-d0-72-c0 the correct mac? Is it the same physical worker that fails all of the time?  From the screen shot it looks like it booted off of the disk which would happen if the IPA doesn't boot in the case of an invalid mac, for example.

Please let us know the results with this change, its curious that this just starting happening with 4.10 even with the same config.

Comment 5 Dmitry Tantsur 2022-02-22 11:23:39 UTC
Needs information per comment 4

Comment 6 Manuel Rodriguez 2022-02-22 14:08:36 UTC
Yes, f4-03-43-d0-72-c0 is the correct MAC. And for now it has been only this node, in previous deployments with version < 4.10 has been working fine even with the wrong MAC.

Comment 7 tonyg 2022-03-01 16:41:37 UTC
We have not seen this issue anymore once we set the correct MAC, I guess prior versions it was not considered as it is 4.10+.

Thanks for checking, I think this can be closed now.

Comment 8 Bob Fournier 2022-03-01 17:09:50 UTC
Thanks Tony. Its still unexplained as to why this worked < 4.10 as we would have expected a config issue to cause a problem. We'll close this out. Feel free to open a lower priority bug for the 4.9 issue.