Bug 1801238 - [Baremetal on IPI]: Master machines are not assigned to nodes
Summary: [Baremetal on IPI]: Master machines are not assigned to nodes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.4
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.5.0
Assignee: Ian Main
QA Contact: Constantin Vultur
URL:
Whiteboard:
: 1775494 1809664 1810430 1822054 (view as bug list)
Depends On:
Blocks: 1771572 1801970 1809664 1813800 1813801 1824241 1825318 1826505 1840133
TreeView+ depends on / blocked
 
Reported: 2020-02-10 13:45 UTC by Constantin Vultur
Modified: 2020-07-28 14:28 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1801970 1809664 1840133 (view as bug list)
Environment:
Last Closed: 2020-07-13 17:14:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
BMH showing no data (130.26 KB, image/png)
2020-02-10 13:50 UTC, Constantin Vultur
no flags Details
add-machine-ips.sh (541 bytes, application/x-shellscript)
2020-02-27 19:01 UTC, Marius Cornea
no flags Details
link-machine-and-node.sh (3.32 KB, application/x-shellscript)
2020-02-27 19:02 UTC, Marius Cornea
no flags Details
utils.sh (10.98 KB, application/x-shellscript)
2020-02-27 19:02 UTC, Marius Cornea
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 3591 0 None closed Baremetal: Bug 1801238: Pull data from ironic inspector and annotate BareMetalHost 2021-01-29 07:05:03 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:15:25 UTC

Description Constantin Vultur 2020-02-10 13:45:00 UTC
Description of problem:
After the cluster is deployed, the master machines are not assigned to nodes. Master Nodes don't have machine.openhift.io/machine annotation and the machine does not have status.addresses an status.nodeRef populated. Executing ./12_csr_hack.sh explicitly fixes the problem.

Version-Release number of selected component (if applicable):
4.3.0-0.nightly-2020-02-03-115336-ipv6.1


How reproducible:
Steps to reproduce the behavior, please include all details about the dev-scripts version (git commit SHA), any local variable overrides or other customizations, and whether you're deploying on VMs or Baremetal.

Recent dev-scripts master (34d334f), default config, only NUM_WORKERS and PULL_SECRET set.

Deploy cluster using default 'make' command. Check one of the master machine resources, the status.addresses and status.nodeRef is not populated.


Actual results:
The masters and nodes are not properly displaying information. 
The Nodes are not displaying the number of pods
The BMH does not display any information and graphs


Expected results:
The master machines and nodes should reference each other correctly.

Additional info:

Comment 1 Constantin Vultur 2020-02-10 13:50:11 UTC
Created attachment 1662150 [details]
BMH showing no data

Comment 2 Jiri Tomasek 2020-02-10 14:19:25 UTC
Related Github issue: https://github.com/openshift-metal3/dev-scripts/issues/917

Comment 3 Marius Cornea 2020-02-10 14:31:43 UTC
The same issue is present on IPV6 environments deployed with the manual IPI installation process(no dev-scripts involved).

Comment 4 Stephen Benjamin 2020-02-24 14:40:09 UTC
Moving this to 4.5. To get this change in 4.4 at this point, you'll need to fix it in 4.5, and clone this bug to 4.4.

Comment 5 Wei Sun 2020-02-25 05:10:03 UTC
Hi, per #comment4, if it needs to be fixed in 4.4, please clone one bug for 4.4.

Comment 6 Russell Bryant 2020-02-27 17:37:24 UTC
This is expected behavior right now with bare metal IPI.  It can be worked around with an external script that writes Addresses to the master Machines.  dev-scripts used to have a script to do this.

AFAIK, the functional impact of this issue is that it breaks some of the bare metal host management in the UI.

Comment 7 Marius Cornea 2020-02-27 19:01:45 UTC
Created attachment 1666271 [details]
add-machine-ips.sh

Comment 8 Marius Cornea 2020-02-27 19:02:10 UTC
Created attachment 1666272 [details]
link-machine-and-node.sh

Comment 9 Marius Cornea 2020-02-27 19:02:29 UTC
Created attachment 1666273 [details]
utils.sh

Comment 10 Marius Cornea 2020-02-27 19:05:03 UTC
Workaround:

download add-machine-ips.sh link-machine-and-node.sh utils.sh scripts in the same directory

export KUBECONFIG=clusterconfigs/auth/kubeconfig 
export CLUSTER_NAME=ocp-edge-cluster
bash add-machine-ips.sh

Comment 11 Marius Cornea 2020-02-27 19:09:15 UTC
Also jq needs to be installed on the machine where the workaround steps are run.

Comment 12 Marius Cornea 2020-02-27 20:47:45 UTC
Also need to set link-machine-and-node.sh as executable

chmod +x link-machine-and-node.sh

Comment 13 Stephen Benjamin 2020-03-18 16:59:18 UTC
*** Bug 1809664 has been marked as a duplicate of this bug. ***

Comment 14 Beth White 2020-04-20 15:45:32 UTC
Hi Ian, can you look into this bug and confirm if it is fixed by your work on collecting introspection data or if there is a duplicate bug? Thanks, Beth

Comment 15 Jiri Tomasek 2020-04-22 07:54:52 UTC
This bug severely affects the UX of the OpenShift console bare metal hosts management operations on masters such as power off, and maintenance. Bare Metal Host includes node status when determining its state and since the relationship is not represented on master hosts, the status calculation ends up not resembling the reality. This fact introduces several UI bugs (adding these as blocked by this bug).

Comment 16 Ian Main 2020-05-11 15:56:29 UTC
Yes, working on getting data into the masters from the installers.  This isn't IPV6 specific though I believe - correct me if I'm wrong.

Comment 17 Ian Main 2020-05-11 15:57:27 UTC
(In reply to Ian Main from comment #16)
> Yes, working on getting data into the masters from the installers.  This
> isn't IPV6 specific though I believe - correct me if I'm wrong.

To be clear, I'm looking at doing introspection during terraform operations and loading that data into the BMH CRs.

Comment 18 Stephen Benjamin 2020-05-12 17:14:04 UTC
> Yes, working on getting data into the masters from the installers.  This isn't IPV6 specific though I believe - correct me if I'm wrong.

Yup, that's correct, I'll fix the title. It applies to all masters regardless of IPv4 or IPv6. 

The node/machine/baremetalhost association is made based on node.InternalIP information, and we gather that through hardware inspection. We don't run inspection on masters, but Ian is working on running it once at install time, and then bringing that information over to the BareMetalHost object via annotations. Once that's populated all the relevant associations will be made.

Comment 19 Stephen Benjamin 2020-05-12 17:14:39 UTC
*** Bug 1810430 has been marked as a duplicate of this bug. ***

Comment 23 Beth White 2020-05-19 16:45:53 UTC
*** Bug 1822054 has been marked as a duplicate of this bug. ***

Comment 24 Constantin Vultur 2020-05-25 13:18:54 UTC
I've just deployed a 4.4.4 cluster and the issue is still present. Given the fact the bootstrap VM was deleted I was not able to check what happened. 
Also the discussion from Github PR dates from the same date as the 4.4.4, so I wonder if this fix got into the 4.4.4 build.

Looking for a 4.4.5 build after May 16th, to recheck this. 

Also on a 4.5.0 from May 22nd, this seems that is fixed. Still the Masters appear as "Host is powered off".

Comment 25 Steven Hardy 2020-05-26 09:10:09 UTC
(In reply to Constantin Vultur from comment #24)
> I've just deployed a 4.4.4 cluster and the issue is still present. Given the
> fact the bootstrap VM was deleted I was not able to check what happened. 
> Also the discussion from Github PR dates from the same date as the 4.4.4, so
> I wonder if this fix got into the 4.4.4 build.
> 
> Looking for a 4.4.5 build after May 16th, to recheck this. 
> 
> Also on a 4.5.0 from May 22nd, this seems that is fixed. Still the Masters
> appear as "Host is powered off".

The target release for this fix is 4.5.0 so what you're seeing is expected I think.

Stephen/Ian can confirm but I suspect this won't be backported to 4.4 unless there's a very strong justification, due to the complexity of the fix.

Comment 26 Constantin Vultur 2020-05-26 11:09:43 UTC
Deployed a cluster with 4.5.0-0.nightly-2020-05-26-063751 and now the information is being shown.

Still the data is not accurate, since the status1 is Ready and status2 : Host is powered off. 
I filled https://bugzilla.redhat.com/show_bug.cgi?id=1840090 for this issue, since the wrong status might not be related to the fix implemented for this BZ.

Comment 27 Constantin Vultur 2020-05-26 11:31:23 UTC
Another side-effect of the fix: https://bugzilla.redhat.com/show_bug.cgi?id=1840105

Comment 28 Stephen Benjamin 2020-05-26 11:37:05 UTC
> Stephen/Ian can confirm but I suspect this won't be backported to 4.4 unless there's a very strong justification, due to the complexity of the fix.


We ended up doing this in a simple way that could be backported, the 4.4 BZ is BZ1840106, and I've got a cherry-pick PR open.



> Still the data is not accurate, since the status1 is Ready and status2 : Host is powered off. 
I filled https://bugzilla.redhat.com/show_bug.cgi?id=1840090 for this issue, since the wrong status might not be related to the fix implemented for this BZ.
> Another side-effect of the fix: https://bugzilla.redhat.com/show_bug.cgi?id=1840105

Both these bugs sound nearly identical to me, and I do not think they are side effects or have anything to do with this one. We'll take a look though.

Comment 29 Stephen Benjamin 2020-06-04 11:41:06 UTC
*** Bug 1775494 has been marked as a duplicate of this bug. ***

Comment 31 errata-xmlrpc 2020-07-13 17:14:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.