Bug 1966947

Summary: [4.9.0] v1beta1.Machine is not registered in scheme, causing bmh_agent_controller reconcileSpokeBMH to fail
Product: OpenShift Container Platform Reporter: Flavio Percoco <fpercoco>
Component: assisted-installerAssignee: Flavio Percoco <fpercoco>
assisted-installer sub component: assisted-service QA Contact: Trey West <trwest>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: low CC: aos-bugs, asegurap, bjacot, rwsu, trwest, yobshans
Version: 4.9Keywords: Triaged
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: AI-Team-Platform
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1965007 Environment:
Last Closed: 2021-10-26 17:22:37 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1965007    
Bug Blocks:    

Description Flavio Percoco 2021-06-02 08:21:06 UTC
+++ This bug was initially created as a clone of Bug #1965007 +++

Description of problem:

When a remote worker is added for the spoke cluster, bmh_agent_controller reconcile fails with 

time="2021-05-26T14:07:47Z" level=error msg="failed to create or update spoke Machine" func="github.com/openshift/assisted-service/internal/controller/controllers.(*BMACReconciler).reconcileSpokeBMH" file="/go/src/github.com/openshift/origin/internal/controller/controllers/bmh_agent_controller.go:522" error="no kind is registered for the type v1beta1.Machine in scheme \"pkg/runtime/scheme.go:100\""


Version-Release number of selected component (if applicable):


How reproducible:

always

Steps to Reproduce:

1. Install dev-scripts cluster (3 master + 1 worker) with 4 extra nodes for the spoke cluster.
2. make assisted_deployment
3. Create pull secret, cluster-ssh-key, ClusterImageSet, InfraEnv, ClusterDeployment. 
4. Apply dev-scripts/ocp/ostest/extra_host_manifests.yaml for first 3 BMH. Add these to the BMH definitions:

  annotations:
    # BMAC will add this annotation if not present
    inspect.metal3.io: disabled
  labels:
    infraenvs.agent-install.openshift.io: "bmac-test"
  spec:
    automatedCleaningMode: disabled

5. After agents are discovered, approve them:
kubectl -n assisted-installer patch agents.agent-install.openshift.io 132fb56c-3d7b-4c00-8944-26d8fc6ac8ca -p '{"spec":{"approved":true}}' --type merge

6. Wait for spoke cluster deployment to complete install.

7. Create worker node using the 4th BMH definition in extra_host_manifests.yaml

Actual results:

bmh_agent_controller reconcileSpokeBMH fails.

Expected results:

bmh_agent_controller reconcileSpokeBMH succeeds.

Additional info:

--- Additional comment from Flavio Percoco on 2021-06-01 06:20:58 UTC ---


> time="2021-05-26T14:07:47Z" level=error msg="failed to create or update spoke Machine" func="github.com/openshift/assisted-service/internal/controller/controllers.(*BMACReconciler).reconcileSpokeBMH" file="/go/src/github.com/openshift/origin/internal/controller/controllers/bmh_agent_controller.go:522" error="no kind is registered for the type v1beta1.Machine in scheme \"pkg/runtime/scheme.go:100\""


Checked Richard's environment and the CRD's exist in the Spoke Cluster:

machinesets.machine.openshift.io
machines.machine.openshift.io

This suggests the issue may be a simple Manager/Client instantiation issue since the Machine scheme's are currently not added to the runtime Scheme: https://github.com/openshift/assisted-service/blob/846f2dc89d10b74ed95cb99a6a8888902fb11497/cmd/main.go#L693-L710

Comment 3 Trey West 2021-06-18 15:05:00 UTC
@fpercoco 

I am trying to verify this with ACM 2.3. I currently don't see any logs from assisted-service regarding reconcileSpokeBMH. These are the only logs I see:

time="2021-06-18T14:54:47Z" level=error msg="failed to register host <276f3a07-fdca-46af-bed4-8b7f58426c20> to cluster d4afb783-c079-4d3c-bce0-321d0aa6d203 due to: Cannot add host to a cluster that is already installed, please use the day2 cluster option" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).RegisterHost" file="/remote-source/app/internal/bminventory/inventory.go:2494" cluster_id=d4afb783-c079-4d3c-bce0-321d0aa6d203 error="Cannot add host to a cluster that is already installed, please use the day2 cluster option" go-id=460011 pkg=Inventory request_id=9c271813-2bca-4c96-90ed-1c9bf0b05bc1
time="2021-06-18T14:54:47Z" level=error msg="RegisterHost failed" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).RegisterHost.func1" file="/remote-source/app/internal/bminventory/inventory.go:2469" cluster_id=d4afb783-c079-4d3c-bce0-321d0aa6d203 go-id=460011 pkg=Inventory request_id=9c271813-2bca-4c96-90ed-1c9bf0b05bc1

Comment 4 Flavio Percoco 2021-06-21 05:41:55 UTC
(In reply to Trey West from comment #3)
> @fpercoco 
> 
> I am trying to verify this with ACM 2.3. I currently don't see any logs from
> assisted-service regarding reconcileSpokeBMH. These are the only logs I see:
> 
> time="2021-06-18T14:54:47Z" level=error msg="failed to register host
> <276f3a07-fdca-46af-bed4-8b7f58426c20> to cluster
> d4afb783-c079-4d3c-bce0-321d0aa6d203 due to: Cannot add host to a cluster
> that is already installed, please use the day2 cluster option"
> func="github.com/openshift/assisted-service/internal/bminventory.
> (*bareMetalInventory).RegisterHost"
> file="/remote-source/app/internal/bminventory/inventory.go:2494"
> cluster_id=d4afb783-c079-4d3c-bce0-321d0aa6d203 error="Cannot add host to a
> cluster that is already installed, please use the day2 cluster option"
> go-id=460011 pkg=Inventory request_id=9c271813-2bca-4c96-90ed-1c9bf0b05bc1
> time="2021-06-18T14:54:47Z" level=error msg="RegisterHost failed"
> func="github.com/openshift/assisted-service/internal/bminventory.
> (*bareMetalInventory).RegisterHost.func1"
> file="/remote-source/app/internal/bminventory/inventory.go:2469"
> cluster_id=d4afb783-c079-4d3c-bce0-321d0aa6d203 go-id=460011 pkg=Inventory
> request_id=9c271813-2bca-4c96-90ed-1c9bf0b05bc1

I don't think this needs further verification, TBH. This was more a programmatic issue than a deployment one. It can be better verified when deploying a day 2 cluster, which is not a priority right now.

Feel free to switch it to VERIFIED if the regular 4.8 flow works.

Comment 5 Trey West 2021-06-22 17:55:02 UTC
Hi @fpercoco, lets wait to move this to verified after the Day2 flow works. I know that there are other bugs opened currently blocking it. For example: https://bugzilla.redhat.com/show_bug.cgi?id=1959869

Once that one and any others blocking the day2 installation are moved to ON_QA we can verify them all at the same time.

Comment 6 Trey West 2021-06-28 13:41:44 UTC
@fpercoco, since this is Day2 can we move target release to 4.9?

Comment 8 bjacot 2021-08-02 13:58:04 UTC
moving to 4.9 for day 2 flow.

Comment 13 Trey West 2021-10-20 13:00:12 UTC
VERIFIED on 2.4.0-DOWNSTREAM-2021-10-15-19-58-05

Comment 15 errata-xmlrpc 2021-10-26 17:22:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.9.4 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3935