Description of problem: Adding day 2 worker does not work because BMH reconcile cannot complete because InfraEnv.Status.ISODownloadURL is not set. The ISODownloadURL is missing because after the spoke cluster completes install the InfraEnv is reconciled, but fails. status: conditions: - lastTransitionTime: "2021-05-12T10:23:08Z" message: 'Failed to create image: internal error' reason: ImageCreationError status: "False" type: ImageCreated The assisted-service log shows: time="2021-05-12T10:34:04Z" level=info msg="Reconcile has been called for InfraEnv name=bmac-test namespace=openshift-machine-api" func="github.com/openshift/assisted-service/internal/controller/controllers.(*InfraEnvReconciler).Reconcile" file="/go/src/github.com/openshift/origin/internal/controller/controllers/infraenv_controller.go:66" time="2021-05-12T10:34:04Z" level=error msg="Failed to add OCP version for release image: " func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).AddOpenshiftVersion" file="/go/src/github.com/openshift/origin/internal/bminventory/inventory.go:4329" error="no releaseImage nor releaseImageMirror provided" go-id=1091 pkg=Inventory request_id= time="2021-05-12T10:34:04Z" level=error msg="infraEnv reconcile failed" func="github.com/openshift/assisted-service/internal/controller/controllers.(*InfraEnvReconciler).handleEnsureISOErrors" file="/go/src/github.com/openshift/origin/internal/controller/controllers/infraenv_controller.go:326" error="no releaseImage nor releaseImageMirror provided" Examining the database shows that there are two rows with the same cluster name. It appears that the newest row is being picked up by the infraenv_controller. The newest row is missing the ocp_release_image. installer=# select name, openshift_cluster_id, ocp_release_image, install_started_at, install_completed_at, created_at from clusters; name | openshift_cluster_id | ocp_release_image | install_started_at | install_completed_at | created_at -----------+--------------------------------------+--------------------------------------------------------+----------------------------+--- -------------------------+---------------------------- bmac-test | 6e9aa9e0-48f3-47ea-9905-bf70dda1e51b | quay.io/openshift-release-dev/ocp-release:4.7.0-x86_64 | 2021-05-12 09:39:08.725+00 | 20 21-05-12 10:23:07.508+00 | 2021-05-12 09:04:45.267+00 bmac-test | | | 2000-01-01 00:00:00+00 | 20 00-01-01 00:00:00+00 | 2021-05-12 10:23:08.534+00 (2 rows) Version-Release number of selected component (if applicable): How reproducible: always Steps to Reproduce: 1. Install dev-scripts cluster (3 master + 1 worker) with 4 extra nodes for the spoke cluster. 2. make assisted_deployment 3. Create pull secret, cluster-ssh-key, ClusterImageSet, InfraEnv, ClusterDeployment. 4. Apply dev-scripts/ocp/ostest/extra_host_manifests.yaml for first 3 BMH. Add these to the BMH definitions: annotations: # BMAC will add this annotation if not present inspect.metal3.io: disabled labels: infraenvs.agent-install.openshift.io: "bmac-test" spec: automatedCleaningMode: disabled 5. After agents are discovered, approve them: kubectl -n assisted-installer patch agents.agent-install.openshift.io 132fb56c-3d7b-4c00-8944-26d8fc6ac8ca -p '{"spec":{"approved":true}}' --type merge 6. Wait for cluster deployment to complete install. 7. View InfraEnv status. Actual results: InfraEnv.Status.ISODownloadURL is missing Expected results: InfraEnv.Status.ISODownloadURL is set which would allow a new worker BMH to be provisioned. Additional info:
After Day1 cluster is installed it is removed and day2 cluster is created instead, InfraEnv point to the same cluster deployment and reconcile is probably not triggered in this case, i expect that after a few hours this is resolved because k8s reconcile infra env and it will detect the new cluster in the backend and create a new ISO. Because day2 is a low focus for 4.8 i set the priority to low. When creating a cluster we can probably trigger reconcile in infra env - it will resolve two issues. 1. day2 image 2. it will not require to create cluster deployment before infra env.
update: day1 is handled by watching cluster deployment so no need to do anything for day1
After a few hours, it did not resolve itself. The ocp_release_image is still not set, and I think that's the crux of the problem. Added additional columns to the sql query and I now see that the extra row is actually the row for the day2 cluster and the original cluster has deleted_at set: installer=# select name, openshift_cluster_id, id, ocp_release_image, install_started_at, install_completed_at, created_at, deleted_at, status, status_info from clusters; name | openshift_cluster_id | id | ocp_release_image | install_started_at | install_completed_at | created_at | deleted_at | status | status_info -----------+--------------------------------------+--------------------------------------+-------------------------------------------------- ------+----------------------------+----------------------------+----------------------------+-------------------------------+-------------- +------------------------------------------------- bmac-test | 24614200-9ec2-49b7-a05f-67dcd4356fe2 | cbefeadd-c816-47d1-a002-9ec95dbd6d2e | quay.io/openshift-release-dev/ocp-release:4.7.0-x 86_64 | 2021-05-18 12:08:25.991+00 | 2021-05-18 12:39:46.659+00 | 2021-05-18 11:22:16.584+00 | 2021-05-18 12:39:46.936627+00 | installed | Cluster is installed bmac-test | | afca3bbc-4c33-4631-81c0-6576458ca0ed | | 2000-01-01 00:00:00+00 | 2000-01-01 00:00:00+00 | 2021-05-18 12:39:46.984+00 | | adding-hosts | cluster is adding hosts to existing OCP cluster (2 rows)
It may be expected, you can trigger a reconcile if you change something in infra env, for example add a label to none existing nmstate. The thing is that we need to make sure the infra env is reconciled after we create day2 cluster, at this point nothing at cluster deployment will change so the notification should come from cluster deployment controller. You can use https://github.com/openshift/assisted-service/blob/master/internal/controller/controllers/clusterdeployments_controller.go#L445 to get the relevant infra env and then use r.CRDEventsHandler.NotifyInfraEnvUpdates(infraEnv.Name, infraEnv.Namespace) to notify it and trigger the reconcile.
The infra env is being reconciled after the day2 cluster is created. It fails though. time="2021-05-24T10:58:37Z" level=error msg="Failed to add OCP version for release image: " func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).AddOpenshiftVersion" file="/go/src/github.com/openshift/origin/internal/bminventory/inventory.go:4395" error="no releaseImage nor releaseImageMirror provided" go-id=649 pkg=Inventory request_id= time="2021-05-24T10:58:37Z" level=error msg="infraEnv reconcile failed" func="github.com/openshift/assisted-service/internal/controller/controllers.(*InfraEnvReconciler).handleEnsureISOErrors" file="/go/src/github.com/openshift/origin/internal/controller/controllers/infraenv_controller.go:329" error="no releaseImage nor releaseImageMirror provided" The issue is clusters.ocp_release_image is not being set when the day2 cluster is created. The cluster is created here: https://github.com/openshift/assisted-service/blob/master/internal/bminventory/inventory.go#L604 When the cluster is created, OcpReleaseVersion isn't being set. newCluster := common.Cluster{Cluster: models.Cluster{ ID: id, Href: swag.String(url.String()), Kind: swag.String(models.ClusterKindAddHostsCluster), Name: clusterName, OpenshiftVersion: *openshiftVersion.ReleaseVersion, UserName: ocm.UserNameFromContext(ctx), OrgID: ocm.OrgIDFromContext(ctx), EmailDomain: ocm.EmailDomainFromContext(ctx), UpdatedAt: strfmt.DateTime{}, APIVipDNSName: swag.String(apivipDnsname), HostNetworks: []*models.HostNetwork{}, Hosts: []*models.Host{}, }, Setting OcpReleaseVersion fixed the issue and allowed my infra env to reconcile with isoDownloadURL filled in. @@ -602,6 +604,7 @@ func (b *bareMetalInventory) RegisterAddHostsClusterInternal(ctx context.Context Kind: swag.String(models.ClusterKindAddHostsCluster), Name: clusterName, OpenshiftVersion: *openshiftVersion.ReleaseVersion, + OcpReleaseImage: *openshiftVersion.ReleaseImage, UserName: ocm.UserNameFromContext(ctx), OrgID: ocm.OrgIDFromContext(ctx), EmailDomain: ocm.EmailDomainFromContext(ctx), @mfilanov, is adding OcpReleaseVersion the correct fix?
When creating a day2 cluster one of the parameters that we set is `params.NewAddHostsClusterParams.OpenshiftVersion` from it openshift version is extracted https://github.com/openshift/assisted-service/blob/master/internal/bminventory/inventory.go#L589 I think that instead of adding another parameter you could use it when creating the cluster and add OcpReleaseImage: swag.StringValue(openshiftVersion.ReleaseImage), to https://github.com/openshift/assisted-service/blob/master/internal/bminventory/inventory.go#L599-L615 derez correct me if i'm wrong.
then when infra env will use this parameter from the cluster it will be correct https://github.com/openshift/assisted-service/blob/master/internal/controller/controllers/infraenv_controller.go#L281
(In reply to Michael Filanov from comment #6) > When creating a day2 cluster one of the parameters that we set is > `params.NewAddHostsClusterParams.OpenshiftVersion` from it openshift version > is extracted > https://github.com/openshift/assisted-service/blob/master/internal/ > bminventory/inventory.go#L589 > I think that instead of adding another parameter you could use it when > creating the cluster and add OcpReleaseImage: > swag.StringValue(openshiftVersion.ReleaseImage), to > https://github.com/openshift/assisted-service/blob/master/internal/ > bminventory/inventory.go#L599-L615 > > derez correct me if i'm wrong. Exactly, so as mentioned, adding 'OcpReleaseImage: *openshiftVersion.ReleaseImage,' to RegisterAddHostsClusterInternal [*] should solve the issue. As OcpReleaseImage is stored in the db when creating the cluster, it could indeed be used before generating the iso. [*] https://github.com/openshift/assisted-service/blob/0404f3865b9a30e1cfabaa0520ce05198e1168b1/internal/bminventory/inventory.go#L604
by the way it means that are tests are not covering this flow, i suggest to add another step to day2 subsystem that will validate that infra env is synced after day2 cluster is created and that the url is different from the url in day1.
PR: https://github.com/openshift/assisted-service/pull/2095
AI-day2 bugs are being switched target release to 4.9. we need to double-check that IPI day2 is an alternative otherwise we have no day2 for 4.8 with SaaS or AI-Operator.
I was able to add a day2 bmh worker node. the node moved from ready to provisioned state and assisted-service continued to add the node as a day2 worker node. Verified on: 2.4.0-DOWNSTREAM-2021-10-29-15-11-27