2083006 – IPI cluster creation is unstable - fails with different errors at every retry

Bug 2083006 - IPI cluster creation is unstable - fails with different errors at every retry

Summary: IPI cluster creation is unstable - fails with different errors at every retry

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.10
Hardware:	arm
OS:	Mac OS
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	OCP Installer
QA Contact:	MayXu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-05-09 05:05 UTC by Guna K Kambalimath
Modified:	2023-03-09 01:19 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-03-09 01:19:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
openshift_install.log for the cluster name ipitest, (126.25 KB, text/plain) 2022-05-09 05:05 UTC, Guna K Kambalimath	no flags	Details
Cluster creation failed - attached logs (126.25 KB, text/plain) 2022-05-09 07:38 UTC, Guna K Kambalimath	no flags	Details
View All

Description Guna K Kambalimath 2022-05-09 05:05:38 UTC

Created attachment 1877972 [details]
openshift_install.log for the cluster name ipitest,

Version:

$ openshift-install version
./openshift-install unreleased-master-5680-gb9faa56e1a63d5aac107d4f059d30cc25702be93
built from commit b9faa56e1a63d5aac107d4f059d30cc25702be93
release image registry.ci.openshift.org/origin/release:4.10
release architecture amd64

Platform: ibmcloud

* IPI 

What happened?

IPI cluster creation fails with platform ibmcloud with different errors every time. Hence, I have attached multiple logs in the attachment. 

What I expected to happen?

Successful creation of IPI cluster

How to reproduce it (as minimally and precisely as possible)?

$ Follow this doc for IPI cluster creation -https://deploy-preview-39767--osdocs.netlify.app/openshift-enterprise/latest/installing/installing_ibm_cloud_public/installing-ibm-cloud-customizations.html#installing-ibm-cloud-customizations

Comment 1 Guna K Kambalimath 2022-05-09 07:38:02 UTC

Created attachment 1877991 [details]
Cluster creation failed - attached logs

Adding one more log. This was observed during another cluster creation

Comment 2 Guna K Kambalimath 2022-05-09 07:43:05 UTC

Adding one more console log: (The cluster is cleaned up, hence, we have lost .openshift_install.log)

```
INFO Obtaining RHCOS image file from 'https://rhcos-redirector.apps.art.xq1c.p1.openshiftapps.com/art/storage/releases/rhcos-4.10/410.84.202201251210-0/x86_64/rhcos-410.84.202201251210-0-ibmcloud.x86_64.qcow2.gz?sha256=8fc2f8c99b6fc4766907f0e793bdf6ce7d0e0160f5a8296e4b0c3bb05bb57f1d' 
INFO Creating infrastructure resources...         
ERROR                                              
ERROR Error: Invalid function argument             
ERROR                                              
ERROR   on ../../../../var/folders/65/fc3xk3hj4b9c7754yqtb5qkw0000gn/T/openshift-install-bootstrap-3569125112/main.tf line 34, in resource "ibm_is_instance" "bootstrap_node": 
ERROR   34:   user_data = templatefile("${path.module}/templates/bootstrap.ign", { 
ERROR   35:     HOSTNAME    = ibm_cos_bucket.bootstrap_ignition.s3_endpoint_public 
ERROR   36:     BUCKET_NAME = ibm_cos_bucket.bootstrap_ignition.bucket_name 
ERROR   37:     OBJECT_NAME = ibm_cos_bucket_object.bootstrap_ignition.key 
ERROR   38:     IAM_TOKEN   = data.ibm_iam_auth_token.iam_token.iam_access_token 
ERROR   39:   })                                   
ERROR     |----------------                        
ERROR     | path.module is "../../../../var/folders/65/fc3xk3hj4b9c7754yqtb5qkw0000gn/T/openshift-install-bootstrap-3569125112" 
ERROR                                              
ERROR Invalid value for "path" parameter: no file exists at 
ERROR ../../../../var/folders/65/fc3xk3hj4b9c7754yqtb5qkw0000gn/T/openshift-install-bootstrap-3569125112/templates/bootstrap.ign; 
ERROR this function works only with files that are distributed as part of the 
ERROR configuration source code, so if this file will be created by a resource in 
ERROR this configuration you must instead obtain this result from an attribute of 
ERROR that resource.                               
ERROR                                              
ERROR Failed to read tfstate: open /var/folders/65/fc3xk3hj4b9c7754yqtb5qkw0000gn/T/openshift-install-bootstrap-3569125112/terraform.bootstrap.tfstate: no such file or directory 
FATAL failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply Terraform: failed to complete the change 
ambikanair@Ambika-Nairs-MacBook-Pro openshift-install-mac-4.10.0-rc.8 % 

```

Comment 3 Jan Safranek 2022-05-09 13:35:55 UTC

I am making this BZ public so people in IBM can access it. Logs from CI are public anyway.

Comment 4 Christopher J Schaefer 2022-05-09 13:45:04 UTC

Looks like the first two attachments (attachment 1877972 [details] and 1877991), the error was due to the user or system cancelling the operation, via an interrupt.


time="2022-04-19T16:35:19+05:30" level=debug msg="module.vpc.ibm_is_lb_pool.kubernetes_api_public[0]: Still creating... [30s elapsed]"
time="2022-04-19T16:35:21+05:30" level=debug msg="module.vpc.ibm_is_lb_pool.machine_config: Still creating... [1m0s elapsed]"
time="2022-04-19T16:35:21+05:30" level=debug msg="module.vpc.ibm_is_lb_pool.kubernetes_api_private: Still creating... [1m0s elapsed]"
time="2022-04-19T16:35:27+05:30" level=debug msg="Interrupt received."
time="2022-04-19T16:35:27+05:30" level=debug msg="Please wait for Terraform to exit or data loss may occur."
time="2022-04-19T16:35:27+05:30" level=debug msg="Gracefully shutting down..."
time="2022-04-19T16:35:27+05:30" level=debug msg="Stopping operation..."
time="2022-04-19T16:35:27+05:30" level=error msg="Two interrupts received. Exiting immediately. Note that data"
time="2022-04-19T16:35:27+05:30" level=error msg="loss may have occurred."
time="2022-04-19T16:35:27+05:30" level=error
time="2022-04-19T16:35:27+05:30" level=error msg="Error: operation canceled"


I would make sure the user or system that is performing the IPI deployment is not interrupted (machine goes to sleep, powers down, etc.).

Comment 5 Christopher J Schaefer 2022-05-09 13:56:44 UTC

As for the other failure

ERROR Invalid value for "path" parameter: no file exists at 
ERROR ../../../../var/folders/65/fc3xk3hj4b9c7754yqtb5qkw0000gn/T/openshift-install-bootstrap-3569125112/templates/bootstrap.ign; 
ERROR this function works only with files that are distributed as part of the 
ERROR configuration source code, so if this file will be created by a resource in 
ERROR this configuration you must instead obtain this result from an attribute of 
ERROR that resource.                               
ERROR                                              
ERROR Failed to read tfstate: open /var/folders/65/fc3xk3hj4b9c7754yqtb5qkw0000gn/T/openshift-install-bootstrap-3569125112/terraform.bootstrap.tfstate: no such file or directory 
FATAL failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply Terraform: failed to complete the change 

This seems like a potential pathing issue for Terraform, or perhaps on MacOS. As I do not see these issues on a fresh Ubuntu container. But, I'm not familiar enough with Terraform to determine the root cause or expectations for this. But I do feel like the pathing should be <cluster-dir>/terraform.bootstrap.tfstate

Someone with more knowledge on Terraform might have some better insight into this I hope.

Comment 7 Guna K Kambalimath 2022-06-01 06:04:10 UTC

(In reply to Guna K Kambalimath from comment #6)
> Update:
> 
> Tried to create cluster inside a linux env (inside docker container). I
> could see some progress. But, I still see the following error. 
> 
> OS:
> root@9e6e50387922:/cco_credentials_mount# cat /etc/os-release 
> NAME="Ubuntu"
> VERSION="18.04.6 LTS (Bionic Beaver)"
> ID=ubuntu
> ID_LIKE=debian
> PRETTY_NAME="Ubuntu 18.04.6 LTS"
> VERSION_ID="18.04"
> HOME_URL="https://www.ubuntu.com/"
> SUPPORT_URL="https://help.ubuntu.com/"
> BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
> PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-
> policy"
> VERSION_CODENAME=bionic
> UBUNTU_CODENAME=bionic
> 
> Logs:
> INFO Cluster operator cloud-controller-manager CloudControllerOwner is True
> with AsExpected: Cluster Cloud Controller Manager Operator owns cloud
> controllers at 4.10.0-rc.8 
> INFO Cluster operator etcd RecentBackup is Unknown with ControllerStarted:  
> INFO Cluster operator image-registry Available is False with
> DeploymentNotFound: Available: The deployment does not exist 
> INFO NodeCADaemonAvailable: The daemon set node-ca has available replicas 
> INFO ImagePrunerAvailable: Pruner CronJob has been created 
> INFO Cluster operator image-registry Progressing is True with Error:
> Progressing: Unable to apply resources: unable to apply objects: failed to
> create object *v1.Secret, Namespace=openshift-image-registry,
> Name=image-registry-private-configuration: specified resource key
> credentials does not contain HMAC keys 
> ERROR Cluster operator image-registry Degraded is True with Unavailable:
> Degraded: The deployment does not exist 
> INFO Cluster operator insights Disabled is False with AsExpected:  
> INFO Cluster operator insights SCANotAvailable is True with NotFound: Failed
> to pull SCA certs from
> https://api.openshift.com/api/accounts_mgmt/v1/certificates: OCM API
> https://api.openshift.com/api/accounts_mgmt/v1/certificates returned HTTP
> 404:
> {"code":"ACCT-MGMT-7","href":"/api/accounts_mgmt/v1/errors/7","id":"7",
> "kind":"Error","operation_id":"290a4842-1b16-4fc6-98d2-d7e3a13bc93c",
> "reason":"The organization (id= 25yGxao0DHRsqcIax7rSwDLAmjQ) does not have
> any certificate of type sca. Enable SCA at
> https://access.redhat.com/management."} 
> INFO Cluster operator network ManagementStateDegraded is False with :  
> ERROR Cluster initialization failed because one or more operators are not
> functioning properly. 
> ERROR The cluster should be accessible for troubleshooting as detailed in
> the documentation linked below, 
> ERROR
> https://docs.openshift.com/container-platform/latest/support/troubleshooting/
> troubleshooting-installations.html 
> ERROR The 'wait-for install-complete' subcommand can then be used to
> continue the installation 
> FATAL failed to initialize the cluster: Cluster operator image-registry is
> not available 
> 
> 
> Installer used: - openshift-install-linux-4.10.0-rc.8

Comment 8 Christopher J Schaefer 2022-06-01 14:56:39 UTC

So it appears you are affected by the bug from last month, likely multiple bugs actually as well.
https://bugzilla.redhat.com/show_bug.cgi?id=2082492

> INFO Cluster operator image-registry Available is False with
> DeploymentNotFound: Available: The deployment does not exist 
> INFO NodeCADaemonAvailable: The daemon set node-ca has available replicas 
> INFO ImagePrunerAvailable: Pruner CronJob has been created 
> INFO Cluster operator image-registry Progressing is True with Error:
> Progressing: Unable to apply resources: unable to apply objects: failed to
> create object *v1.Secret, Namespace=openshift-image-registry,
> Name=image-registry-private-configuration: specified resource key
> credentials does not contain HMAC keys 


I recommend you use an OCP release of 4.10.15 or later, or a very recent 4.11 CI/nightly build, that will include the fixes for these known bugs.
Please also keep in mind, the fix for the `image-registry` bug, requires the user to perform the CredentialsRequests extraction again (as it requires changes to the IRO CredentialsRequest.
https://coreos.slack.com/archives/C01U40AM37F/p1652196272952479?thread_ts=1652074667.322439&cid=C01U40AM37F

The important item being Step 4, when the CredentialsRequests are extracted, and the steps that follow using these new CredentialsRequest
oc adm release extract --cloud=ibmcloud --credentials-requests $RELEASE_IMAGE \
    --to=<path_to_credential_requests_directory> 


The other bugs that affected 4.11 CI/nightly releases back in early May.
https://bugzilla.redhat.com/show_bug.cgi?id=2082604
https://bugzilla.redhat.com/show_bug.cgi?id=2082687

Comment 9 Patrick Dillon 2022-06-02 01:41:30 UTC

I'm marking this as not a blocker because it does not clearly identify an issue. I'm interested in the error in https://bugzilla.redhat.com/show_bug.cgi?id=2083006#c5 if you are able to reproduce it. We have seen errors like this when the installer is run on a symlink directory at a different level in the hierarchy. That issue was previously fixed. Please let me know (and file a bz) if you can reproduce that issue.

Comment 10 MayXu 2022-06-10 10:17:24 UTC

test with  4.10.18  IBMCloud IPI install succeed.

Comment 12 Shiftzilla 2023-03-09 01:19:00 UTC

OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-9261

Note You need to log in before you can comment on or make changes to this bug.