1866543 – Spoke cluster creation in a disconnected env: Cluster is stuck in "Cluster is pending import"

Bug 1866543 - Spoke cluster creation in a disconnected env: Cluster is stuck in "Cluster is pending import"

Summary: Spoke cluster creation in a disconnected env: Cluster is stuck in "Cluster is...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Advanced Cluster Management for Kubernetes
Classification:	Red Hat
Component:	Cluster Lifecycle
Sub Component:
Version:	rhacm-2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	rhacm-2.1
Assignee:	Hao Liu
QA Contact:	magchen@redthat.com
Docs Contact:	Christopher Dawson
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-05 20:36 UTC by Alexander Chuzhoy
Modified:	2021-04-09 18:06 UTC (History)
CC List:	2 users (show)
Fixed In Version:	rhacm-2.1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-11-05 11:55:17 UTC
Target Upstream Version:
Embargoed:
Flags:	sasha: rhacm-2.0.z? gghezzo: rhacm-2.1+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	open-cluster-management backlog issues 4276	0	None	None	None	2020-09-29 20:19:21 UTC
Red Hat Product Errata	RHEA-2020:4954	0	None	None	None	2020-11-05 11:55:36 UTC

Description Alexander Chuzhoy 2020-08-05 20:36:32 UTC

Version: ACM2.0

Attempted to deploy a spoke cluster with ACM in a disconnected setup.

I see in the hive logs that the spoke cluster was created successfully:

time="2020-08-05T19:54:15Z" level=info msg="To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/output/auth/kubeconfig'"
time="2020-08-05T19:54:15Z" level=info msg="Access the OpenShift web-console here: https://console-openshift-console.apps.ocp-sasha-1.qe.lab.redhat.com"
REDACTED LINE OF OUTPUT
time="2020-08-05T19:54:15Z" level=debug msg="Time elapsed per stage:"
time="2020-08-05T19:54:15Z" level=debug msg="    Infrastructure: 49m17s"
time="2020-08-05T19:54:15Z" level=debug msg="Bootstrap Complete: 7m29s"
time="2020-08-05T19:54:15Z" level=debug msg=" Bootstrap Destroy: 13s"
time="2020-08-05T19:54:15Z" level=debug msg=" Cluster Operators: 27m27s"
time="2020-08-05T19:54:15Z" level=info msg="Time elapsed: 1h25m10s"
time="2020-08-05T19:54:16Z" level=info msg="command completed successfully" installID=hntlg9qn
time="2020-08-05T19:54:16Z" level=info msg="saving installer output" installID=hntlg9qn
time="2020-08-05T19:54:16Z" level=debug msg="installer console log: level=info msg=\"Consuming Install Config from target directory\"\nlevel=warning msg=\"Discarding the Openshift Manifests that was provided in the target directory because its dependencies are dirty and it needs to be regenerated\"\nlevel=warning msg=\"Found override for release image. Please be warned, this is not advised\"\nlevel=info msg=\"Consuming Openshift Manifests from target directory\"\nlevel=info msg=\"Consuming Master Machines from target directory\"\nlevel=info msg=\"Consuming Common Manifests from target directory\"\nlevel=info msg=\"Consuming Worker Machines from target directory\"\nlevel=info msg=\"Consuming OpenShift Install (Manifests) from target directory\"\nlevel=info msg=\"Obtaining RHCOS image file from 'http://registry.ocp-sasha-1.qe.lab.redhat.com:8080/images/rhcos-45.82.202007141718-0-qemu.x86_64.qcow2.gz?sha256=f991f93293fea5d9a5f34da7eac4d2f1a2efc9816c13803484c702dee0818feb'\"\nlevel=info msg=\"Consuming Bootstrap Ignition Config from target directory\"\nlevel=info msg=\"Consuming Worker Ignition Config from target directory\"\nlevel=info msg=\"Consuming Master Ignition Config from target directory\"\nlevel=info msg=\"Creating infrastructure resources...\"\nlevel=info msg=\"Waiting up to 20m0s for the Kubernetes API at https://api.ocp-sasha-1.qe.lab.redhat.com:6443...\"\nlevel=info msg=\"API v1.18.3+08c38ef up\"\nlevel=info msg=\"Waiting up to 40m0s for bootstrapping to complete...\"\nlevel=info msg=\"Destroying the bootstrap resources...\"\nlevel=info msg=\"Waiting up to 1h0m0s for the cluster at https://api.ocp-sasha-1.qe.lab.redhat.com:6443 to initialize...\"\nlevel=info msg=\"Waiting up to 10m0s for the openshift-console route to be created...\"\nlevel=info msg=\"Install complete!\"\nlevel=info msg=\"To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/output/auth/kubeconfig'\"\nlevel=info msg=\"Access the OpenShift web-console here: https://console-openshift-console.apps.ocp-sasha-1.qe.lab.redhat.com\"\nREDACTED LINE OF OUTPUT\nlevel=info msg=\"Time elapsed: 1h25m10s\"\n" installID=hntlg9qn
time="2020-08-05T19:54:16Z" level=info msg="install completed successfully" installID=hntlg9qn
[kni@provisionhost-0-0 ~]$ 



Yes in ACM UI, the cluster is is stuck with:
"
Cluster is pending import


Run the command with kubectl configured for your targeted cluster if you have not already completed this step.
echo <The encoded CRD is only displayed in the command when you click the Copy button.> | base64 --decode | kubectl apply -f - && sleep 2 && echo <The encoded YAML is only displayed in the command when you click the Copy button.> | base64 --decode | kubectl apply -f -
"



Checking the csr - all are approved:
[kni@provisionhost-0-0 ~]$ oc get csr -A
NAME        AGE   SIGNERNAME                                    REQUESTOR                CONDITION
csr-bnmhp   13m   kubernetes.io/kubelet-serving                 system:node:master-0-0   Approved,Issued
csr-z5n5r   28m   kubernetes.io/kube-apiserver-client-kubelet   system:node:master-0-0   Approved,Issued
csr-zmslh   57m   kubernetes.io/kube-apiserver-client-kubelet   system:node:master-0-1   Approved,Issued
[kni@provisionhost-0-0 ~]$ 



[kni@provisionhost-0-0 ~]$ oc get pod -A|grep -v Run|grep -v Comple
NAMESPACE                                          NAME                                                             READY   STATUS      RESTARTS   AGE



[kni@provisionhost-0-0 ~]$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.5.0-0.nightly-2020-08-03-123303   True        False         False      20h
cloud-credential                           4.5.0-0.nightly-2020-08-03-123303   True        False         False      20h
cluster-autoscaler                         4.5.0-0.nightly-2020-08-03-123303   True        False         False      20h
config-operator                            4.5.0-0.nightly-2020-08-03-123303   True        False         False      20h
console                                    4.5.0-0.nightly-2020-08-03-123303   True        False         False      20h
csi-snapshot-controller                    4.5.0-0.nightly-2020-08-03-123303   True        False         False      20h
dns                                        4.5.0-0.nightly-2020-08-03-123303   True        False         False      20h
etcd                                       4.5.0-0.nightly-2020-08-03-123303   True        False         False      20h
image-registry                             4.5.0-0.nightly-2020-08-03-123303   True        False         True       20h
ingress                                    4.5.0-0.nightly-2020-08-03-123303   True        False         False      20h
insights                                   4.5.0-0.nightly-2020-08-03-123303   True        False         False      20h
kube-apiserver                             4.5.0-0.nightly-2020-08-03-123303   True        False         False      20h
kube-controller-manager                    4.5.0-0.nightly-2020-08-03-123303   True        False         False      20h
kube-scheduler                             4.5.0-0.nightly-2020-08-03-123303   True        False         False      20h
kube-storage-version-migrator              4.5.0-0.nightly-2020-08-03-123303   True        False         False      20h
machine-api                                4.5.0-0.nightly-2020-08-03-123303   True        False         False      20h
machine-approver                           4.5.0-0.nightly-2020-08-03-123303   True        False         False      20h
machine-config                             4.5.0-0.nightly-2020-08-03-123303   True        False         False      20h
marketplace                                4.5.0-0.nightly-2020-08-03-123303   True        False         False      20h
monitoring                                 4.5.0-0.nightly-2020-08-03-123303   True        False         False      20h
network                                    4.5.0-0.nightly-2020-08-03-123303   True        False         False      20h
node-tuning                                4.5.0-0.nightly-2020-08-03-123303   True        False         False      20h
openshift-apiserver                        4.5.0-0.nightly-2020-08-03-123303   True        False         False      20h
openshift-controller-manager               4.5.0-0.nightly-2020-08-03-123303   True        False         False      113m
openshift-samples                          4.5.0-0.nightly-2020-08-03-123303   True        False         False      20h
operator-lifecycle-manager                 4.5.0-0.nightly-2020-08-03-123303   True        False         False      20h
operator-lifecycle-manager-catalog         4.5.0-0.nightly-2020-08-03-123303   True        False         False      20h
operator-lifecycle-manager-packageserver   4.5.0-0.nightly-2020-08-03-123303   True        False         False      20h
service-ca                                 4.5.0-0.nightly-2020-08-03-123303   True        False         False      20h
storage                                    4.5.0-0.nightly-2020-08-03-123303   True        False         False      20h
[kni@provisionhost-0-0 ~]$

Comment 1 Alexander Chuzhoy 2020-08-05 21:25:29 UTC

This is related to an image not being reachable:

The following command was executed against the spoke cluster:
[kni@provisionhost-0-0 ~]$ oc describe pod -n open-cluster-management-agent klusterlet-8fc468666-q94xf
Name:         klusterlet-8fc468666-q94xf
Namespace:    open-cluster-management-agent
Priority:     0
Node:         worker-1-1/192.168.124.135
Start Time:   Wed, 05 Aug 2020 19:54:18 +0000
Labels:       app=klusterlet
              pod-template-hash=8fc468666
Annotations:  k8s.ovn.org/pod-networks:
                {"default":{"ip_addresses":["10.128.2.6/23"],"mac_address":"1a:f8:16:80:02:07","gateway_ips":["10.128.2.1"],"ip_address":"10.128.2.6/23","...
              k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "ovn-kubernetes",
                    "interface": "eth0",
                    "ips": [
                        "10.128.2.6"
                    ],
                    "mac": "1a:f8:16:80:02:07",
                    "default": true,
                    "dns": {}
                }]
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "ovn-kubernetes",
                    "interface": "eth0",
                    "ips": [
                        "10.128.2.6"
                    ],
                    "mac": "1a:f8:16:80:02:07",
                    "default": true,
                    "dns": {}
                }]
              openshift.io/scc: restricted
Status:       Pending
IP:           10.128.2.6
IPs:
  IP:           10.128.2.6
Controlled By:  ReplicaSet/klusterlet-8fc468666
Containers:
  klusterlet:
    Container ID:  
    Image:         registry.redhat.io/rhacm2/registration-rhel8-operator@sha256:0630bca8263f93a4a1348e6bd5a8689157739e8ab09d4f3354acb4be1bf66dda
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Args:
      /registration-operator
      klusterlet
    State:          Waiting
      Reason:       ImagePullBackOff
    Ready:          False
    Restart Count:  0
    Liveness:       http-get https://:8443/healthz delay=2s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get https://:8443/healthz delay=2s timeout=1s period=10s #success=1 #failure=3
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from klusterlet-token-t2f5p (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  klusterlet-token-t2f5p:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  klusterlet-token-t2f5p
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason          Age                    From                 Message
  ----     ------          ----                   ----                 -------
  Normal   Scheduled       <unknown>              default-scheduler    Successfully assigned open-cluster-management-agent/klusterlet-8fc468666-q94xf to worker-1-1
  Normal   AddedInterface  90m                    multus               Add eth0 [10.128.2.6/23]
  Warning  Failed          84m (x4 over 89m)      kubelet, worker-1-1  Failed to pull image "registry.redhat.io/rhacm2/registration-rhel8-operator@sha256:0630bca8263f93a4a1348e6bd5a8689157739e8ab09d4f3354acb4be1bf66dda": rpc error: code = Unknown desc = error pinging docker registry registry.redhat.io: Get https://registry.redhat.io/v2/: dial tcp 104.126.247.209:443: i/o timeout
  Normal   BackOff         20m (x240 over 89m)    kubelet, worker-1-1  Back-off pulling image "registry.redhat.io/rhacm2/registration-rhel8-operator@sha256:0630bca8263f93a4a1348e6bd5a8689157739e8ab09d4f3354acb4be1bf66dda"
  Warning  Failed          9m59s (x275 over 89m)  kubelet, worker-1-1  Error: ImagePullBackOff
  Warning  Failed          5m7s (x18 over 89m)    kubelet, worker-1-1  Error: ErrImagePull
  Normal   Pulling         3s (x19 over 90m)      kubelet, worker-1-1  Pulling image "registry.redhat.io/rhacm2/registration-rhel8-operator@sha256:0630bca8263f93a4a1348e6bd5a8689157739e8ab09d4f3354acb4be1bf66dda"

Comment 2 Alexander Chuzhoy 2020-08-07 17:06:44 UTC

In order to make the import complete successful, had to:
1. mirror the following images to the local registry:
registry.redhat.io/rhacm2/registration-rhel8-operator
registry.redhat.io/rhacm2/work-rhel8
registry.redhat.io/rhacm2/registration-rhel8
registry.redhat.io/rhacm2/endpoint-component-rhel8-operator

2. Create an imagecontentsourcepolicy for registry.redhat.io/rhacm2 pointing to the local mirror.

Comment 3 Mike Ng 2020-08-07 18:42:28 UTC

G2Bsync 670198597 comment 
 TheRealHaoLiu Thu, 06 Aug 2020 21:22:14 UTC 
 G2Bsync 
`ImagePullBackOff`  `Get https://registry.redhat.io/v2/: dial tcp 104.126.247.209:443: i/o timeout`

is this cluster disconnected? if yes, does this cluster get created with the correct ImageContentSource?

Comment 9 errata-xmlrpc 2020-11-05 11:55:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Advanced Cluster Management for Kubernetes version 2.1 images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:4954

Note You need to log in before you can comment on or make changes to this bug.