Bug 2111663 - Hosted cluster in Pending import state
Summary: Hosted cluster in Pending import state
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Advanced Cluster Management for Kubernetes
Classification: Red Hat
Component: Hypershift
Version: rhacm-2.6
Hardware: x86_64
OS: Unspecified
unspecified
high
Target Milestone: ---
: rhacm-2.6
Assignee: Roke Jung
QA Contact: txue
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-07-27 18:42 UTC by Thuy Nguyen
Modified: 2022-09-06 22:35 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-09-06 22:34:44 UTC
Target Upstream Version:
Embargoed:
bot-tracker-sync: rhacm-2.6+
cbynum: rhacm-2.6.z+


Attachments (Terms of Use)
Clusters UI (80.88 KB, image/png)
2022-07-27 18:42 UTC, Thuy Nguyen
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github stolostron backlog issues 24589 0 None None None 2022-07-27 21:18:17 UTC
Red Hat Product Errata RHSA-2022:6370 0 None None None 2022-09-06 22:35:08 UTC

Description Thuy Nguyen 2022-07-27 18:42:02 UTC
Created attachment 1899767 [details]
Clusters UI

Description of problem: Hosted cluster in Pending import state


Version-Release number of selected component (if applicable):
2.6.0-DOWNSTREAM-2022-07-27-06-01-34

How reproducible:


Steps to Reproduce:
1. Deploy hosted cluster on aws by applying following yaml:
```
apiVersion: cluster.open-cluster-management.io/v1alpha1
kind: HypershiftDeployment
metadata:
  name: hs0
  namespace: default
spec:
  hostingCluster: local-cluster     # the hypershift management cluster name.
  hostingNamespace: hs0-ns     # specify the namespace to which hostedcluster and noodpools belong on the hypershift management cluster.
  infrastructure:
    cloudProvider:
      name: aws-creds
    configure: True
    platform:
      aws:
        region: us-east-1
  hostedClusterSpec:
    etcd:
      managed:
        storage:
          persistentVolume:
            size: 4Gi
          type: PersistentVolume
      managementType: Managed
    controllerAvailabilityPolicy: SingleReplica
    #controllerAvailabilityPolicy: HighlyAvailable
    release:
      image: quay.io/openshift-release-dev/ocp-release:4.11.0-rc.3-x86_64
    networking:
      networkType: OpenShiftSDN
      machineCIDR: ""           # Can be left empty, when configure: true
      podCIDR: ""               # Can be left empty, when configure: true
      serviceCIDR: ""           # Can be left empty, when configure: true
    platform:
      type: AWS
    pullSecret: {}  # Can be left empty, when configure: true
    sshKey: {}      # Can be left empty, when configure: true
    services: []    # Can be left empty, when configure: true
  nodePools:
  - name: hs0
    spec:
      clusterName: hs0
      management:
        autoRepair: false
        replace:
          rollingUpdate:
            maxSurge: 1
            maxUnavailable: 0
          strategy: RollingUpdate
        upgradeType: Replace
      nodeCount: 2
      platform:
        aws:
          instanceType: t3.large
          rootVolume:
            size: 35
            type: gp3
        type: AWS
      release:
        image: quay.io/openshift-release-dev/ocp-release:4.11.0-rc.3-x86_64
```


Actual results:
Hosted cluster is in 'Pending import' state

Expected results:


Additional info:
```
# oc get hypershiftdeployment -n default
NAME   TYPE   INFRA                  IAM                    MANIFESTWORK           PROVIDER REF   PROGRESS   AVAILABLE
hs0    AWS    ConfiguredAsExpected   ConfiguredAsExpected   ConfiguredAsExpected   AsExpected     Partial    True

# oc get nodepool -n hs0-ns
NAME   CLUSTER   DESIRED NODES   CURRENT NODES   AUTOSCALING   AUTOREPAIR   VERSION       UPDATINGVERSION   UPDATINGCONFIG   MESSAGE
hs0    hs0       2               2               False         False        4.11.0-rc.3

# oc get po -n hs0-ns-hs0
NAME                                              READY   STATUS     RESTARTS   AGE
capi-provider-884f969fc-dgrjs                     2/2     Running    0          33m
catalog-operator-68f6769b5d-gv2jm                 2/2     Running    0          30m
certified-operators-catalog-6958dd669d-6sbpd      1/1     Running    0          30m
cluster-api-667459ffbf-d7tjr                      1/1     Running    0          33m
cluster-autoscaler-7ff9545889-wbhwt               1/1     Running    0          32m
cluster-network-operator-6bd98c86d-v8bp7          1/1     Running    0          30m
cluster-policy-controller-ff4844c4c-qbzh8         1/1     Running    0          30m
cluster-version-operator-65477f7dcc-cbjj5         1/1     Running    0          30m
community-operators-catalog-6bb566658b-q2cd4      1/1     Running    0          30m
control-plane-operator-b75794859-n28cx            2/2     Running    0          33m
etcd-0                                            1/1     Running    0          32m
hosted-cluster-config-operator-76f79456f6-rxzqf   1/1     Running    0          30m
ignition-server-646549b976-965xv                  1/1     Running    0          32m
ingress-operator-7d6d49659b-lcbhj                 0/3     Init:0/1   0          30m
konnectivity-agent-ddb8c46bb-v89vl                1/1     Running    0          32m
konnectivity-server-699bfb7d94-gvrnn              1/1     Running    0          32m
kube-apiserver-64bfd8f6c-xpbqv                    5/5     Running    0          32m
kube-controller-manager-5b46896bdd-pzr4q          2/2     Running    0          26m
kube-scheduler-6996497954-2bk9v                   1/1     Running    0          31m
machine-approver-5b676fffff-zpv8n                 1/1     Running    0          32m
oauth-openshift-754cfd88dc-8d5hl                  2/2     Running    0          30m
olm-operator-6c95dd9795-6s58s                     2/2     Running    0          30m
openshift-apiserver-68f6f6f847-mq2dc              2/2     Running    0          26m
openshift-controller-manager-85577f8956-7b4gf     1/1     Running    0          30m
openshift-oauth-apiserver-65b9df4664-l5lvk        1/1     Running    0          30m
packageserver-674c4857b6-rzbwj                    2/2     Running    0          30m
redhat-marketplace-catalog-d5f58875d-fz25k        1/1     Running    0          30m
redhat-operators-catalog-7cb4657b5f-j7txj         1/1     Running    0          30m
```

Hosted cluster status -
```
# oc get hostedcluster -n hs0-ns
NAME   VERSION   KUBECONFIG             PROGRESS   AVAILABLE   REASON                    MESSAGE
hs0              hs0-admin-kubeconfig   Partial    True        HostedClusterAsExpected

# oc get nodes --kubeconfig=hs0-kubeconfig
NAME                           STATUS   ROLES    AGE   VERSION
ip-10-0-133-75.ec2.internal    Ready    worker   16m   v1.24.0+9546431
ip-10-0-134-216.ec2.internal   Ready    worker   27m   v1.24.0+9546431

```

Comment 2 bot-tracker-sync 2022-07-29 15:53:36 UTC
G2Bsync 1198403949 comment 
 dtthuynh Thu, 28 Jul 2022 16:59:35 UTC 
 G2Bsync @rokej Seeing this using `quay.io/openshift-release-dev/ocp-release:4.11.0-rc.6-x86_64` on OCP 4.11.0 and ACM build `2.6.0-DOWNSTREAM-2022-07-27-12-58-38`. I'm also seeing it happen on OCP 4.10.23 + ACM build `2.5.2-DOWNSTREAM-2022-07-27-20-49-17` My 4.11 ones are sitting in Pending Import state

Comment 3 bot-tracker-sync 2022-07-29 15:53:37 UTC
G2Bsync 1198536603 comment 
 rokej Thu, 28 Jul 2022 19:15:11 UTC 
 G2Bsync  Since @dtthuynh was able reproduce the problem with rc6 release image, I am re-opening it.

Comment 4 bot-tracker-sync 2022-07-29 15:53:38 UTC
G2Bsync 1198550131 comment 
 rokej Thu, 28 Jul 2022 19:31:36 UTC 
 G2Bsync  I cannot reproduce the problem with the same HD/HC CRs on my cluster with the same MCE downstream build (the one that is included in the ACM downstream build). It is also strange that the hosted cluster with release image 4.10 works find on this cluster but not with 4.11 release image.

```
oc get pod -n clusters-clc-hypershift-demo-411
NAME                                      READY   STATUS     RESTARTS         AGE
capi-provider-c4f895dd6-9pxqh             0/2     Init:0/1   0                154m
cluster-api-845d5dfbdf-n9xwt              1/1     Running    0                154m
control-plane-operator-845859d769-z89dg   1/2     Running    27 (5m16s ago)   154m
```

In the control-plane-operator pod, I see errors

```
{"level":"info","ts":"2022-07-28T19:05:43Z","msg":"Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference","controller":"hostedcontrolplane","controllerGroup":"hypershift.openshift.io","controllerKind":"HostedControlPlane","hostedControlPlane":{"name":"clc-hypershift-demo-411","namespace":"clusters-clc-hypershift-demo-411"},"namespace":"clusters-clc-hypershift-demo-411","name":"clc-hypershift-demo-411","reconcileID":"3bb5b2a5-a73f-4d88-9f3b-74bf481e3ab2"}
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x2bc0df7]

goroutine 1973 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
	/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:118 +0x2d7
panic({0x2f8e460, 0x49b8540})
	/usr/lib/golang/src/runtime/panic.go:844 +0x25a
github.com/openshift/hypershift/control-plane-operator/controllers/hostedcontrolplane.(*HostedControlPlaneReconciler).Reconcile(0xc0003867e0, {0x37f25b0, 0xc0009671a0}, {{{0xc000836f80, 0x20}, {0xc000e947c8, 0x17}}})
	/hypershift/control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go:247 +0x1497
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xc000419900, {0x37f25b0, 0xc0009671a0}, {{{0xc000836f80, 0x20}, {0xc000e947c8, 0x17}}})
	/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:121 +0x172
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000419900, {0x37f25b0, 0xc0009671a0}, {0x30936c0, 0xc000577e40})
	/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:320 +0x3d7
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000419900, {0x37f2508, 0xc000e5b540})
```

I am asking the hypershift team on this.

Comment 5 bot-tracker-sync 2022-07-29 15:53:39 UTC
G2Bsync 1198716728 comment 
 rokej Thu, 28 Jul 2022 23:18:40 UTC 
 G2Bsync @philipwu08 FYI https://coreos.slack.com/archives/C01C8502FMM/p1659035464648789

Comment 6 bot-tracker-sync 2022-07-29 15:53:41 UTC
G2Bsync 1199515885 comment 
 philipwu08 Fri, 29 Jul 2022 15:19:28 UTC 
 G2Bsync  Still a problem with `2.6.0-DOWNSTREAM-2022-07-28-23-21-09` - pinged Brian Smith.

Comment 7 bot-tracker-sync 2022-08-02 15:18:08 UTC
G2Bsync 1202686078 comment 
 dtthuynh Tue, 02 Aug 2022 14:24:21 UTC 
 G2Bsync Raising to blocker state, still happening on ACM 2.6.0 FC1 and now happens with 4.10 imageset.

Comment 8 bot-tracker-sync 2022-08-04 20:30:17 UTC
G2Bsync 1205695611 comment 
 thuyn-581 Thu, 04 Aug 2022 19:42:58 UTC 
 G2BSync -
Retest OK on 2.6.0-FC2. 
Hypershift cluster created successfully on AWS with OpenShiftSDN network type.

Comment 11 errata-xmlrpc 2022-09-06 22:34:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Advanced Cluster Management 2.6.0 security updates and bug fixes), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6370


Note You need to log in before you can comment on or make changes to this bug.