Created attachment 1899767 [details] Clusters UI Description of problem: Hosted cluster in Pending import state Version-Release number of selected component (if applicable): 2.6.0-DOWNSTREAM-2022-07-27-06-01-34 How reproducible: Steps to Reproduce: 1. Deploy hosted cluster on aws by applying following yaml: ``` apiVersion: cluster.open-cluster-management.io/v1alpha1 kind: HypershiftDeployment metadata: name: hs0 namespace: default spec: hostingCluster: local-cluster # the hypershift management cluster name. hostingNamespace: hs0-ns # specify the namespace to which hostedcluster and noodpools belong on the hypershift management cluster. infrastructure: cloudProvider: name: aws-creds configure: True platform: aws: region: us-east-1 hostedClusterSpec: etcd: managed: storage: persistentVolume: size: 4Gi type: PersistentVolume managementType: Managed controllerAvailabilityPolicy: SingleReplica #controllerAvailabilityPolicy: HighlyAvailable release: image: quay.io/openshift-release-dev/ocp-release:4.11.0-rc.3-x86_64 networking: networkType: OpenShiftSDN machineCIDR: "" # Can be left empty, when configure: true podCIDR: "" # Can be left empty, when configure: true serviceCIDR: "" # Can be left empty, when configure: true platform: type: AWS pullSecret: {} # Can be left empty, when configure: true sshKey: {} # Can be left empty, when configure: true services: [] # Can be left empty, when configure: true nodePools: - name: hs0 spec: clusterName: hs0 management: autoRepair: false replace: rollingUpdate: maxSurge: 1 maxUnavailable: 0 strategy: RollingUpdate upgradeType: Replace nodeCount: 2 platform: aws: instanceType: t3.large rootVolume: size: 35 type: gp3 type: AWS release: image: quay.io/openshift-release-dev/ocp-release:4.11.0-rc.3-x86_64 ``` Actual results: Hosted cluster is in 'Pending import' state Expected results: Additional info: ``` # oc get hypershiftdeployment -n default NAME TYPE INFRA IAM MANIFESTWORK PROVIDER REF PROGRESS AVAILABLE hs0 AWS ConfiguredAsExpected ConfiguredAsExpected ConfiguredAsExpected AsExpected Partial True # oc get nodepool -n hs0-ns NAME CLUSTER DESIRED NODES CURRENT NODES AUTOSCALING AUTOREPAIR VERSION UPDATINGVERSION UPDATINGCONFIG MESSAGE hs0 hs0 2 2 False False 4.11.0-rc.3 # oc get po -n hs0-ns-hs0 NAME READY STATUS RESTARTS AGE capi-provider-884f969fc-dgrjs 2/2 Running 0 33m catalog-operator-68f6769b5d-gv2jm 2/2 Running 0 30m certified-operators-catalog-6958dd669d-6sbpd 1/1 Running 0 30m cluster-api-667459ffbf-d7tjr 1/1 Running 0 33m cluster-autoscaler-7ff9545889-wbhwt 1/1 Running 0 32m cluster-network-operator-6bd98c86d-v8bp7 1/1 Running 0 30m cluster-policy-controller-ff4844c4c-qbzh8 1/1 Running 0 30m cluster-version-operator-65477f7dcc-cbjj5 1/1 Running 0 30m community-operators-catalog-6bb566658b-q2cd4 1/1 Running 0 30m control-plane-operator-b75794859-n28cx 2/2 Running 0 33m etcd-0 1/1 Running 0 32m hosted-cluster-config-operator-76f79456f6-rxzqf 1/1 Running 0 30m ignition-server-646549b976-965xv 1/1 Running 0 32m ingress-operator-7d6d49659b-lcbhj 0/3 Init:0/1 0 30m konnectivity-agent-ddb8c46bb-v89vl 1/1 Running 0 32m konnectivity-server-699bfb7d94-gvrnn 1/1 Running 0 32m kube-apiserver-64bfd8f6c-xpbqv 5/5 Running 0 32m kube-controller-manager-5b46896bdd-pzr4q 2/2 Running 0 26m kube-scheduler-6996497954-2bk9v 1/1 Running 0 31m machine-approver-5b676fffff-zpv8n 1/1 Running 0 32m oauth-openshift-754cfd88dc-8d5hl 2/2 Running 0 30m olm-operator-6c95dd9795-6s58s 2/2 Running 0 30m openshift-apiserver-68f6f6f847-mq2dc 2/2 Running 0 26m openshift-controller-manager-85577f8956-7b4gf 1/1 Running 0 30m openshift-oauth-apiserver-65b9df4664-l5lvk 1/1 Running 0 30m packageserver-674c4857b6-rzbwj 2/2 Running 0 30m redhat-marketplace-catalog-d5f58875d-fz25k 1/1 Running 0 30m redhat-operators-catalog-7cb4657b5f-j7txj 1/1 Running 0 30m ``` Hosted cluster status - ``` # oc get hostedcluster -n hs0-ns NAME VERSION KUBECONFIG PROGRESS AVAILABLE REASON MESSAGE hs0 hs0-admin-kubeconfig Partial True HostedClusterAsExpected # oc get nodes --kubeconfig=hs0-kubeconfig NAME STATUS ROLES AGE VERSION ip-10-0-133-75.ec2.internal Ready worker 16m v1.24.0+9546431 ip-10-0-134-216.ec2.internal Ready worker 27m v1.24.0+9546431 ```
G2Bsync 1198403949 comment dtthuynh Thu, 28 Jul 2022 16:59:35 UTC G2Bsync @rokej Seeing this using `quay.io/openshift-release-dev/ocp-release:4.11.0-rc.6-x86_64` on OCP 4.11.0 and ACM build `2.6.0-DOWNSTREAM-2022-07-27-12-58-38`. I'm also seeing it happen on OCP 4.10.23 + ACM build `2.5.2-DOWNSTREAM-2022-07-27-20-49-17` My 4.11 ones are sitting in Pending Import state
G2Bsync 1198536603 comment rokej Thu, 28 Jul 2022 19:15:11 UTC G2Bsync Since @dtthuynh was able reproduce the problem with rc6 release image, I am re-opening it.
G2Bsync 1198550131 comment rokej Thu, 28 Jul 2022 19:31:36 UTC G2Bsync I cannot reproduce the problem with the same HD/HC CRs on my cluster with the same MCE downstream build (the one that is included in the ACM downstream build). It is also strange that the hosted cluster with release image 4.10 works find on this cluster but not with 4.11 release image. ``` oc get pod -n clusters-clc-hypershift-demo-411 NAME READY STATUS RESTARTS AGE capi-provider-c4f895dd6-9pxqh 0/2 Init:0/1 0 154m cluster-api-845d5dfbdf-n9xwt 1/1 Running 0 154m control-plane-operator-845859d769-z89dg 1/2 Running 27 (5m16s ago) 154m ``` In the control-plane-operator pod, I see errors ``` {"level":"info","ts":"2022-07-28T19:05:43Z","msg":"Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference","controller":"hostedcontrolplane","controllerGroup":"hypershift.openshift.io","controllerKind":"HostedControlPlane","hostedControlPlane":{"name":"clc-hypershift-demo-411","namespace":"clusters-clc-hypershift-demo-411"},"namespace":"clusters-clc-hypershift-demo-411","name":"clc-hypershift-demo-411","reconcileID":"3bb5b2a5-a73f-4d88-9f3b-74bf481e3ab2"} panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x2bc0df7] goroutine 1973 [running]: sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1() /hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:118 +0x2d7 panic({0x2f8e460, 0x49b8540}) /usr/lib/golang/src/runtime/panic.go:844 +0x25a github.com/openshift/hypershift/control-plane-operator/controllers/hostedcontrolplane.(*HostedControlPlaneReconciler).Reconcile(0xc0003867e0, {0x37f25b0, 0xc0009671a0}, {{{0xc000836f80, 0x20}, {0xc000e947c8, 0x17}}}) /hypershift/control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go:247 +0x1497 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xc000419900, {0x37f25b0, 0xc0009671a0}, {{{0xc000836f80, 0x20}, {0xc000e947c8, 0x17}}}) /hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:121 +0x172 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000419900, {0x37f25b0, 0xc0009671a0}, {0x30936c0, 0xc000577e40}) /hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:320 +0x3d7 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000419900, {0x37f2508, 0xc000e5b540}) ``` I am asking the hypershift team on this.
G2Bsync 1198716728 comment rokej Thu, 28 Jul 2022 23:18:40 UTC G2Bsync @philipwu08 FYI https://coreos.slack.com/archives/C01C8502FMM/p1659035464648789
G2Bsync 1199515885 comment philipwu08 Fri, 29 Jul 2022 15:19:28 UTC G2Bsync Still a problem with `2.6.0-DOWNSTREAM-2022-07-28-23-21-09` - pinged Brian Smith.
G2Bsync 1202686078 comment dtthuynh Tue, 02 Aug 2022 14:24:21 UTC G2Bsync Raising to blocker state, still happening on ACM 2.6.0 FC1 and now happens with 4.10 imageset.
G2Bsync 1205695611 comment thuyn-581 Thu, 04 Aug 2022 19:42:58 UTC G2BSync - Retest OK on 2.6.0-FC2. Hypershift cluster created successfully on AWS with OpenShiftSDN network type.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat Advanced Cluster Management 2.6.0 security updates and bug fixes), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:6370