Description of problem: 1. Upgraded a cluster 4.2.29->4.3.18->4.4.0.rc13 on GCP 2. After upgrade, scaled all machinesets to 3 The new machines were created and are in Provisioned state but they never joined the cluster as nodes. There are no pending CSRs. runcom suggested starting this with RHCOS # oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE mffied-8v28p-w-a 3 3 1 1 26h mffied-8v28p-w-b 3 3 1 1 26h mffied-8v28p-w-c 3 3 1 1 26h mffied-8v28p-w-f 3 3 26h # oc get nodes NAME STATUS ROLES AGE VERSION mffied-8v28p-m-0.c.openshift-qe.internal Ready master 26h v1.17.1 mffied-8v28p-m-1.c.openshift-qe.internal Ready master 26h v1.17.1 mffied-8v28p-m-2.c.openshift-qe.internal Ready master 26h v1.17.1 mffied-8v28p-w-a-gr4pb.c.openshift-qe.internal Ready worker 26h v1.17.1 mffied-8v28p-w-b-w5zrf.c.openshift-qe.internal Ready worker 26h v1.17.1 mffied-8v28p-w-c-ngfwh.c.openshift-qe.internal Ready worker 26h v1.17.1 # oc get machines NAME PHASE TYPE REGION ZONE AGE mffied-8v28p-m-0 Running n1-standard-4 us-central1 us-central1-a 26h mffied-8v28p-m-1 Running n1-standard-4 us-central1 us-central1-b 26h mffied-8v28p-m-2 Running n1-standard-4 us-central1 us-central1-c 26h mffied-8v28p-w-a-gr4pb Running n1-standard-4 us-central1 us-central1-a 26h mffied-8v28p-w-a-nxcbx Provisioned n1-standard-4 us-central1 us-central1-a 39m mffied-8v28p-w-a-t5pw9 Provisioned n1-standard-4 us-central1 us-central1-a 39m mffied-8v28p-w-b-582jt Provisioned n1-standard-4 us-central1 us-central1-b 39m mffied-8v28p-w-b-gkqsw Provisioned n1-standard-4 us-central1 us-central1-b 39m mffied-8v28p-w-b-w5zrf Running n1-standard-4 us-central1 us-central1-b 26h mffied-8v28p-w-c-7rkrp Provisioned n1-standard-4 us-central1 us-central1-c 39m mffied-8v28p-w-c-ngfwh Running n1-standard-4 us-central1 us-central1-c 26h mffied-8v28p-w-c-z9vll Provisioned n1-standard-4 us-central1 us-central1-c 39m mffied-8v28p-w-f-2s479 Provisioned n1-standard-4 us-central1 us-central1-f 39m mffied-8v28p-w-f-47744 Provisioned n1-standard-4 us-central1 us-central1-f 39m mffied-8v28p-w-f-sqf2c Provisioned n1-standard-4 us-central1 us-central1-f 39m Version-Release number of selected component (if applicable): 4,2,29->4.3.18->4.4.0.rc13 How reproducible: Unknown. One cluster so far Additional info: Will provide location of must-gather and console logs from machines not joining cluster shortly.
Unfortunately, this is yet another case where we want https://github.com/openshift/machine-config-operator/issues/1365 - the console logs right now just don't show a lot. But...that's something we need to fix RHCOS side too. I looked at your must-gather but we need the systemd journals from the workers.
Created attachment 1683160 [details] journal from system not joining the cluster
Apr 30 00:03:10 mffied-8v28p-w-c-z9vll.c.openshift-qe.internal crio[9281]: time="2020-04-30 00:03:10.046153145Z" level=fatal msg="config validation: invalid runtime_path for runtime 'runc': "stat : no such file or directory""
Created attachment 1683161 [details] /etc/crio/crio.conf from same system
It appears that we are starting crio 1.14 which is failing on the newer config.
When we update a cluster from 4.2 to 4.3 to 4.3 and then scale up, the machineset is still pointing to the 4.2 boot image. This results in a crio-1.4 (on 4.2) trying to parse a crio-1.17 (4.4) crio.conf and failing. We tested a workaround where we modified the machineset image to point to a newer boot image and then we were able to successfully scale up with the nodes coming up. One possible issue is maybe we regressed (or were disconnected in the understanding of) in pivot where we always expect to start with a new crio pulled from the latest machine-os-content or were just lucky so far where older crio version were able to understand the newer versions. There is an enhancement open for supporting updating boot images as well here - https://github.com/openshift/enhancements/pull/201/files.
Tagging as an UpgradeBlocker, because we really want to keep folks with born-in-4.2-or-earlier clusters from moving up to 4.4 and having to manually rotate bootimages if they want to create new machines.
Reproducible upgrading from 4.2.29->4.3.18->4.4.0 Not reproducible on 4.2.29->4.3.18 Not reproducible on 4.3.18->4.4.0
Workaround steps/outline: 1) Successfully upgrade to 4.4 from a cluster which was originally installed as 4.1 or 4.2 2) Identify and copy the RHCOS 4.4 image/AMI (e.g. for AWS: https://github.com/openshift/installer/blob/master/docs/user/aws/install_upi.md#optional-create-encrypted-amis) 3) Update each MachineSet with the newer RHCOS 4.4 image reference
With upgrade issues resolved in 4.4.0-0.nightly-2020-04-30-145451 this now seems to be 100% reproducible (7 for 7 for clusters born on 4.2.x)
I think https://github.com/openshift/machine-config-operator/pull/1706 is likely to work for this, but needs testing.
Moving back to 4.5 as we want to target a different fix for this in the z-stream.
Verified on 4.5.0-0.nightly-2020-05-05-205255. I had to update from 4.2.29 -> 4.3.18 -> 4.4.0-rc13 -> 4.5.0-0.nightly-2020-05-05-205255 to get the system into the right state and verified that scaleup works with the fix. The easiest way would have just to modify the machineset to point to an 4.2 boot image (like colin said), but I wanted to make sure that the problem would be resolved through updates. $ oc describe clusterversion Name: version Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterVersion Metadata: Creation Timestamp: 2020-05-06T13:11:36Z Generation: 5 Resource Version: 127734 Self Link: /apis/config.openshift.io/v1/clusterversions/version UID: 1d9375be-8f9b-11ea-8b1a-0a09ab9c60e7 Spec: Channel: stable-4.3 Cluster ID: ac4727b2-fbf1-4c01-bf15-9c2e48128e16 Desired Update: Force: true Image: registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-05-05-205255 Version: Upstream: https://api.openshift.com/api/upgrades_info/v1/graph Status: Available Updates: <nil> Conditions: Last Transition Time: 2020-05-06T13:30:10Z Message: Done applying 4.5.0-0.nightly-2020-05-05-205255 Status: True Type: Available Last Transition Time: 2020-05-06T16:58:01Z Status: False Type: Failing Last Transition Time: 2020-05-06T16:58:16Z Message: Cluster version is 4.5.0-0.nightly-2020-05-05-205255 Status: False Type: Progressing Last Transition Time: 2020-05-06T15:18:00Z Message: Unable to retrieve available updates: currently installed version 4.5.0-0.nightly-2020-05-05-205255 not found in the "stable-4.3" channel Reason: VersionNotFound Status: False Type: RetrievedUpdates Desired: Force: true Image: registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-05-05-205255 Version: 4.5.0-0.nightly-2020-05-05-205255 History: Completion Time: 2020-05-06T16:58:16Z Image: registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-05-05-205255 Started Time: 2020-05-06T16:16:35Z State: Completed Verified: false Version: 4.5.0-0.nightly-2020-05-05-205255 Completion Time: 2020-05-06T16:03:21Z Image: quay.io/openshift-release-dev/ocp-release:4.4.0-rc.13-x86_64 Started Time: 2020-05-06T15:17:44Z State: Completed Verified: false Version: 4.4.0-rc.13 Completion Time: 2020-05-06T14:56:29Z Image: quay.io/openshift-release-dev/ocp-release@sha256:1f0fd38ac0640646ab8e7fec6821c8928341ad93ac5ca3a48c513ab1fb63bc4b Started Time: 2020-05-06T14:13:55Z State: Completed Verified: true Version: 4.3.18 Completion Time: 2020-05-06T13:30:10Z Image: quay.io/openshift-release-dev/ocp-release@sha256:3bff53ce2202ec59ed87581106b05f364fea0e7459f5806e4dc6e5129f130b36 Started Time: 2020-05-06T13:11:51Z State: Completed Verified: false Version: 4.2.29 Observed Generation: 5 Version Hash: Eh-gywWFNgQ= Events: <none> $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-05-05-205255 True False 6m15s Cluster version is 4.5.0-0.nightly-2020-05-05-205255 $ oc -n openshift-machine-api get machineset NAME DESIRED CURRENT READY AVAILABLE AGE mnguyen4229to45-9hcvx-worker-us-east-1a 1 1 1 1 3h56m mnguyen4229to45-9hcvx-worker-us-east-1b 1 1 1 1 3h56m mnguyen4229to45-9hcvx-worker-us-east-1c 1 1 1 1 3h56m mnguyen4229to45-9hcvx-worker-us-east-1d 0 0 3h56m mnguyen4229to45-9hcvx-worker-us-east-1e 0 0 3h56m mnguyen4229to45-9hcvx-worker-us-east-1f 0 0 3h56m $ oc -n openshift-machine-api get machineset/mnguyen4229to45-9hcvx-worker-us-east-1a -o yaml apiVersion: machine.openshift.io/v1beta1 kind: MachineSet metadata: annotations: machine.openshift.io/GPU: "0" machine.openshift.io/memoryMb: "8192" machine.openshift.io/vCPU: "2" creationTimestamp: "2020-05-06T13:12:32Z" generation: 1 labels: machine.openshift.io/cluster-api-cluster: mnguyen4229to45-9hcvx name: mnguyen4229to45-9hcvx-worker-us-east-1a namespace: openshift-machine-api resourceVersion: "118060" selfLink: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinesets/mnguyen4229to45-9hcvx-worker-us-east-1a uid: 3f12baaa-8f9b-11ea-8b1a-0a09ab9c60e7 spec: replicas: 1 selector: matchLabels: machine.openshift.io/cluster-api-cluster: mnguyen4229to45-9hcvx machine.openshift.io/cluster-api-machineset: mnguyen4229to45-9hcvx-worker-us-east-1a template: metadata: labels: machine.openshift.io/cluster-api-cluster: mnguyen4229to45-9hcvx machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker machine.openshift.io/cluster-api-machineset: mnguyen4229to45-9hcvx-worker-us-east-1a spec: metadata: {} providerSpec: value: ami: id: ami-01e7fdcb66157b224 apiVersion: awsproviderconfig.openshift.io/v1beta1 blockDevices: - ebs: iops: 0 volumeSize: 120 volumeType: gp2 credentialsSecret: name: aws-cloud-credentials deviceIndex: 0 iamInstanceProfile: id: mnguyen4229to45-9hcvx-worker-profile instanceType: m4.large kind: AWSMachineProviderConfig metadata: creationTimestamp: null placement: availabilityZone: us-east-1a region: us-east-1 publicIp: null securityGroups: - filters: - name: tag:Name values: - mnguyen4229to45-9hcvx-worker-sg subnet: filters: - name: tag:Name values: - mnguyen4229to45-9hcvx-private-us-east-1a tags: - name: kubernetes.io/cluster/mnguyen4229to45-9hcvx value: owned userDataSecret: name: worker-user-data status: availableReplicas: 1 fullyLabeledReplicas: 1 observedGeneration: 1 readyReplicas: 1 replicas: 1 == NOTE: ami-01e7fdcb66157b224 = 4.2 AMI https://github.com/openshift/installer/blob/release-4.2/data/data/rhcos.json == $ oc -n openshift-machine-api scale --replicas=2 machineset/mnguyen4229to45-9hcvx-worker-us-east-1a machineset.machine.openshift.io/mnguyen4229to45-9hcvx-worker-us-east-1a scaled $ oc -n openshift-machine-api get machineset/mnguyen4229to45-9hcvx-worker-us-east-1a NAME DESIRED CURRENT READY AVAILABLE AGE mnguyen4229to45-9hcvx-worker-us-east-1a 2 2 1 1 4h8m $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-130-15.ec2.internal Ready master 4h8m v1.18.0-rc.1 ip-10-0-141-160.ec2.internal Ready worker 4h2m v1.18.0-rc.1 ip-10-0-157-240.ec2.internal Ready worker 4h2m v1.18.0-rc.1 ip-10-0-159-40.ec2.internal Ready master 4h8m v1.18.0-rc.1 ip-10-0-172-230.ec2.internal Ready master 4h8m v1.18.0-rc.1 ip-10-0-175-163.ec2.internal Ready worker 4h2m v1.18.0-rc.1 $ oc -n openshift-machine-api get machines NAME PHASE TYPE REGION ZONE AGE mnguyen4229to45-9hcvx-master-0 Running m4.xlarge us-east-1 us-east-1a 4h8m mnguyen4229to45-9hcvx-master-1 Running m4.xlarge us-east-1 us-east-1b 4h8m mnguyen4229to45-9hcvx-master-2 Running m4.xlarge us-east-1 us-east-1c 4h8m mnguyen4229to45-9hcvx-worker-us-east-1a-6wt2g Running m4.large us-east-1 us-east-1a 4h6m mnguyen4229to45-9hcvx-worker-us-east-1a-fdc7j Provisioned m4.large us-east-1 us-east-1a 55s mnguyen4229to45-9hcvx-worker-us-east-1b-j597b Running m4.large us-east-1 us-east-1b 4h6m mnguyen4229to45-9hcvx-worker-us-east-1c-9fhhl Running m4.large us-east-1 us-east-1c 4h6m $ oc -n openshift-machine-api get machines NAME PHASE TYPE REGION ZONE AGE mnguyen4229to45-9hcvx-master-0 Running m4.xlarge us-east-1 us-east-1a 4h12m mnguyen4229to45-9hcvx-master-1 Running m4.xlarge us-east-1 us-east-1b 4h12m mnguyen4229to45-9hcvx-master-2 Running m4.xlarge us-east-1 us-east-1c 4h12m mnguyen4229to45-9hcvx-worker-us-east-1a-6wt2g Running m4.large us-east-1 us-east-1a 4h10m mnguyen4229to45-9hcvx-worker-us-east-1a-fdc7j Running m4.large us-east-1 us-east-1a 4m47s mnguyen4229to45-9hcvx-worker-us-east-1b-j597b Running m4.large us-east-1 us-east-1b 4h10m mnguyen4229to45-9hcvx-worker-us-east-1c-9fhhl Running m4.large us-east-1 us-east-1c 4h10m $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-130-15.ec2.internal Ready master 4h12m v1.18.0-rc.1 ip-10-0-136-255.ec2.internal NotReady worker 39s v1.18.0-rc.1 ip-10-0-141-160.ec2.internal Ready worker 4h6m v1.18.0-rc.1 ip-10-0-157-240.ec2.internal Ready worker 4h6m v1.18.0-rc.1 ip-10-0-159-40.ec2.internal Ready master 4h12m v1.18.0-rc.1 ip-10-0-172-230.ec2.internal Ready master 4h12m v1.18.0-rc.1 ip-10-0-175-163.ec2.internal Ready worker 4h6m v1.18.0-rc.1 $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-130-15.ec2.internal Ready master 4h12m v1.18.0-rc.1 ip-10-0-136-255.ec2.internal Ready worker 76s v1.18.0-rc.1 ip-10-0-141-160.ec2.internal Ready worker 4h7m v1.18.0-rc.1 ip-10-0-157-240.ec2.internal Ready worker 4h7m v1.18.0-rc.1 ip-10-0-159-40.ec2.internal Ready master 4h12m v1.18.0-rc.1 ip-10-0-172-230.ec2.internal Ready master 4h12m v1.18.0-rc.1 ip-10-0-175-163.ec2.internal Ready worker 4h7m v1.18.0-rc.1 $ oc -n openshift-machine-api get machines/mnguyen4229to45-9hcvx-worker-us-east-1a-fdc7j -o yaml apiVersion: machine.openshift.io/v1beta1 kind: Machine metadata: annotations: machine.openshift.io/instance-state: running creationTimestamp: "2020-05-06T17:20:00Z" finalizers: - machine.machine.openshift.io generateName: mnguyen4229to45-9hcvx-worker-us-east-1a- generation: 2 labels: machine.openshift.io/cluster-api-cluster: mnguyen4229to45-9hcvx machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker machine.openshift.io/cluster-api-machineset: mnguyen4229to45-9hcvx-worker-us-east-1a machine.openshift.io/instance-type: m4.large machine.openshift.io/region: us-east-1 machine.openshift.io/zone: us-east-1a name: mnguyen4229to45-9hcvx-worker-us-east-1a-fdc7j namespace: openshift-machine-api ownerReferences: - apiVersion: machine.openshift.io/v1beta1 blockOwnerDeletion: true controller: true kind: MachineSet name: mnguyen4229to45-9hcvx-worker-us-east-1a uid: 3f12baaa-8f9b-11ea-8b1a-0a09ab9c60e7 resourceVersion: "135908" selfLink: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/mnguyen4229to45-9hcvx-worker-us-east-1a-fdc7j uid: 3375313b-8333-4025-9541-e4be1f32ad14 spec: metadata: {} providerID: aws:///us-east-1a/i-08ae6c4c2277d254a providerSpec: value: ami: id: ami-01e7fdcb66157b224 apiVersion: awsproviderconfig.openshift.io/v1beta1 blockDevices: - ebs: iops: 0 volumeSize: 120 volumeType: gp2 credentialsSecret: name: aws-cloud-credentials deviceIndex: 0 iamInstanceProfile: id: mnguyen4229to45-9hcvx-worker-profile instanceType: m4.large kind: AWSMachineProviderConfig metadata: creationTimestamp: null placement: availabilityZone: us-east-1a region: us-east-1 publicIp: null securityGroups: - filters: - name: tag:Name values: - mnguyen4229to45-9hcvx-worker-sg subnet: filters: - name: tag:Name values: - mnguyen4229to45-9hcvx-private-us-east-1a tags: - name: kubernetes.io/cluster/mnguyen4229to45-9hcvx value: owned userDataSecret: name: worker-user-data status: addresses: - address: 10.0.136.255 type: InternalIP - address: ip-10-0-136-255.ec2.internal type: InternalDNS - address: ip-10-0-136-255.ec2.internal type: Hostname lastUpdated: "2020-05-06T17:24:57Z" nodeRef: kind: Node name: ip-10-0-136-255.ec2.internal uid: a8337740-da48-493e-8912-d30afdde4410 phase: Running providerStatus: conditions: - lastProbeTime: "2020-05-06T17:20:02Z" lastTransitionTime: "2020-05-06T17:20:02Z" message: Machine successfully created reason: MachineCreationSucceeded status: "True" type: MachineCreation instanceId: i-08ae6c4c2277d254a instanceState: running $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-130-15.ec2.internal Ready master 4h13m v1.18.0-rc.1 ip-10-0-136-255.ec2.internal Ready worker 2m4s v1.18.0-rc.1 ip-10-0-141-160.ec2.internal Ready worker 4h7m v1.18.0-rc.1 ip-10-0-157-240.ec2.internal Ready worker 4h7m v1.18.0-rc.1 ip-10-0-159-40.ec2.internal Ready master 4h13m v1.18.0-rc.1 ip-10-0-172-230.ec2.internal Ready master 4h13m v1.18.0-rc.1 ip-10-0-175-163.ec2.internal Ready worker 4h7m v1.18.0-rc.1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409