Bug 1829642
| Summary: | 4.4 MachineSet with 4.2 or earlier bootimages fails to scale up because old CRI-O chokes on new CRI-O config | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Mike Fiedler <mifiedle> | ||||||
| Component: | RHCOS | Assignee: | Colin Walters <walters> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Michael Nguyen <mnguyen> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | urgent | ||||||||
| Version: | 4.4 | CC: | bbreard, choag, imcleod, jhou, jligon, jupierce, lmohanty, mpatel, nstielau, scuppett, sdodson, smilner, walters, wking | ||||||
| Target Milestone: | --- | Keywords: | Upgrades | ||||||
| Target Release: | 4.5.0 | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | |||||||||
| : | 1830102 (view as bug list) | Environment: | |||||||
| Last Closed: | 2020-07-13 17:32:59 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 1830102 | ||||||||
| Attachments: |
|
||||||||
|
Description
Mike Fiedler
2020-04-30 00:26:47 UTC
Unfortunately, this is yet another case where we want https://github.com/openshift/machine-config-operator/issues/1365 - the console logs right now just don't show a lot. But...that's something we need to fix RHCOS side too. I looked at your must-gather but we need the systemd journals from the workers. Created attachment 1683160 [details]
journal from system not joining the cluster
Apr 30 00:03:10 mffied-8v28p-w-c-z9vll.c.openshift-qe.internal crio[9281]: time="2020-04-30 00:03:10.046153145Z" level=fatal msg="config validation: invalid runtime_path for runtime 'runc': "stat : no such file or directory"" Created attachment 1683161 [details]
/etc/crio/crio.conf from same system
It appears that we are starting crio 1.14 which is failing on the newer config. When we update a cluster from 4.2 to 4.3 to 4.3 and then scale up, the machineset is still pointing to the 4.2 boot image. This results in a crio-1.4 (on 4.2) trying to parse a crio-1.17 (4.4) crio.conf and failing. We tested a workaround where we modified the machineset image to point to a newer boot image and then we were able to successfully scale up with the nodes coming up. One possible issue is maybe we regressed (or were disconnected in the understanding of) in pivot where we always expect to start with a new crio pulled from the latest machine-os-content or were just lucky so far where older crio version were able to understand the newer versions. There is an enhancement open for supporting updating boot images as well here - https://github.com/openshift/enhancements/pull/201/files. Tagging as an UpgradeBlocker, because we really want to keep folks with born-in-4.2-or-earlier clusters from moving up to 4.4 and having to manually rotate bootimages if they want to create new machines. Reproducible upgrading from 4.2.29->4.3.18->4.4.0 Not reproducible on 4.2.29->4.3.18 Not reproducible on 4.3.18->4.4.0 Workaround steps/outline: 1) Successfully upgrade to 4.4 from a cluster which was originally installed as 4.1 or 4.2 2) Identify and copy the RHCOS 4.4 image/AMI (e.g. for AWS: https://github.com/openshift/installer/blob/master/docs/user/aws/install_upi.md#optional-create-encrypted-amis) 3) Update each MachineSet with the newer RHCOS 4.4 image reference With upgrade issues resolved in 4.4.0-0.nightly-2020-04-30-145451 this now seems to be 100% reproducible (7 for 7 for clusters born on 4.2.x) I think https://github.com/openshift/machine-config-operator/pull/1706 is likely to work for this, but needs testing. Moving back to 4.5 as we want to target a different fix for this in the z-stream. Verified on 4.5.0-0.nightly-2020-05-05-205255. I had to update from 4.2.29 -> 4.3.18 -> 4.4.0-rc13 -> 4.5.0-0.nightly-2020-05-05-205255 to get the system into the right state and verified that scaleup works with the fix. The easiest way would have just to modify the machineset to point to an 4.2 boot image (like colin said), but I wanted to make sure that the problem would be resolved through updates.
$ oc describe clusterversion
Name: version
Namespace:
Labels: <none>
Annotations: <none>
API Version: config.openshift.io/v1
Kind: ClusterVersion
Metadata:
Creation Timestamp: 2020-05-06T13:11:36Z
Generation: 5
Resource Version: 127734
Self Link: /apis/config.openshift.io/v1/clusterversions/version
UID: 1d9375be-8f9b-11ea-8b1a-0a09ab9c60e7
Spec:
Channel: stable-4.3
Cluster ID: ac4727b2-fbf1-4c01-bf15-9c2e48128e16
Desired Update:
Force: true
Image: registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-05-05-205255
Version:
Upstream: https://api.openshift.com/api/upgrades_info/v1/graph
Status:
Available Updates: <nil>
Conditions:
Last Transition Time: 2020-05-06T13:30:10Z
Message: Done applying 4.5.0-0.nightly-2020-05-05-205255
Status: True
Type: Available
Last Transition Time: 2020-05-06T16:58:01Z
Status: False
Type: Failing
Last Transition Time: 2020-05-06T16:58:16Z
Message: Cluster version is 4.5.0-0.nightly-2020-05-05-205255
Status: False
Type: Progressing
Last Transition Time: 2020-05-06T15:18:00Z
Message: Unable to retrieve available updates: currently installed version 4.5.0-0.nightly-2020-05-05-205255 not found in the "stable-4.3" channel
Reason: VersionNotFound
Status: False
Type: RetrievedUpdates
Desired:
Force: true
Image: registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-05-05-205255
Version: 4.5.0-0.nightly-2020-05-05-205255
History:
Completion Time: 2020-05-06T16:58:16Z
Image: registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-05-05-205255
Started Time: 2020-05-06T16:16:35Z
State: Completed
Verified: false
Version: 4.5.0-0.nightly-2020-05-05-205255
Completion Time: 2020-05-06T16:03:21Z
Image: quay.io/openshift-release-dev/ocp-release:4.4.0-rc.13-x86_64
Started Time: 2020-05-06T15:17:44Z
State: Completed
Verified: false
Version: 4.4.0-rc.13
Completion Time: 2020-05-06T14:56:29Z
Image: quay.io/openshift-release-dev/ocp-release@sha256:1f0fd38ac0640646ab8e7fec6821c8928341ad93ac5ca3a48c513ab1fb63bc4b
Started Time: 2020-05-06T14:13:55Z
State: Completed
Verified: true
Version: 4.3.18
Completion Time: 2020-05-06T13:30:10Z
Image: quay.io/openshift-release-dev/ocp-release@sha256:3bff53ce2202ec59ed87581106b05f364fea0e7459f5806e4dc6e5129f130b36
Started Time: 2020-05-06T13:11:51Z
State: Completed
Verified: false
Version: 4.2.29
Observed Generation: 5
Version Hash: Eh-gywWFNgQ=
Events: <none>
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.5.0-0.nightly-2020-05-05-205255 True False 6m15s Cluster version is 4.5.0-0.nightly-2020-05-05-205255
$ oc -n openshift-machine-api get machineset
NAME DESIRED CURRENT READY AVAILABLE AGE
mnguyen4229to45-9hcvx-worker-us-east-1a 1 1 1 1 3h56m
mnguyen4229to45-9hcvx-worker-us-east-1b 1 1 1 1 3h56m
mnguyen4229to45-9hcvx-worker-us-east-1c 1 1 1 1 3h56m
mnguyen4229to45-9hcvx-worker-us-east-1d 0 0 3h56m
mnguyen4229to45-9hcvx-worker-us-east-1e 0 0 3h56m
mnguyen4229to45-9hcvx-worker-us-east-1f 0 0 3h56m
$ oc -n openshift-machine-api get machineset/mnguyen4229to45-9hcvx-worker-us-east-1a -o yaml
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
annotations:
machine.openshift.io/GPU: "0"
machine.openshift.io/memoryMb: "8192"
machine.openshift.io/vCPU: "2"
creationTimestamp: "2020-05-06T13:12:32Z"
generation: 1
labels:
machine.openshift.io/cluster-api-cluster: mnguyen4229to45-9hcvx
name: mnguyen4229to45-9hcvx-worker-us-east-1a
namespace: openshift-machine-api
resourceVersion: "118060"
selfLink: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinesets/mnguyen4229to45-9hcvx-worker-us-east-1a
uid: 3f12baaa-8f9b-11ea-8b1a-0a09ab9c60e7
spec:
replicas: 1
selector:
matchLabels:
machine.openshift.io/cluster-api-cluster: mnguyen4229to45-9hcvx
machine.openshift.io/cluster-api-machineset: mnguyen4229to45-9hcvx-worker-us-east-1a
template:
metadata:
labels:
machine.openshift.io/cluster-api-cluster: mnguyen4229to45-9hcvx
machine.openshift.io/cluster-api-machine-role: worker
machine.openshift.io/cluster-api-machine-type: worker
machine.openshift.io/cluster-api-machineset: mnguyen4229to45-9hcvx-worker-us-east-1a
spec:
metadata: {}
providerSpec:
value:
ami:
id: ami-01e7fdcb66157b224
apiVersion: awsproviderconfig.openshift.io/v1beta1
blockDevices:
- ebs:
iops: 0
volumeSize: 120
volumeType: gp2
credentialsSecret:
name: aws-cloud-credentials
deviceIndex: 0
iamInstanceProfile:
id: mnguyen4229to45-9hcvx-worker-profile
instanceType: m4.large
kind: AWSMachineProviderConfig
metadata:
creationTimestamp: null
placement:
availabilityZone: us-east-1a
region: us-east-1
publicIp: null
securityGroups:
- filters:
- name: tag:Name
values:
- mnguyen4229to45-9hcvx-worker-sg
subnet:
filters:
- name: tag:Name
values:
- mnguyen4229to45-9hcvx-private-us-east-1a
tags:
- name: kubernetes.io/cluster/mnguyen4229to45-9hcvx
value: owned
userDataSecret:
name: worker-user-data
status:
availableReplicas: 1
fullyLabeledReplicas: 1
observedGeneration: 1
readyReplicas: 1
replicas: 1
== NOTE: ami-01e7fdcb66157b224 = 4.2 AMI https://github.com/openshift/installer/blob/release-4.2/data/data/rhcos.json ==
$ oc -n openshift-machine-api scale --replicas=2 machineset/mnguyen4229to45-9hcvx-worker-us-east-1a
machineset.machine.openshift.io/mnguyen4229to45-9hcvx-worker-us-east-1a scaled
$ oc -n openshift-machine-api get machineset/mnguyen4229to45-9hcvx-worker-us-east-1a
NAME DESIRED CURRENT READY AVAILABLE AGE
mnguyen4229to45-9hcvx-worker-us-east-1a 2 2 1 1 4h8m
$ oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-130-15.ec2.internal Ready master 4h8m v1.18.0-rc.1
ip-10-0-141-160.ec2.internal Ready worker 4h2m v1.18.0-rc.1
ip-10-0-157-240.ec2.internal Ready worker 4h2m v1.18.0-rc.1
ip-10-0-159-40.ec2.internal Ready master 4h8m v1.18.0-rc.1
ip-10-0-172-230.ec2.internal Ready master 4h8m v1.18.0-rc.1
ip-10-0-175-163.ec2.internal Ready worker 4h2m v1.18.0-rc.1
$ oc -n openshift-machine-api get machines
NAME PHASE TYPE REGION ZONE AGE
mnguyen4229to45-9hcvx-master-0 Running m4.xlarge us-east-1 us-east-1a 4h8m
mnguyen4229to45-9hcvx-master-1 Running m4.xlarge us-east-1 us-east-1b 4h8m
mnguyen4229to45-9hcvx-master-2 Running m4.xlarge us-east-1 us-east-1c 4h8m
mnguyen4229to45-9hcvx-worker-us-east-1a-6wt2g Running m4.large us-east-1 us-east-1a 4h6m
mnguyen4229to45-9hcvx-worker-us-east-1a-fdc7j Provisioned m4.large us-east-1 us-east-1a 55s
mnguyen4229to45-9hcvx-worker-us-east-1b-j597b Running m4.large us-east-1 us-east-1b 4h6m
mnguyen4229to45-9hcvx-worker-us-east-1c-9fhhl Running m4.large us-east-1 us-east-1c 4h6m
$ oc -n openshift-machine-api get machines
NAME PHASE TYPE REGION ZONE AGE
mnguyen4229to45-9hcvx-master-0 Running m4.xlarge us-east-1 us-east-1a 4h12m
mnguyen4229to45-9hcvx-master-1 Running m4.xlarge us-east-1 us-east-1b 4h12m
mnguyen4229to45-9hcvx-master-2 Running m4.xlarge us-east-1 us-east-1c 4h12m
mnguyen4229to45-9hcvx-worker-us-east-1a-6wt2g Running m4.large us-east-1 us-east-1a 4h10m
mnguyen4229to45-9hcvx-worker-us-east-1a-fdc7j Running m4.large us-east-1 us-east-1a 4m47s
mnguyen4229to45-9hcvx-worker-us-east-1b-j597b Running m4.large us-east-1 us-east-1b 4h10m
mnguyen4229to45-9hcvx-worker-us-east-1c-9fhhl Running m4.large us-east-1 us-east-1c 4h10m
$ oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-130-15.ec2.internal Ready master 4h12m v1.18.0-rc.1
ip-10-0-136-255.ec2.internal NotReady worker 39s v1.18.0-rc.1
ip-10-0-141-160.ec2.internal Ready worker 4h6m v1.18.0-rc.1
ip-10-0-157-240.ec2.internal Ready worker 4h6m v1.18.0-rc.1
ip-10-0-159-40.ec2.internal Ready master 4h12m v1.18.0-rc.1
ip-10-0-172-230.ec2.internal Ready master 4h12m v1.18.0-rc.1
ip-10-0-175-163.ec2.internal Ready worker 4h6m v1.18.0-rc.1
$ oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-130-15.ec2.internal Ready master 4h12m v1.18.0-rc.1
ip-10-0-136-255.ec2.internal Ready worker 76s v1.18.0-rc.1
ip-10-0-141-160.ec2.internal Ready worker 4h7m v1.18.0-rc.1
ip-10-0-157-240.ec2.internal Ready worker 4h7m v1.18.0-rc.1
ip-10-0-159-40.ec2.internal Ready master 4h12m v1.18.0-rc.1
ip-10-0-172-230.ec2.internal Ready master 4h12m v1.18.0-rc.1
ip-10-0-175-163.ec2.internal Ready worker 4h7m v1.18.0-rc.1
$ oc -n openshift-machine-api get machines/mnguyen4229to45-9hcvx-worker-us-east-1a-fdc7j -o yaml
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
annotations:
machine.openshift.io/instance-state: running
creationTimestamp: "2020-05-06T17:20:00Z"
finalizers:
- machine.machine.openshift.io
generateName: mnguyen4229to45-9hcvx-worker-us-east-1a-
generation: 2
labels:
machine.openshift.io/cluster-api-cluster: mnguyen4229to45-9hcvx
machine.openshift.io/cluster-api-machine-role: worker
machine.openshift.io/cluster-api-machine-type: worker
machine.openshift.io/cluster-api-machineset: mnguyen4229to45-9hcvx-worker-us-east-1a
machine.openshift.io/instance-type: m4.large
machine.openshift.io/region: us-east-1
machine.openshift.io/zone: us-east-1a
name: mnguyen4229to45-9hcvx-worker-us-east-1a-fdc7j
namespace: openshift-machine-api
ownerReferences:
- apiVersion: machine.openshift.io/v1beta1
blockOwnerDeletion: true
controller: true
kind: MachineSet
name: mnguyen4229to45-9hcvx-worker-us-east-1a
uid: 3f12baaa-8f9b-11ea-8b1a-0a09ab9c60e7
resourceVersion: "135908"
selfLink: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/mnguyen4229to45-9hcvx-worker-us-east-1a-fdc7j
uid: 3375313b-8333-4025-9541-e4be1f32ad14
spec:
metadata: {}
providerID: aws:///us-east-1a/i-08ae6c4c2277d254a
providerSpec:
value:
ami:
id: ami-01e7fdcb66157b224
apiVersion: awsproviderconfig.openshift.io/v1beta1
blockDevices:
- ebs:
iops: 0
volumeSize: 120
volumeType: gp2
credentialsSecret:
name: aws-cloud-credentials
deviceIndex: 0
iamInstanceProfile:
id: mnguyen4229to45-9hcvx-worker-profile
instanceType: m4.large
kind: AWSMachineProviderConfig
metadata:
creationTimestamp: null
placement:
availabilityZone: us-east-1a
region: us-east-1
publicIp: null
securityGroups:
- filters:
- name: tag:Name
values:
- mnguyen4229to45-9hcvx-worker-sg
subnet:
filters:
- name: tag:Name
values:
- mnguyen4229to45-9hcvx-private-us-east-1a
tags:
- name: kubernetes.io/cluster/mnguyen4229to45-9hcvx
value: owned
userDataSecret:
name: worker-user-data
status:
addresses:
- address: 10.0.136.255
type: InternalIP
- address: ip-10-0-136-255.ec2.internal
type: InternalDNS
- address: ip-10-0-136-255.ec2.internal
type: Hostname
lastUpdated: "2020-05-06T17:24:57Z"
nodeRef:
kind: Node
name: ip-10-0-136-255.ec2.internal
uid: a8337740-da48-493e-8912-d30afdde4410
phase: Running
providerStatus:
conditions:
- lastProbeTime: "2020-05-06T17:20:02Z"
lastTransitionTime: "2020-05-06T17:20:02Z"
message: Machine successfully created
reason: MachineCreationSucceeded
status: "True"
type: MachineCreation
instanceId: i-08ae6c4c2277d254a
instanceState: running
$ oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-130-15.ec2.internal Ready master 4h13m v1.18.0-rc.1
ip-10-0-136-255.ec2.internal Ready worker 2m4s v1.18.0-rc.1
ip-10-0-141-160.ec2.internal Ready worker 4h7m v1.18.0-rc.1
ip-10-0-157-240.ec2.internal Ready worker 4h7m v1.18.0-rc.1
ip-10-0-159-40.ec2.internal Ready master 4h13m v1.18.0-rc.1
ip-10-0-172-230.ec2.internal Ready master 4h13m v1.18.0-rc.1
ip-10-0-175-163.ec2.internal Ready worker 4h7m v1.18.0-rc.1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 |