Created attachment 1682663 [details] cluster autoscaler logs Description of problem: Something occurs in the cluster autoscaler (it looks like a very quick addition of N nodes, then a quick removal of those nodes) that causes MachineWithNoRunningPhase and MachineWithoutValidNode alerts to fire. See https://coreos.slack.com/archives/CHY2E1BL4/p1588119655188100 Version-Release number of selected component (if applicable): $ oc --context build01 get clusterversion version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.0-rc.12 True False 25h Cluster version is 4.4.0-rc.12 Cluster autoscaler logs attached
I've had a look through the logs to try and understand what is going on here, I saw the same thing happen three times in the logs: - A bunch of pods get scheduled and become "unschedulable" - Autoscaler detects unschedulable pods and scales up 1 or more machinesets - In doing so, we start seeing "x unregistered nodes present" which means that the Autoscaler is expecting this many nodes to join the cluster, which it will map to nodes based on the providerID fields - After 15 minutes, the nodes that the CA was expecting to see still haven't been registered with the K8s API - This means the scale up failed - Because no nodes have joined, these unregistered nodes are scaled back down I don't see anything jumping out as wrong within the logs of the CA that you've provided, the IDs that it is marking as unregistered look to be in the correct format for instance. I'm going to suggest that the issue isn't within CA, but rather in the process of Machines becoming Nodes. Since this seems fairly reproducible (create some unschedulable pods), could we try to reproduce this and watch to see: - Do EC2 instances get created and do they start running? - Do the Machines become Provisioned/Running? Are there any errors in the status? - Do any Nodes join the cluster during this period? If so, do they have their providerID set to the expected format `aws:///<az>/<instance-id>`? - Are the machine-api controllers happy? In particular the machine and nodelink controllers may be having issues? If you can provide more info/logs around the components I've mentioned above, we may be able to narrow down where the issue is
Some possible root cause that might answer Joel's questions is hitting cloud rate limits/quotas. This would be visible in the machine's status and machine controller logs.
There are some associated flapping alerts according to the chat logs: [FIRING:3] MachineWithoutValidNode machine-api-operator (https openshift-machine-api 10.130.64.27:8443 openshift-machine-api Provisioned machine-api-operator-7d48cbf7f-tfctb openshift-monitoring/k8s machine-api-operator critical) 8:25 [FIRING:7] MachineWithNoRunningPhase machine-api-operator (https openshift-machine-api 10.130.64.27:8443 openshift-machine-api Provisioned machine-api-operator-7d48cbf7f-tfctb openshift-monitoring/k8s machine-api-operator critical) NoRunningPhase implies the machine never got networking and was not actually provisioned: https://github.com/openshift/enhancements/blob/master/enhancements/machine-api/machine-instance-lifecycle.md#running I agree, API quota is likely suspect and/or bad configuration on the machineset resulting in a machine AWS won't accept.
Created attachment 1682952 [details] machine-healthcheck-controller log
As a follow-on for our team regardless of bug outcome here, we need to make sure we're properly aligned with our timeouts for the autoscaler, alerting, and MHC, both in docs and default values. These were described as 'flapping' indicating that the system is alerting and taking automated action, not leaving any window of time for investigation by the administrator. Also as a follow-on, maybe we should emit an event for cleaning up failed machines and capture that info. Maybe an alert if the autoscaler cleans up more than X failed machines in some time period? I'm not sure what to do about the alerts because it seems we're already somewhat creating an alert storm for a single condition. We have 'NoRunningPhase' at the same time as 'MachineWithoutValidNode'. Perhaps we need additional filtering on MachineWithoutValidNode to not include Failed Machines and machines with NoRunningPhase.
Created attachment 1682954 [details] machine-api-controllers log
Is part of this that the nodes are joining the cluster slowly?
(In reply to Colin Walters from comment #7) > Is part of this that the nodes are joining the cluster slowly? No, the nodes never join the cluster. We can see in the logs they requested the proper MCP from the MCS, but no CSRs are ever generated. Console shows old 4.3 CoreOS version, should be 4.4 based on cluster version, however only 4.3 hosts are able to join (other machinesets are still getting new machines at 4.3 and those work, even though they are N-1)
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from 4.3 to 4.4. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? Example: Customers upgrading from 4.3.Z to 4.4.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet (do we know how widely autoscaling is deployed in 4.3?). Example: All customers upgrading from 4.3.z to 4.4.z fail approximately 10% of the time. What is the impact? Is it serious enough to warrant blocking edges? Example: No further autoscaling occurs, or some such. Example: Up to 90seconds of API downtime. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? Example: Issue resolves itself after five minutes. Example: Autoscaling is broken, and admins have to manually scale the cluster until they can update to a later release with a fix. Example: Admin must SSH to hosts, restore from backups, or other non standard admin activity.
From what I see this has nothing to do with upgrades. The only question that needs to be asked here is: is this issue something we would block 4.4 GA for (what customer clusters will be impacted, what is the impact, is there a workaround). If we don't want customers upgrading to it, we don't want customers installing it fresh either, and vice versa. Upgrade questions should be constrained to bugs/cases where we think the bug is specifically triggered by performing an upgrade, not something a new install could also hit. I don't see a reason to think that is the case here. Sounds like we need some more analysis from the cloud compute team to determine what the risk/impact is if this ships as is. What other cluster autoscaler testing has been done on AWS?
Why this should be a 4.4.0 GA blocker: Our internal CI cluster upgraded from a 4.3 release to the latest 4.4 RC. We don't yet have a root cause, upgrade reported as complete but kubelets are reporting a version behind (1.16 should be 1.17) compared to masters. RHEL CoreOS should be 4.4, but the running workers are still on 4.3. Some machinesets scale up, they get 4.3 RHEL CoreOS and those work. Other MachineSets scale up, they get 4.4 RHEL CoreOS and are not working. In this case, the differentiator is between installer created MCP and a customized one. Customized one is getting 4.4 and not working, installer generated is getting 4.3 still and the new machines join the cluster. Something went really wrong on the upgrade.
> If we don't want customers upgrading to it, we don't want customers installing it fresh either, and vice versa. Fresh 4.4 installs with broken autoscaling would not see this as a regression. For folks with working 4.3 autoscaling moving to 4.4 and hitting broken autoscaling, the update would be a regression. Without a better understanding of the failure mode, I'm not clear on the regression-ness of this bug. > Sounds like we need some more analysis from the cloud compute team to determine what the risk/impact is if this ships as is. Yup, hence the impact-statement request. > We don't yet have a root cause, upgrade reported as complete but kubelets are reporting a version behind (1.16 should be 1.17) compared to masters. The machine-config operator does not consider compute Machine(Set)s as update blockers; it levels once the control-plane has been updated. Compute follows along behind and should wrap up eventually (possibly after the update has been claimed "complete"). 4.4 components need to be able to handle these remaining 4.3 compute nodes gracefully while that reconciliation happens (which would be the case even if the MCO did block on compute as well). > Customized one is getting 4.4 and not working, installer generated is getting 4.3 still and the new machines join the cluster. Can you elaborate on the customizations, and generally unpack this sentence to show example YAML snippets and such that you're summarizing?
Turns out the CRD validation in MCO is wiping out unknown fields (unknown as "not specified in the CRD). That can cause all sort of issues if you specify something which depends on whatever was about to be done with something that went away after the MachineConfig gets created. There's a PR already for the 4.4 branch. I'll work on that asap.
WIP PR here (I'll adjust BZs tomorrow) https://github.com/openshift/machine-config-operator/pull/1698
The root cause for this is pretty simple; CI cluster started out as 4.3. We wanted to optimize cost and use *local* NVMe drives and so the DPTP team, Michael and me collaborated on getting them a custom Ignition config to use RAID for /var/lib/containers: https://github.com/openshift/release/pull/8102 That all worked in 4.3, but the MCO changed in 4.4 to do more CRD validation but it inadvertently dropped *part* of the RAID config, and the kubelet was configured to wait for /var/lib/containers to start. Since the device never appeared, kubelet never started.
Test plan, from Colin in [1]: > Anyone who wants to test this, it should be sufficient to do oc create on this: https://github.com/openshift/release/blob/23b3ddb6b32e8157e9b882172264b8d1b008070c/clusters/build-clusters/01_cluster/machine_config/m5d4x_machineconfig.yaml > > Then, notice the storage/disks field is dropped: > > $ oc describe machineconfig/m5d4x > ... > storage: {} > After this PR merges you should see the storage section from the submitted object. [1]: https://github.com/openshift/machine-config-operator/pull/1698#issuecomment-621572770
Verified in 4.4.0-0.nightly-2020-04-30-051505 Create the machineconfig https://github.com/openshift/release/blob/23b3ddb6b32e8157e9b882172264b8d1b008070c/clusters/build-clusters/01_cluster/machine_config/m5d4x_machineconfig.yaml >, the storage field is not dropped oc describe machineconfig m5d4x Name: m5d4x Namespace: Labels: machineconfiguration.openshift.io/role=worker-m5d4x Annotations: <none> API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Metadata: Creation Timestamp: 2020-04-30T06:34:32Z Generation: 1 Resource Version: 23139 Self Link: /apis/machineconfiguration.openshift.io/v1/machineconfigs/m5d4x UID: 878865b9-3117-4328-8564-af43ca5ef337 Spec: Config: Ignition: Version: 2.2.0 Storage: Disks: Device: /dev/nvme1n1 Partitions: Label: containerraid1 Number: 0 Size: 0 Start: 0 Wipe Table: true Device: /dev/nvme2n1 Partitions: Label: containerraid2 Number: 0 Size: 0 Start: 0 Wipe Table: true Filesystems: Mount: Device: /dev/md/containerraid Format: xfs Label: containers Raid: Devices: /dev/disk/by-partlabel/containerraid1 /dev/disk/by-partlabel/containerraid2 Level: stripe Name: containerraid Systemd: Units: Contents: [Mount] What=/dev/md/containerraid Where=/var/lib/containers Type=xfs [Install] WantedBy=local-fs.target Name: var-lib-containers.mount
Sorry for comment 22, intended for 1829651
(In reply to Jianwei Hou from comment #22) > Verified in 4.4.0-0.nightly-2020-04-30-051505 > > > Create the machineconfig > https://github.com/openshift/release/blob/ > 23b3ddb6b32e8157e9b882172264b8d1b008070c/clusters/build-clusters/01_cluster/ > machine_config/m5d4x_machineconfig.yaml > >, the storage field is not dropped > > oc describe machineconfig m5d4x > Name: m5d4x > Namespace: > Labels: machineconfiguration.openshift.io/role=worker-m5d4x > Annotations: <none> > API Version: machineconfiguration.openshift.io/v1 > Kind: MachineConfig > Metadata: > Creation Timestamp: 2020-04-30T06:34:32Z > Generation: 1 > Resource Version: 23139 > Self Link: > /apis/machineconfiguration.openshift.io/v1/machineconfigs/m5d4x > UID: 878865b9-3117-4328-8564-af43ca5ef337 > Spec: > Config: > Ignition: > Version: 2.2.0 > Storage: > Disks: > Device: /dev/nvme1n1 > Partitions: > Label: containerraid1 > Number: 0 > Size: 0 > Start: 0 > Wipe Table: true > Device: /dev/nvme2n1 > Partitions: > Label: containerraid2 > Number: 0 > Size: 0 > Start: 0 > Wipe Table: true > Filesystems: > Mount: > Device: /dev/md/containerraid > Format: xfs > Label: containers > Raid: > Devices: > /dev/disk/by-partlabel/containerraid1 > /dev/disk/by-partlabel/containerraid2 > Level: stripe > Name: containerraid > Systemd: > Units: > Contents: [Mount] > What=/dev/md/containerraid > Where=/var/lib/containers > Type=xfs > > [Install] > WantedBy=local-fs.target > Name: var-lib-containers.mount why is this still NEW?
Sorry this is the 4.5 BZ, forget my previous comment.
Verified on 4.5.0-0.nightly-2020-05-05-205255. Spec.Config.Storage is not empty when using the MC. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-05-05-205255 True False 4m41s Cluster version is 4.5.0-0.nightly-2020-05-05-205255 $ cat m5d4x_machineconfig.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker-m5d4x name: m5d4x spec: config: ignition: version: 2.2.0 storage: disks: - device: "/dev/nvme1n1" wipeTable: true partitions: - label: containerraid1 number: 0 start: 0 size: 0 - device: "/dev/nvme2n1" wipeTable: true partitions: - label: containerraid2 number: 0 start: 0 size: 0 raid: - devices: - "/dev/disk/by-partlabel/containerraid1" - "/dev/disk/by-partlabel/containerraid2" level: stripe name: containerraid filesystems: - mount: device: "/dev/md/containerraid" format: xfs label: containers systemd: units: - name: var-lib-containers.mount enable: true contents: |- [Mount] What=/dev/md/containerraid Where=/var/lib/containers Type=xfs [Install] WantedBy=local-fs.target $ oc get mc NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 00-master 69cc7815293c2635b9b35f558bddf52559f50f46 2.2.0 19m 00-worker 69cc7815293c2635b9b35f558bddf52559f50f46 2.2.0 19m 01-master-container-runtime 69cc7815293c2635b9b35f558bddf52559f50f46 2.2.0 19m 01-master-kubelet 69cc7815293c2635b9b35f558bddf52559f50f46 2.2.0 19m 01-worker-container-runtime 69cc7815293c2635b9b35f558bddf52559f50f46 2.2.0 19m 01-worker-kubelet 69cc7815293c2635b9b35f558bddf52559f50f46 2.2.0 19m 99-master-de3720e9-b4e9-4fe1-8516-6890fe9d957f-registries 69cc7815293c2635b9b35f558bddf52559f50f46 2.2.0 19m 99-master-ssh 2.2.0 26m 99-worker-4bdd4bc2-7969-4e9d-b611-e657689f9604-registries 69cc7815293c2635b9b35f558bddf52559f50f46 2.2.0 19m 99-worker-ssh 2.2.0 26m rendered-master-d869522fa55a648dadd5bdad77677406 69cc7815293c2635b9b35f558bddf52559f50f46 2.2.0 19m rendered-worker-2fe438dc9093f2113443d5e1e815e7bd 69cc7815293c2635b9b35f558bddf52559f50f46 2.2.0 19m $ oc apply -f m5d4x_machineconfig.yaml machineconfig.machineconfiguration.openshift.io/m5d4x created $ oc get mc NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 00-master 69cc7815293c2635b9b35f558bddf52559f50f46 2.2.0 20m 00-worker 69cc7815293c2635b9b35f558bddf52559f50f46 2.2.0 20m 01-master-container-runtime 69cc7815293c2635b9b35f558bddf52559f50f46 2.2.0 20m 01-master-kubelet 69cc7815293c2635b9b35f558bddf52559f50f46 2.2.0 20m 01-worker-container-runtime 69cc7815293c2635b9b35f558bddf52559f50f46 2.2.0 20m 01-worker-kubelet 69cc7815293c2635b9b35f558bddf52559f50f46 2.2.0 20m 99-master-de3720e9-b4e9-4fe1-8516-6890fe9d957f-registries 69cc7815293c2635b9b35f558bddf52559f50f46 2.2.0 20m 99-master-ssh 2.2.0 27m 99-worker-4bdd4bc2-7969-4e9d-b611-e657689f9604-registries 69cc7815293c2635b9b35f558bddf52559f50f46 2.2.0 20m 99-worker-ssh 2.2.0 27m m5d4x 2.2.0 3s rendered-master-d869522fa55a648dadd5bdad77677406 69cc7815293c2635b9b35f558bddf52559f50f46 2.2.0 20m rendered-worker-2fe438dc9093f2113443d5e1e815e7bd 69cc7815293c2635b9b35f558bddf52559f50f46 2.2.0 20m $ oc describe mc/m5d4x Name: m5d4x Namespace: Labels: machineconfiguration.openshift.io/role=worker-m5d4x Annotations: kubectl.kubernetes.io/last-applied-configuration: {"apiVersion":"machineconfiguration.openshift.io/v1","kind":"MachineConfig","metadata":{"annotations":{},"labels":{"machineconfiguration.o... API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Metadata: Creation Timestamp: 2020-05-06T13:41:44Z Generation: 1 Managed Fields: API Version: machineconfiguration.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: .: f:kubectl.kubernetes.io/last-applied-configuration: f:labels: .: f:machineconfiguration.openshift.io/role: f:spec: .: f:config: .: f:ignition: .: f:version: f:storage: .: f:disks: f:filesystems: f:raid: f:systemd: .: f:units: Manager: oc Operation: Update Time: 2020-05-06T13:41:44Z Resource Version: 24420 Self Link: /apis/machineconfiguration.openshift.io/v1/machineconfigs/m5d4x UID: a471bbf8-33d9-47b5-abea-f61c1f7257c9 Spec: Config: Ignition: Version: 2.2.0 Storage: Disks: Device: /dev/nvme1n1 Partitions: Label: containerraid1 Number: 0 Size: 0 Start: 0 Wipe Table: true Device: /dev/nvme2n1 Partitions: Label: containerraid2 Number: 0 Size: 0 Start: 0 Wipe Table: true Filesystems: Mount: Device: /dev/md/containerraid Format: xfs Label: containers Raid: Devices: /dev/disk/by-partlabel/containerraid1 /dev/disk/by-partlabel/containerraid2 Level: stripe Name: containerraid Systemd: Units: Contents: [Mount] What=/dev/md/containerraid Where=/var/lib/containers Type=xfs [Install] WantedBy=local-fs.target Name: var-lib-containers.mount Events: <none>
Looking for some assistance as to how to safely remove these checks on a UPI VMware deployment? Also, apology in adv as I am cross posting after finding this issue initially after asking the same question on BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1810443 I have the same issue on a fresh cluster running on VMware 6.7. Curious how to safely remove the checks since there's no machineset controller for VMware? Client Version: 4.4.3 Server Version: 4.4.3 Kubernetes Version: v1.17.1 oc get machinesets NAME DESIRED CURRENT READY AVAILABLE AGE ocp4-ctmtp-worker 0 0 28h oc get machines -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE ocp4-ctmtp-master-0 28h ocp4-ctmtp-master-1 28h ocp4-ctmtp-master-2 28h
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days