Created attachment 1554651 [details] Invalid Machine Config Description of problem: We use a client to apply our machine configurations and verify that the configs have been successfully applied. During a recent install, we, unintentionally, applied an invalid configuration (missing ignition version), to a brand new cluster, and our client code eventually returned successfully. The specific configuration we were trying to apply is our ssh keys for access to the cluster nodes. When we were not able to access the cluster, as ourselves, we began to investigate why... The MCO only logged that the configuration we applied was invalid and silently moved on. Version-Release number of selected component (if applicable): Client Version: version.Info{Major:"4", Minor:"0+", GitVersion:"v4.0.22", GitCommit:"509916ce1", GitTreeState:"", BuildDate:"2019-03-28T17:17:29Z", GoVersion:"", Compiler:"", Platform:""} Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.4+0ba401e", GitCommit:"0ba401e", GitTreeState:"clean", BuildDate:"2019-03-31T22:28:12Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"} How reproducible: For "invalid config version", 100% Steps to Reproduce: 1. Create a v4.0 cluster 2. Apply a machine config that is missing the "ignition config version" field Actual results: The invalid configuration will be ignored by the operator daemon and the config will not be applied to the cluster. Expected results: These types of failures should be bubbled up to the "clusteroperators" status as "FAILING = True". This at least provides some indication that something went wrong and where to start looking for the problem. Additional info:
This BZ has 3.11 version? anyway, we don't need to set Failing=True in the clusteroperator, the operator, the MCO it's working just fine. MachineConfigs are a per-pool piece of the whole MCO and bad MCO shouldn't result in the whole operator going Failing. When you have a bad machineconfig, the first thing you should do is to always check: oc get machineconfigpools That will tell you if the pool is progressing towards a configuration that includes your machineconfig (and keep Updating if there's something wrong and you need to take action) Usually, for the exact reason, you should look at the per-node MCD (machine-config-daemon). Grabbing logs from the MCDs will tell you the error. We have a PR in flight which bubbles up the MCD error about a bad machineconfig to a node annotation for admins to check. (and will get better at that generally) bottom line is that we need to enhance how we report the errors on bad machineconfigs but not flip to Failing=True the whole operator. (also, removing the bad machineconfig will result in a reconcile to previous state which will fix things up)
we do actually already flip to Failing=True if you apply a bad machineconfig on masters (but not on workers)
Sorry, this is for OCP 4.1. We apply the same configuration for all the nodes (masters and workers).
(In reply to brad.williams from comment #3) > Sorry, this is for OCP 4.1. > > We apply the same configuration for all the nodes (masters and workers). If masters go down, you should have an error bubbling up in the clusteroperator already BTW, we're enhancing this here https://github.com/openshift/machine-config-operator/pull/597
also, just pointing out that managing SSH is done through https://github.com/openshift/machine-config-operator/blob/master/docs/Update-SSHKeys.md and not via raw machineconfigs
Thanks for the link to the PR and the SSHKeys doc. Based on your comments above, I manually ran our updates against another cluster and here are my findings... I applied the invalid master config (Missing ignition version) and the only indication that it failed was in the controller log. None of the calls to "machineconfigpools", "machineconfigs", or "clusteroperators" gave any indication that they even attempted to apply a configuration or that there was any type of failure. $ oc apply -f master-bad.yaml machineconfig.machineconfiguration.openshift.io/managed-ssh-keys-master created $ oc get machineconfigpools NAME CONFIG UPDATED UPDATING master rendered-master-80ce547eac313139b113203257f682bb True False worker rendered-worker-ec502c465df911064dfdda3d6904771f True False $ oc get machineconfigs NAME GENERATEDBYCONTROLLER IGNITIONVERSION CREATED 00-master 4.0.22-201904011459-dirty 2.2.0 3d 00-worker 4.0.22-201904011459-dirty 2.2.0 3d 01-master-container-runtime 4.0.22-201904011459-dirty 2.2.0 3d 01-master-kubelet 4.0.22-201904011459-dirty 2.2.0 3d 01-worker-container-runtime 4.0.22-201904011459-dirty 2.2.0 3d 01-worker-kubelet 4.0.22-201904011459-dirty 2.2.0 3d 99-master-edf60ffa-5d3c-11e9-81f3-029c8ab2a61c-registries 4.0.22-201904011459-dirty 2.2.0 3d 99-master-ssh 2.2.0 3d 99-worker-edf88b89-5d3c-11e9-81f3-029c8ab2a61c-registries 4.0.22-201904011459-dirty 2.2.0 3d 99-worker-ssh 2.2.0 3d managed-ssh-keys-master 65s rendered-master-3e70eeafed7430563737ca2a16dc9b67 4.0.22-201904011459-dirty 2.2.0 3d rendered-master-80ce547eac313139b113203257f682bb 4.0.22-201904011459-dirty 2.2.0 3d rendered-worker-ec502c465df911064dfdda3d6904771f 4.0.22-201904011459-dirty 2.2.0 3d rendered-worker-f1bd69edc8339bfa1b7ca8e707245994 4.0.22-201904011459-dirty 2.2.0 3d $ oc get clusteroperators NAME VERSION AVAILABLE PROGRESSING FAILING SINCE authentication 4.0.0-0.9 True False False 117s cloud-credential 4.0.0-0.9 True False False 3d cluster-autoscaler 4.0.0-0.9 True False False 3d console 4.0.0-0.9 True False False 23m dns 4.0.0-0.9 True False False 3d image-registry 4.0.0-0.9 True False False 21m ingress 4.0.0-0.9 True False False 3d kube-apiserver 4.0.0-0.9 True False False 20m kube-controller-manager 4.0.0-0.9 True False False 19m kube-scheduler 4.0.0-0.9 True False False 20m machine-api 4.0.0-0.9 True False False 3d machine-config 4.0.0-0.9 True False False 20m marketplace 4.0.0-0.9 True False False 20m monitoring 4.0.0-0.9 True False False 16m network 4.0.0-0.9 True False False 3d node-tuning 4.0.0-0.9 True False False 3d openshift-apiserver 4.0.0-0.9 True False False 18m openshift-controller-manager 4.0.0-0.9 True False False 21m openshift-samples 4.0.0-0.9 True False False 3d operator-lifecycle-manager 4.0.0-0.9 True False False 3d operator-lifecycle-manager-catalog 4.0.0-0.9 True False False 3d service-ca 4.0.0-0.9 True False False 20m service-catalog-apiserver 4.0.0-0.9 True False False 19m service-catalog-controller-manager 4.0.0-0.9 True False False 21m storage 4.0.0-0.9 True False False 3d $ oc logs -f machine-config-controller-5f78744567-hfnw2 <SNIP> I0415 16:50:38.560147 1 render_controller.go:380] Error syncing machineconfigpool master: machine config: managed-ssh-keys-master contains invalid ignition config: error: invalid config version (couldn't parse) I0415 16:51:19.520563 1 render_controller.go:380] Error syncing machineconfigpool master: machine config: managed-ssh-keys-master contains invalid ignition config: error: invalid config version (couldn't parse) E0415 16:52:41.440942 1 render_controller.go:385] machine config: managed-ssh-keys-master contains invalid ignition config: error: invalid config version (couldn't parse) I0415 16:52:41.440974 1 render_controller.go:386] Dropping machineconfigpool "master" out of the queue: machine config: managed-ssh-keys-master contains invalid ignition config: error: invalid config version (couldn't parse) </SNIP>
Yeah, I realized that, that's a rendering issue and we should bubble that up I guess
In a quick discussion about this I feel like the operator should go failing, because e.g. OS updates won't be applied either because we'll fail to render the new config.
(In reply to Colin Walters from comment #8) > In a quick discussion about this I feel like the operator should go failing, > because e.g. OS updates won't be applied either because we'll fail to render > the new config. This is not as easy as it sounds though. We can't just flip the operator Failing to True from the render_controller. Failing follows its own logic in the operator code. We need a dedicated sync function where the operator checks (like the current ones we have today). But we can't really rely on the MCP status if we add Degraded back. I'll think more about this...
alrighty, figured, we can still rely on the Degraded state on MCPs. The thing with flipping the Operator to Failing=True is only valid, as it's today, for the master MCP tho. We won't flip the Operator to Failing=True if the worker pool can't render due to a bad machineconfig. I believe we all agree on that right.
PR has been merged and we now bubble up errors to the machineconfigpool (and also to the operator if it's the master pool)
Using the following release payload: $ oc adm release info Name: 4.1.0-0.okd-2019-05-07-124355 Digest: sha256:52168017b3530f38e29dae2de1f3cd165406660a4c6ef9030bdfa5c610ae0cd0 Created: 2019-05-07T12:44:03Z OS/Arch: linux/amd64 Manifests: 289 Pull From: registry.svc.ci.openshift.org/origin/release@sha256:52168017b3530f38e29dae2de1f3cd165406660a4c6ef9030bdfa5c610ae0cd0 Release Metadata: Version: 4.1.0-0.okd-2019-05-07-124355 Upgrades: <none> Component Versions: Kubernetes 1.13.4 ... I took the example `chrony.conf` from the upstream MCO repo (https://github.com/openshift/machine-config-operator/blob/master/docs/README.md) and removed the Igniton `version` field: $ cat -p ~/Documents/faulty-machineconfig.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 50-examplecorp-chrony spec: config: ignition: storage: files: - contents: source: data:,server%20foo.example.net%20maxdelay%200.4%20offline%0Aserver%20bar.example.net%20maxdelay%200.4%20offline%0Aserver%20baz.example.net%20maxdelay%200.4%20offline filesystem: root mode: 0644 path: /etc/chrony.conf Applied the MachineConfig and checked the MachineConfigPool and ClusterOperator; both showed that the supplied Ignition config failed to render: $ oc get machineconfigpools NAME CONFIG UPDATED UPDATING DEGRADED master rendered-master-5bd1781a83bfcfaab496d807776058ad True False False worker rendered-worker-a854b8292232473efd04c4c670778147 True False True $ oc describe machineconfigpool worker Name: worker Namespace: Labels: <none> Annotations: <none> API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfigPool Metadata: Creation Timestamp: 2019-05-07T14:17:09Z Generation: 1 Resource Version: 50736 Self Link: /apis/machineconfiguration.openshift.io/v1/machineconfigpools/worker UID: cca0e719-70d2-11e9-b5bc-0200ffef4618 Spec: Machine Config Selector: Match Labels: machineconfiguration.openshift.io/role: worker Max Unavailable: <nil> Node Selector: Match Labels: node-role.kubernetes.io/worker: Paused: false Status: Conditions: Last Transition Time: 2019-05-07T14:17:46Z Message: Reason: Status: False Type: NodeDegraded Last Transition Time: 2019-05-07T14:22:53Z Message: Reason: All nodes are updated with rendered-worker-a854b8292232473efd04c4c670778147 Status: True Type: Updated Last Transition Time: 2019-05-07T14:22:53Z Message: Reason: Status: False Type: Updating Last Transition Time: 2019-05-07T16:51:25Z Message: Reason: Failed to render configuration for pool worker: machine config: 50-examplecorp-chrony contains invalid ignition config: error: invalid config version (couldn't parse) Status: True Type: RenderDegraded Last Transition Time: 2019-05-07T16:51:30Z Message: Reason: Status: True Type: Degraded ... $ oc describe clusteroperator/machine-config Name: machine-config Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2019-05-07T14:17:08Z Generation: 1 Resource Version: 50722 Self Link: /apis/config.openshift.io/v1/clusteroperators/machine-config UID: cc952453-70d2-11e9-b5bc-0200ffef4618 Spec: Status: Conditions: Last Transition Time: 2019-05-07T14:18:06Z Message: Cluster has deployed 4.1.0-0.okd-2019-05-07-124355 Status: True Type: Available Last Transition Time: 2019-05-07T14:18:06Z Message: Cluster version is 4.1.0-0.okd-2019-05-07-124355 Status: False Type: Progressing Last Transition Time: 2019-05-07T14:17:08Z Status: False Type: Degraded Extension: Master: all 3 nodes are at latest configuration rendered-master-5bd1781a83bfcfaab496d807776058ad Worker: pool is degraded because rendering fails with "Failed to render configuration for pool worker: machine config: 50-examplecorp-chrony contains invalid ignition config: error: invalid config version (couldn't parse)" Related Objects: Group: Name: openshift-machine-config-operator Resource: namespaces Versions: Name: operator Version: 4.1.0-0.okd-2019-05-07-124355 Events: <none>
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758