1699134 – MCO not bubbling up failures after invalid configuration has been applied

Bug 1699134 - MCO not bubbling up failures after invalid configuration has been applied

Summary: MCO not bubbling up failures after invalid configuration has been applied

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Antonio Murdaca
QA Contact:	Micah Abbott
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-11 21:10 UTC by brad.williams
Modified:	2019-06-04 10:47 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:47:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Invalid Machine Config (3.59 KB, text/plain) 2019-04-11 21:10 UTC, brad.williams	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:47:40 UTC

Description brad.williams 2019-04-11 21:10:46 UTC

Created attachment 1554651 [details]
Invalid Machine Config

Description of problem:

We use a client to apply our machine configurations and verify that the configs  have been successfully applied. During a recent install, we, unintentionally, applied an invalid configuration (missing ignition version), to a brand new cluster, and our client code eventually returned successfully.  The specific configuration we were trying to apply is our ssh keys for access to the cluster nodes.  When we were not able to access the cluster, as ourselves, we began to investigate why...

The MCO only logged that the configuration we applied was invalid and silently moved on.

Version-Release number of selected component (if applicable):

Client Version: version.Info{Major:"4", Minor:"0+", GitVersion:"v4.0.22", GitCommit:"509916ce1", GitTreeState:"", BuildDate:"2019-03-28T17:17:29Z", GoVersion:"", Compiler:"", Platform:""}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.4+0ba401e", GitCommit:"0ba401e", GitTreeState:"clean", BuildDate:"2019-03-31T22:28:12Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:
For "invalid config version", 100%

Steps to Reproduce:
1. Create a v4.0 cluster
2. Apply a machine config that is missing the "ignition config version" field

Actual results:
The invalid configuration will be ignored by the operator daemon and the config will not be applied to the cluster.

Expected results:
These types of failures should be bubbled up to the "clusteroperators" status as "FAILING = True".  This at least provides some indication that something went wrong and where to start looking for the problem.

Additional info:

Comment 1 Antonio Murdaca 2019-04-11 23:31:32 UTC

This BZ has 3.11 version?

anyway, we don't need to set Failing=True in the clusteroperator, the operator, the MCO it's working just fine. MachineConfigs are a per-pool piece of the whole MCO and bad MCO shouldn't result in the whole operator going Failing.

When you have a bad machineconfig, the first thing you should do is to always check:

oc get machineconfigpools

That will tell you if the pool is progressing towards a configuration that includes your machineconfig (and keep Updating if there's something wrong and you need to take action)

Usually, for the exact reason, you should look at the per-node MCD (machine-config-daemon). Grabbing logs from the MCDs will tell you the error.

We have a PR in flight which bubbles up the MCD error about a bad machineconfig to a node annotation for admins to check. (and will get better at that generally)

bottom line is that we need to enhance how we report the errors on bad machineconfigs but not flip to Failing=True the whole operator.

(also, removing the bad machineconfig will result in a reconcile to previous state which will fix things up)

Comment 2 Antonio Murdaca 2019-04-11 23:39:48 UTC

we do actually already flip to Failing=True if you apply a bad machineconfig on masters (but not on workers)

Comment 3 brad.williams 2019-04-12 02:32:00 UTC

Sorry, this is for OCP 4.1.

We apply the same configuration for all the nodes (masters and workers).

Comment 4 Antonio Murdaca 2019-04-12 10:51:14 UTC

(In reply to brad.williams from comment #3)
> Sorry, this is for OCP 4.1.
> 
> We apply the same configuration for all the nodes (masters and workers).

If masters go down, you should have an error bubbling up in the clusteroperator already

BTW, we're enhancing this here https://github.com/openshift/machine-config-operator/pull/597

Comment 5 Antonio Murdaca 2019-04-13 10:08:15 UTC

also, just pointing out that managing SSH is done through https://github.com/openshift/machine-config-operator/blob/master/docs/Update-SSHKeys.md and not via raw machineconfigs

Comment 6 brad.williams 2019-04-15 17:06:00 UTC

Thanks for the link to the PR and the SSHKeys doc. 

Based on your comments above, I manually ran our updates against another cluster and here are my findings...

I applied the invalid master config (Missing ignition version) and the only indication that it failed was in the controller log.  None of the calls to "machineconfigpools", "machineconfigs", or "clusteroperators" gave any indication that they even attempted to apply a configuration or that there was any type of failure.

$ oc apply -f master-bad.yaml 
machineconfig.machineconfiguration.openshift.io/managed-ssh-keys-master created

$ oc get machineconfigpools
NAME     CONFIG                                             UPDATED   UPDATING
master   rendered-master-80ce547eac313139b113203257f682bb   True      False
worker   rendered-worker-ec502c465df911064dfdda3d6904771f   True      False

$ oc get machineconfigs
NAME                                                        GENERATEDBYCONTROLLER       IGNITIONVERSION   CREATED
00-master                                                   4.0.22-201904011459-dirty   2.2.0             3d
00-worker                                                   4.0.22-201904011459-dirty   2.2.0             3d
01-master-container-runtime                                 4.0.22-201904011459-dirty   2.2.0             3d
01-master-kubelet                                           4.0.22-201904011459-dirty   2.2.0             3d
01-worker-container-runtime                                 4.0.22-201904011459-dirty   2.2.0             3d
01-worker-kubelet                                           4.0.22-201904011459-dirty   2.2.0             3d
99-master-edf60ffa-5d3c-11e9-81f3-029c8ab2a61c-registries   4.0.22-201904011459-dirty   2.2.0             3d
99-master-ssh                                                                           2.2.0             3d
99-worker-edf88b89-5d3c-11e9-81f3-029c8ab2a61c-registries   4.0.22-201904011459-dirty   2.2.0             3d
99-worker-ssh                                                                           2.2.0             3d
managed-ssh-keys-master                                                                                   65s
rendered-master-3e70eeafed7430563737ca2a16dc9b67            4.0.22-201904011459-dirty   2.2.0             3d
rendered-master-80ce547eac313139b113203257f682bb            4.0.22-201904011459-dirty   2.2.0             3d
rendered-worker-ec502c465df911064dfdda3d6904771f            4.0.22-201904011459-dirty   2.2.0             3d
rendered-worker-f1bd69edc8339bfa1b7ca8e707245994            4.0.22-201904011459-dirty   2.2.0             3d

$ oc get clusteroperators
NAME                                 VERSION     AVAILABLE   PROGRESSING   FAILING   SINCE
authentication                       4.0.0-0.9   True        False         False     117s
cloud-credential                     4.0.0-0.9   True        False         False     3d
cluster-autoscaler                   4.0.0-0.9   True        False         False     3d
console                              4.0.0-0.9   True        False         False     23m
dns                                  4.0.0-0.9   True        False         False     3d
image-registry                       4.0.0-0.9   True        False         False     21m
ingress                              4.0.0-0.9   True        False         False     3d
kube-apiserver                       4.0.0-0.9   True        False         False     20m
kube-controller-manager              4.0.0-0.9   True        False         False     19m
kube-scheduler                       4.0.0-0.9   True        False         False     20m
machine-api                          4.0.0-0.9   True        False         False     3d
machine-config                       4.0.0-0.9   True        False         False     20m
marketplace                          4.0.0-0.9   True        False         False     20m
monitoring                           4.0.0-0.9   True        False         False     16m
network                              4.0.0-0.9   True        False         False     3d
node-tuning                          4.0.0-0.9   True        False         False     3d
openshift-apiserver                  4.0.0-0.9   True        False         False     18m
openshift-controller-manager         4.0.0-0.9   True        False         False     21m
openshift-samples                    4.0.0-0.9   True        False         False     3d
operator-lifecycle-manager           4.0.0-0.9   True        False         False     3d
operator-lifecycle-manager-catalog   4.0.0-0.9   True        False         False     3d
service-ca                           4.0.0-0.9   True        False         False     20m
service-catalog-apiserver            4.0.0-0.9   True        False         False     19m
service-catalog-controller-manager   4.0.0-0.9   True        False         False     21m
storage                              4.0.0-0.9   True        False         False     3d

$ oc logs -f machine-config-controller-5f78744567-hfnw2
<SNIP>
I0415 16:50:38.560147       1 render_controller.go:380] Error syncing machineconfigpool master: machine config: managed-ssh-keys-master contains invalid ignition config: error: invalid config version (couldn't parse)
I0415 16:51:19.520563       1 render_controller.go:380] Error syncing machineconfigpool master: machine config: managed-ssh-keys-master contains invalid ignition config: error: invalid config version (couldn't parse)
E0415 16:52:41.440942       1 render_controller.go:385] machine config: managed-ssh-keys-master contains invalid ignition config: error: invalid config version (couldn't parse)
I0415 16:52:41.440974       1 render_controller.go:386] Dropping machineconfigpool "master" out of the queue: machine config: managed-ssh-keys-master contains invalid ignition config: error: invalid config version (couldn't parse)
</SNIP>

Comment 7 Antonio Murdaca 2019-04-15 17:39:14 UTC

Yeah, I realized that, that's a rendering issue and we should bubble that up I guess

Comment 8 Colin Walters 2019-04-16 18:51:53 UTC

In a quick discussion about this I feel like the operator should go failing, because e.g. OS updates won't be applied either because we'll fail to render the new config.

Comment 9 Antonio Murdaca 2019-04-18 14:56:33 UTC

(In reply to Colin Walters from comment #8)
> In a quick discussion about this I feel like the operator should go failing,
> because e.g. OS updates won't be applied either because we'll fail to render
> the new config.

This is not as easy as it sounds though. We can't just flip the operator Failing to True from the render_controller. Failing follows its own logic in the operator code. We need a dedicated sync function where the operator checks (like the current ones we have today). But we can't really rely on the MCP status if we add Degraded back. I'll think more about this...

Comment 10 Antonio Murdaca 2019-04-18 15:14:12 UTC

alrighty, figured, we can still rely on the Degraded state on MCPs. The thing with flipping the Operator to Failing=True is only valid, as it's today, for the master MCP tho. We won't flip the Operator to Failing=True if the worker pool can't render due to a bad machineconfig. I believe we all agree on that right.

Comment 11 Antonio Murdaca 2019-04-24 07:10:22 UTC

PR has been merged and we now bubble up errors to the machineconfigpool (and also to the operator if it's the master pool)

Comment 13 Micah Abbott 2019-05-07 17:12:42 UTC

Using the following release payload:

$ oc adm release info   
Name:      4.1.0-0.okd-2019-05-07-124355                                                                               
Digest:    sha256:52168017b3530f38e29dae2de1f3cd165406660a4c6ef9030bdfa5c610ae0cd0                                     
Created:   2019-05-07T12:44:03Z                                                                                        
OS/Arch:   linux/amd64                                                                                                 
Manifests: 289                                                                                                         
                                                                                                                       
Pull From: registry.svc.ci.openshift.org/origin/release@sha256:52168017b3530f38e29dae2de1f3cd165406660a4c6ef9030bdfa5c610ae0cd0
                                                                                                                       
Release Metadata:                                                                                                      
  Version:  4.1.0-0.okd-2019-05-07-124355                                                                              
  Upgrades: <none>                                                                                                     
                                                                                                                       
Component Versions:                                                                                                    
  Kubernetes 1.13.4                                                                                                    
...


I took the example `chrony.conf` from the upstream MCO repo (https://github.com/openshift/machine-config-operator/blob/master/docs/README.md) and removed the Igniton `version` field:

$ cat -p ~/Documents/faulty-machineconfig.yaml 
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 50-examplecorp-chrony
spec:
  config:
    ignition:
    storage:
      files:
      - contents:
          source: data:,server%20foo.example.net%20maxdelay%200.4%20offline%0Aserver%20bar.example.net%20maxdelay%200.4%20offline%0Aserver%20baz.example.net%20maxdelay%200.4%20offline
        filesystem: root
        mode: 0644
        path: /etc/chrony.conf


Applied the MachineConfig and checked the MachineConfigPool and ClusterOperator;  both showed that the supplied Ignition config failed to render:

$ oc get machineconfigpools
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED
master   rendered-master-5bd1781a83bfcfaab496d807776058ad   True      False      False
worker   rendered-worker-a854b8292232473efd04c4c670778147   True      False      True

$ oc describe machineconfigpool worker
Name:         worker           
Namespace:                     
Labels:       <none>                
Annotations:  <none>                                                                                                                 
API Version:  machineconfiguration.openshift.io/v1
Kind:         MachineConfigPool
Metadata:
  Creation Timestamp:  2019-05-07T14:17:09Z
  Generation:          1
  Resource Version:    50736
  Self Link:           /apis/machineconfiguration.openshift.io/v1/machineconfigpools/worker
  UID:                 cca0e719-70d2-11e9-b5bc-0200ffef4618
Spec:
  Machine Config Selector:
    Match Labels:
      machineconfiguration.openshift.io/role:  worker
  Max Unavailable:                             <nil>
  Node Selector:
    Match Labels:
      node-role.kubernetes.io/worker:
  Paused:                              false
Status:
  Conditions:
    Last Transition Time:  2019-05-07T14:17:46Z
    Message:
    Reason:
    Status:                False
    Type:                  NodeDegraded
    Last Transition Time:  2019-05-07T14:22:53Z
    Message:
    Reason:                All nodes are updated with rendered-worker-a854b8292232473efd04c4c670778147
    Status:                True
    Type:                  Updated
    Last Transition Time:  2019-05-07T14:22:53Z
    Message:
    Reason:
    Status:                False
    Type:                  Updating
    Last Transition Time:  2019-05-07T16:51:25Z
    Message:
    Reason:                Failed to render configuration for pool worker: machine config: 50-examplecorp-chrony contains invalid ignition config: error: invalid config version (couldn't parse)
    Status:                True
    Type:                  RenderDegraded
    Last Transition Time:  2019-05-07T16:51:30Z
    Message:
    Reason:
    Status:                True
    Type:                  Degraded
...


$ oc describe clusteroperator/machine-config
Name:         machine-config
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2019-05-07T14:17:08Z
  Generation:          1
  Resource Version:    50722
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/machine-config
  UID:                 cc952453-70d2-11e9-b5bc-0200ffef4618
Spec:
Status:
  Conditions:
    Last Transition Time:  2019-05-07T14:18:06Z
    Message:               Cluster has deployed 4.1.0-0.okd-2019-05-07-124355
    Status:                True
    Type:                  Available
    Last Transition Time:  2019-05-07T14:18:06Z
    Message:               Cluster version is 4.1.0-0.okd-2019-05-07-124355
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2019-05-07T14:17:08Z
    Status:                False
    Type:                  Degraded
  Extension:
    Master:  all 3 nodes are at latest configuration rendered-master-5bd1781a83bfcfaab496d807776058ad
    Worker:  pool is degraded because rendering fails with "Failed to render configuration for pool worker: machine config: 50-examplecorp-chrony contains invalid ignition config: error: invalid config version (couldn't parse)"
  Related Objects:
    Group:     
    Name:      openshift-machine-config-operator
    Resource:  namespaces
  Versions:
    Name:     operator
    Version:  4.1.0-0.okd-2019-05-07-124355
Events:       <none>

Comment 15 errata-xmlrpc 2019-06-04 10:47:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.