Bug 1879283 - panic after nil pointer dereference in pkg/daemon/update.go
Summary: panic after nil pointer dereference in pkg/daemon/update.go
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.7.0
Assignee: Yu Qi Zhang
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-15 21:04 UTC by Michael Hrivnak
Modified: 2021-02-24 15:18 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:18:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator issues 2086 0 None closed panic after nil pointer dereference in pkg/daemon/update.go 2021-01-27 02:12:14 UTC
Github openshift machine-config-operator pull 2123 0 None closed Bug 1879283: *: add nil check when decoding File.Contents.Source 2021-01-27 02:12:15 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:18:40 UTC

Description Michael Hrivnak 2020-09-15 21:04:26 UTC
Description of problem:

A particular arrangement of data in a File object in ignition will cause MCO to panic.


Version-Release number of selected component (if applicable):

OCP build: 4.6.0-0.nightly-2020-08-31-220837

$ oc adm release info --commits quay.io/ocpmetal/ocp-release@sha256:ffba977ed01cd239d2753b54ac37b563644f9212120adce508781a60e4668c3e | grep machine-config-operator
  machine-config-operator                        https://github.com/openshift/machine-config-operator                        36f37f2d6009affe8174854f5ef5538e0cc49034


How reproducible:

always


Steps to Reproduce:
1. use openshift-baremetal-install to generate ignition files for bare metal provisioning
2. boot a host with bootstrap.ign


Actual results:

time=\"2020-09-14T21:48:21Z\" level=info msg=\"I0914 21:48:21.264243       1 update.go:1265] Writing file \\\"/etc/ignition-machine-config-encapsulated.json\\\"\\n\"\n
time=\"2020-09-14T21:48:21Z\" level=info msg=\"I0914 21:48:21.265137       1 update.go:1265] Writing file \\\"/etc/motd\\\"\\n\"\n
time=\"2020-09-14T21:48:21Z\" level=info msg="panic: runtime error: invalid memory address or nil pointer dereference\n[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x16833ea]\n\ngo>
Sep 14 21:48:21 localhost installer[2420]: time="2020-09-14T21:48:21Z" level=info msg=", 0xc00025c401, 0x0, 0x0)\n\t/go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:528 +0x2cd\nmain.runStartCmd(0x2a43400, 0xc0006>
Sep 14 21:48:21 localhost systemd[1]: libpod-6ade6eb4e758e0b467c9caf74c2c7e38831aaca2aad6b22449f8938613261936.scope: Consumed 281ms CPU time
Sep 14 21:48:21 localhost installer[2420]: time="2020-09-14T21:48:21Z" level=error msg="Failed to extract ignition to disk"
Sep 14 21:48:21 localhost installer[2420]: time="2020-09-14T21:48:21Z" level=error msg="Bootstrap failed failed executing nsenter [-t 1 -m -i -- podman run --net host --volume /:/rootfs:rw --volume /usr/bin/rpm-ostree:/usr/bin/rpm-ostree>
Sep 14 21:48:21 localhost installer[2420]: time="2020-09-14T21:48:21Z" level=info msg="Exiting updateConfiguringStatus go routine"


Additional info:

I filed an issue on github that includes the ignition data that induces this bug plus a root cause for the bug including a specific line of code: https://github.com/openshift/machine-config-operator/issues/2086

Comment 1 Yu Qi Zhang 2020-09-17 15:13:19 UTC
Tentatively setting target to 4.7 as this is not a release blocker at the moment.

Comment 2 Kirsten Garrison 2020-09-17 16:00:37 UTC
Can we have a full must gather ?

Comment 4 Michael Hrivnak 2020-09-17 16:04:13 UTC
Just from reading code, this appears to be a regression since OCP 4.5. In that release, which used ignition 2.2, the "Source" field was not a pointer, so there was no opportunity for a nil pointer dereference.

https://github.com/coreos/ignition/blob/v0.35.0/config/v2_2/types/schema.go#L54

Comment 5 Michael Hrivnak 2020-09-17 16:06:04 UTC
> Can we have a full must gather ?

We are seeing the bug during cluster installation, and it prevents installation from completing. So I can't run must-gather without the cluster.

Comment 6 Michael Hrivnak 2020-09-17 16:19:02 UTC
One other note: There is an implied secondary bug underneath the segfault. You'll discover this as soon as you write "if file.Content.Source == nil {" and then think "ok what next?", but I thought it may help to point out here.

The code assumes that an ignition File is constructed to overwrite any prior contents, using the File.Contents.Source field. But that is not a required field (hence the current attempt to read it causing the segfault), and instead a File can be configured to append content. In that case the content itself will show up in the array File.Append. You can see in the github issue that the File in question uses the Append array, and likewise MCO should presumably be able to carry that out. So the fix to this issue is likely to be: A) don't try to derefernce a nil pointer, B) as the alternative implement the ability to append content.

Comment 10 Yu Qi Zhang 2020-09-28 15:43:34 UTC
Ok, we had a discussion on this, moving back to 4.6 with higher priority to fix the panic.

In summary, we realized that `append` has always been broken in the MCO since forever. We implicitly support it during install (since its valid ignition) but not during updates, so anytime we updated that cluster, the append sections would get dropped. This is because during updates we re-write all sections of the file, but ONLY source contents and never append. We only check for append and error in `reconcilable` function which only happens to diffs on upgrades, but not anything in the machineconfigs that existed since install time. In 4.5 this wouldn't even error because the contents will just be seen as empty. In 4.6 because its now a pointer we de-reference, it panics if you have it as empty.

So our proposal is this:

4.6:
1. fix the panic by adding a nil check.
2. If your data.source is empty (maybe also check if data.append is not empty), add an explicit warning telling you not to use append
3. change baremetal-install binary to not use append, since it doesn't work anyways

4.7
re-evaluate how we should support append. Either
1. don't allow append (merge in spec 3) at all, not even during install time, explicitly
2. support append fully both during install and upgrades

Comment 11 Yu Qi Zhang 2020-09-28 23:02:59 UTC
So I applied a fix that would explicitly reject writing append sections, as well as add a nil check to remove the panic.

So this won't fix the problem of an append being used since it will still fail. Looking at the installer a bit: https://github.com/openshift/installer/blame/7f8e8eab0b1a6fff4bf284957b7c390267c1a9d0/pkg/asset/ignition/bootstrap/bootstrap.go#L313

I'm not actually sure where the append is coming from. That's the only reference to append and its on bootstrap, which shouldn't involve the MCD, and we shouldn't have seen that error. Where are you actually getting the append section from?

Comment 12 Michael Hrivnak 2020-09-29 19:54:19 UTC
That is the correct occurrence of append. The assisted-installer uses MCD on the bootstrap here:

https://github.com/openshift/assisted-installer/blob/726f8a0/src/installer/installer.go#L183-L196

Comment 13 Yu Qi Zhang 2020-09-29 20:31:37 UTC
Ah, I see, so that's a conflict of operations. Hm...

So the bootstrap node never interacts with the MCD normally, so we technically allowed "append" being used on bootstrap as it was a one-time provisioning from ignition binary. Mayhaps we should not allow it at all, but that's another discussion and it currently shouldn't break anything.

The MCD never had any explicit support for append. If you tried to apply a pre-install config that has append, although the contents will be written by ignition on firstboot, the MCD will immediately complain when it tries to check the file on validation and degrade. This behaviour exists in all versions (4.5 and before)

I guess since if we did use onceFrom we'd never have validated the node nor used the MCD later to write anything, so it would have been "fine" from a 4.5 perspective to use the assisted installer like this. Then it seems our options are:

1. explicitly disallow the use of append anywhere, which would be my PR + a change to either the installer or the assisted installer
2. explicitly error during checks for append, but only warn when you try to append during writes. This allows the onceFrom to go through. This would be a modification of my PR, but technically is "Wrong" from a normal operation perspective. This is the best stop-gap for 4.6 maybe

Side note: Funnily enough if you try to provide a ign spec 2.2 machineconfig to append, it will actually fail the MCC on bootstrap. This feels like an error somewhere in translation and is probably a bug. I think the assisted installer would provide a spec 3 config so it shouldn't be a problem?

Comment 14 Yu Qi Zhang 2020-09-29 20:36:48 UTC
Also, if we do 2, we'd also have to actually add the ability to write files with "append", since we don't have that today. So it would either be

1. very hacky since append is not idempotent
or
2. slightly complex to ensure some sort of idempotency
or
3. we don't write appends at all for now, and assisted installer bm installs just wouldn't have that snippet in bootstrap MOTD

I somewhat dislike 2 and I personally think should only be a last resort. 1 is better in practice, although I'm not sure what the cleanest way to fix that would be, since the point of the original MOTD write was so that it gets appended to the bootstrap node, I could be missing something but it seems like it would be harder to implement

Comment 15 Yu Qi Zhang 2020-09-29 20:39:56 UTC
Would it be possible for the assisted installer not to do this MCD-write for the bootstrap node? This wasn't ever really a supported operation

Comment 16 Kirsten Garrison 2020-09-29 21:08:25 UTC
Leaning towards 1 from comment 13. Clearly not supporting something is more supportable than supporting something that's "Wrong" esp given that supporting the "wrong" behaviour via 2 reqs more code in MCO to support append that we don't support. 1 would also just require your PR and a change to assisted installer.

Also the point (if I'm reading correctly) is that the assisted installer is operating in an unexpected way ie interacting with the MCD in bootstrap when it should probably be striving to stay consistent why how regular installer does things from a POV of maintainability (ie we expect that you do things one way but you are doing something completely different that we can't keep track of) and supportability. No?

Comment 17 Yu Qi Zhang 2020-09-30 18:29:32 UTC
Ok, after some more discussion let's merge the fix to get rid of the panic and make failures more explicit. See: https://github.com/openshift/machine-config-operator/pull/2123#issuecomment-701564641

To summarize, the underlying issue is not on the MCD operation. Append is not supported at all in ignition/files section. There is a workaround in the assisted installer now to remove that append section. In 4.7 we can fix the installer or the assisted installer further, to not use append preferably.

Comment 18 Yu Qi Zhang 2020-09-30 18:30:47 UTC
Lowering priority given the above. This is not a testblocker or upgradeblocker but is nice to fix in 4.6 without much risk.

Comment 19 Kirsten Garrison 2020-09-30 18:57:26 UTC
SGTM

Comment 22 Michael Nguyen 2020-12-14 21:13:51 UTC
Verification Part 1:  Bad append post-install

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2020-12-14-080124   True        False         24s     Cluster version is 4.7.0-0.nightly-2020-12-14-080124
$ cat 99_openshift-machineconfig_99-worker-motd.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: bad-append-3
spec:
  config:
    ignition:
      config: {}
      security:
        tls: {}
      timeouts: {}
      version: 3.1.0
    storage:
      files:
      - mode: 420
        path: /etc/motd
        append:
          - source: data:,hello%20world%0A
$ oc create -f ../99_openshift-machineconfig_99-worker-motd.yaml
machineconfig.machineconfiguration.openshift.io/bad-append-3 created

$ oc get mc
NAME                                               GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                          d6b5d1922d848885cf5d2737306ab14323b7783a   3.2.0             26m
00-worker                                          d6b5d1922d848885cf5d2737306ab14323b7783a   3.2.0             26m
01-master-container-runtime                        d6b5d1922d848885cf5d2737306ab14323b7783a   3.2.0             26m
01-master-kubelet                                  d6b5d1922d848885cf5d2737306ab14323b7783a   3.2.0             26m
01-worker-container-runtime                        d6b5d1922d848885cf5d2737306ab14323b7783a   3.2.0             26m
01-worker-kubelet                                  d6b5d1922d848885cf5d2737306ab14323b7783a   3.2.0             26m
99-master-generated-registries                     d6b5d1922d848885cf5d2737306ab14323b7783a   3.2.0             26m
99-master-ssh                                                                                 3.1.0             36m
99-worker-generated-registries                     d6b5d1922d848885cf5d2737306ab14323b7783a   3.2.0             26m
99-worker-ssh                                                                                 3.1.0             36m
bad-append-3                                                                                  3.1.0             9s
rendered-master-53ea1f220d2dc663af48dc3c6ec9680d   d6b5d1922d848885cf5d2737306ab14323b7783a   3.2.0             26m
rendered-worker-19950adde7f662bc39ed8ec6c140b89e   d6b5d1922d848885cf5d2737306ab14323b7783a   3.2.0             4s
rendered-worker-8f9c62dc20df06750a98214636b1f89f   d6b5d1922d848885cf5d2737306ab14323b7783a   3.2.0             26m

$ oc get mcp/worker
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
worker   rendered-worker-8f9c62dc20df06750a98214636b1f89f   False     True       True       3              0                   0                     1                      29m

$ oc get mcp/worker -o yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  creationTimestamp: "2020-12-14T20:42:57Z"
  generation: 3
  labels:
    machineconfiguration.openshift.io/mco-built-in: ""
    pools.operator.machineconfiguration.openshift.io/worker: ""
  managedFields:
  - apiVersion: machineconfiguration.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .: {}
          f:machineconfiguration.openshift.io/mco-built-in: {}
          f:pools.operator.machineconfiguration.openshift.io/worker: {}
      f:spec:
        .: {}
        f:configuration: {}
        f:machineConfigSelector:
          .: {}
          f:matchLabels:
            .: {}
            f:machineconfiguration.openshift.io/role: {}
        f:nodeSelector:
          .: {}
          f:matchLabels:
            .: {}
            f:node-role.kubernetes.io/worker: {}
        f:paused: {}
    manager: machine-config-operator
    operation: Update
    time: "2020-12-14T20:42:57Z"
  - apiVersion: machineconfiguration.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:spec:
        f:configuration:
          f:name: {}
          f:source: {}
      f:status:
        .: {}
        f:conditions: {}
        f:configuration:
          .: {}
          f:name: {}
          f:source: {}
        f:degradedMachineCount: {}
        f:machineCount: {}
        f:observedGeneration: {}
        f:readyMachineCount: {}
        f:unavailableMachineCount: {}
        f:updatedMachineCount: {}
    manager: machine-config-controller
    operation: Update
    time: "2020-12-14T21:10:09Z"
  name: worker
  resourceVersion: "31421"
  selfLink: /apis/machineconfiguration.openshift.io/v1/machineconfigpools/worker
  uid: 898bb337-5f14-4b50-8951-5fec7a7a7390
spec:
  configuration:
    name: rendered-worker-19950adde7f662bc39ed8ec6c140b89e
    source:
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 00-worker
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-container-runtime
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-generated-registries
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-ssh
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: bad-append-3
  machineConfigSelector:
    matchLabels:
      machineconfiguration.openshift.io/role: worker
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker: ""
  paused: false
status:
  conditions:
  - lastTransitionTime: "2020-12-14T20:43:48Z"
    message: ""
    reason: ""
    status: "False"
    type: RenderDegraded
  - lastTransitionTime: "2020-12-14T21:10:04Z"
    message: ""
    reason: ""
    status: "False"
    type: Updated
  - lastTransitionTime: "2020-12-14T21:10:04Z"
    message: All nodes are updating to rendered-worker-19950adde7f662bc39ed8ec6c140b89e
    reason: ""
    status: "True"
    type: Updating
  - lastTransitionTime: "2020-12-14T21:10:09Z"
    message: 'Node ip-10-0-177-88.us-west-2.compute.internal is reporting: "can''t
      reconcile config rendered-worker-8f9c62dc20df06750a98214636b1f89f with rendered-worker-19950adde7f662bc39ed8ec6c140b89e:
      ignition file /etc/motd includes append: unreconcilable"'
    reason: 1 nodes are reporting degraded status on sync
    status: "True"
    type: NodeDegraded
  - lastTransitionTime: "2020-12-14T21:10:09Z"
    message: ""
    reason: ""
    status: "True"
    type: Degraded
  configuration:
    name: rendered-worker-8f9c62dc20df06750a98214636b1f89f
    source:
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 00-worker
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-container-runtime
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-generated-registries
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-ssh
  degradedMachineCount: 1
  machineCount: 3
  observedGeneration: 3
  readyMachineCount: 0
  unavailableMachineCount: 0
  updatedMachineCount: 0
$

Comment 23 Michael Nguyen 2020-12-15 16:44:22 UTC
Pre-install verification

$ openshift-install create manifests

$ cat << EOF > 99_openshift-machineconfig_99-worker-motd.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: bad-append-3
spec:
  config:
    ignition:
      config: {}
      security:
        tls: {}
      timeouts: {}
      version: 3.1.0
    storage:
      files:
      - mode: 420
        path: /etc/motd
        append:
          - source: data:,hello%20world%0A
EOF

$ cp 99_openshift-machineconfig_99-worker-motd.yaml openshift/
$ openshift-install create cluster

Cluster fails with the append error at the very last step.  Worker machine config pool is degraded.

Comment 25 Yu Qi Zhang 2021-01-06 17:03:20 UTC
No doc needed: this simply made error reporting more clear, does not change behaviour

Comment 27 errata-xmlrpc 2021-02-24 15:18:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.