Bug 1927731 - /usr/lib/dracut/modules.d/30ignition/ignition --version sigsev
Summary: /usr/lib/dracut/modules.d/30ignition/ignition --version sigsev
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.6
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: ---
: 4.8.0
Assignee: Benjamin Gilbert
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks: 1933205
TreeView+ depends on / blocked
 
Reported: 2021-02-11 12:33 UTC by Pablo Alonso Rodriguez
Modified: 2021-11-19 00:30 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The Machine Config Operator invokes Ignition at startup to check the Ignition version, and Ignition crashes. Consequence: The MCO fails to start. Fix: The MCO no longer queries the Ignition version. Result: The MCO starts successfully.
Clone Of:
: 1933206 (view as bug list)
Environment:
Last Closed: 2021-07-27 22:43:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2431 0 None open Revert "pkg/daemon: Add IgnitionVersion to Daemon" 2021-02-24 21:50:49 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:44:07 UTC

Description Pablo Alonso Rodriguez 2021-02-11 12:33:04 UTC
Description of problem:

MCO needs to run /usr/lib/dracut/modules.d/30ignition/ignition --version command at some point and that command seems to fail with a sigsev. I tried to understand why it happened, but the few things I could find are not making much sense to me.

Version-Release number of selected component (if applicable):

4.5.17 (ignition-0.35.1-11.rhaos4.5.gitb4d18ad.el8.x86_64)

This happened during an upgrade to 4.6.12, but the node is still on the version above.

How reproducible:

Only in one node.

Steps to Reproduce:
1. Run /usr/lib/dracut/modules.d/30ignition/ignition --version

Actual results:

sigsev

Expected results:

Version printed, so that MCO can work as expected.

Additional info:

I have a coredump (gathered by systemd) and a strace. I'll be attaching privately.

Comment 4 Micah Abbott 2021-02-11 18:14:01 UTC
I'm not able to reproduce on an RHCOS node running in qemu; tried a recent 4.6 and 4.7 build:

```
[core@cosa-devsh ~]$ rpm-ostree status
State: idle
Deployments:
* ostree://478636b3f9d960112359f202481507e4c5467dcccdc4e8faba291de4abb3b8bc
                   Version: 46.82.202101191342-0 (2021-01-19T13:45:40Z)
[core@cosa-devsh ~]$ /usr/lib/dracut/modules.d/30ignition/ignition --version 
Ignition 2.6.0
```

```
[core@cosa-devsh ~]$ rpm-ostree status
State: idle
Deployments:
* ostree://626f6034804cbf93f2bdf5bfec9e4a6d2aff0aa40b376a534e4a73aba0799fbf
                   Version: 47.83.202102011415-0 (2021-02-01T14:17:51Z)
[core@cosa-devsh ~]$ /usr/lib/dracut/modules.d/30ignition/ignition --version
Ignition 2.9.0
```

One question is why the MCO needs to know what version of Ignition is on RHCOS?

Comment 5 Benjamin Gilbert 2021-02-11 19:44:28 UTC
The MCO code to query the Ignition version was added in https://github.com/openshift/machine-config-operator/pull/1729.

Comment 6 Colin Walters 2021-02-11 21:24:59 UTC
May be related to https://github.com/openshift/machine-config-operator/pull/2299

Comment 7 Benjamin Gilbert 2021-02-12 00:53:32 UTC
The core file indicates that Ignition died on a SIGABRT:

Core was generated by `/usr/lib/dracut/modules.d/30ignition/ignition --version'.
Program terminated with signal SIGABRT, Aborted.

#0  runtime.raise () at /usr/lib/golang/src/runtime/sys_linux_amd64.s:150
#1  0x0000557448f5fe3b in runtime.dieFromSignal (sig=6)
    at /usr/lib/golang/src/runtime/signal_unix.go:428
#2  0x0000557448f6029d in runtime.sigfwdgo (sig=6, info=0xc000009d70, 
    ctx=0xc000009c40, ~r3=<optimized out>)
    at /usr/lib/golang/src/runtime/signal_unix.go:631
#3  0x0000557448f5f510 in runtime.sigtrampgo (sig=<optimized out>, 
    info=0xc000009d70, ctx=0xc000009c40)
    at /usr/lib/golang/src/runtime/signal_unix.go:289
#4  0x0000557448f79bb3 in runtime.sigtramp ()
    at /usr/lib/golang/src/runtime/sys_linux_amd64.s:357
#5  0x00007fefd142edd0 in ?? ()
#6  0x0000000000000006 in ?? ()
#7  0x0000000000000000 in ?? ()

It's possible that the abort was caused by a SIGSEGV that the runtime failed to handle properly.

--version exits very early; essentially all of the code that runs before that point is init() code.  This could be a problem in Ignition, in a vendored library, or in the Go runtime.  It's interesting that we haven't had a previous report of this crash, though.

Christian, from a quick grep of the MCO codebase, it doesn't look as though the Ignition version is used anywhere, other than for logging.  Should we back out https://github.com/openshift/machine-config-operator/pull/1729?  Ignition isn't designed to be run from the the real root anyway.

Comment 8 Christian Glombek 2021-02-12 13:31:35 UTC
I wouldn't mind removing those lines again although I'd really like to know what broke this all of a sudden, too.

Comment 9 Christian Carlé 2021-02-15 08:14:15 UTC
Hi,

So what happened during the upgrade was that multiple nodes, worker and infra nodes got stuck for about 2 hours. I did a reboot first on a worker node, that resulted it wouldn't pick up the newly rendered machine config, I needed to change that manually in its node yaml file. After the change the upgrade of workers went fine. After that I waited another hour so that the upgrade could finish, it didn't, It was now stuck on the last infra node. So I rebooted the indra node from an ssh session and also noticed it wasn't logged in as core@<node name>, instead it was a different node name that I didn't recognised...

Br.
Christian

Comment 11 Pablo Alonso Rodriguez 2021-02-15 09:08:40 UTC
Hi Christian Carlé

Thanks for providing the insights.

Just a thing: this bug is focused on the concrete sigsev problem we found at ignition, as it is assigned to the engineering teams that can work on that. 

It is good to have the context, but any other problem needs to be worked through the support case so we can either solve it directly there or open different bugzillas to the teams that can handle them (if needed).

Thanks and regards.

Comment 12 Benjamin Gilbert 2021-02-24 21:50:50 UTC
Proposed dropping the `ignition --version` invocation from the MCD in https://github.com/openshift/machine-config-operator/pull/2431.

Comment 14 Michael Nguyen 2021-03-02 14:41:15 UTC
Verified on 4.8.0-0.nightly-2021-03-01-143026.  Message about ignition version is no longer in the log.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-03-01-143026   True        False         32m     Cluster version is 4.8.0-0.nightly-2021-03-01-143026

$ oc get nodes
NAME                                       STATUS   ROLES    AGE   VERSION
ci-ln-5373ry2-f76d1-227tx-master-0         Ready    master   53m   v1.20.0+ac0db7d
ci-ln-5373ry2-f76d1-227tx-master-1         Ready    master   52m   v1.20.0+ac0db7d
ci-ln-5373ry2-f76d1-227tx-master-2         Ready    master   53m   v1.20.0+ac0db7d
ci-ln-5373ry2-f76d1-227tx-worker-b-lndk2   Ready    worker   43m   v1.20.0+ac0db7d
ci-ln-5373ry2-f76d1-227tx-worker-c-c62jb   Ready    worker   47m   v1.20.0+ac0db7d
ci-ln-5373ry2-f76d1-227tx-worker-d-bld4f   Ready    worker   44m   v1.20.0+ac0db7d

$ oc get pods -A --field-selector spec.nodeName=ci-ln-5373ry2-f76d1-227tx-worker-b-lndk2
NAMESPACE                                NAME                           READY   STATUS              RESTARTS   AGE
openshift-cluster-csi-drivers            gcp-pd-csi-driver-node-dq5j9   3/3     Running             0          44m
openshift-cluster-node-tuning-operator   tuned-7w7qp                    1/1     Running             0          44m
openshift-dns                            dns-default-gtwhs              3/3     Running             0          44m
openshift-image-registry                 node-ca-f5n45                  1/1     Running             0          44m
openshift-ingress-canary                 ingress-canary-vtmtn           1/1     Running             0          43m
openshift-machine-config-operator        machine-config-daemon-872jh    2/2     Running             0          44m
openshift-marketplace                    certified-operators-rrs8s      0/1     ContainerCreating   0          4s
openshift-marketplace                    community-operators-bkh5x      0/1     ContainerCreating   0          4s
openshift-marketplace                    redhat-operators-rth4q         0/1     ContainerCreating   0          3s
openshift-monitoring                     node-exporter-bk9th            2/2     Running             0          44m
openshift-multus                         multus-nrmqw                   1/1     Running             0          44m
openshift-multus                         network-metrics-daemon-4v2tc   2/2     Running             0          44m
openshift-network-diagnostics            network-check-target-25rrs     1/1     Running             0          44m
openshift-sdn                            ovs-t82n2                      1/1     Running             0          44m
openshift-sdn                            sdn-wlhfz                      2/2     Running             0          44m

$ oc -n openshift-machine-config-operator logs machine-config-daemon-872jh -c machine-config-daemon | grep "Installed Ignition"

Comment 17 errata-xmlrpc 2021-07-27 22:43:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.