Description of problem: MCO needs to run /usr/lib/dracut/modules.d/30ignition/ignition --version command at some point and that command seems to fail with a sigsev. I tried to understand why it happened, but the few things I could find are not making much sense to me. Version-Release number of selected component (if applicable): 4.5.17 (ignition-0.35.1-11.rhaos4.5.gitb4d18ad.el8.x86_64) This happened during an upgrade to 4.6.12, but the node is still on the version above. How reproducible: Only in one node. Steps to Reproduce: 1. Run /usr/lib/dracut/modules.d/30ignition/ignition --version Actual results: sigsev Expected results: Version printed, so that MCO can work as expected. Additional info: I have a coredump (gathered by systemd) and a strace. I'll be attaching privately.
I'm not able to reproduce on an RHCOS node running in qemu; tried a recent 4.6 and 4.7 build: ``` [core@cosa-devsh ~]$ rpm-ostree status State: idle Deployments: * ostree://478636b3f9d960112359f202481507e4c5467dcccdc4e8faba291de4abb3b8bc Version: 46.82.202101191342-0 (2021-01-19T13:45:40Z) [core@cosa-devsh ~]$ /usr/lib/dracut/modules.d/30ignition/ignition --version Ignition 2.6.0 ``` ``` [core@cosa-devsh ~]$ rpm-ostree status State: idle Deployments: * ostree://626f6034804cbf93f2bdf5bfec9e4a6d2aff0aa40b376a534e4a73aba0799fbf Version: 47.83.202102011415-0 (2021-02-01T14:17:51Z) [core@cosa-devsh ~]$ /usr/lib/dracut/modules.d/30ignition/ignition --version Ignition 2.9.0 ``` One question is why the MCO needs to know what version of Ignition is on RHCOS?
The MCO code to query the Ignition version was added in https://github.com/openshift/machine-config-operator/pull/1729.
May be related to https://github.com/openshift/machine-config-operator/pull/2299
The core file indicates that Ignition died on a SIGABRT: Core was generated by `/usr/lib/dracut/modules.d/30ignition/ignition --version'. Program terminated with signal SIGABRT, Aborted. #0 runtime.raise () at /usr/lib/golang/src/runtime/sys_linux_amd64.s:150 #1 0x0000557448f5fe3b in runtime.dieFromSignal (sig=6) at /usr/lib/golang/src/runtime/signal_unix.go:428 #2 0x0000557448f6029d in runtime.sigfwdgo (sig=6, info=0xc000009d70, ctx=0xc000009c40, ~r3=<optimized out>) at /usr/lib/golang/src/runtime/signal_unix.go:631 #3 0x0000557448f5f510 in runtime.sigtrampgo (sig=<optimized out>, info=0xc000009d70, ctx=0xc000009c40) at /usr/lib/golang/src/runtime/signal_unix.go:289 #4 0x0000557448f79bb3 in runtime.sigtramp () at /usr/lib/golang/src/runtime/sys_linux_amd64.s:357 #5 0x00007fefd142edd0 in ?? () #6 0x0000000000000006 in ?? () #7 0x0000000000000000 in ?? () It's possible that the abort was caused by a SIGSEGV that the runtime failed to handle properly. --version exits very early; essentially all of the code that runs before that point is init() code. This could be a problem in Ignition, in a vendored library, or in the Go runtime. It's interesting that we haven't had a previous report of this crash, though. Christian, from a quick grep of the MCO codebase, it doesn't look as though the Ignition version is used anywhere, other than for logging. Should we back out https://github.com/openshift/machine-config-operator/pull/1729? Ignition isn't designed to be run from the the real root anyway.
I wouldn't mind removing those lines again although I'd really like to know what broke this all of a sudden, too.
Hi, So what happened during the upgrade was that multiple nodes, worker and infra nodes got stuck for about 2 hours. I did a reboot first on a worker node, that resulted it wouldn't pick up the newly rendered machine config, I needed to change that manually in its node yaml file. After the change the upgrade of workers went fine. After that I waited another hour so that the upgrade could finish, it didn't, It was now stuck on the last infra node. So I rebooted the indra node from an ssh session and also noticed it wasn't logged in as core@<node name>, instead it was a different node name that I didn't recognised... Br. Christian
Hi Christian Carlé Thanks for providing the insights. Just a thing: this bug is focused on the concrete sigsev problem we found at ignition, as it is assigned to the engineering teams that can work on that. It is good to have the context, but any other problem needs to be worked through the support case so we can either solve it directly there or open different bugzillas to the teams that can handle them (if needed). Thanks and regards.
Proposed dropping the `ignition --version` invocation from the MCD in https://github.com/openshift/machine-config-operator/pull/2431.
Verified on 4.8.0-0.nightly-2021-03-01-143026. Message about ignition version is no longer in the log. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-03-01-143026 True False 32m Cluster version is 4.8.0-0.nightly-2021-03-01-143026 $ oc get nodes NAME STATUS ROLES AGE VERSION ci-ln-5373ry2-f76d1-227tx-master-0 Ready master 53m v1.20.0+ac0db7d ci-ln-5373ry2-f76d1-227tx-master-1 Ready master 52m v1.20.0+ac0db7d ci-ln-5373ry2-f76d1-227tx-master-2 Ready master 53m v1.20.0+ac0db7d ci-ln-5373ry2-f76d1-227tx-worker-b-lndk2 Ready worker 43m v1.20.0+ac0db7d ci-ln-5373ry2-f76d1-227tx-worker-c-c62jb Ready worker 47m v1.20.0+ac0db7d ci-ln-5373ry2-f76d1-227tx-worker-d-bld4f Ready worker 44m v1.20.0+ac0db7d $ oc get pods -A --field-selector spec.nodeName=ci-ln-5373ry2-f76d1-227tx-worker-b-lndk2 NAMESPACE NAME READY STATUS RESTARTS AGE openshift-cluster-csi-drivers gcp-pd-csi-driver-node-dq5j9 3/3 Running 0 44m openshift-cluster-node-tuning-operator tuned-7w7qp 1/1 Running 0 44m openshift-dns dns-default-gtwhs 3/3 Running 0 44m openshift-image-registry node-ca-f5n45 1/1 Running 0 44m openshift-ingress-canary ingress-canary-vtmtn 1/1 Running 0 43m openshift-machine-config-operator machine-config-daemon-872jh 2/2 Running 0 44m openshift-marketplace certified-operators-rrs8s 0/1 ContainerCreating 0 4s openshift-marketplace community-operators-bkh5x 0/1 ContainerCreating 0 4s openshift-marketplace redhat-operators-rth4q 0/1 ContainerCreating 0 3s openshift-monitoring node-exporter-bk9th 2/2 Running 0 44m openshift-multus multus-nrmqw 1/1 Running 0 44m openshift-multus network-metrics-daemon-4v2tc 2/2 Running 0 44m openshift-network-diagnostics network-check-target-25rrs 1/1 Running 0 44m openshift-sdn ovs-t82n2 1/1 Running 0 44m openshift-sdn sdn-wlhfz 2/2 Running 0 44m $ oc -n openshift-machine-config-operator logs machine-config-daemon-872jh -c machine-config-daemon | grep "Installed Ignition"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438