Description of problem:
MachineConfig `kernel-devel` extension does not reconcile, and results in Degraded node.
Version-Release number of selected component (if applicable):
Cluster version is 4.6.0-0.nightly-ppc64le-2020-10-02-132152
Client Version: openshift-clients-4.6.0-202006250705.p0-156-geadaf8954
Kubernetes Version: v1.19.0+9ec24d6
Steps to Reproduce:
1. oc apply
2. watch oc describe mcp worker; first node to attempt reconcile goes Degraded
oc log of machine-config-daemon on Degraded node
I1005 15:17:25.464607 2965 update.go:361] Rolling back applied changes to OS due to error: failed to execute rpm-ostree ["update" "--install" "kernel-devel"] : exit status 1
I1005 15:17:25.464642 2965 rpm-ostree.go:261] Running captured: rpm-ostree cleanup -p
I1005 16:00:52.789976 2965 run.go:18] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-machine-os-content/os-content-027941479 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9213d5f307a6ffa01005e4b548ec686db1bc38a4b909c9ab06e129325bd6e4f5
I1005 16:01:25.388027 2965 update.go:910] Applying extensions : ["update" "--install" "kernel-devel"]
I1005 16:01:34.394648 2965 update.go:361] Rolling back applied changes to OS due to error: failed to execute rpm-ostree ["update" "--install" "kernel-devel"] : exit status 1
Should reconcile to Done, and have kernel-devel package installed on worker nodes.
Note: applied `usbguard` extension as a sanity check, and it did work.
Current error logging is not very useful to identify the root cause of problem, we have a ongoing patch (https://github.com/openshift/machine-config-operator/pull/2097) to improve it.
Can you please provide must-gather or at least complete machine-config-daemon pod log and rpm-ostreed service log (journalctl -u rpm-ostreed) for the worker node where extensions failed to apply?
I don't think we should support installing `kernel-devel` as an extension. It's just there so other tools can get it from the machine-os-content on their own and use it as part of a kernel module build process.
Installing it on the host would also require other dependencies we don't want to ship in the host.
Created attachment 1719444 [details]
journalctl -u rpm-ostreed
Created attachment 1719445 [details]
machine-config-daemon of Degraded node
Relevant error log from rpm-ostreed service:
Oct 05 16:28:29 <hostname> rpm-ostree: Preparing pkg txn; enabled repos: ['coreos-extensions'] solvables: 8
Oct 05 16:28:29 <hostname> rpm-ostree: Txn UpdateDeployment on /org/projectatomic/rpmostree1/rhcos failed: Could not depsolve transaction; 1 problem detected:
Problem: conflicting requests
- nothing provides perl-interpreter needed by kernel-devel-4.18.0-193.19.1.el8_2.ppc64le
Oct 05 16:28:29 <hostname> rpm-ostree: client(id:machine-config-operator dbus:1.224 unit:crio-0d450d31f7b532bccb5145ec3a1877fc79f1b629e9fe236202044f330109483f.scope uid:0) vanished; remaining=0
It seems perl-interpreter is not available in base RHCOS on ppc64le and hence it is failing. I believe we are not supporting kernel-devel as supported extension and it is primarily available for the usecase which Colin mentioned.
@Steve What are your thoughts?
Moving to RHCOS to re-build with that missing package I guess - sorry if that's wrong but not super clear who owns ppc64le base rhcos
RHCOS is the correct location being that it's a missing package in extensions. I'll ask someone from multiarch to take a look as well.
It was identified that one architecture (ppc64le) doesn't require perl-interpreter an thus the install fails due to missing dependency. Prashanth is working an update now.
The fix landed in the ppc64le RHCOS build 46.82.202010081539-0
Waiting for the build to get promoted to a 4.6 nightly release payload - https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.6-ppc64le
tested with latest ppc64le nightly 4.6.0-0.nightly-ppc64le-2020-10-09-033704 and it works
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.