Bug 1885408 - MCO kernel-devel extension fails with Degraded node: failed to execute rpm-ostree ["update" "--install" "kernel-devel"]
Summary: MCO kernel-devel extension fails with Degraded node: failed to execute rpm-os...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.6
Hardware: ppc64le
OS: Linux
Target Milestone: ---
: 4.6.0
Assignee: Prashanth Sundararaman
QA Contact: Michael Nguyen
Depends On:
TreeView+ depends on / blocked
Reported: 2020-10-05 20:23 UTC by Evan Dunn
Modified: 2020-10-27 16:48 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2020-10-27 16:47:41 UTC
Target Upstream Version:

Attachments (Terms of Use)
journalctl -u rpm-ostreed (2.21 MB, text/plain)
2020-10-06 16:00 UTC, Evan Dunn
no flags Details
machine-config-daemon of Degraded node (8.80 MB, text/plain)
2020-10-06 16:01 UTC, Evan Dunn
no flags Details

System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:48:11 UTC

Description Evan Dunn 2020-10-05 20:23:21 UTC
Description of problem:

MachineConfig `kernel-devel` extension does not reconcile, and results in Degraded node.

Version-Release number of selected component (if applicable):

Cluster version is 4.6.0-0.nightly-ppc64le-2020-10-02-132152

Client Version: openshift-clients-4.6.0-202006250705.p0-156-geadaf8954
Kubernetes Version: v1.19.0+9ec24d6

How reproducible:

Steps to Reproduce:
1. oc apply 

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
    machineconfiguration.openshift.io/role: "worker"
  name: 02-worker-kernel-devel
    - kernel-devel

2. watch oc describe mcp worker; first node to attempt reconcile goes Degraded

Actual results:

oc log of machine-config-daemon on Degraded node

I1005 15:17:25.464607    2965 update.go:361] Rolling back applied changes to OS due to error: failed to execute rpm-ostree ["update" "--install" "kernel-devel"] : exit status 1
I1005 15:17:25.464642    2965 rpm-ostree.go:261] Running captured: rpm-ostree cleanup -p

I1005 16:00:52.789976    2965 run.go:18] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-machine-os-content/os-content-027941479 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9213d5f307a6ffa01005e4b548ec686db1bc38a4b909c9ab06e129325bd6e4f5
I1005 16:01:25.388027    2965 update.go:910] Applying extensions : ["update" "--install" "kernel-devel"]
I1005 16:01:34.394648    2965 update.go:361] Rolling back applied changes to OS due to error: failed to execute rpm-ostree ["update" "--install" "kernel-devel"] : exit status 1

Expected results:

Should reconcile to Done, and have kernel-devel package installed on worker nodes.

Additional info:

Note: applied `usbguard` extension as a sanity check, and it did work.

Comment 1 Sinny Kumari 2020-10-06 12:08:09 UTC
Current error logging is not very useful to identify the root cause of problem, we have a ongoing patch (https://github.com/openshift/machine-config-operator/pull/2097) to improve it.
Can you please provide must-gather or at least complete machine-config-daemon pod log and rpm-ostreed service log (journalctl -u rpm-ostreed) for the worker node where extensions failed to apply?

Comment 2 Colin Walters 2020-10-06 14:13:16 UTC
I don't think we should support installing `kernel-devel` as an extension.  It's just there so other tools can get it from the machine-os-content on their own and use it as part of a kernel module build process.

Installing it on the host would also require other dependencies we don't want to ship in the host.

Comment 3 Evan Dunn 2020-10-06 16:00:27 UTC
Created attachment 1719444 [details]
journalctl -u rpm-ostreed

Comment 4 Evan Dunn 2020-10-06 16:01:12 UTC
Created attachment 1719445 [details]
machine-config-daemon of Degraded node

Comment 5 Sinny Kumari 2020-10-06 16:48:33 UTC
Relevant error log from rpm-ostreed service:

Oct 05 16:28:29 <hostname> rpm-ostree[18499]: Preparing pkg txn; enabled repos: ['coreos-extensions'] solvables: 8
Oct 05 16:28:29 <hostname> rpm-ostree[18499]: Txn UpdateDeployment on /org/projectatomic/rpmostree1/rhcos failed: Could not depsolve transaction; 1 problem detected:
                                                                                         Problem: conflicting requests
                                                                                          - nothing provides perl-interpreter needed by kernel-devel-4.18.0-193.19.1.el8_2.ppc64le
Oct 05 16:28:29 <hostname> rpm-ostree[18499]: client(id:machine-config-operator dbus:1.224 unit:crio-0d450d31f7b532bccb5145ec3a1877fc79f1b629e9fe236202044f330109483f.scope uid:0) vanished; remaining=0

It seems perl-interpreter is not available in base RHCOS on ppc64le and hence it is failing. I believe we are not supporting kernel-devel as supported extension and it is primarily available for the usecase which Colin mentioned.

@Steve What are your thoughts?

Comment 6 Antonio Murdaca 2020-10-07 12:13:44 UTC
Moving to RHCOS to re-build with that missing package I guess - sorry if that's wrong but not super clear who owns ppc64le base rhcos

Comment 7 Steve Milner 2020-10-07 12:59:44 UTC
RHCOS is the correct location being that it's a missing package in extensions. I'll ask someone from multiarch to take a look as well.

Comment 9 Steve Milner 2020-10-07 14:15:26 UTC
It was identified that one architecture (ppc64le) doesn't require perl-interpreter an thus the install fails due to missing dependency. Prashanth is working an update now.

Comment 13 Micah Abbott 2020-10-08 19:03:31 UTC
The fix landed in the ppc64le RHCOS build 46.82.202010081539-0

Waiting for the build to get promoted to a 4.6 nightly release payload - https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.6-ppc64le

Comment 14 Prashanth Sundararaman 2020-10-09 15:23:37 UTC
tested with latest ppc64le nightly 4.6.0-0.nightly-ppc64le-2020-10-09-033704 and it works

Comment 17 errata-xmlrpc 2020-10-27 16:47:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.