Bug 1859269 - [4.6] rpm-ostree crashes when deploying realtime kernel
Summary: [4.6] rpm-ostree crashes when deploying realtime kernel
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.0
Assignee: Jonathan Lebon
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On: 1861786
Blocks: 1860926
TreeView+ depends on / blocked
 
Reported: 2020-07-21 15:25 UTC by To Hung Sze
Modified: 2020-10-27 16:17 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:16:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
must-gather log (41.23 KB, application/gzip)
2020-07-21 15:25 UTC, To Hung Sze
no flags Details
ostreed.service (25.41 KB, text/plain)
2020-07-22 15:30 UTC, To Hung Sze
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github coreos rpm-ostree pull 2178 0 None closed daemon/deploy: Finish OstreeAsyncProgress after pull 2020-11-24 19:15:17 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:17:00 UTC

Description To Hung Sze 2020-07-21 15:25:08 UTC
Created attachment 1701927 [details]
must-gather log

Description of problem:
Specifying realkernel and OpenShift cluster installation fails.

Steps to Reproduce:
1.
Create a MachineConfig that sets kernelType to realtime

cat > <installation_directory>/openshift/99-master-kerneltype.yaml <<EOF
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: "master"
  name: 99-master-kerneltype
spec:
  kernelType: realtime
EOF
cat > <installation_directory>/openshift/99-worker-kerneltype.yaml <<EOF
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: "worker"
  name: 99-worker-kerneltype
spec:
  kernelType: realtime
EOF

2.
Install ipi cluster.


Actual results:
Installation fails with only bootstrap and master created - no worker.

On bootstrap:
[#162] failed to create some manifests:
p.c.openshift-qe.internal bootkube.sh[2299]: "99_openshift-cluster-api_worker-machineset-0.yaml": unable to get REST mapping for "99_openshift-cluster-api_worker-machineset-0.yaml": no matches for kind "MachineSet" in version "machine.openshift.io/v1beta1"

On master:
config-daemon[2057]: I0721 08:31:15.928575    2057 rpm-ostree.go:252] Pivoting to: 46.82.202007171740-0 (51be2cbe77b26ec90d4f4caa956615>
config-daemon[2057]: I0721 08:31:44.222388    2057 update.go:1462] Initiating switch from kernel realtime to default
config-daemon[2057]: I0721 08:31:44.228724    2057 rpm-ostree.go:360] Running captured: rpm-ostree status --json
config-daemon[2057]: I0721 08:31:44.313764    2057 update.go:866] Updating files
config-daemon[2057]: I0721 08:31:44.314472    2057 update.go:903] Deleting stale data
config-daemon[2057]: error: error rolling back Real time Kernel No kernel-rt package installed on host: exit status 1
1]: machine-config-daemon-firstboot.service: Main process exited, code=exited, status=1/FAILURE
1]: machine-config-daemon-firstboot.service: Failed with result 'exit-code'.
1]: Failed to start Machine Config Daemon Firstboot.
1]: machine-config-daemon-firstboot.service: Consumed 2min 14.861s CPU time

This is the error (from above):
error: error rolling back Real time Kernel No kernel-rt package installed on host: exit status 1


Additional info:
Tried with 4.6 nightly 7.20-093546 (and a few earlier builds)

Comment 1 Sinny Kumari 2020-07-21 16:18:47 UTC
Since cluster failed a bit early, must-gather didn't capture journal logs. Since we are trying to switch to realtime kernel during install time, detailed log will be in machine-config-daemon-firstboot.service . Can you please provide also provide journal log from one of the worker node for rpm-ostreed.service and machine-config-daemon-firstboot.service .

Comment 2 To Hung Sze 2020-07-21 18:24:43 UTC
from one of the master machines:
$ journalctl --unit=machine-config-daemon-firstboot.service
-- Logs begin at Tue 2020-07-21 16:50:14 UTC, end at Tue 2020-07-21 18:21:21 UTC. --
Jul 21 16:55:34 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal systemd[1]: Starting Machine Config Daemon Firstboot...
Jul 21 16:55:34 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal machine-config-daemon[2046]: I0721 16:55:34.148869    2046 rpm-ostree.go:360] Running capture>
Jul 21 16:55:34 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal machine-config-daemon[2046]: I0721 16:55:34.415015    2046 daemon.go:220] Booted osImageURL: >
Jul 21 16:55:34 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal machine-config-daemon[2046]: I0721 16:55:34.471123    2046 daemon.go:227] Installed Ignition >
Jul 21 16:55:34 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal machine-config-daemon[2046]: I0721 16:55:34.472624    2046 update.go:284] Checking Reconcilab>
Jul 21 16:55:34 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal machine-config-daemon[2046]: I0721 16:55:34.474187    2046 update.go:1462] Starting update fr>
Jul 21 16:55:34 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal machine-config-daemon[2046]: I0721 16:55:34.478878    2046 update.go:866] Updating files
Jul 21 16:55:34 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal machine-config-daemon[2046]: I0721 16:55:34.479336    2046 update.go:903] Deleting stale data
Jul 21 16:55:34 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal machine-config-daemon[2046]: I0721 16:55:34.481999    2046 update.go:1462] Initiating switch >
Jul 21 16:55:34 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal machine-config-daemon[2046]: I0721 16:55:34.485091    2046 run.go:18] Running: podman pull -q>
Jul 21 16:56:05 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal machine-config-daemon[2046]: 2381dd334aaf1b9c25351bc5a9ac7a48baf063fbfcf3c8b53d94526c7fe45588
Jul 21 16:56:05 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal machine-config-daemon[2046]: I0721 16:56:05.639197    2046 rpm-ostree.go:360] Running capture>
Jul 21 16:56:05 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal machine-config-daemon[2046]: I0721 16:56:05.781138    2046 rpm-ostree.go:360] Running capture>
Jul 21 16:56:05 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal machine-config-daemon[2046]: I0721 16:56:05.899125    2046 update.go:1462] Switching to kerne>
Jul 21 16:58:14 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal machine-config-daemon[2046]: I0721 16:58:14.554239    2046 update.go:1462] Deleted container >
Jul 21 16:58:14 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal machine-config-daemon[2046]: I0721 16:58:14.559825    2046 update.go:1320] Updating OS to qua>
Jul 21 16:58:14 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal machine-config-daemon[2046]: I0721 16:58:14.559855    2046 rpm-ostree.go:360] Running capture>
Jul 21 16:58:14 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal machine-config-daemon[2046]: I0721 16:58:14.598968    2046 rpm-ostree.go:159] Current origin >
Jul 21 16:58:14 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal machine-config-daemon[2046]: I0721 16:58:14.599124    2046 run.go:18] Running: podman pull -q>
Jul 21 16:58:44 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal machine-config-daemon[2046]: 2381dd334aaf1b9c25351bc5a9ac7a48baf063fbfcf3c8b53d94526c7fe45588
Jul 21 16:58:44 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal machine-config-daemon[2046]: I0721 16:58:44.227291    2046 rpm-ostree.go:360] Running capture>
Jul 21 16:58:44 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal machine-config-daemon[2046]: I0721 16:58:44.348389    2046 rpm-ostree.go:360] Running capture>
Jul 21 16:58:44 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal machine-config-daemon[2046]: I0721 16:58:44.503162    2046 rpm-ostree.go:360] Running capture>
Jul 21 16:58:44 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal machine-config-daemon[2046]: I0721 16:58:44.622152    2046 rpm-ostree.go:252] Pivoting to: 46>
Jul 21 16:59:13 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal machine-config-daemon[2046]: I0721 16:59:13.336235    2046 update.go:1462] Initiating switch >
Jul 21 16:59:13 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal machine-config-daemon[2046]: I0721 16:59:13.342546    2046 rpm-ostree.go:360] Running capture>
Jul 21 16:59:13 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal machine-config-daemon[2046]: I0721 16:59:13.424375    2046 update.go:866] Updating files
Jul 21 16:59:13 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal machine-config-daemon[2046]: I0721 16:59:13.425196    2046 update.go:903] Deleting stale data
Jul 21 16:59:13 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal machine-config-daemon[2046]: error: error rolling back Real time Kernel No kernel-rt package >
Jul 21 16:59:13 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal systemd[1]: machine-config-daemon-firstboot.service: Main process exited, code=exited, status>
Jul 21 16:59:13 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal systemd[1]: machine-config-daemon-firstboot.service: Failed with result 'exit-code'.
Jul 21 16:59:13 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal systemd[1]: Failed to start Machine Config Daemon Firstboot.
Jul 21 16:59:13 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal systemd[1]: machine-config-daemon-firstboot.service: Consumed 2min 11.916s CPU time

Comment 3 To Hung Sze 2020-07-21 18:25:08 UTC
There is no worker node created.

Comment 4 Micah Abbott 2020-07-21 21:09:01 UTC
This is probably related to a kernel vs kernel-rt NVR mis-match.

From the RHCOS build logs, we are using `kernel-4.18.0-211.el8.x86_64` in RHCOS itself.  But the we are shipping `kernel-rt-4.18.0-193.13.2.rt13.65.el8_2` in the machine-os-content.

@Sinny if you agree, re-assign to RHCOS and we'll handle it.

Comment 5 Sinny Kumari 2020-07-22 05:39:10 UTC
Umm, I think this issue happens when we overlay a package first and then do rebase in same deployment. This is most likely a regression from rpm-ostree or ostree update

I have seen this problem happening while working on adding extensions https://github.com/openshift/machine-config-operator/pull/1941 . To overcome this in the PR I started doing rebase first and then layer packages which is actually good thing to do since layering package will resolve its dependency from the latest version available in staged deployment.

From the journal log, the actual error is a bit tricky. The problem started happening a bit earlier:

Jul 21 16:56:05 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal machine-config-daemon[2046]: I0721 16:56:05.899125    2046 update.go:1462] Switching to kerne>  <----
Jul 21 16:58:14 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal machine-config-daemon[2046]: I0721 16:58:14.554239    2046 update.go:1462] Deleted container >
Jul 21 16:58:14 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal machine-config-daemon[2046]: I0721 16:58:14.559825    2046 update.go:1320] Updating OS to qua
...
Jul 21 16:58:44 tszegcp72120a-hbtss-master-0.c.openshift-qe.internal machine-config-daemon[2046]: I0721 16:58:44.622152    2046 rpm-ostree.go:252] Pivoting to: 46>   <-----

Since this is install time kernel-rt switch, kernel-rt switch happened first and then we imitated pivot to the new OSImageURL which I suspect in this case didn't succeed. As a result MCO initiated rollback of applied MachineConfigs. While doing rollback it failed because we are trying to query for installed kernel-rt packages which is ofcourse not yet installed because layered package is still in staged deployment.

@micah yes, it will be good to move to RHCOS to see what changes could have caused rebase behavior change. I believe day1 rt-kernel switch has been working fine during 4.5.

@tsze Some of the important messages is not visible in journal log as it opens in `less` . For better clarity, can you please provide again journal log of machine-config-daemon-firstboot.service and rpm-ostreed.service.
Try something like:
$ journalctl --unit=machine-config-daemon-firstboot.service | cat
$ journalctl --unit=rpm-ostreed.service | cat

Comment 6 Micah Abbott 2020-07-22 13:21:06 UTC
A workaround would be to complete the install *without* using the MachineConfig that sets `kernelType=realtime` and then switching the kernel as part of a day 2 operation.

Setting as medium priority and targeting for 4.6

Comment 7 Jonathan Lebon 2020-07-22 14:10:38 UTC
(In reply to Sinny Kumari from comment #5)
> Umm, I think this issue happens when we overlay a package first and then do
> rebase in same deployment. This is most likely a regression from rpm-ostree
> or ostree update
> 
> I have seen this problem happening while working on adding extensions
> https://github.com/openshift/machine-config-operator/pull/1941 . To overcome
> this in the PR I started doing rebase first and then layer packages which is
> actually good thing to do since layering package will resolve its dependency
> from the latest version available in staged deployment.

It's not just a workaround, it's the correct behaviour. :)  Otherwise we'd essentially be trying to do a depsolve on a pkgset that we might've not tested at all. The extensions are made to match the OSTree in the same oscontainer, not whatever happens to be the customer's bootstrap version.

That said, it's not 100% clear if that's the issue here. It did successfully stage the kernel-rt switchover and then something went wrong during the pivot. We need the rpm-ostreed logs (or even better the full journal) to better diagnose.

> As a result MCO initiated rollback of applied MachineConfigs. While doing
> rollback it failed because we are trying to query for installed kernel-rt
> packages which is ofcourse not yet installed because layered package is
> still in staged deployment.

Hmm yeah I think `installedRTKernelRpmsOnHost()` should probably be querying the staged deployment instead? But actually, if a failure happens before rebooting, the MCD should be able to just wipe the staged deployment (`rpm-ostree cleanup -p`).

> @micah yes, it will be good to move to RHCOS to see what changes could have
> caused rebase behavior change. I believe day1 rt-kernel switch has been
> working fine during 4.5.

Weird, AFAIK nothing in rpm-ostree changed recently wrt package layering handling. Note also that both 4.5 and 4.6 currently ship the same rpm-ostree version.

Comment 8 To Hung Sze 2020-07-22 15:22:31 UTC
journalctl --unit=machine-config-daemon-firstboot.service | cat:

-- Logs begin at Wed 2020-07-22 13:02:31 UTC, end at Wed 2020-07-22 15:20:40 UTC. --
Jul 22 13:07:43 tsze72220-x7864-master-0.c.openshift-qe.internal systemd[1]: Starting Machine Config Daemon Firstboot...
Jul 22 13:07:43 tsze72220-x7864-master-0.c.openshift-qe.internal machine-config-daemon[2050]: I0722 13:07:43.107537    2050 rpm-ostree.go:360] Running captured: rpm-ostree status --json
Jul 22 13:07:43 tsze72220-x7864-master-0.c.openshift-qe.internal machine-config-daemon[2050]: I0722 13:07:43.367292    2050 daemon.go:220] Booted osImageURL:  (46.82.202007051540-0)
Jul 22 13:07:43 tsze72220-x7864-master-0.c.openshift-qe.internal machine-config-daemon[2050]: I0722 13:07:43.430824    2050 daemon.go:227] Installed Ignition binary version: 0.35.1
Jul 22 13:07:43 tsze72220-x7864-master-0.c.openshift-qe.internal machine-config-daemon[2050]: I0722 13:07:43.432461    2050 update.go:284] Checking Reconcilable for config mco-empty-mc to rendered-master-92a047670efa0f674d3138da376704b4
Jul 22 13:07:43 tsze72220-x7864-master-0.c.openshift-qe.internal machine-config-daemon[2050]: I0722 13:07:43.434028    2050 update.go:1462] Starting update from mco-empty-mc to rendered-master-92a047670efa0f674d3138da376704b4: &{osUpdate:true kargs:false fips:false passwd:false files:false units:false kernelType:true}
Jul 22 13:07:43 tsze72220-x7864-master-0.c.openshift-qe.internal machine-config-daemon[2050]: I0722 13:07:43.438563    2050 update.go:866] Updating files
Jul 22 13:07:43 tsze72220-x7864-master-0.c.openshift-qe.internal machine-config-daemon[2050]: I0722 13:07:43.439163    2050 update.go:903] Deleting stale data
Jul 22 13:07:43 tsze72220-x7864-master-0.c.openshift-qe.internal machine-config-daemon[2050]: I0722 13:07:43.442480    2050 update.go:1462] Initiating switch from kernel default to realtime
Jul 22 13:07:43 tsze72220-x7864-master-0.c.openshift-qe.internal machine-config-daemon[2050]: I0722 13:07:43.445845    2050 run.go:18] Running: podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4dcfa26c63a08d65feb11f621df29719b4d46b7999c76315c137a21398830fba
Jul 22 13:08:17 tsze72220-x7864-master-0.c.openshift-qe.internal machine-config-daemon[2050]: 2381dd334aaf1b9c25351bc5a9ac7a48baf063fbfcf3c8b53d94526c7fe45588
Jul 22 13:08:17 tsze72220-x7864-master-0.c.openshift-qe.internal machine-config-daemon[2050]: I0722 13:08:17.291545    2050 rpm-ostree.go:360] Running captured: podman create --net=none --annotation=org.openshift.machineconfigoperator.pivot=true --name mcd-3ad80762-7ed6-4faf-b943-258267af1c25 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4dcfa26c63a08d65feb11f621df29719b4d46b7999c76315c137a21398830fba
Jul 22 13:08:17 tsze72220-x7864-master-0.c.openshift-qe.internal machine-config-daemon[2050]: I0722 13:08:17.449506    2050 rpm-ostree.go:360] Running captured: podman mount 90035ffa353b52be1a3f0656c09c1cb6a127b42ea68db1a90bc10fcf94c67832
Jul 22 13:08:17 tsze72220-x7864-master-0.c.openshift-qe.internal machine-config-daemon[2050]: I0722 13:08:17.567684    2050 update.go:1462] Switching to kernelType=realtime, invoking rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "/var/lib/containers/storage/overlay/6db296ceafe90a6b4461b4fe1c2128c7a31d5da32735d971f4142c61cc291173/merged/kernel-rt-core-4.18.0-193.13.2.rt13.65.el8_2.x86_64.rpm" "--install" "/var/lib/containers/storage/overlay/6db296ceafe90a6b4461b4fe1c2128c7a31d5da32735d971f4142c61cc291173/merged/kernel-rt-kvm-4.18.0-193.13.2.rt13.65.el8_2.x86_64.rpm" "--install" "/var/lib/containers/storage/overlay/6db296ceafe90a6b4461b4fe1c2128c7a31d5da32735d971f4142c61cc291173/merged/kernel-rt-modules-4.18.0-193.13.2.rt13.65.el8_2.x86_64.rpm" "--install" "/var/lib/containers/storage/overlay/6db296ceafe90a6b4461b4fe1c2128c7a31d5da32735d971f4142c61cc291173/merged/kernel-rt-modules-extra-4.18.0-193.13.2.rt13.65.el8_2.x86_64.rpm"]
Jul 22 13:10:26 tsze72220-x7864-master-0.c.openshift-qe.internal machine-config-daemon[2050]: I0722 13:10:26.867576    2050 update.go:1462] Deleted container and removed OSContainer image
Jul 22 13:10:26 tsze72220-x7864-master-0.c.openshift-qe.internal machine-config-daemon[2050]: I0722 13:10:26.872656    2050 update.go:1320] Updating OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4dcfa26c63a08d65feb11f621df29719b4d46b7999c76315c137a21398830fba
Jul 22 13:10:26 tsze72220-x7864-master-0.c.openshift-qe.internal machine-config-daemon[2050]: I0722 13:10:26.872687    2050 rpm-ostree.go:360] Running captured: rpm-ostree status --json
Jul 22 13:10:26 tsze72220-x7864-master-0.c.openshift-qe.internal machine-config-daemon[2050]: I0722 13:10:26.913459    2050 rpm-ostree.go:159] Current origin is not custom
Jul 22 13:10:26 tsze72220-x7864-master-0.c.openshift-qe.internal machine-config-daemon[2050]: I0722 13:10:26.913630    2050 run.go:18] Running: podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4dcfa26c63a08d65feb11f621df29719b4d46b7999c76315c137a21398830fba
Jul 22 13:10:57 tsze72220-x7864-master-0.c.openshift-qe.internal machine-config-daemon[2050]: 2381dd334aaf1b9c25351bc5a9ac7a48baf063fbfcf3c8b53d94526c7fe45588
Jul 22 13:10:57 tsze72220-x7864-master-0.c.openshift-qe.internal machine-config-daemon[2050]: I0722 13:10:57.704513    2050 rpm-ostree.go:360] Running captured: podman inspect --type=image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4dcfa26c63a08d65feb11f621df29719b4d46b7999c76315c137a21398830fba
Jul 22 13:10:57 tsze72220-x7864-master-0.c.openshift-qe.internal machine-config-daemon[2050]: I0722 13:10:57.821724    2050 rpm-ostree.go:360] Running captured: podman create --net=none --annotation=org.openshift.machineconfigoperator.pivot=true --name ostree-container-pivot-26a62e78-c35d-43ba-8319-52382dea2c84 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4dcfa26c63a08d65feb11f621df29719b4d46b7999c76315c137a21398830fba
Jul 22 13:10:57 tsze72220-x7864-master-0.c.openshift-qe.internal machine-config-daemon[2050]: I0722 13:10:57.972459    2050 rpm-ostree.go:360] Running captured: podman mount 13ece144809b6b4ba4c82e46eb6fcd8b0b6e9850eccdff22b90031947c9987ce
Jul 22 13:10:58 tsze72220-x7864-master-0.c.openshift-qe.internal machine-config-daemon[2050]: I0722 13:10:58.091518    2050 rpm-ostree.go:252] Pivoting to: 46.82.202007171740-0 (51be2cbe77b26ec90d4f4caa956615c202ede5e122b488ff53bb2eb777362e07)

Comment 9 To Hung Sze 2020-07-22 15:30:06 UTC
Created attachment 1702106 [details]
ostreed.service

ostreed.service as attached

Comment 10 To Hung Sze 2020-07-22 15:35:15 UTC
Please feel free to ping me for more info or access to the cluster.

Comment 11 Sinny Kumari 2020-07-22 17:06:04 UTC
(In reply to Jonathan Lebon from comment #7)
> Weird, AFAIK nothing in rpm-ostree changed recently wrt package layering
> handling. Note also that both 4.5 and 4.6 currently ship the same rpm-ostree
> version.

You are right, latest 4.5 bootimage has rpm-ostree-2020.2-2.el8.x86_64. I haven't ran kernel-rt as day1 with latest 4.5, maybe it will worth running a cluster here and check.
It will really nice if we can have ci test for day1 MachineConfig changes to check these issues in advance.

Comment 12 Jonathan Lebon 2020-07-22 18:00:10 UTC
Ahhh, we have a core dump:

Jul 22 13:11:37 tsze72220-x7864-master-1.c.openshift-qe.internal kernel: pool[16411]: segfault at 7fab14769480 ip 00007fab141f1879 sp 00007fab03ffbd08 error 7 in libglib-2.0.so.0.5600.4[7fab1415d000+116000]
...
Jul 22 13:11:37 tsze72220-x7864-master-1.c.openshift-qe.internal systemd[1]: rpm-ostreed.service: Main process exited, code=killed, status=11/SEGV
Jul 22 13:11:37 tsze72220-x7864-master-1.c.openshift-qe.internal systemd[1]: rpm-ostreed.service: Failed with result 'signal'.
Jul 22 13:11:37 tsze72220-x7864-master-1.c.openshift-qe.internal systemd[1]: rpm-ostreed.service: Consumed 2min 17.768s CPU time
Jul 22 13:11:38 tsze72220-x7864-master-1.c.openshift-qe.internal systemd-coredump[16565]: Process 2047 (rpm-ostree) of user 0 dumped core.

Stack trace of thread 16411:
#0  0x00007fab141f1879 g_mutex_lock (libglib-2.0.so.0)
#1  0x00007fab141a7778 g_source_unref_internal (libglib-2.0.so.0)
#2  0x00007fab15c20d3f idle_invoke_async_progress (libostree-1.so.1)
#3  0x00007fab141a6fbb g_idle_dispatch (libglib-2.0.so.0)
#4  0x00007fab141aa67d g_main_context_dispatch (libglib-2.0.so.0)
#5  0x00007fab141aaa48 g_main_context_iterate.isra.21 (libglib-2.0.so.0)
#6  0x00007fab141aaae0 g_main_context_iteration (libglib-2.0.so.0)
#7  0x00005621d183255d relabel_if_necessary (rpm-ostree)
#8  0x00005621d183917e rpmostree_context_assemble (rpm-ostree)
#9  0x00005621d1872246 rpmostree_sysroot_upgrader_deploy (rpm-ostree)
#10 0x00005621d186abd0 deploy_transaction_execute (rpm-ostree)
#11 0x00005621d185d36d transaction_execute_thread (rpm-ostree)
#12 0x00007fab1475d377 g_task_thread_pool_thread (libgio-2.0.so.0)
#13 0x00007fab141d3753 g_thread_pool_thread_proxy (libglib-2.0.so.0)
#14 0x00007fab141d2d4a g_thread_proxy (libglib-2.0.so.0)
#15 0x00007fab131222de start_thread (libpthread.so.0)
#16 0x00007fab12e53e83 __clone (libc.so.6)

Stack trace of thread 16560:
#0  0x00007fab12e4446d __fxstatat64 (libc.so.6)
#1  0x00007fab15c34659 _ostree_repo_load_file_bare (libostree-1.so.1)
#2  0x00007fab15c34f8e ostree_repo_load_file (libostree-1.so.1)
#3  0x00007fab15c39e13 checkout_one_file_at (libostree-1.so.1)
#4  0x00007fab15c3b923 checkout_tree_at_recurse (libostree-1.so.1)
#5  0x00007fab15c3bee0 checkout_tree_at_recurse (libostree-1.so.1)
#6  0x00007fab15c3bee0 checkout_tree_at_recurse (libostree-1.so.1)
#7  0x00007fab15c3bee0 checkout_tree_at_recurse (libostree-1.so.1)
#8  0x00007fab15c3bee0 checkout_tree_at_recurse (libostree-1.so.1)
#9  0x00007fab15c3bee0 checkout_tree_at_recurse (libostree-1.so.1)
#10 0x00007fab15c3bee0 checkout_tree_at_recurse (libostree-1.so.1)
#11 0x00007fab15c3bee0 checkout_tree_at_recurse (libostree-1.so.1)
#12 0x00007fab15c3c646 checkout_tree_at (libostree-1.so.1)
#13 0x00007fab15c3ca87 ostree_repo_checkout_at (libostree-1.so.1)
#14 0x00005621d1831d6b relabel_in_thread_impl.constprop.50 (rpm-ostree)
#15 0x00005621d18321d6 relabel_in_thread (rpm-ostree)
#16 0x00007fab1475d377 g_task_thread_pool_thread (libgio-2.0.so.0)
#17 0x00007fab141d3753 g_thread_pool_thread_proxy (libglib-2.0.so.0)
#18 0x00007fab141d2d4a g_thread_proxy (libglib-2.0.so.0)
#19 0x00007fab131222de start_thread (libpthread.so.0)
#20 0x00007fab12e53e83 __clone (libc.so.6)

...

Stack trace of thread 16563:
#0  0x00007fab12e4472b mkdirat (libc.so.6)
#1  0x00007fab15c3b727 checkout_tree_at_recurse (libostree-1.so.1)
#2  0x00007fab15c3c646 checkout_tree_at (libostree-1.so.1)
#3  0x00007fab15c3ca87 ostree_repo_checkout_at (libostree-1.so.1)
#4  0x00005621d1831d6b relabel_in_thread_impl.constprop.50 (rpm-ostree)
#5  0x00005621d18321d6 relabel_in_thread (rpm-ostree)
#6  0x00007fab1475d377 g_task_thread_pool_thread (libgio-2.0.so.0)
#7  0x00007fab141d3753 g_thread_pool_thread_proxy (libglib-2.0.so.0)
#8  0x00007fab141d2d4a g_thread_proxy (libglib-2.0.so.0)
#9  0x00007fab131222de start_thread (libpthread.so.0)
#10 0x00007fab12e53e83 __clone (libc.so.6)

Stack trace of thread 16561:
#0  0x00007fab12e51c2e fsetxattr (libc.so.6)
#1  0x00007fab15c9bd71 glnx_fd_set_all_xattrs (libostree-1.so.1)
#2  0x00007fab15c3b81f checkout_tree_at_recurse (libostree-1.so.1)
#3  0x00007fab15c3c646 checkout_tree_at (libostree-1.so.1)
#4  0x00007fab15c3ca87 ostree_repo_checkout_at (libostree-1.so.1)
#5  0x00005621d1831d6b relabel_in_thread_impl.constprop.50 (rpm-ostree)
#6  0x00005621d18321d6 relabel_in_thread (rpm-ostree)
#7  0x00007fab1475d377 g_task_thread_pool_thread (libgio-2.0.so.0)
#8  0x00007fab141d3753 g_thread_pool_thread_proxy (libglib-2.0.so.0)
#9  0x00007fab141d2d4a g_thread_proxy (libglib-2.0.so.0)
#10 0x00007fab131222de start_thread (libpthread.so.0)
#11 0x00007fab12e53e83 __clone (libc.so.6)

Stack trace of thread 16562:
#0  0x00007fab12e4472b mkdirat (libc.so.6)
#1  0x00007fab15c3b727 checkout_tree_at_recurse (libostree-1.so.1)
#2  0x00007fab15c3c646 checkout_tree_at (libostree-1.so.1)
#3  0x00007fab15c3ca87 ostree_repo_checkout_at (libostree-1.so.1)
#4  0x00005621d1831d6b relabel_in_thread_impl.constprop.50 (rpm-ostree)
#5  0x00005621d18321d6 relabel_in_thread (rpm-ostree)
#6  0x00007fab1475d377 g_task_thread_pool_thread (libgio-2.0.so.0)
#7  0x00007fab141d3753 g_thread_pool_thread_proxy (libglib-2.0.so.0)
#8  0x00007fab141d2d4a g_thread_proxy (libglib-2.0.so.0)
#9  0x00007fab131222de start_thread (libpthread.so.0)
#10 0x00007fab12e53e83 __clone (libc.so.6)



Digging...

Comment 13 Jonathan Lebon 2020-07-31 19:27:28 UTC
Sorry, got mixed up.

This is waiting on https://projects.engineering.redhat.com/browse/CLOUDBLD-2104.

Comment 14 Jonathan Lebon 2020-08-10 14:26:49 UTC
In latest RHCOS 4.6 builds.

Comment 17 Micah Abbott 2020-08-12 14:22:22 UTC
Verified with 4.6.0-0.nightly-2020-08-12-071533

```
$ ./openshift-install create install-configs --dir=cluster

$ ./openshift-install create manifests --dir=cluster

$ cat machineConfigs/worker-realtime.yaml 
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: "worker"
  name: 99-worker-kerneltype
spec:
  kernelType: realtime

$ cat machineConfigs/master-realtime.yaml 
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: "master"
  name: 99-master-kerneltype
spec:
  kernelType: realtime

$ cp machineConfigs/*-realtime.yaml cluster/manifests/

$ ./openshift-install --dir=cluster --log-level=debug create cluster

$ $ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-08-12-071533   True        False         31s     Cluster version is 4.6.0-0.nightly-2020-08-12-071533

$ oc get nodes
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-139-89.us-west-2.compute.internal    Ready    worker   8m11s   v1.19.0-rc.2+edbf229-dirty                                                                                                                    
ip-10-0-142-22.us-west-2.compute.internal    Ready    master   21m     v1.19.0-rc.2+edbf229-dirty
ip-10-0-165-202.us-west-2.compute.internal   Ready    worker   8m29s   v1.19.0-rc.2+edbf229-dirty
ip-10-0-167-103.us-west-2.compute.internal   Ready    master   22m     v1.19.0-rc.2+edbf229-dirty
ip-10-0-210-82.us-west-2.compute.internal    Ready    master   22m     v1.19.0-rc.2+edbf229-dirty
ip-10-0-212-70.us-west-2.compute.internal    Ready    worker   8m44s   v1.19.0-rc.2+edbf229-dirty

$ oc get mc
NAME                                               GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                          4865f7831e691d984623c52165a085444a6875ee   3.1.0             20m
00-worker                                          4865f7831e691d984623c52165a085444a6875ee   3.1.0             20m             
01-master-container-runtime                        4865f7831e691d984623c52165a085444a6875ee   3.1.0             20m
01-master-kubelet                                  4865f7831e691d984623c52165a085444a6875ee   3.1.0             20m
01-worker-container-runtime                        4865f7831e691d984623c52165a085444a6875ee   3.1.0             20m
01-worker-kubelet                                  4865f7831e691d984623c52165a085444a6875ee   3.1.0             20m
99-master-generated-registries                     4865f7831e691d984623c52165a085444a6875ee   3.1.0             20m
99-master-kerneltype                                                                                            31m
99-master-ssh                                                                                 3.1.0             31m
99-worker-generated-registries                     4865f7831e691d984623c52165a085444a6875ee   3.1.0             20m
99-worker-kerneltype                                                                                            31m
99-worker-ssh                                                                                 3.1.0             31m
rendered-master-b71d0fb8c61d50f6d2363fbeaab57165   4865f7831e691d984623c52165a085444a6875ee   3.1.0             20m           
rendered-worker-8deff4de3b484f57cace925f88f1b214   4865f7831e691d984623c52165a085444a6875ee   3.1.0             20m

$ oc debug node/ip-10-0-139-89.us-west-2.compute.internal
Starting pod/ip-10-0-139-89us-west-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.139.89
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# rpm-ostree status
State: idle
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3e519481ef0ad982c860b036907d0e9622c524b59ea691e9ea85d7836d345a6e
              CustomOrigin: Managed by machine-config-operator
                   Version: 46.82.202008111934-0 (2020-08-11T19:38:09Z)
       RemovedBasePackages: kernel-core kernel-modules kernel kernel-modules-extra 4.18.0-211.el8
           LayeredPackages: kernel-rt-core kernel-rt-kvm kernel-rt-modules kernel-rt-modules-extra

  ostree://a5f61bca284efdd814e71e250a77afcbdc302f6eadee9836013d29fcae89e234
                   Version: 46.82.202008111140-0 (2020-08-11T11:44:32Z)
sh-4.4# rpm -q rpm-ostree
rpm-ostree-2020.4-1.el8.x86_64
sh-4.4# uname -a
Linux ip-10-0-139-89 4.18.0-211.rt5.23.el8.x86_64 #1 SMP PREEMPT_RT Thu Jun 4 13:01:45 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
sh-4.4# exit
exit
sh-4.2# exit
exit

Removing debug pod ...


$ oc debug node/ip-10-0-142-22.us-west-2.compute.internal 
Starting pod/ip-10-0-142-22us-west-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.142.22
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# rpm-ostree status
State: idle
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3e519481ef0ad982c860b036907d0e9622c524b59ea691e9ea85d7836d345a6e
              CustomOrigin: Managed by machine-config-operator
                   Version: 46.82.202008111934-0 (2020-08-11T19:38:09Z)
       RemovedBasePackages: kernel-core kernel-modules kernel kernel-modules-extra 4.18.0-211.el8
           LayeredPackages: kernel-rt-core kernel-rt-kvm kernel-rt-modules kernel-rt-modules-extra

  ostree://a5f61bca284efdd814e71e250a77afcbdc302f6eadee9836013d29fcae89e234
                   Version: 46.82.202008111140-0 (2020-08-11T11:44:32Z)
sh-4.4# rpm -q rpm-ostree
rpm-ostree-2020.4-1.el8.x86_64
sh-4.4# rpm -q kernel-rt-core
kernel-rt-core-4.18.0-211.rt5.23.el8.x86_64
sh-4.4# uname -a
Linux ip-10-0-142-22 4.18.0-211.rt5.23.el8.x86_64 #1 SMP PREEMPT_RT Thu Jun 4 13:01:45 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
sh-4.4# exit
exit
sh-4.2# exit
exit

Removing debug pod ...
```

Comment 19 errata-xmlrpc 2020-10-27 16:16:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.