1820401 – blocking gcp-op failure: Failed to execute rpm-ostree

Bug 1820401 - blocking gcp-op failure: Failed to execute rpm-ostree

Summary: blocking gcp-op failure: Failed to execute rpm-ostree

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Sinny Kumari
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-03 01:11 UTC by Kirsten Garrison
Modified:	2020-07-13 17:25 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-13 17:25:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 1612	0	None	closed	Bug 1820401: daemon: better way to find installed kernel-rt packages on host	2020-06-23 09:47:43 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:25:41 UTC

Description Kirsten Garrison 2020-04-03 01:11:31 UTC

Description of problem:
For all of Apr2, 2020, the gcp-op job was failing. Looking thru the mcd logs i see a few tests taking a long time and the node never seems to finish its update and reboot finally giving up and erroring:

Marking Degraded due to: Failed to execute rpm-ostree ["override" "reset" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--uninstall" "kernel-rt-core" "--uninstall" "kernel-rt-modules" "--uninstall" "kernel-rt-modules-extra"] : exit status 1

All runs:
https://prow.svc.ci.openshift.org/job-history/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op

Not sure what happened, but need to dig on this as this is blocking PRs from merging.

Examples: 
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1601/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1666/artifacts/e2e-gcp-op/pods/openshift-machine-config-operator_machine-config-daemon-rlwzn_machine-config-daemon.log
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1602/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1665/artifacts/e2e-gcp-op/pods/openshift-machine-config-operator_machine-config-daemon-pnvn7_machine-config-daemon.log
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1602/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1645/artifacts/e2e-gcp-op/pods/openshift-machine-config-operator_machine-config-daemon-tm8n7_machine-config-daemon.log

Comment 1 Colin Walters 2020-04-03 01:54:58 UTC

Hum....this might be fallout from https://gitlab.cee.redhat.com/coreos/redhat-coreos/merge_requests/877

Comment 2 Sinny Kumari 2020-04-03 07:23:27 UTC

Right, this is happening because of additional kernel-rt packages being shipped now in machine-os-content.

Currently during rt-kernel switch, we are installing all kernel-rt packages being available in machine-os-content (https://github.com/openshift/machine-config-operator/blob/db561314c7afae1d77c16cfdb95f0f0ce6b8977d/pkg/daemon/update.go#L721 ). As a result, switching to kernel-rt packages works fine and installs all kernel-rt specific packages. But during rollback to traditional kernel, we are relying on the specific list of kernel-rt to be uninstalled which doesn't take into consideration the additional packages (https://github.com/openshift/machine-config-operator/blob/db561314c7afae1d77c16cfdb95f0f0ce6b8977d/pkg/daemon/update.go#L663). This causes rollback to fail and node goes to degraded state.

We will need better way to know layered kernel-rt packages, maybe it is time to parse `rpm-ostree status --json` result.

Comment 5 Michael Nguyen 2020-04-07 18:40:38 UTC

The https://prow.svc.ci.openshift.org/job-history/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op tests are passing again.  Therefore considering this verified.

Comment 7 errata-xmlrpc 2020-07-13 17:25:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.