Description of problem:
For all of Apr2, 2020, the gcp-op job was failing. Looking thru the mcd logs i see a few tests taking a long time and the node never seems to finish its update and reboot finally giving up and erroring:
Marking Degraded due to: Failed to execute rpm-ostree ["override" "reset" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--uninstall" "kernel-rt-core" "--uninstall" "kernel-rt-modules" "--uninstall" "kernel-rt-modules-extra"] : exit status 1
Not sure what happened, but need to dig on this as this is blocking PRs from merging.
Hum....this might be fallout from https://gitlab.cee.redhat.com/coreos/redhat-coreos/merge_requests/877
Right, this is happening because of additional kernel-rt packages being shipped now in machine-os-content.
Currently during rt-kernel switch, we are installing all kernel-rt packages being available in machine-os-content (https://github.com/openshift/machine-config-operator/blob/db561314c7afae1d77c16cfdb95f0f0ce6b8977d/pkg/daemon/update.go#L721 ). As a result, switching to kernel-rt packages works fine and installs all kernel-rt specific packages. But during rollback to traditional kernel, we are relying on the specific list of kernel-rt to be uninstalled which doesn't take into consideration the additional packages (https://github.com/openshift/machine-config-operator/blob/db561314c7afae1d77c16cfdb95f0f0ce6b8977d/pkg/daemon/update.go#L663). This causes rollback to fail and node goes to degraded state.
We will need better way to know layered kernel-rt packages, maybe it is time to parse `rpm-ostree status --json` result.
The https://prow.svc.ci.openshift.org/job-history/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op tests are passing again. Therefore considering this verified.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.