This bug was initially created as a copy of Bug #2057544 I am copying this bug because we wnat to ship the fix to 4.7 Description of problem: The ARO team recently encountered a failed cluster upgrade caused by a timeout during "rpm-ostree rebase". The cause of the timeout is unknown, but all subsequent attempts failed with this error message in the pod log: --- error running rpm-ostree rebase --experimental (...args omitted...): error: Transaction in progress: (null) --- rpm-ostree recent clarified this error message to include "You can cancel the current transaction with `rpm-ostree cancel`" [1], which in this context implies machine-config-operator should be doing this on error. It currently does not [2], which can lead to the situation described above. I suggest as part of the error recovery, machine-config-operator should attempt to run "rpm-ostree cancel" after a failed "rpm-ostree rebase" and any other transaction-based commands. [1] https://github.com/coreos/rpm-ostree/commit/695312f [2] https://github.com/openshift/machine-config-operator/blob/7c1ac8b51423448397bda3f349f56f6e94261b64/pkg/daemon/rpm-ostree.go#L299-L304
Hello, I have tried to verify the BZ using the strace command as we did in 4.8. Unfortunately, in versions 4.6 and 4.7 strace version does not admit --fault and --inject options. sh-4.4# strace -f --fault connect:error=EPERM:when=2 rpm-ostree upgrade strace: invalid option -- '-' Try 'strace -h' for more information. # strace -V strace -- version 5.1 Copyright (c) 1991-2019 The strace developers <https://strace.io>. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Optional features enabled: stack-trace=libdw stack-demangle m32-mpers mx32-mpers Do you know if there is any way to reproduce this error with this strace version? Thank you very much!
It should work to pull a newer strace binary from RHEL; you're not restricted to the stock one. `rpm-ostree usroverlay` will make /usr/ transiently writable, then get the latest from https://access.redhat.com/errata/RHEA-2022:2026 or so and `rpm -Uvh http://url/strace.rpm`
Verified the BZ upgrading 4.6.59 to 4.7. In order to verify the BZ we need to update the version of strace package in 4.6.59 cluster nodes To update strace package in 4.6.59 nodes: 1) oc debug node/$NODENAME; chroot /host 2) curl http://mirror.centos.org/centos/8-stream/BaseOS/x86_64/os/Packages/strace-5.13-4.el8.x86_64.rpm -o /tmp/strace-5.13-4.el8.x86_64.rpm 3) rpm-ostree usroverlay 4) rpm-ostree override replace /tmp/strace-5.13-4.el8.x86_64.rpm 5) reboot node 6) verify strace version sh-4.4# strace -V strace -- version 5.13 Reproducing the bug, upgrade 4.6.59 -> 4.7.54 (does not contain the fix): 1) Update strace package to 5.13 as described above in master and worker nodes 2) In master and worker nodes execute strace -f --fault connect:error=EPERM:when=2 rpm-ostree upgrade 3) In master and worker nodes run "rpm-ostree upgrade", the command will be stuck, type ctrl+z to continue 4) Upgrade to 4.7.54 oc adm upgrade --to-image=quay.io/openshift-release-dev/ocp-release:4.7.54-x86_64 --force --allow-explicit-upgrade 5) After the upgrade we can see the pools reporting degraded status because: - lastTransitionTime: "2022-07-11T13:39:52Z" message: 'Node ip-10-0-133-239.us-east-2.compute.internal is reporting: "failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f1d1872a8db06bd1a60c222a8c26d9615e295b6837fa140ea57c0690028858e2 : with stdout output: : error running rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-523033659/srv/repo:db8b72f0b3fe7f887f9bbe64b2317eaff79aec45e7ee3e0d3f98be32253fb11f --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f1d1872a8db06bd1a60c222a8c26d9615e295b6837fa140ea57c0690028858e2 --custom-origin-description Managed by machine-config-operator: exit status 1\nerror: Transaction in progress: (null)\n"' reason: 1 nodes are reporting degraded status on sync 6)Executing `systemctl restart rpm-ostreed` will manually fix the issue and the upgrade will finish OK. Verifying the fix, upgrade 4.6.59 -> 4.7.0-0.nightly-2022-07-08-193842 (contains the fix): 1) Update strace package to 5.13 as described above 2) In master and worker node execute strace -f --fault connect:error=EPERM:when=2 rpm-ostree upgrade 3) Run "rpm-ostree upgrade", the command will be stuck, type ctrl+z to continue 4) Upgrade to 4.7.0-0.nightly-2022-07-08-193842 oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2022-07-08-193842 --force --allow-explicit-upgrade 5) The upgrade should finish OK without manual intervention 6) We can see in the MCD logs that the pending transaction is detected and the service was restarted to fix it I0711 22:45:54.654047 81279 start.go:108] Version: v4.7.0-202207081845.p0.g54116a9.assembly.stream-dirty (54116a94477f669acb9b00b43069a6c8b1ca6282) I0711 22:45:54.656251 81279 start.go:121] Calling chroot("/rootfs") I0711 22:45:54.656319 81279 update.go:1969] Running: systemctl start rpm-ostreed I0711 22:45:54.664154 81279 rpm-ostree.go:325] Running captured: rpm-ostree status --json W0711 22:45:54.700073 81279 rpm-ostree.go:129] Detected active transaction during daemon startup, restarting to clear it I0711 22:45:54.700092 81279 update.go:1969] Running: systemctl restart rpm-ostreed I0711 22:45:54.760683 81279 rpm-ostree.go:325] Running captured: rpm-ostree status --json I0711 22:45:54.795165 81279 daemon.go:222] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fedf46061ae5b3c7915b6ce5d1b7f4c0f0f5be761d467a95edc6b6e150d3b727 (46.82.202206080340-0) $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2022-07-08-193842 True False 27s Cluster version is 4.7.0-0.nightly-2022-07-08-193842 We move the status to VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.7.55 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:5660