Description of problem: The ARO team recently encountered a failed cluster upgrade caused by a timeout during "rpm-ostree rebase". The cause of the timeout is unknown, but all subsequent attempts failed with this error message in the pod log: --- error running rpm-ostree rebase --experimental (...args omitted...): error: Transaction in progress: (null) --- rpm-ostree recent clarified this error message to include "You can cancel the current transaction with `rpm-ostree cancel`" [1], which in this context implies machine-config-operator should be doing this on error. It currently does not [2], which can lead to the situation described above. I suggest as part of the error recovery, machine-config-operator should attempt to run "rpm-ostree cancel" after a failed "rpm-ostree rebase" and any other transaction-based commands. [1] https://github.com/coreos/rpm-ostree/commit/695312f [2] https://github.com/openshift/machine-config-operator/blob/7c1ac8b51423448397bda3f349f56f6e94261b64/pkg/daemon/rpm-ostree.go#L299-L304
Can we have a must-gather from any of the failed upgrade clusters?
Thanks Fabio, I didn't realize we had a must-gather already. Didn't have the support ticket case #. Kirsten, if you search the "openshift-machine-config-operator" namespace in the must-gather logs for the string "error running rpm-ostree rebase", it shows what I described in comment #0: error running rpm-ostree rebase ... error: Timeout was reached followed by many attempts to retry, all failing with: error running rpm-ostree rebase ... error: Transaction in progress: (null)
Likely a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1982389 (But feel free to reopen if evidence shows otherwise) *** This bug has been marked as a duplicate of bug 1982389 ***
Actually I'm going to reopen this and use it to cherry pick the MCO-side workaround https://github.com/openshift/machine-config-operator/pull/2677 to 4.8 and probably farther back.
To make this more visible, a workaround today is to log into the affected node and just `systemctl restart rpm-ostreed`.
Hi, Guys. Do we have any idea when the fix for this issue would be available? Thanks.
The timing window to reproduce this on a relatively idle system is really small. A lot of race conditions like this are unfortunately hard to find until the code runs at scale across diverse, loaded environments. OK that said, I figured out an easy way to use strace fault injection: Look at this strace: [root@cosa-devsh ~]# strace -f -e connect rpm-ostree upgrade strace: Process 1800 attached [pid 1799] connect(5, {sa_family=AF_UNIX, sun_path="/var/run/dbus/system_bus_socket"}, 110) = 0 strace: Process 1801 attached [pid 1799] connect(7, {sa_family=AF_UNIX, sun_path=@"/tmp/rpm-ostree/dbus-Yp34qhCg"}, 32) = 0 Pinned to commit; no upgrade available [pid 1801] +++ exited with 0 +++ [pid 1800] +++ exited with 0 +++ +++ exited with 0 +++ [root@cosa-devsh ~]# You can see the client makes two connect() system calls. What we want is for the client to fail to connect to the second one. That's easy to do with strace fault injection: https://archive.fosdem.org/2017/schedule/event/failing_strace/attachments/slides/1630/export/events/attachments/failing_strace/slides/1630/strace_fosdem2017_ta_slides.pdf Run this: strace -f --fault connect:error=EPERM:when=2 rpm-ostree upgrade Now, subsequent operations should fail with that symptom: [root@cosa-devsh ~]# rpm-ostree upgrade error: Transaction in progress: (null) [root@cosa-devsh ~]# Then you can verify this BZ by: - Provision cluster before this bugfix - Use oc debug node or ssh to run the above strace to wedge rpm-ostree daemon - Start an upgrade, verify MCD gets stuck - Run systemctl restart rpm-ostreed on affected node - Upgrade to version with this bugfix (i.e. new MCD) - Rerun above command to wedge rpm-ostree daemon - Do *another* upgrade and you should see in the logs that the MCD is restarting rpm-ostreed - Upgrade to 3rd version should succeed
> when should be the right moment to kill the pod? Actually, now that I think about it instead of injecting a fault, you could do this on a node: rpm-ostree usroverlay mv /usr/bin/rpm-ostree{,.real} cat > /usr/bin/rpm-ostree << 'EOF' #!/bin/bash set -xeuo pipefail strace -o /tmp/trace --inject connect:delay_enter=30s:when=2 rpm-ostree.real "$@" EOF chmod a+x /usr/bin/rpm-ostree That will *greatly* widen the window to reproduce the failure, you'd need to ensure a MCD pod is killed when it's trying to do an OS upgrade or other change.
Verified upgrading from 4.7.49 to 4.8.0-0.nightly-2022-04-27-190611 Verify that the upgrade finish properly if we manually restart the service in case of "transaction in progress" error 1) In a 4.7.49 cluster we run this command on the nodes in order to force a "transaction in progress" error $ oc debug node/$NODENAME # chroot /host # strace -f --fault connect:error=EPERM:when=2 rpm-ostree upgrade 2) Upgrade to a 4.8 version without the fix, we have used 4.7.49 -> 4.8.0-0.nightly-2022-04-26-064801 oc adm upgrade --to-image registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2022-04-26-064801 --allow-explicit-upgrade --force 3) We can see that the MCD are reporting this error when they try to upgrade the nodes E0428 14:24:10.687057 341849 writer.go:135] Marking Degraded due to: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6e661713d2bd5ec3a1a3204900a65ababf7567cc6d24d65b906c50d1f491f4ec : error running rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-655450606/srv/repo:7d092c228fa7078656c8aa2c8c171ed74870451810b1dbea3543cec42832457e --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6e661713d2bd5ec3a1a3204900a65ababf7567cc6d24d65b906c50d1f491f4ec --custom-origin-description Managed by machine-config-operator: error: Transaction in progress: (null) : exit status 1 4) We manually restart the rpm-ostreed service in the failing nodes $ oc debug node/$NODENAME # chroot /host # systemctl restart rpm-ostreed 5) We repeat the operation with every node until all nodes are OK and the upgrade ends OK. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2022-04-26-064801 True False 81m Cluster version is 4.8.0-0.nightly-2022-04-26-064801 Verify that the fix created for this issue solves this problem: 1) In a 4.7.49 cluster we run this command on the nodes in order to force a "transaction in progress" error $ oc debug node/$NODENAME # chroot /host # strace -f --fault connect:error=EPERM:when=2 rpm-ostree upgrade 2) Upgrade to a 4.8 version with the fix, we are using 4.8.0-0.nightly-2022-04-27-190611 $ oc adm upgrade --to-image registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2022-04-27-190611 --allow-explicit-upgrade --force 3) Verify that the upgrade finishes OK and nodes are not stuck reporting any failure. We can see in the MCD logs that the pending transaction is detected and the service is restarted to fix it I0428 15:33:47.223595 143481 start.go:108] Version: v4.8.0-202204271537.p0.g58e7362.assembly.stream-dirty (58e73629d83330deae8b829b890b004de22b836b) I0428 15:33:47.225812 143481 start.go:121] Calling chroot("/rootfs") I0428 15:33:47.225898 143481 update.go:1859] Running: systemctl start rpm-ostreed I0428 15:33:47.234278 143481 rpm-ostree.go:325] Running captured: rpm-ostree status --json W0428 15:33:47.276261 143481 rpm-ostree.go:129] Detected active transaction during daemon startup, restarting to clear it I0428 15:33:47.276282 143481 update.go:1859] Running: systemctl restart rpm-ostreed I0428 15:33:47.393821 143481 rpm-ostree.go:325] Running captured: rpm-ostree status --json There is no MCP degraded $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-baf50f5009da161217006b02e99f8268 True False False 3 3 3 0 127m worker rendered-worker-289eccd43cdfff2a1cfd91e1faf59977 True False False 3 3 3 0 127m All nodes are OK $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-134-121.us-east-2.compute.internal Ready master 128m v1.21.8+ed4d8fd ip-10-0-153-46.us-east-2.compute.internal Ready worker 120m v1.21.8+ed4d8fd ip-10-0-169-243.us-east-2.compute.internal Ready worker 122m v1.21.8+ed4d8fd ip-10-0-169-44.us-east-2.compute.internal Ready master 128m v1.21.8+ed4d8fd ip-10-0-206-110.us-east-2.compute.internal Ready worker 120m v1.21.8+ed4d8fd ip-10-0-221-60.us-east-2.compute.internal Ready master 128m v1.21.8+ed4d8fd And the upgrade finishes OK $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2022-04-27-190611 True False 16s Cluster version is 4.8.0-0.nightly-2022-04-27-190611 We move the issue to VERIFIED status
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.41 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:2272
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days