2057544 – Cancel rpm-ostree transaction after failed rebase

Bug 2057544 - Cancel rpm-ostree transaction after failed rebase

Summary: Cancel rpm-ostree transaction after failed rebase

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.8.z
Assignee:	Colin Walters
QA Contact:	Rio Liu
Docs Contact:
URL:
Whiteboard:
Depends On:	2058738
Blocks:	2105082
TreeView+	depends on / blocked

Reported:	2022-02-23 15:28 UTC by Matthew Barnes
Modified:	2023-12-10 04:25 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-05-25 21:48:21 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2967	0	None	Merged	Bug 2057544: daemon: Explicitly start rpm-ostreed, restart if we detect active txn	2023-02-09 14:35:31 UTC
Red Hat Product Errata	RHSA-2022:2272	0	None	None	None	2022-05-25 21:48:38 UTC

Description Matthew Barnes 2022-02-23 15:28:14 UTC

Description of problem:

The ARO team recently encountered a failed cluster upgrade caused by a timeout during "rpm-ostree rebase".  The cause of the timeout is unknown, but all subsequent attempts failed with this error message in the pod log:

---
error running rpm-ostree rebase --experimental (...args omitted...): error: Transaction in progress: (null)
---

rpm-ostree recent clarified this error message to include "You can cancel the current transaction with `rpm-ostree cancel`" [1], which in this context implies machine-config-operator should be doing this on error.

It currently does not [2], which can lead to the situation described above.

I suggest as part of the error recovery, machine-config-operator should attempt to run "rpm-ostree cancel" after a failed "rpm-ostree rebase" and any other transaction-based commands.


[1] https://github.com/coreos/rpm-ostree/commit/695312f
[2] https://github.com/openshift/machine-config-operator/blob/7c1ac8b51423448397bda3f349f56f6e94261b64/pkg/daemon/rpm-ostree.go#L299-L304

Comment 1 Kirsten Garrison 2022-02-23 18:12:11 UTC

Can we have a must-gather from any of the failed upgrade clusters?

Comment 5 Matthew Barnes 2022-02-25 10:19:21 UTC

Thanks Fabio, I didn't realize we had a must-gather already.  Didn't have the support ticket case #.

Kirsten, if you search the "openshift-machine-config-operator" namespace in the must-gather logs for the string "error running rpm-ostree rebase", it shows what I described in comment #0:

    error running rpm-ostree rebase ... error: Timeout was reached

followed by many attempts to retry, all failing with:

    error running rpm-ostree rebase ... error: Transaction in progress: (null)

Comment 6 Colin Walters 2022-02-25 14:21:21 UTC

Likely a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1982389

(But feel free to reopen if evidence shows otherwise)

*** This bug has been marked as a duplicate of bug 1982389 ***

Comment 7 Colin Walters 2022-02-25 14:57:10 UTC

Actually I'm going to reopen this and use it to cherry pick the MCO-side workaround https://github.com/openshift/machine-config-operator/pull/2677 to 4.8 and probably farther back.

Comment 8 Colin Walters 2022-02-25 15:51:41 UTC

To make this more visible, a workaround today is to log into the affected node and just `systemctl restart rpm-ostreed`.

Comment 9 faldana 2022-04-20 03:57:48 UTC

Hi, Guys.

Do we have any idea when the fix for this issue would be available?

Thanks.

Comment 14 Colin Walters 2022-04-28 13:13:34 UTC

The timing window to reproduce this on a relatively idle system is really small.  A lot of race conditions like this are unfortunately hard to find until the code runs at scale across diverse, loaded environments.

OK that said, I figured out an easy way to use strace fault injection:

Look at this strace:

[root@cosa-devsh ~]# strace -f -e connect rpm-ostree upgrade
strace: Process 1800 attached
[pid  1799] connect(5, {sa_family=AF_UNIX, sun_path="/var/run/dbus/system_bus_socket"}, 110) = 0
strace: Process 1801 attached
[pid  1799] connect(7, {sa_family=AF_UNIX, sun_path=@"/tmp/rpm-ostree/dbus-Yp34qhCg"}, 32) = 0
Pinned to commit; no upgrade available
[pid  1801] +++ exited with 0 +++
[pid  1800] +++ exited with 0 +++
+++ exited with 0 +++
[root@cosa-devsh ~]# 

You can see the client makes two connect() system calls.  What we want is for the client to fail to connect to the second one.  That's easy to do with strace fault injection:
https://archive.fosdem.org/2017/schedule/event/failing_strace/attachments/slides/1630/export/events/attachments/failing_strace/slides/1630/strace_fosdem2017_ta_slides.pdf


Run this:

strace -f --fault connect:error=EPERM:when=2 rpm-ostree upgrade

Now, subsequent operations should fail with that symptom:

[root@cosa-devsh ~]# rpm-ostree upgrade
error: Transaction in progress: (null)
[root@cosa-devsh ~]#

Then you can verify this BZ by:

- Provision cluster before this bugfix
- Use oc debug node or ssh to run the above strace to wedge rpm-ostree daemon
- Start an upgrade, verify MCD gets stuck
- Run systemctl restart rpm-ostreed on affected node
- Upgrade to version with this bugfix (i.e. new MCD)
- Rerun above command to wedge rpm-ostree daemon
- Do *another* upgrade and you should see in the logs that the MCD is restarting rpm-ostreed
- Upgrade to 3rd version should succeed

Comment 15 Colin Walters 2022-04-28 13:27:59 UTC

> when should be the right moment to kill the pod?

Actually, now that I think about it instead of injecting a fault, you could do this on a node:

rpm-ostree usroverlay
mv /usr/bin/rpm-ostree{,.real}
cat > /usr/bin/rpm-ostree << 'EOF'
#!/bin/bash
set -xeuo pipefail
strace -o /tmp/trace --inject connect:delay_enter=30s:when=2 rpm-ostree.real "$@"
EOF
chmod a+x /usr/bin/rpm-ostree

That will *greatly* widen the window to reproduce the failure, you'd need to ensure a MCD pod is killed when it's trying to do an OS upgrade or other change.

Comment 16 Sergio 2022-04-28 16:05:47 UTC

Verified upgrading from 4.7.49 to 4.8.0-0.nightly-2022-04-27-190611

Verify that the upgrade finish properly if we manually restart the service in case of "transaction in progress" error

1) In a 4.7.49 cluster we run this command on the nodes in order to force a "transaction in progress" error

$ oc debug node/$NODENAME
# chroot /host
# strace -f --fault connect:error=EPERM:when=2 rpm-ostree upgrade


2) Upgrade to a 4.8 version without the fix, we have used 4.7.49 -> 4.8.0-0.nightly-2022-04-26-064801

 oc adm upgrade --to-image registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2022-04-26-064801 --allow-explicit-upgrade --force

3) We can see that the MCD are reporting this error when they try to upgrade the nodes

E0428 14:24:10.687057  341849 writer.go:135] Marking Degraded due to: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6e661713d2bd5ec3a1a3204900a65ababf7567cc6d24d65b906c50d1f491f4ec : error running rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-655450606/srv/repo:7d092c228fa7078656c8aa2c8c171ed74870451810b1dbea3543cec42832457e --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6e661713d2bd5ec3a1a3204900a65ababf7567cc6d24d65b906c50d1f491f4ec --custom-origin-description Managed by machine-config-operator: error: Transaction in progress: (null)
: exit status 1

4) We manually restart the rpm-ostreed service in the failing nodes

$ oc debug node/$NODENAME
# chroot /host
# systemctl restart rpm-ostreed

5) We repeat the operation with every node until all nodes are OK and the upgrade ends OK.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2022-04-26-064801   True        False         81m     Cluster version is 4.8.0-0.nightly-2022-04-26-064801




Verify that the fix created for this issue solves this problem:

1) In a 4.7.49 cluster we run this command on the nodes in order to force a "transaction in progress" error

$ oc debug node/$NODENAME
# chroot /host
# strace -f --fault connect:error=EPERM:when=2 rpm-ostree upgrade


2) Upgrade to a 4.8 version with the fix, we are using 4.8.0-0.nightly-2022-04-27-190611

$ oc adm upgrade --to-image registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2022-04-27-190611 --allow-explicit-upgrade --force

3) Verify that the upgrade finishes OK and nodes are not stuck reporting any failure.

We can see in the MCD logs that the pending transaction is detected and the service is restarted to fix it

I0428 15:33:47.223595  143481 start.go:108] Version: v4.8.0-202204271537.p0.g58e7362.assembly.stream-dirty (58e73629d83330deae8b829b890b004de22b836b)
I0428 15:33:47.225812  143481 start.go:121] Calling chroot("/rootfs")
I0428 15:33:47.225898  143481 update.go:1859] Running: systemctl start rpm-ostreed
I0428 15:33:47.234278  143481 rpm-ostree.go:325] Running captured: rpm-ostree status --json
W0428 15:33:47.276261  143481 rpm-ostree.go:129] Detected active transaction during daemon startup, restarting to clear it
I0428 15:33:47.276282  143481 update.go:1859] Running: systemctl restart rpm-ostreed
I0428 15:33:47.393821  143481 rpm-ostree.go:325] Running captured: rpm-ostree status --json

There is no MCP degraded
$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-baf50f5009da161217006b02e99f8268   True      False      False      3              3                   3                     0                      127m
worker   rendered-worker-289eccd43cdfff2a1cfd91e1faf59977   True      False      False      3              3                   3                     0                      127m

All nodes are OK
$ oc get nodes
NAME                                         STATUS   ROLES    AGE    VERSION
ip-10-0-134-121.us-east-2.compute.internal   Ready    master   128m   v1.21.8+ed4d8fd
ip-10-0-153-46.us-east-2.compute.internal    Ready    worker   120m   v1.21.8+ed4d8fd
ip-10-0-169-243.us-east-2.compute.internal   Ready    worker   122m   v1.21.8+ed4d8fd
ip-10-0-169-44.us-east-2.compute.internal    Ready    master   128m   v1.21.8+ed4d8fd
ip-10-0-206-110.us-east-2.compute.internal   Ready    worker   120m   v1.21.8+ed4d8fd
ip-10-0-221-60.us-east-2.compute.internal    Ready    master   128m   v1.21.8+ed4d8fd

And the upgrade finishes OK

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2022-04-27-190611   True        False         16s     Cluster version is 4.8.0-0.nightly-2022-04-27-190611


We move the issue to VERIFIED status

Comment 21 errata-xmlrpc 2022-05-25 21:48:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.41 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:2272

Comment 25 Red Hat Bugzilla 2023-12-10 04:25:05 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.