2105082 – Cancel rpm-ostree transaction after failed rebase

Bug 2105082 - Cancel rpm-ostree transaction after failed rebase

Summary: Cancel rpm-ostree transaction after failed rebase

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.7.z
Assignee:	Colin Walters
QA Contact:	Rio Liu
Docs Contact:
URL:
Whiteboard:
Depends On:	2057544
Blocks:
TreeView+	depends on / blocked

Reported:	2022-07-07 20:37 UTC by Colin Walters
Modified:	2022-07-25 14:20 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-07-25 14:20:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 3240	0	None	open	Bug 2105082: 4.7: daemon: Explicitly start rpm-ostreed, restart if we detect active txn	2022-07-07 20:39:43 UTC
Red Hat Product Errata	RHBA-2022:5660	0	None	None	None	2022-07-25 14:20:15 UTC

Description Colin Walters 2022-07-07 20:37:34 UTC

This bug was initially created as a copy of Bug #2057544

I am copying this bug because we wnat to ship the fix to 4.7



Description of problem:

The ARO team recently encountered a failed cluster upgrade caused by a timeout during "rpm-ostree rebase".  The cause of the timeout is unknown, but all subsequent attempts failed with this error message in the pod log:

---
error running rpm-ostree rebase --experimental (...args omitted...): error: Transaction in progress: (null)
---

rpm-ostree recent clarified this error message to include "You can cancel the current transaction with `rpm-ostree cancel`" [1], which in this context implies machine-config-operator should be doing this on error.

It currently does not [2], which can lead to the situation described above.

I suggest as part of the error recovery, machine-config-operator should attempt to run "rpm-ostree cancel" after a failed "rpm-ostree rebase" and any other transaction-based commands.


[1] https://github.com/coreos/rpm-ostree/commit/695312f
[2] https://github.com/openshift/machine-config-operator/blob/7c1ac8b51423448397bda3f349f56f6e94261b64/pkg/daemon/rpm-ostree.go#L299-L304

Comment 1 Sergio 2022-07-08 16:44:17 UTC

Hello,

I have tried to verify the BZ using the strace command as we did in 4.8.

Unfortunately, in versions 4.6 and 4.7 strace version does not admit --fault and --inject options.

sh-4.4# strace -f --fault connect:error=EPERM:when=2 rpm-ostree upgrade
strace: invalid option -- '-'
Try 'strace -h' for more information.

# strace -V
strace -- version 5.1
Copyright (c) 1991-2019 The strace developers <https://strace.io>.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Optional features enabled: stack-trace=libdw stack-demangle m32-mpers mx32-mpers


Do you know if there is any way to reproduce this error with this strace version?

Thank you very much!

Comment 3 Colin Walters 2022-07-08 19:19:43 UTC

It should work to pull a newer strace binary from RHEL; you're not restricted to the stock one.

`rpm-ostree usroverlay` will make /usr/ transiently writable, then get the latest from
https://access.redhat.com/errata/RHEA-2022:2026
or so and `rpm -Uvh http://url/strace.rpm`

Comment 4 Sergio 2022-07-11 23:07:41 UTC

Verified the BZ upgrading  4.6.59 to 4.7.

In order to verify the BZ we need to update the version of strace package in 4.6.59 cluster nodes

To update strace package in 4.6.59 nodes:

1) oc debug node/$NODENAME; chroot /host
2) curl http://mirror.centos.org/centos/8-stream/BaseOS/x86_64/os/Packages/strace-5.13-4.el8.x86_64.rpm -o /tmp/strace-5.13-4.el8.x86_64.rpm
3) rpm-ostree usroverlay
4) rpm-ostree override replace /tmp/strace-5.13-4.el8.x86_64.rpm
5) reboot node
6) verify strace version
sh-4.4# strace -V
strace -- version 5.13


Reproducing the bug, upgrade 4.6.59 -> 4.7.54 (does not contain the fix):

1) Update strace package to 5.13 as described above in master and worker nodes
2) In master and worker nodes execute
strace -f --fault connect:error=EPERM:when=2 rpm-ostree upgrade
3) In master and worker nodes run "rpm-ostree upgrade", the command will be stuck, type ctrl+z to continue
4) Upgrade to 4.7.54
 oc adm upgrade --to-image=quay.io/openshift-release-dev/ocp-release:4.7.54-x86_64 --force --allow-explicit-upgrade
5) After the upgrade we can see the pools reporting degraded status because:
  - lastTransitionTime: "2022-07-11T13:39:52Z"
    message: 'Node ip-10-0-133-239.us-east-2.compute.internal is reporting: "failed
      to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f1d1872a8db06bd1a60c222a8c26d9615e295b6837fa140ea57c0690028858e2
      : with stdout output: : error running rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-523033659/srv/repo:db8b72f0b3fe7f887f9bbe64b2317eaff79aec45e7ee3e0d3f98be32253fb11f
      --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f1d1872a8db06bd1a60c222a8c26d9615e295b6837fa140ea57c0690028858e2
      --custom-origin-description Managed by machine-config-operator: exit status
      1\nerror: Transaction in progress: (null)\n"'
    reason: 1 nodes are reporting degraded status on sync
6)Executing `systemctl restart rpm-ostreed` will manually fix the issue and the upgrade will finish OK.



Verifying the fix, upgrade 4.6.59 -> 4.7.0-0.nightly-2022-07-08-193842 (contains the fix):

1) Update strace package to 5.13 as described above
2) In master and worker node execute
strace -f --fault connect:error=EPERM:when=2 rpm-ostree upgrade
3) Run "rpm-ostree upgrade", the command will be stuck, type ctrl+z to continue
4) Upgrade to 4.7.0-0.nightly-2022-07-08-193842
  oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2022-07-08-193842 --force --allow-explicit-upgrade
5) The upgrade should finish OK without manual intervention
6) We can see in the MCD logs that the pending transaction is detected and the service was restarted to fix it

I0711 22:45:54.654047   81279 start.go:108] Version: v4.7.0-202207081845.p0.g54116a9.assembly.stream-dirty (54116a94477f669acb9b00b43069a6c8b1ca6282)
I0711 22:45:54.656251   81279 start.go:121] Calling chroot("/rootfs")
I0711 22:45:54.656319   81279 update.go:1969] Running: systemctl start rpm-ostreed
I0711 22:45:54.664154   81279 rpm-ostree.go:325] Running captured: rpm-ostree status --json
W0711 22:45:54.700073   81279 rpm-ostree.go:129] Detected active transaction during daemon startup, restarting to clear it
I0711 22:45:54.700092   81279 update.go:1969] Running: systemctl restart rpm-ostreed
I0711 22:45:54.760683   81279 rpm-ostree.go:325] Running captured: rpm-ostree status --json
I0711 22:45:54.795165   81279 daemon.go:222] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fedf46061ae5b3c7915b6ce5d1b7f4c0f0f5be761d467a95edc6b6e150d3b727 (46.82.202206080340-0)


$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2022-07-08-193842   True        False         27s     Cluster version is 4.7.0-0.nightly-2022-07-08-193842


We move the status to VERIFIED.

Comment 7 errata-xmlrpc 2022-07-25 14:20:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.55 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:5660

Note You need to log in before you can comment on or make changes to this bug.