Bug 2105082

Summary:	Cancel rpm-ostree transaction after failed rebase
Product:	OpenShift Container Platform	Reporter:	Colin Walters <walters>
Component:	Machine Config Operator	Assignee:	Colin Walters <walters>
Machine Config Operator sub component:	Machine Config Operator	QA Contact:	Rio Liu <rioliu>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	medium	CC:	mkrejci, skumari, sregidor
Version:	4.7
Target Milestone:	---
Target Release:	4.7.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-07-25 14:20:09 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2057544
Bug Blocks:

Description Colin Walters 2022-07-07 20:37:34 UTC

This bug was initially created as a copy of Bug #2057544

I am copying this bug because we wnat to ship the fix to 4.7



Description of problem:

The ARO team recently encountered a failed cluster upgrade caused by a timeout during "rpm-ostree rebase".  The cause of the timeout is unknown, but all subsequent attempts failed with this error message in the pod log:

---
error running rpm-ostree rebase --experimental (...args omitted...): error: Transaction in progress: (null)
---

rpm-ostree recent clarified this error message to include "You can cancel the current transaction with `rpm-ostree cancel`" [1], which in this context implies machine-config-operator should be doing this on error.

It currently does not [2], which can lead to the situation described above.

I suggest as part of the error recovery, machine-config-operator should attempt to run "rpm-ostree cancel" after a failed "rpm-ostree rebase" and any other transaction-based commands.


[1] https://github.com/coreos/rpm-ostree/commit/695312f
[2] https://github.com/openshift/machine-config-operator/blob/7c1ac8b51423448397bda3f349f56f6e94261b64/pkg/daemon/rpm-ostree.go#L299-L304

Comment 1 Sergio 2022-07-08 16:44:17 UTC

Hello,

I have tried to verify the BZ using the strace command as we did in 4.8.

Unfortunately, in versions 4.6 and 4.7 strace version does not admit --fault and --inject options.

sh-4.4# strace -f --fault connect:error=EPERM:when=2 rpm-ostree upgrade
strace: invalid option -- '-'
Try 'strace -h' for more information.

# strace -V
strace -- version 5.1
Copyright (c) 1991-2019 The strace developers <https://strace.io>.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Optional features enabled: stack-trace=libdw stack-demangle m32-mpers mx32-mpers


Do you know if there is any way to reproduce this error with this strace version?

Thank you very much!

Comment 3 Colin Walters 2022-07-08 19:19:43 UTC

It should work to pull a newer strace binary from RHEL; you're not restricted to the stock one.

`rpm-ostree usroverlay` will make /usr/ transiently writable, then get the latest from
https://access.redhat.com/errata/RHEA-2022:2026
or so and `rpm -Uvh http://url/strace.rpm`

Comment 4 Sergio 2022-07-11 23:07:41 UTC

Verified the BZ upgrading  4.6.59 to 4.7.

In order to verify the BZ we need to update the version of strace package in 4.6.59 cluster nodes

To update strace package in 4.6.59 nodes:

1) oc debug node/$NODENAME; chroot /host
2) curl http://mirror.centos.org/centos/8-stream/BaseOS/x86_64/os/Packages/strace-5.13-4.el8.x86_64.rpm -o /tmp/strace-5.13-4.el8.x86_64.rpm
3) rpm-ostree usroverlay
4) rpm-ostree override replace /tmp/strace-5.13-4.el8.x86_64.rpm
5) reboot node
6) verify strace version
sh-4.4# strace -V
strace -- version 5.13


Reproducing the bug, upgrade 4.6.59 -> 4.7.54 (does not contain the fix):

1) Update strace package to 5.13 as described above in master and worker nodes
2) In master and worker nodes execute
strace -f --fault connect:error=EPERM:when=2 rpm-ostree upgrade
3) In master and worker nodes run "rpm-ostree upgrade", the command will be stuck, type ctrl+z to continue
4) Upgrade to 4.7.54
 oc adm upgrade --to-image=quay.io/openshift-release-dev/ocp-release:4.7.54-x86_64 --force --allow-explicit-upgrade
5) After the upgrade we can see the pools reporting degraded status because:
  - lastTransitionTime: "2022-07-11T13:39:52Z"
    message: 'Node ip-10-0-133-239.us-east-2.compute.internal is reporting: "failed
      to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f1d1872a8db06bd1a60c222a8c26d9615e295b6837fa140ea57c0690028858e2
      : with stdout output: : error running rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-523033659/srv/repo:db8b72f0b3fe7f887f9bbe64b2317eaff79aec45e7ee3e0d3f98be32253fb11f
      --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f1d1872a8db06bd1a60c222a8c26d9615e295b6837fa140ea57c0690028858e2
      --custom-origin-description Managed by machine-config-operator: exit status
      1\nerror: Transaction in progress: (null)\n"'
    reason: 1 nodes are reporting degraded status on sync
6)Executing `systemctl restart rpm-ostreed` will manually fix the issue and the upgrade will finish OK.



Verifying the fix, upgrade 4.6.59 -> 4.7.0-0.nightly-2022-07-08-193842 (contains the fix):

1) Update strace package to 5.13 as described above
2) In master and worker node execute
strace -f --fault connect:error=EPERM:when=2 rpm-ostree upgrade
3) Run "rpm-ostree upgrade", the command will be stuck, type ctrl+z to continue
4) Upgrade to 4.7.0-0.nightly-2022-07-08-193842
  oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2022-07-08-193842 --force --allow-explicit-upgrade
5) The upgrade should finish OK without manual intervention
6) We can see in the MCD logs that the pending transaction is detected and the service was restarted to fix it

I0711 22:45:54.654047   81279 start.go:108] Version: v4.7.0-202207081845.p0.g54116a9.assembly.stream-dirty (54116a94477f669acb9b00b43069a6c8b1ca6282)
I0711 22:45:54.656251   81279 start.go:121] Calling chroot("/rootfs")
I0711 22:45:54.656319   81279 update.go:1969] Running: systemctl start rpm-ostreed
I0711 22:45:54.664154   81279 rpm-ostree.go:325] Running captured: rpm-ostree status --json
W0711 22:45:54.700073   81279 rpm-ostree.go:129] Detected active transaction during daemon startup, restarting to clear it
I0711 22:45:54.700092   81279 update.go:1969] Running: systemctl restart rpm-ostreed
I0711 22:45:54.760683   81279 rpm-ostree.go:325] Running captured: rpm-ostree status --json
I0711 22:45:54.795165   81279 daemon.go:222] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fedf46061ae5b3c7915b6ce5d1b7f4c0f0f5be761d467a95edc6b6e150d3b727 (46.82.202206080340-0)


$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2022-07-08-193842   True        False         27s     Cluster version is 4.7.0-0.nightly-2022-07-08-193842


We move the status to VERIFIED.

Comment 7 errata-xmlrpc 2022-07-25 14:20:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.55 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:5660