Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2101379

Summary: Upgrade failed with "Transaction in progress" in MCO
Product: Red Hat Enterprise Linux 8 Reporter: Jaspreet Kaur <jkaur>
Component: rpm-ostreeAssignee: Colin Walters <walters>
Status: CLOSED CURRENTRELEASE QA Contact: RHCOS SST QE <rhcos-sst-qe>
Severity: medium Docs Contact:
Priority: medium    
Version: 8.4CC: agogala, dornelas, jerzhang, mkrejci, rioliu, skumari, walters
Target Milestone: rcKeywords: Reopened, Triaged, ZStream
Target Release: 8.4Flags: pm-rhel: mirror+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2137418 2137419 2137420 (view as bug list) Environment:
Last Closed: 2022-11-28 14:50:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2137418, 2137419, 2137420    

Description Jaspreet Kaur 2022-06-27 10:33:10 UTC
Upgrade failed with error message in "MCO" :

2022-06-23T09:50:20.316444637Z E0623 09:50:20.316413 2761140 writer.go:135] Marking Degraded due to: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f1025a8df0b59e4b3e9c98f7f6668c16b83982e17bf14b2ab92215093bf745bb : error running rpm-ostree rebase --experim
ental /run/mco-machine-os-content/os-content-661488103/srv/repo:c7995a310da90340b572f2a6fbc4d454f206c45ef990e2eb948e4c0325ac47eb --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f1025a8df0b59e4b3e9c98f7f6668c16b83982e17bf14b2ab92215093bf745bb --custom-origin-description Managed by machine-config-operator: error: Transaction in progress: (null)


Similar to below bugzilla : 

https://bugzilla.redhat.com/show_bug.cgi?id=2057544

Comment 2 Yu Qi Zhang 2022-06-27 14:33:18 UTC
So rpm-ostree on the node (package) is doing the rebase command you see above. This means that during the update, the OLD rpm-ostree is performing the update.

Since you are upgrading from 4.8.24, the 4.8.24 rpm-ostree does not have the fix and thus you see the above.

I'm going to close this as fixed. Please apply workarounds as needed: log into the affected node and just `systemctl restart rpm-ostreed`.

Comment 6 Yu Qi Zhang 2022-07-27 01:05:59 UTC
Sorry for the delay, looking at the case again, looking at the case and the must-gather, the very first OS update attempt is logged as follows:

2022-06-23T09:48:23.256657035Z I0623 09:48:23.256624 2761140 update.go:1859] Running: rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-337129471/srv/repo:c7995a310da90340b572f2a6fbc4d454f206c45ef990e2eb948e4c0325ac47eb --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f1025a8df0b59e4b3e9c98f7f6668c16b83982e17bf14b2ab92215093bf745bb --custom-origin-description Managed by machine-config-operator
2022-06-23T09:48:49.009657439Z I0623 09:48:49.009551 2761140 update.go:1689] Writing SSHKeys at "/home/core/.ssh/authorized_keys"
2022-06-23T09:48:49.011259467Z I0623 09:48:49.011181 2761140 update.go:1173] Updating files
2022-06-23T09:48:49.028119752Z I0623 09:48:49.027991 2761140 update.go:1570] Writing file "/etc/NetworkManager/conf.d/99-keyfiles.conf"
2022-06-23T09:48:49.029472593Z I0623 09:48:49.029404 2761140 update.go:1570] Writing file "/etc/pki/ca-trust/source/anchors/openshift-config-user-ca-bundle.crt"

which indicates that the update failed, the subsequent file writes being the rollback we perform when the OS update fails, and indeed, not much later,

2022-06-23T09:48:50.146723187Z E0623 09:48:50.146644 2761140 writer.go:135] Marking Degraded due to: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f1025a8df0b59e4b3e9c98f7f6668c16b83982e17bf14b2ab92215093bf745bb : error running rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-337129471/srv/repo:c7995a310da90340b572f2a6fbc4d454f206c45ef990e2eb948e4c0325ac47eb --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f1025a8df0b59e4b3e9c98f7f6668c16b83982e17bf14b2ab92215093bf745bb --custom-origin-description Managed by machine-config-operator: error: Timeout was reached
2022-06-23T09:48:50.146723187Z : exit status 1
2022-06-23T09:48:50.189742552Z I0623 09:48:50.189673 2761140 update.go:1896] Adding SIGTERM protection
2022-06-23T09:48:50.189742552Z I0623 09:48:50.189703 2761140 update.go:546] Checking Reconcilable for config rendered-storage-2f0946ceb3704780306aeda3aa8396b0 to rendered-storage-c31160bb30de2323898d7ef24ca2a625

This actual failure is logged: Timeout reached (a new failure)

And then subsequent failures are:

2022-06-23T09:50:20.316444637Z E0623 09:50:20.316413 2761140 writer.go:135] Marking Degraded due to: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f1025a8df0b59e4b3e9c98f7f6668c16b83982e17bf14b2ab92215093bf745bb : error running rpm-ostree rebase --experim
ental /run/mco-machine-os-content/os-content-661488103/srv/repo:c7995a310da90340b572f2a6fbc4d454f206c45ef990e2eb948e4c0325ac47eb --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f1025a8df0b59e4b3e9c98f7f6668c16b83982e17bf14b2ab92215093bf745bb --custom-origin-description Managed by machine-config-operator: error: Transaction in progress: (null)
2022-06-23T09:50:20.316444637Z : exit status 1

Which, in theory, should have been fixed by https://github.com/openshift/machine-config-operator/pull/2677 and https://github.com/coreos/rpm-ostree/pull/2995. Looking at https://github.com/openshift/machine-config-operator/pull/2677/, I think it actually would only apply if the MCD somehow restarts. In this case, it is retrying after the first attempt timed out (And the MCD is just continuing in its logic loop, so I don't think that helps).

The second PR https://github.com/coreos/rpm-ostree/pull/2995 I think also doesn't help in this scenario since the client technically never exits? although I am not sure on this as I am not as familiar with rpm-ostree code. Maybe it had not made it into the version of RHCOS at the time of the bug (49.84.202203112054-0 I think?)

So then in the case of the transaction timing out, I wonder if we need to restart rpm-ostree in the MCO fail loop or somehow, for the workaround to take effect.

I have to defer to Colin here. Colin: could you look at the assessment I made in this comment to see if this is correct? There is also the second issue of what caused that timeout in the first place.

Comment 12 Colin Walters 2022-11-04 16:16:00 UTC
You're right that if when we hit the DBus timeout, then we'll have the same symptom of having a lingering null transaction.
This seems to be a distinct failure scenario than what I analyzed originally.

OK I did https://github.com/openshift/machine-config-operator/pull/3402 though it will need some paperwork and agreement from the MCO team to ship there too.

Comment 14 Colin Walters 2022-11-28 14:50:42 UTC
This bug is already fixed in 8.6+ per above.