Bug 2101379
| Summary: | Upgrade failed with "Transaction in progress" in MCO | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Jaspreet Kaur <jkaur> | |
| Component: | rpm-ostree | Assignee: | Colin Walters <walters> | |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | RHCOS SST QE <rhcos-sst-qe> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | medium | |||
| Version: | 8.4 | CC: | agogala, dornelas, jerzhang, mkrejci, rioliu, skumari, walters | |
| Target Milestone: | rc | Keywords: | Reopened, Triaged, ZStream | |
| Target Release: | 8.4 | Flags: | pm-rhel:
mirror+
|
|
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2137418 2137419 2137420 (view as bug list) | Environment: | ||
| Last Closed: | 2022-11-28 14:50:42 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2137418, 2137419, 2137420 | |||
|
Description
Jaspreet Kaur
2022-06-27 10:33:10 UTC
So rpm-ostree on the node (package) is doing the rebase command you see above. This means that during the update, the OLD rpm-ostree is performing the update. Since you are upgrading from 4.8.24, the 4.8.24 rpm-ostree does not have the fix and thus you see the above. I'm going to close this as fixed. Please apply workarounds as needed: log into the affected node and just `systemctl restart rpm-ostreed`. Sorry for the delay, looking at the case again, looking at the case and the must-gather, the very first OS update attempt is logged as follows: 2022-06-23T09:48:23.256657035Z I0623 09:48:23.256624 2761140 update.go:1859] Running: rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-337129471/srv/repo:c7995a310da90340b572f2a6fbc4d454f206c45ef990e2eb948e4c0325ac47eb --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f1025a8df0b59e4b3e9c98f7f6668c16b83982e17bf14b2ab92215093bf745bb --custom-origin-description Managed by machine-config-operator 2022-06-23T09:48:49.009657439Z I0623 09:48:49.009551 2761140 update.go:1689] Writing SSHKeys at "/home/core/.ssh/authorized_keys" 2022-06-23T09:48:49.011259467Z I0623 09:48:49.011181 2761140 update.go:1173] Updating files 2022-06-23T09:48:49.028119752Z I0623 09:48:49.027991 2761140 update.go:1570] Writing file "/etc/NetworkManager/conf.d/99-keyfiles.conf" 2022-06-23T09:48:49.029472593Z I0623 09:48:49.029404 2761140 update.go:1570] Writing file "/etc/pki/ca-trust/source/anchors/openshift-config-user-ca-bundle.crt" which indicates that the update failed, the subsequent file writes being the rollback we perform when the OS update fails, and indeed, not much later, 2022-06-23T09:48:50.146723187Z E0623 09:48:50.146644 2761140 writer.go:135] Marking Degraded due to: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f1025a8df0b59e4b3e9c98f7f6668c16b83982e17bf14b2ab92215093bf745bb : error running rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-337129471/srv/repo:c7995a310da90340b572f2a6fbc4d454f206c45ef990e2eb948e4c0325ac47eb --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f1025a8df0b59e4b3e9c98f7f6668c16b83982e17bf14b2ab92215093bf745bb --custom-origin-description Managed by machine-config-operator: error: Timeout was reached 2022-06-23T09:48:50.146723187Z : exit status 1 2022-06-23T09:48:50.189742552Z I0623 09:48:50.189673 2761140 update.go:1896] Adding SIGTERM protection 2022-06-23T09:48:50.189742552Z I0623 09:48:50.189703 2761140 update.go:546] Checking Reconcilable for config rendered-storage-2f0946ceb3704780306aeda3aa8396b0 to rendered-storage-c31160bb30de2323898d7ef24ca2a625 This actual failure is logged: Timeout reached (a new failure) And then subsequent failures are: 2022-06-23T09:50:20.316444637Z E0623 09:50:20.316413 2761140 writer.go:135] Marking Degraded due to: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f1025a8df0b59e4b3e9c98f7f6668c16b83982e17bf14b2ab92215093bf745bb : error running rpm-ostree rebase --experim ental /run/mco-machine-os-content/os-content-661488103/srv/repo:c7995a310da90340b572f2a6fbc4d454f206c45ef990e2eb948e4c0325ac47eb --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f1025a8df0b59e4b3e9c98f7f6668c16b83982e17bf14b2ab92215093bf745bb --custom-origin-description Managed by machine-config-operator: error: Transaction in progress: (null) 2022-06-23T09:50:20.316444637Z : exit status 1 Which, in theory, should have been fixed by https://github.com/openshift/machine-config-operator/pull/2677 and https://github.com/coreos/rpm-ostree/pull/2995. Looking at https://github.com/openshift/machine-config-operator/pull/2677/, I think it actually would only apply if the MCD somehow restarts. In this case, it is retrying after the first attempt timed out (And the MCD is just continuing in its logic loop, so I don't think that helps). The second PR https://github.com/coreos/rpm-ostree/pull/2995 I think also doesn't help in this scenario since the client technically never exits? although I am not sure on this as I am not as familiar with rpm-ostree code. Maybe it had not made it into the version of RHCOS at the time of the bug (49.84.202203112054-0 I think?) So then in the case of the transaction timing out, I wonder if we need to restart rpm-ostree in the MCO fail loop or somehow, for the workaround to take effect. I have to defer to Colin here. Colin: could you look at the assessment I made in this comment to see if this is correct? There is also the second issue of what caused that timeout in the first place. You're right that if when we hit the DBus timeout, then we'll have the same symptom of having a lingering null transaction. This seems to be a distinct failure scenario than what I analyzed originally. OK I did https://github.com/openshift/machine-config-operator/pull/3402 though it will need some paperwork and agreement from the MCO team to ship there too. This bug is already fixed in 8.6+ per above. |