Bug 1958812
Summary: | Cluster upgrade halts as machine-config-daemon fails to parse `rpm-ostree status` during cluster upgrades | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Robert Sandu <rsandu> |
Component: | Machine Config Operator | Assignee: | Colin Walters <walters> |
Status: | CLOSED ERRATA | QA Contact: | Michael Nguyen <mnguyen> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.8 | CC: | agogala, dornelas, gferrazs, jerzhang, jligon, lucab, miabbott, mnguyen, mrussell, nstielau, rioliu, smilner, walters |
Target Milestone: | --- | ||
Target Release: | 4.8.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-07-27 23:07:32 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1970590 |
Description
Robert Sandu
2021-05-10 08:15:51 UTC
I looked through node journal and didn't see anything relevant in rpm-ostreed.service logs either. And looking at MCD code I think `rpm-ostree status --json` is exiting without error (but it may be logging something problematic to stderr). I'm dispatching to MCO folks in order to get that portion of logic augmented to report a more useful error message. In particular, logging the problematic input to journal will help tracking down the root cause. Assorted notes for possible followups after that: * splitting stdout and stderr could help, in order to avoid parsing any warnings from stderr * adding a retry loop could help overcoming transient errors The above PR will get us the actual error, which will hopefully be informative. In the short term, can you try reproducing a failure by doing something like this in an "oc debug node": `while sleep 30; do rpm-ostree status --json >/dev/null; done` And see if any errors appear. Hi Colin, Yu.
> `while sleep 30; do rpm-ostree status --json >/dev/null; done`
Just in case it help: while running the `rpm-ostree status --json` 35 times in a loop has shown 1 outstanding result that doesn't seem to pass JSON validation (line 2705 below):
~~~
$ cat -n rpm-ostree_status_json_loop.txt | grep 2705 -B10 -A10
2695 "custom-origin" : [
2696 "pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6028a87f17b7ecb9fe4b5a9f1ab25999cd414924cbba5721320f14df9faec44a",
2697 "Managed by machine-config-operator"
2698 ],
2699 "requested-base-removals" : [
2700 ]
2701 }
2702 ],
2703 "transaction" : null,
2704 "cached-update" : null
2705 },
2706 {
2707 "base-commit-meta" : {
2708 "rpmostree.initramfs-args" : [
2709 "--add=ignition",
2710 "--no-hostonly",
2711 "--add-drivers",
2712 "mptspi vmw_pvscsi",
2713 "--omit-drivers",
2714 "nouveau",
2715 "--omit",
~~~
I wasn't able to get the `rpm-ostree status --json` to error to verify the fix. I upgraded from 4.8.0-0.nightly-2021-06-07-083552 to 4.8.0-0.nightly-2021-06-08-034312. > Just in case it help: while running the `rpm-ostree status --json` 35 times in a loop has shown 1 outstanding result that doesn't seem to pass JSON validation (line 2705 below):
That's...odd. I don't have a theory for how that can happen; I wonder if it's an artifact of how you're receiving output (ssh/oc debug node) and terminal wrapping or so.
How about this?
$ t=$(mktemp); while true; do echo blah >${t}; if ! jq < ${t} >/dev/null; then cat ${t}; break; fi; sleep 30; done
Sorry, I meant $ t=$(mktemp); while true; do rpm-ostree status --json >${t}; if ! jq < ${t} >/dev/null; then cat ${t}; break; fi; sleep 30; done (I replaced it with `echo blah` to validate the error case) OK here's a debugging hotfix: This script uses `rpm-ostree usroverlay` to make a transient overlayfs for `/usr`, and intercepts the `rpm-ostree` binary. If it fails, we'll dump any error messages into a /tmp/rpmostreeout.XXXXXX file. Can you try this? ``` #!/bin/bash set -xeuo pipefail # Enter root mountnfs (needed if we're in `oc debug node`) if test "$(readlink /proc/self/ns/mnt)" != "$(readlink /proc/1/ns/mnt)"; then if test -n "${nsenter_recurse:-}"; then exit 1; fi export nsenter_recurse=1 exec nsenter -m -t 1 $(pwd)/$0 fi unset nsenter_recurse if ! test -w /usr; then rpm-ostree usroverlay fi if ! test -f /usr/bin/rpm-ostree.orig; then cp -a /usr/bin/rpm-ostree{,.orig} fi cat >/usr/bin/rpm-ostree.new << 'EOF' #!/bin/bash # Temporary debugging wrapper script set -euo pipefail tmpf=$(mktemp --tmpdir rpmostreeout.XXXXXXXX) if ! rpm-ostree.orig "$@" |& tee ${tmpf}; then echo "Logs in ${tmpf}" 1>&2 exit 1 else rm -f ${tmpf} fi EOF chmod a+x /usr/bin/rpm-ostree.new mv /usr/bin/rpm-ostree{.new,} ``` OK https://github.com/coreos/rpm-ostree/pull/2932 should help a lot to debug this type of thing in the future. OK please try this new debugging script that should more carefully gather errors. Note that logs now go to the journal (`journalctl -t rpm-ostree-client-errors`) instead of `/var/tmp`. ``` #!/bin/bash set -xeuo pipefail # Enter root mountnfs (needed if we're in `oc debug node`) if test "$(readlink /proc/self/ns/mnt)" != "$(readlink /proc/1/ns/mnt)"; then if test -n "${nsenter_recurse:-}"; then exit 1; fi export nsenter_recurse=1 exec nsenter -m -t 1 $(pwd)/$0 fi unset nsenter_recurse if ! test -w /usr; then rpm-ostree usroverlay fi if ! test -f /usr/bin/rpm-ostree.orig; then cp -a /usr/bin/rpm-ostree{,.orig} fi # See https://stackoverflow.com/questions/1221833/pipe-output-and-capture-exit-status-in-bash cat >/usr/bin/rpm-ostree.new << 'EOF' #!/bin/bash # Temporary debugging wrapper script set -euo pipefail tmpd=$(mktemp -t -d rpmostreeout.XXXXXXXX) mkfifo ${tmpd}/fifo # Copy stderr to the journal logger -s -t rpm-ostree-client-errors < ${tmpd}/fifo & rc=0 if ! rpm-ostree.orig "$@" 2>${tmpd}/fifo; then echo "Command failed, see journalctl -t rpm-ostree-client-errors" 1>&2 wait exit 1 fi wait rm "${tmpd}" -rf EOF chmod a+x /usr/bin/rpm-ostree.new mv /usr/bin/rpm-ostree{.new,} ``` OK so to satisfy the mechanics, we need this BZ in VERIFIED in order to proceed with the backport PR https://github.com/openshift/machine-config-operator/pull/2609 But, the PR so far is mostly about additional logging. I think my proposal is: - Flag this BZ as VERIFIED (leave the customer case open of course) - This will allow https://github.com/openshift/machine-config-operator/pull/2609 to merge which based on everything I've seen so far will likely fix the bug *or* at least surface the actual problem. If the backport doesn't solve the problem, then we can file a new BZ. Thank you, Colin! @mnguyen could you help with verifying the PRs linked to this case and push it towards an errata? Please see the comments above for context and customer business impact. Regards. Verified on 4.9.0-0.nightly-2021-07-12-203753. Ran `rpm-ostree usroverlay` then changed /usr/bin/rpm-osree into a script #!/bin/sh echo oops 1>&2 exit 1 Then ran an upgrade and captured the detailed log messages. $ oc -n openshift-machine-config-operator logs machine-config-daemon-l8mww -c machine-config-daemon I0713 19:55:47.449373 248281 start.go:108] Version: v4.9.0-202107122135.p0.git.a542196.assembly.stream-dirty (a5421965cf4e440b94bd1a4774072c517291fda0) I0713 19:55:47.451451 248281 start.go:121] Calling chroot("/rootfs") I0713 19:55:47.451558 248281 rpm-ostree.go:278] Running captured: rpm-ostree status --json ● rpm-ostreed.service - rpm-ostree System Management Daemon Loaded: loaded (/usr/lib/systemd/system/rpm-ostreed.service; static; vendor preset: disabled) Active: inactive (dead) Docs: man:rpm-ostree(1) Jul 13 19:02:49 ip-10-0-157-3.us-west-2.compute.internal systemd[1]: Started rpm-ostree System Management Daemon. Jul 13 19:02:49 ip-10-0-157-3.us-west-2.compute.internal rpm-ostree[2811]: client(id:machine-config-operator dbus:1.77 unit:crio-5dfc0f6d88f86c68d56c1b9937eaf5806c7a47898e5a42c22f9a6d3d32bd371d.scope uid:0) added; new total=1 Jul 13 19:02:49 ip-10-0-157-3.us-west-2.compute.internal rpm-ostree[2811]: client(id:machine-config-operator dbus:1.77 unit:crio-5dfc0f6d88f86c68d56c1b9937eaf5806c7a47898e5a42c22f9a6d3d32bd371d.scope uid:0) vanished; remaining=0 Jul 13 19:02:49 ip-10-0-157-3.us-west-2.compute.internal rpm-ostree[2811]: In idle state; will auto-exit in 63 seconds Jul 13 19:02:50 ip-10-0-157-3.us-west-2.compute.internal rpm-ostree[2811]: client(id:machine-config-operator dbus:1.83 unit:crio-5dfc0f6d88f86c68d56c1b9937eaf5806c7a47898e5a42c22f9a6d3d32bd371d.scope uid:0) added; new total=1 Jul 13 19:02:50 ip-10-0-157-3.us-west-2.compute.internal rpm-ostree[2811]: client(id:machine-config-operator dbus:1.83 unit:crio-5dfc0f6d88f86c68d56c1b9937eaf5806c7a47898e5a42c22f9a6d3d32bd371d.scope uid:0) vanished; remaining=0 Jul 13 19:02:50 ip-10-0-157-3.us-west-2.compute.internal rpm-ostree[2811]: In idle state; will auto-exit in 61 seconds Jul 13 19:03:51 ip-10-0-157-3.us-west-2.compute.internal rpm-ostree[2811]: In idle state; will auto-exit in 62 seconds Jul 13 19:03:51 ip-10-0-157-3.us-west-2.compute.internal systemd[1]: rpm-ostreed.service: Succeeded. Jul 13 19:03:51 ip-10-0-157-3.us-west-2.compute.internal systemd[1]: rpm-ostreed.service: Consumed 68ms CPU time F0713 19:55:47.467400 248281 start.go:150] Failed to initialize single run daemon: error reading osImageURL from rpm-ostree: error running rpm-ostree status --json: exit status 1 oops Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |