OCP Version at Install Time: each upgrade, from 4.6.22 to 4.7.9 OCP Version after Upgrade (if applicable): 4.7.9 Platform: AWS Architecture: x86_64 What are you trying to do? What is your use case? - Upgrading the cluster. What happened? What went wrong or what did you expect? - Node are reported in "Degraded" state as machine-config-daemon fails to run and parse rpm-ostree status --json output: $ cat namespaces/openshift-machine-config-operator/pods/machine-config-daemon-4msf4/machine-config-daemon/machine-config-daemon/logs/current.log 2021-05-07T10:48:38.348971166Z I0507 10:48:38.348822 246746 start.go:108] Version: v4.7.0-202104250659.p0-dirty (ac79a2ffc6002f086b3fa80003b278b635d2055a) 2021-05-07T10:48:38.356013527Z I0507 10:48:38.355980 246746 start.go:121] Calling chroot("/rootfs") 2021-05-07T10:48:38.356082310Z I0507 10:48:38.356067 246746 rpm-ostree.go:261] Running captured: rpm-ostree status --json 2021-05-07T10:48:38.423052266Z F0507 10:48:38.423004 246746 start.go:150] Failed to initialize single run daemon: error reading osImageURL from rpm-ostree: failed to parse `rpm-ostree status --json` output: invalid character 'E' looking for beginning of value What are the steps to reproduce your issue? Please try to reduce these steps to something that can be reproduced with a single RHCOS node. - Standard cluster upgrade. Please add anything else that might be useful, for example: - journalctl --verify: seeing no truncated/corrupted journalctl logs at the time of reproducing the issue. - rpm-ostreed.service is up and manually running `rpm-ostree status --json` on the affected nodes shown no visible issue. - The same issue is reproduced: - During every single upgrade, from 4.6.22 up to 4.7.9. - Impacts individual workers nodes. - Impacts multiple clusters. - As the error is reproduced, MCP is reporting their nodes in degraded state. - Following a manual reboot of the impacted nodes, the upgrade finishes successfully.
I looked through node journal and didn't see anything relevant in rpm-ostreed.service logs either. And looking at MCD code I think `rpm-ostree status --json` is exiting without error (but it may be logging something problematic to stderr). I'm dispatching to MCO folks in order to get that portion of logic augmented to report a more useful error message. In particular, logging the problematic input to journal will help tracking down the root cause. Assorted notes for possible followups after that: * splitting stdout and stderr could help, in order to avoid parsing any warnings from stderr * adding a retry loop could help overcoming transient errors
The above PR will get us the actual error, which will hopefully be informative. In the short term, can you try reproducing a failure by doing something like this in an "oc debug node": `while sleep 30; do rpm-ostree status --json >/dev/null; done` And see if any errors appear.
Hi Colin, Yu. > `while sleep 30; do rpm-ostree status --json >/dev/null; done` Just in case it help: while running the `rpm-ostree status --json` 35 times in a loop has shown 1 outstanding result that doesn't seem to pass JSON validation (line 2705 below): ~~~ $ cat -n rpm-ostree_status_json_loop.txt | grep 2705 -B10 -A10 2695 "custom-origin" : [ 2696 "pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6028a87f17b7ecb9fe4b5a9f1ab25999cd414924cbba5721320f14df9faec44a", 2697 "Managed by machine-config-operator" 2698 ], 2699 "requested-base-removals" : [ 2700 ] 2701 } 2702 ], 2703 "transaction" : null, 2704 "cached-update" : null 2705 }, 2706 { 2707 "base-commit-meta" : { 2708 "rpmostree.initramfs-args" : [ 2709 "--add=ignition", 2710 "--no-hostonly", 2711 "--add-drivers", 2712 "mptspi vmw_pvscsi", 2713 "--omit-drivers", 2714 "nouveau", 2715 "--omit", ~~~
I wasn't able to get the `rpm-ostree status --json` to error to verify the fix. I upgraded from 4.8.0-0.nightly-2021-06-07-083552 to 4.8.0-0.nightly-2021-06-08-034312.
> Just in case it help: while running the `rpm-ostree status --json` 35 times in a loop has shown 1 outstanding result that doesn't seem to pass JSON validation (line 2705 below): That's...odd. I don't have a theory for how that can happen; I wonder if it's an artifact of how you're receiving output (ssh/oc debug node) and terminal wrapping or so. How about this? $ t=$(mktemp); while true; do echo blah >${t}; if ! jq < ${t} >/dev/null; then cat ${t}; break; fi; sleep 30; done
Sorry, I meant $ t=$(mktemp); while true; do rpm-ostree status --json >${t}; if ! jq < ${t} >/dev/null; then cat ${t}; break; fi; sleep 30; done (I replaced it with `echo blah` to validate the error case)
OK here's a debugging hotfix: This script uses `rpm-ostree usroverlay` to make a transient overlayfs for `/usr`, and intercepts the `rpm-ostree` binary. If it fails, we'll dump any error messages into a /tmp/rpmostreeout.XXXXXX file. Can you try this? ``` #!/bin/bash set -xeuo pipefail # Enter root mountnfs (needed if we're in `oc debug node`) if test "$(readlink /proc/self/ns/mnt)" != "$(readlink /proc/1/ns/mnt)"; then if test -n "${nsenter_recurse:-}"; then exit 1; fi export nsenter_recurse=1 exec nsenter -m -t 1 $(pwd)/$0 fi unset nsenter_recurse if ! test -w /usr; then rpm-ostree usroverlay fi if ! test -f /usr/bin/rpm-ostree.orig; then cp -a /usr/bin/rpm-ostree{,.orig} fi cat >/usr/bin/rpm-ostree.new << 'EOF' #!/bin/bash # Temporary debugging wrapper script set -euo pipefail tmpf=$(mktemp --tmpdir rpmostreeout.XXXXXXXX) if ! rpm-ostree.orig "$@" |& tee ${tmpf}; then echo "Logs in ${tmpf}" 1>&2 exit 1 else rm -f ${tmpf} fi EOF chmod a+x /usr/bin/rpm-ostree.new mv /usr/bin/rpm-ostree{.new,} ```
OK https://github.com/coreos/rpm-ostree/pull/2932 should help a lot to debug this type of thing in the future.
Also https://github.com/openshift/machine-config-operator/pull/2642
OK please try this new debugging script that should more carefully gather errors. Note that logs now go to the journal (`journalctl -t rpm-ostree-client-errors`) instead of `/var/tmp`. ``` #!/bin/bash set -xeuo pipefail # Enter root mountnfs (needed if we're in `oc debug node`) if test "$(readlink /proc/self/ns/mnt)" != "$(readlink /proc/1/ns/mnt)"; then if test -n "${nsenter_recurse:-}"; then exit 1; fi export nsenter_recurse=1 exec nsenter -m -t 1 $(pwd)/$0 fi unset nsenter_recurse if ! test -w /usr; then rpm-ostree usroverlay fi if ! test -f /usr/bin/rpm-ostree.orig; then cp -a /usr/bin/rpm-ostree{,.orig} fi # See https://stackoverflow.com/questions/1221833/pipe-output-and-capture-exit-status-in-bash cat >/usr/bin/rpm-ostree.new << 'EOF' #!/bin/bash # Temporary debugging wrapper script set -euo pipefail tmpd=$(mktemp -t -d rpmostreeout.XXXXXXXX) mkfifo ${tmpd}/fifo # Copy stderr to the journal logger -s -t rpm-ostree-client-errors < ${tmpd}/fifo & rc=0 if ! rpm-ostree.orig "$@" 2>${tmpd}/fifo; then echo "Command failed, see journalctl -t rpm-ostree-client-errors" 1>&2 wait exit 1 fi wait rm "${tmpd}" -rf EOF chmod a+x /usr/bin/rpm-ostree.new mv /usr/bin/rpm-ostree{.new,} ```
OK so to satisfy the mechanics, we need this BZ in VERIFIED in order to proceed with the backport PR https://github.com/openshift/machine-config-operator/pull/2609 But, the PR so far is mostly about additional logging. I think my proposal is: - Flag this BZ as VERIFIED (leave the customer case open of course) - This will allow https://github.com/openshift/machine-config-operator/pull/2609 to merge which based on everything I've seen so far will likely fix the bug *or* at least surface the actual problem. If the backport doesn't solve the problem, then we can file a new BZ.
Thank you, Colin! @mnguyen could you help with verifying the PRs linked to this case and push it towards an errata? Please see the comments above for context and customer business impact. Regards.
Verified on 4.9.0-0.nightly-2021-07-12-203753. Ran `rpm-ostree usroverlay` then changed /usr/bin/rpm-osree into a script #!/bin/sh echo oops 1>&2 exit 1 Then ran an upgrade and captured the detailed log messages. $ oc -n openshift-machine-config-operator logs machine-config-daemon-l8mww -c machine-config-daemon I0713 19:55:47.449373 248281 start.go:108] Version: v4.9.0-202107122135.p0.git.a542196.assembly.stream-dirty (a5421965cf4e440b94bd1a4774072c517291fda0) I0713 19:55:47.451451 248281 start.go:121] Calling chroot("/rootfs") I0713 19:55:47.451558 248281 rpm-ostree.go:278] Running captured: rpm-ostree status --json ● rpm-ostreed.service - rpm-ostree System Management Daemon Loaded: loaded (/usr/lib/systemd/system/rpm-ostreed.service; static; vendor preset: disabled) Active: inactive (dead) Docs: man:rpm-ostree(1) Jul 13 19:02:49 ip-10-0-157-3.us-west-2.compute.internal systemd[1]: Started rpm-ostree System Management Daemon. Jul 13 19:02:49 ip-10-0-157-3.us-west-2.compute.internal rpm-ostree[2811]: client(id:machine-config-operator dbus:1.77 unit:crio-5dfc0f6d88f86c68d56c1b9937eaf5806c7a47898e5a42c22f9a6d3d32bd371d.scope uid:0) added; new total=1 Jul 13 19:02:49 ip-10-0-157-3.us-west-2.compute.internal rpm-ostree[2811]: client(id:machine-config-operator dbus:1.77 unit:crio-5dfc0f6d88f86c68d56c1b9937eaf5806c7a47898e5a42c22f9a6d3d32bd371d.scope uid:0) vanished; remaining=0 Jul 13 19:02:49 ip-10-0-157-3.us-west-2.compute.internal rpm-ostree[2811]: In idle state; will auto-exit in 63 seconds Jul 13 19:02:50 ip-10-0-157-3.us-west-2.compute.internal rpm-ostree[2811]: client(id:machine-config-operator dbus:1.83 unit:crio-5dfc0f6d88f86c68d56c1b9937eaf5806c7a47898e5a42c22f9a6d3d32bd371d.scope uid:0) added; new total=1 Jul 13 19:02:50 ip-10-0-157-3.us-west-2.compute.internal rpm-ostree[2811]: client(id:machine-config-operator dbus:1.83 unit:crio-5dfc0f6d88f86c68d56c1b9937eaf5806c7a47898e5a42c22f9a6d3d32bd371d.scope uid:0) vanished; remaining=0 Jul 13 19:02:50 ip-10-0-157-3.us-west-2.compute.internal rpm-ostree[2811]: In idle state; will auto-exit in 61 seconds Jul 13 19:03:51 ip-10-0-157-3.us-west-2.compute.internal rpm-ostree[2811]: In idle state; will auto-exit in 62 seconds Jul 13 19:03:51 ip-10-0-157-3.us-west-2.compute.internal systemd[1]: rpm-ostreed.service: Succeeded. Jul 13 19:03:51 ip-10-0-157-3.us-west-2.compute.internal systemd[1]: rpm-ostreed.service: Consumed 68ms CPU time F0713 19:55:47.467400 248281 start.go:150] Failed to initialize single run daemon: error reading osImageURL from rpm-ostree: error running rpm-ostree status --json: exit status 1 oops
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438