Bug 1958812 - Cluster upgrade halts as machine-config-daemon fails to parse `rpm-ostree status` during cluster upgrades
Summary: Cluster upgrade halts as machine-config-daemon fails to parse `rpm-ostree sta...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.8.0
Assignee: Colin Walters
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks: 1970590
TreeView+ depends on / blocked
 
Reported: 2021-05-10 08:15 UTC by Robert Sandu
Modified: 2021-08-03 03:01 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 23:07:32 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2583 0 None open Bug 1958812: daemon: Change runGetOut to not intermix stdout/stderr 2021-05-20 19:43:34 UTC
Github openshift machine-config-operator pull 2587 0 None open Bug 1958812: daemon: Rework rpm-ostree synchronous invocation logging 2021-06-09 16:47:37 UTC
Github openshift machine-config-operator pull 2588 0 None open Bug 1958812: daemon: Display more non-JSON text from rpm-ostree status --json 2021-05-21 19:21:34 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:07:50 UTC

Description Robert Sandu 2021-05-10 08:15:51 UTC
OCP Version at Install Time: each upgrade, from 4.6.22 to 4.7.9
OCP Version after Upgrade (if applicable): 4.7.9
Platform: AWS
Architecture: x86_64

What are you trying to do? What is your use case?
- Upgrading the cluster.

What happened? What went wrong or what did you expect?
- Node are reported in "Degraded" state as machine-config-daemon fails to run and parse rpm-ostree status --json output:

$ cat namespaces/openshift-machine-config-operator/pods/machine-config-daemon-4msf4/machine-config-daemon/machine-config-daemon/logs/current.log
2021-05-07T10:48:38.348971166Z I0507 10:48:38.348822  246746 start.go:108] Version: v4.7.0-202104250659.p0-dirty (ac79a2ffc6002f086b3fa80003b278b635d2055a)
2021-05-07T10:48:38.356013527Z I0507 10:48:38.355980  246746 start.go:121] Calling chroot("/rootfs")
2021-05-07T10:48:38.356082310Z I0507 10:48:38.356067  246746 rpm-ostree.go:261] Running captured: rpm-ostree status --json
2021-05-07T10:48:38.423052266Z F0507 10:48:38.423004  246746 start.go:150] Failed to initialize single run daemon: error reading osImageURL from rpm-ostree: failed to parse `rpm-ostree status --json` output: invalid character 'E' looking for beginning of value

What are the steps to reproduce your issue? Please try to reduce these steps to something that can be reproduced with a single RHCOS node.
- Standard cluster upgrade.

Please add anything else that might be useful, for example:
- journalctl --verify: seeing no truncated/corrupted journalctl logs at the time of reproducing the issue.
- rpm-ostreed.service is up and manually running `rpm-ostree status --json` on the affected nodes shown no visible issue.
- The same issue is reproduced:
 - During every single upgrade, from 4.6.22 up to 4.7.9. 
 - Impacts individual workers nodes.
 - Impacts multiple clusters.
- As the error is reproduced, MCP is reporting their nodes in degraded state.
- Following a manual reboot of the impacted nodes, the upgrade finishes successfully.

Comment 3 Luca BRUNO 2021-05-10 09:37:57 UTC
I looked through node journal and didn't see anything relevant in rpm-ostreed.service logs either. And looking at MCD code I think `rpm-ostree status --json` is exiting without error (but it may be logging something problematic to stderr).

I'm dispatching to MCO folks in order to get that portion of logic augmented to report a more useful error message. In particular, logging the problematic input to journal will help tracking down the root cause.

Assorted notes for possible followups after that:
 * splitting stdout and stderr could help, in order to avoid parsing any warnings from stderr
 * adding a retry loop could help overcoming transient errors

Comment 13 Colin Walters 2021-05-20 21:44:52 UTC
The above PR will get us the actual error, which will hopefully be informative.

In the short term, can you try reproducing a failure by doing something like this in an "oc debug node":

`while sleep 30; do rpm-ostree status --json >/dev/null; done`

And see if any errors appear.

Comment 14 Robert Sandu 2021-05-25 11:20:08 UTC
Hi Colin, Yu.

> `while sleep 30; do rpm-ostree status --json >/dev/null; done`

Just in case it help: while running the `rpm-ostree status --json` 35 times in a loop has shown 1 outstanding result that doesn't seem to pass JSON validation (line 2705 below):

  ~~~
  $ cat -n rpm-ostree_status_json_loop.txt | grep 2705 -B10 -A10
    2695        "custom-origin" : [
    2696          "pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6028a87f17b7ecb9fe4b5a9f1ab25999cd414924cbba5721320f14df9faec44a",
    2697          "Managed by machine-config-operator"
    2698        ],
    2699        "requested-base-removals" : [
    2700        ]
    2701      }
    2702    ],
    2703    "transaction" : null,
    2704    "cached-update" : null
    2705  },
    2706      {
    2707        "base-commit-meta" : {
    2708          "rpmostree.initramfs-args" : [
    2709            "--add=ignition",
    2710            "--no-hostonly",
    2711            "--add-drivers",
    2712            "mptspi vmw_pvscsi",
    2713            "--omit-drivers",
    2714            "nouveau",
    2715            "--omit",
  ~~~

Comment 19 Michael Nguyen 2021-06-08 17:00:42 UTC
I wasn't able to get the `rpm-ostree status --json` to error to verify the fix.  I upgraded from 4.8.0-0.nightly-2021-06-07-083552 to 4.8.0-0.nightly-2021-06-08-034312.

Comment 20 Colin Walters 2021-06-09 13:11:46 UTC
> Just in case it help: while running the `rpm-ostree status --json` 35 times in a loop has shown 1 outstanding result that doesn't seem to pass JSON validation (line 2705 below):

That's...odd.  I don't have a theory for how that can happen; I wonder if it's an artifact of how you're receiving output (ssh/oc debug node) and terminal wrapping or so.

How about this?

$ t=$(mktemp); while true; do echo blah >${t}; if ! jq < ${t} >/dev/null; then cat ${t}; break; fi; sleep 30; done

Comment 21 Colin Walters 2021-06-09 13:49:11 UTC
Sorry, I meant
$ t=$(mktemp); while true; do rpm-ostree status --json >${t}; if ! jq < ${t} >/dev/null; then cat ${t}; break; fi; sleep 30; done

(I replaced it with `echo blah` to validate the error case)

Comment 25 Colin Walters 2021-06-09 19:09:18 UTC
OK here's a debugging hotfix:

This script uses `rpm-ostree usroverlay` to make a transient overlayfs for `/usr`, and intercepts the `rpm-ostree` binary.  If it fails, we'll dump any error messages into a /tmp/rpmostreeout.XXXXXX file.

Can you try this?

```
#!/bin/bash
set -xeuo pipefail
# Enter root mountnfs (needed if we're in `oc debug node`)
if test "$(readlink /proc/self/ns/mnt)" != "$(readlink /proc/1/ns/mnt)"; then
  if test -n "${nsenter_recurse:-}"; then exit 1; fi
  export nsenter_recurse=1
  exec nsenter -m -t 1 $(pwd)/$0
fi
unset nsenter_recurse
if ! test -w /usr; then
   rpm-ostree usroverlay
fi
if ! test -f /usr/bin/rpm-ostree.orig; then
  cp -a /usr/bin/rpm-ostree{,.orig}
fi
cat >/usr/bin/rpm-ostree.new << 'EOF'
#!/bin/bash
# Temporary debugging wrapper script
set -euo pipefail
tmpf=$(mktemp --tmpdir rpmostreeout.XXXXXXXX)
if ! rpm-ostree.orig "$@" |& tee ${tmpf}; then
  echo "Logs in ${tmpf}" 1>&2
  exit 1
else
  rm -f ${tmpf}
fi
EOF
chmod a+x /usr/bin/rpm-ostree.new
mv /usr/bin/rpm-ostree{.new,}
```

Comment 36 Colin Walters 2021-06-25 20:33:19 UTC
OK https://github.com/coreos/rpm-ostree/pull/2932 should help a lot to debug this type of thing in the future.

Comment 40 Colin Walters 2021-06-29 19:07:58 UTC
OK please try this new debugging script that should more carefully gather errors.  Note that logs now go to the journal (`journalctl -t rpm-ostree-client-errors`) instead of `/var/tmp`.

```
#!/bin/bash
set -xeuo pipefail
# Enter root mountnfs (needed if we're in `oc debug node`)
if test "$(readlink /proc/self/ns/mnt)" != "$(readlink /proc/1/ns/mnt)"; then
  if test -n "${nsenter_recurse:-}"; then exit 1; fi
  export nsenter_recurse=1
  exec nsenter -m -t 1 $(pwd)/$0
fi
unset nsenter_recurse
if ! test -w /usr; then
   rpm-ostree usroverlay
fi
if ! test -f /usr/bin/rpm-ostree.orig; then
  cp -a /usr/bin/rpm-ostree{,.orig}
fi
# See https://stackoverflow.com/questions/1221833/pipe-output-and-capture-exit-status-in-bash
cat >/usr/bin/rpm-ostree.new << 'EOF'
#!/bin/bash
# Temporary debugging wrapper script
set -euo pipefail
tmpd=$(mktemp -t -d rpmostreeout.XXXXXXXX)
mkfifo ${tmpd}/fifo
# Copy stderr to the journal
logger -s -t rpm-ostree-client-errors < ${tmpd}/fifo &
rc=0
if ! rpm-ostree.orig "$@" 2>${tmpd}/fifo; then
  echo "Command failed, see journalctl -t rpm-ostree-client-errors" 1>&2
  wait
  exit 1
fi
wait
rm "${tmpd}" -rf
EOF
chmod a+x /usr/bin/rpm-ostree.new
mv /usr/bin/rpm-ostree{.new,}
```

Comment 45 Colin Walters 2021-07-08 21:50:45 UTC
OK so to satisfy the mechanics, we need this BZ in VERIFIED in order to proceed with the backport PR https://github.com/openshift/machine-config-operator/pull/2609

But, the PR so far is mostly about additional logging.  

I think my proposal is:

- Flag this BZ as VERIFIED (leave the customer case open of course)
- This will allow https://github.com/openshift/machine-config-operator/pull/2609 to merge which based on everything I've seen so far will
  likely fix the bug *or* at least surface the actual problem.

If the backport doesn't solve the problem, then we can file a new BZ.

Comment 46 Robert Sandu 2021-07-09 06:37:27 UTC
Thank you, Colin!

@mnguyen@redhat.com could you help with verifying the PRs linked to this case and push it towards an errata? Please see the comments above for context and customer business impact.

Regards.

Comment 47 Michael Nguyen 2021-07-13 20:07:43 UTC
Verified on 4.9.0-0.nightly-2021-07-12-203753.

Ran `rpm-ostree usroverlay` then changed /usr/bin/rpm-osree into a script

#!/bin/sh
echo oops 1>&2
exit 1

Then ran an upgrade and captured the detailed log messages.

$ oc -n openshift-machine-config-operator logs machine-config-daemon-l8mww -c machine-config-daemon 
I0713 19:55:47.449373  248281 start.go:108] Version: v4.9.0-202107122135.p0.git.a542196.assembly.stream-dirty (a5421965cf4e440b94bd1a4774072c517291fda0)
I0713 19:55:47.451451  248281 start.go:121] Calling chroot("/rootfs")
I0713 19:55:47.451558  248281 rpm-ostree.go:278] Running captured: rpm-ostree status --json
● rpm-ostreed.service - rpm-ostree System Management Daemon
   Loaded: loaded (/usr/lib/systemd/system/rpm-ostreed.service; static; vendor preset: disabled)
   Active: inactive (dead)
     Docs: man:rpm-ostree(1)

Jul 13 19:02:49 ip-10-0-157-3.us-west-2.compute.internal systemd[1]: Started rpm-ostree System Management Daemon.
Jul 13 19:02:49 ip-10-0-157-3.us-west-2.compute.internal rpm-ostree[2811]: client(id:machine-config-operator dbus:1.77 unit:crio-5dfc0f6d88f86c68d56c1b9937eaf5806c7a47898e5a42c22f9a6d3d32bd371d.scope uid:0) added; new total=1
Jul 13 19:02:49 ip-10-0-157-3.us-west-2.compute.internal rpm-ostree[2811]: client(id:machine-config-operator dbus:1.77 unit:crio-5dfc0f6d88f86c68d56c1b9937eaf5806c7a47898e5a42c22f9a6d3d32bd371d.scope uid:0) vanished; remaining=0
Jul 13 19:02:49 ip-10-0-157-3.us-west-2.compute.internal rpm-ostree[2811]: In idle state; will auto-exit in 63 seconds
Jul 13 19:02:50 ip-10-0-157-3.us-west-2.compute.internal rpm-ostree[2811]: client(id:machine-config-operator dbus:1.83 unit:crio-5dfc0f6d88f86c68d56c1b9937eaf5806c7a47898e5a42c22f9a6d3d32bd371d.scope uid:0) added; new total=1
Jul 13 19:02:50 ip-10-0-157-3.us-west-2.compute.internal rpm-ostree[2811]: client(id:machine-config-operator dbus:1.83 unit:crio-5dfc0f6d88f86c68d56c1b9937eaf5806c7a47898e5a42c22f9a6d3d32bd371d.scope uid:0) vanished; remaining=0
Jul 13 19:02:50 ip-10-0-157-3.us-west-2.compute.internal rpm-ostree[2811]: In idle state; will auto-exit in 61 seconds
Jul 13 19:03:51 ip-10-0-157-3.us-west-2.compute.internal rpm-ostree[2811]: In idle state; will auto-exit in 62 seconds
Jul 13 19:03:51 ip-10-0-157-3.us-west-2.compute.internal systemd[1]: rpm-ostreed.service: Succeeded.
Jul 13 19:03:51 ip-10-0-157-3.us-west-2.compute.internal systemd[1]: rpm-ostreed.service: Consumed 68ms CPU time
F0713 19:55:47.467400  248281 start.go:150] Failed to initialize single run daemon: error reading osImageURL from rpm-ostree: error running rpm-ostree status --json: exit status 1
oops

Comment 50 errata-xmlrpc 2021-07-27 23:07:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.