Bug 1982389
| Summary: | rpm-ostree stuck in UpdateDeployment, no further logs | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Neil Girard <ngirard> |
| Component: | rpm-ostree | Assignee: | Colin Walters <walters> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Micah Abbott <miabbott> |
| Severity: | urgent | Docs Contact: | |
| Priority: | medium | ||
| Version: | 8.5 | CC: | dornelas, jlebon, jligon, mbarnes, mrobson, mrussell, nstielau, oarribas, vkuss, walters |
| Target Milestone: | beta | Flags: | pm-rhel:
mirror+
|
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-07-07 18:59:39 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Neil Girard
2021-07-14 18:28:07 UTC
This is really similar to https://bugzilla.redhat.com/show_bug.cgi?id=1976435 OK so from the journal we have:
> Jul 14 11:47:33 tor63secapworker05.ocp4-stg.soc.int.bell.ca rpm-ostree[2689365]: Initiated txn UpdateDeployment for client(id:machine-config-operator dbus:1.111485 unit:crio-6ef8ee933b95235c515b6109cb6f7aba203e92e3e87c692703bfbb98c5225e2a.scope uid:0): /org/projectatomic/rpmostree1/rhcos
And...that's it. No further logs except messages from the MCD connecting. It seems the transaction thread is somehow stuck.
I think we'll need something like `strace -p $(systemctl show -p MainPID rpm-ostreed |cut -f 2 -d =)` from inside an `oc debug node` shell.
Or we could just try a `systemctl restart rpm-ostreed` but I'd like to understand why and how it's stuck like this.
Other commands that may give us useful data: systemctl status rpm-ostreed export pid=$(systemctl show -p MainPID rpm-ostreed |cut -f 2 -d =) cat /proc/$pid/status cat /proc/$pid/task/*/stack ls -al /proc/$pid/fd Also, when we reach the limit of noninvasive inspection, we can try forcing the process to dump core: kill -SEGV $pid Then see if we can get a stack trace from that core dump via `coredumpctl`. Failing that, attaching the core to the case should give us all the data we need. It will also be useful to see after that whether the MCD retries and if that succeeds or gets stuck again. Hello, here's the output we get from the strace command: sh-4.4# strace -p $(systemctl show -p MainPID rpm-ostreed |cut -f 2 -d =) strace: Process 1176421 attached restart_syscall(<... resuming interrupted read ...> ^Cstrace: Process 1176421 detached <detached ...> basically nothing. Customer tried the `systemctl restart rpm-ostreed` and the update progressed. Thanks! Hmm, would have been good to get the core dump, I should have argued for that to start. I may try to extend sosreport to gather more data like this. Looking through the code, I do have a theory for how this could happen: - MCD starts `rpm-ostree upgrade` (client process) - Daemon receives DBus request, and waits for client to connect to the transaction progress DBus socket - MCO is also upgrading the MCD daemonset, and this ends up killing the MCD pod, which also kills the rpm-ostree client process - We're now stuck because the transaction is pending the client connection The rpm-ostree daemon code looks like it's *trying* to handle this situation by monitoring the DBus state, but I'm not sure it's correct. We definitely don't have much coverage of killing the client process in CI, which would explain how this race (if present) hasn't been caught earlier. In current upstream git we have https://github.com/coreos/rpm-ostree/pull/2387 which would have helped prove/disprove this theory, but that's not in RHEL 8.2/OpenShift 4.6. I'm going to look at beefing up logging here more and seeing if I can reproduce this race. OK yeah this is easy to reproduce with
```
diff --git a/src/app/rpmostree-clientlib.cxx b/src/app/rpmostree-clientlib.cxx
index 93401d64..8a4555b4 100644
--- a/src/app/rpmostree-clientlib.cxx
+++ b/src/app/rpmostree-clientlib.cxx
@@ -713,11 +713,17 @@ impl_transaction_get_response_sync (GDBusConnection *connection,
G_CALLBACK (on_owner_changed),
tp);
}
+
+ if (getenv ("exit_connect"))
+ _exit (1);
transaction = transaction_connect (transaction_address, cancellable, error);
if (!transaction)
goto out;
+ if (getenv ("exit_before_start"))
+ _exit (1);
+
sigintid = g_unix_signal_add (SIGINT, on_sigint, cancellable);
/* setup cancel handler */
@@ -736,6 +742,9 @@ impl_transaction_get_response_sync (GDBusConnection *connection,
error))
goto out;
+ if (getenv ("exit_after_start"))
+ _exit (1);
+
/* FIXME Use the 'just_started' flag to determine whether to print
* a message about reattaching to an in-progress transaction,
* like:
```
Then `env exit_connect=1 rpm-ostree upgrade` wedges the daemon; further operations just hang. You can't even `rpm-ostree cancel`. It needs a `systemctl restart rpm-ostreed`.
@walters Excellent. Great analysis. Customer only hit this on two nodes. The remaining nodes upgraded successfully. They will be attempting to upgrade to 4.7 today. We'll monitor to see if this situation is hit again. If it is, do you want me to grab additional data? > We'll monitor to see if this situation is hit again. If it is, do you want me to grab additional data? Yes, the commands from https://bugzilla.redhat.com/show_bug.cgi?id=1982389#c3 but the core dump is probably best. Also what would help support my theory is if we see the MCD pod being killed at the same time as it's trying to update. The code in the MCO tries to prevent this but there's been some bugs found in that I think, e.g. - https://github.com/openshift/machine-config-operator/pull/2395 - https://github.com/openshift/machine-config-operator/pull/2631 I'll see if I can reproduce this, though when I was talking about CI above I was thinking of the rpm-ostree CI. But OpenShift CI obviously covers upgrades which if my theory is correct would also sometimes hit this, but I haven't seen it there. Though we may not have looked hard enough. We also probably should look at a MCO workaround for this; simplest may be just `systemctl restart rpm-ostreed` when the MCD pod comes up. *** Bug 2057544 has been marked as a duplicate of this bug. *** https://bugzilla.redhat.com/show_bug.cgi?id=2057544 will track the backport of the MCO-side workaround to this. This has been fixed by https://access.redhat.com/errata/RHBA-2022:1883 but I missed attaching this BZ to the errata. Although currently that's just for 8.6, which will ship in 4.11. So for now, I'm closing this as fixed - we could backport to 8.4, but we have the MCO side workaround backported to 4.8 it looks like in https://bugzilla.redhat.com/show_bug.cgi?id=2057544 which I think will be sufficient to tide us over. If the fix is needed for 4.7...let's just get the MCO workaround shipped there, I looked at that in https://github.com/openshift/machine-config-operator/pull/2967#issuecomment-1178074956 This was backported to 8.4 in https://bugzilla.redhat.com/show_bug.cgi?id=2137420 See also https://issues.redhat.com/browse/OCPBUGS-13407 Older systems may still need a "systemctl restart rpm-ostreed" - but future versions will contain the fix and not need this. |