RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1982389 - rpm-ostree stuck in UpdateDeployment, no further logs
Summary: rpm-ostree stuck in UpdateDeployment, no further logs
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: rpm-ostree
Version: 8.5
Hardware: Unspecified
OS: Unspecified
medium
urgent
Target Milestone: beta
: ---
Assignee: Colin Walters
QA Contact: Micah Abbott
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-14 18:28 UTC by Neil Girard
Modified: 2024-12-20 20:28 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-07-07 18:59:39 UTC
Type: Bug
Target Upstream Version:
Embargoed:
pm-rhel: mirror+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github coreos rpm-ostree pull 2995 0 None open daemon: Correctly abort not-started transactions after client exit 2021-07-15 11:40:25 UTC
Github openshift machine-config-operator pull 2677 0 None open daemon: Explicitly start rpm-ostreed, restart if we detect active txn 2021-07-15 18:24:51 UTC

Description Neil Girard 2021-07-14 18:28:07 UTC
Description of problem:

Customer was performing an upgrade from 4.6.15 to 4.6.36 and during the upgrade, 2 worker nodes are stuck applying new machine configs.  The MCD is showing that the rpm-ostree process seems to be in some sort of loop:

Jul 14 13:08:15 tor63secapworker05.ocp4-stg.soc.int.bell.ca rpm-ostree[2689365]: client(id:machine-config-operator dbus:1.111673 unit:crio-6ef8ee933b95235c515b6109cb6f7aba203e92e3e87c692703bfbb98c5225e2a.scope uid:0) vanished; remaining=0
Jul 14 13:09:30 tor63secapworker05.ocp4-stg.soc.int.bell.ca rpm-ostree[2689365]: client(id:machine-config-operator dbus:1.111674 unit:crio-6ef8ee933b95235c515b6109cb6f7aba203e92e3e87c692703bfbb98c5225e2a.scope uid:0) added; new total=1
Jul 14 13:09:30 tor63secapworker05.ocp4-stg.soc.int.bell.ca rpm-ostree[2689365]: client(id:machine-config-operator dbus:1.111674 unit:crio-6ef8ee933b95235c515b6109cb6f7aba203e92e3e87c692703bfbb98c5225e2a.scope uid:0) vanished; remaining=0
Jul 14 13:09:30 tor63secapworker05.ocp4-stg.soc.int.bell.ca rpm-ostree[2689365]: client(id:machine-config-operator dbus:1.111675 unit:crio-6ef8ee933b95235c515b6109cb6f7aba203e92e3e87c692703bfbb98c5225e2a.scope uid:0) added; new total=1
Jul 14 13:09:30 tor63secapworker05.ocp4-stg.soc.int.bell.ca rpm-ostree[2689365]: client(id:machine-config-operator dbus:1.111675 unit:crio-6ef8ee933b95235c515b6109cb6f7aba203e92e3e87c692703bfbb98c5225e2a.scope uid:0) vanished; remaining=0
Jul 14 13:11:10 tor63secapworker05.ocp4-stg.soc.int.bell.ca rpm-ostree[2689365]: client(id:machine-config-operator dbus:1.111676 unit:crio-6ef8ee933b95235c515b6109cb6f7aba203e92e3e87c692703bfbb98c5225e2a.scope uid:0) added; new total=1
Jul 14 13:11:10 tor63secapworker05.ocp4-stg.soc.int.bell.ca rpm-ostree[2689365]: client(id:machine-config-operator dbus:1.111676 unit:crio-6ef8ee933b95235c515b6109cb6f7aba203e92e3e87c692703bfbb98c5225e2a.scope uid:0) vanished; remaining=0
Jul 14 13:11:11 tor63secapworker05.ocp4-stg.soc.int.bell.ca rpm-ostree[2689365]: client(id:machine-config-operator dbus:1.111677 unit:crio-6ef8ee933b95235c515b6109cb6f7aba203e92e3e87c692703bfbb98c5225e2a.scope uid:0) added; new total=1
Jul 14 13:11:11 tor63secapworker05.ocp4-stg.soc.int.bell.ca rpm-ostree[2689365]: client(id:machine-config-operator dbus:1.111677 unit:crio-6ef8ee933b95235c515b6109cb6f7aba203e92e3e87c692703bfbb98c5225e2a.scope uid:0) vanished; remaining=0

Currently not sure a proper workaround to get customer progressing.

Version-Release number of selected component (if applicable):

4.6.15

How reproducible:

Currently in cu cluster

Steps to Reproduce:
1.N/A

Actual results:

Unable to apply os image

Expected results:

mc config applies successfully and node has new os image

Additional info:

The case has sosreport and must-gather attached to the case.

Comment 1 Colin Walters 2021-07-14 18:40:38 UTC
This is really similar to https://bugzilla.redhat.com/show_bug.cgi?id=1976435

Comment 2 Colin Walters 2021-07-14 18:51:45 UTC
OK so from the journal we have:

> Jul 14 11:47:33 tor63secapworker05.ocp4-stg.soc.int.bell.ca rpm-ostree[2689365]: Initiated txn UpdateDeployment for client(id:machine-config-operator dbus:1.111485 unit:crio-6ef8ee933b95235c515b6109cb6f7aba203e92e3e87c692703bfbb98c5225e2a.scope uid:0): /org/projectatomic/rpmostree1/rhcos


And...that's it.  No further logs except messages from the MCD connecting.  It seems the transaction thread is somehow stuck.  

I think we'll need something like `strace -p $(systemctl show -p MainPID rpm-ostreed |cut -f 2 -d =)` from inside an `oc debug node` shell.  

Or we could just try a `systemctl restart rpm-ostreed` but I'd like to understand why and how it's stuck like this.

Comment 3 Colin Walters 2021-07-14 19:17:00 UTC
Other commands that may give us useful data:

systemctl status rpm-ostreed

export pid=$(systemctl show -p MainPID rpm-ostreed |cut -f 2 -d =)
cat /proc/$pid/status
cat /proc/$pid/task/*/stack
ls -al /proc/$pid/fd


Also, when we reach the limit of noninvasive inspection, we can try forcing the process to dump core:

kill -SEGV $pid

Then see if we can get a stack trace from that core dump via `coredumpctl`.  Failing that, attaching the core to the case should give us all the data we need.

It will also be useful to see after that whether the MCD retries and if that succeeds or gets stuck again.

Comment 4 Neil Girard 2021-07-14 19:42:39 UTC
Hello,

here's the output we get from the strace command:

sh-4.4# strace -p $(systemctl show -p MainPID rpm-ostreed |cut -f 2 -d =)
strace: Process 1176421 attached
restart_syscall(<... resuming interrupted read ...>
^Cstrace: Process 1176421 detached
 <detached ...>


basically nothing.

Customer tried the `systemctl restart rpm-ostreed` and the update progressed.

Thanks!

Comment 5 Colin Walters 2021-07-14 19:59:23 UTC
Hmm, would have been good to get the core dump, I should have argued for that to start.

I may try to extend sosreport to gather more data like this.

Looking through the code, I do have a theory for how this could happen:

- MCD starts `rpm-ostree upgrade` (client process)
- Daemon receives DBus request, and waits for client to connect to the transaction progress DBus socket
- MCO is also upgrading the MCD daemonset, and this ends up killing the MCD pod, which also kills the rpm-ostree client process
- We're now stuck because the transaction is pending the client connection

The rpm-ostree daemon code looks like it's *trying* to handle this situation by monitoring the DBus state, but I'm not sure it's correct.  We definitely don't have much coverage of killing the client process in CI, which would explain how this race (if present) hasn't been caught earlier.

Comment 6 Colin Walters 2021-07-14 20:03:16 UTC
In current upstream git we have https://github.com/coreos/rpm-ostree/pull/2387 which would have helped prove/disprove this theory, but that's not in RHEL 8.2/OpenShift 4.6.  I'm going to look at beefing up logging here more and seeing if I can reproduce this race.

Comment 7 Colin Walters 2021-07-14 21:53:42 UTC
OK yeah this is easy to reproduce with 
```
diff --git a/src/app/rpmostree-clientlib.cxx b/src/app/rpmostree-clientlib.cxx
index 93401d64..8a4555b4 100644
--- a/src/app/rpmostree-clientlib.cxx
+++ b/src/app/rpmostree-clientlib.cxx
@@ -713,11 +713,17 @@ impl_transaction_get_response_sync (GDBusConnection *connection,
                         G_CALLBACK (on_owner_changed),
                         tp);
     }
+    
+  if (getenv ("exit_connect"))
+    _exit (1);
 
   transaction = transaction_connect (transaction_address, cancellable, error);
   if (!transaction)
     goto out;
 
+  if (getenv ("exit_before_start"))
+    _exit (1);
+
   sigintid = g_unix_signal_add (SIGINT, on_sigint, cancellable);
 
   /* setup cancel handler */
@@ -736,6 +742,9 @@ impl_transaction_get_response_sync (GDBusConnection *connection,
                                               error))
     goto out;
 
+  if (getenv ("exit_after_start"))
+    _exit (1);
+
   /* FIXME Use the 'just_started' flag to determine whether to print
    *       a message about reattaching to an in-progress transaction,
    *       like:
```

Then `env exit_connect=1 rpm-ostree upgrade` wedges the daemon; further operations just hang.  You can't even `rpm-ostree cancel`.  It needs a `systemctl restart rpm-ostreed`.

Comment 9 Neil Girard 2021-07-15 14:13:19 UTC
@walters Excellent.  Great analysis.  Customer only hit this on two nodes.  The remaining nodes upgraded successfully.  They will be attempting to upgrade to 4.7 today.  We'll monitor to see if this situation is hit again.  If it is, do you want me to grab additional data?

Comment 10 Colin Walters 2021-07-15 15:21:10 UTC
>   We'll monitor to see if this situation is hit again.  If it is, do you want me to grab additional data?

Yes, the commands from https://bugzilla.redhat.com/show_bug.cgi?id=1982389#c3 but the core dump is probably best.

Also what would help support my theory is if we see the MCD pod being killed at the same time as it's trying to update.  The code in the MCO tries to prevent this but there's been some bugs found in that I think, e.g.
- https://github.com/openshift/machine-config-operator/pull/2395
- https://github.com/openshift/machine-config-operator/pull/2631

I'll see if I can reproduce this, though when I was talking about CI above I was thinking of the rpm-ostree CI.  But OpenShift CI obviously covers upgrades which if my theory is correct would also sometimes hit this, but I haven't seen it there.  Though we may not have looked hard enough.

We also probably should look at a MCO workaround for this; simplest may be just `systemctl restart rpm-ostreed` when the MCD pod comes up.

Comment 12 Colin Walters 2022-02-25 14:21:21 UTC
*** Bug 2057544 has been marked as a duplicate of this bug. ***

Comment 13 Colin Walters 2022-02-25 15:54:30 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=2057544 will track the backport of the MCO-side workaround to this.

Comment 14 Colin Walters 2022-07-07 18:59:39 UTC
This has been fixed by https://access.redhat.com/errata/RHBA-2022:1883
but I missed attaching this BZ to the errata.  Although currently that's just for 8.6, which will ship in 4.11.

So for now, I'm closing this as fixed - we could backport to 8.4, but we have the MCO side workaround backported to 4.8 it looks like in https://bugzilla.redhat.com/show_bug.cgi?id=2057544
which I think will be sufficient to tide us over.  

If the fix is needed for 4.7...let's just get the MCO workaround shipped there, I looked at that in
https://github.com/openshift/machine-config-operator/pull/2967#issuecomment-1178074956

Comment 16 Colin Walters 2023-05-11 19:46:29 UTC
This was backported to 8.4 in https://bugzilla.redhat.com/show_bug.cgi?id=2137420
See also https://issues.redhat.com/browse/OCPBUGS-13407

Older systems may still need a "systemctl restart rpm-ostreed" - but future versions will contain the fix and not need this.


Note You need to log in before you can comment on or make changes to this bug.