Bug 2135778

Summary: systemd-coredump times out while processing a crash, gdb can't attach to a stuck process
Product: [Fedora] Fedora Reporter: Kamil Páral <kparal>
Component: systemdAssignee: systemd-maint
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 37CC: awilliam, fedoraproject, filbranden, flepied, lnykryn, msekleta, robatino, ryncsn, ssahani, s, systemd-maint, yuwatana, zbyszek
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: RejectedBlocker AcceptedFreezeException
Fixed In Version: systemd-251.7-611.fc37 systemd-250.9-1.fc36 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-11-02 19:53:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2009540    
Attachments:
Description Flags
system journal
none
rpm -qa output none

Description Kamil Páral 2022-10-18 12:00:23 UTC
Description of problem:
I see something very weird happening. If I make gnome-calendar freeze as described in [1], 
the journal says:

Oct 18 13:33:38 f37 kernel: gnome-calendar[2187]: segfault at 18 ip 00007fc854e467b4 sp 00007ffe800e3718 error 4 in libglib-2.0.so.0.7400.0[7fc854de4000+9200>
Oct 18 13:33:38 f37 kernel: Code: 0a 6c 04 00 ba 32 03 00 00 48 8d 35 9f 05 04 00 48 8d 3d 08 f9 02 00 e8 6a fe ff ff 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e >
Oct 18 13:33:38 f37 systemd[1]: Created slice system-systemd\x2dcoredump.slice - Slice /system/systemd-coredump.
Oct 18 13:33:38 f37 systemd[1]: Started systemd-coredump - Process Core Dump (PID 2318/UID 0).

However, that systemd-coredump service never finishes and times out. It waits 5 minutes in this state:

● systemd-coredump - Process Core Dump (PID 2318/UID 0)
     Loaded: loaded (/usr/lib/systemd/system/systemd-coredump@.service; static)
     Active: active (running) since Tue 2022-10-18 13:33:38 CEST; 1min 17s ago
      Until: Tue 2022-10-18 13:38:38 CEST; 3min 42s left
TriggeredBy: ● systemd-coredump.socket
       Docs: man:systemd-coredump(8)
   Main PID: 2319 (systemd-coredum)
      Tasks: 2 (limit: 3358)
     Memory: 312.4M
        CPU: 797ms
     CGroup: /system.slice/system-systemd\x2dcoredump.slice/systemd-coredump
             ├─2319 /usr/lib/systemd/systemd-coredump
             └─2323 "(sd-parse-elf)"

Oct 18 13:33:38 f37 systemd[1]: Started systemd-coredump - Process Core Dump (PID 2318/UID 0).


and then looks like this:

× systemd-coredump - Process Core Dump (PID 2318/UID 0)
     Loaded: loaded (/usr/lib/systemd/system/systemd-coredump@.service; static)
     Active: failed (Result: timeout) since Tue 2022-10-18 13:38:38 CEST; 9min ago
   Duration: 5min 164ms
TriggeredBy: ● systemd-coredump.socket
       Docs: man:systemd-coredump(8)
    Process: 2319 ExecStart=/usr/lib/systemd/systemd-coredump (code=killed, signal=TERM)
   Main PID: 2319 (code=killed, signal=TERM)
        CPU: 831ms

Oct 18 13:33:38 f37 systemd[1]: Started systemd-coredump - Process Core Dump (PID 2318/UID 0).
Oct 18 13:38:38 f37 systemd[1]: systemd-coredump: Service reached runtime time limit. Stopping.
Oct 18 13:38:38 f37 systemd[1]: systemd-coredump: Failed with result 'timeout'.


Coredumpctl doesn't list that problem at all.


Also, if I try to attach to the stuck process using gdb (even though journal says gnome-calendar segfaulted, I can still see its window and gnome-shell reports it as "not responding: kill / wait"), gdb gets stuck as well. It prints "Attaching to process NNNN" and doesn't respond to Ctrl+C or anything else.

It seems to me that something wrong is happening here, perhaps it's caused by some low-level library, but I don't know how to debug this further.


[1] https://gitlab.gnome.org/GNOME/gnome-calendar/-/issues/892


Version-Release number of selected component (if applicable):
systemd-251.6-609.fc37.x86_64
gnome-calendar-43.0-6.fc37.x86_64
gdb-12.1-4.fc37.x86_64

How reproducible:
always

Steps to Reproduce:
1. make gnome-calendar freeze as described in [1]
2. see that systemd-coredump process was started, but it doesn't finish (`systemctl status systemd-coredump*`), times out after 5 minutes
3. see that `coredumpctl` doesn't list that problem
4. try to attach to gnome-calendar PID using gdb (`gdb -p PID`), while the gnome-calendar window is still visible (you haven't killed it). Gdb hangs after "Attaching to process".

Comment 1 Kamil Páral 2022-10-18 12:00:52 UTC
Created attachment 1918736 [details]
system journal

Comment 2 Kamil Páral 2022-10-18 12:00:58 UTC
Created attachment 1918737 [details]
rpm -qa output

Comment 3 Kamil Páral 2022-10-18 12:04:38 UTC
I'd like to raise awareness of this issue and potentially propose it as a blocker. It seems to be that this could be an issue somewhere in the lower stack and could cause us some troubles. It definitely seems to break ABRT, because it can't notify the user about a crash and offer to report it, when not even systemd-coredump completes. However, that seems to be happening just for this gnome-calendar crash. When I send SIGSEGV e.g. to gnome-calculator, the whole systemd-coredump -> ABRT workflow works correctly.

Comment 4 Yu Watanabe 2022-10-18 12:32:22 UTC
Seems duplicate of RHBZ#2134741 and already fixed by https://github.com/systemd/systemd/commit/f6e88aac2c30392a934507591d70a35ca1ea7acf in the upstream.

Comment 5 Zbigniew Jędrzejewski-Szmek 2022-10-18 12:46:52 UTC
No, I don't think it can be the same issue. Here we're observing a timeout, not a crash.

Comment 6 Zbigniew Jędrzejewski-Szmek 2022-10-18 13:12:26 UTC
Actually, systemd-251.6 doesn't have the bug fixed by f6e88aac2c30392a934507591d70a35ca1ea7acf.
So it's definitely a different issue.

I managed gnome-calendar to crash as described, and systemd-coredump (from git) runs fine.
But with 251.6 I can reproduce the issue.

Comment 7 Zbigniew Jędrzejewski-Szmek 2022-10-18 14:21:51 UTC
Phew, it's just a straightforward deadlock. We fork the child and wait for it to exit. The child
tries to write 69596 bytes to a pipe, and and after the first 64k blocks on the parent because
the pipe is full.

Comment 8 Adam Williamson 2022-10-18 16:26:02 UTC
Zbigniew: can you give us a sense whether this is likely to be quite a general problem - will it affect a lot of crashes - or is it quite an unusual scenario that likely won't affect many?

Comment 9 Zbigniew Jędrzejewski-Szmek 2022-10-18 16:36:04 UTC
I don't think it'll be super-common. You need a program linked to way too many libraries for the issue to occur.

https://github.com/systemd/systemd/pull/25055 should fix the issue.

Comment 10 Adam Williamson 2022-10-18 21:00:01 UTC
-4 blocker / +3 FE in https://pagure.io/fedora-qa/blocker-review/issue/978 , marking rejected blocker, accepted FE.

Comment 11 Kamil Páral 2022-10-20 10:54:58 UTC
I just found a crash in gnome-control-center, which is also affected by this systemd bug. So it's not that rare, probably.

Comment 12 Kamil Páral 2022-10-21 09:21:45 UTC
@zbyszek.pl We delayed F37 release by a week. Would it be possible to backport a patch to fix this? Or is a new systemd release planned soon after F37 is go (to at least shorten the window where people can't report certain crashes)? Thanks.

Comment 13 Adam Williamson 2022-10-21 16:43:40 UTC
kparal: I did a scratch build with a backport of the fix and found it still didn't work as the coredump generation timed out. If you want to try it, it's at https://koji.fedoraproject.org/koji/taskinfo?taskID=93196487 .

Comment 14 Kamil Páral 2022-10-24 08:29:22 UTC
(In reply to Adam Williamson from comment #13)
> kparal: I did a scratch build with a backport of the fix and found it still
> didn't work as the coredump generation timed out. If you want to try it,
> it's at https://koji.fedoraproject.org/koji/taskinfo?taskID=93196487 .

Well, it worked for me. I reproduced the calendar crash again, and it got printed to the system journal and is visible in coredumpctl.

Comment 15 Zbigniew Jędrzejewski-Szmek 2022-10-24 19:29:34 UTC
I'm fairly confident that the patch fixes the issue, though not 100%. Please test the build
that will be started soon.

Comment 16 Fedora Update System 2022-10-24 20:05:46 UTC
FEDORA-2022-c72fd8b071 has been submitted as an update to Fedora 37. https://bodhi.fedoraproject.org/updates/FEDORA-2022-c72fd8b071

Comment 17 Fedora Update System 2022-10-25 11:34:54 UTC
FEDORA-2022-c72fd8b071 has been pushed to the Fedora 37 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2022-c72fd8b071`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2022-c72fd8b071

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 18 Kamil Páral 2022-10-25 12:31:48 UTC
(In reply to Fedora Update System from comment #16)
> FEDORA-2022-c72fd8b071 has been submitted as an update to Fedora 37.
> https://bodhi.fedoraproject.org/updates/FEDORA-2022-c72fd8b071

Thanks, it is fixed for me.

(Note: While system-coredump works now, ABRT doesn't, for that particular crash. I reported that as bug 2137249 ).

Comment 19 Fedora Update System 2022-11-02 19:53:14 UTC
FEDORA-2022-c72fd8b071 has been pushed to the Fedora 37 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 20 Fedora Update System 2022-12-20 19:17:35 UTC
FEDORA-2022-ef4f57b072 has been submitted as an update to Fedora 36. https://bodhi.fedoraproject.org/updates/FEDORA-2022-ef4f57b072

Comment 21 Fedora Update System 2022-12-21 02:26:24 UTC
FEDORA-2022-ef4f57b072 has been pushed to the Fedora 36 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2022-ef4f57b072`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2022-ef4f57b072

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 22 Fedora Update System 2022-12-31 01:16:10 UTC
FEDORA-2022-ef4f57b072 has been pushed to the Fedora 36 stable repository.
If problem still persists, please make note of it in this bug report.