Bug 2253099 - evolution frequently dies on Rawhide, logged only as "killed"
Summary: evolution frequently dies on Rawhide, logged only as "killed"
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: webkitgtk
Version: rawhide
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Michael Catanzaro
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-12-06 00:55 UTC by Adam Williamson
Modified: 2024-02-08 15:54 UTC (History)
9 users (show)

Fixed In Version: webkitgtk-2.43.4-2.fc40
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-02-08 15:05:39 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
WebKit Project 268744 0 None None None 2024-02-06 14:32:27 UTC

Description Adam Williamson 2023-12-06 00:55:17 UTC
Since I upgraded my system from Fedora 39 to Rawhide, Evolution has been dying constantly, to the point of being almost unusable. It is not, seemingly, crashing - there's no core dumped, abrt sees nothing, coredumpctl sees nothing. If I run Evolution from a console and wait for it to die, all I see is this:

[adamw@xps13a closebugs (eol-close-date %)]$ evolution 
Killed

...that's *it*. Not a sausage more. Nothing in the system journal either.

I can't seem to find anything much consistent about what triggers this. Sometimes Evo just dies in the background when I'm not interacting with it at all (presumably while refreshing something).

My upgrade to Rawhide didn't change the version of Evolution itself - it was evolution-3.50.1-1.fc39.x86_64 , and went to evolution-3.50.1-1.fc40.x86_64 . So I guess the issue must be triggered by something Evolution uses, but I'm not sure what.

I have two mail accounts configured - my work (Red Hat) mail via GNOME Online Accounts, and my personal mail just as a regular IMAPX account. I also have calendars and contacts from my work account, and from my personal account via caldav/carddav.

Comment 1 Adam Williamson 2023-12-06 00:55:43 UTC
Oh, I've since updated to evolution-3.50.2-1.fc40.x86_64 and related packages, but that didn't solve the problem.

Comment 2 Milan Crha 2023-12-06 12:28:16 UTC
Thanks for a bug report. I have an up-to-date rawhide test machine and evolution is not killed within 2 minutes. I'll keep it running for a longer time to see whether it'll be triggered.

I run it under gdb, in case the kill would be caused by anything internal to the Evolution.

Otherwise I do not know what might kill it, maybe an OOM service?

Comment 3 Adam Williamson 2023-12-06 14:59:42 UTC
It can happen more or less immediately, or it can take an hour or two. I don't think I've made it through a day, or even half a day, without it dying, since the upgrade. But of course it may depend on your accounts, emails, refresh settings, yadda yadda.

I can try running it in gdb here too, I guess. Hopefully it won't overpower my weedy laptop.

I don't think it's an OOM kill, that would be logged, whether it was the systemd or kernel oom killers that was doing it.

Comment 4 Milan Crha 2023-12-06 16:19:50 UTC
My accounts are rather mute, not much traffic in them. I had evo running for few hours now, and it did not crash/was not killed so far. This is with an up-to-date Fedora rawhide

Comment 5 Adam Williamson 2023-12-06 16:34:54 UTC
And of course, now I started running it through gdb, it hasn't died here too...let's hope it's just been "lucky" so far, and not that running it through gdb somehow "fixes" the problem...

Comment 6 Adam Williamson 2023-12-07 02:50:31 UTC
whole day running through gdb, didn't die once. sigh. I guess running through gdb stops the bug happening, somehow.

Comment 7 Milan Crha 2023-12-07 08:18:35 UTC
That's bad :-/ and when you run without gdb, then it'll crash immediately/soon?

Comment 8 Adam Williamson 2023-12-07 20:10:28 UTC
Yeah, within an hour usually.

So, I just got it to die in gdb after a suspend/resume (not sure if that breaks anything). This is what I have in gdb right now:

Program terminated with signal SIGKILL, Killed.
The program no longer exists.

I can't get any kind of backtrace. Can I do anything useful?

Comment 9 Milan Crha 2023-12-08 08:28:42 UTC
That's odd. Evolution surely doesn't kill itself, something did it, but I do not know how to figure out what it was and why it did that.

Comment 10 Adam Williamson 2023-12-08 18:50:56 UTC
Yeah, me either. Whatever it is is not logging a damn thing that I can see. There's absolutely no message that looks relevant anywhere that I can find, around the time Evo was last killed (according to the timestamp on comment #8). very mysterious :|

Comment 11 Adam Williamson 2024-01-03 19:30:09 UTC
FWIW, this seemed to go away for some time, but recently it's come back with a vengeance, again without any obvious rhyme or reason or helpful logging.

I'm reluctantly going to have to try switching to Thunderbird for a while at least :(

Comment 12 Milan Crha 2024-01-04 06:48:48 UTC
There is going to be a release this Friday (tomorrow), but I guess it won't help much, because what you face is a "killed" message, which means something killed the process. Evolution did not kill itself for sure, thus it might be something in the system doing so for whatever reason. That "that thing" did not log why or what killed it doesn't help at all.

That being said, unless knowing what and why killed the process it's really hard to help. Even when you ran it under gdb it was not able to catch the place of the kill, which I understand as "something from the outside" killed evo for whatever reason.

Comment 13 Adam Williamson 2024-01-04 16:21:23 UTC
I understand that, but the fly in the ointment is - this isn't happening to anything else. I run several apps permanently (a browser, a text editor, a console, some chats), and none of them are getting mysteriously killed. Since I set up Thunderbird yesterday it has not been mysteriously killed one time.

Whoever the murderer is, they apparently only like to take Evolution as a victim...

Comment 14 Milan Crha 2024-01-30 06:49:33 UTC
Evolution itself doesn't do any really crazy things, it might be some of the libraries evo uses, I guess. I might be completely wrong with this, still, maybe it's something WebKitGTK related, its gigacage and maybe bubblewrap features. I do not know, it's only one complicated thing I can think of. It would help to know what killed evolution, but when there's nothing in the journal neither on terminal, then it's kinda hard to address this.

If you'd like to give it a try, to disable the gigacage, run Evolution with the following environment variable set:

   GIGACAGE_ENABLED=0

and to disable bubblewrap:

   WEBKIT_DISABLE_SANDBOX_THIS_IS_DANGEROUS=1

Comment 15 Milan Crha 2024-01-31 10:10:45 UTC
Okay, I tried to play with this in a rawhide machine and I get the crash too, from time to time, in different occasions, which are:
- when opening mail account properties from Edit->Preferences->Mail Accounts;
- when scrolling folder tree with a mouse wheel;
- when switched from the Mail to the Calendar view.

There's nothing common here, it's just semi-random (or at least it looks like that). There is nothing logged in the dmesg nor journalctl about that kill, while I suppose oom-killer would log about itself, thus it's not it.

I opened a thread on the devel list, in a hope that someone could help to investigate this. It's here:
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/22WZRFTJT5RKSUSTY6GJB2VVQTVBYGDQ/

Comment 16 Yanko Kaneti 2024-01-31 12:45:56 UTC
Just my 2c. 
Here it started somewhere around the kernel 6.7 rcs and( or maybe) the glib2 upgrade to 2.79.
The other thing other than evolution that dies randomly here is liferea. It defintely dies from a SIGKILL and bpf-sigsnoop tells me its sending the signal to itself.
Haven't been able to catch the evolution case yet.

Again not really reproducible.
I would think its either glib or the kernel at fault here.

Comment 17 Milan Crha 2024-01-31 13:10:17 UTC
Good it's not only Evolution. It would be weird if it is.

With a help from folks from the Fedora's devel list I also figured evo sends the kill signal to itself, from which I'd guess I can catch it in gdb, but I did not have a luck (I could use a wrong function). Check the referenced thread link, if you fish, for more details. I'll be happy to test other things, if anybody has any idea for a way to catch the part/lib, which sends the signal.

Comment 18 Adam Williamson 2024-01-31 17:35:38 UTC
still, it's definitely not *everything*. I'm reluctantly back on Thunderbird for now as it was just happening too often to Evo, and it hasn't happened once to that, or any of the other apps I run (I typically run Firefox, gedit, gnome-terminal, and Slack, Discord and Element out of flatpaks; often I'll open up virt-manager).

So...it still feels like there must be some path here that Evo and liferea and maybe some other things hit, but not *all* things hit. Not even "all glib2 things", since I'm pretty sure gedit and gnome-terminal at least also use glib2?

Comment 19 Yanko Kaneti 2024-01-31 19:59:50 UTC
The bit on the list about seccomp kill from the kernel looking like that,  and that webkit sets seccomp profile probably points to something in the kernel ? While I would still expect something like that to be logged either in the kernel log or audit

Comment 20 Yanko Kaneti 2024-01-31 20:07:59 UTC
@awilliam if it happens to you quite often perhaps you could try running Evo with WEBKIT_DISABLE_SANDBOX_THIS_IS_DANGEROUS=1 for a bit and see if it helps.

Comment 21 Milan Crha 2024-02-01 08:28:23 UTC
> While I would still expect something like that to be logged either in the kernel log or audit

I apology for my ignorance, the audit log, it's part of the journalctl output, right?

I wasn't sure, thus I checked and the liferea also uses WebKitGTK, the 4.1 API, the same as Evolution. That's one common part of the two apps.

Comment 22 Milan Crha 2024-02-01 09:37:02 UTC
(In reply to Yanko Kaneti from comment #20)
> if it happens to you quite often perhaps you could try
> running Evo with WEBKIT_DISABLE_SANDBOX_THIS_IS_DANGEROUS=1 for a bit and
> see if it helps.

I tried it and it crashed even sooner, which might be just a coincidence.

I do not have any precise steps, I roughly:
a) run evo in the Mail view from a terminal
b) click on some mails
c) press Ctrl+N 3-5 times (it opens 3-5 composer windows)
d) press Esc 3-5 times (it closes the opened composers)
e) repeat c) and d) for 3-5 times
f) open Edit->Preferences->Mail Accounts
g) double-click on some accounts and close the dialog with Esc
h) switch to the Calendars window
i) switch to the Contacts window
j) switch to the Tasks window
k) switch to the Memos window
l) switch to the Mail window
m) repeat from b)

It sometimes crashes when switching views (steps h) to l)), sometimes when opening mail account properties (step g)), sometimes when playing with the composers (steps c), d) and e)).

Comment 23 Adam Williamson 2024-02-01 16:16:04 UTC
> I apology for my ignorance, the audit log, it's part of the journalctl output, right?

Some audit stuff is logged to the journal, but there is also /var/log/audit/audit.log .

Comment 24 Yanko Kaneti 2024-02-02 16:58:00 UTC
It seems it helps to trigger the issue if the machine is under load when evolution tries to do something heavy, like display the messages list of a 200k+ folder.

 $ stress-ng --cpu -1 --cpu-method all -t 5m --cpu-load 95

Helps it happen almost immediately.

Still at a loss how to debug it further.

Comment 25 Adam Williamson 2024-02-02 17:10:12 UTC
ah, that definitely seems to make sense. I run a fairly underpowered laptop as my main system and it certainly does seem plausible that it was particularly bad when I'm running VMs or containers alongside my usual workload (which already stresses the CPU a bit, thanks giant web frameworks we need just to run chat apps these days).

I'm still on Thunderbird ATM, but big thanks to everyone who's helping to debug this...

Comment 26 Yanko Kaneti 2024-02-04 11:42:09 UTC
After the helpful information on fedora-devel I tried running the crashing scenario with a loaded machine, but with masked and stopped rtkit-daemon.service. Which helped to avoid the kernel kill.

A little bit of grepping makes me think its not evolution itself that makes its main process realtime priority, but webkitgtk, which also explains why only evolution and liferea are affected.

I would think that no one wants webkit as used by evolution, to have anything realtime in it. Even if your HTML mails my have funny cat videos in them. I can't seem to find any webkit knobs to disable the threads prioritization for this usecase.

As to why it started happening recently, I suspect its either a new kernel issue or a kernel fix that actually acts on realtime threads.

Comment 27 Michael Catanzaro 2024-02-04 23:22:38 UTC
Carlos Garcia is investigating this. He thinks something is wrong with the display link thread (whatever that is) and that it probably doesn't need to be real time at all.

Comment 28 Michael Catanzaro 2024-02-06 14:34:19 UTC
It seems we have an extra display link monitor by mistake. Maybe the killing will stop after fixing that? I will do a new rawhide build with this patch for testing purposes.

Comment 29 Milan Crha 2024-02-06 16:40:01 UTC
If I'm not mistaken, once this is fixed, there should be no journalctl entry like this one:

>   Jan 31 10:49:22 localhost.localdomain rtkit-daemon[826]:
>   Successfully made thread 4820 of process 4640 (/usr/bin/evolution)
>   owned by '1000' RT at priority 5.

aka the rtkit-daemon should not create threads for the evolution, or even for the MiniBrowser:

   $ /usr/libexec/webkit2gtk-4.1/MiniBrowser https://bugzilla.redhat.com/show_bug.cgi?id=2253099

   $ sudo journalctl -xb | grep MiniBrowser
   Feb 06 17:37:06 localhost.localdomain rtkit-daemon[839]: Successfully made thread 3117 of
   process 3048 (/usr/libexec/webkit2gtk-4.1/MiniBrowser) owned by '1000' RT at priority 5.
   Feb 06 17:37:06 localhost.localdomain rtkit-daemon[839]: Successfully made thread 3048 of
   process 3048 (/usr/libexec/webkit2gtk-4.1/MiniBrowser) owned by '1000' RT at priority 5.
   Feb 06 17:38:09 localhost.localdomain rtkit-daemon[839]: Successfully made thread 3363 of
   process 3048 (/usr/libexec/webkit2gtk-4.1/MiniBrowser) owned by '1000' RT at priority 5.

Comment 30 Michael Catanzaro 2024-02-06 16:45:35 UTC
No, the use of realtime threads has not been removed. But now there will be only one, rather than two.

Comment 31 Milan Crha 2024-02-06 17:36:40 UTC
The log from the MiniBrowser shows 3, created in different times. I even did not reload the page, I just left it opened while writing the comment #29 text. I've no idea why it thought asking for the third thread is a good idea.

It would be sad if the rtkit daemon kills the app again, instead of the WebKitWebProcess (which would be a pita too, when reading a page content, but less painful than having killed down the app itself, which does more than WebKit pages). Honestly, realtime threads are useless for apps like evo, having there an option whether the consumer wants it or not (with `false` being the default), would be a better choice from my point of view. I do not know why you think everyone wants realtime threads and when they cannot be satisfied then panic in "kill me now" way. That does not make sense to me and is, obviously, harmful.

I can give a try to your test package in "real action" (aka in evolution). Let me know when it's built, please.

Comment 32 Michael Catanzaro 2024-02-06 20:14:10 UTC
Carlos says that 3 threads is expected if you have multiple monitors.

With this patch, there should generally be only 1, except for a short while when moving the web view between monitors or for a short while if the web view is created on a secondary monitor.

I haven't started the build yet but will get to it soon.

Comment 33 Michael Catanzaro 2024-02-07 00:06:21 UTC
(In reply to Milan Crha from comment #31)
> Let me know when it's built, please.

Build started here: https://koji.fedoraproject.org/koji/taskinfo?taskID=113076130

You'll probably notice it finish before I do, but I'll post here tomorrow if not.

Comment 34 Fedora Update System 2024-02-07 05:51:33 UTC
FEDORA-2024-98136d71b3 (webkitgtk-2.43.4-2.fc40) has been submitted as an update to Fedora 40.
https://bodhi.fedoraproject.org/updates/FEDORA-2024-98136d71b3

Comment 35 Milan Crha 2024-02-07 07:15:44 UTC
No, I'm sorry, I've installed webkit2gtk4.1-2.43.4-2.fc40.x86_64 and the evolution is still killed while doing things from the comment #22.

Comment 36 Fedora Update System 2024-02-07 07:20:31 UTC
FEDORA-2024-98136d71b3 (webkitgtk-2.43.4-2.fc40) has been pushed to the Fedora 40 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 37 Milan Crha 2024-02-07 07:24:04 UTC
Hmm, negative karma on a rawhide update does not stop it from merging. I'm reopening this.

Comment 38 Milan Crha 2024-02-07 07:29:05 UTC
> Hmm, negative karma on a rawhide update does not stop it from merging

Aha, there's a difference between setting negative feedback on a respective bug and a negative feedback on "is generally useful", where I did not change the later, because I cannot speak of general usability. I'll try to not forget this little detail next time.

Comment 39 Adam Williamson 2024-02-07 07:31:37 UTC
It still won't cause the update not to go stable. It's designed that way intentionally. If you get -3 (by default) before the update clears any gating tests it's subject to it'll be obsoleted, though. Well, uh. Probably.

Comment 40 Yanko Kaneti 2024-02-07 09:07:00 UTC
From the little I've gleamed from google about signal processing today, if any thread of parent process is addressed in a SIGKILL, the whole process goes.
Which is consistent with what we've been seeing here.

- webkit (in the context of the main evolution process) assigns (via rtkit-daemon) a rt priority for a thread it has created from the same process

- the rt hard timeout (200000 us, which in this case is the same as the soft) is hit and the kernel RT watchdog sends SIGKILL which goes for the whole process

Why a "display link thread" (whatever this is) hits the limit is a question in itself, but I think fundamentally using rt threads in webkit as used in this context seems inappropriate.

Comment 41 Michael Catanzaro 2024-02-07 13:26:10 UTC
(In reply to Yanko Kaneti from comment #40)
> Why a "display link thread" (whatever this is) hits the limit is a question
> in itself, but I think fundamentally using rt threads in webkit as used in
> this context seems inappropriate.

This thread is required to update the monitor every 16ms (60fps). It should never take longer than that. 200ms is missing the deadline by more than an order of magnitude, so something has to go *really* wrong for this to happen.

I'll let Carlos Garcia know that it's not fixed.

Comment 42 Michael Catanzaro 2024-02-07 13:27:18 UTC
(P.S. Since this was a rawhide update, there is no harm in letting it go stable. That just makes it easier to test. If it was a stable release update, then I would have prepared a scratch build instead.)

Comment 43 Michael Catanzaro 2024-02-07 13:32:14 UTC
It seems I'm several hours behind in this conversation, and Carlos has already decided to stop making the thread realtime. I'll build another update, and that should really fix this.

Comment 44 Yanko Kaneti 2024-02-08 08:33:46 UTC
webkitgtk-2.43.4-3.fc40 fixes it here.
One RT thread but only for the WebKitWebProcess, with either evolution or liferea.

Thanks

Comment 45 Adam Williamson 2024-02-08 15:54:36 UTC
Thanks a bunch to everyone who contributed to help fix this! It was really awesome to see the collaboration.


Note You need to log in before you can comment on or make changes to this bug.