Created attachment 1949495 [details]
//EDIT: Please see comment 25 for a discovered root cause of this.
Created attachment 1949495 [details]
Description of problem:
As shown in the attached screencast,there are coreutils problem and gnome-shell listed,but after I created one known gnome-shell crash,the coreutils problem is erased,and the count number for gnome-shell crash is 1,which should be 2.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Created attachment 1949496 [details]
Proposed as a Blocker for 38-final by Fedora user lnie using the blocker tracking app because:
violates this criteria:
For all release-blocking desktop / arch combinations, the following applications must start successfully and withstand a basic functionality test.
Plus:Erase data sounds like a serious problem
$ grep abrt journal.txt | grep -i size
Mar 10 02:59:57 fedora abrtd: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'ccpp-2023-03-10-02:33:24.72823-10404'
Mar 10 03:00:49 fedora abrtd: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'ccpp-2023-03-10-02:59:56.614799-13965'
Mar 10 03:04:20 fedora abrtd: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'ccpp-2023-03-10-03:00:48.754161-12450'
Have you tried to increase the default size of 5GB for crashes to something bigger? The older reports seem to be removed because your /var/spool/abrt is oversize. That's expected behavior.
I think this fails the basic functionality test, but it's not new behavior: ABRT has worked like this for a decade or longer.
It's reasonable for ABRT to clear old core dumps after some size limit is exceeded, but it's not reasonable for it to completely delete all data from old reports and then hide the old report from the ABRT UI without informing the user that older reports will be disappearing. That's just confusing.
(In reply to Kamil Páral from comment #3)
> $ grep abrt journal.txt | grep -i size
> Mar 10 02:59:57 fedora abrtd: Size of '/var/spool/abrt' >= 5000 MB
> (MaxCrashReportsSize), deleting old directory
> Mar 10 03:00:49 fedora abrtd: Size of '/var/spool/abrt' >= 5000 MB
> (MaxCrashReportsSize), deleting old directory
> Mar 10 03:04:20 fedora abrtd: Size of '/var/spool/abrt' >= 5000 MB
> (MaxCrashReportsSize), deleting old directory
> Have you tried to increase the default size of 5GB for crashes to something
> bigger? The older reports seem to be removed because your /var/spool/abrt is
> oversize. That's expected behavior.
Nope,but just like Michael said,I don't think it's reasonable for it to completely delete *ALL* data from old reports,
as users may not get a chance to report the bugs before bug data is erased, and I do think if the old data is created in,like, one day,
users should be informed that it is going to be deleted when the contents of that path is oversize.
> I think this fails the basic functionality test, but it's not new behavior: ABRT has worked like this for a decade or longer.
I haven't seen this before,I mean,all old data are deleted at once.
> I don't think it's reasonable for it to completely delete *ALL* data from old reports
But it's not doing that. I just tested it. I set a limit of 2 GB and slowly crashed applications one by one, seeing /var/spool/abrt grow. Any time it hits the max size, it deletes one directory (or more, if it needs more space). It doesn't delete all of them. It seems to pick the largest directories first. So it's behaving quite reasonably, I think.
I'm not sure why in Lili's case the directory is oversize at the very moment a new crash happens. Lili, look into the directory, is there something weird occupying a lot of space? Is there some directory present which isn't shown in abrt gui or abrt-cli output?
Or perhaps when the whole session crashed including firefox, the crash data were so large? Try to crash just a gnome-calculator or something similarly small, it should behave as expected. Can you confirm?
> Lili, look into the directory, is there something weird occupying a lot of space? Is there some directory present which isn't shown in abrt gui or abrt-cli output?
I have reinstalled the test system to test other cases,but I remember clearly that there is only one dir left in /var/spool/abrt/.
I think your crash-data-so-large hypothesis is reasonable, so I tried to prove it.But as you can see from the new screencast I attached,the crash data is only about 800M,
and during that process I found the today's firefox bug data is erased,while the totem crash data created several days ago is kept.
I reported several gnome-abrt bugs last week,and I do feel gnome-abrt of this version is pretty un-robust.
Created attachment 1951159 [details]
Created attachment 1951160 [details]
It's hard to judge when you don't show the directory sizes (the `ls -l` output doesn't display directory sizes, you need to use `du` for that).
Here is the output of the du command,I didn't know that gnome-abrt calculates actually used space.
Just in case,I have firefox crashed in the same way as shown in the screencast,and the data size is only about 363M,way smaller that 5G:
[root@fedora abrt]# du -h
[root@fedora abrt]# cat ccpp-2023-03-16-11\:31\:31.634567-3081/cmdline
[root@fedora abrt]# cat ccpp-2023-03-16-08:28:03.826891-1756/cmdline
Proposal: crashes should remain visible in gnome-abrt UI for at least 6 months after the crash occurs.
It's OK to free disk space by removing core dumps or big logs, even though that makes it impossible to report a crash. It's not OK to hide the crash from the gnome-abrt UI when this happens. That's really confusing and makes it impossible to use gnome-abrt to track crashes.
In my experience, ABRT rarely ever shows more than one crash at a time.
It's interesting how different our experience is. I very often have 10-20 crashes displayed in ABRT. It's true that I've increased the default 5GB max size to 10GB. But that's just doubling the size, and our experience is different by a larger factor.
So according to coredumpctl I had 8 crashes yesterday and today. All still have core dumps (the big part) present in coredumpctl. Let's see how big they are:
$ coredumpctl info 24683 | grep Size
Size on Disk: 14.4M
$ coredumpctl info 24861 | grep Size
Size on Disk: 14.3M
$ coredumpctl info 27318 | grep Size
Size on Disk: 9.3M
$ coredumpctl info 2721 | grep Size
Size on Disk: 8.2M
$ coredumpctl info 34581 | grep Size
Size on Disk: 23.2M
$ coredumpctl info 67917 | grep Size
Size on Disk: 9.9M
$ coredumpctl info 92457 | grep Size
Size on Disk: 9.0M
$ coredumpctl info 91138 | grep Size
Size on Disk: 6.7M
So I don't know what all ABRT is storing, but the cumulative size of these core dumps in coredumpctl is only 95 MB.
Now, six of these crashes were not packaged executables, so I guess it's reasonable for ABRT to not display them. Two of them were packaged by Fedora: /usr/libexec/xdg-desktop-portal-gnome first, and /usr/bin/gnome-text-editor second. I see only the gnome-text-editor crash in gnome-abrt, so I assume the xdg-desktop-portal-gnome crash has already been deleted by ABRT as if it had never existed. (There's not really any way to know whether ABRT ever processed it, is there? I just assume it was processed and then deleted.)
It's reasonable to delete large data if needed, but not to delete all evidence that the crash occurred.
To prove Kamil's many-individual-crashes hypothesis mentioned in blocker ticket,I set maxsize to 30G,and reproduce the firefox and gnome-shell crash I mentioned in comment7,
found that firefox crash data is Not deleted after whole session crash,but the total crash data is only 1.1G(screenshot1),much smaller than the default 5G.
So I don't think abrt should delete firefox crash data on the default 5G situation.
And during that process I also ran into a similar situation as Michael mentioned in comment14,abrt starts delete data when there are more than 5 crashes with maxsize set to 30G(screenshot2)
Actually,abrt delete firefox crash data when the third(or maybe fourth) crash is created.
Created attachment 1951895 [details]
Created attachment 1951896 [details]
Created attachment 1951898 [details]
journal for screenshot2
Abrt starts to delete data when there is many space left issue is reproducible 100% to me,would you please confirm it, Kamil?On no-gnome-shell-crash situation,abrt does delete crash data one or two at a time just as you said in #comment6,but look into /var/spool/abrt directory you will find that it starts to delete data when there is much space left.
Created attachment 1951916 [details]
reproduce screencast on a newly created machine(maxsize is the default 5G)
Hey @msrb , can you please have a look on what's going on here?
aren't reports *moved* from /var/spool/abrt to...somewhere else...quite quickly after creation? I wonder if there's some problem with that on lnie's system?
Discussed at 2023-03-20 blocker review meeting: https://meetbot-raw.fedoraproject.org/fedora-blocker-review/2023-03-20/f38-blocker-review.2023-03-20-16.00.html . We agreed that, for now, this is rejected as a blocker as there doesn't seem to be clear reproduction of a sufficiently serious problem to violate the release criteria. If further testing provides a better indication that there's a real problem here that may hit multiple users, we can reconsider the decision.
>aren't reports *moved* from /var/spool/abrt to...somewhere else...quite quickly after creation?
I don't think that reasonable,but just in case I searched the disappeared ccpp dir,no,it's not moved but deleted.
>I wonder if there's some problem with that on lnie's system?
I tested on the two local bare metal machines with Fedora-Workstation-Live-x86_64-38-20230318.n.0.iso installed,it's 100% reproducible.
here is the reproduce steps:
1)pkill -SEGV firefox
2)pkill -SEGV sleep
3)pkill -SEGV nautilus
4)pkill -SEGV gnome-photos
5)pkill -SEGV gnome-calendar
6)pkill -SEGV gnome-software
7)pkill -SEGV gnome-clocks
You can have the apps crashed in the turn you like,of course.
You will see firefox crash data is removed on step 3) or maybe 4)
start from step 5) you will find abrt starts to delete crash data on default or large number situation mentioned below.
We do have several problems here:
on default max size or large number(say more than 25G,I didn't try smaller to avoid the argument about space not enough):
1)abrt starts to delete data when there are tons of space left,and the old crash data+ new crash data < way smaller than the max size
2)abrt delete *all* the data if the new crash data is larger than 1G(I don't know the exact number,gnome-photos crash data on my t490s system is 1.2G sometimes,and I'm able to reproduce this issue all the time with that crash),and I think that maybe the original issue I ran into,
on small size situation(say 1G,or 2G) abrt seems work well,just as mentioned in #comment6
but even on that situation,there is one small problems
3)abrt will delete the newest crash data if its size is larger than all the other crash data(and the new+old data > max size),which means users will not be able to report the new issue
4)okay,actually a little similar with the above issue,abrt will delete largest old crash data even it is created later than the second largest one,and just a little bigger,say 60M.
I'm not sure about this one,I don't think abrt should delete second old data instead of the oldest one.
I do think it is a big problem when users try to keep more crash data by setting the maxsize larger but they get a opposite result.
Adam,Kamil,and maybe others, would you please confirm?Thanks.
Hi @msrb, would you please check?Thanks.
OK, I believe I've found the problem. In an out-of-box scenario, where /etc/abrt/abrt.conf doesn't contain any value configuration, the MaxCrashReportsSize is supposed to be 5000 MB, but it actually is 1000 MB. Even though the abrtd printouts in journal still claim 5000 MB.
I can consistently reproduce that by crashing one program after another, /var/spool/abrt overall size raises towards the 1000 MB limit, but when it would exceed this value, abrtd starts to delete older crashes, in order to keep it under 1000 MB. There's some complex determination which directory to delete , but in reality it usually deletes the largest one, and if that's not enough, another one, etc, until everything fits under 1000 MB again.
But if I edit /etc/abrt/abrt.conf and explicitly add:
MaxCrashReportsSize = 2000
then my /var/spool/abrt can have 1.9 GB in size, and only then some directory gets deleted, as expected. If the overall size is e.g. 1.5 GB and I comment out MaxCrashReportsSize from abrt.conf, and cause another crash (an extremely small one, e.g. crashing `sleep` or `cat`), it immediately removes as many dirs as needed to go under 1000 MB.
The default is clearly not 5000 MB, as documented.
For the record, my testing was done in a VM with 20 GB disk containing 14 GB free space, in case it matters.
I can reproduce the same problem on my desktop with 160 GB of free space, so this is not affected by a limited space in a VM.
PLEASE NOTE: Make sure to restart abrtd.service after editing abrt.conf, otherwise the changes are sometimes not detected.
Reproposing for a blocker discussion. A 1 GB space for crashes is really quite low, some crashes consume 300-600 MB, especially web browsers or apps that rely on web tech. It might happen that you only fit 2 crashes in that space, or just 1, and the second one will make the first one deleted. If the whole session crashes, it might happen that follow-up app crashes (due to the environment disappearing) overwrite the primary crash. We can then really discuss whether this violates a basic app functionality.
The easiest fix here is to provide the value in abrt.conf until the root cause is found and fixed.
Thanks Kamil for confirm this^^,TBO,I also considered the possibility that maybe the default value is much smaller than documented,
but the next second I denied it myself,as I set maxsize to 30G on a VM with 120G disk space when I reproduce the issue I mentioned on #comment15,
Just to be clear, I reproduced the four issues I mentioned in #comment24 on bare metal machine which has more than 200G disk space.
Thanks for the thorough analysis everyone!
I just experienced this strange behavior on my laptop as well. The default value of 5000 MB is hardcoded in the source code, but it doesn't seem to be properly honored. I suspect that the problem might be an integer overflow somewhere (5gb in bytes doesn't fit into 32 bit unsigned integer). But that is just a wild guess. I will take a closer look later this week ;)
That sounds likely. Here are some locations where the 5000 (or whatever the setting is) is multiplied by (1024*1024):
which can probably overflow, as there are only ints involved.