Bug 1050154

Summary: crash reporting causes system hang
Product: [Fedora] Fedora Reporter: Jon McCann <william.jon.mccann>
Component: abrtAssignee: abrt <abrt-devel-list>
Status: CLOSED NEXTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 20CC: abrt-devel-list, aday, bnocera, dfediuck, dvlasenk, fedora, iprikryl, jfilak, mmilata, redhat, rvokal
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: abrt-2.5.0-1.fc22 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-06-19 09:34:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jon McCann 2014-01-08 18:20:38 UTC
The way we are collecting information about a crash can cause the system to hang by consuming too many resources.

There are a number of problems here.

1. The process consumes too much CPU
2. The process consumes too much memory and frequently causes the system to swap
3. The process takes far too long and the user is notified many minutes after the problem occurred. This is not acceptable for user visible crashes.
4. In some cases the entire system becomes unresponsive due to swapping and the system is likely to be powered off and data loss may occur.

Top output:
29205 root      20   0  857448   1652   1316 R  14.9  0.0   0:31.32 journalctl                                                                     
   61 root      20   0       0      0      0 D  10.6  0.0   1:57.78 kswapd0               

PS output:
ps -eaf|grep 29204
root     29204 29143  0 13:06 ?        00:00:00 /bin/sh -c if grep '^TracerPid:[[:space:]]*[123456789]' proc_pid_status >/dev/null 2>&1; then             # We see 'TracerPid: <nonzero>" in /proc/PID/status             # Process is ptraced (gdb, strace, ltrace)             # Debuggers have wide variety of bugs where they leak SIGTRAP             # to traced process and nuke it. Ignore this crash.             echo "The crashed process was ptraced - not saving the crash"             exit 1  # abrt will remove the problem directory         fi         if grep -q ^ABRT_IGNORE_ALL=1 environ \         || grep -q ^ABRT_IGNORE_CCPP=1 environ \         ; then             echo "ABRT_IGNORE variable is 1 - not saving the crash"             # abrtd will delete the problem directory when we exit nonzero:             exit 1         fi         # Try generating backtrace, if it fails we can still use         # the hash generated by abrt-action-analyze-c         ##satyr migration:         #satyr abrt-create-core-stacktrace "$DUMP_DIR"         abrt-action-generate-core-backtrace         # Run GDB plugin to see if crash looks exploitable         abrt-action-analyze-vulnerability         # Generate hash         abrt-action-analyze-c &&         abrt-action-list-dsos -m maps -o dso_list &&         (             # Try to save relevant log lines.             # Can't do it as analyzer step, non-root can't read log.             executable=`cat executable` &&             base_executable=${executable##*/} &&             # Test if the current version of journalctl has --system switch             journalctl --system -n1 >/dev/null             if [ $? -ne 0 ];             then                 # It's not an error if /var/log/messages isn't readable:                 test -f /var/log/messages || exit 0                 test -r /var/log/messages || exit 0                 log=`grep -F -e "$base_executable" /var/log/messages | tail -99`             else                 uid=`cat uid` &&                 log="[System Logs]:\n" &&                 log=$log`journalctl -b --system | grep -F -e "$base_executable" | tail -99` &&                 log=$log"\n[User Logs]:\n" &&                 log=$log`journalctl _UID="$uid" -b | grep -F -e "$base_executable" | tail -99` &&                 log=`echo -e "$log"`             fi             if test -n "$log"; then                 printf "%s\n" "$log" >var_log_messages                 # echo "Element 'var_log_messages' saved"             fi         )
root     29205 29204 18 13:06 ?        00:00:44 journalctl _UID=0 -b
root     29206 29204  0 13:06 ?        00:00:00 grep -F -e plymouth
root     29207 29204  0 13:06 ?        00:00:00 tail -99

Comment 1 Christian Kujau 2014-01-16 02:53:15 UTC
duplicate of bug 1015922 ?

Comment 2 Jon McCann 2014-01-16 20:55:50 UTC
Doesn't look like a dup of that to me. More like this being terribly inefficient:
https://github.com/abrt/abrt/blob/master/src/plugins/ccpp_event.conf

Comment 3 Jiri Moskovcak 2014-01-20 14:28:34 UTC
(In reply to Jon McCann from comment #0)
> The way we are collecting information about a crash can cause the system to
> hang by consuming too many resources.
> 
> There are a number of problems here.
> 
> 1. The process consumes too much CPU
> 2. The process consumes too much memory and frequently causes the system to
> swap
> 3. The process takes far too long and the user is notified many minutes
> after the problem occurred. This is not acceptable for user visible crashes.

- saving the coredump takes the most of the processing time. ABRT can display the popup before the dump is complete, but the user won't be able to report it until the dump is complete, would that really be a better from the UX perspective?

> 4. In some cases the entire system becomes unresponsive due to swapping and
> the system is likely to be powered off and data loss may occur.
> 

Can you be more specific? What application crashed and made abrt behave like that?

Comment 4 Jon McCann 2014-01-20 15:02:56 UTC
Saving the coredump may very well take a lot of time. But what we are doing with grepping through a dump of the journal is grossly inefficient. We should certainly try to fix it. There are APIs to search the journal directly and should probably be done from C. In these situations I am seeing journalctl and the ccpp_event.conf script consuming most of the CPU in the output of top.

Comment 5 Christian Stadelmann 2014-01-27 11:03:20 UTC
And I don't think that ABRT needs to search through the whole journal. How about just the last 24 hours?

Comment 6 Jakub Filak 2014-07-18 05:19:16 UTC
Hello, there is an ongoing discussion of the "grepping" problem on bug #1043670. Please post your comments and suggestions there.

Comment 7 Fedora End Of Life 2015-05-29 10:23:45 UTC
This message is a reminder that Fedora 20 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 20. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '20'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 20 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.