Bug 480323

Summary: RHEL 4.8 PTRACE_ATTACH failure after auditd start
Product: Red Hat Enterprise Linux 4 Reporter: Vivek Goyal <vgoyal>
Component: kernelAssignee: Oleg Nesterov <onestero>
Status: CLOSED WONTFIX QA Contact: Martin Jenner <mjenner>
Severity: medium Docs Contact:
Priority: low    
Version: 4.8CC: jan.kratochvil, jburke, roland
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-20 16:01:31 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
PTRACE_ATTACH reproducer. none

Description Vivek Goyal 2009-01-16 14:20:19 UTC
Description of problem:

rhts localwatchdog hit because vsyscall test did not finish in time.

http://rhts.redhat.com/cgi-bin/rhts/test_list.cgi?test_filter=/kernel/syscalls/vsyscall&result=Warn&rwhiteboard=kernel%202.6.9-78.30.EL%20largesmp&arch=x86_64&jobids=42002

Version-Release number of selected component (if applicable):

2.6.9-78.30.EL

How reproducible:
I have seen it 2-3 times now during various rhts runs.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Oleg Nesterov 2009-01-20 19:00:44 UTC
I spent a lot of time trying to re-produce but failed. Because I know
nothing about gdb, looking at these logs I am not sure I understand what
really happens. I guess I have to dig into gdb's sources, but perhaps
Jan or Roland already have the answer.

What seem to happen is,
We have small.c:

        static int wait;

        static void handle_alrm (int signo)
        {
                if (wait)
                        for (;;)
                                pause ();
                kill (getpid(), SIGSEGV);
                abort ();
        }

        int main (int argc, char **argv)
        {
                struct itimerval itimerval;
                int i;

                wait = (argc > 1);
                signal (SIGALRM, handle_alrm);
                memset (&itimerval, 0, sizeof (itimerval));
                itimerval.it_value.tv_usec = 1000000 / 10;
                i = setitimer (ITIMER_REAL, &itimerval, NULL);
                assert (i == 0);
                pause ();
                abort ();
                return 0;
        }

we have gdbinit-maps1:
        
        set width 0
        set height 0
        gcore core.gcore
        quit

and runtest.sh does:

        ./small wait &
        PID=$!
        sleep 5
        gdb -silent --command=./gdbinit-maps1 ./small $PID

and, according to http://rhts.redhat.com/testlogs/42002/145415/1207053/current.log
the last command "hangs" and outputs:

        Using host libthread_db library "/lib64/tls/libthread_db.so.1".
        Attaching to program: /mnt/tests/kernel/syscalls/vsyscall/small, process 14579
        Redelivering pending Trace/breakpoint trap.
        Redelivering pending Trace/breakpoint trap.
        Program process 0 exited: Unknown signal 0 (terminated)

        /mnt/tests/kernel/syscalls/vsyscall/14579: No such file or directory.
        ./gdbinit-maps1:3: Error in sourced command file:
        You can't do that without a process to debug.
        (gdb)

this is why the test did not finish, gdb can't proceed and waits for the
input. Note that the tracee has really exited, there is no "small" process
in sysrq-t output.

What does this "Redelivering" mean? Google finds this patch:

        http://sourceware.org/ml/gdb-patches/2007-06/msg00059.html

Trace/breakpoint trap? strings `which gdb` shows this means SIGTRAP.

So. It looks like gdb does PTRACE_ATTACH, ptrace(PTRACE_CONT, SIGSTOP),
and gets WIFSTOPPED() == SIGTRAP ?

Currently, I don't see how this is possible. Perhaps gdb does something
strange. Will continue tomorrow, unless somebody knowledgeable can save
me from studying gdb's sources ;)


As for the small.c, I think it could be just

        int main()
        {
                pause();
        }

but again, I can't reproduce the problem, not sure.

Comment 2 Jan Kratochvil 2009-01-20 22:31:29 UTC
Message `Redelivering pending ...' was present in RHEL-5.2 and it was a bug in:
gdb-6.5-bz292971-attach-signalled-fix.patch

The defect-by-design of this message was found by Roland in:
  http://sourceware.org/ml/archer/2008-q3/msg00003.html

(I did mean it originally for SIGSTOP but it is wrong for other signals.)

This GDB defect is no longer present in RHEL-5.3.

Still going to find out how it can meet the SIGTRAP ("Trace/breakpoint trap") signal at all as small.c does not generate any SIGTRAP (it looks suspicious).

Comment 3 Jan Kratochvil 2009-02-01 13:59:47 UTC
Created attachment 330548 [details]
PTRACE_ATTACH reproducer.

You can check the vsyscall RHTS test had very predecessing tests on that host.
It is an interaction between auditd and kernel-ptrace.
IIRC according to Roland kernel switches the syscall enter/exit code after starting auditd to a slower path which can be undone only by a reset.
After starting auditd simple PTRACE_ATTACH generates SIGTRAP instead of SIGSTOP.
Curiously it may be reproducible only on this RHTS host.
Anyway it is unrelated to GDB therefore giving away this Bug.

HOSTNAME=dell-per905-01.rhts.bos.redhat.com
JOBID=44290
DISTRO=RHEL4-U8-re20090128.1
ARCHITECTURE=x86_64
# cat /proc/version
Linux version 2.6.9-80.ELlargesmp (mockbuild.redhat.com) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-10)) #1 SMP Fri Jan 23 16:39:07 EST 2009
# gcc -o attach-ok attach-ok.c -Wall -g
# ./attach-ok 
PASS - SIGSTOP
Optionally (it has no effect on the results): # setenforce 0
# /etc/init.d/auditd start
Starting auditd: [  OK  ]
# ./attach-ok 
FAIL - SIGTRAP
# /etc/init.d/auditd stop
Stopping auditd: [  OK  ]
# ./attach-ok 
FAIL - SIGTRAP
# rpm --qf '%{name}-%{version}-%{release}.%{arch}\n' -q audit kernel-largesmp
audit-1.0.16-4.el4.x86_64
kernel-largesmp-2.6.9-80.EL.x86_64
# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                       57G  1.4G   53G   3% /
/dev/sda1              99M   15M   80M  16% /boot
none                  3.9G     0  3.9G   0% /dev/shm


Original RHTS test had these messages there but according to my tests the disk-full state is not required for the `attach-ok' reproducer:
messages.gz:
Jan 15 18:48:28 dell-per905-01 rhts: /mnt/tests/kernel/security/audit/audit-test /
...
Jan 15 18:52:27 dell-per905-01 auditd[11029]: Audit daemon has no space left on logging partition
Jan 15 18:52:27 dell-per905-01 auditd[11029]: The audit daemon is now halting the system due to no space left on logging partition
Jan 15 18:52:27 dell-per905-01 auditd[11029]: Record was not written to disk (No space left on device)
Jan 15 18:52:27 dell-per905-01 auditd[11029]: write: Audit daemon detected an error writing an event to disk (No space left on device)

Comment 4 Roland McGrath 2009-02-07 02:36:31 UTC
I can't see how the RHEL4 code might produce that SIGTRAP.  Jan mentioned maybe this doesn't reproduce on the same kernel on all machines.  If that's so, it's especially weird and the distinguishing factor should be figured out.  But some hardware weirdness is almost easier to believe than a plain bug off hand since I really can't see where it would come from in the code we have.

Comment 5 Jiri Pallich 2012-06-20 16:01:31 UTC
Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. 
Please See https://access.redhat.com/support/policy/updates/errata/

If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.