Bug 468903

Summary: [utrace] exec in ptraced process sometimes hangs
Product: [Fedora] Fedora Reporter: Tom Horsley <horsley1953>
Component: kernelAssignee: Roland McGrath <roland>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: medium    
Version: 8CC: dvlasenk, kernel-maint, quintela
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-01-09 07:53:40 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
test-clone.c test program
none
Updated test program test-clone2.c none

Description Tom Horsley 2008-10-28 18:39:04 UTC
Description of problem:

When a debugged multi-threaded process with one thread sitting around stopped
and other threads running does an exec() call, sometimes the process winds
up stopped in a disk wait state, and no SIGTRAP for the exec event is ever
delivered to the debugger.

Version-Release number of selected component (if applicable):
kernel-2.6.26.6-49.fc8 (i686 kernel)

How reproducible:

Apparently random timing dependencies affect the results. Often the
program runs OK, occasionally it gets in the hung state.

Steps to Reproduce:
1. g++ -o test-clone -g test-clone.c -lpthread
2. ./test-clone
3. repeat till test reports failure
  
Actual results:
DEF: ACCESS_CLONE_SAFE=0

Expected results:
DEF: ACCESS_CLONE_SAFE=1

Additional info:

This was tested on a not very fast dual processor system, and system load
seems to make the test more likely to fail (starting firefox while running
test, etc).

Why is one thread not running, you ask? Its a feature: If I keep a stopped
thread around I can read and write memory while the rest of the process
is running :-).

I was always able to do this safely on fedora 7 and on the 2.6.25 kernel in
original fedora 8. Much earlier kernels would always get hung, and my test
would always fail. The random failure is new with latest 2.6.26 (I haven't
tried it on a 2.6.27 kernel yet).

Comment 1 Tom Horsley 2008-10-28 18:40:07 UTC
Created attachment 321720 [details]
test-clone.c test program

Comment 2 Tom Horsley 2008-10-28 21:26:24 UTC
I've now tried the same test on my x86_64 system at home (somewhat faster
AMD Athlon 64 X2 4400+ dual core box), and I see the same failures.

I also tried it on kernel-2.6.27.4-51.fc10.x86_64 on my f10 beta
partition and 2.6.27 also gives the same error.

Comment 3 Denys Vlasenko 2008-11-14 14:27:07 UTC
taking a look at it

Comment 4 Denys Vlasenko 2008-11-14 15:04:45 UTC
Trying on 2.6.27.5-32.fc9.x86_64 on Fedora 9 on Intel Core 2 Duo. Doesn't fail. I am getting "DEF: ACCESS_CLONE_SAFE=1" anytime I run the test.

btw:
      struct timespec a_bit;
      memset((void *)&a_bit, 0, sizeof(a_bit));
      a_bit.tv_nsec = 20*1000; /* (a bit == 20 milliseconds :-) */
      nanosleep(&a_bit, NULL);
these are microseconds. Add another *1000 to get milliseconds.

I tried this:

while sleep 0.05; do echo -n .; ./test-clone >/dev/null 2>&1 || exit; done

and got a stream of dots, no failures.

Comment 5 Denys Vlasenko 2008-11-14 16:28:58 UTC
Tested on kernel-2.6.27.4-51.fc10.x86_64.

At first I seemed to get failures under load, but I think they are false. I bumped up hang_count to 100 (and of course I have 20 microsecond delay fixed - replaced it by 10 millisecond), and now

while sleep 0.05; do echo -n .; ./test-clone2 >RES 2>&1 || exit; done

runs without failures while I have four "while true; do true; done" CPU hogging shells running on my 2 core machine.

I edited testcase a bit. For example, now it is C, not C++ (there was one instance of "char&"). I will attach the testcase.

Tom, can you run it on your machine(s)? If it runs ok but your original testcase fails, can it (original testcase) be fixed by correcting "microsecond" bug and using if (hang_count > 100) instead of if (hang_count > 10)?

Comment 6 Denys Vlasenko 2008-11-14 16:33:33 UTC
Created attachment 323597 [details]
Updated test program test-clone2.c

Changes:
* after PTRACE_TRACEME, it is known practice to immediately do raise (SIGSTOP) afterwards. Removed SIGUSR1 usage for this.
* replaced nanosleep with usleep, fixing "microsecond" bug in the process.
* switched stdout into unbuffered mode, removed all fflush() calls.
* removed lone C++-ism present.
* other minor simplifications.

Comment 7 Denys Vlasenko 2008-11-14 16:57:07 UTC
Unable to reproduce on vanilla 2.6.26.7 either

Comment 8 Tom Horsley 2008-11-14 18:15:50 UTC
I'll check out the revised test when I get home, but unfortunately if it no
longer fails, it doesn't mean there isn't a bug, just that the test can't
reproduce it :-(. My real debugger really gets hung sometimes, and it is
definitely related to the stopped thread since the hangs don't happen without it.
Maybe I need more realistic activity in the threads themselves - waiting on a
mutex or something rather than merely sleeping. I'll play with it some more and
see what I can come up with.

Comment 9 Denys Vlasenko 2008-11-18 13:29:46 UTC
> I'll check out the revised test when I get home, but unfortunately if it no
longer fails, it doesn't mean there isn't a bug, just that the test can't
reproduce it :-(

I fully agree that it doesn't rule out the bug. But I do need a testcase for it.

So far it is possible that what you saw was caused by having way too short (20 microseconds x5 times) period of waiting. IOW: the code was too eager to declare thread stuck when in fact it was not.

Please try to produce a testcase which works on vanilla kernel (with utrace disabled - the bug, if it exists, may also affect the version of utrace which is in mainline), but fails on utrace kernel (vanilla or Fedora one).

The .config of both kernels should be similar apart from CONFIG_UTRACE.

Comment 10 Tom Horsley 2008-11-18 13:56:28 UTC
Yea, I'm still trying to reproduce the bug reliably. I may have to get
a complete trace of everything that happens when the real debugger
hangs then see if I can turn that into a test program. It may take
a while to figure out :-(.

Comment 11 Denys Vlasenko 2008-11-19 13:59:54 UTC
I talking nonsense - there is no utrace in mainline yet. I meant "upstream utrace" (Roalnd's patchset against upstream kernel)

Comment 12 Bug Zapper 2008-11-26 11:16:02 UTC
This message is a reminder that Fedora 8 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 8.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '8'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 8's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 8 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 13 Bug Zapper 2009-01-09 07:53:40 UTC
Fedora 8 changed to end-of-life (EOL) status on 2009-01-07. Fedora 8 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.