Bug 1373197

Summary: dyninst: Repeated execution of Process::temporaryDetach(), Process::reAttach() fails
Product: [Fedora] Fedora Reporter: Florian Weimer <fweimer>
Component: dyninstAssignee: Josh Stone <jistone>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: urgent    
Version: 24CC: fche, jistone, lberk, mbenitez, orion, wcohen
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: dyninst-9.2.0-5.fc25 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-07 05:24:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
dyninst-reattach-loop.cpp none

Description Florian Weimer 2016-09-05 13:12:38 UTC
Created attachment 1197942 [details]
dyninst-reattach-loop.cpp

Description of problem:

The dyninst process either fails in reAttach() or deadlocks after repeatedly detaching and reattaching to a target process.

Version-Release number of selected component (if applicable):

dyninst-9.0.3-3.fc23.x86_64

How reproducible:

The loop has to execute for a few thousand times, likely depending on the activity in the target process and overall system load.

Steps to Reproduce:
1. Run the reproducer against a target PID and wait.

Actual results:

Process hangs (with the message “error: process appears to be stuck”), or reAttach() fails.

I have seen a reAttach() failure only with a target process which frequently starts and terminates threads.

Expected results:

No hang.

Additional info:

A backtrace from the hang looks like this:

Thread 5 (Thread 0x7ffff3404700 (LWP 31816)):
#0  0x00007ffff74d334d in pause () at ../sysdeps/unix/syscall-template.S:84
#1  0x000000000040142e in <lambda()>::operator() (__closure=0x62fda8) at dyninst-reattach-loop.cpp:43
#2  std::_Bind_simple<main(int, char**)::<lambda()>()>::_M_invoke<> (this=0x62fda8)
    at /usr/include/c++/5.3.1/functional:1531
#3  std::_Bind_simple<main(int, char**)::<lambda()>()>::operator() (this=0x62fda8)
    at /usr/include/c++/5.3.1/functional:1520
#4  std::thread::_Impl<std::_Bind_simple<main(int, char**)::<lambda()>()> >::_M_run(void) (this=0x62fd90)
    at /usr/include/c++/5.3.1/thread:115
#5  0x00007ffff71f9f20 in std::(anonymous namespace)::execute_native_thread_routine (__p=<optimized out>)
    at ../../../../../libstdc++-v3/src/c++11/thread.cc:84
#6  0x00007ffff74ca61a in start_thread (arg=0x7ffff3404700) at pthread_create.c:334
#7  0x00007ffff69695fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 4 (Thread 0x7ffff40af700 (LWP 31815)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007ffff7a7d5e8 in boost::condition_variable_any::wait<Mutex<false> > (m=..., this=0x62dc60)
    at /usr/include/boost/thread/pthread/condition_variable.hpp:184
#2  CondVar<Mutex<false> >::wait (this=0x62dc60)
    at /usr/src/debug/dyninst-9.0.3/DyninstAPI-9.0.3/common/src/dthread.h:121
#3  Dyninst::ProcControlAPI::GeneratorMT::processWait (this=0x62fa80, block=true)
    at /usr/src/debug/dyninst-9.0.3/DyninstAPI-9.0.3/proccontrol/src/generator.C:438
#4  0x00007ffff7a7e218 in Dyninst::ProcControlAPI::Generator::getAndQueueEventInt (this=this@entry=0x62fa80, 
    block=block@entry=true) at /usr/src/debug/dyninst-9.0.3/DyninstAPI-9.0.3/proccontrol/src/generator.C:188
#5  0x00007ffff7a6e381 in Dyninst::ProcControlAPI::GeneratorMT::main (this=this@entry=0x62fa80)
    at /usr/src/debug/dyninst-9.0.3/DyninstAPI-9.0.3/proccontrol/src/generator.C:418
#6  0x00007ffff7a7ebeb in Dyninst::ProcControlAPI::GeneratorMT::start (this=0x62fa80)
    at /usr/src/debug/dyninst-9.0.3/DyninstAPI-9.0.3/proccontrol/src/generator.C:410
#7  0x00007ffff7b525cc in thread_init (d=<optimized out>)
    at /usr/src/debug/dyninst-9.0.3/DyninstAPI-9.0.3/common/src/dthread-unix.C:56
#8  0x00007ffff74ca61a in start_thread (arg=0x7ffff40af700) at pthread_create.c:334
#9  0x00007ffff69695fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 3 (Thread 0x7ffff4ab8700 (LWP 31814)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007ffff7a6f05a in boost::condition_variable_any::wait<Mutex<false> > (m=..., this=0x62ded0)
    at /usr/include/boost/thread/pthread/condition_variable.hpp:184
#2  CondVar<Mutex<false> >::wait (this=0x62ded0)
    at /usr/src/debug/dyninst-9.0.3/DyninstAPI-9.0.3/common/src/dthread.h:121
#3  LinuxPtrace::main (this=this@entry=0x62de10)
    at /usr/src/debug/dyninst-9.0.3/DyninstAPI-9.0.3/proccontrol/src/linux.C:3346
#4  0x00007ffff7b3e869 in start_ptrace (lp=lp@entry=0x62de10)
    at /usr/src/debug/dyninst-9.0.3/DyninstAPI-9.0.3/proccontrol/src/linux.C:3328
#5  0x00007ffff7b525cc in thread_init (d=<optimized out>)
    at /usr/src/debug/dyninst-9.0.3/DyninstAPI-9.0.3/common/src/dthread-unix.C:56
#6  0x00007ffff74ca61a in start_thread (arg=0x7ffff4ab8700) at pthread_create.c:334
#7  0x00007ffff69695fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 2 (Thread 0x7ffff52b9700 (LWP 31813)):
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007ffff74cca46 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x624b88) at ../nptl/pthread_mutex_lock.c:116
#2  0x00007ffff7ac3c85 in boost::recursive_mutex::lock (this=0x624b88)
    at /usr/include/boost/thread/pthread/recursive_mutex.hpp:113
#3  MTManager::startWork (this=0x624b10)
    at /usr/src/debug/dyninst-9.0.3/DyninstAPI-9.0.3/proccontrol/src/process.C:8640
#4  MTManager::evhandler_main (this=0x624b10)
    at /usr/src/debug/dyninst-9.0.3/DyninstAPI-9.0.3/proccontrol/src/process.C:8583
#5  0x00007ffff7b525cc in thread_init (d=<optimized out>)
    at /usr/src/debug/dyninst-9.0.3/DyninstAPI-9.0.3/common/src/dthread-unix.C:56
#6  0x00007ffff74ca61a in start_thread (arg=0x7ffff52b9700) at pthread_create.c:334
#7  0x00007ffff69695fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 1 (Thread 0x7ffff7fc9740 (LWP 31812)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007ffff7b3e9e4 in boost::condition_variable_any::wait<Mutex<false> > (m=..., this=0x62df38)
    at /usr/include/boost/thread/pthread/condition_variable.hpp:184
#2  CondVar<Mutex<false> >::wait (this=0x62df38)
    at /usr/src/debug/dyninst-9.0.3/DyninstAPI-9.0.3/common/src/dthread.h:121
#3  LinuxPtrace::waitfor_ret (this=this@entry=0x62de10)
    at /usr/src/debug/dyninst-9.0.3/DyninstAPI-9.0.3/proccontrol/src/linux.C:3381
#4  0x00007ffff7b3ef48 in LinuxPtrace::ptrace_write (this=0x62de10, inTrace=139903394877056, size_=1, 
    inSelf=0xa47750, pid_=31390) at /usr/src/debug/dyninst-9.0.3/DyninstAPI-9.0.3/proccontrol/src/linux.C:3446
#5  0x00007ffff7a9b43a in int_process::writeMem (this=this@entry=0x62e458, local=local@entry=0xa47750, 
    remote=139903394877056, size=1, result=..., thr=0x633b78, thr@entry=0x0, bp_write=int_process::bp_clear)
    at /usr/src/debug/dyninst-9.0.3/DyninstAPI-9.0.3/proccontrol/src/process.C:1548
#6  0x00007ffff7a9e322 in sw_breakpoint::restoreBreakpointData (this=this@entry=0xa476c0, proc=proc@entry=0x62e458, 
    res_resp=...) at /usr/src/debug/dyninst-9.0.3/DyninstAPI-9.0.3/proccontrol/src/process.C:5075
#7  0x00007ffff7ab3127 in sw_breakpoint::suspend (this=0xa476c0, proc=0x62e458, resps=std::set with 0 elements)
    at /usr/src/debug/dyninst-9.0.3/DyninstAPI-9.0.3/proccontrol/src/process.C:5119
#8  0x00007ffff7a89141 in HandleDetach::handleEvent (this=<optimized out>, ev=...)
    at /usr/src/debug/dyninst-9.0.3/DyninstAPI-9.0.3/proccontrol/src/handler.C:1862
#9  0x00007ffff7a8de5c in HandlerPool::handleEvent (this=this@entry=0x62f880, orig_ev=...)
    at /usr/src/debug/dyninst-9.0.3/DyninstAPI-9.0.3/proccontrol/src/handler.C:438
#10 0x00007ffff7ac15db in int_process::waitAndHandleEvents (block=block@entry=false)
    at /usr/src/debug/dyninst-9.0.3/DyninstAPI-9.0.3/proccontrol/src/process.C:1106
#11 0x00007ffff7aecba3 in do_detach (procset=0xa3bd20, temporary=temporary@entry=true, 
    leaveStopped=leaveStopped@entry=false)
    at /usr/src/debug/dyninst-9.0.3/DyninstAPI-9.0.3/proccontrol/src/procset.C:1560
#12 0x00007ffff7aecca0 in Dyninst::ProcControlAPI::ProcessSet::temporaryDetach (this=<optimized out>)
    at /usr/src/debug/dyninst-9.0.3/DyninstAPI-9.0.3/proccontrol/src/procset.C:1591
#13 0x00007ffff7aad02e in Dyninst::ProcControlAPI::Process::temporaryDetach (this=<optimized out>)
    at /usr/src/debug/dyninst-9.0.3/DyninstAPI-9.0.3/proccontrol/src/process.C:6470
#14 0x00000000004010f5 in main (argc=<optimized out>, argv=0x7fffffffe058) at dyninst-reattach-loop.cpp:52

Comment 1 Florian Weimer 2016-09-05 14:43:22 UTC
Stuck process also reproduces with dyninst-9.1.0-5.fc24.x86_64.

Comment 2 Florian Weimer 2016-09-05 15:02:04 UTC
I can test rawhide due to bug 1373239.

Comment 3 Josh Stone 2016-09-14 19:51:31 UTC
With dyninst-9.2.0-4.fc26, rawhide now gets stuck too. (yay?)

Comment 4 Josh Stone 2016-09-27 21:50:01 UTC
Aha - the watchdog is rigged!

      while (true) {
        if (!running) {
          fprintf(stderr,
                  "error: process appears to be stuck (%zu iterations)\n",
                  loops.load());
          pause();
        }
        usleep (1000 * 1000);
        running = false;
      }

That's an almost impossible window between "running = false;" and looping around to check it again.  I was baffled why this always seemed to get stuck in "about" one second, when in fact it was exactly one usleep's worth.

Having swapped the usleep line after "running = false", I have not seen the watchdog report a problem anymore.

> Process hangs (with the message “error: process appears to be stuck”), or
> reAttach() fails.
> 
> I have seen a reAttach() failure only with a target process which frequently
> starts and terminates threads.

Now I'll see if I can reproduce this actual reAttach failure.  Do you have a suggested process which behaves this way?  (Or I can mock one up, of course.)

Comment 5 Josh Stone 2016-09-27 23:12:31 UTC
I can reproduce a reAttach failure by targeting Firefox.  It's not consistent in how long that takes to fail.

Comment 6 Josh Stone 2016-09-28 00:29:39 UTC
This fails reliably quickly as a target process:


#include <thread>
#include <unistd.h>

using namespace std;

void
spawn() {
    this_thread::sleep_for(100ms);
    thread(spawn).detach();
}

int
main()
{
    for (int i = 0; i < 10; ++i)
        spawn();
    pause();
    return 0;
}

Comment 7 Fedora Update System 2016-11-22 22:30:12 UTC
dyninst-9.2.0-5.fc25 has been submitted as an update to Fedora 25. https://bodhi.fedoraproject.org/updates/FEDORA-2016-2401f115af

Comment 8 Fedora Update System 2016-11-25 09:40:52 UTC
dyninst-9.2.0-5.fc25 has been pushed to the Fedora 25 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-2401f115af

Comment 9 Fedora Update System 2016-12-07 05:24:01 UTC
dyninst-9.2.0-5.fc25 has been pushed to the Fedora 25 stable repository. If problems still persist, please make note of it in this bug report.