| Summary: | RFC: SIGPROF keeps a large task from ever completing a fork() | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Paulo Andrade <pandrade> |
| Component: | glibc | Assignee: | Carlos O'Donell <codonell> |
| Status: | CLOSED WONTFIX | QA Contact: | qe-baseos-tools-bugs |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 6.7 | CC: | ashankar, fweimer, mnewsome, pfrankli |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-04-02 00:44:18 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
Use perf to profile your application. Signal based profiling has limited uses. Assume any signal occuring with period Tp. Assume any restartable syscall taking time Ts. When Tp < Ts you always have an infinite loop. There is nothing that userspace or the kernel can do in general. There are no forward guarantees for this scenario. Hardware transactional memory suffers similar problems. Given that no general solution exists, any changes in userspace or the kernel would penalize the vast majority of programs which probably have Tp > Ts and don't suffer from infinite restart loops. Any solution to block SIGPROF would increasing signal latency and degrading SIGPROF results, not to mention adding latency costs to clone. If you *must* use -pg and can't use perf then my only suggestion is that the application block SIGPROF before calling such syscalls as it might expect to take a long time, or detecting that the application has made no forward progress, block SIGPROF (sigprocmask, pthread_sigmask), and then later enable it at a further progress checkpoint (this will skew -pg results which are based on statistical profiling). If other libraries use clone, and I believe ASAN might, you will need to talk to the author of those libraries to determine how they want to handle the general problem as noted above without impacting all of userspace. Again, neither glibc nor the kernel can fix this problem. And adding latency to clone for the sake of -pg is not acceptable. Please use perf to profile your application. Thanks for the comments. I will report the issue to upstream. Probably it should be handled only when built with -pg, by generating stubs with gcc, and may require some option for which syscalls to block SIGPROF. (In reply to Paulo Andrade from comment #3) > Thanks for the comments. > > I will report the issue to upstream. Please report is a kernel bug. There is little glibc can do to work around this without distorting profiling. I just reported it at https://sourceware.org/bugzilla/show_bug.cgi?id=19904 |
A large program (that has allocated a lot of memory) may enter an infinite loop if compiled with -pg, due to restarting the clone syscall, and never ending. Previously Red Hat 5 had a patch to workaround it, to correct gprof issues, in kernel: """ commit 122c17ac54c9b3f53e80bc6f0786cc5f2a8dc486 Author: Stefan Ring <str> Date: Fri May 8 13:19:55 2015 +0200 the patch (v2.6.18) diff --git a/include/linux/sched.h b/include/linux/sched.h index 34ed0d9..808f79d 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1478,6 +1478,7 @@ static inline int lock_need_resched(spinlock_t *lock) extern FASTCALL(void recalc_sigpending_tsk(struct task_struct *t)); extern void recalc_sigpending(void); +extern int fork_recalc_sigpending(void); extern void signal_wake_up(struct task_struct *t, int resume_stopped); diff --git a/kernel/fork.c b/kernel/fork.c index f9b014e..21f9a0d 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1193,8 +1193,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, * A fatal signal pending means that current will exit, so the new * thread can't slip out of an OOM kill (or normal SIGKILL). */ - recalc_sigpending(); - if (signal_pending(current)) { + if (fork_recalc_sigpending()) { spin_unlock(¤t->sighand->siglock); write_unlock_irq(&tasklist_lock); retval = -ERESTARTNOINTR; diff --git a/kernel/signal.c b/kernel/signal.c index bfdb568..bd7e794 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -227,6 +227,31 @@ void recalc_sigpending(void) recalc_sigpending_tsk(current); } +int fork_recalc_sigpending(void) +{ + struct task_struct *tsk = current; + int pending; + + recalc_sigpending(); + if (likely(!signal_pending(tsk))) + return 0; + + pending = 1; + /* + * HACK. If SIGPROF is the sole reason for TIF_SIGPENDING + * we assume it was sent by ITIMER_PROF and return false, + * otherwise fork() can never succeed if it takes more than + * it_prof_incr. bz645528. + */ + if (!sigismember(&tsk->blocked, SIGPROF)) { + sigaddset(&tsk->blocked, SIGPROF); + pending = recalc_sigpending_tsk(tsk); + sigdelset(&tsk->blocked, SIGPROF); + } + + return pending; +} + /* Given the mask, find the first available signal that should be serviced. */ static int """ But newer rhel does not use the above patch. After some discussion in https://bugzilla.redhat.com/show_bug.cgi?id=1309789, and related user preferring to use perf, for profiling, it was suggested that SIGPROF could be blocked in glibc during the clone syscall. Gprof is still a useful tool, and just telling users that it is a known failure and tell them to use perf may not be a good solution, or not a viable one, on other architectures.