Bug 528544 - condor_master hangs on shutdown
Summary: condor_master hangs on shutdown
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: Development
Hardware: All
OS: Linux
high
high
Target Milestone: 1.3
: ---
Assignee: Ken Giusti
QA Contact: Frantisek Reznicek
URL:
Whiteboard:
Depends On: 491995 531911 538471
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-10-12 17:51 UTC by Luigi Toscano
Modified: 2015-11-16 01:11 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-12-03 09:15:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Core file (221.30 KB, application/octet-stream)
2009-10-12 17:51 UTC, Luigi Toscano
no flags Details
Log after a stop (hung) (4.40 KB, text/plain)
2009-10-12 17:52 UTC, Luigi Toscano
no flags Details
Log after a stop (success) (4.77 KB, text/plain)
2009-10-12 17:52 UTC, Luigi Toscano
no flags Details
Backtraces (2.26 KB, text/plain)
2009-10-12 17:53 UTC, Luigi Toscano
no flags Details
the bug 528544 semiautomated reproducer (5.07 KB, application/x-shellscript)
2009-11-09 11:53 UTC, Frantisek Reznicek
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2009:1633 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging and Grid Version 1.2 2009-12-03 09:15:33 UTC

Description Luigi Toscano 2009-10-12 17:51:10 UTC
Created attachment 364497 [details]
Core file

Description of problem:

When condor is restarted through 

service condor restart

it sometimes hangs and condor_master stays alive even if all the children have exited correctly.

From logs and backtraces it seems that management plugin is involved.

Attached files:
- core file of hung condor_master.
- log (D_FULLDEBUG) of successful condor shutdown
- log (D_FULLDEBUG) of hung shutdown
- backtraces

Comment 1 Luigi Toscano 2009-10-12 17:52:16 UTC
Created attachment 364498 [details]
Log after a stop (hung)

Comment 2 Luigi Toscano 2009-10-12 17:52:56 UTC
Created attachment 364499 [details]
Log after a stop (success)

Comment 3 Luigi Toscano 2009-10-12 17:53:24 UTC
Created attachment 364500 [details]
Backtraces

Comment 4 Luigi Toscano 2009-10-12 17:54:27 UTC
Version of tested packages:
qpid* 0.5.752581-28.el5
condor* 7.4.0-0.6.el5

Comment 5 Ken Giusti 2009-10-13 13:24:06 UTC
Hi, 

We have submitted a fix for a similar bug which fixes the thread cleanup in the management plug in.  See:

https://bugzilla.redhat.com/show_bug.cgi?id=489557

This fix adds a new wakeup/cleanup for the pubthread (which appears to be hung here).   This fix will be on the next release (post 28).  Please let me know if this fix solves your problem,

thanks,

-K

Comment 7 Frantisek Reznicek 2009-10-26 17:45:01 UTC
The issue was retested and proved as still present at least on following packages:

RHEL4.8 i386:
condor-7.4.0-0.6.el4
condor-qmf-plugins-7.4.0-0.6.el4
condor-remote-configuration-1.0-21.el4
python-qpid-0.5.760500-6.el4
qpid*-0.5.752581-30.el4

RHEL5.4 x86_64: (latest candidate MRG/M+G packages)
condor-7.4.1-0.1.el5
condor-qmf-plugins-7.4.1-0.1.el5
condor-remote-configuration-1.0-21.el5
condor-remote-configuration-server-1.0-21.el5
python-qpid-0.5.752581-4.el5
qpid*-0.5.752581-30.el5


Transcript of the shell commands from (rhel 5.4 x86_64):

[root@mrg-qe-02 ~]# service condor status
condor_master is stopped
[root@mrg-qe-02 ~]# ps auxw | grep condor | grep -v grep
[root@mrg-qe-02 ~]# service condor start
Starting Condor daemons:                                   [  OK  ]
[root@mrg-qe-02 ~]# service condor status
condor_master (pid  17585) is running...
[root@mrg-qe-02 ~]# service condor stop
Stopping Condor daemons:                                   [  OK  ]
[root@mrg-qe-02 ~]# rpm -qlc condor
/etc/condor/condor_config
/var/lib/condor/condor_config.local
[root@mrg-qe-02 ~]# hostname
mrg-qe-02.lab.eng.brq.redhat.com
[root@mrg-qe-02 ~]# vi /var/lib/condor/condor_config.local
[root@mrg-qe-02 ~]# ps auxw | grep condor | grep -v grep
[root@mrg-qe-02 ~]# service condor status
condor_master is stopped
[root@mrg-qe-02 ~]# service condor start
Starting Condor daemons:                                   [  OK  ]
[root@mrg-qe-02 ~]# ps auxw | grep condor | grep -v grep
condor   17680  1.3  0.0 214032  7052 ?        Ssl  13:36   0:00 condor_master -pidfile /var/run/condor/condor_master.pid
condor   17684  2.0  0.1 216600  8280 ?        Ssl  13:36   0:00 condor_collector -f
[root@mrg-qe-02 ~]# ps auxw | grep condor | grep -v grep
condor   17680  1.4  0.0 214032  7872 ?        Ssl  13:36   0:00 condor_master -pidfile /var/run/condor/condor_master.pid
condor   17684  0.7  0.1 216600  8448 ?        Ssl  13:36   0:00 condor_collector -f
condor   17692  0.8  0.1 214248  8272 ?        Ssl  13:36   0:00 condor_negotiator -f
condor   17693  1.1  0.1 215896  9208 ?        Ssl  13:36   0:00 condor_schedd -f
condor   17694  3.5  0.0  58288  3996 ?        Rs   13:36   0:00 condor_startd -f
root     17695  0.1  0.0  52096  3084 ?        S    13:36   0:00 condor_procd -A /var/run/condor/procd_pipe.SCHEDD -S 60 -C 64
[root@mrg-qe-02 ~]# condor_status
[root@mrg-qe-02 ~]# condor_status

Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1 LINUX      X86_64 Unclaimed Idle     0.160  1006  0+00:00:04
slot2 LINUX      X86_64 Unclaimed Idle     0.000  1006  0+00:00:05
slot3 LINUX      X86_64 Unclaimed Idle     0.000  1006  0+00:00:06
slot4 LINUX      X86_64 Unclaimed Idle     0.000  1006  0+00:00:07
slot5 LINUX      X86_64 Unclaimed Idle     0.000  1006  0+00:00:08
slot8 LINUX      X86_64 Unclaimed Idle     0.000  1006  0+00:00:03
                     Total Owner Claimed Unclaimed Matched Preempting Backfill

        X86_64/LINUX     6     0       0         6       0          0        0

               Total     6     0       0         6       0          0        0
[root@mrg-qe-02 ~]# ps auxw | grep condor | grep -v grep
condor   17680  0.5  0.0 214032  7872 ?        Ssl  13:36   0:00 condor_master -pidfile /var/run/condor/condor_master.pid
condor   17684  0.4  0.1 216600  8756 ?        Ssl  13:36   0:00 condor_collector -f
condor   17692  0.2  0.1 214248  8452 ?        Ssl  13:36   0:00 condor_negotiator -f
condor   17693  0.2  0.1 215896  9240 ?        Ssl  13:36   0:00 condor_schedd -f
condor   17694 23.4  0.0  58640  4684 ?        Ss   13:36   0:05 condor_startd -f
root     17695  0.0  0.0  52096  3084 ?        S    13:36   0:00 condor_procd -A /var/run/condor/procd_pipe.SCHEDD -S 60 -C 64
[root@mrg-qe-02 ~]# service condor stop
Stopping Condor daemons:                                   [  OK  ]
Warning: condor_master may not have exited, start/restart may fail
[root@mrg-qe-02 ~]# ps auxw | grep condor | grep -v grep
condor   17680  0.3  0.0 214032  7896 ?        Ssl  13:36   0:00 condor_master -pidfile /var/run/condor/condor_master.pid
condor   17693  0.1  0.1 215896  9340 ?        Ssl  13:36   0:00 condor_schedd -f
root     17695  0.0  0.0  52096  3084 ?        S    13:36   0:00 condor_procd -A /var/run/condor/procd_pipe.SCHEDD -S 60 -C 64
[root@mrg-qe-02 ~]# less /var/log/condor/StartLog
[root@mrg-qe-02 ~]# ps auxw | grep condor | grep -v grep
condor   17680  0.2  0.0 214032  7896 ?        Ssl  13:36   0:00 condor_master -pidfile /var/run/condor/condor_master.pid
condor   17693  0.0  0.1 215896  9340 ?        Ssl  13:36   0:00 condor_schedd -f
root     17695  0.0  0.0  52628  3972 ?        S    13:36   0:00 condor_procd -A /var/run/condor/procd_pipe.SCHEDD -S 60 -C 64
[root@mrg-qe-02 ~]# rpm -qa | egrep '(condor|qpid)' | sort -u
condor-7.4.1-0.1.el5
condor-qmf-plugins-7.4.1-0.1.el5
condor-remote-configuration-1.0-21.el5
condor-remote-configuration-server-1.0-21.el5
python-qpid-0.5.752581-4.el5
qpidc-0.5.752581-30.el5
qpidc-debuginfo-0.5.752581-30.el5
qpidc-devel-0.5.752581-30.el5
qpidc-perftest-0.5.752581-30.el5
qpidc-rdma-0.5.752581-30.el5
qpidc-ssl-0.5.752581-30.el5
qpidd-0.5.752581-30.el5
qpidd-acl-0.5.752581-30.el5
qpidd-cluster-0.5.752581-30.el5
qpidd-devel-0.5.752581-30.el5
qpid-dotnet-0.4.738274-2.el5
qpidd-rdma-0.5.752581-30.el5
qpidd-ssl-0.5.752581-30.el5
qpidd-xml-0.5.752581-30.el5
qpid-java-client-0.5.751061-9.el5
qpid-java-common-0.5.751061-9.el5
rh-qpid-tests-0.5.752581-30.el5
[root@mrg-qe-02 ~]# cat /var/lib/condor/condor_config.local
# This config disables advertising to UW's world collector. Changing
# this config option will have your pool show up in UW's world
# collector and eventually on the world map of Condor pools.
CONDOR_DEVELOPERS = NONE

CONDOR_HOST = $(FULL_HOSTNAME)
COLLECTOR_NAME = Personal Condor
START = TRUE
SUSPEND = FALSE
PREEMPT = FALSE
KILL = FALSE
DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD
NEGOTIATOR_INTERVAL = 20
TRUST_UID_DOMAIN = TRUE

SCHEDD.PLUGINS = $(LIB)/plugins/MgmtScheddPlugin-plugin.so
COLLECTOR.PLUGINS = $(LIB)/plugins/MgmtCollectorPlugin-plugin.so
NEGOTIATOR.PLUGINS = $(LIB)/plugins/MgmtNegotiatorPlugin-plugin.so

MASTER.PLUGINS = $(LIB)/plugins/MgmtMasterPlugin-plugin.so
QMF_BROKER_HOST = mrg-qe-02.lab.eng.brq.redhat.com

[root@mrg-qe-02 ~]# uname -a
Linux mrg-qe-02.lab.eng.brq.redhat.com 2.6.18-164.el5 #1 SMP Tue Aug 18 15:51:48 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

Above described problem occurs only if management agents are attached to condor deamons (*.PLUGINS + QMF_BROKER_HOST).

Comment 8 Frantisek Reznicek 2009-10-26 17:54:33 UTC
This BZ is extremely slowing down the MRG/G 1.2 BZ verification ... thus marking as TestBlocker.

Feel free to raise objections/questions about this decision.

Comment 9 Ken Giusti 2009-10-26 18:26:42 UTC
Hi Frantisek,

If you can reproduce this, can you get pstack backtraces on those hung processes?  Or ping me if you repro it and I can poke around the system.  Thanks,

-K

Comment 10 Ted Ross 2009-10-26 20:49:08 UTC
We just checked the patch in distcvs and it appears that the changes for this fix are still not in the build.  We are investigating why this is the case.

-Ted

Comment 11 Ted Ross 2009-10-26 21:04:09 UTC
(In reply to comment #10)
> We just checked the patch in distcvs and it appears that the changes for this
> fix are still not in the build.  We are investigating why this is the case.
> 
> -Ted  

Check that... The changes are in the patch.  Please attach a pstack trace of a hung process.

-Ted

Comment 12 Frantisek Reznicek 2009-10-27 07:46:54 UTC
Hello Ted, please find below the pstack and also detailed gdb backtrace.

[root@mrg-qe-02 ~]# service condor status
condor_master is stopped
[root@mrg-qe-02 ~]# service condor start
Starting Condor daemons:                                   [  OK  ]
[root@mrg-qe-02 ~]# ps auxw | grep condor | grep -v grep
condor   20260  0.8  0.0 214032  7848 ?        Ssl  03:37   0:00 condor_master -pidfile /var/run/condor/condor_master.pid
condor   20264  0.5  0.1 216596  8544 ?        Ssl  03:37   0:00 condor_collector -f
condor   20269  0.3  0.1 214308  8272 ?        Ssl  03:37   0:00 condor_negotiator -f
condor   20270  0.3  0.1 216016  9208 ?        Ssl  03:37   0:00 condor_schedd -f
condor   20271 31.0  0.0  58640  4672 ?        Ss   03:37   0:05 condor_startd -f
root     20272  0.2  0.0  52096  3072 ?        S    03:37   0:00 condor_procd -A /var/run/condor/procd_pipe.SCHEDD -S 60 -C 64
[root@mrg-qe-02 ~]# condor_stat
condor_stats   condor_status
[root@mrg-qe-02 ~]# condor_status

Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1 LINUX      X86_64 Unclaimed Idle     0.430  1006  0+00:00:04
slot2 LINUX      X86_64 Unclaimed Idle     0.000  1006  0+00:00:05
slot3 LINUX      X86_64 Unclaimed Idle     0.000  1006  0+00:00:06
slot4 LINUX      X86_64 Unclaimed Idle     0.000  1006  0+00:00:07
slot5 LINUX      X86_64 Unclaimed Idle     0.000  1006  0+00:00:08
slot6 LINUX      X86_64 Unclaimed Idle     0.000  1006  0+00:00:09
slot7 LINUX      X86_64 Unclaimed Idle     0.000  1006  0+00:00:10
slot8 LINUX      X86_64 Unclaimed Idle     0.000  1006  0+00:00:03
                     Total Owner Claimed Unclaimed Matched Preempting Backfill

        X86_64/LINUX     8     0       0         8       0          0        0

               Total     8     0       0         8       0          0        0
[root@mrg-qe-02 ~]# service condor stop
Stopping Condor daemons:                                   [  OK  ]
Warning: condor_master may not have exited, start/restart may fail
[root@mrg-qe-02 ~]# ps auxw | grep condor | grep -v grep
condor   20260  0.3  0.0 214032  7872 ?        Ssl  03:37   0:00 condor_master -pidfile /var/run/condor/condor_master.pid
condor   20270  0.0  0.1 216016  9304 ?        Ssl  03:37   0:00 condor_schedd -f
root     20272  0.0  0.0  52628  3968 ?        S    03:37   0:00 condor_procd -A /var/run/condor/procd_pipe.SCHEDD -S 60 -C 64
[root@mrg-qe-02 ~]# pstree -ap 20260
condor_master,20260 -pidfile /var/run/condor/condor_master.pid
  ├─condor_schedd,20270 -f
  │   ├─condor_procd,20272 -A /var/run/condor/procd_pipe.SCHEDD -S 60 -C 64
  │   ├─{condor_schedd},20288
  │   ├─{condor_schedd},20289
  │   └─{condor_schedd},20293
  ├─{condor_master},20262
  ├─{condor_master},20263
  └─{condor_master},20265
[root@mrg-qe-02 ~]# pstack 20260
Thread 3 (Thread 0x41383940 (LWP 20262)):
#0  0x000000323920ab99 in pthread_cond_wait@@GLIBC_2.3.2 ()
#1  0x00000030c60795ee in qpid::client::Dispatcher::run ()
#2  0x00000030c6413b90 in qpid::management::ManagementAgentImpl::ConnectionThread::run () from /usr/lib64/libqmfagent.so.0
#3  0x00000030c5b76bca in ?? () from /usr/lib64/libqpidcommon.so.0
#4  0x00000032392064a7 in start_thread () from /lib64/libpthread.so.0
#5  0x00000032386d3c2d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x41f43940 (LWP 20263)):
#0  0x00000032386cced2 in select () from /lib64/libc.so.6
#1  0x00000000004bf4ee in sleep ()
#2  0x00000030c6414fb2 in qpid::management::ManagementAgentImpl::PublishThread::run () from /usr/lib64/libqmfagent.so.0
#3  0x00000030c5b76bca in ?? () from /usr/lib64/libqpidcommon.so.0
#4  0x00000032392064a7 in start_thread () from /lib64/libpthread.so.0
#5  0x00000032386d3c2d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x42944940 (LWP 20265)):
#0  0x00000032386d4018 in epoll_wait () from /lib64/libc.so.6
#1  0x00000030c5b7ffbd in qpid::sys::Poller::wait ()
#2  0x00000030c5b80b67 in qpid::sys::Poller::run ()
#3  0x00000030c6069944 in qpid::client::TCPConnector::run ()
#4  0x00000030c5b76bca in ?? () from /usr/lib64/libqpidcommon.so.0
#5  0x00000032392064a7 in start_thread () from /lib64/libpthread.so.0
#6  0x00000032386d3c2d in clone () from /lib64/libc.so.6
#0  0x00000032386cced2 in select () from /lib64/libc.so.6
[root@mrg-qe-02 ~]# gdb
GNU gdb Fedora (6.8-37.el5)
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
(gdb) attach 20260
Attaching to process 20260
Reading symbols from /usr/sbin/condor_master...(no debugging symbols found)...done.
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
warning: Can't attach LWP -1: No such process
[New Thread 0x42944940 (LWP 20265)]
[New Thread 0x41f43940 (LWP 20263)]
[New Thread 0x41383940 (LWP 20262)]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libpcre.so.0...(no debugging symbols found)...done.
...
Loaded symbols for /usr/lib64/sasl2/liblogin.so.2
0x00000032386cced2 in select () from /lib64/libc.so.6
(gdb) info threads
warning: Can't attach LWP -1: No such process
  3 Thread 0x41383940 (LWP 20262)  0x000000323920ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  2 Thread 0x41f43940 (LWP 20263)  0x00000032386cced2 in select () from /lib64/libc.so.6
  1 Thread 0x42944940 (LWP 20265)  0x00000032386d4018 in epoll_wait () from /lib64/libc.so.6
(gdb) thread apply all bt
warning: Can't attach LWP -1: No such process

Thread 3 (Thread 0x41383940 (LWP 20262)):
#0  0x000000323920ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000030c60795ee in qpid::client::Dispatcher::run (this=<value optimized out>) at qpid/sys/posix/Condition.h:63
#2  0x00000030c6413b90 in qpid::management::ManagementAgentImpl::ConnectionThread::run (this=<value optimized out>)
    at qpid/agent/ManagementAgentImpl.cpp:814
#3  0x00000030c5b76bca in runRunnable (p=<value optimized out>) at qpid/sys/posix/Thread.cpp:35
#4  0x00000032392064a7 in start_thread () from /lib64/libpthread.so.0
#5  0x00000032386d3c2d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x41f43940 (LWP 20263)):
#0  0x00000032386cced2 in select () from /lib64/libc.so.6
#1  0x00000000004bf4ee in sleep ()
#2  0x00000030c6414fb2 in qpid::management::ManagementAgentImpl::PublishThread::run (this=<value optimized out>)
    at qpid/agent/ManagementAgentImpl.cpp:924
#3  0x00000030c5b76bca in runRunnable (p=<value optimized out>) at qpid/sys/posix/Thread.cpp:35
#4  0x00000032392064a7 in start_thread () from /lib64/libpthread.so.0
#5  0x00000032386d3c2d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x42944940 (LWP 20265)):
#0  0x00000032386d4018 in epoll_wait () from /lib64/libc.so.6
#1  0x00000030c5b7ffbd in qpid::sys::Poller::wait (this=<value optimized out>, timeout=<value optimized out>)
    at qpid/sys/epoll/EpollPoller.cpp:439
#2  0x00000030c5b80b67 in qpid::sys::Poller::run (this=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:405
#3  0x00000030c6069944 in qpid::client::TCPConnector::run (this=<value optimized out>) at qpid/client/Connector.cpp:408
#4  0x00000030c5b76bca in runRunnable (p=<value optimized out>) at qpid/sys/posix/Thread.cpp:35
#5  0x00000032392064a7 in start_thread () from /lib64/libpthread.so.0
#6  0x00000032386d3c2d in clone () from /lib64/libc.so.6
#0  0x00000032386cced2 in select () from /lib64/libc.so.6
(gdb) quit
The program is running.  Quit anyway (and detach it)? (y or n) y
Detaching from program: /usr/sbin/condor_master, process 20260

Comment 13 Ken Giusti 2009-10-27 14:58:02 UTC
Thanks Frantisek - it looks like the cleanup code (which has the fix) isn't getting invoked.   I'm going to try to repro this on an RHTS system to see if I can find out why the shutdown isn't being caught by the qpid agent code.

-K

Comment 14 Ken Giusti 2009-10-27 18:26:57 UTC
I've repro'd this a few times, and the failure is consistent.  What I suspect may be happening is the pthreads library is getting corrupted, leaving threads in a state where they cannot correctly terminate.


I'm basing this on the fact that I'm getting some very inconsistent results from pstack/gdb on the hung threads.   Below is a sample debug trace, and the trace is consistent each time I repro the problem.



First, there are four threads "hung" after service condor shutdown:

[root@mrg16 task]# ps -ef | grep condor
condor    3435     1 96 13:34 ?        00:20:07 condor_master -pidfile /var/run/condor/condor_master.pid

[root@mrg16 task]# ps -eL | grep condor
 3435  3435 ?        00:00:00 condor_master
 3435  3437 ?        00:00:00 condor_master
 3435  3438 ?        00:30:10 condor_master
 3435  3441 ?        00:00:00 condor_master

[root@mrg16 task]# pstree -ap 3435
condor_master,3435 -pidfile /var/run/condor/condor_master.pid
  ├─{condor_master},3437
  ├─{condor_master},3438
  └─{condor_master},3441


However, pstack only dumps one thread trace:


[root@mrg16 yum.repos.d]# pstack 3435
Thread 1 (Thread 0x40b5b940 (LWP 3437)):
#0  0x0000003f794077e5 in pthread_join () from /lib64/libpthread.so.0
#1  0x00002b2136bb6dd2 in qpid::sys::Thread::join ()
#2  0x00002b2136ef309b in qpid::client::TCPConnector::closeInternal ()
#3  0x00002b2136eee25a in qpid::client::ConnectionImpl::closeInternal<boost::_bi::bind_t<void, boost::_mfi::mf2<void, qpid::client::SessionImpl, unsigned short, std::string const&>, boost::_bi::list3<boost::arg<1>, boost::_bi::value<unsigned short>, boost::_bi::value<std::string> > > > ()
#4  0x00002b2136eeaaff in qpid::client::ConnectionImpl::closed ()
#5  0x00002b2136eeadb0 in qpid::client::ConnectionImpl::close ()
#6  0x00002b2137172c84 in qpid::management::ManagementAgentImpl::ConnectionThread::run () from /usr/lib64/libqmfagent.so.0
#7  0x00002b2136bb6bca in ?? () from /usr/lib64/libqpidcommon.so.0
#8  0x0000003f794064a7 in start_thread () from /lib64/libpthread.so.0
#9  0x0000003f78cd3c2d in clone () from /lib64/libc.so.6
#0  0x0000003f794077e5 in pthread_join () from /lib64/libpthread.so.0
[root@mrg16 yum.repos.d]# 



In addition, when I attach to each thread via gdb, I see two threads waiting in a join(), which is correct behavior for a shutdown.  However, I also see two threads that appear to be stuck in start_thread()!   My guess is that they are in the process of cleaning up, but are stuck.  Top indicates that one of these thread "stubs" is stuck at 100% cpu.



(gdb) attach 3435
Attaching to process 3435
(gdb) bt
#0  0x0000003f794077e5 in pthread_join () from /lib64/libpthread.so.0
#1  0x00002b2136bb6dd2 in qpid::sys::Thread::join () from /usr/lib64/libqpidcommon.so.0
#2  0x00002b2137174d8c in qpid::management::ManagementAgentImpl::~ManagementAgentImpl () from /usr/lib64/libqmfagent.so.0
#3  0x00002b213716f470 in qpid::management::ManagementAgent::Singleton::~Singleton () from /usr/lib64/libqmfagent.so.0
#4  0x00002b213682b999 in MgmtMasterPlugin::shutdown () from /usr/lib64/condor/plugins/MgmtMasterPlugin-plugin.so
#5  0x000000000048fbdb in MasterPluginManager::Shutdown ()
#6  0x000000000047b0b5 in master_exit ()
#7  0x000000000047ee1e in Daemons::AllDaemonsGone ()
#8  0x000000000047ee96 in Daemons::AllReaper ()
#9  0x000000000049743e in DaemonCore::CallReaper ()
#10 0x000000000049e9aa in DaemonCore::HandleProcessExit ()
#11 0x000000000049eb26 in DaemonCore::HandleDC_SERVICEWAITPIDS ()
#12 0x0000000000497e2d in DaemonCore::Driver ()
#13 0x00000000004b0b21 in main ()

(gdb) attach 3437
(gdb) bt
#0  0x0000003f794077e5 in pthread_join () from /lib64/libpthread.so.0
#1  0x00002b2136bb6dd2 in qpid::sys::Thread::join () from /usr/lib64/libqpidcommon.so.0
#2  0x00002b2136ef309b in qpid::client::TCPConnector::closeInternal () from /usr/lib64/libqpidclient.so.0
#3  0x00002b2136eee25a in qpid::client::ConnectionImpl::closeInternal<boost::_bi::bind_t<void, boost::_mfi::mf2<void, qpid::client::SessionImpl, unsigned short, std::string const&>, boost::_bi::list3<boost::arg<1>, boost::_bi::value<unsigned short>, boost::_bi::value<std::string> > > > ()
   from /usr/lib64/libqpidclient.so.0
#4  0x00002b2136eeaaff in qpid::client::ConnectionImpl::closed () from /usr/lib64/libqpidclient.so.0
#5  0x00002b2136eeadb0 in qpid::client::ConnectionImpl::close () from /usr/lib64/libqpidclient.so.0
#6  0x00002b2137172c84 in qpid::management::ManagementAgentImpl::ConnectionThread::run () from /usr/lib64/libqmfagent.so.0
#7  0x00002b2136bb6bca in ?? () from /usr/lib64/libqpidcommon.so.0
#8  0x0000003f794064a7 in start_thread () from /lib64/libpthread.so.0
#9  0x0000003f78cd3c2d in clone () from /lib64/libc.so.6


(gdb) detach
Detaching from program: /usr/sbin/condor_master, process 3437
(gdb) attach 3438
(gdb) bt
#0  0x0000003f79406674 in start_thread () from /lib64/libpthread.so.0
#1  0x0000003f78cd3c2d in clone () from /lib64/libc.so.6


(gdb) detach 
Detaching from program: /usr/sbin/condor_master, process 3438
(gdb) attach 3441
(gdb) bt
#0  0x0000003f79406674 in start_thread () from /lib64/libpthread.so.0
#1  0x0000003f78cd3c2d in clone () from /lib64/libc.so.6


Note also that "info thread" in gdb only shows the current thread - once again appears as if the thread state is corrupt. 


(gdb) info thread
warning: Can't attach LWP -1: No such process
  1 Thread 0x40b5b940 (LWP 3437)  0x0000003f794077e5 in pthread_join () from /lib64/libpthread.so.0
(gdb)

Comment 15 Ken Giusti 2009-10-29 13:59:28 UTC
Still don't have a solution to this, but I have a better understanding of the failure condition:

The problem seems to be related to the use of setuid()/getuid() and pthreads.

From reading the pthread library sources, I've discovered that pthreads uses signals to implement setuid()/getuid() processing.

When a threaded application (condor, in this case) makes a setuid()/getuid() call, the pthread library is invoked as part of the setXid() processing.  All threads need to be notified of the setXid() change, and this is done by the pthread library.  The pthread library's handling of the setXid() call works something like this:

  for (each thread) do
    set a flag in the thread (SETXID_BITMASK)
    signal the thread
  done

  wait for all threads to process the signal.

Each thread runs a signal handler to process the setXid() event:

  process the xid change
  clear the SETXID_BITMASK
  update a futex to indicate setXid() processing done.

Make sense?  Back to the problem at hand:

This bug is due to threads hanging in the pthread library's thread cleanup processing.  The application's thread routine(s) have returned properly, and control is returned to the pthread routine that has invoked the app's thread function (start_thread() in libpthread).  This function then cleans up after the thread.

The thread is hanging because the SETXID_BITMASK bit is set.  The pthread cleanup code checks this bit, and loops until the bit is cleared.  The logic is correct: if a setXid() signal is pending [bit is set], then wait for the signal to be handled before thread cleanup can proceed.

The problem is that there shouldn't be a pending SETXID event - the application is shutting down, and no other thread is waiting for a setXid() change response.

Furthermore, the futex of the spinning threads is set to "done"!  This is why one of the threads is taking 100% cpu - it checks that SETXID is set, then waits for the futex to change to done.  Since the futex is already marked done, the wait returns immediately, and we continue to loop.

From what I can tell, the fact that the SETXID bit is set -and- the futex is marked done is an "impossible" condition.  The code is written to update these items atomically, and the above state should not exist.

I haven't been able to repro this without both qmf & condor.  I'd like to isolate this as a pthread issue, but have not.

Comment 16 Ken Giusti 2009-10-29 20:17:02 UTC
This appears to be a pthread/clone() issue.  Can repo without using condor or qmf.  See BZ531911.

condor calls clone() with CLONE_VM flag.  Cloned process does a setresuid(), this causes the pthreads to hang when exiting.

-K

Comment 17 Frantisek Reznicek 2009-11-09 11:49:47 UTC
Let me highlight one more observation on RHEL 4.8 x86_64, which will be for sure not corrected by glibc bug 531911.

The semi-automated test 'run.sh' is attached.


I can see on 8 core RHEL 4.8 x86_64 that 'service condor stop' does not stop all condor deamons (condor configured to use qmf plugins and qpidd is running with std. 'service params').

tst_cnt:85, err_cnt:0 (wait:2374mili-sec[s])
Stopping Condor daemons: ^[[60G[  ^[[0;32mOK^[[0;39m  ]^M
Starting Condor daemons: ^[[60G[  ^[[0;32mOK^[[0;39m  ]^M
tst_cnt:86, err_cnt:0 (wait:140mili-sec[s])
Stopping Condor daemons: ^[[60G[  ^[[0;32mOK^[[0;39m  ]^M
Warning: condor_master may not have exited, start/restart may fail
condor     309  0.6  0.1 94640 8640 ?        Ssl  06:17   0:00 condor_master -pidfile /var/run/condor/condor_master.pid
condor     353 32.8  0.0 30128 4268 ?        Rs   06:17   0:04 condor_startd -f
pstack 309
Thread 4 (Thread 1084229984 (LWP 313)):
#0  0x0000003527908b3a in pthread_cond_wait@@GLIBC_2.3.2 ()
#1  0x0000003b4b3a9cb2 in qpid::sys::BlockingQueue<boost::shared_ptr<qpid::framing::FrameSet> >::pop () from /usr/lib64/libqpidclient.so.0
#2  0x0000003b4b3a56cf in qpid::client::Dispatcher::run ()
#3  0x0000003b4b61c3ba in qpid::management::ManagementAgentImpl::ConnectionThread::run () from /usr/lib64/libqmfagent.so.0
#4  0x0000003b4b0bdfaa in qpid::sys::(anonymous namespace)::runRunnable ()
#5  0x0000003527906137 in start_thread () from /lib64/tls/libpthread.so.0
#6  0x0000003526cc9f03 in clone () from /lib64/tls/libc.so.6
Thread 3 (Thread 1094719840 (LWP 314)):
#0  0x0000003526cc2f56 in __select_nocancel () from /lib64/tls/libc.so.6
#1  0x00000000004ae6d2 in sleep ()
#2  0x0000003b4b61d960 in qpid::management::ManagementAgentImpl::PublishThread::run () from /usr/lib64/libqmfagent.so.0
#3  0x0000003b4b0bdfaa in qpid::sys::(anonymous namespace)::runRunnable ()
#4  0x0000003527906137 in start_thread () from /lib64/tls/libpthread.so.0
#5  0x0000003526cc9f03 in clone () from /lib64/tls/libc.so.6
Thread 2 (Thread 1105209696 (LWP 316)):
#0  0x0000003526cca2dc in epoll_wait () from /lib64/tls/libc.so.6
#1  0x0000003b4b0c895c in qpid::sys::Poller::wait ()
#2  0x0000003b4b0c9f07 in qpid::sys::Poller::run ()
#3  0x0000003b4b3904d9 in qpid::client::TCPConnector::run ()
#4  0x0000003b4b0bdfaa in qpid::sys::(anonymous namespace)::runRunnable ()
#5  0x0000003527906137 in start_thread () from /lib64/tls/libpthread.so.0
#6  0x0000003526cc9f03 in clone () from /lib64/tls/libc.so.6
Error while reading shared library symbols:
Can't attach LWP -1: No such process
Couldn't write debug register: No such process.
Variable "timeout" is not available.
Variable "timeout" is not available.
Quitting: Can't detach LWP -1: No such process
pstack 353
/bin/ls: /proc/353/task: No such file or directory
ptrace: Operation not permitted.
No registers.
killing 309 353
Starting Condor daemons: ^[[60G[  ^[[0;32mOK^[[0;39m  ]^M
tst_cnt:87, err_cnt:1 (wait:2747mili-sec[s])
Stopping Condor daemons: ^[[60G[  ^[[0;32mOK^[[0;39m  ]^M
Starting Condor daemons: ^[[60G[  ^[[0;32mOK^[[0;39m  ]^M
tst_cnt:88, err_cnt:1 (wait:4687mili-sec[s])
Stopping Condor daemons: ^[[60G[  ^[[0;32mOK^[[0;39m  ]^M
Starting Condor daemons: ^[[60G[  ^[[0;32mOK^[[0;39m  ]^M
tst_cnt:89, err_cnt:1 (wait:658mili-sec[s])
Stopping Condor daemons: ^[[60G[  ^[[0;32mOK^[[0;39m  ]^M
Warning: condor_master may not have exited, start/restart may fail
Starting Condor daemons: ^[[60G[  ^[[0;32mOK^[[0;39m  ]^M
tst_cnt:90, err_cnt:1 (wait:1865mili-sec[s])
Stopping Condor daemons: ^[[60G[  ^[[0;32mOK^[[0;39m  ]^M
Starting Condor daemons: ^[[60G[  ^[[0;32mOK^[[0;39m  ]^M
tst_cnt:91, err_cnt:1 (wait:2131mili-sec[s])
Stopping Condor daemons: ^[[60G[  ^[[0;32mOK^[[0;39m  ]^M
Starting Condor daemons: ^[[60G[  ^[[0;32mOK^[[0;39m  ]^M
tst_cnt:92, err_cnt:1 (wait:3565mili-sec[s])
Stopping Condor daemons: ^[[60G[  ^[[0;32mOK^[[0;39m  ]^M
Starting Condor daemons: ^[[60G[  ^[[0;32mOK^[[0;39m  ]^M
tst_cnt:93, err_cnt:1 (wait:1550mili-sec[s])


gdb showed this:
~~~~~~~~~~~~~~~
[root@mrg-qe-03 bz]# cat bt_dump_87_309
GNU gdb Red Hat Linux (6.3.0.0-1.162.el4rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
(gdb) Attaching to process 309
Redelivering pending Stopped (signal).
Reading symbols from /usr/sbin/condor_master...(no debugging symbols found)...done.
Using host libthread_db library "/lib64/tls/libthread_db.so.1".
Reading symbols from /lib64/tls/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
[New Thread 182894196864 (LWP -1)]
Reading symbols from /lib64/libpcre.so.0...
...
Loaded symbols for /usr/lib64/sasl2/libanonymous.so.2

0x0000003526cc2f56 in __select_nocancel () from /lib64/tls/libc.so.6
(gdb) [New Thread 182894196864 (LWP -1)]
(gdb) [New Thread 1105209696 (LWP 316)]
[New Thread 1094719840 (LWP 314)]
[New Thread 1084229984 (LWP 313)]

Thread 5 (Thread 1084229984 (LWP 313)):
#0  0x0000003527908b3a in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000003b4b3a9cb2 in qpid::sys::BlockingQueue<boost::shared_ptr<qpid::framing::FrameSet> >::pop (this=0x8db990, result=@0x409ffbd0, timeout=)
    at ./qpid/sys/posix/Condition.h:63
#2  0x0000003b4b3a56cf in qpid::client::Dispatcher::run (this=0x8db230)
    at ./qpid/sys/BlockingQueue.h:71
#3  0x0000003b4b61c3ba in qpid::management::ManagementAgentImpl::ConnectionThread::run (this=0x2a957bf218) at /usr/include/boost/shared_ptr.hpp:252
#4  0x0000003b4b0bdfaa in qpid::sys::(anonymous namespace)::runRunnable (
    p=0x8db9bc) at qpid/sys/posix/Thread.cpp:35
#5  0x0000003527906137 in start_thread () from /lib64/tls/libpthread.so.0
#6  0x0000003526cc9f03 in clone () from /lib64/tls/libc.so.6

Thread 4 (Thread 1094719840 (LWP 314)):
#0  0x0000003526cc2f56 in __select_nocancel () from /lib64/tls/libc.so.6
#1  0x00000000004ae6d2 in sleep ()
#2  0x0000003b4b61d960 in qpid::management::ManagementAgentImpl::PublishThread::run (this=0x2a957bf440) at qpid/agent/ManagementAgentImpl.cpp:924
#3  0x0000003b4b0bdfaa in qpid::sys::(anonymous namespace)::runRunnable (p=0x0)
    at qpid/sys/posix/Thread.cpp:35
#4  0x0000003527906137 in start_thread () from /lib64/tls/libpthread.so.0
#5  0x0000003526cc9f03 in clone () from /lib64/tls/libc.so.6

Thread 3 (Thread 1105209696 (LWP 316)):
#0  0x0000003526cca2dc in epoll_wait () from /lib64/tls/libc.so.6
#1  0x0000003b4b0c895c in qpid::sys::Poller::wait (this=0x8dabe0, timeout=)
    at qpid/sys/epoll/EpollPoller.cpp:439
#2  0x0000003b4b0c9f07 in qpid::sys::Poller::run (this=0x8dabe0)
    at qpid/sys/epoll/EpollPoller.cpp:405
#3  0x0000003b4b3904d9 in qpid::client::TCPConnector::run (this=0x8da7c0)
    at qpid/client/Connector.cpp:408
#4  0x0000003b4b0bdfaa in qpid::sys::(anonymous namespace)::runRunnable (
    p=0x10) at qpid/sys/posix/Thread.cpp:35
#5  0x0000003527906137 in start_thread () from /lib64/tls/libpthread.so.0
#6  0x0000003526cc9f03 in clone () from /lib64/tls/libc.so.6

One more...

tst_cnt:146, err_cnt:1 (wait:1181mili-sec[s])
Stopping Condor daemons: ^[[60G[  ^[[0;32mOK^[[0;39m  ]^M
Warning: condor_master may not have exited, start/restart may fail
condor    9325  0.6  0.1 94640 8644 ?        Ssl  06:36   0:00 condor_master -pidfile /var/run/condor/condor_master.pid
condor    9364 34.3  0.0 30128 4268 ?        Rs   06:36   0:05 condor_startd -f
pstack 9325
Thread 4 (Thread 1084229984 (LWP 9329)):
#0  0x0000003527908b3a in pthread_cond_wait@@GLIBC_2.3.2 ()
#1  0x0000003b4b3a9cb2 in qpid::sys::BlockingQueue<boost::shared_ptr<qpid::framing::FrameSet> >::pop () from /usr/lib64/libqpidclient.so.0
#2  0x0000003b4b3a56cf in qpid::client::Dispatcher::run ()
#3  0x0000003b4b61c3ba in qpid::management::ManagementAgentImpl::ConnectionThread::run () from /usr/lib64/libqmfagent.so.0
#4  0x0000003b4b0bdfaa in qpid::sys::(anonymous namespace)::runRunnable ()
#5  0x0000003527906137 in start_thread () from /lib64/tls/libpthread.so.0
#6  0x0000003526cc9f03 in clone () from /lib64/tls/libc.so.6
Thread 3 (Thread 1094719840 (LWP 9330)):
#0  0x0000003526cc2f56 in __select_nocancel () from /lib64/tls/libc.so.6
#1  0x00000000004ae6d2 in sleep ()
#2  0x0000003b4b61d960 in qpid::management::ManagementAgentImpl::PublishThread::run () from /usr/lib64/libqmfagent.so.0
#3  0x0000003b4b0bdfaa in qpid::sys::(anonymous namespace)::runRunnable ()
#4  0x0000003527906137 in start_thread () from /lib64/tls/libpthread.so.0
#5  0x0000003526cc9f03 in clone () from /lib64/tls/libc.so.6
Thread 2 (Thread 1105209696 (LWP 9332)):
#0  0x0000003526cca2dc in epoll_wait () from /lib64/tls/libc.so.6
#1  0x0000003b4b0c895c in qpid::sys::Poller::wait ()
#2  0x0000003b4b0c9f07 in qpid::sys::Poller::run ()
#3  0x0000003b4b3904d9 in qpid::client::TCPConnector::run ()
#4  0x0000003b4b0bdfaa in qpid::sys::(anonymous namespace)::runRunnable ()
#5  0x0000003527906137 in start_thread () from /lib64/tls/libpthread.so.0
#6  0x0000003526cc9f03 in clone () from /lib64/tls/libc.so.6
Error while reading shared library symbols:
Can't attach LWP -1: No such process
Couldn't write debug register: No such process.
Variable "timeout" is not available.
Variable "timeout" is not available.
Quitting: Can't detach LWP -1: No such process
pstack 9364
#0  0x00000000005740cc in daxpy ()
#1  0x00000000005742f3 in dgefa ()
#2  0x0000000000575291 in clinpack_kflops ()
#3  0x00000000005753cd in sysapi_kflops_raw ()
#4  0x0000000000489993 in MachAttributes::benchmark ()
#5  0x00000000004976bb in Resource::force_benchmark ()
#6  0x000000000048fe32 in ResMgr::force_benchmark ()
#7  0x00000000004a4f1e in main_init ()
#8  0x00000000004e4193 in main ()
killing 9325 9364
Starting Condor daemons: ^[[60G[  ^[[0;32mOK^[[0;39m  ]^M
tst_cnt:147, err_cnt:2 (wait:447mili-sec[s])


gdb showed this:
~~~~~~~~~~~~~~~
[root@mrg-qe-03 bz]# cat bt_dump_147_9325
GNU gdb Red Hat Linux (6.3.0.0-1.162.el4rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
(gdb) Attaching to process 9325
Redelivering pending Stopped (signal).
Reading symbols from /usr/sbin/condor_master...(no debugging symbols found)...done.
Using host libthread_db library "/lib64/tls/libthread_db.so.1".
Reading symbols from /lib64/tls/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
[New Thread 182894196864 (LWP -1)]
Reading symbols from /lib64/libpcre.so.0...
...
Loaded symbols for /usr/lib64/sasl2/libanonymous.so.2

0x0000003526cc2f56 in __select_nocancel () from /lib64/tls/libc.so.6
(gdb) [New Thread 182894196864 (LWP -1)]
(gdb) [New Thread 1105209696 (LWP 9332)]
[New Thread 1094719840 (LWP 9330)]
[New Thread 1084229984 (LWP 9329)]

Thread 5 (Thread 1084229984 (LWP 9329)):
#0  0x0000003527908b3a in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000003b4b3a9cb2 in qpid::sys::BlockingQueue<boost::shared_ptr<qpid::framing::FrameSet> >::pop (this=0x8dcba0, result=@0x409ffbd0, timeout=)
    at ./qpid/sys/posix/Condition.h:63
#2  0x0000003b4b3a56cf in qpid::client::Dispatcher::run (this=0x8db9c0)
    at ./qpid/sys/BlockingQueue.h:71
#3  0x0000003b4b61c3ba in qpid::management::ManagementAgentImpl::ConnectionThread::run (this=0x2a957bf218) at /usr/include/boost/shared_ptr.hpp:252
#4  0x0000003b4b0bdfaa in qpid::sys::(anonymous namespace)::runRunnable (
    p=0x8dcbcc) at qpid/sys/posix/Thread.cpp:35
#5  0x0000003527906137 in start_thread () from /lib64/tls/libpthread.so.0
#6  0x0000003526cc9f03 in clone () from /lib64/tls/libc.so.6

Thread 4 (Thread 1094719840 (LWP 9330)):
#0  0x0000003526cc2f56 in __select_nocancel () from /lib64/tls/libc.so.6
#1  0x00000000004ae6d2 in sleep ()
#2  0x0000003b4b61d960 in qpid::management::ManagementAgentImpl::PublishThread::run (this=0x2a957bf440) at qpid/agent/ManagementAgentImpl.cpp:924
#3  0x0000003b4b0bdfaa in qpid::sys::(anonymous namespace)::runRunnable (p=0x0)
    at qpid/sys/posix/Thread.cpp:35
#4  0x0000003527906137 in start_thread () from /lib64/tls/libpthread.so.0
#5  0x0000003526cc9f03 in clone () from /lib64/tls/libc.so.6

Thread 3 (Thread 1105209696 (LWP 9332)):
#0  0x0000003526cca2dc in epoll_wait () from /lib64/tls/libc.so.6
#1  0x0000003b4b0c895c in qpid::sys::Poller::wait (this=0x8dabe0, timeout=)
    at qpid/sys/epoll/EpollPoller.cpp:439
#2  0x0000003b4b0c9f07 in qpid::sys::Poller::run (this=0x8dabe0)
    at qpid/sys/epoll/EpollPoller.cpp:405
#3  0x0000003b4b3904d9 in qpid::client::TCPConnector::run (this=0x8da7c0)
    at qpid/client/Connector.cpp:408
#4  0x0000003b4b0bdfaa in qpid::sys::(anonymous namespace)::runRunnable (
    p=0x10) at qpid/sys/posix/Thread.cpp:35
#5  0x0000003527906137 in start_thread () from /lib64/tls/libpthread.so.0
#6  0x0000003526cc9f03 in clone () from /lib64/tls/libc.so.6
(gdb) 

bt_dump_147_9325  bt_dump_147_9364
[root@mrg-qe-03 bz]# cat bt_dump_147_9364
GNU gdb Red Hat Linux (6.3.0.0-1.162.el4rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
(gdb) Attaching to process 9364
Reading symbols from /usr/sbin/condor_startd...(no debugging symbols found)...done.
Using host libthread_db library "/lib64/tls/libthread_db.so.1".
Reading symbols from /lib64/tls/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
[New Thread 182894196864 (LWP 9364)]
Loaded symbols for /lib64/tls/libpthread.so.0
...
(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
0x0000003526c688fe in malloc_consolidate ()
   from /lib64/tls/libc.so.6
(gdb)   1 Thread 182894196864 (LWP 9364)  0x0000003526c688fe in malloc_consolidate ()
   from /lib64/tls/libc.so.6
(gdb)
Thread 1 (Thread 182894196864 (LWP 9364)):
#0  0x0000003526c688fe in malloc_consolidate () from /lib64/tls/libc.so.6
#1  0x0000003526c6914f in _int_free () from /lib64/tls/libc.so.6
#2  0x0000003526c69846 in free () from /lib64/tls/libc.so.6
#3  0x000000352b2ae29e in operator delete () from /usr/lib64/libstdc++.so.6
#4  0x00000000005c713c in ExtArray<StringSpace::SSStringEnt>::~ExtArray ()
#5  0x00000000005c5691 in StringSpace::~StringSpace ()
#6  0x0000000000593138 in ExprTree::~ExprTree$base ()
#7  0x0000000000590c3c in BinaryOpBase::~BinaryOpBase$base ()
#8  0x000000000058f723 in AssignOpBase::~AssignOpBase$base ()
#9  0x000000000058f75f in AssignOp::~AssignOp$delete ()
#10 0x0000000000581f40 in AttrListElem::~AttrListElem ()
#11 0x0000000000580834 in AttrList::clear ()
#12 0x000000000057d54a in AttrList::~AttrList$base ()
#13 0x00000000005839fa in ClassAd::~ClassAd$delete ()
#14 0x000000000049b3b7 in Starter::~Starter$delete ()
#15 0x00000000004a7ebe in StarterMgr::~StarterMgr ()
#16 0x000000000048c648 in ResMgr::~ResMgr$delete ()
#17 0x00000000004a600f in startd_exit ()
#18 0x00000000004a6337 in startd_check_free ()
#19 0x00000000004a607c in main_shutdown_fast ()
#20 0x00000000004e2958 in handle_dc_sigquit ()
#21 0x00000000004ce79f in DaemonCore::Driver ()
#22 0x00000000004e419f in main ()
(gdb) Detaching from program: /usr/sbin/condor_startd, process 9364

Comment 18 Frantisek Reznicek 2009-11-09 11:53:17 UTC
Created attachment 368189 [details]
the bug 528544
semiautomated reproducer

To run the reproducer, make sure you have qpidd + condor + qmf plugins installed.
The run is done following way:
./run.sh 1000 1
         ^ nr of runs
              ^ turn on verbose

Comment 19 Frantisek Reznicek 2009-11-09 11:58:54 UTC
The long term running tests show atm that issue form comment #17 appears on  rhel 4.8 x86_64 only:
[root@mrg-qe-03 bz]# rpm -qa | egrep '(condor|qmf|qpid|rhm)' | sort -u
condor-7.4.1-0.4.el4
condor-qmf-plugins-7.4.1-0.4.el4
condor-remote-configuration-1.0-23.el4
python-qpid-0.5.760500-6.el4
qmf-0.5.752581-32.el4
qmf-devel-0.5.752581-32.el4
qpidc-0.5.752581-32.el4
qpidc-debuginfo-0.5.752581-32.el4
qpidc-devel-0.5.752581-32.el4
qpidc-ssl-0.5.752581-32.el4
qpidd-0.5.752581-32.el4
qpidd-acl-0.5.752581-32.el4
qpidd-devel-0.5.752581-32.el4
qpid-dotnet-0.4.738274-2.el4
qpidd-ssl-0.5.752581-32.el4
qpidd-xml-0.5.752581-32.el4
qpid-java-client-0.5.751061-9.el4
qpid-java-common-0.5.751061-9.el4
rhm-0.5.3206-25.el4
rhm-debuginfo-0.5.3206-23.el4
rhm-docs-0.5.756148-1.el4


Removing TestBlocker keyword (workaround known QMF_DELETE_ON_SHUTDOWN = FALSE)

Comment 20 Matthew Farrellee 2009-11-09 16:32:18 UTC
It is expected that the condor_master is still running whenever any of its children are still running. In the examples above the condor_startd is still running. The only trace of the condor_startd above indicates some benchmarking is happening. I would expect the startd to exit after that operation. Please check the StartLog whenever you see this issue to see what the startd is doing instead of exiting.

Unless the startd and master are truly hung and not just busy with normal operations, I'd consider this normal behavior.

Typically the hang is evident when the daemon's log indicates the daemon has exited when the process has not.

Comment 21 Ken Giusti 2009-11-09 19:05:55 UTC
After looking a bit at this new behavior, I agree - this is NAB.  In the fault cases above, I believe that condor is still in the process of shutting down, and would shutdown cleanly had the test script not kill -9 it.

Here's why:

I provisioned mrg44 and executed the test script.  Ran 1000 times without repro the problem.  Frantisek allowed me to use mrg-qe-03, which has his setup already in place.  On mrg-qe-03, the problem was reproduced.

However, it turns out that the load average on mrg-qe-03 is around 25%.   I returned to mrg44, and spawned 2*#cpus bash while[1] loops.  mrg44 load ave hit 15% and the test reproduced the problem.

Given Matt's input (comment 20), I thought that the load was preventing condor from shutting down in a timely manner.  I modified the test script to NOT issue a kill on the processes, but rather monitor the processes to see if they eventually exit:

  service condor stop || let "err_cnt++"

  sleep 1
  
  # check whether all deamons down
  cpids="$(ps auxw | grep condor_ | grep -v grep | awk '{printf $2 " "}')"
  ps auxw | grep condor_ | grep -v grep
  
  while [ -n "${cpids}" ]; do
	echo "condor processes still running:"
	echo "$cpids"
	echo "Sleeping and retrying..."
	sleep 1
	cpids="$(ps auxw | grep condor_ | grep -v grep | awk '{printf $2 " "}')"
  done 

  echo "condor has exited!!!"
  ps auxw | grep condor_ | grep -v grep


Every time the test runs, the processes eventually clean up and exit correctly.  In some cases, the script will wait over 10 seconds, but the processes do clean up and exit.  I'll continue the run and update this bug if I get a permanent hang.

Comment 22 Frantisek Reznicek 2009-11-10 13:06:34 UTC
Comments #17 ... #19 are proved to be NAB, it is the problem of early condor shutdown check & kill.

Comment 26 errata-xmlrpc 2009-12-03 09:15:38 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-1633.html


Note You need to log in before you can comment on or make changes to this bug.