Bug 499317

Summary: [RHEL 5.2] Problems with futex()
Product: Red Hat Enterprise Linux 5 Reporter: Lachlan McIlroy <lmcilroy>
Component: kernelAssignee: Dave Anderson <anderson>
Status: CLOSED NOTABUG QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: high    
Version: 5.2CC: cdahlin, cwyse, dzickus, pzijlstr, tao, vgaikwad
Target Milestone: rc   
Target Release: 5.4   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-09-03 18:22:27 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
running backtrace
none
process /proc info
none
new backtrace none

Description Lachlan McIlroy 2009-05-06 05:48:07 UTC
Description of problem:
The customer is experiencing render process hangs during a futex system call.  This is happening on RHEL 4 and RHEL 5 systems.

Version-Release number of selected component (if applicable):
kernel is 2.6.18-92.1.13.el5

How reproducible:
Customer is seeing this problem a lot but not reliably reproducible.

Steps to Reproduce:
Customer is working on a test case.
  
Actual results:


Expected results:


Additional info:
Could be related to https://bugzilla.redhat.com/show_bug.cgi?id=400541

Comment 2 Issue Tracker 2009-05-11 18:10:02 UTC
Listened to pandora and logged into facebook.  I saw an error that looks
very similar to what they were seeing.  Below I have comments where the
futex error started and stopped.

fcntl64(23, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=1073741824,
len=1}) = 0
fcntl64(23, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=1073741826,
len=510}) = 0
fcntl64(23, F_SETLK, {type=F_UNLCK, whence=SEEK_SET, start=1073741824,
len=1}) = 0
access("/home/cwyse/.mozilla/firefox/Buffalo/cookies.sqlite-journal",
F_OK) = -1 ENOENT (No such file or directory)
fstat64(23, {st_mode=S_IFREG|0644, st_size=143360, ...}) = 0
lseek(23, 24, SEEK_SET)                 = 24
read(23, "\0\0Y\305\0\0\0\0\0\0\0\0\0\0\0\23", 16) = 16
fcntl64(23, F_SETLK, {type=F_WRLCK, whence=SEEK_SET, start=1073741825,
len=1}) = 0
open("/home/cwyse/.mozilla/firefox/Buffalo/cookies.sqlite-journal",
O_RDWR|O_CREAT|O_EXCL|O_LARGEFILE|O_NOFOLLOW, 0644) = 47
open("/home/cwyse/.mozilla/firefox/Buffalo", O_RDONLY) = 50
fcntl64(50, F_GETFD)                    = 0
fcntl64(50, F_SETFD, FD_CLOEXEC)        = 0
fcntl64(47, F_GETFD)                    = 0
fcntl64(47, F_SETFD, FD_CLOEXEC)        = 0
fstat64(47, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
lseek(47, 0, SEEK_SET)                  = 0
write(47, "\331\325\5\371
\241c\327\377\377\377\377\236\6\2167\0\0\0\214\0\0\2\0\0\0\4\0\0\0\0\0"...,
512) = 512
close(50)                               = 0
close(47)                               = 0
unlink("/home/cwyse/.mozilla/firefox/Buffalo/cookies.sqlite-journal") =
0
fcntl64(23, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=1073741826,
len=510}) = 0
fcntl64(23, F_SETLK, {type=F_UNLCK, whence=SEEK_SET, start=1073741824,
len=2}) = 0
fcntl64(23, F_SETLK, {type=F_UNLCK, whence=SEEK_SET, start=0, len=0}) = 0
gettimeofday({1242063738, 459802}, NULL) = 0
gettimeofday({1242063738, 460090}, NULL) = 0
futex(0x8116738, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x8116734, {FUTEX_OP_SET, 0,
FUTEX_OP_CMP_GT, 1}) = 1
futex(0xb3604b0, FUTEX_WAIT_PRIVATE, 1, NULL

#################################################
# When it locked up I hit enter a bunch of times
# So the lines above are where it locks up firefox
# The lines below this are what we see when it comes back.
###################################################

) = 0
futex(0xb3604ec, FUTEX_WAKE_PRIVATE, 1) = 0
gettimeofday({1242063761, 118042}, NULL) = 0
gettimeofday({1242063761, 118290}, NULL) = 0
gettimeofday({1242063761, 118958}, NULL) = 0
gettimeofday({1242063761, 118999}, NULL) = 0
futex(0x80aa688, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x80aa684, {FUTEX_OP_SET, 0,
FUTEX_OP_CMP_GT, 1}) = 1
gettimeofday({1242063761, 119757}, NULL) = 0
gettimeofday({1242063761, 120065}, NULL) = 0
gettimeofday({1242063761, 120129}, NULL) = 0
write(16, "8", 1)                       = 1
futex(0x8114480, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x8e54bb8, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x8e54bf4, 354)
= 1
gettimeofday({1242063761, 121321}, NULL) = 0
write(3, "\22\0\7\0\331\0 \2\30\1\0\0\6\0\0\0
\0\0\0\1\0\0\0B\226\3040\2\333\4\0"..., 44) = 44
ioctl(3, FIONREAD, [14336])             = 0
read(3, "\3$\276\1\262\226\3040\266\1\0\0\332\0
\2\0\0\0\0\211\3\317\0\212\3\237\0\20\0\1\0"..., 2048) =
2048
gettimeofday({1242063761, 121893}, NULL) = 0
poll([{fd=4, events=POLLIN}, {fd=3, events=POLLIN}, {fd=8,
events=POLLIN|POLLPRI}, {fd=11, events=POLLIN|POLLPRI}, {fd=12,
events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}, {fd=17,
events=POLLIN}, {fd=10, events=POLLIN|POLLPRI}, {fd=49, events=POLLIN},
{fd=51, events=POLLIN}, {fd=48, events=POLLIN}, {fd=18, events=POLLIN}],
12, 0) = 2 ([{fd=3, revents=POLLIN}, {fd=18, revents=POLLIN}])
gettimeofday({1242063761, 122098}, NULL) = 0
read(18, "\372", 1)                     = 1
inotify_add_watch(49, "/usr/local/share/applications",
IN_MODIFY|IN_ATTRIB|IN_MOVED_FROM|IN_MOVED_TO|IN_CREATE|IN_DELETE|IN_DELETE_SELF|IN_MOVE_SELF|IN_UNMOUNT|IN_ONLYDIR)
= -1 ENOENT (No such file or directory)
inotify_add_watch(49, "/usr/local/share/applications",
IN_MODIFY|IN_ATTRIB|IN_MOVED_FROM|IN_MOVED_TO|IN_CREATE|IN_DELETE|IN_DELETE_SELF|IN_MOVE_SELF|IN_UNMOUNT|IN_ONLYDIR)
= -1 ENOENT (No such file or directory)
gettimeofday({1242063761, 122504}, NULL) = 0



This event sent from IssueTracker by cwyse 
 issue 285050

Comment 10 Lachlan McIlroy 2009-05-13 01:11:38 UTC
I can get a kernel dump from the ctime-hang test program but we don't need to.  We know what the problem is in that case and how to fix it.  What we need to do is confirm that what the customer is seeing is the same problem.  The symptoms are the same but without the stack trace from python we don't know if it is ctime recursion or something else.

The backtrace file and python binary haven't helped.  Can we get the customer to get the backtrace?  The next time the application gets stuck they just need to do:

gdb <path to binary of process that is stuck> <pid of process>

and use the 'bt' command.  That should be enough information.  We wont need a kernel dump.  I'll look at the pandora/facebook lockup now.

Comment 14 Charlie Wyse 2009-05-14 22:14:24 UTC
Created attachment 344058 [details]
running backtrace

Here is the running backtrace you requested.

Comment 15 Charlie Wyse 2009-05-14 23:02:43 UTC
Created attachment 344063 [details]
process /proc info

Here is all the info I pulled from /proc/<pid>.  Hope it proves helpful.

Comment 17 Lachlan McIlroy 2009-05-15 02:26:06 UTC
It's a long backtrace but here's the business end of it.

#0  0x0000003fbd60c888 in __lll_mutex_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003fbd6088a5 in _L_mutex_lock_107 () from /lib64/libpthread.so.0
#2  0x0000003fbd608333 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00002ac0da30901c in pthread_attr_setdetachstate@plt () from /lib64/liblm.so
#4  0x00002ac0da30eb44 in ppoll () from /lib64/liblm.so
#5  <signal handler called>
#6  0x0000003fbca7a1ab in memcpy () from /lib64/libc.so.6
#7  0x0000003fbca73ed0 in _int_realloc () from /lib64/libc.so.6
#8  0x0000003fbca74f80 in realloc () from /lib64/libc.so.6

The application was attempting to realloc() some memory when it received a signal and the signal handler has called ppoll().  While executing ppoll() it needs to acquire a mutex that is already held and is stuck waiting for it to be released.

Comment 18 Lachlan McIlroy 2009-05-15 05:52:01 UTC
What is /lib64/liblm.so?  I can't find it anywhere.  Can you ask the customer what it is and if we can get the source code to it?

In realloc() the code acquires the __libc_malloc_lock lock.  I doubt this is the lock that pthread_attr_setdetachstate() is trying to acquire but if it is then it's a case of double locking the same lock.  What I think is more likely here is that we've got an ABBA deadlock.  This thread has the __libc_malloc_lock lock and is trying to acquire lock X while another thread has lock X and is trying to acquire the __libc_malloc_lock lock.

Comment 19 Casey Dahlin 2009-05-15 14:29:54 UTC
From sbeal on the related IT:

Looks like liblm.so comes from Librato:

Librato offers two different load management products, Load Monitor and Load Manager. Although both ... LM Library is a Linux shared library named liblm.so

grep librato rpmpkgs
librato-lm-lmc-1.4.4-1.x86_64.rpm
librato-lm-server-1.4.4-1.x86_64.rpm

http://www.librato.com/products/load_manager

Comment 21 Charlie Wyse 2009-05-15 22:08:09 UTC
Created attachment 344242 [details]
new backtrace

Here is a backtrace on a machine without librato.

Comment 22 Lachlan McIlroy 2009-05-18 02:14:34 UTC
It's another long backtrace:

#0  0x00000033fd20a496 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00002aaab180492b in ZThread::Monitor::wait () from /rel/depot/third_party_lib/ZThread/all-2.3.1-15/lib/ZThread/lib/opt-gccWS5_64/libZThread-2.3.so.1
#2  0x00002aaab180dce3 in ZThread::(anonymous namespace)::WaiterQueue::wait () from /rel/depot/third_party_lib/ZThread/all-2.3.1-15/lib/ZThread/lib/opt-gccWS5_64/libZThread-2.3.so.1
#3  0x00002aaab6e278a7 in image_filter::MultiThreadDispatch::operator() () from /rel/depot/so/dwadbmain/all-0.0.3618833-4/lib/dwadbmain/lib/opt-gccWS5_64/libdwadbmain.so
#4  0x00002aaab6e2ab79 in image_filter::MultiThreadFilter::computeImage () from /rel/depot/so/dwadbmain/all-0.0.3618833-4/lib/dwadbmain/lib/opt-gccWS5_64/libdwadbmain.so
#5  0x00002aaab6e0bd00 in image_filter::Filter::processImage () from /rel/depot/so/dwadbmain/all-0.0.3618833-4/lib/dwadbmain/lib/opt-gccWS5_64/libdwadbmain.so
#6  0x00002aaab6e0d133 in image_filter::Filter::setupScanlineInputs () from /rel/depot/so/dwadbmain/all-0.0.3618833-4/lib/dwadbmain/lib/opt-gccWS5_64/libdwadbmain.so
#7  0x00002aaab6e45946 in image_filter::ScanlineFilter::setupScanlineInputs () from /rel/depot/so/dwadbmain/all-0.0.3618833-4/lib/dwadbmain/lib/opt-gccWS5_64/libdwadbmain.so
#8  0x00002aaab6e66d79 in image_filter::Crop_sw::setupScanlineInputs () from /rel/depot/so/dwadbmain/all-0.0.3618833-4/lib/dwadbmain/lib/opt-gccWS5_64/libdwadbmain.so
#9  0x00002aaab6e2aa48 in image_filter::MultiThreadFilter::computeImage () from /rel/depot/so/dwadbmain/all-0.0.3618833-4/lib/dwadbmain/lib/opt-gccWS5_64/libdwadbmain.so
#10 0x00002aaab6e0bd00 in image_filter::Filter::processImage () from /rel/depot/so/dwadbmain/all-0.0.3618833-4/lib/dwadbmain/lib/opt-gccWS5_64/libdwadbmain.so
#11 0x00002aaab6d9f621 in image::Image::run () from /rel/depot/so/dwadbmain/all-0.0.3618833-4/lib/dwadbmain/lib/opt-gccWS5_64/libdwadbmain.so
#12 0x00002aaab6d9fbc6 in image::Image::getRow () from /rel/depot/so/dwadbmain/all-0.0.3618833-4/lib/dwadbmain/lib/opt-gccWS5_64/libdwadbmain.so
#13 0x00002aaabca494de in image_mov::Encoder::encodeVideo () from /rel/map/mvp-2008.03.83/python/opt-gccWS5_64/lib/python2.5/studio/image_mov/_image_movmodule.so
#14 0x00002aaabca5f081 in image_mov::SimpleEncoder::encodeRgbVideo () from /rel/map/mvp-2008.03.83/python/opt-gccWS5_64/lib/python2.5/studio/image_mov/_image_movmodule.so
#15 0x00002aaabca422e1 in image_mov::BasicEncodeProcessor::videoProcess () from /rel/map/mvp-2008.03.83/python/opt-gccWS5_64/lib/python2.5/studio/image_mov/_image_movmodule.so
...

I don't see anything obviously wrong with this.  There's no signal handler called in this one so it's definitely not a case of recursive locking.  Almost the entire stack trace is third-party software so I'm not sure what we can do for them.  The thread appears to be processing an image and has (maybe?) launched some threads to do the work and is now synchronising on a semaphore.  This thread state may be valid and whatever thread it is waiting for has not completed it's work yet.

We're going to need more information to diagnose this.  In the backtrace output there are these lines:

[New Thread 46912498771056 (LWP 14319)]
[New Thread 1080043840 (LWP 14374)]
[New Thread 1077942592 (LWP 14373)]
[New Thread 1075841344 (LWP 14372)]

The first line is the thread we have a backtrace for (note that LWP is the pid).  The other lines I think are for the threads started by the first thread - one of these may hold the clue to why we are stuck.  What we will need is a backtrace for these threads too.  Since the system is probably no longer in the state we need we will have to wait for the customer to hit this problem again.

Comment 26 Casey Dahlin 2009-06-26 21:06:16 UTC
Customer is looking for an update on this one. Can we get them something?

Comment 28 Dave Anderson 2009-06-29 12:24:41 UTC
If you're asking me, I am still waiting for some evidence of a kernel bug.
Or should this BZ have been assigned to a user-space library?