Bug 499317
Summary: | [RHEL 5.2] Problems with futex() | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Lachlan McIlroy <lmcilroy> | ||||||||
Component: | kernel | Assignee: | Dave Anderson <anderson> | ||||||||
Status: | CLOSED NOTABUG | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 5.2 | CC: | cdahlin, cwyse, dzickus, pzijlstr, tao, vgaikwad | ||||||||
Target Milestone: | rc | ||||||||||
Target Release: | 5.4 | ||||||||||
Hardware: | All | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2009-09-03 18:22:27 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Lachlan McIlroy
2009-05-06 05:48:07 UTC
Listened to pandora and logged into facebook. I saw an error that looks very similar to what they were seeing. Below I have comments where the futex error started and stopped. fcntl64(23, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=1073741824, len=1}) = 0 fcntl64(23, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=1073741826, len=510}) = 0 fcntl64(23, F_SETLK, {type=F_UNLCK, whence=SEEK_SET, start=1073741824, len=1}) = 0 access("/home/cwyse/.mozilla/firefox/Buffalo/cookies.sqlite-journal", F_OK) = -1 ENOENT (No such file or directory) fstat64(23, {st_mode=S_IFREG|0644, st_size=143360, ...}) = 0 lseek(23, 24, SEEK_SET) = 24 read(23, "\0\0Y\305\0\0\0\0\0\0\0\0\0\0\0\23", 16) = 16 fcntl64(23, F_SETLK, {type=F_WRLCK, whence=SEEK_SET, start=1073741825, len=1}) = 0 open("/home/cwyse/.mozilla/firefox/Buffalo/cookies.sqlite-journal", O_RDWR|O_CREAT|O_EXCL|O_LARGEFILE|O_NOFOLLOW, 0644) = 47 open("/home/cwyse/.mozilla/firefox/Buffalo", O_RDONLY) = 50 fcntl64(50, F_GETFD) = 0 fcntl64(50, F_SETFD, FD_CLOEXEC) = 0 fcntl64(47, F_GETFD) = 0 fcntl64(47, F_SETFD, FD_CLOEXEC) = 0 fstat64(47, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 lseek(47, 0, SEEK_SET) = 0 write(47, "\331\325\5\371 \241c\327\377\377\377\377\236\6\2167\0\0\0\214\0\0\2\0\0\0\4\0\0\0\0\0"..., 512) = 512 close(50) = 0 close(47) = 0 unlink("/home/cwyse/.mozilla/firefox/Buffalo/cookies.sqlite-journal") = 0 fcntl64(23, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=1073741826, len=510}) = 0 fcntl64(23, F_SETLK, {type=F_UNLCK, whence=SEEK_SET, start=1073741824, len=2}) = 0 fcntl64(23, F_SETLK, {type=F_UNLCK, whence=SEEK_SET, start=0, len=0}) = 0 gettimeofday({1242063738, 459802}, NULL) = 0 gettimeofday({1242063738, 460090}, NULL) = 0 futex(0x8116738, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x8116734, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0xb3604b0, FUTEX_WAIT_PRIVATE, 1, NULL ################################################# # When it locked up I hit enter a bunch of times # So the lines above are where it locks up firefox # The lines below this are what we see when it comes back. ################################################### ) = 0 futex(0xb3604ec, FUTEX_WAKE_PRIVATE, 1) = 0 gettimeofday({1242063761, 118042}, NULL) = 0 gettimeofday({1242063761, 118290}, NULL) = 0 gettimeofday({1242063761, 118958}, NULL) = 0 gettimeofday({1242063761, 118999}, NULL) = 0 futex(0x80aa688, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x80aa684, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 gettimeofday({1242063761, 119757}, NULL) = 0 gettimeofday({1242063761, 120065}, NULL) = 0 gettimeofday({1242063761, 120129}, NULL) = 0 write(16, "8", 1) = 1 futex(0x8114480, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x8e54bb8, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x8e54bf4, 354) = 1 gettimeofday({1242063761, 121321}, NULL) = 0 write(3, "\22\0\7\0\331\0 \2\30\1\0\0\6\0\0\0 \0\0\0\1\0\0\0B\226\3040\2\333\4\0"..., 44) = 44 ioctl(3, FIONREAD, [14336]) = 0 read(3, "\3$\276\1\262\226\3040\266\1\0\0\332\0 \2\0\0\0\0\211\3\317\0\212\3\237\0\20\0\1\0"..., 2048) = 2048 gettimeofday({1242063761, 121893}, NULL) = 0 poll([{fd=4, events=POLLIN}, {fd=3, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}, {fd=11, events=POLLIN|POLLPRI}, {fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}, {fd=17, events=POLLIN}, {fd=10, events=POLLIN|POLLPRI}, {fd=49, events=POLLIN}, {fd=51, events=POLLIN}, {fd=48, events=POLLIN}, {fd=18, events=POLLIN}], 12, 0) = 2 ([{fd=3, revents=POLLIN}, {fd=18, revents=POLLIN}]) gettimeofday({1242063761, 122098}, NULL) = 0 read(18, "\372", 1) = 1 inotify_add_watch(49, "/usr/local/share/applications", IN_MODIFY|IN_ATTRIB|IN_MOVED_FROM|IN_MOVED_TO|IN_CREATE|IN_DELETE|IN_DELETE_SELF|IN_MOVE_SELF|IN_UNMOUNT|IN_ONLYDIR) = -1 ENOENT (No such file or directory) inotify_add_watch(49, "/usr/local/share/applications", IN_MODIFY|IN_ATTRIB|IN_MOVED_FROM|IN_MOVED_TO|IN_CREATE|IN_DELETE|IN_DELETE_SELF|IN_MOVE_SELF|IN_UNMOUNT|IN_ONLYDIR) = -1 ENOENT (No such file or directory) gettimeofday({1242063761, 122504}, NULL) = 0 This event sent from IssueTracker by cwyse issue 285050 I can get a kernel dump from the ctime-hang test program but we don't need to. We know what the problem is in that case and how to fix it. What we need to do is confirm that what the customer is seeing is the same problem. The symptoms are the same but without the stack trace from python we don't know if it is ctime recursion or something else. The backtrace file and python binary haven't helped. Can we get the customer to get the backtrace? The next time the application gets stuck they just need to do: gdb <path to binary of process that is stuck> <pid of process> and use the 'bt' command. That should be enough information. We wont need a kernel dump. I'll look at the pandora/facebook lockup now. Created attachment 344058 [details]
running backtrace
Here is the running backtrace you requested.
Created attachment 344063 [details]
process /proc info
Here is all the info I pulled from /proc/<pid>. Hope it proves helpful.
It's a long backtrace but here's the business end of it. #0 0x0000003fbd60c888 in __lll_mutex_lock_wait () from /lib64/libpthread.so.0 #1 0x0000003fbd6088a5 in _L_mutex_lock_107 () from /lib64/libpthread.so.0 #2 0x0000003fbd608333 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00002ac0da30901c in pthread_attr_setdetachstate@plt () from /lib64/liblm.so #4 0x00002ac0da30eb44 in ppoll () from /lib64/liblm.so #5 <signal handler called> #6 0x0000003fbca7a1ab in memcpy () from /lib64/libc.so.6 #7 0x0000003fbca73ed0 in _int_realloc () from /lib64/libc.so.6 #8 0x0000003fbca74f80 in realloc () from /lib64/libc.so.6 The application was attempting to realloc() some memory when it received a signal and the signal handler has called ppoll(). While executing ppoll() it needs to acquire a mutex that is already held and is stuck waiting for it to be released. What is /lib64/liblm.so? I can't find it anywhere. Can you ask the customer what it is and if we can get the source code to it? In realloc() the code acquires the __libc_malloc_lock lock. I doubt this is the lock that pthread_attr_setdetachstate() is trying to acquire but if it is then it's a case of double locking the same lock. What I think is more likely here is that we've got an ABBA deadlock. This thread has the __libc_malloc_lock lock and is trying to acquire lock X while another thread has lock X and is trying to acquire the __libc_malloc_lock lock. From sbeal on the related IT: Looks like liblm.so comes from Librato: Librato offers two different load management products, Load Monitor and Load Manager. Although both ... LM Library is a Linux shared library named liblm.so grep librato rpmpkgs librato-lm-lmc-1.4.4-1.x86_64.rpm librato-lm-server-1.4.4-1.x86_64.rpm http://www.librato.com/products/load_manager Created attachment 344242 [details]
new backtrace
Here is a backtrace on a machine without librato.
It's another long backtrace: #0 0x00000033fd20a496 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00002aaab180492b in ZThread::Monitor::wait () from /rel/depot/third_party_lib/ZThread/all-2.3.1-15/lib/ZThread/lib/opt-gccWS5_64/libZThread-2.3.so.1 #2 0x00002aaab180dce3 in ZThread::(anonymous namespace)::WaiterQueue::wait () from /rel/depot/third_party_lib/ZThread/all-2.3.1-15/lib/ZThread/lib/opt-gccWS5_64/libZThread-2.3.so.1 #3 0x00002aaab6e278a7 in image_filter::MultiThreadDispatch::operator() () from /rel/depot/so/dwadbmain/all-0.0.3618833-4/lib/dwadbmain/lib/opt-gccWS5_64/libdwadbmain.so #4 0x00002aaab6e2ab79 in image_filter::MultiThreadFilter::computeImage () from /rel/depot/so/dwadbmain/all-0.0.3618833-4/lib/dwadbmain/lib/opt-gccWS5_64/libdwadbmain.so #5 0x00002aaab6e0bd00 in image_filter::Filter::processImage () from /rel/depot/so/dwadbmain/all-0.0.3618833-4/lib/dwadbmain/lib/opt-gccWS5_64/libdwadbmain.so #6 0x00002aaab6e0d133 in image_filter::Filter::setupScanlineInputs () from /rel/depot/so/dwadbmain/all-0.0.3618833-4/lib/dwadbmain/lib/opt-gccWS5_64/libdwadbmain.so #7 0x00002aaab6e45946 in image_filter::ScanlineFilter::setupScanlineInputs () from /rel/depot/so/dwadbmain/all-0.0.3618833-4/lib/dwadbmain/lib/opt-gccWS5_64/libdwadbmain.so #8 0x00002aaab6e66d79 in image_filter::Crop_sw::setupScanlineInputs () from /rel/depot/so/dwadbmain/all-0.0.3618833-4/lib/dwadbmain/lib/opt-gccWS5_64/libdwadbmain.so #9 0x00002aaab6e2aa48 in image_filter::MultiThreadFilter::computeImage () from /rel/depot/so/dwadbmain/all-0.0.3618833-4/lib/dwadbmain/lib/opt-gccWS5_64/libdwadbmain.so #10 0x00002aaab6e0bd00 in image_filter::Filter::processImage () from /rel/depot/so/dwadbmain/all-0.0.3618833-4/lib/dwadbmain/lib/opt-gccWS5_64/libdwadbmain.so #11 0x00002aaab6d9f621 in image::Image::run () from /rel/depot/so/dwadbmain/all-0.0.3618833-4/lib/dwadbmain/lib/opt-gccWS5_64/libdwadbmain.so #12 0x00002aaab6d9fbc6 in image::Image::getRow () from /rel/depot/so/dwadbmain/all-0.0.3618833-4/lib/dwadbmain/lib/opt-gccWS5_64/libdwadbmain.so #13 0x00002aaabca494de in image_mov::Encoder::encodeVideo () from /rel/map/mvp-2008.03.83/python/opt-gccWS5_64/lib/python2.5/studio/image_mov/_image_movmodule.so #14 0x00002aaabca5f081 in image_mov::SimpleEncoder::encodeRgbVideo () from /rel/map/mvp-2008.03.83/python/opt-gccWS5_64/lib/python2.5/studio/image_mov/_image_movmodule.so #15 0x00002aaabca422e1 in image_mov::BasicEncodeProcessor::videoProcess () from /rel/map/mvp-2008.03.83/python/opt-gccWS5_64/lib/python2.5/studio/image_mov/_image_movmodule.so ... I don't see anything obviously wrong with this. There's no signal handler called in this one so it's definitely not a case of recursive locking. Almost the entire stack trace is third-party software so I'm not sure what we can do for them. The thread appears to be processing an image and has (maybe?) launched some threads to do the work and is now synchronising on a semaphore. This thread state may be valid and whatever thread it is waiting for has not completed it's work yet. We're going to need more information to diagnose this. In the backtrace output there are these lines: [New Thread 46912498771056 (LWP 14319)] [New Thread 1080043840 (LWP 14374)] [New Thread 1077942592 (LWP 14373)] [New Thread 1075841344 (LWP 14372)] The first line is the thread we have a backtrace for (note that LWP is the pid). The other lines I think are for the threads started by the first thread - one of these may hold the clue to why we are stuck. What we will need is a backtrace for these threads too. Since the system is probably no longer in the state we need we will have to wait for the customer to hit this problem again. Customer is looking for an update on this one. Can we get them something? If you're asking me, I am still waiting for some evidence of a kernel bug. Or should this BZ have been assigned to a user-space library? |