<davids> kpvdr: I've lately seen a lot of "Directory creation failed" exceptions .... with the explanation "File exists" ... is that something you know about? <kpvdr> davids: no <davids> kpvdr: 2008-jul-20 15:41: <davids> 35 error Unexpected exception: Queue 33c46d96-1601-4f6e-801d-8ed5f71c5e3e@guest: create() failed: jexception 0x0301 jdir::create_dir() threw JERR_JDIR_MKDIR: Directory creation failed. <davids> (dir="/tmp/rhts_qpidd/qpid-data/pt_broker.568/rhm/jrnl/0017" errno=17 (File exists)) (BdbMessageStore.cpp:356) <davids> kpvdr: on one box I have this 3-4 times using the RHTS test script ... and the broker is started all the times from scratch in a new empty data directory <davids> kpvdr: Can I do something in another way to pin-point where/how/why it happens? <kpvdr> davids: thinking <kpvdr> davids: what is the test doing? <davids> kpvdr: [15:41:17] Running perftest in topic mode (with storage): 1 iterations with 25000 msgs. Msg size: 64 bytes. Extra test params: --nsubs 10 --qt 4 --durable yes <kpvdr> davids: Hmmm, this is odd <davids> kpvdr: it happens only on topic tests <kpvdr> davids: I have an idea on this.. <kpvdr> davids: ie when more than one journal maps into the same dir - in this case 0017
I have been unable to reproduce this error. I have eliminated the following possibilities: 1. Directory permission: this results in a different error message; 2. Directory exists: this works fine and the test completes with up to 4 queues per directory; 3. Too many files handles: This results in a different error message on file creation, not dir creation. Since the code checks for the existence of a dir prior to creating it, the only explanation for this error is a thread safety issue - ie two threads happen to create the same dir at the same time. Examination of the code shows that the current algorithm for creating the first level dir uses a simple hash of the queue name to create one of 20 possible dirs. There exists for random dir names a 5% probability that a second equally paced thread may attempt to to create the same top-level dir at the same time for another queue. Although I have not reproduced the error, I am checking in a fix for this oversight in the hope that it will eliminate this bug. By not checking for dir existence, and allowing for a possible duplicate would solve this problem without the need for a lock. I will leave this assigned for a little while longer and see if the bug can be reproduced on the RHTS hardware which originally found this problem. r2215 on trunk; r2216 on 1.0 branch
Reproducer is now available from mrg-team SVN ... mrg-team/people/dsommers/bz456272 This reproducer works somehow on hp-xw4800-01.rhts.bos.redhat.com. Starting qpidd with: --auth no --tpl-wcache-page-size 128 --tpl-jfile-size-pgs 32 --num-jfiles 16 --jfile-size-pgs 32 In another screen, run this command: $ (find /usr -type f -exec cat {} \; > /tmp/filedata.dat) & python ./bz456272.py It seems this issue arises much more often when the disk is busy with work. The fail rate is somewhat around 10% with this script on this box.
With the latest qpid and storage module from SVN (qpid.0-10, mrg-1.0) this bug seems to be fixed.
RHTS test developed (MRG/qpid_broker_jfail_bz456272). Test results comming soon.
RHTS test (MRG/qpid_broker_jfail_bz456272) shows no more qpidd 'file already exists fails'. Bug going to VERIFIED. Last results show that MRG/qpid_broker_jfail_bz456272 test is unstable, sometimes fails on disconnecting queues from broker. This behavior is under investigation on QA and might be reported to DEV as new bug.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0640.html