Bug 460109

Summary: Broker fails to create journal directory - File already exists (RHEL 4)
Product: Red Hat Enterprise MRG Reporter: Gordon Sim <gsim>
Component: qpid-cppAssignee: messaging-bugs <messaging-bugs>
Status: CLOSED ERRATA QA Contact: Kim van der Riet <kim.vdriet>
Severity: high Docs Contact:
Priority: medium    
Version: 1.0CC: freznice
Target Milestone: 1.0.1   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-10-06 19:00:00 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 456272    
Bug Blocks:    

Description Gordon Sim 2008-08-26 08:46:15 UTC
+++ This bug was initially created as a clone of Bug #456272 +++

<davids> kpvdr:  I've lately seen a lot of "Directory creation failed"
exceptions .... with the explanation "File exists" ... is that something you
know about?
<kpvdr> davids: no
<davids> kpvdr:  2008-jul-20 15:41:
<davids> 35 error Unexpected exception: Queue
33c46d96-1601-4f6e-801d-8ed5f71c5e3e@guest: create() failed: jexception 0x0301
jdir::create_dir() threw JERR_JDIR_MKDIR: Directory creation failed.
<davids>  (dir="/tmp/rhts_qpidd/qpid-data/pt_broker.568/rhm/jrnl/0017" errno=17
(File exists)) (BdbMessageStore.cpp:356)
<davids> kpvdr:  on one box I have this 3-4 times using the RHTS test script ...
and the broker is started all the times from scratch in a new empty data directory
<davids> kpvdr:  Can I do something in another way to pin-point where/how/why it
happens?
<kpvdr> davids: thinking
<kpvdr> davids: what is the test doing?
<davids> kpvdr:  [15:41:17] Running perftest in topic mode (with storage): 1
iterations with 25000 msgs. Msg size: 64 bytes.  Extra test params: --nsubs 10
--qt 4 --durable yes
<kpvdr> davids: Hmmm, this is odd
<davids> kpvdr:  it happens only on topic tests
<kpvdr> davids: I have an idea on this..
<kpvdr> davids: ie when more than one journal maps into the same dir - in this
case 0017

--- Additional comment from kim.vdriet on 2008-07-24 16:33:38 EDT ---

I have been unable to reproduce this error. I have eliminated the following
possibilities:

1. Directory permission: this results in a different error message;
2. Directory exists: this works fine and the test completes with up to 4 queues
per directory;
3. Too many files handles: This results in a different error message on file
creation, not dir creation.

Since the code checks for the existence of a dir prior to creating it, the only
explanation for this error is a thread safety issue - ie two threads happen to
create the same dir at the same time. Examination of the code shows that the
current algorithm for creating the first level dir uses a simple hash of the
queue name to create one of 20 possible dirs. There exists for random dir names
a 5% probability that a second equally paced thread may attempt to to create the
same top-level dir at the same time for another queue.

Although I have not reproduced the error, I am checking in a fix for this
oversight in the hope that it will eliminate this bug. By not checking for dir
existence, and allowing for a possible duplicate would solve this problem
without the need for a lock.

I will leave this assigned for a little while longer and see if the bug can be
reproduced on the RHTS hardware which originally found this problem.

r2215 on trunk; r2216 on 1.0 branch

--- Additional comment from davids on 2008-07-29 14:44:34 EDT ---

Reproducer is now available from mrg-team SVN ... mrg-team/people/dsommers/bz456272

This reproducer works somehow on hp-xw4800-01.rhts.bos.redhat.com.  Starting
qpidd with: --auth no --tpl-wcache-page-size 128 --tpl-jfile-size-pgs 32
--num-jfiles 16 --jfile-size-pgs 32

In another screen, run this command:

  $ (find /usr -type f -exec cat {} \; > /tmp/filedata.dat) & python ./bz456272.py


It seems this issue arises much more often when the disk is busy with work.  The
fail rate is somewhat around 10% with this script on this box.


--- Additional comment from davids on 2008-07-30 09:45:42 EDT ---

With the latest qpid and storage module from SVN (qpid.0-10, mrg-1.0) this bug
seems to be fixed.

Comment 2 Frantisek Reznicek 2008-08-29 07:22:07 UTC
RHTS test developed (MRG/qpid_broker_jfail_bz456272).
Test results comming soon.

Comment 3 Frantisek Reznicek 2008-09-04 10:17:55 UTC
RHTS test (MRG/qpid_broker_jfail_bz456272) shows no more qpidd 'file already exists fails'. Bug going to VERIFIED.

Last results show that MRG/qpid_broker_jfail_bz456272 test is unstable, sometimes fails on disconnecting queues from broker. This behavior is under investigation on QA and might be reported to DEV as new bug.

Comment 5 errata-xmlrpc 2008-10-06 19:00:00 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0867.html