456272 – Broker fails to create journal directory - File already exists

Bug 456272 - Broker fails to create journal directory - File already exists

Summary: Broker fails to create journal directory - File already exists

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	qpid-cpp
Sub Component:
Version:	1.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	1.0.1
Target Release:	---
Assignee:	Kim van der Riet
QA Contact:	Kim van der Riet
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	460109
TreeView+	depends on / blocked

Reported:	2008-07-22 15:38 UTC by David Sommerseth
Modified:	2016-05-22 23:27 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-10-06 19:09:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2008:0640	0	normal	SHIPPED_LIVE	Red Hat Enterprise MRG bug fix and enhancement update	2008-10-06 19:08:07 UTC

Description David Sommerseth 2008-07-22 15:38:22 UTC

<davids> kpvdr:  I've lately seen a lot of "Directory creation failed"
exceptions .... with the explanation "File exists" ... is that something you
know about?
<kpvdr> davids: no
<davids> kpvdr:  2008-jul-20 15:41:
<davids> 35 error Unexpected exception: Queue
33c46d96-1601-4f6e-801d-8ed5f71c5e3e@guest: create() failed: jexception 0x0301
jdir::create_dir() threw JERR_JDIR_MKDIR: Directory creation failed.
<davids>  (dir="/tmp/rhts_qpidd/qpid-data/pt_broker.568/rhm/jrnl/0017" errno=17
(File exists)) (BdbMessageStore.cpp:356)
<davids> kpvdr:  on one box I have this 3-4 times using the RHTS test script ...
and the broker is started all the times from scratch in a new empty data directory
<davids> kpvdr:  Can I do something in another way to pin-point where/how/why it
happens?
<kpvdr> davids: thinking
<kpvdr> davids: what is the test doing?
<davids> kpvdr:  [15:41:17] Running perftest in topic mode (with storage): 1
iterations with 25000 msgs. Msg size: 64 bytes.  Extra test params: --nsubs 10
--qt 4 --durable yes
<kpvdr> davids: Hmmm, this is odd
<davids> kpvdr:  it happens only on topic tests
<kpvdr> davids: I have an idea on this..
<kpvdr> davids: ie when more than one journal maps into the same dir - in this
case 0017

Comment 1 Kim van der Riet 2008-07-24 20:33:38 UTC

I have been unable to reproduce this error. I have eliminated the following
possibilities:

1. Directory permission: this results in a different error message;
2. Directory exists: this works fine and the test completes with up to 4 queues
per directory;
3. Too many files handles: This results in a different error message on file
creation, not dir creation.

Since the code checks for the existence of a dir prior to creating it, the only
explanation for this error is a thread safety issue - ie two threads happen to
create the same dir at the same time. Examination of the code shows that the
current algorithm for creating the first level dir uses a simple hash of the
queue name to create one of 20 possible dirs. There exists for random dir names
a 5% probability that a second equally paced thread may attempt to to create the
same top-level dir at the same time for another queue.

Although I have not reproduced the error, I am checking in a fix for this
oversight in the hope that it will eliminate this bug. By not checking for dir
existence, and allowing for a possible duplicate would solve this problem
without the need for a lock.

I will leave this assigned for a little while longer and see if the bug can be
reproduced on the RHTS hardware which originally found this problem.

r2215 on trunk; r2216 on 1.0 branch

Comment 2 David Sommerseth 2008-07-29 18:44:34 UTC

Reproducer is now available from mrg-team SVN ... mrg-team/people/dsommers/bz456272

This reproducer works somehow on hp-xw4800-01.rhts.bos.redhat.com.  Starting
qpidd with: --auth no --tpl-wcache-page-size 128 --tpl-jfile-size-pgs 32
--num-jfiles 16 --jfile-size-pgs 32

In another screen, run this command:

  $ (find /usr -type f -exec cat {} \; > /tmp/filedata.dat) & python ./bz456272.py


It seems this issue arises much more often when the disk is busy with work.  The
fail rate is somewhat around 10% with this script on this box.

Comment 3 David Sommerseth 2008-07-30 13:45:42 UTC

With the latest qpid and storage module from SVN (qpid.0-10, mrg-1.0) this bug
seems to be fixed.

Comment 5 Frantisek Reznicek 2008-08-29 07:21:27 UTC

RHTS test developed (MRG/qpid_broker_jfail_bz456272).
Test results comming soon.

Comment 6 Frantisek Reznicek 2008-09-04 10:16:47 UTC

RHTS test (MRG/qpid_broker_jfail_bz456272) shows no more qpidd 'file already exists fails'. Bug going to VERIFIED.

Last results show that MRG/qpid_broker_jfail_bz456272 test is unstable, sometimes fails on disconnecting queues from broker. This behavior is under investigation on QA and might be reported to DEV as new bug.

Comment 8 errata-xmlrpc 2008-10-06 19:09:09 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0640.html

Note You need to log in before you can comment on or make changes to this bug.