Bug 656385

Summary: Data store can become corrupt with small store file size and large wcache-page-size
Product: Red Hat Enterprise MRG Reporter: Mike Cressman <mcressma>
Component: qpid-cppAssignee: Kim van der Riet <kim.vdriet>
Status: CLOSED ERRATA QA Contact: Petr Matousek <pematous>
Severity: medium Docs Contact:
Priority: high    
Version: 1.3CC: freznice, gsim, iboverma, tross
Target Milestone: 1.3.2-RC2   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: qpid-cpp-mrg-0.7.946106-28 Doc Type: Bug Fix
Doc Text:
Prior to this update, running the qpidd service with the page cache size (the "wcache-page-size" option) greater than the file size could corrupt the data store. With this update, the underlying source code has been modified to reduce the page cache size to the file size, and the qpid-config utility no longer allows users to specify the "--file-size" option with a value that is smaller than the page cache size.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-02-15 12:12:22 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
simple python consumer
none
simple python producer
none
Patch which adds check for illegal file size / page cache size condition
none
Patch to fix issues described in comments 11 - 14 none

Description Mike Cressman 2010-11-23 16:14:41 UTC
Created attachment 462375 [details]
simple python consumer

Description of problem:

If wcache-page-size is set to 128, and a queue is created with file-size=1 and file-count=4, the store can become corrupted, producing the following error:

2010-11-23 09:18:44 error Unexpected exception: Queue myqueue: flush() failed: jexception 0x0403 fcntl::add_wr_subm_cnt_dblks() threw JERR_FCNTL_FILEOFFSOVFL: Attempted increase file offset past file size. (pfid=1 lfid=2 wr_subm_cnt_dblks=516 incr=4 fsize=516 dblks) (MessageStoreImpl.cpp:1257)


Version-Release number of selected component (if applicable):
1.3

How reproducible:
Always

Steps to Reproduce:
1. set wcache-page-size=128 in qpidd.conf
2. delete the datadir/store (start clean)
3. start broker
4. create queue/exchange/binding as follows:

   qpid-config add queue myqueue --durable --file-count 4 --file-size 1
   qpid-config add exchange direct myqueue --durable
   qpid-config bind myqueue myqueue mybind

5. Run msg-send.py

You should get an error message in the log about journal full.

6. Run msg-read.py

This reads the current messages from the queue.

7. Restart the broker

8. Run msg-send.py
  
Actual results:

2010-11-23 09:18:44 error Unexpected exception: Queue myqueue: flush() failed: jexception 0x0403 fcntl::add_wr_subm_cnt_dblks() threw JERR_FCNTL_FILEOFFSOVFL: Attempted increase file offset past file size. (pfid=1 lfid=2 wr_subm_cnt_dblks=516 incr=4 fsize=516 dblks) (MessageStoreImpl.cpp:1257)

Expected results:

System handles the error without corrupting the store

Additional info:
If you don't read the messages in step 6, then restart the broker, you just get a journal full error again but no corruption.  Reading them seems to cause the problem.

Comment 1 Mike Cressman 2010-11-23 16:15:31 UTC
Created attachment 462376 [details]
simple python producer

Comment 2 Kim van der Riet 2010-11-30 13:44:26 UTC
The current trunk (r.4412) exhibits this bug.

Comment 3 Kim van der Riet 2010-11-30 15:51:30 UTC
This would appear to be a condition in which the page cache (128k) is larger than the file size (64k), a condition for which the store does not check.

This is a fundamentally incorrect situation, since the use of DMA for writing files means that each page must fit into the file being written. The example above recovers and continues to use the journal under these conditions; all subsequent errors are cascaded (consequential) and compounded.

There are two possible solutions:
1. Add a check to catch and fail startup of the broker if this condition exists;
2. Auto-fix the cache size to the maximum allowed, and log the action.

Thoughts?

Being an undetected misconfiguration, I am lowering this to high/medium.

Suggest placing an article in the knowledge base for this.

Comment 4 Kim van der Riet 2010-11-30 16:23:46 UTC
Where QMF is used to create a persistent queue with illegal conditions (such as in this reproducer), the action should fail with a well-articulated error message.

Comment 5 Kim van der Riet 2010-11-30 19:50:02 UTC
Created attachment 463811 [details]
Patch which adds check for illegal file size / page cache size condition

This patch checks for the case of the page cache size being greater than the file size.

If this happens during broker startup (ie through cmd-line params and/or config file settings), then the write cache size will be reduced to the same size as the file size with a log file warning:

"warning parameter wcache-page-size (128) cannot set a page size greater than the journal file size; changing this parameter to the journal file size (64)"

If this happens using qpid tools/qmf at runtime, then an exception is thrown and the connection closed with an error message of the form:

"Cannot create journal with file size less than write page cache size. [file size = 1 (64 kB); write page cache = 128 kB]"

Comment 6 Kim van der Riet 2011-01-10 18:47:21 UTC
This patch has been checked in on trunk as r.4418

Comment 9 Mike Cressman 2011-01-24 14:30:54 UTC
In build for 1.3.2 RC 2

Comment 10 Kim van der Riet 2011-01-27 18:03:05 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Before this fix, it was possible to specify a cache page size that was greater than the file size. Since the file written by committing the entire cache page to disk, it was not possible to keep the file at the correct size. This is a corner case in which the file size had to be set to the smallest possible value and the page cache to the largest possible value. This fix adds a parameter check which prevents this combination from occurring, keeping the cache page size equal to or larger than the file size.

Comment 11 Petr Matousek 2011-01-28 11:53:09 UTC
This issue is not fixed

Version-Release number of selected component:
qpid-cpp-mrg-0.7.946106-27

When trying to create a queue with --file-size smaller than wcache-page-size set in qpidd.conf file using qpid-tools/qmf, error is displaed and queue is not created (correct). 
Second execution of the same command ends with success, moreover the created queue is corrupted. Sending messages using msg-send.py on that queue ends with error. The queue is also not persistent, the queue is not present after the broker restart.

# grep wcache-page-size /etc/qpidd.conf
wcache-page-size=128
# qpid-config add queue myqueue --durable --file-count 4 --file-size 1
Failed: SessionException: (None, 'Cannot create store with file size less than write page cache size. [file size = 1 (64 kB); write page cache = 128 kB] (MessageStoreImpl.cpp:114)')
# echo $?
1
# qpid-config add queue myqueue --durable --file-count 4 --file-size 1
# echo $?
0
# qpid-config queues myqueue
Queue Name  Attributes
===================================
myqueue     --durable
# qpid-config add exchange direct myqueue --durable
# qpid-config bind myqueue myqueue mybind
# ./msg-send.py 
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib64/python2.4/threading.py", line 442, in __bootstrap
    self.run()
  File "/usr/lib64/python2.4/threading.py", line 422, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/lib/python2.4/site-packages/qpid/connection.py", line 183, in run
    self.delegate.received(op)
  File "/usr/lib/python2.4/site-packages/qpid/delegates.py", line 52, in received
    getattr(self, op.NAME)(ch, op)
  File "/usr/lib/python2.4/site-packages/qpid/delegates.py", line 106, in session_completed
    ch.session_known_completed(cmp.commands)
  File "/usr/lib/python2.4/site-packages/qpid/generator.py", line 25, in <lambda>
    method = lambda self, *args, **kwargs: self.invoke(op, args, kwargs)
  File "/usr/lib/python2.4/site-packages/qpid/connection.py", line 229, in invoke
    self.connection.write_op(ctl)
  File "/usr/lib/python2.4/site-packages/qpid/connection.py", line 200, in write_op
    self.flush()
  File "/usr/lib/python2.4/site-packages/qpid/framer.py", line 57, in flush
    self._write(cipher_buf)
  File "/usr/lib/python2.4/site-packages/qpid/framer.py", line 68, in _write
    n = self.sock.send(buf)
error: (32, 'Broken pipe')

Traceback (most recent call last):
  File "./msg-send.py", line 48, in ?
    session.close(timeout=10)
  File "/usr/lib/python2.4/site-packages/qpid/session.py", line 109, in close
    self.channel.session_detach(self.name)
  File "/usr/lib/python2.4/site-packages/qpid/generator.py", line 25, in <lambda>
    method = lambda self, *args, **kwargs: self.invoke(op, args, kwargs)
  File "/usr/lib/python2.4/site-packages/qpid/connection.py", line 229, in invoke
    self.connection.write_op(ctl)
  File "/usr/lib/python2.4/site-packages/qpid/connection.py", line 200, in write_op
    self.flush()
  File "/usr/lib/python2.4/site-packages/qpid/framer.py", line 57, in flush
    self._write(cipher_buf)
  File "/usr/lib/python2.4/site-packages/qpid/framer.py", line 68, in _write
    n = self.sock.send(buf)
socket.error: (32, 'Broken pipe')

# service qpidd restart
Stopping Qpid AMQP daemon:                                 [  OK  ]
Starting Qpid AMQP daemon:                                 [  OK  ]
# qpid-config queues myqueue
Queue Name  Attributes
===================================

Comment 13 Petr Matousek 2011-01-28 11:56:14 UTC
Please see comment 11

Default durrable queue created with qpid-config command do not suffer from that.

+ service qpidd stop
Stopping Qpid AMQP daemon:                                 [  OK  ]
+ rm -rf /var/lib/qpidd/lock /var/lib/qpidd/rhm /var/lib/qpidd/systemId
+ service qpidd start
Starting Qpid AMQP daemon:                                 [  OK  ]
+ qpid-config add queue myqueue --durable
+ qpid-config queues myqueue
Queue Name  Attributes
===================================
myqueue     --durable --file-size=24 --file-count=8
+ qpid-config add exchange direct myqueue --durable
+ qpid-config bind myqueue myqueue mybind
+ ./msg-send.py
sent hello 0
sent hello 1
sent hello 2
+ ./msg-read.py
read hello 0
read hello 1
read hello 2
+ service qpidd restart
Stopping Qpid AMQP daemon:                                 [  OK  ]
Starting Qpid AMQP daemon:                                 [  OK  ]
+ qpid-config queues myqueue
Queue Name  Attributes
===================================
myqueue     --durable --file-size=24 --file-count=8
+ ./msg-send.py
sent hello 0
sent hello 1
sent hello 2
+ ./msg-read.py
read hello 0
read hello 1
read hello 2


ASSIGNED

Comment 14 Kim van der Riet 2011-01-28 12:39:24 UTC
This is a good catch. It appears that the queue is not handling the exception correctly. Attempting to add the queue the first time results in the connection being closed, and no journal is created for the queue, but the queue exists and remains in the broker. After starting a fresh broker:

$ /home/kpvdr/mrg/qpid/tools/src/py/qpid-config queues
Queue Name                                   Attributes
=================================================================
qmfc-v2-hb-mrg42.lab.bos.redhat.com.10155.1  auto-del excl --limit-policy=ring
qmfc-v2-mrg42.lab.bos.redhat.com.10155.1     auto-del excl
qmfc-v2-ui-mrg42.lab.bos.redhat.com.10155.1  auto-del excl --limit-policy=ring
reply-mrg42.lab.bos.redhat.com.10155.1       auto-del excl
topic-mrg42.lab.bos.redhat.com.10155.1       auto-del excl --limit-policy=ring

$ /home/kpvdr/mrg/qpid/tools/src/py/qpid-config add queue myqueue --durable --file-count 4 --file-size 1
Failed: SessionException: (None, 'Cannot create store with file size less than write page cache size. [file size = 1 (64 kB); write page cache = 128 kB] (MessageStoreImpl.cpp:108)')

$ /home/kpvdr/mrg/qpid/tools/src/py/qpid-config queues
Queue Name                                   Attributes
=================================================================
myqueue                                      --durable
qmfc-v2-hb-mrg42.lab.bos.redhat.com.10183.1  auto-del excl --limit-policy=ring
qmfc-v2-mrg42.lab.bos.redhat.com.10183.1     auto-del excl
qmfc-v2-ui-mrg42.lab.bos.redhat.com.10183.1  auto-del excl --limit-policy=ring
reply-mrg42.lab.bos.redhat.com.10183.1       auto-del excl
topic-mrg42.lab.bos.redhat.com.10183.1       auto-del excl --limit-policy=ring
[kpvdr@mrg42 python]$

This shows that the broker retained the failed queue, and subsequent attempts to create it appear to succeed, however, I believe that the request is ignored - this is normal behaviour for attempting to create a queue that already exists. (Perhaps a warning would be appropriate, but this is a separate issue.)

I should imagine that the behaviour of this queue is undefined, especially since the broker believes that it is durable, but no journal exists.

Comment 15 Kim van der Riet 2011-01-28 12:59:52 UTC
I believe that the demonstration in comments 11 - 14 illustrates a separate bug from this one; further, I don't believe that it is a regression and has been present all along. These comments reflect a problem in the broker's handling of exceptions during management calls, and may also exist for exchange and binding creation (although this has not been demonstrated).

I would recommend closing this bug and opening another on handling exceptions around management calls in the broker.

Comment 16 Kim van der Riet 2011-01-28 14:18:54 UTC
Created attachment 475811 [details]
Patch to fix issues described in comments 11 - 14

Comment 17 Kim van der Riet 2011-01-28 14:24:26 UTC
With the patch in comment #15, the problem described above is now solved. After starting a fresh store/broker with --wcache-page-size 128:


$ /home/kpvdr/mrg/qpid/tools/src/py/qpid-config add queue myqueue --durable --file-count 4 --file-size 1
Failed: SessionException: (None, 'Cannot create store with file size less than write page cache size. [file size = 1 (64 kB); write page cache = 128 kB] (MessageStoreImpl.cpp:108)')

$ /home/kpvdr/mrg/qpid/tools/src/py/qpid-config queues
Queue Name                                   Attributes
=================================================================
qmfc-v2-hb-mrg42.lab.bos.redhat.com.11233.1  auto-del excl --limit-policy=ring
qmfc-v2-mrg42.lab.bos.redhat.com.11233.1     auto-del excl
qmfc-v2-ui-mrg42.lab.bos.redhat.com.11233.1  auto-del excl --limit-policy=ring
reply-mrg42.lab.bos.redhat.com.11233.1       auto-del excl
topic-mrg42.lab.bos.redhat.com.11233.1       auto-del excl --limit-policy=ring

$ /home/kpvdr/mrg/qpid/tools/src/py/qpid-config add queue myqueue --durable --file-count 4 --file-size 1
Failed: SessionException: (None, 'Cannot create store with file size less than write page cache size. [file size = 1 (64 kB); write page cache = 128 kB] (MessageStoreImpl.cpp:108)')

$ /home/kpvdr/mrg/qpid/tools/src/py/qpid-config add queue myqueue --durable --file-count 4 --file-size 1
Failed: SessionException: (None, 'Cannot create store with file size less than write page cache size. [file size = 1 (64 kB); write page cache = 128 kB] (MessageStoreImpl.cpp:108)')

Comment 18 Gordon Sim 2011-01-28 14:28:29 UTC
Decision pending on whether to respin for 1.3.2 with the attached fix.

Comment 19 Kim van der Riet 2011-01-28 14:57:00 UTC
Checked onto trunk as r.1064711.

-> POST

Comment 20 Kim van der Riet 2011-01-28 14:59:53 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1,3 @@
-Before this fix, it was possible to specify a cache page size that was greater than the file size. Since the file written by committing the entire cache page to disk, it was not possible to keep the file at the correct size. This is a corner case in which the file size had to be set to the smallest possible value and the page cache to the largest possible value. This fix adds a parameter check which prevents this combination from occurring, keeping the cache page size equal to or larger than the file size.+Before this fix, it was possible to specify a cache page size that was greater than the file size. Since the file written by committing the entire cache page to disk, it was not possible to keep the file at the correct size. This is a corner case in which the file size had to be set to the smallest possible value and the page cache to the largest possible value. This fix adds a parameter check which prevents this combination from occurring, keeping the cache page size equal to or larger than the file size.
+
+Checkin r.1064711 resolves the problem that if a queue is created using these illegal parameters, the store raises an error, but the queue remains partially initialized in the broker.

Comment 21 Petr Matousek 2011-02-04 10:53:40 UTC
The issue has been fixed, tested on RHEL 4.8 / 5.6, i386 / x86_64 on packages:
python-qpid-0.7.946106-15.el5
qpid-cpp-client-0.7.946106-28.el5
qpid-cpp-client-devel-0.7.946106-28.el5
qpid-cpp-client-devel-docs-0.7.946106-28.el5
qpid-cpp-client-ssl-0.7.946106-28.el5
qpid-cpp-mrg-debuginfo-0.7.946106-28.el5
qpid-cpp-server-0.7.946106-28.el5
qpid-cpp-server-cluster-0.7.946106-28.el5
qpid-cpp-server-devel-0.7.946106-28.el5
qpid-cpp-server-ssl-0.7.946106-28.el5
qpid-cpp-server-store-0.7.946106-28.el5
qpid-cpp-server-xml-0.7.946106-28.el5
qpid-java-client-0.7.946106-15.el5
qpid-java-common-0.7.946106-15.el5
qpid-java-example-0.7.946106-15.el5
qpid-tools-0.7.946106-12.el5

VERIFIED

Comment 22 Jaromir Hradilek 2011-02-09 17:09:35 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,3 +1 @@
-Before this fix, it was possible to specify a cache page size that was greater than the file size. Since the file written by committing the entire cache page to disk, it was not possible to keep the file at the correct size. This is a corner case in which the file size had to be set to the smallest possible value and the page cache to the largest possible value. This fix adds a parameter check which prevents this combination from occurring, keeping the cache page size equal to or larger than the file size.
+Prior to this update, running the qpidd service with the page cache size (the "wcache-page-size" option) greater than the file size could corrupt the data store. With this update, the underlying source code has been modified to reduce the page cache size to the file size, and the qpid-config utility no longer allows users to specify the "--file-size" option with a value that is smaller than the page cache size.-
-Checkin r.1064711 resolves the problem that if a queue is created using these illegal parameters, the store raises an error, but the queue remains partially initialized in the broker.

Comment 23 errata-xmlrpc 2011-02-15 12:12:22 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0217.html