Bug 656385
Summary: | Data store can become corrupt with small store file size and large wcache-page-size | ||
---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Mike Cressman <mcressma> |
Component: | qpid-cpp | Assignee: | Kim van der Riet <kim.vdriet> |
Status: | CLOSED ERRATA | QA Contact: | Petr Matousek <pematous> |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | 1.3 | CC: | freznice, gsim, iboverma, tross |
Target Milestone: | 1.3.2-RC2 | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | qpid-cpp-mrg-0.7.946106-28 | Doc Type: | Bug Fix |
Doc Text: |
Prior to this update, running the qpidd service with the page cache size (the "wcache-page-size" option) greater than the file size could corrupt the data store. With this update, the underlying source code has been modified to reduce the page cache size to the file size, and the qpid-config utility no longer allows users to specify the "--file-size" option with a value that is smaller than the page cache size.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2011-02-15 12:12:22 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Created attachment 462376 [details]
simple python producer
The current trunk (r.4412) exhibits this bug. This would appear to be a condition in which the page cache (128k) is larger than the file size (64k), a condition for which the store does not check. This is a fundamentally incorrect situation, since the use of DMA for writing files means that each page must fit into the file being written. The example above recovers and continues to use the journal under these conditions; all subsequent errors are cascaded (consequential) and compounded. There are two possible solutions: 1. Add a check to catch and fail startup of the broker if this condition exists; 2. Auto-fix the cache size to the maximum allowed, and log the action. Thoughts? Being an undetected misconfiguration, I am lowering this to high/medium. Suggest placing an article in the knowledge base for this. Where QMF is used to create a persistent queue with illegal conditions (such as in this reproducer), the action should fail with a well-articulated error message. Created attachment 463811 [details]
Patch which adds check for illegal file size / page cache size condition
This patch checks for the case of the page cache size being greater than the file size.
If this happens during broker startup (ie through cmd-line params and/or config file settings), then the write cache size will be reduced to the same size as the file size with a log file warning:
"warning parameter wcache-page-size (128) cannot set a page size greater than the journal file size; changing this parameter to the journal file size (64)"
If this happens using qpid tools/qmf at runtime, then an exception is thrown and the connection closed with an error message of the form:
"Cannot create journal with file size less than write page cache size. [file size = 1 (64 kB); write page cache = 128 kB]"
This patch has been checked in on trunk as r.4418 In build for 1.3.2 RC 2 Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Before this fix, it was possible to specify a cache page size that was greater than the file size. Since the file written by committing the entire cache page to disk, it was not possible to keep the file at the correct size. This is a corner case in which the file size had to be set to the smallest possible value and the page cache to the largest possible value. This fix adds a parameter check which prevents this combination from occurring, keeping the cache page size equal to or larger than the file size. This issue is not fixed Version-Release number of selected component: qpid-cpp-mrg-0.7.946106-27 When trying to create a queue with --file-size smaller than wcache-page-size set in qpidd.conf file using qpid-tools/qmf, error is displaed and queue is not created (correct). Second execution of the same command ends with success, moreover the created queue is corrupted. Sending messages using msg-send.py on that queue ends with error. The queue is also not persistent, the queue is not present after the broker restart. # grep wcache-page-size /etc/qpidd.conf wcache-page-size=128 # qpid-config add queue myqueue --durable --file-count 4 --file-size 1 Failed: SessionException: (None, 'Cannot create store with file size less than write page cache size. [file size = 1 (64 kB); write page cache = 128 kB] (MessageStoreImpl.cpp:114)') # echo $? 1 # qpid-config add queue myqueue --durable --file-count 4 --file-size 1 # echo $? 0 # qpid-config queues myqueue Queue Name Attributes =================================== myqueue --durable # qpid-config add exchange direct myqueue --durable # qpid-config bind myqueue myqueue mybind # ./msg-send.py Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib64/python2.4/threading.py", line 442, in __bootstrap self.run() File "/usr/lib64/python2.4/threading.py", line 422, in run self.__target(*self.__args, **self.__kwargs) File "/usr/lib/python2.4/site-packages/qpid/connection.py", line 183, in run self.delegate.received(op) File "/usr/lib/python2.4/site-packages/qpid/delegates.py", line 52, in received getattr(self, op.NAME)(ch, op) File "/usr/lib/python2.4/site-packages/qpid/delegates.py", line 106, in session_completed ch.session_known_completed(cmp.commands) File "/usr/lib/python2.4/site-packages/qpid/generator.py", line 25, in <lambda> method = lambda self, *args, **kwargs: self.invoke(op, args, kwargs) File "/usr/lib/python2.4/site-packages/qpid/connection.py", line 229, in invoke self.connection.write_op(ctl) File "/usr/lib/python2.4/site-packages/qpid/connection.py", line 200, in write_op self.flush() File "/usr/lib/python2.4/site-packages/qpid/framer.py", line 57, in flush self._write(cipher_buf) File "/usr/lib/python2.4/site-packages/qpid/framer.py", line 68, in _write n = self.sock.send(buf) error: (32, 'Broken pipe') Traceback (most recent call last): File "./msg-send.py", line 48, in ? session.close(timeout=10) File "/usr/lib/python2.4/site-packages/qpid/session.py", line 109, in close self.channel.session_detach(self.name) File "/usr/lib/python2.4/site-packages/qpid/generator.py", line 25, in <lambda> method = lambda self, *args, **kwargs: self.invoke(op, args, kwargs) File "/usr/lib/python2.4/site-packages/qpid/connection.py", line 229, in invoke self.connection.write_op(ctl) File "/usr/lib/python2.4/site-packages/qpid/connection.py", line 200, in write_op self.flush() File "/usr/lib/python2.4/site-packages/qpid/framer.py", line 57, in flush self._write(cipher_buf) File "/usr/lib/python2.4/site-packages/qpid/framer.py", line 68, in _write n = self.sock.send(buf) socket.error: (32, 'Broken pipe') # service qpidd restart Stopping Qpid AMQP daemon: [ OK ] Starting Qpid AMQP daemon: [ OK ] # qpid-config queues myqueue Queue Name Attributes =================================== Please see comment 11 Default durrable queue created with qpid-config command do not suffer from that. + service qpidd stop Stopping Qpid AMQP daemon: [ OK ] + rm -rf /var/lib/qpidd/lock /var/lib/qpidd/rhm /var/lib/qpidd/systemId + service qpidd start Starting Qpid AMQP daemon: [ OK ] + qpid-config add queue myqueue --durable + qpid-config queues myqueue Queue Name Attributes =================================== myqueue --durable --file-size=24 --file-count=8 + qpid-config add exchange direct myqueue --durable + qpid-config bind myqueue myqueue mybind + ./msg-send.py sent hello 0 sent hello 1 sent hello 2 + ./msg-read.py read hello 0 read hello 1 read hello 2 + service qpidd restart Stopping Qpid AMQP daemon: [ OK ] Starting Qpid AMQP daemon: [ OK ] + qpid-config queues myqueue Queue Name Attributes =================================== myqueue --durable --file-size=24 --file-count=8 + ./msg-send.py sent hello 0 sent hello 1 sent hello 2 + ./msg-read.py read hello 0 read hello 1 read hello 2 ASSIGNED This is a good catch. It appears that the queue is not handling the exception correctly. Attempting to add the queue the first time results in the connection being closed, and no journal is created for the queue, but the queue exists and remains in the broker. After starting a fresh broker: $ /home/kpvdr/mrg/qpid/tools/src/py/qpid-config queues Queue Name Attributes ================================================================= qmfc-v2-hb-mrg42.lab.bos.redhat.com.10155.1 auto-del excl --limit-policy=ring qmfc-v2-mrg42.lab.bos.redhat.com.10155.1 auto-del excl qmfc-v2-ui-mrg42.lab.bos.redhat.com.10155.1 auto-del excl --limit-policy=ring reply-mrg42.lab.bos.redhat.com.10155.1 auto-del excl topic-mrg42.lab.bos.redhat.com.10155.1 auto-del excl --limit-policy=ring $ /home/kpvdr/mrg/qpid/tools/src/py/qpid-config add queue myqueue --durable --file-count 4 --file-size 1 Failed: SessionException: (None, 'Cannot create store with file size less than write page cache size. [file size = 1 (64 kB); write page cache = 128 kB] (MessageStoreImpl.cpp:108)') $ /home/kpvdr/mrg/qpid/tools/src/py/qpid-config queues Queue Name Attributes ================================================================= myqueue --durable qmfc-v2-hb-mrg42.lab.bos.redhat.com.10183.1 auto-del excl --limit-policy=ring qmfc-v2-mrg42.lab.bos.redhat.com.10183.1 auto-del excl qmfc-v2-ui-mrg42.lab.bos.redhat.com.10183.1 auto-del excl --limit-policy=ring reply-mrg42.lab.bos.redhat.com.10183.1 auto-del excl topic-mrg42.lab.bos.redhat.com.10183.1 auto-del excl --limit-policy=ring [kpvdr@mrg42 python]$ This shows that the broker retained the failed queue, and subsequent attempts to create it appear to succeed, however, I believe that the request is ignored - this is normal behaviour for attempting to create a queue that already exists. (Perhaps a warning would be appropriate, but this is a separate issue.) I should imagine that the behaviour of this queue is undefined, especially since the broker believes that it is durable, but no journal exists. I believe that the demonstration in comments 11 - 14 illustrates a separate bug from this one; further, I don't believe that it is a regression and has been present all along. These comments reflect a problem in the broker's handling of exceptions during management calls, and may also exist for exchange and binding creation (although this has not been demonstrated). I would recommend closing this bug and opening another on handling exceptions around management calls in the broker. Created attachment 475811 [details]
Patch to fix issues described in comments 11 - 14
With the patch in comment #15, the problem described above is now solved. After starting a fresh store/broker with --wcache-page-size 128: $ /home/kpvdr/mrg/qpid/tools/src/py/qpid-config add queue myqueue --durable --file-count 4 --file-size 1 Failed: SessionException: (None, 'Cannot create store with file size less than write page cache size. [file size = 1 (64 kB); write page cache = 128 kB] (MessageStoreImpl.cpp:108)') $ /home/kpvdr/mrg/qpid/tools/src/py/qpid-config queues Queue Name Attributes ================================================================= qmfc-v2-hb-mrg42.lab.bos.redhat.com.11233.1 auto-del excl --limit-policy=ring qmfc-v2-mrg42.lab.bos.redhat.com.11233.1 auto-del excl qmfc-v2-ui-mrg42.lab.bos.redhat.com.11233.1 auto-del excl --limit-policy=ring reply-mrg42.lab.bos.redhat.com.11233.1 auto-del excl topic-mrg42.lab.bos.redhat.com.11233.1 auto-del excl --limit-policy=ring $ /home/kpvdr/mrg/qpid/tools/src/py/qpid-config add queue myqueue --durable --file-count 4 --file-size 1 Failed: SessionException: (None, 'Cannot create store with file size less than write page cache size. [file size = 1 (64 kB); write page cache = 128 kB] (MessageStoreImpl.cpp:108)') $ /home/kpvdr/mrg/qpid/tools/src/py/qpid-config add queue myqueue --durable --file-count 4 --file-size 1 Failed: SessionException: (None, 'Cannot create store with file size less than write page cache size. [file size = 1 (64 kB); write page cache = 128 kB] (MessageStoreImpl.cpp:108)') Decision pending on whether to respin for 1.3.2 with the attached fix. Checked onto trunk as r.1064711. -> POST Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1,3 @@ -Before this fix, it was possible to specify a cache page size that was greater than the file size. Since the file written by committing the entire cache page to disk, it was not possible to keep the file at the correct size. This is a corner case in which the file size had to be set to the smallest possible value and the page cache to the largest possible value. This fix adds a parameter check which prevents this combination from occurring, keeping the cache page size equal to or larger than the file size.+Before this fix, it was possible to specify a cache page size that was greater than the file size. Since the file written by committing the entire cache page to disk, it was not possible to keep the file at the correct size. This is a corner case in which the file size had to be set to the smallest possible value and the page cache to the largest possible value. This fix adds a parameter check which prevents this combination from occurring, keeping the cache page size equal to or larger than the file size. + +Checkin r.1064711 resolves the problem that if a queue is created using these illegal parameters, the store raises an error, but the queue remains partially initialized in the broker. The issue has been fixed, tested on RHEL 4.8 / 5.6, i386 / x86_64 on packages: python-qpid-0.7.946106-15.el5 qpid-cpp-client-0.7.946106-28.el5 qpid-cpp-client-devel-0.7.946106-28.el5 qpid-cpp-client-devel-docs-0.7.946106-28.el5 qpid-cpp-client-ssl-0.7.946106-28.el5 qpid-cpp-mrg-debuginfo-0.7.946106-28.el5 qpid-cpp-server-0.7.946106-28.el5 qpid-cpp-server-cluster-0.7.946106-28.el5 qpid-cpp-server-devel-0.7.946106-28.el5 qpid-cpp-server-ssl-0.7.946106-28.el5 qpid-cpp-server-store-0.7.946106-28.el5 qpid-cpp-server-xml-0.7.946106-28.el5 qpid-java-client-0.7.946106-15.el5 qpid-java-common-0.7.946106-15.el5 qpid-java-example-0.7.946106-15.el5 qpid-tools-0.7.946106-12.el5 VERIFIED Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,3 +1 @@ -Before this fix, it was possible to specify a cache page size that was greater than the file size. Since the file written by committing the entire cache page to disk, it was not possible to keep the file at the correct size. This is a corner case in which the file size had to be set to the smallest possible value and the page cache to the largest possible value. This fix adds a parameter check which prevents this combination from occurring, keeping the cache page size equal to or larger than the file size. +Prior to this update, running the qpidd service with the page cache size (the "wcache-page-size" option) greater than the file size could corrupt the data store. With this update, the underlying source code has been modified to reduce the page cache size to the file size, and the qpid-config utility no longer allows users to specify the "--file-size" option with a value that is smaller than the page cache size.- -Checkin r.1064711 resolves the problem that if a queue is created using these illegal parameters, the store raises an error, but the queue remains partially initialized in the broker. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0217.html |
Created attachment 462375 [details] simple python consumer Description of problem: If wcache-page-size is set to 128, and a queue is created with file-size=1 and file-count=4, the store can become corrupted, producing the following error: 2010-11-23 09:18:44 error Unexpected exception: Queue myqueue: flush() failed: jexception 0x0403 fcntl::add_wr_subm_cnt_dblks() threw JERR_FCNTL_FILEOFFSOVFL: Attempted increase file offset past file size. (pfid=1 lfid=2 wr_subm_cnt_dblks=516 incr=4 fsize=516 dblks) (MessageStoreImpl.cpp:1257) Version-Release number of selected component (if applicable): 1.3 How reproducible: Always Steps to Reproduce: 1. set wcache-page-size=128 in qpidd.conf 2. delete the datadir/store (start clean) 3. start broker 4. create queue/exchange/binding as follows: qpid-config add queue myqueue --durable --file-count 4 --file-size 1 qpid-config add exchange direct myqueue --durable qpid-config bind myqueue myqueue mybind 5. Run msg-send.py You should get an error message in the log about journal full. 6. Run msg-read.py This reads the current messages from the queue. 7. Restart the broker 8. Run msg-send.py Actual results: 2010-11-23 09:18:44 error Unexpected exception: Queue myqueue: flush() failed: jexception 0x0403 fcntl::add_wr_subm_cnt_dblks() threw JERR_FCNTL_FILEOFFSOVFL: Attempted increase file offset past file size. (pfid=1 lfid=2 wr_subm_cnt_dblks=516 incr=4 fsize=516 dblks) (MessageStoreImpl.cpp:1257) Expected results: System handles the error without corrupting the store Additional info: If you don't read the messages in step 6, then restart the broker, you just get a journal full error again but no corruption. Reading them seems to cause the problem.