Hide Forgot
Description of problem: When running qemu xfs tests out of order op and crash in osd seen Version-Release number of selected component (if applicable): 1.3.1Async - 0.94.3-6 How reproducible: 1/1 Steps to Reproduce: 1. 2. 3. Actual results: 2016-01-19T22:14:40.782 INFO:tasks.qemu.client.0.clara005.stdout:Processing triggers for libc-bin ... 2016-01-19T22:14:40.787 INFO:tasks.qemu.client.0.clara005.stdout:ldconfig deferred processing now taking place 2016-01-19T22:14:42.120 INFO:tasks.qemu.client.0.clara005.stdout:mount.nfs: timeout set for Wed Jan 20 03:16:40 2016 2016-01-19T22:14:42.188 INFO:tasks.qemu.client.0.clara005.stdout:mount.nfs: trying text-based options 'proto=tcp,vers=4,addr=10.0.2.2,clientaddr=10.0.2.15' 2016-01-19T22:14:42.189 INFO:tasks.qemu.client.0.clara005.stdout:10.0.2.2:/export/client.0 on /mnt/log type nfs (rw,proto=tcp) 2016-01-19T22:14:42.190 INFO:tasks.qemu.client.0.clara005.stdout:mount: block device /dev/sr0 is write-protected, mounting read-only 2016-01-19T22:19:48.453 INFO:tasks.ceph.osd.3.clara006.stderr:2016-01-19 22:19:48.452608 7f0affd87700 -1 osd.3 pg_epoch: 6 pg[0.8( v 6'2220 (0'0,6'2220] local-les=4 n=76 ec=1 les/c 4/4 3/3/3) [3,5] r=0 lpr=3 crt=6'2218 lcod 6'2219 mlcod 6'2219 active+clean] bad op order, already applied 17686 > this 17682 2016-01-19T22:19:48.456 INFO:tasks.ceph.osd.3.clara006.stderr:osd/ReplicatedPG.cc: In function 'void ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)' thread 7f0affd87700 time 2016-01-19 22:19:48.452616 2016-01-19T22:19:48.457 INFO:tasks.ceph.osd.3.clara006.stderr:osd/ReplicatedPG.cc: 2351: FAILED assert(0 == "out of order op") 2016-01-19T22:19:48.458 INFO:tasks.ceph.osd.3.clara006.stderr: ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b) 2016-01-19T22:19:48.459 INFO:tasks.ceph.osd.3.clara006.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xb21595] 2016-01-19T22:19:48.460 INFO:tasks.ceph.osd.3.clara006.stderr: 2: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x1c7b) [0x873bcb] 2016-01-19T22:19:48.461 INFO:tasks.ceph.osd.3.clara006.stderr: 3: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x45d7) [0x8782f7] 2016-01-19T22:19:48.462 INFO:tasks.ceph.osd.3.clara006.stderr: 4: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x68a) [0x81053a] 2016-01-19T22:19:48.463 INFO:tasks.ceph.osd.3.clara006.stderr: 5: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x661f35] 2016-01-19T22:19:48.464 INFO:tasks.ceph.osd.3.clara006.stderr: 6: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x330) [0x6624a0] 2016-01-19T22:19:48.465 INFO:tasks.ceph.osd.3.clara006.stderr: 7: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xb10d1f] 2016-01-19T22:19:48.466 INFO:tasks.ceph.osd.3.clara006.stderr: 8: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xb12e50] 2016-01-19T22:19:48.467 INFO:tasks.ceph.osd.3.clara006.stderr: 9: (()+0x7dc5) [0x7f0b1eabadc5] 2016-01-19T22:19:48.468 INFO:tasks.ceph.osd.3.clara006.stderr: 10: (clone()+0x6d) [0x7f0b1d59d21d] 2016-01-19T22:19:48.469 INFO:tasks.ceph.osd.3.clara006.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2016-01-19T22:19:48.470 INFO:tasks.ceph.osd.3.clara006.stderr:2016-01-19 22:19:48.455393 7f0affd87700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)' thread 7f0affd87700 time 2016-01-19 22:19:48.452616 2016-01-19T22:19:48.471 INFO:tasks.ceph.osd.3.clara006.stderr:osd/ReplicatedPG.cc: 2351: FAILED assert(0 == "out of order op") 2016-01-19T22:19:48.472 INFO:tasks.ceph.osd.3.clara006.stderr: 2016-01-19T22:19:48.473 INFO:tasks.ceph.osd.3.clara006.stderr: ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b) 2016-01-19T22:19:48.474 INFO:tasks.ceph.osd.3.clara006.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xb21595] 2016-01-19T22:19:48.475 INFO:tasks.ceph.osd.3.clara006.stderr: 2: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x1c7b) [0x873bcb] 2016-01-19T22:19:48.476 INFO:tasks.ceph.osd.3.clara006.stderr: 3: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x45d7) [0x8782f7] 2016-01-19T22:19:48.477 INFO:tasks.ceph.osd.3.clara006.stderr: 4: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x68a) [0x81053a] 2016-01-19T22:19:48.478 INFO:tasks.ceph.osd.3.clara006.stderr: 5: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x661f35] 2016-01-19T22:19:48.479 INFO:tasks.ceph.osd.3.clara006.stderr: 6: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x330) [0x6624a0] 2016-01-19T22:19:48.480 INFO:tasks.ceph.osd.3.clara006.stderr: 7: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xb10d1f] 2016-01-19T22:19:48.481 INFO:tasks.ceph.osd.3.clara006.stderr: 8: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xb12e50] 2016-01-19T22:19:48.482 INFO:tasks.ceph.osd.3.clara006.stderr: 9: (()+0x7dc5) [0x7f0b1eabadc5] 2016-01-19T22:19:48.483 INFO:tasks.ceph.osd.3.clara006.stderr: 10: (clone()+0x6d) [0x7f0b1d59d21d] 2016-01-19T22:19:48.484 INFO:tasks.ceph.osd.3.clara006.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2016-01-19T22:19:48.485 INFO:tasks.ceph.osd.3.clara006.stderr: 2016-01-19T22:19:48.544 INFO:tasks.ceph.osd.3.clara006.stderr: -1> 2016-01-19 22:19:48.452608 7f0affd87700 -1 osd.3 pg_epoch: 6 pg[0.8( v 6'2220 (0'0,6'2220] local-les=4 n=76 ec=1 les/c 4/4 3/3/3) [3,5] r=0 lpr=3 crt=6'2218 lcod 6'2219 mlcod 6'2219 active+clean] bad op order, already applied 17686 > this 17682 2016-01-19T22:19:48.545 INFO:tasks.ceph.osd.3.clara006.stderr: 0> 2016-01-19 22:19:48.455393 7f0affd87700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)' thread 7f0affd87700 time 2016-01-19 22:19:48.452616 2016-01-19T22:19:48.546 INFO:tasks.ceph.osd.3.clara006.stderr:osd/ReplicatedPG.cc: 2351: FAILED assert(0 == "out of order op") 2016-01-19T22:19:48.547 INFO:tasks.ceph.osd.3.clara006.stderr: 2016-01-19T22:19:48.548 INFO:tasks.ceph.osd.3.clara006.stderr: ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b) 2016-01-19T22:19:48.549 INFO:tasks.ceph.osd.3.clara006.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xb21595] 2016-01-19T22:19:48.550 INFO:tasks.ceph.osd.3.clara006.stderr: 2: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x1c7b) [0x873bcb] 2016-01-19T22:19:48.551 INFO:tasks.ceph.osd.3.clara006.stderr: 3: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x45d7) [0x8782f7] 2016-01-19T22:19:48.552 INFO:tasks.ceph.osd.3.clara006.stderr: 4: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x68a) [0x81053a] 2016-01-19T22:19:48.553 INFO:tasks.ceph.osd.3.clara006.stderr: 5: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x661f35] 2016-01-19T22:19:48.553 INFO:tasks.ceph.osd.3.clara006.stderr: 6: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x330) [0x6624a0] 2016-01-19T22:19:48.554 INFO:tasks.ceph.osd.3.clara006.stderr: 7: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xb10d1f] 2016-01-19T22:19:48.555 INFO:tasks.ceph.osd.3.clara006.stderr: 8: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xb12e50] 2016-01-19T22:19:48.556 INFO:tasks.ceph.osd.3.clara006.stderr: 9: (()+0x7dc5) [0x7f0b1eabadc5] 2016-01-19T22:19:48.557 INFO:tasks.ceph.osd.3.clara006.stderr: 10: (clone()+0x6d) [0x7f0b1d59d21d] 2016-01-19T22:19:48.557 INFO:tasks.ceph.osd.3.clara006.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2016-01-19T22:19:48.558 INFO:tasks.ceph.osd.3.clara006.stderr: 2016-01-19T22:19:48.561 INFO:tasks.ceph.osd.3.clara006.stderr:terminate called after throwing an instance of 'ceph::FailedAssertion' 2016-01-19T22:19:48.561 INFO:tasks.ceph.osd.3.clara006.stderr:*** Caught signal (Aborted) ** 2016-01-19T22:19:48.562 INFO:tasks.ceph.osd.3.clara006.stderr: in thread 7f0affd87700 2016-01-19T22:19:48.563 INFO:tasks.ceph.osd.3.clara006.stderr: ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b) 2016-01-19T22:19:48.564 INFO:tasks.ceph.osd.3.clara006.stderr: 1: ceph-osd() [0xa20512] 2016-01-19T22:19:48.565 INFO:tasks.ceph.osd.3.clara006.stderr: 2: (()+0xf100) [0x7f0b1eac2100] 2016-01-19T22:19:48.565 INFO:tasks.ceph.osd.3.clara006.stderr: 3: (gsignal()+0x37) [0x7f0b1d4dc5f7] 2016-01-19T22:19:48.566 INFO:tasks.ceph.osd.3.clara006.stderr: 4: (abort()+0x148) [0x7f0b1d4ddce8] 2016-01-19T22:19:48.567 INFO:tasks.ceph.osd.3.clara006.stderr: 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f0b1dde09b5] 2016-01-19T22:19:48.568 INFO:tasks.ceph.osd.3.clara006.stderr: 6: (()+0x5e926) [0x7f0b1ddde926] 2016-01-19T22:19:48.568 INFO:tasks.ceph.osd.3.clara006.stderr: 7: (()+0x5e953) [0x7f0b1ddde953] 2016-01-19T22:19:48.569 INFO:tasks.ceph.osd.3.clara006.stderr: 8: (()+0x5eb73) [0x7f0b1dddeb73] 2016-01-19T22:19:48.570 INFO:tasks.ceph.osd.3.clara006.stderr: 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xb2178a] 2016-01-19T22:19:48.570 INFO:tasks.ceph.osd.3.clara006.stderr: 10: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x1c7b) [0x873bcb] 2016-01-19T22:19:48.571 INFO:tasks.ceph.osd.3.clara006.stderr: 11: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x45d7) [0x8782f7] 2016-01-19T22:19:48.572 INFO:tasks.ceph.osd.3.clara006.stderr: 12: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x68a) [0x81053a] 2016-01-19T22:19:48.572 INFO:tasks.ceph.osd.3.clara006.stderr: 13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x661f35] 2016-01-19T22:19:48.573 INFO:tasks.ceph.osd.3.clara006.stderr: 14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x330) [0x6624a0] 2016-01-19T22:19:48.574 INFO:tasks.ceph.osd.3.clara006.stderr: 15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xb10d1f] 2016-01-19T22:19:48.574 INFO:tasks.ceph.osd.3.clara006.stderr: 16: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xb12e50] 2016-01-19T22:19:48.575 INFO:tasks.ceph.osd.3.clara006.stderr: 17: (()+0x7dc5) [0x7f0b1eabadc5] 2016-01-19T22:19:48.576 INFO:tasks.ceph.osd.3.clara006.stderr: 18: (clone()+0x6d) [0x7f0b1d59d21d] 2016-01-19T22:19:48.576 INFO:tasks.ceph.osd.3.clara006.stderr:2016-01-19 22:19:48.548237 7f0affd87700 -1 *** Caught signal (Aborted) ** 2016-01-19T22:19:48.577 INFO:tasks.ceph.osd.3.clara006.stderr: in thread 7f0affd87700 2016-01-19T22:19:48.578 INFO:tasks.ceph.osd.3.clara006.stderr: 2016-01-19T22:19:48.578 INFO:tasks.ceph.osd.3.clara006.stderr: ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b) 2016-01-19T22:19:48.579 INFO:tasks.ceph.osd.3.clara006.stderr: 1: ceph-osd() [0xa20512] 2016-01-19T22:19:48.580 INFO:tasks.ceph.osd.3.clara006.stderr: 2: (()+0xf100) [0x7f0b1eac2100] 2016-01-19T22:19:48.580 INFO:tasks.ceph.osd.3.clara006.stderr: 3: (gsignal()+0x37) [0x7f0b1d4dc5f7] 2016-01-19T22:19:48.581 INFO:tasks.ceph.osd.3.clara006.stderr: 4: (abort()+0x148) [0x7f0b1d4ddce8] 2016-01-19T22:19:48.582 INFO:tasks.ceph.osd.3.clara006.stderr: 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f0b1dde09b5] 2016-01-19T22:19:48.583 INFO:tasks.ceph.osd.3.clara006.stderr: 6: (()+0x5e926) [0x7f0b1ddde926] 2016-01-19T22:19:48.584 INFO:tasks.ceph.osd.3.clara006.stderr: 7: (()+0x5e953) [0x7f0b1ddde953] 2016-01-19T22:19:48.584 INFO:tasks.ceph.osd.3.clara006.stderr: 8: (()+0x5eb73) [0x7f0b1dddeb73] 2016-01-19T22:19:48.585 INFO:tasks.ceph.osd.3.clara006.stderr: 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xb2178a] 2016-01-19T22:19:48.586 INFO:tasks.ceph.osd.3.clara006.stderr: 10: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x1c7b) [0x873bcb] 2016-01-19T22:19:48.587 INFO:tasks.ceph.osd.3.clara006.stderr: 11: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x45d7) [0x8782f7] 2016-01-19T22:19:48.588 INFO:tasks.ceph.osd.3.clara006.stderr: 12: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x68a) [0x81053a] 2016-01-19T22:19:48.588 INFO:tasks.ceph.osd.3.clara006.stderr: 13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x661f35] 2016-01-19T22:19:48.589 INFO:tasks.ceph.osd.3.clara006.stderr: 14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x330) [0x6624a0] 2016-01-19T22:19:48.590 INFO:tasks.ceph.osd.3.clara006.stderr: 15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xb10d1f] 2016-01-19T22:19:48.591 INFO:tasks.ceph.osd.3.clara006.stderr: 16: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xb12e50] 2016-01-19T22:19:48.592 INFO:tasks.ceph.osd.3.clara006.stderr: 17: (()+0x7dc5) [0x7f0b1eabadc5] 2016-01-19T22:19:48.592 INFO:tasks.ceph.osd.3.clara006.stderr: 18: (clone()+0x6d) [0x7f0b1d59d21d] Additional info: http://magna002.ceph.redhat.com/vasu-2016-01-19_16:33:29-rbd-v0.94.3---basic-clara/211478/teuthology.log
Looks like an OSD bug that's always been there, but very hard to hit.
This will be actually quite hard to fix, won't happen soon. If people start hitting it in real life, we'll have to disable the assert which is triggering for now.
This has been fixed upstream for kraken, but won't be backported unless someone tells me that it's really super urgent (it's a very involved high-risk patch series).
(In reply to Samuel Just from comment #5) > This has been fixed upstream for kraken, but won't be backported unless > someone tells me that it's really super urgent (it's a very involved > high-risk patch series). @Josh, will this be in 3.0? If yes, please let us know the use cases this code affects.
Yes, this is in 3.0. It affects all use cases and will crash the osd if the bug reappears. It is covered by automated tests.
Haven't seen this in rados regression runs, Closing this as verified for 12.2.1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:3387