Bug 1300803 - [1.3.1Async - 0.94.3-6] qemu xfs tests - osd/ReplicatedPG.cc: 2351: FAILED assert(0 == "out of order op")
[1.3.1Async - 0.94.3-6] qemu xfs tests - osd/ReplicatedPG.cc: 2351: FAILED as...
Status: CLOSED ERRATA
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: RADOS (Show other bugs)
1.3.2
Unspecified Unspecified
low Severity medium
: rc
: 3.0
Assigned To: Josh Durgin
Vasu Kulkarni
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-01-21 13:32 EST by Vasu Kulkarni
Modified: 2017-12-05 18:29 EST (History)
9 users (show)

See Also:
Fixed In Version: RHEL: ceph-12.1.2-1.el7cp Ubuntu: ceph_12.1.2-2redhat1xenial
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-12-05 18:29:38 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Ceph Project Bug Tracker 14468 None None None 2016-01-21 20:45 EST

  None (edit)
Description Vasu Kulkarni 2016-01-21 13:32:42 EST
Description of problem:
When running qemu xfs tests out of order op and crash in osd seen

Version-Release number of selected component (if applicable):
1.3.1Async - 0.94.3-6

How reproducible:
1/1

Steps to Reproduce:
1.
2.
3.

Actual results:

2016-01-19T22:14:40.782 INFO:tasks.qemu.client.0.clara005.stdout:Processing triggers for libc-bin ...
2016-01-19T22:14:40.787 INFO:tasks.qemu.client.0.clara005.stdout:ldconfig deferred processing now taking place
2016-01-19T22:14:42.120 INFO:tasks.qemu.client.0.clara005.stdout:mount.nfs: timeout set for Wed Jan 20 03:16:40 2016
2016-01-19T22:14:42.188 INFO:tasks.qemu.client.0.clara005.stdout:mount.nfs: trying text-based options 'proto=tcp,vers=4,addr=10.0.2.2,clientaddr=10.0.2.15'
2016-01-19T22:14:42.189 INFO:tasks.qemu.client.0.clara005.stdout:10.0.2.2:/export/client.0 on /mnt/log type nfs (rw,proto=tcp)
2016-01-19T22:14:42.190 INFO:tasks.qemu.client.0.clara005.stdout:mount: block device /dev/sr0 is write-protected, mounting read-only
2016-01-19T22:19:48.453 INFO:tasks.ceph.osd.3.clara006.stderr:2016-01-19 22:19:48.452608 7f0affd87700 -1 osd.3 pg_epoch: 6 pg[0.8( v 6'2220 (0'0,6'2220] local-les=4 n=76 ec=1 les/c 4/4 3/3/3) [3,5] r=0 lpr=3 crt=6'2218 lcod 6'2219 mlcod 6'2219 active+clean] bad op order, already applied 17686 > this 17682
2016-01-19T22:19:48.456 INFO:tasks.ceph.osd.3.clara006.stderr:osd/ReplicatedPG.cc: In function 'void ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)' thread 7f0affd87700 time 2016-01-19 22:19:48.452616
2016-01-19T22:19:48.457 INFO:tasks.ceph.osd.3.clara006.stderr:osd/ReplicatedPG.cc: 2351: FAILED assert(0 == "out of order op")
2016-01-19T22:19:48.458 INFO:tasks.ceph.osd.3.clara006.stderr: ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
2016-01-19T22:19:48.459 INFO:tasks.ceph.osd.3.clara006.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xb21595]
2016-01-19T22:19:48.460 INFO:tasks.ceph.osd.3.clara006.stderr: 2: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x1c7b) [0x873bcb]
2016-01-19T22:19:48.461 INFO:tasks.ceph.osd.3.clara006.stderr: 3: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x45d7) [0x8782f7]
2016-01-19T22:19:48.462 INFO:tasks.ceph.osd.3.clara006.stderr: 4: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x68a) [0x81053a]
2016-01-19T22:19:48.463 INFO:tasks.ceph.osd.3.clara006.stderr: 5: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x661f35]
2016-01-19T22:19:48.464 INFO:tasks.ceph.osd.3.clara006.stderr: 6: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x330) [0x6624a0]
2016-01-19T22:19:48.465 INFO:tasks.ceph.osd.3.clara006.stderr: 7: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xb10d1f]
2016-01-19T22:19:48.466 INFO:tasks.ceph.osd.3.clara006.stderr: 8: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xb12e50]
2016-01-19T22:19:48.467 INFO:tasks.ceph.osd.3.clara006.stderr: 9: (()+0x7dc5) [0x7f0b1eabadc5]
2016-01-19T22:19:48.468 INFO:tasks.ceph.osd.3.clara006.stderr: 10: (clone()+0x6d) [0x7f0b1d59d21d]
2016-01-19T22:19:48.469 INFO:tasks.ceph.osd.3.clara006.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2016-01-19T22:19:48.470 INFO:tasks.ceph.osd.3.clara006.stderr:2016-01-19 22:19:48.455393 7f0affd87700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)' thread 7f0affd87700 time 2016-01-19 22:19:48.452616
2016-01-19T22:19:48.471 INFO:tasks.ceph.osd.3.clara006.stderr:osd/ReplicatedPG.cc: 2351: FAILED assert(0 == "out of order op")
2016-01-19T22:19:48.472 INFO:tasks.ceph.osd.3.clara006.stderr:
2016-01-19T22:19:48.473 INFO:tasks.ceph.osd.3.clara006.stderr: ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
2016-01-19T22:19:48.474 INFO:tasks.ceph.osd.3.clara006.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xb21595]
2016-01-19T22:19:48.475 INFO:tasks.ceph.osd.3.clara006.stderr: 2: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x1c7b) [0x873bcb]
2016-01-19T22:19:48.476 INFO:tasks.ceph.osd.3.clara006.stderr: 3: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x45d7) [0x8782f7]
2016-01-19T22:19:48.477 INFO:tasks.ceph.osd.3.clara006.stderr: 4: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x68a) [0x81053a]
2016-01-19T22:19:48.478 INFO:tasks.ceph.osd.3.clara006.stderr: 5: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x661f35]
2016-01-19T22:19:48.479 INFO:tasks.ceph.osd.3.clara006.stderr: 6: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x330) [0x6624a0]
2016-01-19T22:19:48.480 INFO:tasks.ceph.osd.3.clara006.stderr: 7: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xb10d1f]
2016-01-19T22:19:48.481 INFO:tasks.ceph.osd.3.clara006.stderr: 8: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xb12e50]
2016-01-19T22:19:48.482 INFO:tasks.ceph.osd.3.clara006.stderr: 9: (()+0x7dc5) [0x7f0b1eabadc5]
2016-01-19T22:19:48.483 INFO:tasks.ceph.osd.3.clara006.stderr: 10: (clone()+0x6d) [0x7f0b1d59d21d]
2016-01-19T22:19:48.484 INFO:tasks.ceph.osd.3.clara006.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2016-01-19T22:19:48.485 INFO:tasks.ceph.osd.3.clara006.stderr:
2016-01-19T22:19:48.544 INFO:tasks.ceph.osd.3.clara006.stderr:    -1> 2016-01-19 22:19:48.452608 7f0affd87700 -1 osd.3 pg_epoch: 6 pg[0.8( v 6'2220 (0'0,6'2220] local-les=4 n=76 ec=1 les/c 4/4 3/3/3) [3,5] r=0 lpr=3 crt=6'2218 lcod 6'2219 mlcod 6'2219 active+clean] bad op order, already applied 17686 > this 17682
2016-01-19T22:19:48.545 INFO:tasks.ceph.osd.3.clara006.stderr:     0> 2016-01-19 22:19:48.455393 7f0affd87700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)' thread 7f0affd87700 time 2016-01-19 22:19:48.452616
2016-01-19T22:19:48.546 INFO:tasks.ceph.osd.3.clara006.stderr:osd/ReplicatedPG.cc: 2351: FAILED assert(0 == "out of order op")
2016-01-19T22:19:48.547 INFO:tasks.ceph.osd.3.clara006.stderr:
2016-01-19T22:19:48.548 INFO:tasks.ceph.osd.3.clara006.stderr: ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
2016-01-19T22:19:48.549 INFO:tasks.ceph.osd.3.clara006.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xb21595]
2016-01-19T22:19:48.550 INFO:tasks.ceph.osd.3.clara006.stderr: 2: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x1c7b) [0x873bcb]
2016-01-19T22:19:48.551 INFO:tasks.ceph.osd.3.clara006.stderr: 3: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x45d7) [0x8782f7]
2016-01-19T22:19:48.552 INFO:tasks.ceph.osd.3.clara006.stderr: 4: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x68a) [0x81053a]
2016-01-19T22:19:48.553 INFO:tasks.ceph.osd.3.clara006.stderr: 5: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x661f35]
2016-01-19T22:19:48.553 INFO:tasks.ceph.osd.3.clara006.stderr: 6: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x330) [0x6624a0]
2016-01-19T22:19:48.554 INFO:tasks.ceph.osd.3.clara006.stderr: 7: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xb10d1f]
2016-01-19T22:19:48.555 INFO:tasks.ceph.osd.3.clara006.stderr: 8: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xb12e50]
2016-01-19T22:19:48.556 INFO:tasks.ceph.osd.3.clara006.stderr: 9: (()+0x7dc5) [0x7f0b1eabadc5]
2016-01-19T22:19:48.557 INFO:tasks.ceph.osd.3.clara006.stderr: 10: (clone()+0x6d) [0x7f0b1d59d21d]
2016-01-19T22:19:48.557 INFO:tasks.ceph.osd.3.clara006.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2016-01-19T22:19:48.558 INFO:tasks.ceph.osd.3.clara006.stderr:
2016-01-19T22:19:48.561 INFO:tasks.ceph.osd.3.clara006.stderr:terminate called after throwing an instance of 'ceph::FailedAssertion'
2016-01-19T22:19:48.561 INFO:tasks.ceph.osd.3.clara006.stderr:*** Caught signal (Aborted) **
2016-01-19T22:19:48.562 INFO:tasks.ceph.osd.3.clara006.stderr: in thread 7f0affd87700
2016-01-19T22:19:48.563 INFO:tasks.ceph.osd.3.clara006.stderr: ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
2016-01-19T22:19:48.564 INFO:tasks.ceph.osd.3.clara006.stderr: 1: ceph-osd() [0xa20512]
2016-01-19T22:19:48.565 INFO:tasks.ceph.osd.3.clara006.stderr: 2: (()+0xf100) [0x7f0b1eac2100]
2016-01-19T22:19:48.565 INFO:tasks.ceph.osd.3.clara006.stderr: 3: (gsignal()+0x37) [0x7f0b1d4dc5f7]
2016-01-19T22:19:48.566 INFO:tasks.ceph.osd.3.clara006.stderr: 4: (abort()+0x148) [0x7f0b1d4ddce8]
2016-01-19T22:19:48.567 INFO:tasks.ceph.osd.3.clara006.stderr: 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f0b1dde09b5]
2016-01-19T22:19:48.568 INFO:tasks.ceph.osd.3.clara006.stderr: 6: (()+0x5e926) [0x7f0b1ddde926]
2016-01-19T22:19:48.568 INFO:tasks.ceph.osd.3.clara006.stderr: 7: (()+0x5e953) [0x7f0b1ddde953]
2016-01-19T22:19:48.569 INFO:tasks.ceph.osd.3.clara006.stderr: 8: (()+0x5eb73) [0x7f0b1dddeb73]
2016-01-19T22:19:48.570 INFO:tasks.ceph.osd.3.clara006.stderr: 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xb2178a]
2016-01-19T22:19:48.570 INFO:tasks.ceph.osd.3.clara006.stderr: 10: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x1c7b) [0x873bcb]
2016-01-19T22:19:48.571 INFO:tasks.ceph.osd.3.clara006.stderr: 11: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x45d7) [0x8782f7]
2016-01-19T22:19:48.572 INFO:tasks.ceph.osd.3.clara006.stderr: 12: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x68a) [0x81053a]
2016-01-19T22:19:48.572 INFO:tasks.ceph.osd.3.clara006.stderr: 13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x661f35]
2016-01-19T22:19:48.573 INFO:tasks.ceph.osd.3.clara006.stderr: 14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x330) [0x6624a0]
2016-01-19T22:19:48.574 INFO:tasks.ceph.osd.3.clara006.stderr: 15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xb10d1f]
2016-01-19T22:19:48.574 INFO:tasks.ceph.osd.3.clara006.stderr: 16: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xb12e50]
2016-01-19T22:19:48.575 INFO:tasks.ceph.osd.3.clara006.stderr: 17: (()+0x7dc5) [0x7f0b1eabadc5]
2016-01-19T22:19:48.576 INFO:tasks.ceph.osd.3.clara006.stderr: 18: (clone()+0x6d) [0x7f0b1d59d21d]
2016-01-19T22:19:48.576 INFO:tasks.ceph.osd.3.clara006.stderr:2016-01-19 22:19:48.548237 7f0affd87700 -1 *** Caught signal (Aborted) **
2016-01-19T22:19:48.577 INFO:tasks.ceph.osd.3.clara006.stderr: in thread 7f0affd87700
2016-01-19T22:19:48.578 INFO:tasks.ceph.osd.3.clara006.stderr:
2016-01-19T22:19:48.578 INFO:tasks.ceph.osd.3.clara006.stderr: ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
2016-01-19T22:19:48.579 INFO:tasks.ceph.osd.3.clara006.stderr: 1: ceph-osd() [0xa20512]
2016-01-19T22:19:48.580 INFO:tasks.ceph.osd.3.clara006.stderr: 2: (()+0xf100) [0x7f0b1eac2100]
2016-01-19T22:19:48.580 INFO:tasks.ceph.osd.3.clara006.stderr: 3: (gsignal()+0x37) [0x7f0b1d4dc5f7]
2016-01-19T22:19:48.581 INFO:tasks.ceph.osd.3.clara006.stderr: 4: (abort()+0x148) [0x7f0b1d4ddce8]
2016-01-19T22:19:48.582 INFO:tasks.ceph.osd.3.clara006.stderr: 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f0b1dde09b5]
2016-01-19T22:19:48.583 INFO:tasks.ceph.osd.3.clara006.stderr: 6: (()+0x5e926) [0x7f0b1ddde926]
2016-01-19T22:19:48.584 INFO:tasks.ceph.osd.3.clara006.stderr: 7: (()+0x5e953) [0x7f0b1ddde953]
2016-01-19T22:19:48.584 INFO:tasks.ceph.osd.3.clara006.stderr: 8: (()+0x5eb73) [0x7f0b1dddeb73]
2016-01-19T22:19:48.585 INFO:tasks.ceph.osd.3.clara006.stderr: 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xb2178a]
2016-01-19T22:19:48.586 INFO:tasks.ceph.osd.3.clara006.stderr: 10: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x1c7b) [0x873bcb]
2016-01-19T22:19:48.587 INFO:tasks.ceph.osd.3.clara006.stderr: 11: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x45d7) [0x8782f7]
2016-01-19T22:19:48.588 INFO:tasks.ceph.osd.3.clara006.stderr: 12: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x68a) [0x81053a]
2016-01-19T22:19:48.588 INFO:tasks.ceph.osd.3.clara006.stderr: 13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x661f35]
2016-01-19T22:19:48.589 INFO:tasks.ceph.osd.3.clara006.stderr: 14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x330) [0x6624a0]
2016-01-19T22:19:48.590 INFO:tasks.ceph.osd.3.clara006.stderr: 15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xb10d1f]
2016-01-19T22:19:48.591 INFO:tasks.ceph.osd.3.clara006.stderr: 16: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xb12e50]
2016-01-19T22:19:48.592 INFO:tasks.ceph.osd.3.clara006.stderr: 17: (()+0x7dc5) [0x7f0b1eabadc5]
2016-01-19T22:19:48.592 INFO:tasks.ceph.osd.3.clara006.stderr: 18: (clone()+0x6d) [0x7f0b1d59d21d]

Additional info:

http://magna002.ceph.redhat.com/vasu-2016-01-19_16:33:29-rbd-v0.94.3---basic-clara/211478/teuthology.log
Comment 2 Josh Durgin 2016-01-21 20:45:52 EST
Looks like an OSD bug that's always been there, but very hard to hit.
Comment 3 Samuel Just 2016-01-26 10:13:46 EST
This will be actually quite hard to fix, won't happen soon.  If people start hitting it in real life, we'll have to disable the assert which is triggering for now.
Comment 5 Samuel Just 2016-09-21 16:00:36 EDT
This has been fixed upstream for kraken, but won't be backported unless someone tells me that it's really super urgent (it's a very involved high-risk patch series).
Comment 7 Harish NV Rao 2017-07-20 14:13:18 EDT
(In reply to Samuel Just from comment #5)
> This has been fixed upstream for kraken, but won't be backported unless
> someone tells me that it's really super urgent (it's a very involved
> high-risk patch series).

@Josh, will this be in 3.0? If yes, please let us know the use cases this code affects.
Comment 8 Josh Durgin 2017-07-20 16:51:42 EDT
Yes, this is in 3.0. It affects all use cases and will crash the osd if the bug reappears. It is covered by automated tests.
Comment 11 Vasu Kulkarni 2017-10-24 13:29:50 EDT
Haven't seen this in rados regression runs, Closing this as verified for 12.2.1
Comment 14 errata-xmlrpc 2017-12-05 18:29:38 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3387

Note You need to log in before you can comment on or make changes to this bug.