Bug 1253486

Summary: ffsb osd thrash test - osd/ReplicatedPG.cc: 2348: FAILED assert(0 == "out of order op")
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Vasu Kulkarni <vakulkar>
Component: RBDAssignee: Jason Dillaman <jdillama>
Status: CLOSED ERRATA QA Contact: Vasu Kulkarni <vakulkar>
Severity: medium Docs Contact:
Priority: medium    
Version: 1.3.0CC: anharris, ceph-eng-bugs, dzafman, flucifre, hnallurv, jdillama, kchai, kdreyer
Target Milestone: rcKeywords: TestOnly
Target Release: 3.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-26 18:16:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vasu Kulkarni 2015-08-13 19:42:48 UTC
Description of problem:

I think this is a Z-stream kernel which should be now on CDN
Linux magna079 3.10.0-229.9.1.el7.x86_64 #1 SMP Sat Jul 4 17:21:43 EDT 2015 x86_64 x86_64 x86_64 GNU/Linu

Run OSD trhash ffsb test and in 1/3 runs I am seeing the following crash


Version-Release number of selected component (if applicable):


How reproducible:
1/3

Steps to Reproduce:
1. Run Krbd: osd thrash test


Actual results:

2015-08-12T20:37:51.883 INFO:tasks.ceph.osd.4.magna078.stderr:osd/ReplicatedPG.cc: 2348: FAILED assert(0 == "out of order op")
2015-08-12T20:37:51.884 INFO:tasks.ceph.osd.4.magna078.stderr: ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
2015-08-12T20:37:51.885 INFO:tasks.ceph.osd.4.magna078.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xaf6715]
2015-08-12T20:37:51.886 INFO:tasks.ceph.osd.4.magna078.stderr: 2: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x1914) [0x84dd24]
2015-08-12T20:37:51.887 INFO:tasks.ceph.osd.4.magna078.stderr: 3: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x45d7) [0x852677]
2015-08-12T20:37:51.887 INFO:tasks.ceph.osd.4.magna078.stderr: 4: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x68a) [0x7ee58a]
2015-08-12T20:37:51.888 INFO:tasks.ceph.osd.4.magna078.stderr: 5: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x652185]
2015-08-12T20:37:51.889 INFO:tasks.ceph.osd.4.magna078.stderr: 6: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2e9) [0x6526a9]
2015-08-12T20:37:51.890 INFO:tasks.ceph.osd.4.magna078.stderr: 7: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x882) [0xae5ec2]
2015-08-12T20:37:51.890 INFO:tasks.ceph.osd.4.magna078.stderr: 8: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xae7fe0]
2015-08-12T20:37:51.891 INFO:tasks.ceph.osd.4.magna078.stderr: 9: (()+0x7df5) [0x7f34a4edbdf5]
2015-08-12T20:37:51.892 INFO:tasks.ceph.osd.4.magna078.stderr: 10: (clone()+0x6d) [0x7f34a39be1ad]
2015-08-12T20:37:51.893 INFO:tasks.ceph.osd.4.magna078.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2015-08-12T20:37:51.893 INFO:tasks.ceph.osd.4.magna078.stderr:2015-08-12 20:37:51.880548 7f34899e1700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)' thread 7f34899e1700 time 2015-08-12 20:37:51.878724
2015-08-12T20:37:51.894 INFO:tasks.ceph.osd.4.magna078.stderr:osd/ReplicatedPG.cc: 2348: FAILED assert(0 == "out of order op")
2015-08-12T20:37:51.895 INFO:tasks.ceph.osd.4.magna078.stderr:
2015-08-12T20:37:51.896 INFO:tasks.ceph.osd.4.magna078.stderr: ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
2015-08-12T20:37:51.896 INFO:tasks.ceph.osd.4.magna078.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xaf6715]
2015-08-12T20:37:51.897 INFO:tasks.ceph.osd.4.magna078.stderr: 2: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x1914) [0x84dd24]
2015-08-12T20:37:51.898 INFO:tasks.ceph.osd.4.magna078.stderr: 3: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x45d7) [0x852677]
2015-08-12T20:37:51.898 INFO:tasks.ceph.osd.4.magna078.stderr: 4: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x68a) [0x7ee58a]
2015-08-12T20:37:51.899 INFO:tasks.ceph.osd.4.magna078.stderr: 5: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x652185]
2015-08-12T20:37:51.900 INFO:tasks.ceph.osd.4.magna078.stderr: 6: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2e9) [0x6526a9]
2015-08-12T20:37:51.901 INFO:tasks.ceph.osd.4.magna078.stderr: 7: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x882) [0xae5ec2]
2015-08-12T20:37:51.901 INFO:tasks.ceph.osd.4.magna078.stderr: 8: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xae7fe0]
2015-08-12T20:37:51.902 INFO:tasks.ceph.osd.4.magna078.stderr: 9: (()+0x7df5) [0x7f34a4edbdf5]
2015-08-12T20:37:51.903 INFO:tasks.ceph.osd.4.magna078.stderr: 10: (clone()+0x6d) [0x7f34a39be1ad]
2015-08-12T20:37:51.904 INFO:tasks.ceph.osd.4.magna078.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2015-08-12T20:37:51.905 INFO:tasks.ceph.osd.4.magna078.stderr:
2015-08-12T20:37:51.957 INFO:tasks.ceph.osd.4.magna078.stderr:  -209> 2015-08-12 20:37:51.878713 7f34899e1700 -1 osd.4 pg_epoch: 197 pg[0.e( v 194'11837 (106'8625,194'11837] local-les=181 n=198 ec=1 les/c 181/165 180/180/173) [3,5]/[4,5] r=0 lpr=180 pi=164-179/4 rops=5 bft=3 crt=188'11821 lcod 194'11836 mlcod 194'11836 active+remapped+backfilling] bad op order, already applied 199013 > this 198728
2015-08-12T20:37:51.958 INFO:tasks.ceph.osd.4.magna078.stderr:   -21> 2015-08-12 20:37:51.880548 7f34899e1700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)' thread 7f34899e1700 time 2015-08-12 20:37:51.878724
2015-08-12T20:37:51.959 INFO:tasks.ceph.osd.4.magna078.stderr:osd/ReplicatedPG.cc: 2348: FAILED assert(0 == "out of order op")
2015-08-12T20:37:51.960 INFO:tasks.ceph.osd.4.magna078.stderr:
2015-08-12T20:37:51.961 INFO:tasks.ceph.osd.4.magna078.stderr: ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
2015-08-12T20:37:51.962 INFO:tasks.ceph.osd.4.magna078.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xaf6715]
2015-08-12T20:37:51.962 INFO:tasks.ceph.osd.4.magna078.stderr: 2: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x1914) [0x84dd24]
2015-08-12T20:37:51.963 INFO:tasks.ceph.osd.4.magna078.stderr: 3: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x45d7) [0x852677]
2015-08-12T20:37:51.964 INFO:tasks.ceph.osd.4.magna078.stderr: 4: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x68a) [0x7ee58a]
2015-08-12T20:37:51.965 INFO:tasks.ceph.osd.4.magna078.stderr: 5: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x652185]
2015-08-12T20:37:51.966 INFO:tasks.ceph.osd.4.magna078.stderr: 6: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2e9) [0x6526a9]
2015-08-12T20:37:51.967 INFO:tasks.ceph.osd.4.magna078.stderr: 7: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x882) [0xae5ec2]
2015-08-12T20:37:51.968 INFO:tasks.ceph.osd.4.magna078.stderr: 8: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xae7fe0]
2015-08-12T20:37:51.968 INFO:tasks.ceph.osd.4.magna078.stderr: 9: (()+0x7df5) [0x7f34a4edbdf5]
2015-08-12T20:37:51.969 INFO:tasks.ceph.osd.4.magna078.stderr: 10: (clone()+0x6d) [0x7f34a39be1ad]
2015-08-12T20:37:51.970 INFO:tasks.ceph.osd.4.magna078.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2015-08-12T20:37:51.971 INFO:tasks.ceph.osd.4.magna078.stderr:
2015-08-12T20:37:51.973 INFO:tasks.ceph.osd.4.magna078.stderr:terminate called after throwing an instance of 'ceph::FailedAssertion'
2015-08-12T20:37:51.973 INFO:tasks.ceph.osd.4.magna078.stderr:*** Caught signal (Aborted) **
2015-08-12T20:37:51.974 INFO:tasks.ceph.osd.4.magna078.stderr: in thread 7f34899e1700
2015-08-12T20:37:51.975 INFO:tasks.ceph.osd.4.magna078.stderr: ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
2015-08-12T20:37:51.975 INFO:tasks.ceph.osd.4.magna078.stderr: 1: ceph-osd() [0x9f63f2]
2015-08-12T20:37:51.976 INFO:tasks.ceph.osd.4.magna078.stderr: 2: (()+0xf130) [0x7f34a4ee3130]
2015-08-12T20:37:51.977 INFO:tasks.ceph.osd.4.magna078.stderr: 3: (gsignal()+0x37) [0x7f34a38fd5d7]
2015-08-12T20:37:51.977 INFO:tasks.ceph.osd.4.magna078.stderr: 4: (abort()+0x148) [0x7f34a38fecc8]
2015-08-12T20:37:51.978 INFO:tasks.ceph.osd.4.magna078.stderr: 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f34a42019b5]
2015-08-12T20:37:51.979 INFO:tasks.ceph.osd.4.magna078.stderr: 6: (()+0x5e926) [0x7f34a41ff926]
2015-08-12T20:37:51.979 INFO:tasks.ceph.osd.4.magna078.stderr: 7: (()+0x5e953) [0x7f34a41ff953]
2015-08-12T20:37:51.980 INFO:tasks.ceph.osd.4.magna078.stderr: 8: (()+0x5eb73) [0x7f34a41ffb73]
2015-08-12T20:37:51.981 INFO:tasks.ceph.osd.4.magna078.stderr: 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xaf690a]
2015-08-12T20:37:51.981 INFO:tasks.ceph.osd.4.magna078.stderr: 10: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x1914) [0x84dd24]
2015-08-12T20:37:51.982 INFO:tasks.ceph.osd.4.magna078.stderr: 11: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x45d7) [0x852677]
2015-08-12T20:37:51.983 INFO:tasks.ceph.osd.4.magna078.stderr: 12: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x68a) [0x7ee58a]
2015-08-12T20:37:51.983 INFO:tasks.ceph.osd.4.magna078.stderr: 13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x652185]
2015-08-12T20:37:51.984 INFO:tasks.ceph.osd.4.magna078.stderr: 14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2e9) [0x6526a9]
2015-08-12T20:37:51.985 INFO:tasks.ceph.osd.4.magna078.stderr: 15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x882) [0xae5ec2]
2015-08-12T20:37:51.985 INFO:tasks.ceph.osd.4.magna078.stderr: 16: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xae7fe0]
2015-08-12T20:37:51.986 INFO:tasks.ceph.osd.4.magna078.stderr: 17: (()+0x7df5) [0x7f34a4edbdf5]
2015-08-12T20:37:51.986 INFO:tasks.ceph.osd.4.magna078.stderr: 18: (clone()+0x6d) [0x7f34a39be1ad]
2015-08-12T20:37:51.987 INFO:tasks.ceph.osd.4.magna078.stderr:2015-08-12 20:37:51.961266 7f34899e1700 -1 *** Caught signal (Aborted) **

Expected results:


Additional info:

Comment 2 Ken Dreyer (Red Hat) 2015-12-11 22:04:48 UTC
After discussing this issue with Sam, and reviewing the upstream source, apparently this is an intentional assert in the OSD and Sam wants to keep it. It reflects a bug in kRBD.

Josh, can you fix this or re-assign as appropriate?

Comment 3 Josh Durgin 2015-12-11 22:11:39 UTC
This is should be fixed by the osd client sync Ilya's working on. It's purely a kernel client change, I'm not sure how we want to mark that here.

We hadn't been able to reproduce in a way that told us enough information to fix it in the existing kernel osd client yet.

Comment 4 Ken Dreyer (Red Hat) 2015-12-17 20:40:20 UTC
Ok, thanks. Is there a BZ tracking Ilya's osd client sync?

Comment 5 Josh Durgin 2015-12-18 00:32:13 UTC
No bz for it yet, just the upstream ticket.

Comment 7 Ken Dreyer (Red Hat) 2016-08-03 19:48:58 UTC
RHEL 7.3 kRBD bug is 1344930

Comment 8 Ken Dreyer (Red Hat) 2017-08-08 20:12:19 UTC
Sounds like we just need a re-test here with RHEL 7.4.

Comment 11 Vasu Kulkarni 2018-08-20 06:38:06 UTC
In recent regression runs this is not seen.

Comment 13 errata-xmlrpc 2018-09-26 18:16:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2819