Description of problem: I think this is a Z-stream kernel which should be now on CDN Linux magna079 3.10.0-229.9.1.el7.x86_64 #1 SMP Sat Jul 4 17:21:43 EDT 2015 x86_64 x86_64 x86_64 GNU/Linu Run OSD trhash ffsb test and in 1/3 runs I am seeing the following crash Version-Release number of selected component (if applicable): How reproducible: 1/3 Steps to Reproduce: 1. Run Krbd: osd thrash test Actual results: 2015-08-12T20:37:51.883 INFO:tasks.ceph.osd.4.magna078.stderr:osd/ReplicatedPG.cc: 2348: FAILED assert(0 == "out of order op") 2015-08-12T20:37:51.884 INFO:tasks.ceph.osd.4.magna078.stderr: ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) 2015-08-12T20:37:51.885 INFO:tasks.ceph.osd.4.magna078.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xaf6715] 2015-08-12T20:37:51.886 INFO:tasks.ceph.osd.4.magna078.stderr: 2: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x1914) [0x84dd24] 2015-08-12T20:37:51.887 INFO:tasks.ceph.osd.4.magna078.stderr: 3: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x45d7) [0x852677] 2015-08-12T20:37:51.887 INFO:tasks.ceph.osd.4.magna078.stderr: 4: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x68a) [0x7ee58a] 2015-08-12T20:37:51.888 INFO:tasks.ceph.osd.4.magna078.stderr: 5: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x652185] 2015-08-12T20:37:51.889 INFO:tasks.ceph.osd.4.magna078.stderr: 6: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2e9) [0x6526a9] 2015-08-12T20:37:51.890 INFO:tasks.ceph.osd.4.magna078.stderr: 7: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x882) [0xae5ec2] 2015-08-12T20:37:51.890 INFO:tasks.ceph.osd.4.magna078.stderr: 8: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xae7fe0] 2015-08-12T20:37:51.891 INFO:tasks.ceph.osd.4.magna078.stderr: 9: (()+0x7df5) [0x7f34a4edbdf5] 2015-08-12T20:37:51.892 INFO:tasks.ceph.osd.4.magna078.stderr: 10: (clone()+0x6d) [0x7f34a39be1ad] 2015-08-12T20:37:51.893 INFO:tasks.ceph.osd.4.magna078.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2015-08-12T20:37:51.893 INFO:tasks.ceph.osd.4.magna078.stderr:2015-08-12 20:37:51.880548 7f34899e1700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)' thread 7f34899e1700 time 2015-08-12 20:37:51.878724 2015-08-12T20:37:51.894 INFO:tasks.ceph.osd.4.magna078.stderr:osd/ReplicatedPG.cc: 2348: FAILED assert(0 == "out of order op") 2015-08-12T20:37:51.895 INFO:tasks.ceph.osd.4.magna078.stderr: 2015-08-12T20:37:51.896 INFO:tasks.ceph.osd.4.magna078.stderr: ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) 2015-08-12T20:37:51.896 INFO:tasks.ceph.osd.4.magna078.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xaf6715] 2015-08-12T20:37:51.897 INFO:tasks.ceph.osd.4.magna078.stderr: 2: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x1914) [0x84dd24] 2015-08-12T20:37:51.898 INFO:tasks.ceph.osd.4.magna078.stderr: 3: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x45d7) [0x852677] 2015-08-12T20:37:51.898 INFO:tasks.ceph.osd.4.magna078.stderr: 4: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x68a) [0x7ee58a] 2015-08-12T20:37:51.899 INFO:tasks.ceph.osd.4.magna078.stderr: 5: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x652185] 2015-08-12T20:37:51.900 INFO:tasks.ceph.osd.4.magna078.stderr: 6: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2e9) [0x6526a9] 2015-08-12T20:37:51.901 INFO:tasks.ceph.osd.4.magna078.stderr: 7: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x882) [0xae5ec2] 2015-08-12T20:37:51.901 INFO:tasks.ceph.osd.4.magna078.stderr: 8: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xae7fe0] 2015-08-12T20:37:51.902 INFO:tasks.ceph.osd.4.magna078.stderr: 9: (()+0x7df5) [0x7f34a4edbdf5] 2015-08-12T20:37:51.903 INFO:tasks.ceph.osd.4.magna078.stderr: 10: (clone()+0x6d) [0x7f34a39be1ad] 2015-08-12T20:37:51.904 INFO:tasks.ceph.osd.4.magna078.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2015-08-12T20:37:51.905 INFO:tasks.ceph.osd.4.magna078.stderr: 2015-08-12T20:37:51.957 INFO:tasks.ceph.osd.4.magna078.stderr: -209> 2015-08-12 20:37:51.878713 7f34899e1700 -1 osd.4 pg_epoch: 197 pg[0.e( v 194'11837 (106'8625,194'11837] local-les=181 n=198 ec=1 les/c 181/165 180/180/173) [3,5]/[4,5] r=0 lpr=180 pi=164-179/4 rops=5 bft=3 crt=188'11821 lcod 194'11836 mlcod 194'11836 active+remapped+backfilling] bad op order, already applied 199013 > this 198728 2015-08-12T20:37:51.958 INFO:tasks.ceph.osd.4.magna078.stderr: -21> 2015-08-12 20:37:51.880548 7f34899e1700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)' thread 7f34899e1700 time 2015-08-12 20:37:51.878724 2015-08-12T20:37:51.959 INFO:tasks.ceph.osd.4.magna078.stderr:osd/ReplicatedPG.cc: 2348: FAILED assert(0 == "out of order op") 2015-08-12T20:37:51.960 INFO:tasks.ceph.osd.4.magna078.stderr: 2015-08-12T20:37:51.961 INFO:tasks.ceph.osd.4.magna078.stderr: ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) 2015-08-12T20:37:51.962 INFO:tasks.ceph.osd.4.magna078.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xaf6715] 2015-08-12T20:37:51.962 INFO:tasks.ceph.osd.4.magna078.stderr: 2: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x1914) [0x84dd24] 2015-08-12T20:37:51.963 INFO:tasks.ceph.osd.4.magna078.stderr: 3: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x45d7) [0x852677] 2015-08-12T20:37:51.964 INFO:tasks.ceph.osd.4.magna078.stderr: 4: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x68a) [0x7ee58a] 2015-08-12T20:37:51.965 INFO:tasks.ceph.osd.4.magna078.stderr: 5: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x652185] 2015-08-12T20:37:51.966 INFO:tasks.ceph.osd.4.magna078.stderr: 6: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2e9) [0x6526a9] 2015-08-12T20:37:51.967 INFO:tasks.ceph.osd.4.magna078.stderr: 7: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x882) [0xae5ec2] 2015-08-12T20:37:51.968 INFO:tasks.ceph.osd.4.magna078.stderr: 8: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xae7fe0] 2015-08-12T20:37:51.968 INFO:tasks.ceph.osd.4.magna078.stderr: 9: (()+0x7df5) [0x7f34a4edbdf5] 2015-08-12T20:37:51.969 INFO:tasks.ceph.osd.4.magna078.stderr: 10: (clone()+0x6d) [0x7f34a39be1ad] 2015-08-12T20:37:51.970 INFO:tasks.ceph.osd.4.magna078.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2015-08-12T20:37:51.971 INFO:tasks.ceph.osd.4.magna078.stderr: 2015-08-12T20:37:51.973 INFO:tasks.ceph.osd.4.magna078.stderr:terminate called after throwing an instance of 'ceph::FailedAssertion' 2015-08-12T20:37:51.973 INFO:tasks.ceph.osd.4.magna078.stderr:*** Caught signal (Aborted) ** 2015-08-12T20:37:51.974 INFO:tasks.ceph.osd.4.magna078.stderr: in thread 7f34899e1700 2015-08-12T20:37:51.975 INFO:tasks.ceph.osd.4.magna078.stderr: ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) 2015-08-12T20:37:51.975 INFO:tasks.ceph.osd.4.magna078.stderr: 1: ceph-osd() [0x9f63f2] 2015-08-12T20:37:51.976 INFO:tasks.ceph.osd.4.magna078.stderr: 2: (()+0xf130) [0x7f34a4ee3130] 2015-08-12T20:37:51.977 INFO:tasks.ceph.osd.4.magna078.stderr: 3: (gsignal()+0x37) [0x7f34a38fd5d7] 2015-08-12T20:37:51.977 INFO:tasks.ceph.osd.4.magna078.stderr: 4: (abort()+0x148) [0x7f34a38fecc8] 2015-08-12T20:37:51.978 INFO:tasks.ceph.osd.4.magna078.stderr: 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f34a42019b5] 2015-08-12T20:37:51.979 INFO:tasks.ceph.osd.4.magna078.stderr: 6: (()+0x5e926) [0x7f34a41ff926] 2015-08-12T20:37:51.979 INFO:tasks.ceph.osd.4.magna078.stderr: 7: (()+0x5e953) [0x7f34a41ff953] 2015-08-12T20:37:51.980 INFO:tasks.ceph.osd.4.magna078.stderr: 8: (()+0x5eb73) [0x7f34a41ffb73] 2015-08-12T20:37:51.981 INFO:tasks.ceph.osd.4.magna078.stderr: 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xaf690a] 2015-08-12T20:37:51.981 INFO:tasks.ceph.osd.4.magna078.stderr: 10: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x1914) [0x84dd24] 2015-08-12T20:37:51.982 INFO:tasks.ceph.osd.4.magna078.stderr: 11: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x45d7) [0x852677] 2015-08-12T20:37:51.983 INFO:tasks.ceph.osd.4.magna078.stderr: 12: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x68a) [0x7ee58a] 2015-08-12T20:37:51.983 INFO:tasks.ceph.osd.4.magna078.stderr: 13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x652185] 2015-08-12T20:37:51.984 INFO:tasks.ceph.osd.4.magna078.stderr: 14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2e9) [0x6526a9] 2015-08-12T20:37:51.985 INFO:tasks.ceph.osd.4.magna078.stderr: 15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x882) [0xae5ec2] 2015-08-12T20:37:51.985 INFO:tasks.ceph.osd.4.magna078.stderr: 16: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xae7fe0] 2015-08-12T20:37:51.986 INFO:tasks.ceph.osd.4.magna078.stderr: 17: (()+0x7df5) [0x7f34a4edbdf5] 2015-08-12T20:37:51.986 INFO:tasks.ceph.osd.4.magna078.stderr: 18: (clone()+0x6d) [0x7f34a39be1ad] 2015-08-12T20:37:51.987 INFO:tasks.ceph.osd.4.magna078.stderr:2015-08-12 20:37:51.961266 7f34899e1700 -1 *** Caught signal (Aborted) ** Expected results: Additional info:
After discussing this issue with Sam, and reviewing the upstream source, apparently this is an intentional assert in the OSD and Sam wants to keep it. It reflects a bug in kRBD. Josh, can you fix this or re-assign as appropriate?
This is should be fixed by the osd client sync Ilya's working on. It's purely a kernel client change, I'm not sure how we want to mark that here. We hadn't been able to reproduce in a way that told us enough information to fix it in the existing kernel osd client yet.
Ok, thanks. Is there a BZ tracking Ilya's osd client sync?
No bz for it yet, just the upstream ticket.
RHEL 7.3 kRBD bug is 1344930
Sounds like we just need a re-test here with RHEL 7.4.
In recent regression runs this is not seen.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2819