Bug 1574556

Summary: FAILED assert(oi.version == i->first)
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Jos Collin <jcollin>
Component: RADOSAssignee: Josh Durgin <jdurgin>
Status: CLOSED DUPLICATE QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.5CC: ceph-eng-bugs, dzafman, kchai
Target Milestone: rcKeywords: Reopened
Target Release: 3.*   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-05-15 06:58:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jos Collin 2018-05-03 14:36:50 UTC
Description of problem:
FAILED assert(oi.version == i->first) when trying to start an osd which is down.

 ceph osd tree
ID  WEIGHT   TYPE NAME       UP/DOWN REWEIGHT PRIMARY-AFFINITY
 -1 71.97299 root default
 -2 23.99100     host ceph03
198  1.09050         osd.198      up  1.00000          1.00000
200  1.09050         osd.200      up  1.00000          1.00000
203  1.09050         osd.203      up  1.00000          1.00000
206  1.09050         osd.206      up  1.00000          1.00000
209  1.09050         osd.209      up  1.00000          1.00000
212  1.09050         osd.212      up  1.00000          1.00000
215  1.09050         osd.215      up  1.00000          1.00000
218  1.09050         osd.218      up  1.00000          1.00000
221  1.09050         osd.221      up  1.00000          1.00000
223  1.09050         osd.223      up  1.00000          1.00000
227  1.09050         osd.227      up  1.00000          1.00000
230  1.09050         osd.230      up  1.00000          1.00000
233  1.09050         osd.233      up  1.00000          1.00000
236  1.09050         osd.236      up  1.00000          1.00000
239  1.09050         osd.239      up  1.00000          1.00000
241  1.09050         osd.241      up  1.00000          1.00000
244  1.09050         osd.244      up  1.00000          1.00000
248  1.09050         osd.248      up  1.00000          1.00000
251  1.09050         osd.251      up  1.00000          1.00000
253  1.09050         osd.253      up  1.00000          1.00000
257  1.09050         osd.257      up  1.00000          1.00000
260  1.09050         osd.260      up  1.00000          1.00000
 -3 23.99100     host ceph02
201  1.09050         osd.201      up  1.00000          1.00000
204  1.09050         osd.204      up  1.00000          1.00000
207  1.09050         osd.207      up  1.00000          1.00000
210  1.09050         osd.210      up  1.00000          1.00000
213  1.09050         osd.213      up  1.00000          1.00000
216  1.09050         osd.216      up  1.00000          1.00000
219  1.09050         osd.219    down        0          1.00000
222  1.09050         osd.222      up  1.00000          1.00000
225  1.09050         osd.225      up  1.00000          1.00000
228  1.09050         osd.228      up  1.00000          1.00000
231  1.09050         osd.231      up  1.00000          1.00000
234  1.09050         osd.234      up  1.00000          1.00000
237  1.09050         osd.237      up  1.00000          1.00000
240  1.09050         osd.240      up  1.00000          1.00000
243  1.09050         osd.243      up  1.00000          1.00000
246  1.09050         osd.246      up  1.00000          1.00000
249  1.09050         osd.249      up  1.00000          1.00000
252  1.09050         osd.252      up  1.00000          1.00000
255  1.09050         osd.255      up  1.00000          1.00000
258  1.09050         osd.258      up  1.00000          1.00000
261  1.09050         osd.261      up  1.00000          1.00000
263  1.09050         osd.263      up  1.00000          1.00000
 -4 23.99100     host ceph01
199  1.09050         osd.199      up  1.00000          1.00000
202  1.09050         osd.202      up  1.00000          1.00000
205  1.09050         osd.205      up  1.00000          1.00000
208  1.09050         osd.208      up  1.00000          1.00000
211  1.09050         osd.211      up  1.00000          1.00000
214  1.09050         osd.214      up  1.00000          1.00000
217  1.09050         osd.217      up  1.00000          1.00000
220  1.09050         osd.220      up  1.00000          1.00000
224  1.09050         osd.224      up  1.00000          1.00000
226  1.09050         osd.226      up  1.00000          1.00000
229  1.09050         osd.229      up  1.00000          1.00000
232  1.09050         osd.232      up  1.00000          1.00000
235  1.09050         osd.235      up  1.00000          1.00000
238  1.09050         osd.238      up  1.00000          1.00000
242  1.09050         osd.242      up  1.00000          1.00000
245  1.09050         osd.245      up  1.00000          1.00000
247  1.09050         osd.247      up  1.00000          1.00000
250  1.09050         osd.250      up  1.00000          1.00000
254  1.09050         osd.254      up  1.00000          1.00000
256  1.09050         osd.256      up  1.00000          1.00000
259  1.09050         osd.259      up  1.00000          1.00000
262  1.09050         osd.262      up  1.00000          1.00000

When we try to restart service, we face to following issue : 
"
systemctl status ceph-osd
● ceph-osd - Ceph object storage daemon
   Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: disabled)
   Active: activating (auto-restart) (Result: signal) since Thu 2018-05-03 07:45:09 UTC; 8s ago
  Process: 485163 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=killed, signal=ABRT)
  Process: 485112 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)
 Main PID: 485163 (code=killed, signal=ABRT)

May 03 07:45:09 ceph02.iaasn.ic.nis systemd[1]: Unit ceph-osd entered failed state.
May 03 07:45:09 ceph02.iaasn.ic.nis systemd[1]: ceph-osd failed.
"

We saw the following error on Ceph OSD log (/var/log/ceph/ceph-osd.219.log)  :
"
   -13> 2018-05-03 07:45:51.191917 7f81c2d4b800  5 osd.219 pg_epoch: 4748 pg[7.80(unlocked)] enter Initial
   -12> 2018-05-03 07:45:51.207133 7f81c2d4b800  5 osd.219 pg_epoch: 4748 pg[7.80( v 4748'2348670 (3602'2343955,4748'2348670] local-les=4613 n=875 ec=1064 les/c/f 4613/4542/3609 4583/4595/4595) [211,260,219] r=2 lpr=0 pi=1064-4594/36 crt=4748'2348670 lcod 0'0 inactive NOTIFY NIBBLEWISE] exit Initial 0.015215 0 0.000000
   -11> 2018-05-03 07:45:51.207148 7f81c2d4b800  5 osd.219 pg_epoch: 4748 pg[7.80( v 4748'2348670 (3602'2343955,4748'2348670] local-les=4613 n=875 ec=1064 les/c/f 4613/4542/3609 4583/4595/4595) [211,260,219] r=2 lpr=0 pi=1064-4594/36 crt=4748'2348670 lcod 0'0 inactive NOTIFY NIBBLEWISE] enter Reset
   -10> 2018-05-03 07:45:51.213586 7f81c2d4b800  5 osd.219 pg_epoch: 4748 pg[8.34(unlocked)] enter Initial
    -9> 2018-05-03 07:45:51.226944 7f81c2d4b800  5 osd.219 pg_epoch: 4748 pg[8.34( v 4748'6639514 (4748'6636497,4748'6639514] local-les=4742 n=1507 ec=1065 les/c/f 4742/4742/3609 4579/4739/4005) [219,218,238] r=0 lpr=0 crt=4748'6639514 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] exit Initial 0.013357 0 0.000000
    -8> 2018-05-03 07:45:51.226959 7f81c2d4b800  5 osd.219 pg_epoch: 4748 pg[8.34( v 4748'6639514 (4748'6636497,4748'6639514] local-les=4742 n=1507 ec=1065 les/c/f 4742/4742/3609 4579/4739/4005) [219,218,238] r=0 lpr=0 crt=4748'6639514 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] enter Reset
    -7> 2018-05-03 07:45:51.227104 7f81c2d4b800  5 osd.219 pg_epoch: 4748 pg[8.51(unlocked)] enter Initial
    -6> 2018-05-03 07:45:51.244545 7f81c2d4b800  5 osd.219 pg_epoch: 4748 pg[8.51( v 4748'18032888 (4748'18029847,4748'18032888] local-les=4614 n=1460 ec=1065 les/c/f 4614/4641/3609 4583/4595/3997) [219,226,241] r=0 lpr=0 crt=4748'18032888 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] exit Initial 0.017441 0 0.000000
    -5> 2018-05-03 07:45:51.244564 7f81c2d4b800  5 osd.219 pg_epoch: 4748 pg[8.51( v 4748'18032888 (4748'18029847,4748'18032888] local-les=4614 n=1460 ec=1065 les/c/f 4614/4641/3609 4583/4595/3997) [219,226,241] r=0 lpr=0 crt=4748'18032888 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] enter Reset
    -4> 2018-05-03 07:45:51.247288 7f81c2d4b800  5 osd.219 pg_epoch: 4748 pg[8.9c(unlocked)] enter Initial
    -3> 2018-05-03 07:45:51.297356 7f81c2d4b800  5 osd.219 pg_epoch: 4748 pg[8.9c( v 4748'5545162 (4543'5524651,4748'5545162] local-les=4615 n=1543 ec=1065 les/c/f 4615/4543/3609 4581/4614/3997) [219,198,205] r=0 lpr=0 pi=4534-4613/2 crt=4748'5545162 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] exit Initial 0.050067 0 0.000000
    -2> 2018-05-03 07:45:51.297373 7f81c2d4b800  5 osd.219 pg_epoch: 4748 pg[8.9c( v 4748'5545162 (4543'5524651,4748'5545162] local-les=4615 n=1543 ec=1065 les/c/f 4615/4543/3609 4581/4614/3997) [219,198,205] r=0 lpr=0 pi=4534-4613/2 crt=4748'5545162 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] enter Reset
    -1> 2018-05-03 07:45:51.297708 7f81c2d4b800  5 osd.219 pg_epoch: 4748 pg[8.a0(unlocked)] enter Initial
     0> 2018-05-03 07:45:51.376792 7f81c2d4b800 -1 osd/PGLog.cc: In function 'static void PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t, const pg_info_t&, std::map<eversion_t, hobject_t>&, PGLog::IndexedLog&, pg_missing_t&, std::ostringstream&, const DoutPrefixProvider*, std::set<std::basic_string<char> >*)' thread 7f81c2d4b800 time 2018-05-03 07:45:51.375136
osd/PGLog.cc: 1047: FAILED assert(oi.version == i->first)

 ceph version 10.2.5-37.el7cp (033f137cde8573cfc5a4662b4ed6a63b8a8d1464)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f81c3781585]
 2: (PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t, pg_info_t const&, std::map<eversion_t, hobject_t, std::less<eversion_t>, std::allocator<std::pair<eversion_t const, hobject_t> > >&, PGLog::IndexedLog&, pg_missing_t&, std::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >&, DoutPrefixProvider const*, std::set<std::string, std::less<std::string>, std::allocator<std::string> >*)+0xdc7) [0x7f81c336bfa7]
 3: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x490) [0x7f81c31c9190]
 4: (OSD::load_pgs()+0x9e2) [0x7f81c30feab2]
 5: (OSD::init()+0x2086) [0x7f81c3111276]
 6: (main()+0x2c55) [0x7f81c3074db5]
 7: (__libc_start_main()+0xf5) [0x7f81bfc5eb35]
 8: (()+0x3551d7) [0x7f81c30bf1d7]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
"

Version-Release number of selected component (if applicable):
10.2.5-37.el7cp

How reproducible:
Always

Steps to Reproduce:
1.
2.
3.

Actual results:
FAILED assert(oi.version == i->first) when trying to start an osd which is down.

Expected results:
The osd should start without an assert

Additional info:
The issue is fixed in the upstream: http://tracker.ceph.com/issues/17916#note-21
And backported to Upstream Jewel: https://github.com/ceph/ceph/pull/14596.

But this patch is not present in RHCS 10.2.5-37.

Comment 13 Jos Collin 2018-05-15 06:58:12 UTC

*** This bug has been marked as a duplicate of bug 1436752 ***