Description of problem: FAILED assert(oi.version == i->first) when trying to start an osd which is down. ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 71.97299 root default -2 23.99100 host ceph03 198 1.09050 osd.198 up 1.00000 1.00000 200 1.09050 osd.200 up 1.00000 1.00000 203 1.09050 osd.203 up 1.00000 1.00000 206 1.09050 osd.206 up 1.00000 1.00000 209 1.09050 osd.209 up 1.00000 1.00000 212 1.09050 osd.212 up 1.00000 1.00000 215 1.09050 osd.215 up 1.00000 1.00000 218 1.09050 osd.218 up 1.00000 1.00000 221 1.09050 osd.221 up 1.00000 1.00000 223 1.09050 osd.223 up 1.00000 1.00000 227 1.09050 osd.227 up 1.00000 1.00000 230 1.09050 osd.230 up 1.00000 1.00000 233 1.09050 osd.233 up 1.00000 1.00000 236 1.09050 osd.236 up 1.00000 1.00000 239 1.09050 osd.239 up 1.00000 1.00000 241 1.09050 osd.241 up 1.00000 1.00000 244 1.09050 osd.244 up 1.00000 1.00000 248 1.09050 osd.248 up 1.00000 1.00000 251 1.09050 osd.251 up 1.00000 1.00000 253 1.09050 osd.253 up 1.00000 1.00000 257 1.09050 osd.257 up 1.00000 1.00000 260 1.09050 osd.260 up 1.00000 1.00000 -3 23.99100 host ceph02 201 1.09050 osd.201 up 1.00000 1.00000 204 1.09050 osd.204 up 1.00000 1.00000 207 1.09050 osd.207 up 1.00000 1.00000 210 1.09050 osd.210 up 1.00000 1.00000 213 1.09050 osd.213 up 1.00000 1.00000 216 1.09050 osd.216 up 1.00000 1.00000 219 1.09050 osd.219 down 0 1.00000 222 1.09050 osd.222 up 1.00000 1.00000 225 1.09050 osd.225 up 1.00000 1.00000 228 1.09050 osd.228 up 1.00000 1.00000 231 1.09050 osd.231 up 1.00000 1.00000 234 1.09050 osd.234 up 1.00000 1.00000 237 1.09050 osd.237 up 1.00000 1.00000 240 1.09050 osd.240 up 1.00000 1.00000 243 1.09050 osd.243 up 1.00000 1.00000 246 1.09050 osd.246 up 1.00000 1.00000 249 1.09050 osd.249 up 1.00000 1.00000 252 1.09050 osd.252 up 1.00000 1.00000 255 1.09050 osd.255 up 1.00000 1.00000 258 1.09050 osd.258 up 1.00000 1.00000 261 1.09050 osd.261 up 1.00000 1.00000 263 1.09050 osd.263 up 1.00000 1.00000 -4 23.99100 host ceph01 199 1.09050 osd.199 up 1.00000 1.00000 202 1.09050 osd.202 up 1.00000 1.00000 205 1.09050 osd.205 up 1.00000 1.00000 208 1.09050 osd.208 up 1.00000 1.00000 211 1.09050 osd.211 up 1.00000 1.00000 214 1.09050 osd.214 up 1.00000 1.00000 217 1.09050 osd.217 up 1.00000 1.00000 220 1.09050 osd.220 up 1.00000 1.00000 224 1.09050 osd.224 up 1.00000 1.00000 226 1.09050 osd.226 up 1.00000 1.00000 229 1.09050 osd.229 up 1.00000 1.00000 232 1.09050 osd.232 up 1.00000 1.00000 235 1.09050 osd.235 up 1.00000 1.00000 238 1.09050 osd.238 up 1.00000 1.00000 242 1.09050 osd.242 up 1.00000 1.00000 245 1.09050 osd.245 up 1.00000 1.00000 247 1.09050 osd.247 up 1.00000 1.00000 250 1.09050 osd.250 up 1.00000 1.00000 254 1.09050 osd.254 up 1.00000 1.00000 256 1.09050 osd.256 up 1.00000 1.00000 259 1.09050 osd.259 up 1.00000 1.00000 262 1.09050 osd.262 up 1.00000 1.00000 When we try to restart service, we face to following issue : " systemctl status ceph-osd ● ceph-osd - Ceph object storage daemon Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: disabled) Active: activating (auto-restart) (Result: signal) since Thu 2018-05-03 07:45:09 UTC; 8s ago Process: 485163 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=killed, signal=ABRT) Process: 485112 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS) Main PID: 485163 (code=killed, signal=ABRT) May 03 07:45:09 ceph02.iaasn.ic.nis systemd[1]: Unit ceph-osd entered failed state. May 03 07:45:09 ceph02.iaasn.ic.nis systemd[1]: ceph-osd failed. " We saw the following error on Ceph OSD log (/var/log/ceph/ceph-osd.219.log) : " -13> 2018-05-03 07:45:51.191917 7f81c2d4b800 5 osd.219 pg_epoch: 4748 pg[7.80(unlocked)] enter Initial -12> 2018-05-03 07:45:51.207133 7f81c2d4b800 5 osd.219 pg_epoch: 4748 pg[7.80( v 4748'2348670 (3602'2343955,4748'2348670] local-les=4613 n=875 ec=1064 les/c/f 4613/4542/3609 4583/4595/4595) [211,260,219] r=2 lpr=0 pi=1064-4594/36 crt=4748'2348670 lcod 0'0 inactive NOTIFY NIBBLEWISE] exit Initial 0.015215 0 0.000000 -11> 2018-05-03 07:45:51.207148 7f81c2d4b800 5 osd.219 pg_epoch: 4748 pg[7.80( v 4748'2348670 (3602'2343955,4748'2348670] local-les=4613 n=875 ec=1064 les/c/f 4613/4542/3609 4583/4595/4595) [211,260,219] r=2 lpr=0 pi=1064-4594/36 crt=4748'2348670 lcod 0'0 inactive NOTIFY NIBBLEWISE] enter Reset -10> 2018-05-03 07:45:51.213586 7f81c2d4b800 5 osd.219 pg_epoch: 4748 pg[8.34(unlocked)] enter Initial -9> 2018-05-03 07:45:51.226944 7f81c2d4b800 5 osd.219 pg_epoch: 4748 pg[8.34( v 4748'6639514 (4748'6636497,4748'6639514] local-les=4742 n=1507 ec=1065 les/c/f 4742/4742/3609 4579/4739/4005) [219,218,238] r=0 lpr=0 crt=4748'6639514 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] exit Initial 0.013357 0 0.000000 -8> 2018-05-03 07:45:51.226959 7f81c2d4b800 5 osd.219 pg_epoch: 4748 pg[8.34( v 4748'6639514 (4748'6636497,4748'6639514] local-les=4742 n=1507 ec=1065 les/c/f 4742/4742/3609 4579/4739/4005) [219,218,238] r=0 lpr=0 crt=4748'6639514 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] enter Reset -7> 2018-05-03 07:45:51.227104 7f81c2d4b800 5 osd.219 pg_epoch: 4748 pg[8.51(unlocked)] enter Initial -6> 2018-05-03 07:45:51.244545 7f81c2d4b800 5 osd.219 pg_epoch: 4748 pg[8.51( v 4748'18032888 (4748'18029847,4748'18032888] local-les=4614 n=1460 ec=1065 les/c/f 4614/4641/3609 4583/4595/3997) [219,226,241] r=0 lpr=0 crt=4748'18032888 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] exit Initial 0.017441 0 0.000000 -5> 2018-05-03 07:45:51.244564 7f81c2d4b800 5 osd.219 pg_epoch: 4748 pg[8.51( v 4748'18032888 (4748'18029847,4748'18032888] local-les=4614 n=1460 ec=1065 les/c/f 4614/4641/3609 4583/4595/3997) [219,226,241] r=0 lpr=0 crt=4748'18032888 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] enter Reset -4> 2018-05-03 07:45:51.247288 7f81c2d4b800 5 osd.219 pg_epoch: 4748 pg[8.9c(unlocked)] enter Initial -3> 2018-05-03 07:45:51.297356 7f81c2d4b800 5 osd.219 pg_epoch: 4748 pg[8.9c( v 4748'5545162 (4543'5524651,4748'5545162] local-les=4615 n=1543 ec=1065 les/c/f 4615/4543/3609 4581/4614/3997) [219,198,205] r=0 lpr=0 pi=4534-4613/2 crt=4748'5545162 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] exit Initial 0.050067 0 0.000000 -2> 2018-05-03 07:45:51.297373 7f81c2d4b800 5 osd.219 pg_epoch: 4748 pg[8.9c( v 4748'5545162 (4543'5524651,4748'5545162] local-les=4615 n=1543 ec=1065 les/c/f 4615/4543/3609 4581/4614/3997) [219,198,205] r=0 lpr=0 pi=4534-4613/2 crt=4748'5545162 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] enter Reset -1> 2018-05-03 07:45:51.297708 7f81c2d4b800 5 osd.219 pg_epoch: 4748 pg[8.a0(unlocked)] enter Initial 0> 2018-05-03 07:45:51.376792 7f81c2d4b800 -1 osd/PGLog.cc: In function 'static void PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t, const pg_info_t&, std::map<eversion_t, hobject_t>&, PGLog::IndexedLog&, pg_missing_t&, std::ostringstream&, const DoutPrefixProvider*, std::set<std::basic_string<char> >*)' thread 7f81c2d4b800 time 2018-05-03 07:45:51.375136 osd/PGLog.cc: 1047: FAILED assert(oi.version == i->first) ceph version 10.2.5-37.el7cp (033f137cde8573cfc5a4662b4ed6a63b8a8d1464) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f81c3781585] 2: (PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t, pg_info_t const&, std::map<eversion_t, hobject_t, std::less<eversion_t>, std::allocator<std::pair<eversion_t const, hobject_t> > >&, PGLog::IndexedLog&, pg_missing_t&, std::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >&, DoutPrefixProvider const*, std::set<std::string, std::less<std::string>, std::allocator<std::string> >*)+0xdc7) [0x7f81c336bfa7] 3: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x490) [0x7f81c31c9190] 4: (OSD::load_pgs()+0x9e2) [0x7f81c30feab2] 5: (OSD::init()+0x2086) [0x7f81c3111276] 6: (main()+0x2c55) [0x7f81c3074db5] 7: (__libc_start_main()+0xf5) [0x7f81bfc5eb35] 8: (()+0x3551d7) [0x7f81c30bf1d7] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. " Version-Release number of selected component (if applicable): 10.2.5-37.el7cp How reproducible: Always Steps to Reproduce: 1. 2. 3. Actual results: FAILED assert(oi.version == i->first) when trying to start an osd which is down. Expected results: The osd should start without an assert Additional info: The issue is fixed in the upstream: http://tracker.ceph.com/issues/17916#note-21 And backported to Upstream Jewel: https://github.com/ceph/ceph/pull/14596. But this patch is not present in RHCS 10.2.5-37.
*** This bug has been marked as a duplicate of bug 1436752 ***