Bug 1291632
Summary: | OSDMap Leak : OSD does not delete old OSD Maps in a timely fashion | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Vikhyat Umrao <vumrao> | |
Component: | RADOS | Assignee: | Kefu Chai <kchai> | |
Status: | CLOSED ERRATA | QA Contact: | shylesh <shmohan> | |
Severity: | high | Docs Contact: | Bara Ancincova <bancinco> | |
Priority: | high | |||
Version: | 1.3.0 | CC: | ceph-eng-bugs, dzafman, flucifre, hnallurv, jbautist, kchai, kdreyer, sjust, skinjo | |
Target Milestone: | rc | |||
Target Release: | 1.3.3 | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | RHEL: ceph-0.94.7-5.el7cp Ubuntu: ceph_0.94.7-3redhat1trusty | Doc Type: | Bug Fix | |
Doc Text: |
.OSD now deletes old OSD maps as expected
When new OSD maps are received, the OSD daemon marks the unused OSD maps as `stale` and deletes them to keep up with the changes. Previously, an attempt to delete stale OSD maps could fail for various reasons. As a consequence, certain OSD nodes were sometimes marked as `down` if it took too long to clean their OSD map caches when booting. With this update, the OSD daemon deletes old OSD maps as expected, thus fixing this bug.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1339061 (view as bug list) | Environment: | ||
Last Closed: | 2016-09-29 12:55:22 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1339061, 1348597, 1372735 |
Description
Vikhyat Umrao
2015-12-15 10:57:25 UTC
Thanks Kefu for working actively on this bug , changing priority and severity field to match the upstream tracker. Please follow upstream tracker for more updates regarding this issue : http://tracker.ceph.com/issues/13990 i suspected the culprit was a large "osd_pg_epoch_persisted_max_stale". which let `PG::last_persisted_osdmap_ref` hold a reference of osdmap, and hence prevented the osd from removing the old osdmaps. but the log shows: Dec 16 11:09:22 str-slc-04-08 ceph-osd: 2015-12-16 11:09:22.038415 7f4c57c7a700 20 osd.846 pg_epoch: 1304111 pg[3.6cc( empty local-les=1263161 n=0 ec=380 les/c 1263161/1263161 1263160/1263160/1262473) [371,1458,846] r=2 lpr=1263160 pi=1042696-1263159/17 crt=0'0 active] handle_activate_map: Not dirtying info: last_persisted is 1303964 while current is 1304111 so the last_persisted_osdmap_ref was pretty new compared to the latest one. and we can tell that the superblock.oldest_map of the OSD is way behind the current epoch: 1263150 versus 1304111. Dec 16 11:09:22 str-slc-04-08 ceph-osd: 2015-12-16 11:09:22.030825 7f4c6148d700 10 osd.846 1304111 write_superblock sb(f3b7f409-e061-4e39-b4d0-ae380e29ae7e osd.846 3b781287-d79d-4e1b-8128-54ada841c187 e1304111 [1263150,1304111] lci=[1263146,1304111]) and by reading the OSDMap message from monitor: Dec 16 11:09:21 str-slc-04-08 ceph-osd: 2015-12-16 11:09:21.959067 7f4c6148d700 1 -- 10.200.20.67:6826/5913 <== mon.0 10.200.20.30:6789/0 10515 ==== osd_map(1304111..1304111 src has 1303506..1304111) v3 ==== 285396+0+0 (1443583887 0 0) 0x6d53e880 con 0x1d0b0c60 we know that its m->oldest_map is 1303506. so i think there must be some guy holding an OSDMapRef of e1263150. but the question is who it is... Harish, i will try. the bug is not RCA'ed yet. so no reproduce steps is available yet. see http://tracker.ceph.com/issues/13990 . Hi Kefu, Can you please let us know if this bug has been RCA'ed? If yes, can you please share the steps to verify the bug fix and also any extra configuration requirements like number of osds running etc.,? Regards, Harish As per discussion with kefu and Harish moving it out of 1.3.2 , work in progress for information please follow http://tracker.ceph.com/issues/13990 . posted a fix to address the "huge build up of OSD map" issue at https://github.com/ceph/ceph/pull/8017. will backport it to hammer once it gets merged into master. thanks to Vikhyat, we have a cluster of 33 OSD and 3 monitor for reproducing this issue: 1. the cluster is running RHCS 1.3.2 (hammer) 2. add following settings in the global section in ceph.conf, and push it to all nodes using ceph-deploy osd_pg_epoch_persisted_max_stale=10 osd_map_cache_size=20 osd_map_max_advance=10 osd_mon_report_interval_min=10 osd_pg_stat_report_interval_max=10 paxos_service_trim_min=10 mon_min_osdmap_epochs=20 3. create an rbd image, and 1000 snapshots of it and then remove them rbd create rbd-image-0 --size 1 --pool rbd for i in `seq 1000`; do rbd snap create rbd/rbd-image-0@rbd-snap-$i; done for i in `seq 1000`; do rbd snap rm rbd/rbd-image-0@rbd-snap-$i; done 4. check the number of osdmap on osd.0 [root@dell-per630-11 ceph]# find /var/lib/ceph/osd/ceph-0/current/meta/ | wc -l 5347 5. and its log message with debug-ms=1 2016-04-22 11:08:27.049276 7f6d05f77700 1 -- 10.74.128.33:6801/1849261 --> 10.74.128.31:6810/6579 -- osd_map(4517..4517 src has 2092..4517) v3 -- ?+0 0x7434b40 con 0xaaf0940 its peer was claiming that it was holding osdmap(2092..4517), so over 2k osdmaps in cache 6. check the log message on the lead mon, with debug-paxos=10 2016-04-22 11:02:22.146800 7fdef9f0f700 10 mon.dell-per630-8@0(leader).paxosservice(osdmap 4437..4459) maybe_trim trim_to 4439 would only trim 2 < paxos_service_trim_min 10 so monitor was sane, as it was only holding 22 osdmaps. we will repeat this test with the hammer backport of https://github.com/ceph/ceph/pull/8017 in this cluster. i can reproduce the issue even with the fix: see the log from osd.24 2016-04-22 14:24:21.608162 7f3792ee6700 10 osd.24 7212 write_superblock sb(444f54b1-f97f-43d8-85b7-d5a02daac39a osd.24 ae43702f-eaa6-4235-917b-3bc4c97e4144 e7212 [4816,7212] lci=[4811,7212]) so it is holding over 2k osdmaps. and gdb shows that osdmap.4816 is being hold by another guy: (gdb) p weak_refs $2 = std::map with 28 elements = {[4816] = {first = std::tr1::weak_ptr (count 2, weak 1) 0x77b13c0, second = 0x77b13c0}, [4825] = {first = std::tr1::weak_ptr (count 2, weak 1) 0x77adcc0, second = 0x77adcc0}, [4829] = { first = std::tr1::weak_ptr (count 1, weak 1) 0x4b09cc0, second = 0x4b09cc0}, [4831] = {first = std::tr1::weak_ptr (count 1, weak 1) 0x4b093c0, second = 0x4b093c0}, [4834] = {first = std::tr1::weak_ptr (count 2, weak 1) 0x4b08640, second = 0x4b08640}, [4853] = { first = std::tr1::weak_ptr (count 1, weak 1) 0x77b7440, second = 0x77b7440}, [4857] = {first = std::tr1::weak_ptr (count 1, weak 1) 0x81ca240, second = 0x81ca240}, [4861] = {first = std::tr1::weak_ptr (count 1, weak 1) 0x81ca480, second = 0x81ca480}, [7312] = { first = std::tr1::weak_ptr (count 1, weak 1) 0x77a3f80, second = 0x77a3f80}, [7313] = {first = std::tr1::weak_ptr (count 1, weak 1) 0x857c240, second = 0x857c240}, [7316] = {first = std::tr1::weak_ptr (count 1, weak 1) 0x7bdab40, second = 0x7bdab40}, [7317] = { first = std::tr1::weak_ptr (count 1, weak 1) 0x7bda480, second = 0x7bda480}, [7318] = {first = std::tr1::weak_ptr (count 1, weak 1) 0x7922b40, second = 0x7922b40}, [7319] = {first = std::tr1::weak_ptr (count 1, weak 1) 0x8fea400, second = 0x8fea400}, [7320] = { first = std::tr1::weak_ptr (count 1, weak 1) 0x8feba80, second = 0x8feba80}, [7321] = {first = std::tr1::weak_ptr (count 15, weak 1) 0x7c00480, second = 0x7c00480}, [7322] = {first = std::tr1::weak_ptr (count 22, weak 1) 0x7bdbd40, second = 0x7bdbd40}, [7323] = { first = std::tr1::weak_ptr (count 3, weak 1) 0x770f600, second = 0x770f600}, [7324] = {first = std::tr1::weak_ptr (count 19, weak 1) 0x90f4880, second = 0x90f4880}, [7325] = {first = std::tr1::weak_ptr (count 18, weak 1) 0x90aa000, second = 0x90aa000}, [7326] = { first = std::tr1::weak_ptr (count 11, weak 1) 0x7a0ff80, second = 0x7a0ff80}, [7327] = {first = std::tr1::weak_ptr (count 7, weak 1) 0x8fcd440, second = 0x8fcd440}, [7328] = {first = std::tr1::weak_ptr (count 12, weak 1) 0x8fcf600, second = 0x8fcf600}, [7329] = { first = std::tr1::weak_ptr (count 20, weak 1) 0x8e741c0, second = 0x8e741c0}, [7330] = {first = std::tr1::weak_ptr (count 7, weak 1) 0x927b440, second = 0x927b440}, [7331] = {first = std::tr1::weak_ptr (count 348, weak 1) 0x9c79840, second = 0x9c79840}, [7332] = { first = std::tr1::weak_ptr (count 2, weak 1) 0x857c480, second = 0x857c480}, [7333] = {first = std::tr1::weak_ptr (count 2, weak 1) 0x96e3200, second = 0x96e3200}} Thanks Kefu looks great now we are close :) Removing needinfo from Harish as we have set up the cluster and able to reproduce the issue. Hammer backport tracker : http://tracker.ceph.com/issues/15193 Hammer backports can be taken from above given tracker 15193. Master Branch patches : https://github.com/ceph/ceph/pull/8017 https://github.com/ceph/ceph/pull/8990 https://github.com/ceph/ceph/pull/9108 (In reply to Vikhyat Umrao from comment #36) > Hammer backport tracker : http://tracker.ceph.com/issues/15193 > > Hammer backports can be taken from above given tracker 15193. > > Master Branch patches : > > https://github.com/ceph/ceph/pull/8017 > https://github.com/ceph/ceph/pull/8990 > https://github.com/ceph/ceph/pull/9108 ^^ These are PRs not patches as they have multiple commits. change got merged into master. hammer backport pending on review: https://github.com/ceph/ceph/pull/9090 i talked with Sam regarding to the impact of flapping OSD to osdmap trimming. if we have an OSD constantly flapping, the min(last-epoch-clean) will be kept from moving forwards. as the last-epoch-clean of the PG served by that flapping OSD is not clean. so the monitor can not trim the osdmaps until the PG is back to normal. Reproducer steps for this issue : https://bugzilla.redhat.com/show_bug.cgi?id=1339061#c12 Ken, please see https://github.com/ceph/ceph/pull/9090 and the ticket at http://tracker.ceph.com/issues/16639 you need to take the 5793b13492feecad399451e3a83836722b6e9abc ("osd/OpRequest: reset connection upon unregister") out of the build. but the other 3 commits are good. Followed the steps from https://bugzilla.redhat.com/show_bug.cgi?id=1291632#c27 and number of maps didn't go beyond 50. Hence marking it as verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-1972.html |