Bug 1291632

Summary:	OSDMap Leak : OSD does not delete old OSD Maps in a timely fashion
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Vikhyat Umrao <vumrao>
Component:	RADOS	Assignee:	Kefu Chai <kchai>
Status:	CLOSED ERRATA	QA Contact:	shylesh <shmohan>
Severity:	high	Docs Contact:	Bara Ancincova <bancinco>
Priority:	high
Version:	1.3.0	CC:	ceph-eng-bugs, dzafman, flucifre, hnallurv, jbautist, kchai, kdreyer, sjust, skinjo
Target Milestone:	rc
Target Release:	1.3.3
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	RHEL: ceph-0.94.7-5.el7cp Ubuntu: ceph_0.94.7-3redhat1trusty	Doc Type:	Bug Fix
Doc Text:	.OSD now deletes old OSD maps as expected When new OSD maps are received, the OSD daemon marks the unused OSD maps as `stale` and deletes them to keep up with the changes. Previously, an attempt to delete stale OSD maps could fail for various reasons. As a consequence, certain OSD nodes were sometimes marked as `down` if it took too long to clean their OSD map caches when booting. With this update, the OSD daemon deletes old OSD maps as expected, thus fixing this bug.	Story Points:	---
Clone Of:
Clones:	1339061 (view as bug list)		Environment:
Last Closed:	2016-09-29 12:55:22 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1339061, 1348597, 1372735

Description Vikhyat Umrao 2015-12-15 10:57:25 UTC

Description of problem:

OSDs get marked down while cleaning up osdmap cache
-----------------------------------------------------------------------------------------

OSDs get marked down sometimes when cleaning up their osdmap caches. These caches are growing to over 200GB and the OSDs become unresponsive when trying to clean them up.

This is a production cluster hosting customer data. This cluster creates approximately 3,200 RBD snapshots per day and deletes approximately the same number of old RBD snapshots per day. It appears that the deleted snapshots cause updates to the osdmap by adding entries to removed_snaps list stored in the osdmap.

The behavior occurs anytime OSDs clean their osdmap cache, which doesn't appear to be configurable, so it happens at random times. We have 1,480 OSDs in this cluster, so it happens relatively frequently for individual OSDs.

http://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/#osd-map

We are in discussion for tuning options listed here.

Yes, osd_map_cache_size refers to the number of cached maps, which should be at most 500 according to the default setting that we are not overriding. What I'm saying is that the osd for which I sent you debug logs had 235,520 maps in its cache at the time I sent the log even though the map cache size is 500. That would seem to indicate that changing that setting in the config file won't fix the problem since the cache already contains hundreds of thousands more maps than it should.

That osd has cleaned things up now and that directory is down to 12GB at the moment, but when I sent the debug log it was at 276GB. All of our osds are behaving similarly. The map cache directories get huge, clean up, and repeat. It does seem rather excessive, We keep getting reports of near full osds because the map caches are using so much disk space.

Version-Release number of selected component (if applicable):
Customer is on Ubuntu Hammer upstream.

$ cat installed-debs | grep ceph
ceph 0.94.3-1
# cat lsb-release
Description: Ubuntu 14.04.3 LTS

Comment 8 Vikhyat Umrao 2015-12-22 05:46:41 UTC

Thanks Kefu for working actively on this bug , changing priority and severity field to match the upstream tracker. 

Please follow upstream tracker for more updates regarding this issue :
http://tracker.ceph.com/issues/13990

Comment 9 Kefu Chai 2015-12-22 14:10:59 UTC

i suspected the culprit was a large "osd_pg_epoch_persisted_max_stale". which let `PG::last_persisted_osdmap_ref` hold a reference of osdmap, and hence prevented the osd from removing the old osdmaps.

but the log shows:

 Dec 16 11:09:22 str-slc-04-08 ceph-osd: 2015-12-16 11:09:22.038415 7f4c57c7a700 20 osd.846 pg_epoch: 1304111 pg[3.6cc( empty local-les=1263161 n=0 ec=380 les/c 1263161/1263161 1263160/1263160/1262473) [371,1458,846] r=2 lpr=1263160 pi=1042696-1263159/17 crt=0'0 active] handle_activate_map: Not dirtying info: last_persisted is 1303964 while current is 1304111


so the last_persisted_osdmap_ref was pretty new compared to the latest one.

and we can tell that the superblock.oldest_map of the OSD is way behind the current epoch: 1263150 versus 1304111.

 Dec 16 11:09:22 str-slc-04-08 ceph-osd: 2015-12-16 11:09:22.030825 7f4c6148d700 10 osd.846 1304111 write_superblock sb(f3b7f409-e061-4e39-b4d0-ae380e29ae7e osd.846 3b781287-d79d-4e1b-8128-54ada841c187 e1304111 [1263150,1304111] lci=[1263146,1304111])

and by reading the OSDMap message from monitor:

 Dec 16 11:09:21 str-slc-04-08 ceph-osd: 2015-12-16 11:09:21.959067 7f4c6148d700  1 -- 10.200.20.67:6826/5913 <== mon.0 10.200.20.30:6789/0 10515 ==== osd_map(1304111..1304111 src has 1303506..1304111) v3 ==== 285396+0+0 (1443583887 0 0) 0x6d53e880 con 0x1d0b0c60

we know that its m->oldest_map is 1303506. so i think there must be some guy holding  an OSDMapRef of e1263150. but the question is who it is...

Comment 12 Kefu Chai 2016-01-06 05:26:05 UTC

Harish,

i will try. the bug is not RCA'ed yet. so no reproduce steps is available yet. see http://tracker.ceph.com/issues/13990 .

Comment 13 Harish NV Rao 2016-01-25 07:43:57 UTC

Hi Kefu,

Can you please let us know if this bug has been RCA'ed? If yes, can you please share the steps to verify the bug fix and also any extra configuration requirements like number of osds running etc.,?

Regards,
Harish

Comment 14 Vikhyat Umrao 2016-01-25 08:34:20 UTC

As per discussion with kefu and Harish moving it out of 1.3.2 , work in progress  for information please follow  http://tracker.ceph.com/issues/13990 .

Comment 21 Kefu Chai 2016-03-10 09:01:24 UTC

posted a fix to address the "huge build up of OSD map" issue at https://github.com/ceph/ceph/pull/8017. will backport it to hammer once it gets merged into master.

Comment 27 Kefu Chai 2016-04-22 06:32:39 UTC

thanks to Vikhyat, we have a cluster of 33 OSD and 3 monitor for reproducing this issue:

1. the cluster is running RHCS 1.3.2 (hammer)

2. add following settings in the global section in ceph.conf, and push it to all nodes using ceph-deploy

osd_pg_epoch_persisted_max_stale=10
osd_map_cache_size=20
osd_map_max_advance=10
osd_mon_report_interval_min=10
osd_pg_stat_report_interval_max=10
paxos_service_trim_min=10
mon_min_osdmap_epochs=20

3. create an rbd image, and 1000 snapshots of it and then remove them
rbd create rbd-image-0 --size 1 --pool rbd
for i in `seq 1000`; do rbd snap create rbd/rbd-image-0@rbd-snap-$i; done
for i in `seq 1000`; do rbd snap rm rbd/rbd-image-0@rbd-snap-$i; done

4. check the number of osdmap on osd.0

[root@dell-per630-11 ceph]# find /var/lib/ceph/osd/ceph-0/current/meta/ | wc -l
5347

5. and its log message with debug-ms=1

2016-04-22 11:08:27.049276 7f6d05f77700  1 -- 10.74.128.33:6801/1849261 --> 10.74.128.31:6810/6579 -- osd_map(4517..4517 src has 2092..4517) v3 -- ?+0 0x7434b40 con 0xaaf0940

its peer was claiming that it was holding osdmap(2092..4517), so over 2k osdmaps in cache

6. check the log message on the lead mon, with debug-paxos=10

2016-04-22 11:02:22.146800 7fdef9f0f700 10 mon.dell-per630-8@0(leader).paxosservice(osdmap 4437..4459) maybe_trim trim_to 4439 would only trim 2 < paxos_service_trim_min 10

so monitor was sane, as it was only holding 22 osdmaps.


we will repeat this test with the hammer backport of https://github.com/ceph/ceph/pull/8017 in this cluster.

Comment 29 Kefu Chai 2016-04-22 09:15:05 UTC

i can reproduce the issue even with the fix:

see the log from osd.24

2016-04-22 14:24:21.608162 7f3792ee6700 10 osd.24 7212 write_superblock sb(444f54b1-f97f-43d8-85b7-d5a02daac39a osd.24 ae43702f-eaa6-4235-917b-3bc4c97e4144 e7212 [4816,7212] lci=[4811,7212])

so it is holding over 2k osdmaps.

and gdb shows that osdmap.4816 is being hold by another guy:

(gdb) p	weak_refs
$2 = std::map with 28 elements = {[4816] = {first = std::tr1::weak_ptr (count 2, weak 1) 0x77b13c0, second = 0x77b13c0}, [4825] = {first = std::tr1::weak_ptr (count 2, weak 1) 0x77adcc0, second = 0x77adcc0},	[4829] = {
    first = std::tr1::weak_ptr (count 1, weak 1) 0x4b09cc0, second = 0x4b09cc0}, [4831] = {first = std::tr1::weak_ptr (count 1, weak 1) 0x4b093c0, second = 0x4b093c0},	[4834] = {first = std::tr1::weak_ptr (count 2, weak 1) 0x4b08640, second = 0x4b08640}, [4853] = {
    first = std::tr1::weak_ptr (count 1, weak 1) 0x77b7440, second = 0x77b7440}, [4857] = {first = std::tr1::weak_ptr (count 1, weak 1) 0x81ca240, second = 0x81ca240},	[4861] = {first = std::tr1::weak_ptr (count 1, weak 1) 0x81ca480, second = 0x81ca480}, [7312] = {
    first = std::tr1::weak_ptr (count 1, weak 1) 0x77a3f80, second = 0x77a3f80}, [7313] = {first = std::tr1::weak_ptr (count 1, weak 1) 0x857c240, second = 0x857c240},	[7316] = {first = std::tr1::weak_ptr (count 1, weak 1) 0x7bdab40, second = 0x7bdab40}, [7317] = {
    first = std::tr1::weak_ptr (count 1, weak 1) 0x7bda480, second = 0x7bda480}, [7318] = {first = std::tr1::weak_ptr (count 1, weak 1) 0x7922b40, second = 0x7922b40},	[7319] = {first = std::tr1::weak_ptr (count 1, weak 1) 0x8fea400, second = 0x8fea400}, [7320] = {
    first = std::tr1::weak_ptr (count 1, weak 1) 0x8feba80, second = 0x8feba80}, [7321] = {first = std::tr1::weak_ptr (count 15, weak 1) 0x7c00480, second = 0x7c00480}, [7322] = {first = std::tr1::weak_ptr (count 22, weak 1) 0x7bdbd40, second = 0x7bdbd40}, [7323] = {
    first = std::tr1::weak_ptr (count 3, weak 1) 0x770f600, second = 0x770f600}, [7324] = {first = std::tr1::weak_ptr (count 19, weak 1) 0x90f4880, second = 0x90f4880}, [7325] = {first = std::tr1::weak_ptr (count 18, weak 1) 0x90aa000, second = 0x90aa000}, [7326] = {
    first = std::tr1::weak_ptr (count 11, weak 1) 0x7a0ff80, second = 0x7a0ff80}, [7327] = {first = std::tr1::weak_ptr (count 7, weak 1) 0x8fcd440, second = 0x8fcd440}, [7328] = {first = std::tr1::weak_ptr (count 12, weak 1) 0x8fcf600, second = 0x8fcf600}, [7329] = {
    first = std::tr1::weak_ptr (count 20, weak 1) 0x8e741c0, second = 0x8e741c0}, [7330] = {first = std::tr1::weak_ptr (count 7, weak 1) 0x927b440, second = 0x927b440}, [7331] = {first = std::tr1::weak_ptr (count 348, weak 1) 0x9c79840, second = 0x9c79840}, [7332] = {
    first = std::tr1::weak_ptr (count 2, weak 1) 0x857c480, second = 0x857c480}, [7333] = {first = std::tr1::weak_ptr (count 2, weak 1) 0x96e3200, second = 0x96e3200}}

Comment 30 Vikhyat Umrao 2016-04-22 09:30:21 UTC

Thanks Kefu looks great now we are close :) 

Removing needinfo from Harish as we have set up the cluster and able to reproduce the issue.

Comment 36 Vikhyat Umrao 2016-05-13 05:12:54 UTC

Hammer backport tracker : http://tracker.ceph.com/issues/15193

Hammer backports can be taken from above given tracker 15193.

Master Branch patches :

    https://github.com/ceph/ceph/pull/8017
    https://github.com/ceph/ceph/pull/8990
    https://github.com/ceph/ceph/pull/9108

Comment 37 Vikhyat Umrao 2016-05-13 07:54:23 UTC

(In reply to Vikhyat Umrao from comment #36)
> Hammer backport tracker : http://tracker.ceph.com/issues/15193
> 
> Hammer backports can be taken from above given tracker 15193.
> 
> Master Branch patches :
> 
>     https://github.com/ceph/ceph/pull/8017
>     https://github.com/ceph/ceph/pull/8990
>     https://github.com/ceph/ceph/pull/9108

^^ These are PRs not patches as they have multiple commits.

Comment 38 Kefu Chai 2016-05-20 04:00:54 UTC

change got merged into master.

hammer backport pending on review: https://github.com/ceph/ceph/pull/9090

Comment 43 Kefu Chai 2016-05-31 10:45:10 UTC

i talked with Sam regarding to the impact of flapping OSD to osdmap trimming.

if we have an OSD constantly flapping, the min(last-epoch-clean) will be kept from moving forwards. as the last-epoch-clean of the PG served by that flapping OSD is not clean. so the monitor can not trim the osdmaps until the PG is back to normal.

Comment 44 Vikhyat Umrao 2016-06-01 06:32:41 UTC

Reproducer steps for this issue : https://bugzilla.redhat.com/show_bug.cgi?id=1339061#c12

Comment 49 Kefu Chai 2016-07-15 11:16:28 UTC

Ken,

please see https://github.com/ceph/ceph/pull/9090

and the ticket at http://tracker.ceph.com/issues/16639

you need to take the 5793b13492feecad399451e3a83836722b6e9abc ("osd/OpRequest: reset connection upon unregister") out of the build. but the other 3 commits are good.

Comment 59 shylesh 2016-09-15 05:16:37 UTC

Followed the steps from  https://bugzilla.redhat.com/show_bug.cgi?id=1291632#c27 and number of maps didn't go beyond 50. Hence marking it as verified.

Comment 61 errata-xmlrpc 2016-09-29 12:55:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-1972.html