Description of problem: Customers deploys Gnocchi[1] as part of OpenStack platform to store their metrics. Gnocchi stores its metric data into Ceph. Among those data are small incoming measures sent by other services, which are store as small (>=16 bytes) objects in a pool using RADOS. This is what we call the backlog of measures to process. Gnocchi metricd workers tends to consume those objects and bring their number to zero, but if the workers are stopped or not fast enough, the backlog might grow further. The problem is that when a few hundreds of thousands of this objects are created, the Ceph cluster becomes unstable and starts crashing. This is well documented in the Bugzilla https://bugzilla.redhat.com/show_bug.cgi?id=1454941 and we have multiple KCS reporting that issue. Right now, we're trying to mitigate this issue in Gnocchi by processing more often the backlog to be sure it does not grow, but it is still quite of a mitigation. [1] http://gnocchi.xyz This has been shown in (at least) OSP 10 and OSP 11.
The original Bugzilla is https://bugzilla.redhat.com/show_bug.cgi?id=1430588
Just my opinion: this bug should be limited to making sure that Ceph OSDs don't go down with a suicide timeout because of this problem. It should not focus on performance of Ceph filestore for small RADOS objects - this problem should go away as Ceph bluestore becomes available downstream, since we no longer store each small RADOS object in a separate file. Also, the Gnocchi implementation is being optimized upstream, and it's not clear that this problem would even be encountered with the upstream code for Gnocchi. Alex Krzos has pointed out that many users don't even know they have a Gnocchi service and don't need it, so why are we deploying it everywhere by default? There is a serious problem in Ceph to explore though. Ceph OMaps are associated with a single PG in a RADOS object. I think this is what got Gnocchi into trouble. For example, is it possible to shard OMaps across PGs in some way so that they would become more scalable? Even with Bluestore, if OMaps are restricted to a single PG, then you could hit this problem of non-scalability for omaps, right? Didn't we encounter this in RGW too? IMO either you should do that or set a limit on how big the set of keys for a single OMap can get. If the API lets users shoot themselves in the foot, they will do it.
(In reply to Ben England from comment #3) > Just my opinion: this bug should be limited to making sure that Ceph OSDs > don't go down with a suicide timeout because of this problem. +1 > It should not > focus on performance of Ceph filestore for small RADOS objects - this > problem should go away as Ceph bluestore becomes available downstream, since > we no longer store each small RADOS object in a separate file. Also, the > Gnocchi implementation is being optimized upstream, and it's not clear that > this problem would even be encountered with the upstream code for Gnocchi. This is fixed in Gnocchi 4, but earlier version which are shipped with OSP 10 and 11. > Alex Krzos has pointed out that many users don't even know they have a > Gnocchi service and don't need it, so why are we deploying it everywhere by > default? It's a question we are also trying, but this is not the subject of this bug. > There is a serious problem in Ceph to explore though. Ceph OMaps are > associated with a single PG in a RADOS object. I think this is what got > Gnocchi into trouble. For example, is it possible to shard OMaps across PGs > in some way so that they would become more scalable? I don't understand the link with OMaps and the small files. Is that question against Gnocchi or Ceph? :) Gnocchi also uses OMap, and Gnocchi 4 will spread its usage against N OMaps (128 by default).
Julien, about last paragraph, yes I agree that Gnocchi has been changed upstream not to use a single OMap to track all of the undigested samples, but since this is a Ceph bz, I'm pointing out that Ceph allows developers like you to get into this predicament by providing an OMap API that is not scalable. I think the Omap problem is what resulted in the suicide timeout referenced in above bz, not the small objects. I was able to create millions of small 16-byte objects per OSD in my test cluster using rados bench without causing a suicide timeout (but did see some latencies > 1 sec at times), and we had nowhere near that many objects/OSD in the OpenStack cluster where the suicide timeout occurred. That's why I observed exactly 3 OSDs died in the original problem report - I think these OSDs were in the PG for the 1 Omap. I'm not aware of an upstream Ceph fix for this problem. Will try to supply some data to quantify it more in a subsequent post, am running a test now on an AWS-resident Ceph cluster using RHCS 2.3 QE build, using this program: https://github.com/bengland2/rados_object_perf/blob/master/rados-omap.py If this reproduces the problem, I'll run the test against upstream Luminous. [root@ip-172-31-4-35 ~]# ceph -s cluster 0049b838-87ec-4757-bc59-b3bacfc358dd health HEALTH_OK monmap e1: 3 mons at {ip-172-31-3-64=172.31.3.64:6789/0,ip-172-31-47-197=172.31.47.197:6789/0,ip-172-31-7-37=172.31.7.37:6789/0} election epoch 6, quorum 0,1,2 ip-172-31-3-64,ip-172-31-7-37,ip-172-31-47-197 fsmap e5: 1/1/1 up {0=ip-172-31-47-197=up:active} osdmap e29: 6 osds: 6 up, 6 in flags sortbitwise,require_jewel_osds pgmap v230: 256 pgs, 8 pools, 3664 bytes data, 192 objects 227 MB used, 2980 GB / 2980 GB avail 256 active+clean [root@ip-172-31-4-35 ~]# ./rados-omap.py --keys-per-call 4 --keys 20000000 --dir ection writeread --pool-name rbd 2>&1 | tee rados-omap-writeread.log keys: 20000000 keys-per-call: 4 direction: writeread value-size: 0 pool-name: rbd ... still running ...
I was able to reproduce suicide timeout using this command: ./rados-omap.py --keys-per-call 256 --keys 80000000 \ --direction writeread --pool-name rbd it hung after successfully doing a read of 16 million keys, which took ~60 seconds. So the next key read of 64 million keys was what pushed the elapsed time up high enough to trigger the timeout. files are at: https://s3.amazonaws.com/ben-england/ceph/bz-1457767/ceph-logs.tgz - contains OSD logs for the cluster, including osd.2 that had a suicide timeout https://s3.amazonaws.com/ben-england/ceph/bz-1457767/ceph-osd-tree.log - shows the OSDs that are down https://s3.amazonaws.com/ben-england/ceph/bz-1457767/ceph-s.log - shows that cluster was no longer healthy after running this workload I have a sosreport from one of the nodes if you want to see that. I think the highest priority thing is to 1) avoid the suicide timeout and 2) document an individual omap's lack of scalability in the RADOS API. Is there a way to avoid the suicide timeout?
(In reply to Ben England from comment #6) > I was able to reproduce suicide timeout using this command: Thanks Ben for doing this testing and reporting your results. We have seen perf issues when RADOS objects have a large number of omap entries like in our RGW bucket index objects sharding. I am sure Josh will provide more inputs after looking into the logs but these logs may not give many details because they are not from higher debug level. Could you please reproduce this issue once again with below debug log level: debug_osd = 20 debug_filestore = 20 debug_ms = 1 We may need to collect debug leveldb but we will see once Josh will analyse debug logs. As this issue is causing lot of issues in new OpenStack deployment where gnocchi is using as Ceph backed as we can see OpenStack BZ has 8 customer cases attached and being in Ceph support team I am aware of the criticality of this issue we need to fix it if this needs any fix in Ceph or in gnocchi or in both. Today we will bring this bug in our support and engineering sync call so Josh will have attention on this bz. > I think the highest priority thing is to 1) avoid the suicide timeout and 2) > document an individual omap's lack of scalability in the RADOS API. Is > there a way to avoid the suicide timeout? Yep, once you reproduce the issue with high debug log level you can increase the suicide timeout for next set of testing but first please collect the debug logs with default timeout values. - osd_op_thread_suicide_timeout Default value: 150 seconds - filestore_op_thread_suicide_timeout Default value: 180 seconds
Josh, I checked the ceph-logs/ip-172-31-3-64/var/log/ceph/ceph-osd.2.log from this https://s3.amazonaws.com/ben-england/ceph/bz-1457767/ceph-logs.tgz tar file and what I can see is OSD was always busy with leveldb compaction work and this may 2017-06-04 08:08:16.275817 7fb32c143800 0 set uid:gid to 167:167 (ceph:ceph) 2017-06-04 08:08:16.275842 7fb32c143800 0 ceph version 10.2.7-23.el7cp (2c7b8aeb4d08785570f73fbbd1064134650bed23), process ceph-osd, pid 23915 [....] 2017-06-04 08:08:16.306077 7fb32c143800 1 leveldb: Recovering log #5 2017-06-04 08:08:16.306121 7fb32c143800 1 leveldb: Level-0 table #7: started 2017-06-04 08:08:16.307716 7fb32c143800 1 leveldb: Level-0 table #7: 146 bytes OK 2017-06-04 08:08:16.309870 7fb32c143800 1 leveldb: Delete type=0 #5 2017-06-04 08:08:16.309921 7fb32c143800 1 leveldb: Delete type=3 #4 [....] 2017-06-04 08:08:16.317532 7fb32c143800 0 osd.2 0 load_pgs 2017-06-04 08:08:16.317545 7fb32c143800 0 osd.2 0 load_pgs opened 0 pgs 2017-06-04 08:08:16.317551 7fb32c143800 0 osd.2 0 using 0 op queue with priority op cut off at 64. 2017-06-04 08:08:16.318207 7fb32c143800 -1 osd.2 0 log_to_monitors {default=true} 2017-06-04 08:08:16.321502 7fb32c143800 0 osd.2 0 done with init, starting boot process [.....] 2017-06-04 08:09:01.218233 7fb3016a5700 0 -- 172.31.3.64:6801/23915 >> :/0 pipe(0x7fb3386e4000 sd=156 :6801 s=0 pgs=0 cs=0 l=0 c=0x7fb3386f7300).accept failed to getpeername (107) Transport endpoint is not connected ^^ It has this getpeername error. - But it comes back in GC. 2017-06-04 08:10:12.685642 7fb30ace0700 0 <cls> cls/rgw/cls_rgw.cc:3221: gc_iterate_entries end_key=1_01496578212.685635537 2017-06-04 08:10:12.695376 7fb30d4e5700 0 <cls> cls/rgw/cls_rgw.cc:3221: gc_iterate_entries end_key=1_01496578212.695371683 2017-06-04 08:10:12.776023 7fb30c4e3700 0 <cls> cls/rgw/cls_rgw.cc:3221: gc_iterate_entries end_key=1_01496578212.776019869 This has 14 minutes gap in logs. 2017-06-04 08:24:09.878930 7fb3015a4700 0 -- 172.31.3.64:6801/23915 >> 172.31.3.64:6805/24803 pipe(0x7fb3386c5400 sd=155 :42452 s=2 pgs=5 cs=1 l=0 c=0x7fb33867b900).fault with nothing to send, going to standby 2017-06-04 09:05:53.162582 7fb300a9a700 1 leveldb: Level-0 table #10: started 2017-06-04 09:05:53.188470 7fb300a9a700 1 leveldb: Level-0 table #10: 1184013 bytes OK 2017-06-04 09:05:53.190087 7fb300a9a700 1 leveldb: Delete type=0 #8 2017-06-04 09:06:04.172991 7fb300a9a700 1 leveldb: Level-0 table #12: started 2017-06-04 09:06:04.198092 7fb300a9a700 1 leveldb: Level-0 table #12: 1102650 bytes OK 2017-06-04 09:06:04.199862 7fb300a9a700 1 leveldb: Delete type=0 #9 [.....] for 12 hours there is only leveldb work by this OSD nothing else. 2017-06-04 21:02:04.850114 7fb300a9a700 1 leveldb: Level-0 table #16980: started 2017-06-04 21:02:04.915907 7fb300a9a700 1 leveldb: Level-0 table #16980: 814604 bytes OK 2017-06-04 21:02:04.917426 7fb300a9a700 1 leveldb: Delete type=0 #16977 2017-06-04 21:02:25.148351 7fb303ccb700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fb30bce2700' had timed out after 15 2017-06-04 21:02:25.148527 7fb302cbb700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fb30bce2700' had timed out after 15
At this point anyone that wants to, such as QE, can do it, since I supplied the reproducer program. I'll try to get to it in the next couple weeks, but frankly it's not my top priority. If I do it, I'll try to do it with Ceph Luminous build so that developers know if it's not fixed already. Should a Ceph tracker be filed on this? Who will do that? Should I?
(In reply to Ben England from comment #6) > I was able to reproduce suicide timeout using this command: > > ./rados-omap.py --keys-per-call 256 --keys 80000000 \ > --direction writeread --pool-name rbd This is putting millions of keys on one object, effectively writing to a single PG and creating enormous leveldb instances, for which compaction becomes very expensive. Using rocksdb will help, since it does multithreaded compaction, but this is fundamentally an issue with not distributing the data among multiple objects (and thus PGs) to leverage the whole cluster. > I think the highest priority thing is to 1) avoid the suicide timeout and 2) > document an individual omap's lack of scalability in the RADOS API. Is > there a way to avoid the suicide timeout? Unfortunately the omap interface in particular does not have a good way of enforcing size restrictions - writes are simply setting keys, and the osd can't tell whether those keys are new or overwriting existing keys without adding a bunch of overhead. We can add a warning during deep scrub though - we need to read all the keys at that point anyway. The main thing here imo is the need for better docs for librados. The tradeoffs of object size, i/o rate, and omap vs regular data should be added.
(In reply to Josh Durgin from comment #14) > The main thing here imo is the need for better docs for librados. The > tradeoffs of object size, i/o rate, and omap vs regular data should be added. Thanks Josh. I have created doc bugs: http://tracker.ceph.com/issues/20215 https://bugzilla.redhat.com/show_bug.cgi?id=1459688
A warning for objects with too many/too large omap entries is being added here: https://github.com/ceph/ceph/pull/16332
Josh> We can add a warning during deep scrub though - we need to read all the keys at that point anyway. This would close the bz IMHO, though I suppose backfilling would be another place to check this. The problem is that the Ceph system admin doesn't have a clue which omap is causing the problems, and there is in general no brute-force way to find out that would work in a reasonable amount of time. is there a way to either log something when an omap exceeds a size threshold or number-of-keys threshold? These would have to be parameters because one size might not fit all, with different types of storage that Ceph runs on, etc. But something like 1,000,000 keys and 128 MB of data might be good defaults (i.e. early warning -- small enough that OSD is not near a suicide timeout, big enough not to be commonly encountered). Not sure if these parameters would work for a data lake (really big RGW site).
Warnings about objects with too many or too large omap entries will be in upstream 12.2.6
Hi Josh and Brad - Warn about objects with too many omap entries is already downstream in 3.1 with the help of bug - https://bugzilla.redhat.com/show_bug.cgi?id=1569192 I think we can close this one as a duplicate of 1569192.
Good point Vikhyat - closing *** This bug has been marked as a duplicate of bug 1569192 ***