Bug 1428888
Summary: | [Documentation] increasing pg count needs a warning | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Tim Wilkinson <twilkins> | ||||
Component: | Documentation | Assignee: | ceph-docs <ceph-docs> | ||||
Status: | CLOSED NOTABUG | QA Contact: | ceph-qe-bugs <ceph-qe-bugs> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 2.1 | CC: | bengland, ceph-eng-bugs, dzafman, jdurgin, johfulto, kbader, kchai, kdreyer, mnelson, rsussman, twilkins | ||||
Target Milestone: | rc | ||||||
Target Release: | 2.3 | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2017-03-09 00:25:54 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Tim Wilkinson
2017-03-03 14:59:08 UTC
This is a generic assert so it can be caused by many things - the most common being a misconfigured cluster that has too much work for too few spindles. To diagnose what's happening in this case we'll need more info - ideally a way to log in to the system and look around. If that's not possible, ceph.log from the monitors osd logs of this happening with debug osd = 20, debug filestore = 20, and debug ms = 1 would be the place to start. Can you gather those logs or provide login info for the cluster? Increasing the volumes pool from 4096 to 8192 pgs was the culprit. This caused 1/3 of the data to move, and ended up overloading the disks with backfilling enough that they took longer than the osd op thread timeout to respond. We already have changes to help with throttling recovery and backfill planned upstream, but these won't be backportable. In the mean time the documentation should warn about how expensive increasing the number of pgs is, and suggest doing it in smaller increments if there is lots of data in the pool already. Doubling the pg count causes the least total data movement, but using smaller increments spreads the load over time better. Understood. Is there no hope for the present cluster? Recovery got down to 0.075% misplaced objects but stopped there. Nothing is moving but the two OSDs continue to experience the timeout if restarted. (In reply to Tim Wilkinson from comment #5) > Understood. Is there no hope for the present cluster? Recovery got down to > 0.075% misplaced objects but stopped there. Nothing is moving but the two > OSDs continue to experience the timeout if restarted. There's nothing that should stop it from recovering eventually. You've also got some deep scrubbing going on - setting the noscrub and nodeepscrub flags would help speed up the rebalancing a bunch - those are also expensive operations. Looking at /var/log/ceph/ceph.log, the cluster did go fully active+clean at 2017-03-07 23:41:28.992018, and now that you've added back the last couple osds it's rebalancing again. It'll just take some time. Thanks. It wouldn't move forward with recovery until we dropped the metrics table. (In reply to Tim Wilkinson from comment #7) > Thanks. It wouldn't move forward with recovery until we dropped the metrics > table. Correction, we stopped gnocchi & ceilometer services and dropped the metrics pool. It turns out OpenStack Gnocchi (telemetry) service is creating a lot of 16-BYTE RADOS objects in the Ceph "metrics" storage pool used by Gnocchi, about 19 million of them, with average object size of about 50 bytes. each of these objects takes up a 2-KB inode, so the ratio of metadata to data is about 40:1 . Yes, I looked inside the per-OSD filesystems and into the PG directories where the object replicas live to confirm this. I can't prove that this was the cause of OSD suicide timeouts, but I did see that XFS_inode slab in kernel was 13 GB and with other XFS related slabs may have come out to about 20 GB. filestore does not do well with small RADOS objects, as mark nelson has documented before (hopefully BlueStore does much better). But still Gnocchi is not making efficient use of storage. When we turned off Gnocchi and deleted its storage pool, Ceph recovered fully, where before it could not (but OpenStack Newton services apparently depend on Gnocchi now). Here's what pools looked like before we got rid of Gnocchi pool: [root@overcloud-osd-compute-2 1.102_head]# ceph df GLOBAL: SIZE AVAIL RAW USED %RAW USED 347T 175T 171T 49.50 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS rbd 0 0 0 49997G 0 metrics 1 9870M 0.02 49997G 19789554 images 2 128G 0.26 49997G 20503 backups 3 0 0 49997G 0 volumes 4 58695G 54.00 49997G 15140895 vms 5 454G 0.90 49997G 117867 This is consistent with Tim's and my experience with OpenStack running on an external Ceph cluster - there, Gnocchi was running on non-Ceph storage and we saw no such problems. I'll file a bz on Gnocchi with Alex Krzos in Perf & Scale, but wanted ceph people to know what happened here. Thanks for the further detail Tim and Ben - looks like the Gnocchi ceph backend needs some work to store its data more efficiently. The overhead of an object_info_t (internal osd metadata about an object) is enough to warrant using larger objects (or omap) for bluestore as well. Please cc me on the bz. |