Bug 1428888

Summary:

[Documentation] increasing pg count needs a warning

Product:

[Red Hat Storage] Red Hat Ceph Storage

Reporter:

Tim Wilkinson <twilkins>

Component:

Documentation

Assignee:

ceph-docs <ceph-docs>

Status:

CLOSED NOTABUG

QA Contact:

ceph-qe-bugs <ceph-qe-bugs>

Severity:

high

Docs Contact:

Priority:

high

Version:

2.1

CC:

bengland, ceph-eng-bugs, dzafman, jdurgin, johfulto, kbader, kchai, kdreyer, mnelson, rsussman, twilkins

Target Milestone:

Target Release:

2.3

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-03-09 00:25:54 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
ceph-osd.118.log	none

Description Tim Wilkinson 2017-03-03 14:59:08 UTC

Created attachment 1259573 [details]
ceph-osd.118.log

Description of problem:
----------------------
OSDs in my RHCS 2.1 cluster keep dropping after hitting a suicide timeout causing backfills to run forever. I thought it may be fixed by bz 1394007 (referenced by https://access.redhat.com/solutions/2800491) but I still see it in 10.2.3-17. 



Component Version-Release:
-------------------------
7.3 (Maipo)
3.10.0-514.2.2.el7.x86_64
ceph-base.x86_64      1:10.2.3-17.el7cp   @rhelosp-10.0-ceph-2.0-mon



How reproducible:
----------------
OSD comes and goes, always with same complaint.



Steps to Reproduce:
------------------
1. Monitor ceph 'status' and 'osd tree' to see which OSDs are down.
2. Check OSD log for reason.



Actual results:
--------------
Not all OSD up or in cluster. suicide timeouts in OSD log.



Expected results:
----------------
192 osds: 192 up, 192 in



Additional info:
---------------
ping(you_died e5724 stamp 2017-03-03 13:43:26.545677) v2 ==== 47+0+0 (2739220313 0 0) 0x7f7bd4d44000 con 0x7f7b3c5b9800
   -11> 2017-03-03 13:43:26.550974 7f7b0b1ba700  1 -- 172.19.0.17:0/495838 <== osd.151 172.18.0.21:6819/334947 77 ==== osd_ping(you_died e5724 stamp 2017-03-03 13:43:26.545677) v2 ==== 47+0+0 (2739220313 0 0) 0x7f7bca57f400 con 0x7f7b3c5b9980
   -10> 2017-03-03 13:43:26.550982 7f7b0b1ba700  1 -- 172.19.0.17:0/495838 <== osd.182 172.19.0.22:6826/335793 77 ==== osd_ping(you_died e5724 stamp 2017-03-03 13:43:26.545677) v2 ==== 47+0+0 (2739220313 0 0) 0x7f7bd13a8800 con 0x7f7b3c5b9b00
    -9> 2017-03-03 13:43:26.550989 7f7b0b1ba700  1 -- 172.19.0.17:0/495838 <== osd.182 172.18.0.21:6846/335793 77 ==== osd_ping(you_died e5724 stamp 2017-03-03 13:43:26.545677) v2 ==== 47+0+0 (2739220313 0 0) 0x7f7bca53cc00 con 0x7f7b3c5b9c80
    -8> 2017-03-03 13:43:26.550996 7f7b0b1ba700  1 -- 172.19.0.17:0/495838 <== osd.78 172.19.0.22:6821/335777 77 ==== osd_ping(you_died e5724 stamp 2017-03-03 13:43:26.545677) v2 ==== 47+0+0 (2739220313 0 0) 0x7f7bca570c00 con 0x7f7b3c5baa00
    -7> 2017-03-03 13:43:26.551004 7f7b0b1ba700  1 -- 172.19.0.17:0/495838 <== osd.78 172.18.0.21:6839/335777 77 ==== osd_ping(you_died e5724 stamp 2017-03-03 13:43:26.545677) v2 ==== 47+0+0 (2739220313 0 0) 0x7f7bd4d2e400 con 0x7f7b3c5bab80
    -6> 2017-03-03 13:43:26.551012 7f7b0b1ba700  1 -- 172.19.0.17:0/495838 <== osd.91 172.19.0.21:6811/546058 77 ==== osd_ping(you_died e5724 stamp 2017-03-03 13:43:26.545677) v2 ==== 47+0+0 (2739220313 0 0) 0x7f7b93c55800 con 0x7f7b3c5bb000
    -5> 2017-03-03 13:43:26.551021 7f7b0b1ba700  1 -- 172.19.0.17:0/495838 <== osd.91 172.18.0.26:6813/546058 77 ==== osd_ping(you_died e5724 stamp 2017-03-03 13:43:26.545677) v2 ==== 47+0+0 (2739220313 0 0) 0x7f7bd4d35800 con 0x7f7b3c5bb180
    -4> 2017-03-03 13:43:26.551031 7f7b0b1ba700  1 -- 172.19.0.17:0/495838 <== osd.73 172.19.0.13:6835/281372 77 ==== osd_ping(you_died e5724 stamp 2017-03-03 13:43:26.545677) v2 ==== 47+0+0 (2739220313 0 0) 0x7f7bd4d31800 con 0x7f7b3c5bb600
    -3> 2017-03-03 13:43:26.551038 7f7b0b1ba700  1 -- 172.19.0.17:0/495838 <== osd.73 172.18.0.18:6850/281372 77 ==== osd_ping(you_died e5724 stamp 2017-03-03 13:43:26.545677) v2 ==== 47+0+0 (2739220313 0 0) 0x7f7bd4d33e00 con 0x7f7b3c5bb780
    -2> 2017-03-03 13:43:29.426240 7f7b1ddcf700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7b021a8700' had timed out after 15
    -1> 2017-03-03 13:43:29.426265 7f7b1ddcf700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7b021a8700' had suicide timed out after 150
     0> 2017-03-03 13:43:29.428452 7f7b1ddcf700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f7b1ddcf700 time 2017-03-03 13:43:29.426296
common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")

 ceph version 10.2.3-17.el7cp (ca9d57c0b140eb5cea9de7f7133260271e57490e)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f7b23f24265]
 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x2e1) [0x7f7b23e63bf1]
 3: (ceph::HeartbeatMap::is_healthy()+0xde) [0x7f7b23e6444e]
 4: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0x7f7b23e64c2c]
 5: (CephContextServiceThread::entry()+0x15b) [0x7f7b23f3bffb]
 6: (()+0x7dc5) [0x7f7b21e50dc5]
 7: (clone()+0x6d) [0x7f7b204dc73d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Comment 2 Josh Durgin 2017-03-03 16:49:10 UTC

This is a generic assert so it can be caused by many things - the most common being a misconfigured cluster that has too much work for too few spindles.

To diagnose what's happening in this case we'll need more info - ideally a way to log in to the system and look around. If that's not possible, ceph.log from the monitors osd logs of this happening with debug osd = 20, debug filestore = 20, and debug ms = 1 would be the place to start. Can you gather those logs or provide login info for the cluster?

Comment 4 Josh Durgin 2017-03-07 04:14:30 UTC

Increasing the volumes pool from 4096 to 8192 pgs was the culprit. This caused 1/3 of the data to move, and ended up overloading the disks with backfilling enough that they took longer than the osd op thread timeout to respond.

We already have changes to help with throttling recovery and backfill planned upstream, but these won't be backportable.

In the mean time the documentation should warn about how expensive increasing the number of pgs is, and suggest doing it in smaller increments if there is lots of data in the pool already. Doubling the pg count causes the least total data movement, but using smaller increments spreads the load over time better.

Comment 5 Tim Wilkinson 2017-03-07 12:31:00 UTC

Understood. Is there no hope for the present cluster? Recovery got down to 0.075% misplaced objects but stopped there. Nothing is moving but the two OSDs continue to experience the timeout if restarted.

Comment 6 Josh Durgin 2017-03-08 01:03:43 UTC

(In reply to Tim Wilkinson from comment #5)
> Understood. Is there no hope for the present cluster? Recovery got down to
> 0.075% misplaced objects but stopped there. Nothing is moving but the two
> OSDs continue to experience the timeout if restarted.

There's nothing that should stop it from recovering eventually. 

You've also got some deep scrubbing going on - setting the noscrub and nodeepscrub flags would help speed up the rebalancing a bunch - those are also expensive operations.

Looking at /var/log/ceph/ceph.log, the cluster did go fully active+clean at 2017-03-07 23:41:28.992018, and now that you've added back the last couple osds it's rebalancing again. It'll just take some time.

Comment 7 Tim Wilkinson 2017-03-08 01:06:18 UTC

Thanks. It wouldn't move forward with recovery until we dropped the metrics table.

Comment 8 Tim Wilkinson 2017-03-08 01:07:36 UTC

(In reply to Tim Wilkinson from comment #7)
> Thanks. It wouldn't move forward with recovery until we dropped the metrics
> table.

Correction, we stopped gnocchi & ceilometer services and dropped the metrics pool.

Comment 9 Ben England 2017-03-08 02:53:20 UTC

It turns out OpenStack Gnocchi (telemetry) service is creating a lot of 16-BYTE RADOS objects in the Ceph "metrics" storage pool used by Gnocchi, about 19 million of them, with average object size of about 50 bytes.  each of these objects takes up a 2-KB inode, so the ratio of metadata to data is about 40:1 .  Yes, I looked inside the per-OSD filesystems and into the PG directories where the object replicas live to confirm this.  I can't prove that this was the cause of OSD suicide timeouts, but I did see that XFS_inode slab in kernel was 13 GB and with other XFS related slabs may have come out to about 20 GB.   filestore does not do well with small RADOS objects, as mark nelson has documented before (hopefully BlueStore does much better).  But still Gnocchi is not making efficient use of storage.

When we turned off Gnocchi and deleted its storage pool, Ceph recovered fully, where before it could not (but OpenStack Newton services apparently depend on Gnocchi now).  Here's what pools looked like before we got rid of Gnocchi pool:

[root@overcloud-osd-compute-2 1.102_head]# ceph df
GLOBAL:
    SIZE     AVAIL     RAW USED     %RAW USED
    347T      175T         171T         49.50
POOLS:
    NAME        ID     USED       %USED     MAX AVAIL     OBJECTS  
    rbd         0           0         0        49997G            0
    metrics     1       9870M      0.02        49997G     19789554
    images      2        128G      0.26        49997G        20503
    backups     3           0         0        49997G            0
    volumes     4      58695G     54.00        49997G     15140895
    vms         5        454G      0.90        49997G       117867

This is consistent with Tim's and my experience with OpenStack running on an external Ceph cluster - there, Gnocchi was running on non-Ceph storage and we saw no such problems.

I'll file a bz on Gnocchi with Alex Krzos in Perf & Scale, but wanted ceph people to know what happened here.

Comment 10 Josh Durgin 2017-03-08 06:57:54 UTC

Thanks for the further detail Tim and Ben - looks like the Gnocchi ceph backend needs some work to store its data more efficiently. The overhead of an object_info_t (internal osd metadata about an object) is enough to warrant using larger objects (or omap) for bluestore as well. Please cc me on the bz.