1306842 – Adding a new OSD to a Ceph cluster with a CRUSH weight of 0 causes 'ceph df' to report invalid MAX AVAIL on pools

Bug 1306842 - Adding a new OSD to a Ceph cluster with a CRUSH weight of 0 causes 'ceph df' to report invalid MAX AVAIL on pools

Summary: Adding a new OSD to a Ceph cluster with a CRUSH weight of 0 causes 'ceph df' ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RADOS
Sub Component:
Version:	1.3.1
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	1.3.3
Assignee:	Kefu Chai
QA Contact:	Ramakrishnan Periyasamy
Docs Contact:	Bara Ancincova
URL:
Whiteboard:
Depends On:	1335269
Blocks:	1348597 1372735
TreeView+	depends on / blocked

Reported:	2016-02-11 21:08 UTC by Mike Hackett
Modified:	2022-07-09 08:13 UTC (History)
CC List:	10 users (show)
Fixed In Version:	RHEL: ceph-0.94.7-5.el7cp Ubuntu: ceph_0.94.7-3redhat1trusty
Doc Type:	Bug Fix
Doc Text:	."ceph df" now shows proper value of "MAX AVAIL" When adding a new OSD node to the cluster by using the `ceph-deploy` utility with the `osd_crush_initial_weight` option set to `0`, the value of the `MAX AVAIL` field in the output of the `ceph df` command was `0` for each pool instead of the proper numerical value. As a consequence, other applications using Ceph, such as OpenStack Cinder, assumed that there is no space available to provision new volumes. This bug has been fixed, and `ceph df` now shows proper value of `MAX AVAIL` as expected.
Clone Of:
Environment:
Last Closed:	2016-09-29 12:56:37 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	13930	None	None	None	2016-02-22 15:25:14 UTC
Red Hat Issue Tracker	RHCEPH-4712	None	None	None	2022-07-09 08:13:41 UTC
Red Hat Knowledge Base (Solution)	2159331	None	None	None	2016-06-30 16:45:02 UTC
Red Hat Product Errata	RHSA-2016:1972	normal	SHIPPED_LIVE	Moderate: Red Hat Ceph Storage 1.3.3 security, bug fix, and enhancement update	2016-09-29 16:51:21 UTC

Description Mike Hackett 2016-02-11 21:08:28 UTC

Description of problem:

When adding new OSD's to a cluster using Ceph Deploy with 'osd_crush_initial_weight = 0' set. The output of 'ceph df' reports 'MAX AVAIL' to be 0 instead of the proper value until the weight is changed to 0.01, then ceph df displays proper numerical values. This causes problems for OpenStack Cinder in Kilo because it thinks there isn't any available space for new volumes.

Before adding an OSD:

GLOBAL:
SIZE AVAIL RAW USED %RAW USED
589T 345T 243T 41.32
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
data 0 816M 0 102210G 376
metadata 1 120M 0 102210G 94
images 5 11990G 1.99 68140G 1536075
volumes 6 63603G 10.54 68140G 16462022
instances 8 5657G 0.94 68140G 1063602
rbench 12 260M 0 68140G 22569
scratch 13 40960 0 68140G 10

After adding an OSD:

GLOBAL:
SIZE AVAIL RAW USED %RAW USED
590T 346T 243T 41.24
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
data 0 816M 0 0 376
metadata 1 120M 0 0 94
images 5 11990G 1.98 0 1536075
volumes 6 63603G 10.52 0 16462022
instances 8 5657G 0.94 0 1063602
rbench 12 260M 0 0 22569
scratch 13 40960 0 0 10

Max Avail is showing 0's for all pools.

Version-Release number of selected component (if applicable): 
RHCS 1.3.1
RHEL 7.2


How reproducible:
Highly reproducible, I was able to reproduce on my cluster without any issues.

Steps to Reproduce:

2 separate networks cluster and public
1 admin deploy node
3 Mon nodes
3 OSD nodes - 3 OSD disk 3 separate journals
Running RHCS 1.3.1 and RHEL 7.2

I have edited my ceph conf to include:
osd_crush_initial_weight = 0

Action plan:

check ceph df - verfiy max avail shows expected value.
Add 1 additional OSD node with 3 OSD's and 3 separate journals taking the initial OSD CRUSH weight of 0 from the conf

check ceph df - verify issue is seen with max avail at 0

If issue is seen, change osd CRUSH weight to 0.01 and recheck that max avail shows proper numerical values


Actual results:
Max Avail shows 0

Expected results:
Max Avail to show proper numerical counts from the pools

Additional info:
Upstream Ceph tracker open as issue is also seen on 94.5
http://tracker.ceph.com/issues/14710

Comment 1 Mike Hackett 2016-02-11 21:10:41 UTC

Assigned bug to kchai per sjust request.

Comment 2 Mike Hackett 2016-02-11 21:18:56 UTC

Better representation of test from 'ceph df' output:

Before adding an OSD:

GLOBAL:
    SIZE     AVAIL     RAW USED     %RAW USED
    589T      345T         243T         41.32
POOLS:
    NAME          ID     USED       %USED     MAX AVAIL     OBJECTS
    data          0        816M         0       102210G          376
    metadata      1        120M         0       102210G           94
    images        5      11990G      1.99        68140G      1536075
    volumes       6      63603G     10.54        68140G     16462022
    instances     8       5657G      0.94        68140G      1063602
    rbench        12       260M         0        68140G        22569
    scratch       13      40960         0        68140G           10

After adding an OSD:

GLOBAL:
    SIZE     AVAIL     RAW USED     %RAW USED
    590T      346T         243T         41.24
POOLS:
    NAME          ID     USED       %USED     MAX AVAIL     OBJECTS
    data          0        816M         0             0          376
    metadata      1        120M         0             0           94
    images        5      11990G      1.98             0      1536075
    volumes       6      63603G     10.52             0     16462022
    instances     8       5657G      0.94             0      1063602
    rbench        12       260M         0             0        22569
    scratch       13      40960         0             0           10

Comment 4 Kefu Chai 2016-02-22 15:25:14 UTC

merged into hammer https://github.com/ceph/ceph/pull/6834.

Comment 8 Federico Lucifredi 2016-07-25 17:59:00 UTC

Fix in 0.94.7 -

Comment 13 Ramakrishnan Periyasamy 2016-09-09 10:31:29 UTC

Verified adding OSD to cluster having data & Journal partition on same disk and different disks, "ceph df" command has "MAX AVAIL" data. Output captured for reference. 

[root@host1]# ceph osd tree
ID WEIGHT   TYPE NAME         UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 10.79991 root default                                        
-2  2.69998     host magna105                                   
 0  0.89999         osd.0          up  1.00000          1.00000 
 1  0.89999         osd.1          up  1.00000          1.00000 
 2  0.89999         osd.2          up  1.00000          1.00000 
-3  2.69998     host magna107                                   
 3  0.89999         osd.3          up  1.00000          1.00000 
 4  0.89999         osd.4          up  1.00000          1.00000 
 5  0.89999         osd.5          up  1.00000          1.00000 
-4  2.69998     host magna108                                   
 6  0.89999         osd.6          up  1.00000          1.00000 
 7  0.89999         osd.7          up  1.00000          1.00000 
 8  0.89999         osd.8          up  1.00000          1.00000 
-5  2.69997     host magna109                                   
10  0.89999         osd.10         up  1.00000          1.00000 
 9  0.89999         osd.9          up  1.00000          1.00000 
11  0.89998         osd.11         up  1.00000          1.00000 
-6        0     host magna110                                   
12        0         osd.12         up  1.00000          1.00000 
[root@host1]# ceph df
GLOBAL:
    SIZE       AVAIL      RAW USED     %RAW USED 
    12043G     11947G       98290M          0.80 
POOLS:
    NAME      ID     USED      %USED     MAX AVAIL     OBJECTS 
    rbd       0      7173M      0.17         3655G     1836528 
    pool1     1      2456M      0.06         3655G      628740 
    pool2     2      2556M      0.06         3655G      654449 
[root@host1]# ceph -v
ceph version 0.94.9-1.el7cp (72b3e852266cea8a99b982f7aa3dde8ca6b48bd3)

Comment 16 errata-xmlrpc 2016-09-29 12:56:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-1972.html

Note You need to log in before you can comment on or make changes to this bug.