Bug 1597048

Summary: ceph osd df not showing correct disk size and causing cluster to go to full state
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Vikhyat Umrao <vumrao>
Component: RADOSAssignee: Brad Hubbard <bhubbard>
Status: CLOSED NOTABUG QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 3.0CC: bhubbard, ceph-eng-bugs, dzafman, kchai
Target Milestone: rc   
Target Release: 3.*   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-07-04 00:53:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vikhyat Umrao 2018-07-01 21:37:56 UTC
Description of problem:
ceph osd df not showing correct disk size and causing the cluster to go to full state

[root@storage-004 ~]# df -h /var/lib/ceph/osd/ceph-0
Filesystem                           Size  Used Avail Use% Mounted on
/dev/nvme0n1p1                       3.7T  9.8G  3.7T   1% /var/lib/ceph/osd/ceph-0

[root@storage-004 ~]# ceph -s
  cluster:
    id:     03e3321d-071f-4b28-a3f9-0256f384bdca
    health: HEALTH_ERR
            full flag(s) set
            1 full osd(s)

  services:
    mon: 3 daemons, quorum storage-004,storage-005,storage-009
    mgr: storage-009(active), standbys: storage-005, storage-004
    osd: 102 osds: 96 up, 96 in; 103 remapped pgs
         flags full
    rgw: 2 daemons active


From ceph osd df:
=======================

  0   ssd 3.63199  1.00000 10240M   9467M  772M 92.45 20.78 131 <===
                              ^^
    
  5   ssd 3.63199  1.00000  3719G   1025G 2693G 27.57  6.20 419
 10   ssd 3.63199  1.00000  3719G   1220G 2498G 32.81  7.38 458
 16   ssd 3.63199  1.00000  3719G   1114G 2604G 29.98  6.74 428
 21   ssd 3.63199  1.00000  3719G   1004G 2714G 27.02  6.07 417

From ceph osd tree:
========================

 -9        18.15994     host storage-004
  0   ssd   3.63199         osd.0            up  1.00000 1.00000
  5   ssd   3.63199         osd.5            up  1.00000 1.00000
 10   ssd   3.63199         osd.10           up  1.00000 1.00000
 16   ssd   3.63199         osd.16           up  1.00000 1.00000
 21   ssd   3.63199         osd.21           up  1.00000 1.00000



Version-Release number of selected component (if applicable):
Red Hat Ceph Storage 3

How reproducible:
Always at the customer site.

Comment 2 Brad Hubbard 2018-07-02 00:25:44 UTC
Assuming this is filestore can we see the output of "stat -f /var/lib/ceph/osd/ceph-0" please?

Comment 8 Vikhyat Umrao 2018-07-18 21:55:36 UTC
Resolution - this disk was deployed as bluestore by mistake and in bluestore also it was not deployed properly.

[root@storage-004 ~]# cat /var/lib/ceph/osd/ceph-*/type
bluestore
filestore
filestore
filestore
filestore

So only OSD.0 was bluestore.

[root@storage-004 ~]# blockdev --getsize64 /dev/nvme0n1p1
3995417255424

root@storage-004 ceph-0]# ls -l
total 10421404
-rw-r--r--. 1 ceph ceph         447 Mar 21 15:44 activate.monmap
-rw-r--r--. 1 ceph ceph           3 Mar 21 15:44 active
-rw-r--r--. 1 ceph ceph 10737418240 Jul  2 16:51 block <==============
-rw-r--r--. 1 ceph ceph           2 Mar 21 15:44 bluefs
-rw-r--r--. 1 ceph ceph          37 Mar 21 15:43 ceph_fsid
-rw-r--r--. 1 ceph ceph          37 Mar 21 15:43 fsid
-rw-------. 1 ceph ceph          56 Mar 21 15:44 keyring
-rw-r--r--. 1 ceph ceph           8 Mar 21 15:44 kv_backend
-rw-r--r--. 1 ceph ceph          21 Mar 21 15:43 magic
-rw-r--r--. 1 ceph ceph           4 Mar 21 15:44 mkfs_done
-rw-r--r--. 1 ceph ceph           6 Mar 21 15:44 ready
-rw-r--r--. 1 ceph ceph           0 Jul  1 21:52 systemd
-rw-r--r--. 1 ceph ceph          10 Mar 21 15:43 type
-rw-r--r--. 1 ceph ceph           2 Mar 21 15:44 whoami 


The BlueStore block device was a file named with a block, not a symlink to block device partition of this disk and that file size was 10G hence it was showing the size of the OSD as 10G.

Redeploying the OSD with filestore fixed the issue.