Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1358275

Summary: Rados df gives wrong degraded object count
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: anmol babu <anbabu>
Component: RADOSAssignee: Josh Durgin <jdurgin>
Status: CLOSED INSUFFICIENT_DATA QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 2.0CC: anbabu, ceph-eng-bugs, dzafman, kchai, kdreyer, sjust
Target Milestone: rc   
Target Release: 2.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-09-21 18:44:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1349913    

Description anmol babu 2016-07-20 12:13:38 UTC
Description of problem:
The rados df cli command shows number of degraded objects greater than the total number of objects when some of the PGs are degraded

Version-Release number of selected component (if applicable):

Mon rpms:
rpm -qa|grep ceph
ceph-selinux-10.2.2-5.el7cp.x86_64
python-cephfs-10.2.2-5.el7cp.x86_64
ceph-common-10.2.2-5.el7cp.x86_64
ceph-base-10.2.2-5.el7cp.x86_64
libcephfs1-10.2.2-5.el7cp.x86_64
ceph-mon-10.2.2-5.el7cp.x86_64

OSD rpms:
rpm -qa|grep ceph
ceph-selinux-10.2.2-9.el7cp.x86_64
ceph-common-10.2.2-9.el7cp.x86_64
ceph-base-10.2.2-9.el7cp.x86_64
libcephfs1-10.2.2-9.el7cp.x86_64
python-cephfs-10.2.2-9.el7cp.x86_64
ceph-osd-10.2.2-9.el7cp.x86_64

How reproducible:
Frequently

Steps to Reproduce:
1. prepare some pool and add there several objects e.g. 4 objects
2. remove some OSDs so there is less OSDs than the pool requires
3. create another object

Actual results:
degraded object count > total object count

Expected results:
Degraded object count should not be more than total object count.

Additional info:

rados df --cluster c1 --format json
{"pools":[{"name":"p1","id":"1","size_bytes":"114","size_kb":"1","num_objects":"4","num_object_clones":"0","num_object_copies":"12","num_objects_missing_on_primary":"0","num_objects_unfound":"0","num_objects_degraded":"8","read_ops":"5483","read_bytes":"4009984","write_ops":"8","write_bytes":"2048"},{"name":"p2","id":"2","size_bytes":"0","size_kb":"0","num_objects":"1","num_object_clones":"0","num_object_copies":"3","num_objects_missing_on_primary":"0","num_objects_unfound":"0","num_objects_degraded":"2","read_ops":"0","read_bytes":"0","write_ops":"2","write_bytes":"0"}],"total_objects":"5","total_used":"74284","total_avail":"31360428","total_space":"31434712"}

ceph -s --cluster c1
    cluster ef7329fe-01e5-4b60-8427-71112db95c9d
     health HEALTH_WARN
            256 pgs degraded
            256 pgs stuck unclean
            256 pgs undersized
            recovery 10/15 objects degraded (66.667%)
     monmap e1: 1 mons at {dhcp41-235=10.70.41.235:6789/0}
            election epoch 3, quorum 0 dhcp41-235
     osdmap e37: 2 osds: 2 up, 2 in
            flags sortbitwise
      pgmap v700: 256 pgs, 2 pools, 114 bytes data, 5 objects
            74284 kB used, 30625 MB / 30697 MB avail
            10/15 objects degraded (66.667%)
                 256 active+undersized+degraded

Comment 2 Ken Dreyer (Red Hat) 2016-07-20 13:40:30 UTC
The builds listed above are pretty old. Please confirm that this is still happening with the latest builds (10.2.2-24.el7cp)

Comment 3 Samuel Just 2016-07-20 14:45:13 UTC
It's actually ok for there to be more degraded objects than objects.  If the pool is configured for 4 replicas but you only have 2, each object is degraded twice.  This appears to have happened with 2 osds and pool size=3, however, so it seems like each object should have been degraded once.  Possibly a bug with contructing the stats.

I do not think this should be a 2.0 blocker.

Comment 4 Josh Durgin 2016-07-23 01:07:29 UTC
I haven't been able to reproduce this myself. One possible cause would be PGs getting mapped to a smaller acting set than expected, as can happen with older crush tunables.

Can you reproduce with osd debugging (debug osd = 20, debug ms = 1) enabled and post the osd logs and output of 'ceph pg dump'?

Comment 5 Samuel Just 2016-09-21 18:44:03 UTC
Closing on the assumption that it's just the normal behavior absent any other information.