Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1358275

Summary:	Rados df gives wrong degraded object count
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	anmol babu <anbabu>
Component:	RADOS	Assignee:	Josh Durgin <jdurgin>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	ceph-qe-bugs <ceph-qe-bugs>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	2.0	CC:	anbabu, ceph-eng-bugs, dzafman, kchai, kdreyer, sjust
Target Milestone:	rc
Target Release:	2.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-09-21 18:44:03 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1349913

Description anmol babu 2016-07-20 12:13:38 UTC

Description of problem:
The rados df cli command shows number of degraded objects greater than the total number of objects when some of the PGs are degraded

Version-Release number of selected component (if applicable):

Mon rpms:
rpm -qa|grep ceph
ceph-selinux-10.2.2-5.el7cp.x86_64
python-cephfs-10.2.2-5.el7cp.x86_64
ceph-common-10.2.2-5.el7cp.x86_64
ceph-base-10.2.2-5.el7cp.x86_64
libcephfs1-10.2.2-5.el7cp.x86_64
ceph-mon-10.2.2-5.el7cp.x86_64

OSD rpms:
rpm -qa|grep ceph
ceph-selinux-10.2.2-9.el7cp.x86_64
ceph-common-10.2.2-9.el7cp.x86_64
ceph-base-10.2.2-9.el7cp.x86_64
libcephfs1-10.2.2-9.el7cp.x86_64
python-cephfs-10.2.2-9.el7cp.x86_64
ceph-osd-10.2.2-9.el7cp.x86_64

How reproducible:
Frequently

Steps to Reproduce:
1. prepare some pool and add there several objects e.g. 4 objects
2. remove some OSDs so there is less OSDs than the pool requires
3. create another object

Actual results:
degraded object count > total object count

Expected results:
Degraded object count should not be more than total object count.

Additional info:

rados df --cluster c1 --format json
{"pools":[{"name":"p1","id":"1","size_bytes":"114","size_kb":"1","num_objects":"4","num_object_clones":"0","num_object_copies":"12","num_objects_missing_on_primary":"0","num_objects_unfound":"0","num_objects_degraded":"8","read_ops":"5483","read_bytes":"4009984","write_ops":"8","write_bytes":"2048"},{"name":"p2","id":"2","size_bytes":"0","size_kb":"0","num_objects":"1","num_object_clones":"0","num_object_copies":"3","num_objects_missing_on_primary":"0","num_objects_unfound":"0","num_objects_degraded":"2","read_ops":"0","read_bytes":"0","write_ops":"2","write_bytes":"0"}],"total_objects":"5","total_used":"74284","total_avail":"31360428","total_space":"31434712"}

ceph -s --cluster c1
    cluster ef7329fe-01e5-4b60-8427-71112db95c9d
     health HEALTH_WARN
            256 pgs degraded
            256 pgs stuck unclean
            256 pgs undersized
            recovery 10/15 objects degraded (66.667%)
     monmap e1: 1 mons at {dhcp41-235=10.70.41.235:6789/0}
            election epoch 3, quorum 0 dhcp41-235
     osdmap e37: 2 osds: 2 up, 2 in
            flags sortbitwise
      pgmap v700: 256 pgs, 2 pools, 114 bytes data, 5 objects
            74284 kB used, 30625 MB / 30697 MB avail
            10/15 objects degraded (66.667%)
                 256 active+undersized+degraded

Comment 2 Ken Dreyer (Red Hat) 2016-07-20 13:40:30 UTC

The builds listed above are pretty old. Please confirm that this is still happening with the latest builds (10.2.2-24.el7cp)

Comment 3 Samuel Just 2016-07-20 14:45:13 UTC

It's actually ok for there to be more degraded objects than objects.  If the pool is configured for 4 replicas but you only have 2, each object is degraded twice.  This appears to have happened with 2 osds and pool size=3, however, so it seems like each object should have been degraded once.  Possibly a bug with contructing the stats.

I do not think this should be a 2.0 blocker.

Comment 4 Josh Durgin 2016-07-23 01:07:29 UTC

I haven't been able to reproduce this myself. One possible cause would be PGs getting mapped to a smaller acting set than expected, as can happen with older crush tunables.

Can you reproduce with osd debugging (debug osd = 20, debug ms = 1) enabled and post the osd logs and output of 'ceph pg dump'?

Comment 5 Samuel Just 2016-09-21 18:44:03 UTC

Closing on the assumption that it's just the normal behavior absent any other information.