Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1479522

Summary:	[10.2.7-31.el7cp] osd set recovery delete tests failing
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Vasu Kulkarni <vakulkar>
Component:	RADOS	Assignee:	Vasu Kulkarni <vakulkar>
Status:	CLOSED NOTABUG	QA Contact:	ceph-qe-bugs <ceph-qe-bugs>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	2.4	CC:	ceph-eng-bugs, dzafman, gfarnum, kchai
Target Milestone:	rc
Target Release:	3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-08-08 21:04:44 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Vasu Kulkarni 2017-08-08 17:18:43 UTC

Description of problem:

I have cherry picked the following recovery test to be run on RHCeph 2.4
https://github.com/jdurgin/ceph/commit/205c45ff09858e6fc2046c8627abf08a2a2dee20

but for some reason, the test facet is failing wherever that yaml occurs, Not sure if the build has fixes.

Logs:

http://magna002.ceph.redhat.com/vasu-2017-08-07_16:34:45-rados:thrash-jewel---basic-multi/271477/teuthology.log

2017-08-07T18:14:14.803 INFO:teuthology.orchestra.run.pluto002:Running: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --cluster ceph health'
2017-08-07T18:14:15.015 INFO:teuthology.misc.health.pluto002.stdout:HEALTH_WARN 54 pgs degraded; 4 pgs stale; 12 pgs stuck unclean; 54 pgs undersized; pool rbd pg_num 204 > pgp_num 154; 1/6 in osds are down; mon.a has mon_osd_down_out_interval set to 0
2017-08-07T18:14:15.016 DEBUG:teuthology.misc:Ceph health: HEALTH_WARN 54 pgs degraded; 4 pgs stale; 12 pgs stuck unclean; 54 pgs undersized; pool rbd pg_num 204 > pgp_num 154; 1/6 in osds are down; mon.a has mon_osd_down_out_interval set to 0

Comment 2 Greg Farnum 2017-08-08 20:34:08 UTC

This appears to be a test configuration issue. It is continuing to thrash the cluster configuration (OSDs up/down, pg num, etc) but then times out because the cluster has not gone clean 15 minutes after other work has ceased.

If I compare the config.yaml of an upstream run against the downstream, they look very different. On upstream the thrash_osds segment is near the end of the config (which indicates the order the tasks are processed in); on downstream the full_sequential_finally stanza follows it. (Their order is inverted.)

Compare http://qa-proxy.ceph.com/teuthology/jcollin-2017-08-08_02:40:19-rados-wip-jcollin-testing_08-08-2017-distro-basic-smithi/1494927/config.yaml and http://magna002.ceph.redhat.com/vasu-2017-08-07_16:34:45-rados:thrash-jewel---basic-multi/271477/config.yaml

So it looks to me like downstream teuthology has a broken implementation of this task. I'd look into that. :)

Comment 3 Greg Farnum 2017-08-08 20:35:18 UTC

Or possibly it's just the config file is broken, I just realized you were pointing at an upstream config so I don't know what the downstream fragment really looks like.

Comment 4 Vasu Kulkarni 2017-08-08 21:04:44 UTC

Thanks Greg for your help,  The cherry pick applied caused the task to appear after thrash, I will fix that and rerun.