Bug 1479522

Summary: [10.2.7-31.el7cp] osd set recovery delete tests failing
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Vasu Kulkarni <vakulkar>
Component: RADOSAssignee: Vasu Kulkarni <vakulkar>
Status: CLOSED NOTABUG QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 2.4CC: ceph-eng-bugs, dzafman, gfarnum, kchai
Target Milestone: rc   
Target Release: 3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-08 21:04:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vasu Kulkarni 2017-08-08 17:18:43 UTC
Description of problem:

I have cherry picked the following recovery test to be run on RHCeph 2.4
https://github.com/jdurgin/ceph/commit/205c45ff09858e6fc2046c8627abf08a2a2dee20

but for some reason, the test facet is failing wherever that yaml occurs, Not sure if the build has fixes.

Logs:

http://magna002.ceph.redhat.com/vasu-2017-08-07_16:34:45-rados:thrash-jewel---basic-multi/271477/teuthology.log

2017-08-07T18:14:14.803 INFO:teuthology.orchestra.run.pluto002:Running: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --cluster ceph health'
2017-08-07T18:14:15.015 INFO:teuthology.misc.health.pluto002.stdout:HEALTH_WARN 54 pgs degraded; 4 pgs stale; 12 pgs stuck unclean; 54 pgs undersized; pool rbd pg_num 204 > pgp_num 154; 1/6 in osds are down; mon.a has mon_osd_down_out_interval set to 0
2017-08-07T18:14:15.016 DEBUG:teuthology.misc:Ceph health: HEALTH_WARN 54 pgs degraded; 4 pgs stale; 12 pgs stuck unclean; 54 pgs undersized; pool rbd pg_num 204 > pgp_num 154; 1/6 in osds are down; mon.a has mon_osd_down_out_interval set to 0

Comment 2 Greg Farnum 2017-08-08 20:34:08 UTC
This appears to be a test configuration issue. It is continuing to thrash the cluster configuration (OSDs up/down, pg num, etc) but then times out because the cluster has not gone clean 15 minutes after other work has ceased.

If I compare the config.yaml of an upstream run against the downstream, they look very different. On upstream the thrash_osds segment is near the end of the config (which indicates the order the tasks are processed in); on downstream the full_sequential_finally stanza follows it. (Their order is inverted.)

Compare http://qa-proxy.ceph.com/teuthology/jcollin-2017-08-08_02:40:19-rados-wip-jcollin-testing_08-08-2017-distro-basic-smithi/1494927/config.yaml and http://magna002.ceph.redhat.com/vasu-2017-08-07_16:34:45-rados:thrash-jewel---basic-multi/271477/config.yaml

So it looks to me like downstream teuthology has a broken implementation of this task. I'd look into that. :)

Comment 3 Greg Farnum 2017-08-08 20:35:18 UTC
Or possibly it's just the config file is broken, I just realized you were pointing at an upstream config so I don't know what the downstream fragment really looks like.

Comment 4 Vasu Kulkarni 2017-08-08 21:04:44 UTC
Thanks Greg for your help,  The cherry pick applied caused the task to appear after thrash, I will fix that and rerun.