Bug 1479522 - [10.2.7-31.el7cp] osd set recovery delete tests failing
[10.2.7-31.el7cp] osd set recovery delete tests failing
Status: CLOSED NOTABUG
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: RADOS (Show other bugs)
2.4
Unspecified Unspecified
unspecified Severity unspecified
: rc
: 3.0
Assigned To: Vasu Kulkarni
ceph-qe-bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-08-08 13:18 EDT by Vasu Kulkarni
Modified: 2017-08-08 17:04 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-08-08 17:04:44 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Vasu Kulkarni 2017-08-08 13:18:43 EDT
Description of problem:

I have cherry picked the following recovery test to be run on RHCeph 2.4
https://github.com/jdurgin/ceph/commit/205c45ff09858e6fc2046c8627abf08a2a2dee20

but for some reason, the test facet is failing wherever that yaml occurs, Not sure if the build has fixes.

Logs:

http://magna002.ceph.redhat.com/vasu-2017-08-07_16:34:45-rados:thrash-jewel---basic-multi/271477/teuthology.log

2017-08-07T18:14:14.803 INFO:teuthology.orchestra.run.pluto002:Running: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --cluster ceph health'
2017-08-07T18:14:15.015 INFO:teuthology.misc.health.pluto002.stdout:HEALTH_WARN 54 pgs degraded; 4 pgs stale; 12 pgs stuck unclean; 54 pgs undersized; pool rbd pg_num 204 > pgp_num 154; 1/6 in osds are down; mon.a has mon_osd_down_out_interval set to 0
2017-08-07T18:14:15.016 DEBUG:teuthology.misc:Ceph health: HEALTH_WARN 54 pgs degraded; 4 pgs stale; 12 pgs stuck unclean; 54 pgs undersized; pool rbd pg_num 204 > pgp_num 154; 1/6 in osds are down; mon.a has mon_osd_down_out_interval set to 0
Comment 2 Greg Farnum 2017-08-08 16:34:08 EDT
This appears to be a test configuration issue. It is continuing to thrash the cluster configuration (OSDs up/down, pg num, etc) but then times out because the cluster has not gone clean 15 minutes after other work has ceased.

If I compare the config.yaml of an upstream run against the downstream, they look very different. On upstream the thrash_osds segment is near the end of the config (which indicates the order the tasks are processed in); on downstream the full_sequential_finally stanza follows it. (Their order is inverted.)

Compare http://qa-proxy.ceph.com/teuthology/jcollin-2017-08-08_02:40:19-rados-wip-jcollin-testing_08-08-2017-distro-basic-smithi/1494927/config.yaml and http://magna002.ceph.redhat.com/vasu-2017-08-07_16:34:45-rados:thrash-jewel---basic-multi/271477/config.yaml

So it looks to me like downstream teuthology has a broken implementation of this task. I'd look into that. :)
Comment 3 Greg Farnum 2017-08-08 16:35:18 EDT
Or possibly it's just the config file is broken, I just realized you were pointing at an upstream config so I don't know what the downstream fragment really looks like.
Comment 4 Vasu Kulkarni 2017-08-08 17:04:44 EDT
Thanks Greg for your help,  The cherry pick applied caused the task to appear after thrash, I will fix that and rerun.

Note You need to log in before you can comment on or make changes to this bug.