Red Hat Bugzilla – Bug 1479522
[10.2.7-31.el7cp] osd set recovery delete tests failing
Last modified: 2017-08-08 17:04:44 EDT
Description of problem:
I have cherry picked the following recovery test to be run on RHCeph 2.4
but for some reason, the test facet is failing wherever that yaml occurs, Not sure if the build has fixes.
2017-08-07T18:14:14.803 INFO:teuthology.orchestra.run.pluto002:Running: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --cluster ceph health'
2017-08-07T18:14:15.015 INFO:teuthology.misc.health.pluto002.stdout:HEALTH_WARN 54 pgs degraded; 4 pgs stale; 12 pgs stuck unclean; 54 pgs undersized; pool rbd pg_num 204 > pgp_num 154; 1/6 in osds are down; mon.a has mon_osd_down_out_interval set to 0
2017-08-07T18:14:15.016 DEBUG:teuthology.misc:Ceph health: HEALTH_WARN 54 pgs degraded; 4 pgs stale; 12 pgs stuck unclean; 54 pgs undersized; pool rbd pg_num 204 > pgp_num 154; 1/6 in osds are down; mon.a has mon_osd_down_out_interval set to 0
This appears to be a test configuration issue. It is continuing to thrash the cluster configuration (OSDs up/down, pg num, etc) but then times out because the cluster has not gone clean 15 minutes after other work has ceased.
If I compare the config.yaml of an upstream run against the downstream, they look very different. On upstream the thrash_osds segment is near the end of the config (which indicates the order the tasks are processed in); on downstream the full_sequential_finally stanza follows it. (Their order is inverted.)
Compare http://qa-proxy.ceph.com/teuthology/jcollin-2017-08-08_02:40:19-rados-wip-jcollin-testing_08-08-2017-distro-basic-smithi/1494927/config.yaml and http://magna002.ceph.redhat.com/vasu-2017-08-07_16:34:45-rados:thrash-jewel---basic-multi/271477/config.yaml
So it looks to me like downstream teuthology has a broken implementation of this task. I'd look into that. :)
Or possibly it's just the config file is broken, I just realized you were pointing at an upstream config so I don't know what the downstream fragment really looks like.
Thanks Greg for your help, The cherry pick applied caused the task to appear after thrash, I will fix that and rerun.