Bug 1194828

Summary: rados_bench tests pass on 7.1 and fail on 6.6
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Warren <wusui>
Component: RADOSAssignee: Samuel Just <sjust>
Status: CLOSED CURRENTRELEASE QA Contact: Warren <wusui>
Severity: high Docs Contact:
Priority: unspecified    
Version: 1.2.3CC: ceph-eng-bugs, dzafman, icolle, kchai, kdreyer, sgraf, wusui
Target Milestone: rc   
Target Release: 1.2.3   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-10-05 22:57:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Warren 2015-02-20 20:57:32 UTC
Description of problem:
The rados bench test fails on 6.6 -- It appears that 'sudo ceph osd crush tunables default' is timing-out.

Version-Release number of selected component (if applicable):
6.6

How reproducible:
100% of the time

Steps to Reproduce:
1. Run teuthology using the following yaml file.
--------------------------------------------------
interactive-on-error: true
roles:
- [mon.a, osd.0, osd.1]
- [mon.b, mon.c, osd.2, osd.3]
- [client.0]
overrides:
  ceph:
    conf:
      global:
        ms inject delay max: 1
        ms inject delay probability: 0.005
        ms inject delay type: osd
        ms inject internal delays: 0.002
        ms inject socket failures: 2500
tasks:
- install.ship_utilities: null
- ceph:
    branch: firefly
    fs: btrfs
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    chance_pgnum_grow: 2
    chance_pgpnum_fix: 1
    timeout: 1200
- radosbench:
    clients:
    - client.0
    time: 1800
--------------------------------------------------

Actual results:
2015-02-20 13:38:48,619.619 INFO:teuthology.orchestra.run.magna106.stderr:2015-02-20 10:38:48.618505 7f0e63ed9700  0 librados: client.admin authentication error (110) Connection timed out
2015-02-20 13:38:48,640.640 INFO:teuthology.orchestra.run.magna106.stderr:Error connecting to cluster: TimedOut
2015-02-20 13:38:48,655.655 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
  File "/home/wusui/teuthology/teuthology/contextutil.py", line 28, in nested
    vars.append(enter())
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/home/wusui/ceph-qa-suite/tasks/ceph.py", line 156, in crush_setup
    args=['sudo', 'ceph', 'osd', 'crush', 'tunables', profile])
  File "/home/wusui/teuthology/teuthology/orchestra/remote.py", line 137, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/wusui/teuthology/teuthology/orchestra/run.py", line 378, in run
    r.wait()
  File "/home/wusui/teuthology/teuthology/orchestra/run.py", line 114, in wait
    label=self.label)
CommandFailedError: Command failed on magna106 with status 1: 'sudo ceph osd crush tunables default'



Expected results:

Teuthology should pass

Additional info:

The teuthology run passes on 7.1

Comment 1 Samuel Just 2015-02-24 01:24:39 UTC
This test doesn't appear to have the ceph install portion.  Can you reproduce from a clean machine on 6.6?

Comment 2 Warren 2015-02-28 02:43:26 UTC
The problem that is happening here is explained in 1197287.  On 6.6, for some reason the initial iptables looked like:

Chain INPUT (policy ACCEPT)
target     prot opt source               destination         
ACCEPT     all  --  anywhere             anywhere            state RELATED,ESTABLISHED 
ACCEPT     icmp --  anywhere             anywhere            
ACCEPT     all  --  anywhere             anywhere            
ACCEPT     tcp  --  anywhere             anywhere            state NEW tcp dpt:ssh 
REJECT     all  --  anywhere             anywhere            reject-with icmp-host-prohibited 

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         
REJECT     all  --  anywhere             anywhere            reject-with icmp-host-prohibited 

This caused a 'sudo ceph osd crush tunables default' command to time out.

Running iptables -F fixed this problem.

Comment 3 Warren 2015-02-28 02:43:39 UTC
The problem that is happening here is explained in 1197287.  On 6.6, for some reason the initial iptables looked like:

Chain INPUT (policy ACCEPT)
target     prot opt source               destination         
ACCEPT     all  --  anywhere             anywhere            state RELATED,ESTABLISHED 
ACCEPT     icmp --  anywhere             anywhere            
ACCEPT     all  --  anywhere             anywhere            
ACCEPT     tcp  --  anywhere             anywhere            state NEW tcp dpt:ssh 
REJECT     all  --  anywhere             anywhere            reject-with icmp-host-prohibited 

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         
REJECT     all  --  anywhere             anywhere            reject-with icmp-host-prohibited 

This caused a 'sudo ceph osd crush tunables default' command to time out.

Running iptables -F fixed this problem.

Comment 4 Ian Colle 2015-03-02 17:24:03 UTC
Can we close this with the doc fixes to 1197287?

Comment 6 Warren 2015-03-05 01:38:59 UTC
The iptables -F work around fixes this issue.

Comment 7 Warren 2015-03-05 01:44:02 UTC
Sorry.  The text of John's iptables instructions are not quite what I expected.  I will do what he says here, and if works, then I will verify both this and 1197287

Comment 8 Warren 2015-03-05 01:44:12 UTC
Sorry.  The text of John's iptables instructions are not quite what I expected.  I will do what he says here, and if works, then I will verify both this and 1197287

Comment 9 Warren 2015-03-06 04:15:39 UTC
I can get this to pass by cleaning the iptables.  I think for this release we can go with the documented change (which is still iffy).