Description of problem: Smoke is broken for few ceph-ansible tests, althogh ceph-ansible hasn't changed in this async build. Note this tests worked fine in 10.2.7-27. 2017-07-07T14:26:18.207 INFO:teuthology.orchestra.run.clara011.stdout: 2017-07-07T14:26:18.207 INFO:teuthology.orchestra.run.clara011.stdout:ceph version 10.2.7-28.el7cp (216cda64fd9a9b43c4b0c2f8c402d36753ee35f7) 2017-07-07T14:26:18.207 INFO:teuthology.orchestra.run.clara011.stdout: 2017-07-07T14:26:18.208 INFO:teuthology.orchestra.run.clara011.stdout:TASK [ceph.ceph-common : is ceph running already?] ***************************** 2017-07-07T14:26:18.208 INFO:teuthology.orchestra.run.clara011.stdout:task path: /home/ubuntu/ceph-ansible/roles/ceph-common/tasks/facts.yml:11 2017-07-07T14:26:18.208 INFO:teuthology.orchestra.run.clara011.stdout:ok: [clara010.ceph.redhat.com -> clara011.ceph.redhat.com] => { 2017-07-07T14:26:18.208 INFO:teuthology.orchestra.run.clara011.stdout: "changed": false, 2017-07-07T14:26:18.208 INFO:teuthology.orchestra.run.clara011.stdout: "cmd": [ 2017-07-07T14:26:18.208 INFO:teuthology.orchestra.run.clara011.stdout: "ceph", 2017-07-07T14:26:18.208 INFO:teuthology.orchestra.run.clara011.stdout: "--connect-timeout", 2017-07-07T14:26:18.209 INFO:teuthology.orchestra.run.clara011.stdout: "3", 2017-07-07T14:26:18.209 INFO:teuthology.orchestra.run.clara011.stdout: "--cluster", 2017-07-07T14:26:18.209 INFO:teuthology.orchestra.run.clara011.stdout: "ceph", 2017-07-07T14:26:18.209 INFO:teuthology.orchestra.run.clara011.stdout: "fsid" 2017-07-07T14:26:18.209 INFO:teuthology.orchestra.run.clara011.stdout: ], 2017-07-07T14:26:18.209 INFO:teuthology.orchestra.run.clara011.stdout: "delta": "0:00:00.167942", 2017-07-07T14:26:18.209 INFO:teuthology.orchestra.run.clara011.stdout: "end": "2017-07-07 18:26:04.504175", 2017-07-07T14:26:18.210 INFO:teuthology.orchestra.run.clara011.stdout: "failed": false, 2017-07-07T14:26:18.210 INFO:teuthology.orchestra.run.clara011.stdout: "failed_when_result": false, 2017-07-07T14:26:18.210 INFO:teuthology.orchestra.run.clara011.stdout: "rc": 1, 2017-07-07T14:26:18.210 INFO:teuthology.orchestra.run.clara011.stdout: "start": "2017-07-07 18:26:04.336233", 2017-07-07T14:26:18.210 INFO:teuthology.orchestra.run.clara011.stdout: "warnings": [] 2017-07-07T14:26:18.210 INFO:teuthology.orchestra.run.clara011.stdout:} 2017-07-07T14:26:18.210 INFO:teuthology.orchestra.run.clara011.stdout: 2017-07-07T14:26:18.211 INFO:teuthology.orchestra.run.clara011.stdout:STDERR: 2017-07-07T14:26:18.211 INFO:teuthology.orchestra.run.clara011.stdout: 2017-07-07T14:26:18.211 INFO:teuthology.orchestra.run.clara011.stdout:2017-07-07 18:26:04.464323 7fbf1f66b700 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory 2017-07-07T14:26:18.211 INFO:teuthology.orchestra.run.clara011.stdout:2017-07-07 18:26:04.465808 7fbf1f66b700 -1 monclient(hunting): authenticate NOTE: no keyring found; disabled cephx authentication 2017-07-07T14:26:18.211 INFO:teuthology.orchestra.run.clara011.stdout:2017-07-07 18:26:04.465821 7fbf1f66b700 0 librados: client.admin authentication error (95) Operation not supported 2017-07-07T14:26:18.211 INFO:teuthology.orchestra.run.clara011.stdout:Error connecting to cluster: Error 2017-07-07T14:26:18.211 INFO:teuthology.orchestra.run.clara011.stdout:ok: [pluto008.ceph.redhat.com -> clara011.ceph.redhat.com] => { 2017-07-07T14:26:18.212 INFO:teuthology.orchestra.run.clara011.stdout: "changed": false, 2017-07-07T14:26:18.212 INFO:teuthology.orchestra.run.clara011.stdout: "cmd": [ 2017-07-07T14:26:18.212 INFO:teuthology.orchestra.run.clara011.stdout: "ceph", 2017-07-07T14:26:18.212 INFO:teuthology.orchestra.run.clara011.stdout: "--connect-timeout", 2017-07-07T14:26:18.212 INFO:teuthology.orchestra.run.clara011.stdout: "3", 2017-07-07T14:26:18.212 INFO:teuthology.orchestra.run.clara011.stdout: "--cluster", 2017-07-07T14:26:18.212 INFO:teuthology.orchestra.run.clara011.stdout: "ceph", 2017-07-07T14:26:18.213 INFO:teuthology.orchestra.run.clara011.stdout: "fsid" 2017-07-07T14:26:18.213 INFO:teuthology.orchestra.run.clara011.stdout: ], 2017-07-07T14:26:18.213 INFO:teuthology.orchestra.run.clara011.stdout: "delta": "0:00:03.107497", 2017-07-07T14:26:18.213 INFO:teuthology.orchestra.run.clara011.stdout: "end": "2017-07-07 18:26:07.443729", 2017-07-07T14:26:18.213 INFO:teuthology.orchestra.run.clara011.stdout: "failed": false, 2017-07-07T14:26:18.213 INFO:teuthology.orchestra.run.clara011.stdout: "failed_when_result": false, 2017-07-07T14:26:18.214 INFO:teuthology.orchestra.run.clara011.stdout: "rc": 1, 2017-07-07T14:26:18.214 INFO:teuthology.orchestra.run.clara011.stdout: "start": "2017-07-07 18:26:04.336232", 2017-07-07T14:26:18.214 INFO:teuthology.orchestra.run.clara011.stdout: "warnings": [] 2017-07-07T14:26:18.214 INFO:teuthology.orchestra.run.clara011.stdout:} 2017-07-07T14:26:18.214 INFO:teuthology.orchestra.run.clara011.stdout: 2017-07-07T14:26:18.214 INFO:teuthology.orchestra.run.clara011.stdout:STDERR: 2017-07-07T14:26:18.214 INFO:teuthology.orchestra.run.clara011.stdout: 2017-07-07T14:26:18.215 INFO:teuthology.orchestra.run.clara011.stdout:2017-07-07 18:26:04.464324 7fd7a3a98700 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory 2017-07-07T14:26:18.215 INFO:teuthology.orchestra.run.clara011.stdout:2017-07-07 18:26:04.464922 7fd7a0320700 0 -- :/2038272039 >> 10.8.129.11:6789/0 pipe(0x7fd79c05dd40 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fd79c05f000).fault 2017-07-07T14:26:18.215 INFO:teuthology.orchestra.run.clara011.stdout:Traceback (most recent call last): 2017-07-07T14:26:18.215 INFO:teuthology.orchestra.run.clara011.stdout: File "/bin/ceph", line 948, in <module> 2017-07-07T14:26:18.215 INFO:teuthology.orchestra.run.clara011.stdout: retval = main() 2017-07-07T14:26:18.215 INFO:teuthology.orchestra.run.clara011.stdout: File "/bin/ceph", line 852, in main 2017-07-07T14:26:18.215 INFO:teuthology.orchestra.run.clara011.stdout: prefix='get_command_descriptions') 2017-07-07T14:26:18.216 INFO:teuthology.orchestra.run.clara011.stdout: File "/usr/lib/python2.7/site-packages/ceph_argparse.py", line 1300, in json_command 2017-07-07T14:26:18.216 INFO:teuthology.orchestra.run.clara011.stdout: raise RuntimeError('"{0}": exception {1}'.format(argdict, e)) 2017-07-07T14:26:18.216 INFO:teuthology.orchestra.run.clara011.stdout:RuntimeError: "None": exception "['{"prefix": "get_command_descriptions"}']": exception You cannot perform that operation on a Rados object in state configuring. Full logs at: http://magna002.ceph.redhat.com/vasu-2017-07-07_13:17:54-smoke-jewel---basic-multi/270347/teuthology.log
steps for someone who want to recreate a) Inventory [clients] pluto008.ceph.redhat.com devices='["/dev/sdb", "/dev/sdc", "/dev/sdd"]' monitor_interface='eno1' public_network='10.8.128.0/21' [mons] clara011.ceph.redhat.com devices='[]' monitor_interface='eno1' public_network='10.8.128.0/21' clara012.ceph.redhat.com devices='[]' monitor_interface='eno1' public_network='10.8.128.0/21' pluto009.ceph.redhat.com devices='[]' monitor_interface='eno1' public_network='10.8.128.0/21' [osds] clara010.ceph.redhat.com devices='["/dev/sdb", "/dev/sdc", "/dev/sdd"]' monitor_interface='eno1' public_network='10.8.128.0/21' pluto008.ceph.redhat.com devices='["/dev/sdb", "/dev/sdc", "/dev/sdd"]' monitor_interface='eno1' public_network='10.8.128.0/21' b) group_vars/all ceph_conf_overrides: global: osd_default_pool_size: 2 osd_pool_default_pg_num: 128 osd_pool_default_pgp_num: 128 ceph_origin: distro ceph_stable: true ceph_stable_rh_storage: true ceph_test: true journal_collocation: true journal_size: 1024 osd_auto_discovery: false c) ansible-playbook -vv -i inven.yml site.yml
Different traceback where it failed to start radosgw-instance. https://paste.fedoraproject.org/paste/262EwD6cXG8Io9Awl58kMQ/raw
Another instance where it failed to start mon: https://paste.fedoraproject.org/paste/3IY9D5y5bOA-Nn7U2Bk9WA/raw
Seb is out on PTO for next two weeks - Andrew, please take a look
Upstream PR https://github.com/ceph/ceph-ansible/pull/1666
This looks like you have incorrectly configured the network addrs to be clara010.ceph.redhat.com devices='["/dev/sdb", "/dev/sdc", "/dev/sdd"]' monitor_interface='eno1' public_network='10.8.128.0/21' pluto008.ceph.redhat.com devices='["/dev/sdb", "/dev/sdc", "/dev/sdd"]' monitor_interface='eno1' public_network='10.8.128.0/21' in stead of what they actually seem to be 10.8.129.0/21 Would you please re-test with the correct config? [ubuntu@pluto009 ~]$ ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000 link/ether 0c:c4:7a:6e:db:58 brd ff:ff:ff:ff:ff:ff inet 10.8.129.109/21 brd 10.8.135.255 scope global dynamic eno1 valid_lft 27538sec preferred_lft 27538sec inet6 2620:52:0:880:ec4:7aff:fe6e:db58/64 scope global noprefixroute dynamic valid_lft 2591572sec preferred_lft 604372sec inet6 fe80::ec4:7aff:fe6e:db58/64 scope link valid_lft forever preferred_lft forever 3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN qlen 1000 link/ether 0c:c4:7a:6e:db:59 brd ff:ff:ff:ff:ff:ff [ubuntu@pluto009 ~]$ logout Connection to pluto009 closed. gmeno@magna002:~$ ssh ubuntu@clara010 Warning: Permanently added 'clara010,10.8.129.10' (ECDSA) to the list of known hosts. Last login: Tue Jul 11 12:48:27 2017 from pluto010.ceph.redhat.com [ubuntu@clara010 ~]$ ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000 link/ether 0c:c4:7a:6c:69:1c brd ff:ff:ff:ff:ff:ff inet 10.8.129.10/21 brd 10.8.135.255 scope global dynamic eno1 valid_lft 38060sec preferred_lft 38060sec inet6 2620:52:0:880:ec4:7aff:fe6c:691c/64 scope global noprefixroute dynamic valid_lft 2591955sec preferred_lft 604755sec inet6 fe80::ec4:7aff:fe6c:691c/64 scope link valid_lft forever preferred_lft forever 3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN qlen 1000 link/ether 0c:c4:7a:6c:69:1d brd ff:ff:ff:ff:ff:ff
Gregory, The /21 as pointed out by david correctly address the Ip range in this network: 10.8.128.0/21: 10.8.128.1 through 10.8.135.255, So the inventory file is correct.
Guillaume, Please investigate this as top priority tomorrow. thank you
I tried to reproduce several times your issue with similar environment, but I couldn't : PLAY RECAP ***************************************************************** clara010.ceph.redhat.com : ok=56 changed=14 unreachable=0 failed=0 clara011.ceph.redhat.com : ok=57 changed=16 unreachable=0 failed=0 clara012.ceph.redhat.com : ok=52 changed=16 unreachable=0 failed=0 pluto008.ceph.redhat.com : ok=93 changed=16 unreachable=0 failed=0 pluto009.ceph.redhat.com : ok=53 changed=17 unreachable=0 failed=0 # cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.3 (Maipo) Using the repo you provided at https://paste.fedoraproject.org/paste/reZNMtQ7Dl8NUkGPq8yIwg/raw # ceph --version ceph version 10.2.7-28.el7cp (216cda64fd9a9b43c4b0c2f8c402d36753ee35f7) # rpm -qi ceph-ansible Name : ceph-ansible Version : 2.2.11 Release : 1.el7scon # cat group_vars/all.yml ceph_conf_overrides: global: osd_default_pool_size: 2 osd_pool_default_pg_num: 128 osd_pool_default_pgp_num: 128 ceph_origin: distro ceph_stable: true ceph_stable_rh_storage: true ceph_test: true journal_collocation: true journal_size: 1024 osd_auto_discovery: false # cat hosts [clients] pluto008.ceph.redhat.com ansible_ssh_host='192.168.121.113' ansible_ssh_user='vagrant' devices='["/dev/sdb", "/dev/sdc", "/dev/sdd"]' monitor_interface='eth1' public_network='192.168.91.0/24' [mons] clara011.ceph.redhat.com ansible_ssh_host='192.168.121.112' ansible_ssh_user='vagrant' devices='[]' monitor_interface='eth1' public_network='192.168.91.0/24' clara012.ceph.redhat.com ansible_ssh_host='192.168.121.38' ansible_ssh_user='vagrant' devices='[]' monitor_interface='eth1' public_network='192.168.91.0/24' pluto009.ceph.redhat.com ansible_ssh_host='192.168.121.226' ansible_ssh_user='vagrant' devices='[]' monitor_interface='eth1' public_network='192.168.91.0/24' [osds] clara010.ceph.redhat.com ansible_ssh_host='192.168.121.17' ansible_ssh_user='vagrant' devices='["/dev/sda", "/dev/sdb", "/dev/sdc"]' monitor_interface='eth1' public_network='192.168.91.0/24' pluto008.ceph.redhat.com ansible_ssh_host='192.168.121.113' ansible_ssh_user='vagrant' devices='["/dev/sda", "/dev/sdb", "/dev/sdc"]' monitor_interface='eth1' public_network='192.168.91.0/24' you can find attached the playbook log. Are you hitting this issue for every deployment you try with all these parameters or does it happen 'randomly'?
Thanks Guillaume for trying the exact steps, I will look in detail what is causing this in regression runs, I now doubt something stale from other test is probably causing this :(