Description of problem: A few osds are skipped from being created while using ceph-disk. Version-Release number of selected component (if applicable): 3.2.0-0.1.rc8.el7cp How reproducible: Intermittent mon: 1 daemons, quorum pluto002 mgr: pluto002(active) osd: 6 osds: 3 up, 3 in rgw: 1 daemon active data: pools: 4 pools, 32 pgs objects: 187 objects, 1.09KiB usage: 3.00GiB used, 1.31TiB / 1.31TiB avail pgs: 32 active+clean Of the 6 osds, only 3 are up and in. There are no obvious failure errors in ceph-ansible. In the logs we can see that cluster 'c1' all osds are up. While in cluster 'c2' only 3 are up. Additional info: Parameters: ceph_conf_overrides: global: osd crush chooseleaf type: 0 osd default pool size: 3 osd pool default pg num: 8 osd pool default pgp num: 8 ceph_origin: distro ceph_repository: rhcs ceph_stable: true ceph_stable_release: luminous ceph_stable_rh_storage: true ceph_test: true journal_size: 1024 osd_auto_discovery: true osd_scenario: collocated Logs are at: http://magna002.ceph.redhat.com/smanjara-2018-12-05_02:10:11-rgw:multisite-ansible-luminous-distro-basic-multi/314786/teuthology.log
This sounds like udev races similar to https://bugzilla.redhat.com/show_bug.cgi?id=1654011
This is the error: 2018-12-05T02:55:34.263 INFO:teuthology.orchestra.run.pluto002.stdout:TASK [ceph-osd : get osd ids] ************************************************** 2018-12-05T02:55:34.264 INFO:teuthology.orchestra.run.pluto002.stdout:task path: /home/ubuntu/ceph-ansible/roles/ceph-osd/tasks/start_osds.yml:36 2018-12-05T02:55:34.264 INFO:teuthology.orchestra.run.pluto002.stdout:Wednesday 05 December 2018 07:51:11 +0000 (0:00:00.131) 0:07:22.541 **** 2018-12-05T02:55:34.264 INFO:teuthology.orchestra.run.pluto002.stdout:changed: [pluto002.ceph.redhat.com] => { 2018-12-05T02:55:34.264 INFO:teuthology.orchestra.run.pluto002.stdout: "changed": true, 2018-12-05T02:55:34.265 INFO:teuthology.orchestra.run.pluto002.stdout: "cmd": "ls /var/lib/ceph/osd/ | sed 's/.*-//'", 2018-12-05T02:55:34.265 INFO:teuthology.orchestra.run.pluto002.stdout: "delta": "0:00:00.004367", 2018-12-05T02:55:34.265 INFO:teuthology.orchestra.run.pluto002.stdout: "end": "2018-12-05 07:51:12.167107", 2018-12-05T02:55:34.265 INFO:teuthology.orchestra.run.pluto002.stdout: "rc": 0, 2018-12-05T02:55:34.266 INFO:teuthology.orchestra.run.pluto002.stdout: "start": "2018-12-05 07:51:12.162740" 2018-12-05T02:55:34.266 INFO:teuthology.orchestra.run.pluto002.stdout:} 2018-12-05T02:55:34.266 INFO:teuthology.orchestra.run.pluto002.stdout: 2018-12-05T02:55:34.266 INFO:teuthology.orchestra.run.pluto002.stdout:STDOUT: 2018-12-05T02:55:34.267 INFO:teuthology.orchestra.run.pluto002.stdout: 2018-12-05T02:55:34.267 INFO:teuthology.orchestra.run.pluto002.stdout:1 2018-12-05T02:55:34.267 INFO:teuthology.orchestra.run.pluto002.stdout:2 2018-12-05T02:55:34.267 INFO:teuthology.orchestra.run.pluto002.stdout:4 2018-12-05T02:55:34.268 INFO:teuthology.orchestra.run.pluto002.stdout:changed: [clara011.ceph.redhat.com] => { 2018-12-05T02:55:34.268 INFO:teuthology.orchestra.run.pluto002.stdout: "changed": true, 2018-12-05T02:55:34.268 INFO:teuthology.orchestra.run.pluto002.stdout: "cmd": "ls /var/lib/ceph/osd/ | sed 's/.*-//'", 2018-12-05T02:55:34.268 INFO:teuthology.orchestra.run.pluto002.stdout: "delta": "0:00:00.006334", 2018-12-05T02:55:34.268 INFO:teuthology.orchestra.run.pluto002.stdout: "end": "2018-12-05 07:51:12.857183", 2018-12-05T02:55:34.269 INFO:teuthology.orchestra.run.pluto002.stdout: "rc": 0, 2018-12-05T02:55:34.269 INFO:teuthology.orchestra.run.pluto002.stdout: "start": "2018-12-05 07:51:12.850849" 2018-12-05T02:55:34.269 INFO:teuthology.orchestra.run.pluto002.stdout:} We need an env we can access to fix this issue, thanks.
(In reply to seb from comment #4) > This is the error: > > 2018-12-05T02:55:34.263 INFO:teuthology.orchestra.run.pluto002.stdout:TASK > [ceph-osd : get osd ids] ************************************************** > 2018-12-05T02:55:34.264 INFO:teuthology.orchestra.run.pluto002.stdout:task > path: /home/ubuntu/ceph-ansible/roles/ceph-osd/tasks/start_osds.yml:36 > 2018-12-05T02:55:34.264 > INFO:teuthology.orchestra.run.pluto002.stdout:Wednesday 05 December 2018 > 07:51:11 +0000 (0:00:00.131) 0:07:22.541 **** > 2018-12-05T02:55:34.264 > > > We need an env we can access to fix this issue, thanks. Unfortunately, the tests are run as teuthology jobs, which cleanup the machines after job completion. Will try to get hold of some machines and reproduce it. BTW the issue is found only on clusters with custom cluster name if that helps.
A request to add this to the Release Notes has been made. Please fill out the Doc Text using the cause, consequence, workaround, result format and we will ensure it gets in the Release Notes.
Hi Shilpa, is there any log you corresponding to this environment you could provide? Thanks