1656686 – Intermittent failures in osd creation while using ceph-disk

Bug 1656686 - Intermittent failures in osd creation while using ceph-disk

Summary: Intermittent failures in osd creation while using ceph-disk

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Ansible
Sub Component:
Version:	3.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	3.3
Assignee:	Guillaume Abrioux
QA Contact:	ceph-qe-bugs
Docs Contact:	John Brier
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-12-06 05:05 UTC by shilpa
Modified:	2018-12-18 14:28 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	Known Issue
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-12-18 14:28:18 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description shilpa 2018-12-06 05:05:07 UTC

Description of problem:
A few osds are skipped from being created while using ceph-disk.

Version-Release number of selected component (if applicable):
3.2.0-0.1.rc8.el7cp

How reproducible:
Intermittent

   mon: 1 daemons, quorum pluto002
   mgr: pluto002(active)
   osd: 6 osds: 3 up, 3 in
   rgw: 1 daemon active

 data:
   pools:   4 pools, 32 pgs
   objects: 187 objects, 1.09KiB
   usage:   3.00GiB used, 1.31TiB / 1.31TiB avail
   pgs:     32 active+clean

Of the 6 osds, only 3 are up and in.

There are no obvious failure errors in ceph-ansible. In the logs we can see that cluster 'c1' all osds are up. While in cluster 'c2' only 3 are up.


Additional info:

Parameters:

        ceph_conf_overrides:
          global:
            osd crush chooseleaf type: 0
            osd default pool size: 3
            osd pool default pg num: 8
            osd pool default pgp num: 8
        ceph_origin: distro
        ceph_repository: rhcs
        ceph_stable: true
        ceph_stable_release: luminous
        ceph_stable_rh_storage: true
        ceph_test: true
        journal_size: 1024
        osd_auto_discovery: true
        osd_scenario: collocated

Logs are at: 
http://magna002.ceph.redhat.com/smanjara-2018-12-05_02:10:11-rgw:multisite-ansible-luminous-distro-basic-multi/314786/teuthology.log

Comment 3 Josh Durgin 2018-12-07 18:51:09 UTC

This sounds like udev races similar to https://bugzilla.redhat.com/show_bug.cgi?id=1654011

Comment 4 seb 2018-12-10 13:13:24 UTC

This is the error:

2018-12-05T02:55:34.263 INFO:teuthology.orchestra.run.pluto002.stdout:TASK [ceph-osd : get osd ids] **************************************************
2018-12-05T02:55:34.264 INFO:teuthology.orchestra.run.pluto002.stdout:task path: /home/ubuntu/ceph-ansible/roles/ceph-osd/tasks/start_osds.yml:36
2018-12-05T02:55:34.264 INFO:teuthology.orchestra.run.pluto002.stdout:Wednesday 05 December 2018  07:51:11 +0000 (0:00:00.131)       0:07:22.541 ****
2018-12-05T02:55:34.264 INFO:teuthology.orchestra.run.pluto002.stdout:changed: [pluto002.ceph.redhat.com] => {
2018-12-05T02:55:34.264 INFO:teuthology.orchestra.run.pluto002.stdout:    "changed": true,
2018-12-05T02:55:34.265 INFO:teuthology.orchestra.run.pluto002.stdout:    "cmd": "ls /var/lib/ceph/osd/ | sed 's/.*-//'",
2018-12-05T02:55:34.265 INFO:teuthology.orchestra.run.pluto002.stdout:    "delta": "0:00:00.004367",
2018-12-05T02:55:34.265 INFO:teuthology.orchestra.run.pluto002.stdout:    "end": "2018-12-05 07:51:12.167107",
2018-12-05T02:55:34.265 INFO:teuthology.orchestra.run.pluto002.stdout:    "rc": 0,
2018-12-05T02:55:34.266 INFO:teuthology.orchestra.run.pluto002.stdout:    "start": "2018-12-05 07:51:12.162740"
2018-12-05T02:55:34.266 INFO:teuthology.orchestra.run.pluto002.stdout:}
2018-12-05T02:55:34.266 INFO:teuthology.orchestra.run.pluto002.stdout:
2018-12-05T02:55:34.266 INFO:teuthology.orchestra.run.pluto002.stdout:STDOUT:
2018-12-05T02:55:34.267 INFO:teuthology.orchestra.run.pluto002.stdout:
2018-12-05T02:55:34.267 INFO:teuthology.orchestra.run.pluto002.stdout:1
2018-12-05T02:55:34.267 INFO:teuthology.orchestra.run.pluto002.stdout:2
2018-12-05T02:55:34.267 INFO:teuthology.orchestra.run.pluto002.stdout:4
2018-12-05T02:55:34.268 INFO:teuthology.orchestra.run.pluto002.stdout:changed: [clara011.ceph.redhat.com] => {
2018-12-05T02:55:34.268 INFO:teuthology.orchestra.run.pluto002.stdout:    "changed": true,
2018-12-05T02:55:34.268 INFO:teuthology.orchestra.run.pluto002.stdout:    "cmd": "ls /var/lib/ceph/osd/ | sed 's/.*-//'",
2018-12-05T02:55:34.268 INFO:teuthology.orchestra.run.pluto002.stdout:    "delta": "0:00:00.006334",
2018-12-05T02:55:34.268 INFO:teuthology.orchestra.run.pluto002.stdout:    "end": "2018-12-05 07:51:12.857183",
2018-12-05T02:55:34.269 INFO:teuthology.orchestra.run.pluto002.stdout:    "rc": 0,
2018-12-05T02:55:34.269 INFO:teuthology.orchestra.run.pluto002.stdout:    "start": "2018-12-05 07:51:12.850849"
2018-12-05T02:55:34.269 INFO:teuthology.orchestra.run.pluto002.stdout:}


We need an env we can access to fix this issue, thanks.

Comment 5 shilpa 2018-12-11 05:07:34 UTC

(In reply to seb from comment #4)
> This is the error:
> 
> 2018-12-05T02:55:34.263 INFO:teuthology.orchestra.run.pluto002.stdout:TASK
> [ceph-osd : get osd ids] **************************************************
> 2018-12-05T02:55:34.264 INFO:teuthology.orchestra.run.pluto002.stdout:task
> path: /home/ubuntu/ceph-ansible/roles/ceph-osd/tasks/start_osds.yml:36
> 2018-12-05T02:55:34.264
> INFO:teuthology.orchestra.run.pluto002.stdout:Wednesday 05 December 2018 
> 07:51:11 +0000 (0:00:00.131)       0:07:22.541 ****
> 2018-12-05T02:55:34.264

> 
> 
> We need an env we can access to fix this issue, thanks.

Unfortunately, the tests are run as teuthology jobs, which cleanup the machines after job completion. Will try to get hold of some machines and reproduce it. BTW the issue is found only on clusters with custom cluster name if that helps.

Comment 6 John Brier 2018-12-11 19:51:07 UTC

A request to add this to the Release Notes has been made. Please fill out the Doc Text using the cause, consequence, workaround, result format and we will ensure it gets in the Release Notes.

Comment 13 Guillaume Abrioux 2018-12-17 20:29:21 UTC

Hi Shilpa,

is there any log you corresponding to this environment you could provide?

Thanks

Note You need to log in before you can comment on or make changes to this bug.