Bug 1335938

Summary: Ceph-installer report success even if the OSDs are not created successfully
Product: [Red Hat Storage] Red Hat Storage Console Reporter: Nishanth Thomas <nthomas>
Component: ceph-installerAssignee: Alfredo Deza <adeza>
Status: CLOSED ERRATA QA Contact: Daniel Horák <dahorak>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 2CC: adeza, aschoen, ceph-eng-bugs, dahorak, kdreyer, mkudlej, nthomas, sankarshan, sds-qe-bugs
Target Milestone: ---Keywords: Reopened
Target Release: 2   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: ceph-ansible-1.0.5-14.el7scon Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-23 19:50:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
"ceph-installer task e1e52f53-3d4b-489e-84c4-fdaa88ad06a9" output none

Description Nishanth Thomas 2016-05-13 15:02:33 UTC
Description of problem:

 Ceph-installer reports that the OSDs are created successfully(task returns success) but actually the OSDs are not created(ceph -s does not list the OSDs)

Version-Release number of selected component (if applicable):
http://puddle.ceph.redhat.com/puddles/rhscon/2/2016-04-29.2/RHSCON-2.repo

How reproducible:
not always

Steps to Reproduce:
1.Have more number of disks(10) on the node and create OSDs one after another

Comment 3 Daniel Horák 2016-05-17 12:19:13 UTC
*** Bug 1335913 has been marked as a duplicate of this bug. ***

Comment 4 Alfredo Deza 2016-05-17 14:33:08 UTC
Upstream pull request opened: https://github.com/ceph/ceph-ansible/pull/794

Comment 5 Alfredo Deza 2016-05-18 12:59:59 UTC
Merged upstream. Pushed 52f73f30c5b1e350d4965d4d82c456d2d9c39500 to downstream.

Comment 9 Nishanth Thomas 2016-05-27 10:22:52 UTC
This issue seen on the latest builds

Comment 10 Ken Dreyer (Red Hat) 2016-05-31 17:18:39 UTC
Nishanth,

Would you please provide the following information?

* What versions of the products are being used?
* What are the exact steps reproduce?
* Relevant log output to the issue and products (e.g. ansible output,
ceph-installer task information, /var/log/ceph/* log, systemd log
output from osds/mons)
* If an OSD is related to the issue, we expect a look at
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/

Comment 11 Nishanth Thomas 2016-06-01 15:26:05 UTC
(In reply to Ken Dreyer (Red Hat) from comment #10)
> Nishanth,
> 
> Would you please provide the following information?
> 
> * What versions of the products are being used?

ceph-ansible-1.0.5-15.el7scon.noarch.rpm           20-May-2016 17:13    108K
ceph-installer-1.0.11-1.el7scon.noarch.rpm         18-May-2016 20:55     75K

> * What are the exact steps reproduce?

create a cluster with more than 8 disks per node. Also provide custom clustername(TestCluster10)

> * Relevant log output to the issue and products (e.g. ansible output,
> ceph-installer task information, /var/log/ceph/* log, systemd log
> output from osds/mons)

Not available as the setup is cleaned up

> * If an OSD is related to the issue, we expect a look at
> http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/

Comment 12 Nishanth Thomas 2016-06-01 15:28:20 UTC
I tried to reproduce this issue couple of times today but no success. So I am closing this for now and will re-open if found again

Comment 13 Daniel Horák 2016-06-02 12:14:51 UTC
Seems like I was able reproduce it again.

Related packages:
  ceph-ansible-1.0.5-18.el7scon.noarch
  ceph-installer-1.0.11-1.el7scon.noarch
  
  ceph-base-10.2.1-12.el7cp.x86_64
  ceph-common-10.2.1-12.el7cp.x86_64
  ceph-osd-10.2.1-12.el7cp.x86_64
  ceph-selinux-10.2.1-12.el7cp.x86_64
  libcephfs1-10.2.1-12.el7cp.x86_64
  python-cephfs-10.2.1-12.el7cp.x86_64

Here is visible, that on /dev/vdd (on node1) is no OSD, but it should be there:
# ceph-disk list
  /dev/vda :
   /dev/vda1 other, swap
   /dev/vda2 other, xfs, mounted on /
  /dev/vdb :
   /dev/vdb2 ceph journal, for /dev/vdc1
   /dev/vdb1 ceph journal, for /dev/vde1
  /dev/vdc :
   /dev/vdc1 ceph data, active, cluster TestClusterA, osd.1, journal /dev/vdb2
  /dev/vdd other, unknown
  /dev/vde :
   /dev/vde1 ceph data, active, cluster TestClusterA, osd.0, journal /dev/vdb1
  /dev/vdf other, unknown
  /dev/vdg other, unknown

Related Ceph installer task was submitted this way:
  2016-06-02T10:47:09.437+02:00 INFO     api.go:174 Configure] admin:670b65a9-fd32-4971-9afd-202ec4481aa6-Started configuration on node: jenkins-usm1-node1.localdomain. TaskId: e1e52f53-3d4b-489e-84c4-fdaa88ad06a9. Request Data: {"cluster_name":"TestClusterA","cluster_network":"172.16.176.0/24","devices":{"/dev/vdd":"/dev/vdb"},"fsid":"50261f74-e019-48bf-a584-af9bdfd60200","host":"jenkins-usm1-node1.localdomain","journal_size":5120,"monitors":[{"address":"172.16.176.83","host":"jenkins-usm1-mon1.localdomain"},{"address":"172.16.176.84","host":"jenkins-usm1-mon2.localdomain"},{"address":"172.16.176.85","host":"jenkins-usm1-mon3.localdomain"}],"public_network":"172.16.176.0/24","redhat_storage":true}. Route: http://localhost:8181/api/osd/configure
  
I'll post the ceph-installer task log as an attachment (# ceph-installer task e1e52f53-3d4b-489e-84c4-fdaa88ad06a9).

I'll try to collect more data and post it there, also if it helps direct access on the affected machines, please let me know.

Comment 14 Daniel Horák 2016-06-02 12:15:35 UTC
Created attachment 1164043 [details]
"ceph-installer task e1e52f53-3d4b-489e-84c4-fdaa88ad06a9" output

Comment 18 Daniel Horák 2016-06-02 13:21:29 UTC
The issue described in comment 13 have different root cause, described in new Bug 1342117.

I'll test this bug accordingly to the original scenario with not "correctly" cleaned data disks.

Comment 19 Daniel Horák 2016-08-02 14:43:59 UTC
Tested on multiple scenarios in the last weeks, failed OSD creation task is properly reported.

Latest testing on USM Server/ceph-installer server (RHEL 7.2):
  ceph-ansible-1.0.5-31.el7scon.noarch
  ceph-installer-1.0.14-1.el7scon.noarch
  rhscon-ceph-0.0.39-1.el7scon.x86_64
  rhscon-core-0.0.39-1.el7scon.x86_64
  rhscon-core-selinux-0.0.39-1.el7scon.noarch
  rhscon-ui-0.0.51-1.el7scon.noarch
  salt-2015.5.5-1.el7.noarch
  salt-master-2015.5.5-1.el7.noarch
  salt-selinux-0.0.39-1.el7scon.noarch

  Ceph node (RHEL 7.2):
  ceph-base-10.2.2-32.el7cp.x86_64
  ceph-common-10.2.2-32.el7cp.x86_64
  ceph-osd-10.2.2-32.el7cp.x86_64
  ceph-selinux-10.2.2-32.el7cp.x86_64
  libcephfs1-10.2.2-32.el7cp.x86_64
  python-cephfs-10.2.2-32.el7cp.x86_64
  rhscon-agent-0.0.16-1.el7scon.noarch
  rhscon-core-selinux-0.0.39-1.el7scon.noarch
  salt-2015.5.5-1.el7.noarch
  salt-minion-2015.5.5-1.el7.noarch
  salt-selinux-0.0.39-1.el7scon.noarch

>> VERIFIED

Comment 21 errata-xmlrpc 2016-08-23 19:50:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2016:1754