Bug 1570584 - ceph-disk: Error: No cluster conf found in /etc/ceph with fsid xyz
Summary: ceph-disk: Error: No cluster conf found in /etc/ceph with fsid xyz
Keywords:
Status: CLOSED DUPLICATE of bug 1418040
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: John Fulton
QA Contact: Yogev Rabl
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-23 09:41 UTC by Filip Hubík
Modified: 2018-12-10 16:34 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-05-02 14:28:37 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Filip Hubík 2018-04-23 09:41:04 UTC
Description of problem:
Ceph is not able to survive "heat stack-delete overcloud" and its re-deployment.

Steps to Reproduce:
1) Deploy overcloud successfully using InfraRed with deploy command (topology virthost 1 undercloud, 1 controller, 1 compute, 1 ceph node):
stack@undercloud $ openstack overcloud deploy \
--timeout 100 \
--templates /usr/share/openstack-tripleo-heat-templates \
--stack overcloud \
--libvirt-type kvm \
--ntp-server clock.redhat.com \
--control-scale 1 \
--control-flavor controller \
--compute-scale 1 \
--compute-flavor compute \
--ceph-storage-scale 1 \
--ceph-storage-flavor ceph \
-e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml \
-e /home/stack/virt/internal.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/network/network-environment.yaml \
-e /home/stack/virt/hostnames.yml \
-e /home/stack/virt/debug.yaml \
--log-file overcloud_deployment_79.log

# ceph is operable on Ceph node, though in "WARN" state
$ ceph status
    cluster 530864f0-4322-11e8-88c6-525400a21c88
     health HEALTH_WARN
            256 pgs degraded
            256 pgs stuck degraded
            256 pgs stuck unclean
            256 pgs stuck undersized
            256 pgs undersized
            recovery 220/330 objects degraded (66.667%)
     monmap e1: 1 mons at {controller-0=172.17.3.12:6789/0}
            election epoch 2, quorum 0 controller-0
     osdmap e189: 1 osds: 1 up, 1 in
      pgmap v3335: 256 pgs, 4 pools, 377 MB data, 110 objects
            424 MB used, 39490 MB / 39915 MB avail
            220/330 objects degraded (66.667%)
                 256 active+undersized+degraded
$ ceph health
HEALTH_WARN 256 pgs degraded; 256 pgs stuck degraded; 256 pgs stuck unclean; 256 pgs stuck undersized; 256 pgs undersized; recovery 220/330 objects degraded (66.667%)

# In general cloud is responding and it is able to store images using glance as expected

2) Perform overcloud delete
stack@undercloud $ heat stack-delete overcloud

3) Redeploy overcloud using same way using mentioned script

4) Ceph fails to start and reports
$ ceph status
    cluster 999300c8-46d4-11e8-9c4b-525400a21c88
     health HEALTH_ERR
            256 pgs stuck inactive
            256 pgs stuck unclean
            no osds
     monmap e1: 1 mons at {controller-0=172.17.3.13:6789/0}
            election epoch 2, quorum 0 controller-0
     osdmap e4: 0 osds: 0 up, 0 in
      pgmap v5: 256 pgs, 4 pools, 0 bytes data, 0 objects
            0 kB used, 0 kB / 0 kB avail
                 256 creating
$ ceph health                                                                                                                                                                                                           
HEALTH_ERR 256 pgs stuck inactive; 256 pgs stuck unclean; no osds

5) Therefore whole storage for overcloud is incapacitated

Version-Release number of selected component (if applicable):
OSP7

Actual results:
$ cat messages | grep "Error: No cluster conf found in "
Apr 23 05:19:56 localhost os-collect-config: disabled\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + test -b /dev/vdb1\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + ceph-disk activate /dev/vdb1\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: ERROR:ceph-disk:Failed to activate\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: ceph-disk: Error: No cluster conf found in /etc/ceph with fsid 530864f0-4322-11e8-88c6-525400a21c88\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + true\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: executed successfully\u001b[0m\n\u001b[mNotice: /Stage[main]/Timezone/File[/etc/localtime]/target: target changed '../usr/share/zoneinfo/America/New_York' to '/usr/share/zoneinfo/UTC'\u001b[0m\n\u001b[mNotice: /File[/etc/localtime]/seluser: seluser changed 'unconfined_u' to 'system_u'\u001b[0m\n\u001b[mNotice: Finished catalog run in 1.45 seconds\u001b[0m\n", "deploy_stderr": "Device \"br_isolated\" does not exist.\nDevice \"ovs_system\" does not exist.\n", "deploy_status_code": 0}
Apr 23 05:19:56 localhost os-collect-config: #033[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: ceph-disk: Error: No cluster conf found in /etc/ceph with fsid 530864f0-4322-11e8-88c6-525400a21c88#033[0m

Expected results:
Overcloud re-deployment shoudln't affect ceph's behaviour, moreover break ceph storage completely.

Additional info:
ceph-node:
rpm -qa | grep ceph
ceph-mon-0.94.9-9.el7cp.x86_64
ceph-0.94.9-9.el7cp.x86_64
ceph-osd-0.94.9-9.el7cp.x86_64
ceph-common-0.94.9-9.el7cp.x86_64

undercloud:
sudo rpm -qa | grep -iE '(tripleo|instack)'
openstack-tripleo-puppet-elements-0.0.1-6.el7ost.noarch
openstack-tripleo-heat-templates-0.8.6-137.el7ost.noarch
instack-0.0.7-2.el7ost.noarch
openstack-tripleo-common-0.0.1.dev6-7.git49b57eb.el7ost.noarch
openstack-tripleo-0.0.7-0.1.1664e566.el7ost.noarch
instack-undercloud-2.1.2-41.el7ost.noarch
openstack-tripleo-image-elements-0.9.6-11.el7ost.noarch

How reproducible:
100%

Comment 1 John Fulton 2018-05-02 14:28:37 UTC
This exact error message results from not cleaning the disks between deployments. This is the expected and desired behavior. Your disks had ceph data on them from the previous deployment. TripleO refused to erase that data to make a new ceph cluster when it found it because its default behavior is not to delete data. You need to clean the disk so it doesn't find that data. To do this, and not run into this issue, please follow the documented procedure to clean the disks during deployment using a pre-boot script linked from the duplicate bug. Newer versions of ODPd have a version of Ironic which will clean the disks for you, provided that you enable it. For the version of OSPd reported in this bug you would use a preboot script.

*** This bug has been marked as a duplicate of bug 1418040 ***


Note You need to log in before you can comment on or make changes to this bug.