Description of problem: Ceph is not able to survive "heat stack-delete overcloud" and its re-deployment. Steps to Reproduce: 1) Deploy overcloud successfully using InfraRed with deploy command (topology virthost 1 undercloud, 1 controller, 1 compute, 1 ceph node): stack@undercloud $ openstack overcloud deploy \ --timeout 100 \ --templates /usr/share/openstack-tripleo-heat-templates \ --stack overcloud \ --libvirt-type kvm \ --ntp-server clock.redhat.com \ --control-scale 1 \ --control-flavor controller \ --compute-scale 1 \ --compute-flavor compute \ --ceph-storage-scale 1 \ --ceph-storage-flavor ceph \ -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml \ -e /home/stack/virt/internal.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \ -e /home/stack/virt/network/network-environment.yaml \ -e /home/stack/virt/hostnames.yml \ -e /home/stack/virt/debug.yaml \ --log-file overcloud_deployment_79.log # ceph is operable on Ceph node, though in "WARN" state $ ceph status cluster 530864f0-4322-11e8-88c6-525400a21c88 health HEALTH_WARN 256 pgs degraded 256 pgs stuck degraded 256 pgs stuck unclean 256 pgs stuck undersized 256 pgs undersized recovery 220/330 objects degraded (66.667%) monmap e1: 1 mons at {controller-0=172.17.3.12:6789/0} election epoch 2, quorum 0 controller-0 osdmap e189: 1 osds: 1 up, 1 in pgmap v3335: 256 pgs, 4 pools, 377 MB data, 110 objects 424 MB used, 39490 MB / 39915 MB avail 220/330 objects degraded (66.667%) 256 active+undersized+degraded $ ceph health HEALTH_WARN 256 pgs degraded; 256 pgs stuck degraded; 256 pgs stuck unclean; 256 pgs stuck undersized; 256 pgs undersized; recovery 220/330 objects degraded (66.667%) # In general cloud is responding and it is able to store images using glance as expected 2) Perform overcloud delete stack@undercloud $ heat stack-delete overcloud 3) Redeploy overcloud using same way using mentioned script 4) Ceph fails to start and reports $ ceph status cluster 999300c8-46d4-11e8-9c4b-525400a21c88 health HEALTH_ERR 256 pgs stuck inactive 256 pgs stuck unclean no osds monmap e1: 1 mons at {controller-0=172.17.3.13:6789/0} election epoch 2, quorum 0 controller-0 osdmap e4: 0 osds: 0 up, 0 in pgmap v5: 256 pgs, 4 pools, 0 bytes data, 0 objects 0 kB used, 0 kB / 0 kB avail 256 creating $ ceph health HEALTH_ERR 256 pgs stuck inactive; 256 pgs stuck unclean; no osds 5) Therefore whole storage for overcloud is incapacitated Version-Release number of selected component (if applicable): OSP7 Actual results: $ cat messages | grep "Error: No cluster conf found in " Apr 23 05:19:56 localhost os-collect-config: disabled\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + test -b /dev/vdb1\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + ceph-disk activate /dev/vdb1\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: ERROR:ceph-disk:Failed to activate\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: ceph-disk: Error: No cluster conf found in /etc/ceph with fsid 530864f0-4322-11e8-88c6-525400a21c88\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + true\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: executed successfully\u001b[0m\n\u001b[mNotice: /Stage[main]/Timezone/File[/etc/localtime]/target: target changed '../usr/share/zoneinfo/America/New_York' to '/usr/share/zoneinfo/UTC'\u001b[0m\n\u001b[mNotice: /File[/etc/localtime]/seluser: seluser changed 'unconfined_u' to 'system_u'\u001b[0m\n\u001b[mNotice: Finished catalog run in 1.45 seconds\u001b[0m\n", "deploy_stderr": "Device \"br_isolated\" does not exist.\nDevice \"ovs_system\" does not exist.\n", "deploy_status_code": 0} Apr 23 05:19:56 localhost os-collect-config: #033[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: ceph-disk: Error: No cluster conf found in /etc/ceph with fsid 530864f0-4322-11e8-88c6-525400a21c88#033[0m Expected results: Overcloud re-deployment shoudln't affect ceph's behaviour, moreover break ceph storage completely. Additional info: ceph-node: rpm -qa | grep ceph ceph-mon-0.94.9-9.el7cp.x86_64 ceph-0.94.9-9.el7cp.x86_64 ceph-osd-0.94.9-9.el7cp.x86_64 ceph-common-0.94.9-9.el7cp.x86_64 undercloud: sudo rpm -qa | grep -iE '(tripleo|instack)' openstack-tripleo-puppet-elements-0.0.1-6.el7ost.noarch openstack-tripleo-heat-templates-0.8.6-137.el7ost.noarch instack-0.0.7-2.el7ost.noarch openstack-tripleo-common-0.0.1.dev6-7.git49b57eb.el7ost.noarch openstack-tripleo-0.0.7-0.1.1664e566.el7ost.noarch instack-undercloud-2.1.2-41.el7ost.noarch openstack-tripleo-image-elements-0.9.6-11.el7ost.noarch How reproducible: 100%
This exact error message results from not cleaning the disks between deployments. This is the expected and desired behavior. Your disks had ceph data on them from the previous deployment. TripleO refused to erase that data to make a new ceph cluster when it found it because its default behavior is not to delete data. You need to clean the disk so it doesn't find that data. To do this, and not run into this issue, please follow the documented procedure to clean the disks during deployment using a pre-boot script linked from the duplicate bug. Newer versions of ODPd have a version of Ironic which will clean the disks for you, provided that you enable it. For the version of OSPd reported in this bug you would use a preboot script. *** This bug has been marked as a duplicate of bug 1418040 ***