Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1605930

Summary: osd failed to upgrade with "Error: No cluster conf found in /etc/ceph"
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Tiffany Nguyen <tunguyen>
Component: Ceph-AnsibleAssignee: Sébastien Han <shan>
Status: CLOSED INSUFFICIENT_DATA QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.1CC: aschoen, ceph-eng-bugs, gmeno, nthomas, sankarshan, seb, tunguyen, vakulkar
Target Milestone: rcFlags: vakulkar: automate_bug?
Target Release: 3.1   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-08-07 18:58:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
ansible log
none
all.yml
none
osds.yml
none
hosts file
none
ansible log none

Description Tiffany Nguyen 2018-07-20 17:39:15 UTC
Created attachment 1466874 [details]
ansible log

Description of problem:
When running rolling_update.yml, osd failed to activate ceph-disk with error below:

failed: [c06-h09-6048r.rdu.openstack.engineering.redhat.com] (item=/dev/sdb) => {"changed": false, "cmd": ["ceph-disk", "activate", "/dev/sdb1"], "delta": "0:00:00.144141", "end": "2018-07-20 17:18:44.826203", "item": "/dev/sdb", "msg": "non-zero return code", "rc": 1, "start": "2018-07-20 17:18:44.682062", "stderr": "mount_activate: Failed to activate\nceph-disk: Error: No cluster conf found in /etc/ceph with fsid 9071b1aa-c5ea-451c-b1d0-06b2298c1901", "stderr_lines": ["mount_activate: Failed to activate", "ceph-disk: Error: No cluster conf found in /etc/ceph with fsid 9071b1aa-c5ea-451c-b1d0-06b2298c1901"], "stdout": "", "stdout_lines": []}

Version-Release number of selected component (if applicable):
Upgrade from 2.5 --> 3.1
 * 2.5 (10.2.10-17.el7cp)
 * 3.1 (build http://download.eng.bos.redhat.com/composes/auto/ceph-3.1-rhel-7/RHCEPH-3.1-RHEL-7-20180712.ci.2/) 

How reproducible:
* Cluster has 30% filled data
* Running 2.5 -> 3.1 rolling_upgrade.yml with I/O running in parallel

Steps to Reproduce:
1. Running a ceph 2.5 on cluster with 30% data
2. Start I/O using cosbench tool
3. Perform rolling_upgrade.yml to 3.1 build
4. Monitor the ansible log: upgrade error and cluster failed to upgrade

Comment 3 Tiffany Nguyen 2018-07-20 17:49:15 UTC
Created attachment 1466978 [details]
all.yml

Comment 4 Tiffany Nguyen 2018-07-20 17:49:40 UTC
Created attachment 1466984 [details]
osds.yml

Comment 5 Tiffany Nguyen 2018-07-20 17:50:01 UTC
Created attachment 1466988 [details]
hosts file

Comment 6 Tiffany Nguyen 2018-07-20 17:57:57 UTC
fsid info:
[root@c07-h29-6018r ~]# ceph fsid
9071b1aa-c5ea-451c-b1d0-06b2298c1901

Comment 7 Tiffany Nguyen 2018-07-20 23:52:50 UTC
Created attachment 1469625 [details]
ansible log

Re-run rolling_upgrade.yml, attaching new ansible log.  Upgrade still failing with error of pgs stuck degraded:
 cluster:
    id:     9071b1aa-c5ea-451c-b1d0-06b2298c1901
    health: HEALTH_WARN
            1012 pgs degraded
            6 pgs recovering
            1008 pgs recovery_wait
            1012 pgs stuck degraded
            1014 pgs stuck unclean
            recovery 1225343/184368312 objects degraded (0.665%)
            noout,noscrub,nodeep-scrub flag(s) set

Comment 8 seb 2018-07-25 13:42:29 UTC
What does your ceph.conf say about this fsid?