Bug 1561456

Summary: [ceph-ansible] [ceph-container] : shrink OSD with NVMe disks - failing as OSD services are not stoppped
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Vasishta <vashastr>
Component: Ceph-AnsibleAssignee: Sébastien Han <shan>
Status: CLOSED ERRATA QA Contact: Vasishta <vashastr>
Severity: high Docs Contact: Erin Donnelly <edonnell>
Priority: unspecified    
Version: 3.0CC: adeza, agunn, aschoen, ceph-eng-bugs, edonnell, gmeno, hnallurv, jquinn, kdreyer, nthomas, sankarshan, shan, tchandra
Target Milestone: z3   
Target Release: 3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: RHEL: ceph-ansible-3.0.32-1.el7cp Ubuntu: ceph-ansible_3.0.32-2redhat1 Doc Type: Bug Fix
Doc Text:
.The `shrink-osd` playbook supports NVMe drives Previously, the `shrink-osd` Ansible playbook did not support shrinking OSDs backed by an NVMe drive. NVMe drive support has been added in this release.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-05-15 18:20:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1553254, 1557269, 1572368, 1600697    
Attachments:
Description Flags
File contains contents ansible-playbook log none

Description Vasishta 2018-03-28 11:36:46 UTC
Created attachment 1414140 [details]
File contains contents ansible-playbook log

Description of problem:
Shrinking of OSD with NVMe disks are failing in task deallocate osd(s) id when ceph-disk destroy fail saying "Error EBUSY: osd.<id> is still up; must be down before removal. "

Version-Release number of selected component (if applicable):
ceph-ansible-3.0.28-1.el7cp.noarch

How reproducible:
Always (3/3)

Steps to Reproduce:
1. Configure containerized cluster with NVMe disks for OSDs
2. Try to shrink an OSD.


Actual results:
TASK [deallocate osd(s) id when ceph-disk destroy fail]
--------------------------------
"stderr_lines": [
        "Error EBUSY: osd.7 is still up; must be down before removal. "
    ], 

Expected results:
OSD must be removed successfully 

Additional info:

The task TASK [stop osd services (container)] was completed with status 'ok'.

Comment 6 Sébastien Han 2018-04-12 12:14:43 UTC
Can you check if the container is still running?
Perhaps we tried to stop the wrong service.

Thanks.

Comment 7 Vasishta 2018-04-12 13:12:32 UTC
Hi Sebastien,

As I remember container was running. (Unfortunately I don't have environment as of now)

I think we must have tried to stop wrong service as I see "name": "ceph-osd@nvme0n1p" in the log (atachment). By following the convention, service name must have been "ceph-osd@nvme0n1"

The logic we have in shrink-osd.yml [1] to findout service seems to be not working for nvme disks.


- name: stop osd services (container)
      service:
        name: "ceph-osd@{{ item.0.stdout[:-1] | regex_replace('/dev/', '') }}"

I think It would have been fine if we could have "item.0.stdout[:-2]" only for nvme disks.

[1] https://github.com/ceph/ceph-ansible/blob/37117071ebb7ab3cf68b607b6760077a2b46a00d/infrastructure-playbooks/shrink-osd.yml#L119-L121


Regards,
Vasishta Shastry
AQE, Ceph

Comment 9 Sébastien Han 2018-04-20 09:44:40 UTC
*** Bug 1555793 has been marked as a duplicate of this bug. ***

Comment 10 Sébastien Han 2018-04-23 21:02:22 UTC
Will be in the next release v3.0.32

Comment 14 Vasishta 2018-05-09 07:15:44 UTC
working fine with ceph-ansible-3.0.32-1.el7cp

Comment 17 errata-xmlrpc 2018-05-15 18:20:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1563