Description of problem: ======================= unable to shrink cluster - remove MON and/or OSD from cluster using ceph-ansible Version-Release number of selected component (if applicable): ============================================================= ceph-ansible-1.0.5-10.el7scon.noarch How reproducible: ================= always Steps to Reproduce: 1. 2. 3. Expected results: ================= ceph-ansible should provide a way to remove OSD and/or MON from the cluster. Additional info:
Federico, This defect is re-targeted to ceph release 3. Is product management ok with this? Please confirm. If this is not going to be in 2, then what is the alternate plan for the customer who want to remove/add nodes to the cluster? Regards, Harish
There was a brief discussion on this BZ in today's program meeting and it was decided to discuss further on this via bz. I am changing target release to 2.0 till we reach final decision.
Gregory, we need confirmation that removing a node without using Ansible will not impact the rest of the install/scale out Ansible/ceph-install stack before we punt this to Console v3. question is around day to day operations, failed nodes, etc. Not about console initiated operations, but about the need to ditch a node. We know RHS-C will not do this in 2.0, we need to decide that indeed we do not need this in Ansible either operationally. What say you?
We can't know with absolute certainty that in such a case ceph-ansible would work in the future. There is currently no support for removing a node in ceph-ansible.
Will USM break if a customer manually removes a node? We don't have this support in ceph-ansible and we don't have the time to implement it.
tried to deploy a ceph cluster using ceph-ansible and removed a ceph osd node using ceph-deploy. everything seems to be normal and the cluster is healthy. 1. deployed ceph cluster on a 3 node test setup [ 1 mon+osd, 2 osd nodes] using ceph-ansible 2. removed a osd node from the cluster using ceph-deploy [ceph-deploy purge <node>, ceph-deploy purgedata <node>] 3. set the crushmap to use osd level replication instead of host level [since am now left with only 2 nodes] - dont need this step in an environment that has more than 2 osds. 4. Ceph cluster is healthy. so, i believe, until we have a support in ceph-ansible or Console to remove a node[for faulty disks or whatever reason], we can use ceph-deploy to do it. ceph-deploy is going to be however shipped in RH Ceph Tools repo [rhel-7-server-rhceph-2-tools-rpms] although deprecated.
Neil, can you please check comment 10 and let us know your opinion? My guess is that this workaround will be exposed only to Support team and not to the customer. If this is correct, then where do we document the info?
We can't use ceph-deploy to do any operational actions on the cluster. This goes for both support and the customer. How long is the manual process to remove a node to document?
Harish and Neil, please note that comment10 only answers the question on comment-6 which is what we tested and confirmed that removing a node from a running cluster will not impact the cluster in anyway. ceph-deploy was only used as a short cut to do that. as Neil pointed out, we may have to document the manual process to remove a node from the cluster.
(In reply to Tamil from comment #13) > Harish and Neil, please note that comment10 only answers the question on > comment-6 which is what we tested and confirmed that removing a node from a > running cluster will not impact the cluster in anyway. > > ceph-deploy was only used as a short cut to do that. > > as Neil pointed out, we may have to document the manual process to remove a > node from the cluster. Ah. Ok, thanks for clearing that up.
(In reply to Tamil from comment #13) > Harish and Neil, please note that comment10 only answers the question on > comment-6 which is what we tested and confirmed that removing a node from a > running cluster will not impact the cluster in anyway. Thanks Tamil! > > ceph-deploy was only used as a short cut to do that. > > as Neil pointed out, we may have to document the manual process to remove a > node from the cluster. Can you please let me know who will be coming up with manual process(steps)?
Harish, i bet there is already a document upstream on how to remove a monitor or osd from the running cluster. Docs team has to make a similar copy for downstream. reference: http://docs.ceph.com/docs/master/rados/operations/
Ken, can you please get us the downstream documentation from comment 16?
verified the doc, looks good!
We aren't ready to fully support this yet. This was initially targeted for 3.0 and we are working towards having this fully tested in upstream CI
shrinking is tracked in bz 1366807 *** This bug has been marked as a duplicate of bug 1366807 ***