Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1707020

Summary: Scaling out with an additional compute node fails during ceph-ansible run
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: documentationAssignee: Laura Marsh <lmarsh>
Status: CLOSED CURRENTRELEASE QA Contact: RHOS Documentation Team <rhos-docs>
Severity: medium Docs Contact:
Priority: low    
Version: 15.0 (Stein)CC: dbecker, dcadzow, dsavinea, fpantano, gabrioux, gcharot, gfidente, johfulto, jvisser, lmarsh, mburns, morazi, ssmolyak, tenobreg
Target Milestone: z1Keywords: Triaged, ZStream
Target Release: 15.0 (Stein)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-10-21 19:43:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Marius Cornea 2019-05-06 15:59:50 UTC
Description of problem:


Scaling out with an additional compute node fails during ceph-ansible run.

Version-Release number of selected component (if applicable):
ceph-ansible-4.0.0-0.1.rc5.el8cp.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy overcloud on pre-deployed servers with 3controllers  2computes 3ceph nodes
2. Remove one compute node
3. Blacklist already existing compute nodes
4. Run overcloud deploy to add one more additional compute node to the setup.

Actual results:
Fails during ceph-ansible, attaching the ceph-ansible playbook output.

Expected results:
No failures.

Additional info:
Attaching job artifacts.

Comment 13 John Fulton 2019-06-10 16:11:36 UTC
We already document how a user can override the relevant parameters:

 https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/fast_forward_upgrades/index#increasing-the-restart-delay-for-large-ceph-clusters

I suspect we'll be able to reproduce much less frequently (perhaps not at all?) with the following overrides:

parameter_defaults:
  CephAnsibleExtraConfig:
    health_mon_check_retries: 10
    health_mon_check_delay: 20

Perhaps we just need the docbug in the scale up documentation suggesting the above. As a follow up we can request higher defaults for ceph-ansible.

Comment 19 Francesco Pantano 2019-06-13 16:00:10 UTC
Hi Marius, we've seen the variable names used are wrong, so I think the last attempts are not valid.
Can you try to run again the jobs using the following:

  CephAnsibleExtraConfig:
    handler_health_mon_check_retries: 10
    handler_health_mon_check_delay: 20

Thanks.

Comment 20 Marius Cornea 2019-06-14 21:53:09 UTC
(In reply to fpantano from comment #19)
> Hi Marius, we've seen the variable names used are wrong, so I think the last
> attempts are not valid.
> Can you try to run again the jobs using the following:
> 
>   CephAnsibleExtraConfig:
>     handler_health_mon_check_retries: 10
>     handler_health_mon_check_delay: 20
> 
> Thanks.

I've had multiple runs with the new parameters in place and I wasn't able to reproduce the issue reported initially so I think we're good.

Comment 21 Laura Marsh 2019-06-25 19:27:16 UTC
OSP 15; put the content in the Director Installation & Usage Guide in the "Scaling overcloud nodes" section.

Comment 28 Giulio Fidente 2019-09-11 13:40:25 UTC
We raised the timeouts in ceph-ansible itself [1], this should be even less likely to be hit, lowering severity.

1. https://bugzilla.redhat.com/show_bug.cgi?id=1718981