1550977 – Scale out task fails due to ansible worker dying

Bug 1550977 - Scale out task fails due to ansible worker dying

Summary: Scale out task fails due to ansible worker dying

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Ansible
Sub Component:
Version:	3.0
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	urgent
Target Milestone:	rc
Target Release:	3.1
Assignee:	Guillaume Abrioux
QA Contact:	Yogev Rabl
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1548353
TreeView+	depends on / blocked

Reported:	2018-03-02 13:04 UTC by Joe Talerico
Modified:	2019-01-17 17:12 UTC (History)
CC List:	16 users (show)
Fixed In Version:	RHEL: ceph-ansible-3.1.0-0.1.beta6.el7cp
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-01-17 17:12:42 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	ceph ceph-ansible pull 2487	0	None	closed	create keys and pools for client nodes only on first node	2020-10-26 14:08:14 UTC
Github	ceph ceph-ansible pull 2712	0	None	closed	rolling_update: fix facts gathering delegation	2020-10-26 14:08:14 UTC

Description Joe Talerico 2018-03-02 13:04:33 UTC

Description of problem:
Scaling from:

parameter_defaults:
  DnsServers: ["10.16.36.29","10.11.5.19"]

  ControllerCount: 3
  CephStorageCount: 18
  R620ComputeCount: 23
  R630ComputeCount:  23
  6018RComputeCount: 1
  R930ComputeCount: 1
  1029pComputeCount: 0
  1029uComputeCount: 1
  1028rComputeCount: 1
  R730ComputeCount: 1
  ComputeCount: 0


To:
parameter_defaults:
  DnsServers: ["10.16.36.29","10.11.5.19"]

  ControllerCount: 3
  CephStorageCount: 18
  R620ComputeCount: 77
  R630ComputeCount:  46
  6018RComputeCount: 1
  R930ComputeCount: 1
  1029pComputeCount: 0
  1029uComputeCount: 1
  1028rComputeCount: 1
  R730ComputeCount: 1
  ComputeCount: 0


Resulted in :
2018-03-02 11:58:37Z [overcloud-AllNodesDeploySteps-w5a6kxikdgwc.R620ComputeDeployment_Step1]: UPDATE_COMPLETE  state changed
2018-03-02 11:58:37Z [overcloud-AllNodesDeploySteps-w5a6kxikdgwc.WorkflowTasks_Step2_Execution]: UPDATE_IN_PROGRESS  state changed
2018-03-02 11:58:37Z [overcloud-AllNodesDeploySteps-w5a6kxikdgwc.WorkflowTasks_Step2_Execution]: UPDATE_COMPLETE  The Resource WorkflowTasks_Step2_Execution requires replacement.
2018-03-02 11:58:38Z [overcloud-AllNodesDeploySteps-w5a6kxikdgwc.WorkflowTasks_Step2_Execution]: CREATE_IN_PROGRESS  state changed
2018-03-02 12:44:46Z [overcloud-AllNodesDeploySteps-w5a6kxikdgwc.WorkflowTasks_Step2_Execution]: CREATE_FAILED  resources.WorkflowTasks_Step2_Execution: ERROR
2018-03-02 12:44:47Z [overcloud-AllNodesDeploySteps-w5a6kxikdgwc]: UPDATE_FAILED  resources.WorkflowTasks_Step2_Execution: ERROR
2018-03-02 12:44:48Z [AllNodesDeploySteps]: UPDATE_FAILED  resources.AllNodesDeploySteps: resources.WorkflowTasks_Step2_Execution: ERROR
2018-03-02 12:44:49Z [overcloud]: UPDATE_FAILED  resources.AllNodesDeploySteps: resources.WorkflowTasks_Step2_Execution: ERROR

 Stack overcloud UPDATE_FAILED

overcloud.AllNodesDeploySteps.WorkflowTasks_Step2_Execution:
  resource_type: OS::Mistral::ExternalResource
  physical_resource_id: c5c6b59a-7a03-4993-bad8-8ae0abb2a0e0
  status: CREATE_FAILED
  status_reason: |
    resources.WorkflowTasks_Step2_Execution: ERROR
Fri Mar  2 12:45:25 UTC 2018


Which looking at the ceph-ansible log the last task that ran was :

 2018-03-02 12:05:20,644 p=27069 u=mistral |  PLAY [mons,agents,osds,mdss,rgws,nfss,restapis,rbdmirrors,clients,iscsigws,mgrs] ***
 2018-03-02 12:05:21,544 p=27069 u=mistral |  TASK [gather and delegate facts] ***********************************************

Which ended with :

 2018-03-02 12:44:06,025 p=27069 u=mistral |  ok: [192.168.25.51 -> 192.168.25.168] => (item=192.168.25.168)
 2018-03-02 12:44:06,136 p=27069 u=mistral |  ok: [192.168.25.54 -> 192.168.25.169] => (item=192.168.25.169)
 2018-03-02 12:44:06,817 p=27069 u=mistral |  ok: [192.168.25.171 -> 192.168.25.165] => (item=192.168.25.165)
 2018-03-02 12:44:07,263 p=27069 u=mistral |  ERROR! A worker was found in a dead state

Comment 1 Joe Talerico 2018-03-02 13:15:41 UTC

Possibly related? https://github.com/ansible/ansible/issues/32554

Comment 2 Joe Talerico 2018-03-02 13:18:03 UTC

Not sure if we ran out of fd's... I suppose I would need to maybe bump the output of ansible to get more insight?

[stack@b04-h01-1029p ~]$ sudo sysctl fs.file-nr
fs.file-nr = 12928      0       26125814

Comment 3 Ben England 2018-03-02 13:40:12 UTC

try ulimit -a from account running ceph-ansible and see if any resource limits would affect it.  If so change /etc/security/limits.*.  Also see:

https://bugzilla.redhat.com/show_bug.cgi?id=1459891

For discussion of which kernel parameters limit Ceph thread creation.  Note that this problem goes away when we transition to RHCS 3.0 but we are still running RHCS 2.4 in RHOSP 12.

Comment 4 Ben England 2018-03-02 13:47:07 UTC

cc'ing John Fulton, OpenStack-Ceph DFG lead.

Comment 5 Joe Talerico 2018-03-02 14:26:27 UTC

Thanks Ben, Mistral is kicking off ceph-ansible so :

(overcloud) [stack@b04-h01-1029p ~]$ cat /proc/176989/limits 
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             1029496              1029496              processes 
Max open files            1024                 4096                 files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       1029496              1029496              signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        
(overcloud) [stack@b04-h01-1029p ~]$ 


I am trying to run through the same test again to see if I can get more detail on what is causing the failure.

Comment 8 John Fulton 2018-03-13 14:25:16 UTC

- Moving this from an OSP bug to a Ceph ceph-ansible bug
- It seems, with a high number of clients, that ceph-ansible hits this issue
- Rather than throw more memory at the host running ceph-ansible, can ceph-ansible optimize the client configuration so that it can configure 89 nodes (of which only 3 are clients)

Comment 10 John Fulton 2018-03-13 16:20:54 UTC

(In reply to John Fulton from comment #8)
> - Moving this from an OSP bug to a Ceph ceph-ansible bug
> - It seems, with a high number of clients, that ceph-ansible hits this issue
> - Rather than throw more memory at the host running ceph-ansible, can
> ceph-ansible optimize the client configuration so that it can configure 89
> nodes (of which only 3 are clients)

Typo: "of which only 3 are ceph OSD servers" 

The other 86 were just in the ansible inventory under the ceph-client role: 

 https://github.com/ceph/ceph-ansible/tree/master/roles/ceph-client

Comment 13 Andrew Schoen 2018-03-15 19:27:51 UTC

That task we're failing on here is a very expensive one https://github.com/ceph/ceph-ansible/blob/master/site.yml.sample#L57

I believe this was initially added so that we could support the --limit of ansible-playbook. The issue being that to generate a ceph.conf we need to know facts from all nodes in the cluster. In this case we're having an issue because of the amount of client nodes, but facts from clients nodes are not needed to generate a ceph.conf. Perhaps ceph-ansible could find a way to avoid collecting facts from client nodes (or any other nodes not needed in conf generation) on that task.

If you don't need to update the ceph.conf on the client nodes it looks like you get around this by setting 'delegate_facts_host: false' and using '--skip-tags= ceph_update_config'

Comment 14 Ken Dreyer (Red Hat) 2018-03-20 14:57:15 UTC

Discussed in stand-up call today. OSP team has attempted delegate_facts_host: false and they still hit the memory problem. Guillaume and Joe are working to reproduce this today.

Joe and Guillaume, would you please share the results of your testing?

Comment 15 Andrew Schoen 2018-03-20 15:03:36 UTC

It looks like `delegate_facts_host` does not exist in the stable-3.0 version of ceph-ansible upstream. This commit would need backported: https://github.com/ceph/ceph-ansible/commit/4596fbaac1322a4c670026bc018e3b5b061b072b

Comment 16 Andrew Schoen 2018-03-20 15:17:18 UTC

This commit would add `delegate_facts_host` to the site-docker.yml.sample playbook and it is not backported to stable-3.0 either:

https://github.com/ceph/ceph-ansible/commit/c315f81dfe440945aaa90265cd3294fdea549942

Comment 17 Andrew Schoen 2018-03-20 15:20:49 UTC

(In reply to Andrew Schoen from comment #16)
> This commit would add `delegate_facts_host` to the site-docker.yml.sample
> playbook and it is not backported to stable-3.0 either:
> 
> https://github.com/ceph/ceph-ansible/commit/
> c315f81dfe440945aaa90265cd3294fdea549942

I'm incorrect, that commit does exist in stable-3.0 upstream.

Comment 18 Guillaume Abrioux 2018-03-20 17:57:24 UTC

I tried to run the playbook on an admin node with only 200Mb RAM with 60+ nodes in the inventory. After many tests I was only able to hit a memory issue but not the one described in that BZ :

An exception occurred during task execution. To see the full traceback, use -vvv. The error was: OSError: [Errno 12] Cannot allocate memory
fatal: [osd17]: FAILED! => {}

MSG:

Unexpected failure during module execution.


Looks like I run ouf of memory before I can hit the issue reported.
I'm not sure how I can reproduce this.

Joe, if you can reproduce this error in your env, could you link the playbook run log and keep the env running so I can take a look ?

Thanks!

Comment 19 Ben England 2018-03-23 12:55:31 UTC

In today's Ceph DFG meeting Guillaume told us that this PR is being tried:

https://github.com/ceph/ceph-ansible/pull/2459

This might solve the problem if it works as intended.

Question: does this change mean that a node which is only in [clients] role will not need to deploy a container just to discover facts, put ceph.conf in and install RPMs?  I think the cost of container deployment to potentially hundreds of compute nodes was part of Joe Talerico's concern.  Thanks!

Comment 20 Ben England 2018-03-27 11:38:42 UTC

I was looking at Joe Talerico's results here

https://i.imgur.com/eppyWLW.png

with/without Guillaume's patch, wondering why would compute nodes still have to pull docker image down, with Guillaume's patch?  Then I looked at roles/ceph-client/tasks/create_users_keys.yml, which shows that on compute nodes you still have to pull down the docker image to manufacture keys and create pools.  What I don't get about it is this: 

- aren't the keys the same on every client?
- so why can't they be manufactured on one of the clients and copied to the other ones?
- why are pools created ON THE CLIENTS?  This should only be done once.  Is it possible to do pool creation from the first client only?

If it was possible to do it this way, I think deployment would be greatly speeded up if we avoid deploying a container on every client, as Joe suggested.

Comment 22 Sébastien Han 2018-04-05 13:23:07 UTC

Will be in 3.1

Comment 23 Ken Dreyer (Red Hat) 2018-04-05 20:35:58 UTC

Would you please tag v3.1.0beta5 on master for this so OSP 13 can cross-ship this into their release?

Comment 27 Ben England 2018-05-04 12:35:33 UTC

There has been additional work on memory consumption and scalability of ceph-ansible, See https://github.com/ceph/ceph-ansible/issues/2553.  I plan to test in Infrared with 80-node deploy, if we get the result with PR 2560, which passed CI, then I think we can consider the memory issue resolved.

Comment 28 Ben England 2018-05-22 12:22:04 UTC

memory issue is fixed, and O(N^2) issue is fixed.  Tested with 90 computes, 4 OSDs and 3 mons using ceph-ansible-3.1.0-0.1.rc3.el7cp.noarch and ansible 2.4.3.  The delegate facts task takes 10 min, about 1/4 of total ceph-ansible execution time, to run.  Perhaps it is collecting on 1 host at a time.  But this is not the original issue, and it certainly isn't worse than before in this respect. May revisit slowness later, and we may need to test ceph-ansible more extensively at this scale (e.g. in HCI configuration, with more roles in play).  See ceph-ansible issue comment here:

https://github.com/ceph/ceph-ansible/issues/2553#issuecomment-390020874

Comment 29 Ken Dreyer (Red Hat) 2019-01-17 17:12:42 UTC

OpenStack shipped ceph-ansible-3.1.0-0.1.rc9.el7cp first in http://access.redhat.com/errata/RHEA-2018:2086 .

RHCEPH shipped ceph-ansible-3.1.5-1.el7cp in http://access.redhat.com/errata/RHBA-2018:2819 .

Note You need to log in before you can comment on or make changes to this bug.