Bug 1319856 - ceph-installer executes installation of packages in sequences for different nodes
Summary: ceph-installer executes installation of packages in sequences for different n...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Installer
Version: 3.0
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: rc
: 3.1
Assignee: Christina Meno
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks: 1291304 1319833
TreeView+ depends on / blocked
 
Reported: 2016-03-21 16:36 UTC by Shubhendu Tripathi
Modified: 2019-12-12 20:48 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-03-15 16:44:16 UTC
Embargoed:


Attachments (Terms of Use)

Description Shubhendu Tripathi 2016-03-21 16:36:42 UTC
Description of problem:
During cluster creation from USM, the installation of mon and osd packages happens in sequence. This could be an scalability issue. 

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Bootstrap the storages nodes with ceph-installer and USM agent
2. Accept the nodes in USM
3. Execute cluster creation using the nodes as MON and OSD

Actual results:
The installation of the packages happens in sequence.

Expected results:
Installation of packages on the nodes should happen in parallel.

Additional info:
In QE setup, cluster creation with fresh 6 nodes, took almost 30 minutes.

Comment 2 Shubhendu Tripathi 2016-03-21 16:38:46 UTC
@mbukatov, you may add additional details for this.

Comment 5 Alfredo Deza 2016-03-24 20:01:13 UTC
It depends on how requests are being sent to the installer. If an install request is being POSTed for each host then the install will be sequential. If a group of hosts (say MONs) is sent, then the process will be using the default parallel value (5).

30 minutes doesn't sound right. The description doesn't mention how these hosts are being installed, how the requests are being handled, and what the output of these tasks look like.

The ceph-installer captures start/end times for tasks and other useful information like the command being used to call Ansible.

Comment 6 Christina Meno 2016-04-08 17:52:24 UTC
would you please provide an example of the request you are making to achieve sequential package install?

Comment 7 Shubhendu Tripathi 2016-04-09 03:15:43 UTC
For each of the hosts (mon/osd) we invoke /api/mon/install or /api/osd/install respectively one by one in threads.

This makes sure that we are able to track and report in UI of successful installation done for each node.

Comment 8 Alfredo Deza 2016-04-11 13:18:33 UTC
(In reply to Shubhendu Tripathi from comment #7)
> For each of the hosts (mon/osd) we invoke /api/mon/install or
> /api/osd/install respectively one by one in threads.
> 
> This makes sure that we are able to track and report in UI of successful
> installation done for each node.

Then this is not a bug. Per #comment5 if an install request is being POSTed for each host then it is forcing the installer to go sequentially. To avoid this the client must pass multiple hosts for the install process at once (which is allowed by the API).

Comment 9 Shubhendu Tripathi 2016-04-12 04:04:21 UTC
Alfredo, is this a design restriction or due to ceph-ansible.
My understanding is, even if multiple http POST are submitted to the server, it can create async tasks for each of the POST and return the task ids to the client.

These async tasks can run in parallel. Also this should not cause any issues within tasks as they are executed for different hosts as such.

This is something similar we are doing in USM for node accept/initialize. UI does submit multiple POST to the server, but the async tasks are run in parallel for different hosts.

@Nishanth, anything to add here?

Comment 10 Alfredo Deza 2016-04-12 13:12:56 UTC
(In reply to Shubhendu Tripathi from comment #9)
> Alfredo, is this a design restriction or due to ceph-ansible.
> My understanding is, even if multiple http POST are submitted to the server,
> it can create async tasks for each of the POST and return the task ids to
> the client.

This is correct, but the "async" process here is placing these requests in a queue of which only one worker is consuming from, so even though the client gets an immediate response because of the asynchronous nature the tasks will get completed because of the single worker.

> 
> These async tasks can run in parallel. Also this should not cause any issues
> within tasks as they are executed for different hosts as such.

They cannot run in parallel now. As I mentioned, to allow parallelization of task execution it would require more work. If parallel execution is required by the client then it must pass multiple hosts when installing.
> 
> This is something similar we are doing in USM for node accept/initialize. UI
> does submit multiple POST to the server, but the async tasks are run in
> parallel for different hosts.
> 
> @Nishanth, anything to add here?

Comment 13 Alfredo Deza 2016-04-29 15:01:50 UTC
Martin: not sure why this is blocking 1319833. Like I mentioned in comment #8 and #10:

To allow parallel installation of multiple hosts *the API allows this* by accepting multiple hosts at the same time.

See the "Install Operations" in http://docs.ceph.com/ceph-installer/docs/#install-operations

From that section:

    The install requests to the API are allowed to pass a list of multiple hosts.
    
    This process is not sequential: all hosts are operated against at once and if
    a single host fails to install the entire task will report as a failure. This
    is expected Ansible behavior and this API adheres to that.

Comment 14 Martin Bukatovic 2016-04-29 15:19:54 UTC
(In reply to Alfredo Deza from comment #13)
> Martin: not sure why this is blocking 1319833. Like I mentioned in comment
> #8 and #10:

I just noticed that this BZ was moved into "Red Hat Storage Console" product,
and interpreted this as an acknowledgement of the point you mention, that the
issue actually is in the RHSC rather then ceph-installer itself. But based on
your comment #13, it seems that I may misunderstood the meaning of
ceph-installer component of RHSC. If this is the case, I'm sorry for the
confusion and feel free to revert link to BZ 1319833 back to "see also" state.

Comment 15 Nishanth Thomas 2016-04-29 15:25:16 UTC
From USM integration point view this is an issue. Suppose you passed 50 host to API call and one fails means the request itself fail? I dont think this is the right behaviour. Also from usm should show the user what is failed what is not. Based on current output from task, there is no way to figure out this information. So that is reason we are sending each installation request as a separate request. What is blocking you to create separate tasks for each of these and run it parallel?

Comment 16 Ken Dreyer (Red Hat) 2016-04-29 22:09:01 UTC
(In reply to Martin Bukatovic from comment #14)
> (In reply to Alfredo Deza from comment #13)
> > Martin: not sure why this is blocking 1319833. Like I mentioned in comment
> > #8 and #10:
> 
> I just noticed that this BZ was moved into "Red Hat Storage Console" product,
> and interpreted this as an acknowledgement of the point you mention, that the
> issue actually is in the RHSC rather then ceph-installer itself. But based on
> your comment #13, it seems that I may misunderstood the meaning of
> ceph-installer component of RHSC. If this is the case, I'm sorry for the
> confusion and feel free to revert link to BZ 1319833 back to "see also"
> state.

The move to the RH Storage Console product simply means that we are now trying to track all our installer bugs in the RH Storage Console product. This aligns with the fact that the ceph-installer RPM and its dependencies will ship in the RH Storage Console product, not the RH Ceph Product.

It's confusing to have "ceph-installer" BZ components in two products, and it's my understanding that we will disable the "ceph-installer" sub-component in the RH Ceph Storage product soon, so we need to be tracking all ceph-installer bugs here instead.

Comment 17 Alfredo Deza 2016-05-02 14:14:50 UTC
(In reply to Nishanth Thomas from comment #15)
> From USM integration point view this is an issue. Suppose you passed 50 host
> to API call and one fails means the request itself fail? I dont think this
> is the right behaviour.

That is not the right behavior for USM's domain logic. It is entirely valid for Ansible. This is why it is crucial to determine what behavior is needed by the client and not the installer.

I do understand that having 50 individual requests would be a problem. But that wouldn't be solved by an increment in the number of workers for the installer. For example, if we increased that number to, say, 8 workers, it would mean that the client would see five servers at a time which would still take very long to complete. 

> Also from usm should show the user what is failed
> what is not. Based on current output from task, there is no way to figure
> out this information. So that is reason we are sending each installation
> request as a separate request. 

This looks more like the problem we should solve in the installer: "if N hosts are used for a task and it fails, report back what host(s) failed"

> What is blocking you to create separate tasks
> for each of these and run it parallel?

This is tricky because it means we would now need to create distinct queues (as opposed to a single one): one for installs and another one for configurations. Having two queues is not that complex but it would require a good amount of effort to configure it correctly.

Once those queues are correctly set and separated, then we need to come up with an increased number of workers to help
with the amount of requests. The common approach here is using one per CPU/core, which will probably be 8. Even that number wouldn't help that much in the case of 50 requests.

The added caveat here is that a machine's load (using a worker per core) could get high enough to cause severe usage issues. Since the console is installed in the same host, it would no doubt have repercussions for that. Is that a risk that is OK to take?

Comment 18 Nishanth Thomas 2016-05-10 06:06:04 UTC
(In reply to Alfredo Deza from comment #17)
> (In reply to Nishanth Thomas from comment #15)
> > From USM integration point view this is an issue. Suppose you passed 50 host
> > to API call and one fails means the request itself fail? I dont think this
> > is the right behaviour.
> 
> That is not the right behavior for USM's domain logic. It is entirely valid
> for Ansible. This is why it is crucial to determine what behavior is needed
> by the client and not the installer.
> 
> I do understand that having 50 individual requests would be a problem. But
> that wouldn't be solved by an increment in the number of workers for the
> installer. For example, if we increased that number to, say, 8 workers, it
> would mean that the client would see five servers at a time which would
> still take very long to complete. 
> 

But still that will be a huge improvement compared to what we have today

> > Also from usm should show the user what is failed
> > what is not. Based on current output from task, there is no way to figure
> > out this information. So that is reason we are sending each installation
> > request as a separate request. 
> 
> This looks more like the problem we should solve in the installer: "if N
> hosts are used for a task and it fails, report back what host(s) failed"
> 

How easy it will be for you to do this? a failed list and a list with successful nodes

> > What is blocking you to create separate tasks
> > for each of these and run it parallel?
> 
> This is tricky because it means we would now need to create distinct queues
> (as opposed to a single one): one for installs and another one for
> configurations. Having two queues is not that complex but it would require a
> good amount of effort to configure it correctly.
> 
> Once those queues are correctly set and separated, then we need to come up
> with an increased number of workers to help
> with the amount of requests. The common approach here is using one per
> CPU/core, which will probably be 8. Even that number wouldn't help that much
> in the case of 50 requests.
> 
> The added caveat here is that a machine's load (using a worker per core)
> could get high enough to cause severe usage issues. Since the console is
> installed in the same host, it would no doubt have repercussions for that.
> Is that a risk that is OK to take?

I think we need to do some benchmarking before we decide on this. From USM stand point it is a goo feature to explore and implemented as we have a better control if we run installation tasks separately.

Comment 19 Drew Harris 2017-06-28 20:04:00 UTC
I added a priority and severity as an experiment to see if those carry over when moving this from Console to Ceph.

Comment 22 Drew Harris 2017-06-28 20:08:41 UTC
Looks like a successful transition.


Note You need to log in before you can comment on or make changes to this bug.