Bug 1382492 - Ansible mux_client_request_session errors scaling cluster up by 100 nodes with forks=100
Summary: Ansible mux_client_request_session errors scaling cluster up by 100 nodes wit...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.3.1
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Scott Dodson
QA Contact: Johnny Liu
URL:
Whiteboard: aos-scalability-34
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-10-06 19:42 UTC by Mike Fiedler
Modified: 2017-06-09 02:56 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-06-09 02:56:25 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Mike Fiedler 2016-10-06 19:42:53 UTC
Description of problem:

I've hit this on 3 different install attempts with OTB sshd tuning, so getting a bug open.   I have logs from 2 of the attempts and will attach to this bz.

While scaling a cluster up to add 100 nodes using the new version of Ansible (2.2.0.0-0.61.rc1.el7), ssh mux_client_request_session errors occur during the node certificate configuration.   The error does not occur  on all nodes.

All nodes are identical from the same gold image.

2016-10-06 11:50:09,580 p=112268 u=root |  fatal: [192.1.5.246]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: mux_client_request_session: session request failed: Session open refused by peer\r\nssh_exchange_identification: Connection closed by remote host\r\n", "unreachable": true}

The errors all seem to happen in openshift_node_certificates and cause node registration failures down the line (see the end of the first log).

From the first log, the errors pop in these sections and nowhere else:

openshift_node_certificates : Create openshift_generated_configs_dir if it does not exist
openshift_node_certificates : Generate the node client config
openshift_node_certificates : Generate the node server certificate
openshift_node_certificates : Create a tarball of the node config directories
openshift_node_certificates : Unarchive the tarball on the node

From the second log (on a completely separate set of nodes - no overlap)

openshift_node_certificates : Create openshift_generated_configs_dir if it does not exist
openshift_node_certificates : Generate the node server certificate
openshift_node_certificates : Create a tarball of the node config directories



Version-Release number of selected component (if applicable): 3.3.0.34



How reproducible:  3 out 3 attempts


Steps to Reproduce:
1.  Install an HA cluster (3 etcd, 3 master, 1 master lb, 2 infra nodes, 3 test nodes).   My install was on OpenStack
2.   Run the e2e Conformance tests to vet the cluster.  Tests passed.
3.   Run the openshift-ansible/byo/openshift-node/scaleup.yml playbook to add 100 new nodes to the cluster


Actual results:


During the node certificate configuration, the ssh errors above occurred for some (not all) nodes.  Node registration later failed for those systems.

Expected results:

Successful install.

Additional info:

Also saw Ansible sftp warnings I have not seen in the past:

[WARNING]: sftp transfer mechanism failed on [192.1.5.27]. Use ANSIBLE_DEBUG=1 to see detailed information

Comment 2 Mike Fiedler 2016-10-06 19:47:24 UTC
The workaround (solution?) seems to be increasing MaxSessions in /etc/ssh/sshd_config on each node we are installing on.  I bumped it to 50 on mine (default is 10) and have had 2 successful scaleups of 100 and 200 nodes.

Is there something different about the node cert phase of the install?

Comment 3 Scott Dodson 2016-10-07 13:53:49 UTC
Andrew,

This is happening on tasks where we delegate_to a specific host meaning that it's slamming openshift_ca_host with 100 connections/tasks at once.

What do you think about making all plays that have delegate_to tasks serial: 10 ?

Comment 4 Andrew Butcher 2016-10-07 14:59:47 UTC
Since nodes are the only component we'll see more than 10 of (in most cases) and  delegate_to is isolated to the node certificates role (as far as nodes go), we should try breaking node certificates out of the node configuration plays and run them at serial:10. I think that will have the smallest impact on run time.

We could also move node certificates back to with_items (all nodes) and apply the role to the first master host but that moves logic back into the playbook which would make it harder to maintain.

Comment 7 Mike Fiedler 2016-10-10 12:26:12 UTC
With forks=100 and MaxSessions=50

scale up a batch of 100 nodes = 30 minutes
batch of 200 nodes = 56 minutes
batch of 175 nodes = 47 minutes

Comment 10 John Barker 2016-10-20 13:51:00 UTC
Hi,
I've been directed here after Tuesday's "OpenShift(3.4)-on-OpenStack(10) Scalability Testing" call, I'm on the Ansible Core Team.

It sounds like you have this partly in hand, is their anything specific you'd like to know from Ansible Core Engineering? 

I can be found as gundalow on freenode & GitHub

Comment 11 John Barker 2016-10-20 18:09:21 UTC
Some initial thoughts after speaking to people:

1) It sounds like you are delegating a task many times to a single host.

1.1) What are you actually doing in that case

1.2) Is the machine you are delegating to the one throwing the mux_client_request_session error

1.3) If you increase mux_client_request_session does it work (ignoring the increase in runtime) from Comment 7 is sounds like this is working

1.4) After increasing mux_client_request_session do you hit other bottlenecks on that machine?

1.5) Can task be rewritten so it doesn't have to always delegate to a single point. Feels like a change in architecture is needed

2) Can you provide a link to the role that's been deletedto, so we can look?

3) From Comment 7 could there be an issue with the response of the IAAS server, are their any logs to show the requests arrving and where the delay is.

In our experience performance issues generally boil down to one machine getting over loaded, e.g. fork=200 installs all pulling from a single git server

Comment 12 John Barker 2016-10-21 11:45:21 UTC
(In reply to Mike Fiedler from comment #7)
> With forks=100 and MaxSessions=50
> 
> scale up a batch of 100 nodes = 30 minutes
> batch of 200 nodes = 56 minutes
> batch of 175 nodes = 47 minutes

From my understanding every host (so up to 200?) generates the certs by running roles/openshift_node_certificates, which runs 8 tasks which delegate_to "openshift_ca_host".

I wonder if the slow down is due to openshift_ca_host becoming over loaded, we know there at at least 10 simultaneous tasks running against it due to hitting the mux_client_request_session limit on that machine. We know there are between 11 and 50 connections to that machine.

Generating certificates requires entropy.


1) 
It would be interesting to watch the following on openshift_ca_host during the playbook runs with different fork levels

while true; do paste <(date --rfc-3339=seconds) /proc/sys/kernel/random/entropy_avail  <(cut -f1 -d ' ' /proc/loadavg); sleep 0.5; done


2) 
How are you "scaling up" and limiting batch size, with --limit?

Comment 13 Mike Fiedler 2016-10-21 11:48:55 UTC
We're going to reinstall this environment today or over the weekend.   I can do #1.   

For #2, I am not setting any sort of --limit parameter.   Recommendations for this attempt?

sdodson/jdetober - is openshift_ca_host always first master?

Comment 14 John Barker 2016-10-21 11:55:16 UTC
I was wondering what the following actually means in practice: 
> With forks=100 and MaxSessions=50
> 
> scale up a batch of 100 nodes = 30 minutes
> batch of 200 nodes = 56 minutes
> batch of 175 nodes = 47 minutes

Comment 15 Mike Fiedler 2016-10-21 13:09:35 UTC
re:  comment 14.   I was just trying to document that installs with forks > 20 could be successful if /etc/ssh/sshd_config MaxSessions was bumped to a value greater than the default whereas they were unsuccessful without modifying that parameter.

Comment 16 Brenton Leanhardt 2017-01-05 21:34:38 UTC
Are you blocked by this or is your lowered forks value OK for now?  We're thinking of closing this.

Comment 21 Scott Dodson 2017-01-24 18:24:42 UTC
Andrew,

Does the certificate generation serialization mitigate this issue?

Comment 22 Andrew Butcher 2017-01-24 20:48:30 UTC
A few of the tasks are now "run_once" but there are 4 tasks which will still generate many connections to the first master. Every task in openshift_node_certificates would need to be serialized to mitigate.

Comment 23 Scott Dodson 2017-06-09 02:56:25 UTC
We have no immediate plans to support forks greater than 20


Note You need to log in before you can comment on or make changes to this bug.