Bug 1367948 - openshift 3.3 installer is unstable
Summary: openshift 3.3 installer is unstable
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.3.0
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: ---
: 3.3.1
Assignee: Scott Dodson
QA Contact: Mike Fiedler
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-08-18 00:54 UTC by Jeremy Eder
Modified: 2016-10-31 12:28 UTC (History)
15 users (show)

Fixed In Version: ansible-2.2.0.0-0.62.rc1.el7
Doc Type: Bug Fix
Doc Text:
Previously, in order to overcome performance regressions seen in Ansible 2.1 we had updated to an early Ansible 2.2 development build. We have since updated to Ansible 2.2 RC1 bringing considerable reliability improvements especially when dealing with large numbers of hosts.
Clone Of:
Environment:
Last Closed: 2016-10-27 16:12:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:2122 0 normal SHIPPED_LIVE OpenShift Container Platform atomic-openshift-utils bug fix update 2016-10-27 20:11:30 UTC

Description Jeremy Eder 2016-08-18 00:54:37 UTC
ansible-2.2.0-0.5 RPM
openshift 3.3.0.18 puddle

This week ​​I've had a few installs get wedged in the openshift_repos role specifically in the steps that use with_fileglob.​​  ​If I comment out those 3 or 4 steps, installs more likely to complete successfully.

I've also seen the install hang at the ​[openshift_node_certificates : Check status of node certificates] step.
This one seems most likely to occur when running scaleup.yml with a bunch of 

This is using openshift-ansible master from the day I run the test (I make sure I pull fresh from master ~once a day).

Another engineer says his install died twice today, once on the openshift_repos role and then again on a dnsmasq role.

Comment 1 Scott Dodson 2016-08-18 01:29:27 UTC
I've seen a lot of general hangs, almost always in the [setup] module. According to folks on #ansible slack channel this is often due to persistent connection issues and they suggested using paramiko connection plugin. I've tried that and I don't think it's improved the situation. This is also the cause of frequent hangs in CI jobs. 

The guidance on #ansible was to attempt to debug whether an ssh connection was open. Perhaps I don't know how to debug if a persistent control connection is open or not but I didn't see any active ssh sessions on the remote host. We need to figure out how to debug that I think.

That said, the stalls I've seen have always been in the setup module, if you're seeing it elsewhere it may be a different problem.

Comment 5 James C. 2016-08-18 13:48:47 UTC
Scott, for the setup hang, can you use gather_subset per the docs here http://docs.ansible.com/ansible/setup_module.html to try and isolate it down to a specific set of facts which may be causing problems? I know some disk checks can cause hangs if they're in a bad state.

Comment 6 Scott Dodson 2016-08-18 16:07:49 UTC
(In reply to James C. from comment #5)
> Scott, for the setup hang, can you use gather_subset per the docs here
> http://docs.ansible.com/ansible/setup_module.html to try and isolate it down
> to a specific set of facts which may be causing problems? I know some disk
> checks can cause hangs if they're in a bad state.

I'll add gather_subset = !hardware to my config and see if we can still run our playbooks. I don't think we depend on any hardware facts.

We're not calling setup explicitly. If I have a playbook run that's hung what's the best way to figure out what it's doing?

Comment 7 Scott Dodson 2016-08-18 20:33:15 UTC
I believe the stalling issues I've been seeing are different from those that Mike and Jeremy have been seeing. Theirs are consistently in the openshift_repos role which makes use of with_fileglob and that's actually one of the cherry-picks we did. See https://github.com/ansible/ansible/issues/16801

I've put in a PR to disable these tasks because they shouldn't be necessary for OSE where the repos should be managed via RHSM. This code path is really only useful if one were to do an origin isntall and then attempt an OSE install, I think.

https://github.com/openshift/openshift-ansible/pull/2325

We've also enabled ssh debug logs on their hosts in hopes of catching any other problems with control persist.


Regarding my stalled setup tasks, I'm trying to get a somewhat reliable reproducer with all facts being gathered and then I'll try to limit fact gathering as suggested in comment #5.

Comment 8 Jeremy Eder 2016-08-19 12:28:29 UTC
(In reply to Mike Fiedler from comment #3)
> I hit a hang today scaling up from 300 -> 500 nodes.
> 
> 2016-08-17 11:53:06,627 p=9838 u=root |  RUNNING HANDLER
> [openshift_node_dnsmasq : restart NetworkManager] **************
> 2016-08-17 11:53:06,627 p=9838 u=root |  Wednesday 17 August 2016  11:53:06
> -0400 (0:00:10.812)       0:24:52.188 ******
> 2016-08-17 12:53:20,840 p=9838 u=root |   [ERROR]: User interrupted execution
> 
>  Ctrl-C and restart got past it.

I've just now hit this as well -- for me it occurred when I tried to run scaleup against 150 new nodes.  Hadn't hit it until that point.  I'm now just continuing forward in 85 node groups up to our target of 1000.

The "mix" that's currently working for me is:
ansible-2.2.0-0.2.pi.el7.noarch
openshift-ansible 522cccbc7fd119a182a44af8fb2c0959d919a093
commented out fileglob entries in roles/openshift_repos/
forks=100

Comment 13 Scott Dodson 2016-10-04 15:36:00 UTC
Our aim is to improve stability of the installer by shipping Ansible 2.2 release candidates or final builds.

Comment 15 Scott Dodson 2016-10-04 20:25:47 UTC
Please test with these two builds in the latest 3.3 puddles, we've never been able to reliably reproduce these problems so it'd be great to get as much soak time as possible to see if the situation has improved.

ansible-2.2.0.0-0.61.rc1.el7
openshift-ansible-3.3.30-1.git.0.b260e04.el7

Comment 16 Timothy St. Clair 2016-10-04 21:05:47 UTC
Are there any known limitations to the number of forks?  

Previously we used to set ~100-200.

Comment 17 Scott Dodson 2016-10-04 21:57:55 UTC
There should not be. But for the sake of apples to apples, can we test with whatever was most widely used when you were hitting the issue before?

Comment 18 Jeremy Eder 2016-10-05 00:09:03 UTC
It reproduced @ 100 forks in the past.  Will try to get to this tomorrow or Thursday.

Comment 19 Mike Fiedler 2016-10-05 01:09:02 UTC
The latest Ansible available in the AOS repo is  2.2.0-0.50.prerelease.el7

Can we get the version mentioned in comment 15 pushed to https://mirror.openshift.com/enterprise/enterprise-3.4/latest/RH7-RHAOS-3.4/x86_64/os/ and https://mirror.openshift.com/enterprise/enterprise-3.3/latest/RH7-RHAOS-3.3/x86_64/os/ ?

Comment 21 Scott Dodson 2016-10-05 01:49:30 UTC
(In reply to Mike Fiedler from comment #19)
> The latest Ansible available in the AOS repo is  2.2.0-0.50.prerelease.el7
> 
> Can we get the version mentioned in comment 15 pushed to
> https://mirror.openshift.com/enterprise/enterprise-3.4/latest/RH7-RHAOS-3.4/
> x86_64/os/ and
> https://mirror.openshift.com/enterprise/enterprise-3.3/latest/RH7-RHAOS-3.3/
> x86_64/os/ ?

Mike, ansible-2.2.0.0-0.61.rc1.el7 should now be in the 3.3 and 3.4 repos on the mirror. Sorry I forgot to push them there.

Comment 22 Johnny Liu 2016-10-09 10:53:33 UTC
After ansible is updated to ansible-2.2.0.0-0.61.rc1.el7, recently QE often encounter the following issue (but not always reproduce it), especially when there are multiple installers are running on one host.

TASK [openshift_node : Configure Proxy Settings] *******************************
Sunday 09 October 2016  10:14:18 +0000 (0:00:05.781)       0:18:08.612 ******** 
fatal: [openshift-132.lab.sjc.redhat.com]: FAILED! => {"failed": true, "msg": "'dict object' has no attribute 'http_proxy'"}
fatal: [openshift-123.lab.sjc.redhat.com]: FAILED! => {"failed": true, "msg": "'dict object' has no attribute 'http_proxy'"}

We never set anything about http_proxy stuff in our inventory host file.

Comment 24 John Barker 2016-10-18 18:32:07 UTC
(In reply to Johnny Liu from comment #22)
> After ansible is updated to ansible-2.2.0.0-0.61.rc1.el7, recently QE often
> encounter the following issue (but not always reproduce it), especially when
> there are multiple installers are running on one host.
> 
> TASK [openshift_node : Configure Proxy Settings]
> *******************************
> Sunday 09 October 2016  10:14:18 +0000 (0:00:05.781)       0:18:08.612
> ******** 
> fatal: [openshift-132.lab.sjc.redhat.com]: FAILED! => {"failed": true,
> "msg": "'dict object' has no attribute 'http_proxy'"}
> fatal: [openshift-123.lab.sjc.redhat.com]: FAILED! => {"failed": true,
> "msg": "'dict object' has no attribute 'http_proxy'"}
> 
> We never set anything about http_proxy stuff in our inventory host file.


The above from Johnny Liu seems to be something different to the main bug report (which is about with_fileglob).

Can you please raise a separate bug and add include a link to that bug here so other people can find it.

So we can diagnose that issue please include:
 * Output of ansible -v
 * Output of "ansible-playbook -vvv ..." of a failing run
 * Full ansible-playbook commandline from a failing run
 * The YAML for the "Configure Proxy Settings" task
 * Link to the full source of playbooks would be useful
 * What version of Ansible did this used to work, what version did it first fail on

Comment 25 John Barker 2016-10-18 18:38:37 UTC
(In reply to Jeremy Eder from comment #18)
> It reproduced @ 100 forks in the past.  Will try to get to this tomorrow or
> Thursday.

Hi,
I'm looking at this after the "OpenShift(3.4)-on-OpenStack(10) Scalability Testing" meeting that was today.


So I can assist (and get up to speed) can you please

1) Provide me with a link for the playbooks for

a) 
> This week ​​I've had a few installs get wedged in the openshift_repos role specifically in the steps that use with_fileglob.​​  ​If I comment out those 3 or 4 steps, installs more likely to complete successfully.

b) 
I've also seen the install hang at the ​[openshift_node_certificates : Check status of node certificates] step.
This one seems most likely to occur when running scaleup.yml with a bunch of 

This is using openshift-ansible master from the day I run the test (I make sure I pull fresh from master ~once a day).

Another engineer says his install died twice today, once on the openshift_repos role and then again on a dnsmasq role.

2) Is there an update to
(In reply to Jeremy Eder from comment #18)
> It reproduced @ 100 forks in the past.  Will try to get to this tomorrow or
> Thursday.

Comment 27 Mike Fiedler 2016-10-18 18:59:45 UTC
Using 2.2.0.0-0.61.rc1.el7 and higher I no longer experience install hangs.   Running with forks > 20 does cause failures during the node configuration/node cert installation.   See:  https://bugzilla.redhat.com/show_bug.cgi?id=1382492

Comment 28 John Barker 2016-10-18 19:08:27 UTC
Hi,
I'm looking at this after the "OpenShift(3.4)-on-OpenStack(10) Scalability Testing" meeting that was today, I was one of the Ansible Core representatives.

Can you please raise bugs in https://github.com/ansible/ansible/issues/new
for each of the issues (putting @gundalow so I get notified), then include a link here so others can follow.

If there is any confidential information, such as credentials to access the source, please email that to me jobarker

That may make it a lot easier to follow. Which means we can ensure each specific issues gets looked at and resolved, there will be a lot of chatter on here otherwise from 5+ different sets of debugging.

I *think* there are 6 different bug reports in here:

BUG A)
> This week ​​I've (Jeremy) had a few installs get wedged in the openshift_repos role specifically in the steps that use with_fileglob.​​  ​If I comment out those 3 or 4 steps, installs more likely to complete successfully.

BUG B)
> I've (Jeremy) also seen the install hang at the ​[openshift_node_certificates : Check status of node certificates] step.
This one seems most likely to occur when running scaleup.yml with a bunch of 

BUG C)
> Another engineer says his install died twice today, once on the openshift_repos role 

BUG D)
> ...and then again on a dnsmasq role.


BUG E)
> I've (Scott) seen a lot of general hangs, almost always in the [setup] module.


BUG F)
> TASK [openshift_node : Configure Proxy Settings] *******************************
> Sunday 09 October 2016  10:14:18 +0000 (0:00:05.781)       0:18:08.612 ******** 
> fatal: [openshift-132.lab.sjc.redhat.com]: FAILED! => {"failed": true, "msg": "'dict object' has no attribute 'http_proxy'"}
> fatal: [openshift-123.lab.sjc.redhat.com]: FAILED! => {"failed": true, "msg": "'dict object' has no attribute 'http_proxy'"}

Can you please raise individual bugs for the above, plus any I've forgotten in the public GutHub
https://github.com/ansible/ansible/issues/new

Please @gundalow in each of them so I'll see them

Can they please include:
 * Output of ansible -v
 * Output of "ansible-playbook -vvv ..." of a failing run
 * Full ansible-playbook commandline from a failing run
 * The YAML for the "Configure Proxy Settings" task
 * Link to the full source of playbooks would be useful
 * If this is a regression - what version of Ansible did this used to work, what version did it first fail on

In *each* of the separate, stand alone bugs reports.

Thanks in advance,
John Barker

Comment 29 Johnny Liu 2016-10-19 11:54:29 UTC
(In reply to John Barker from comment #28)
> BUG F)
> > TASK [openshift_node : Configure Proxy Settings] *******************************
> > Sunday 09 October 2016  10:14:18 +0000 (0:00:05.781)       0:18:08.612 ******** 
> > fatal: [openshift-132.lab.sjc.redhat.com]: FAILED! => {"failed": true, "msg": "'dict object' has no attribute 'http_proxy'"}
> > fatal: [openshift-123.lab.sjc.redhat.com]: FAILED! => {"failed": true, "msg": "'dict object' has no attribute 'http_proxy'"}

File new bug - BZ#1386654 to track this issue.

Comment 30 Mike Fiedler 2016-10-19 12:50:00 UTC
Marking this bz VERIFIED - no longer seeing the hangs reported originally, even with forks=100.    Follow up issues are tracked by the bugs linked in comment 29 and comment 27.

Comment 34 errata-xmlrpc 2016-10-27 16:12:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:2122


Note You need to log in before you can comment on or make changes to this bug.