ansible-2.2.0-0.5 RPM openshift 3.3.0.18 puddle This week I've had a few installs get wedged in the openshift_repos role specifically in the steps that use with_fileglob. If I comment out those 3 or 4 steps, installs more likely to complete successfully. I've also seen the install hang at the [openshift_node_certificates : Check status of node certificates] step. This one seems most likely to occur when running scaleup.yml with a bunch of This is using openshift-ansible master from the day I run the test (I make sure I pull fresh from master ~once a day). Another engineer says his install died twice today, once on the openshift_repos role and then again on a dnsmasq role.
I've seen a lot of general hangs, almost always in the [setup] module. According to folks on #ansible slack channel this is often due to persistent connection issues and they suggested using paramiko connection plugin. I've tried that and I don't think it's improved the situation. This is also the cause of frequent hangs in CI jobs. The guidance on #ansible was to attempt to debug whether an ssh connection was open. Perhaps I don't know how to debug if a persistent control connection is open or not but I didn't see any active ssh sessions on the remote host. We need to figure out how to debug that I think. That said, the stalls I've seen have always been in the setup module, if you're seeing it elsewhere it may be a different problem.
Scott, for the setup hang, can you use gather_subset per the docs here http://docs.ansible.com/ansible/setup_module.html to try and isolate it down to a specific set of facts which may be causing problems? I know some disk checks can cause hangs if they're in a bad state.
(In reply to James C. from comment #5) > Scott, for the setup hang, can you use gather_subset per the docs here > http://docs.ansible.com/ansible/setup_module.html to try and isolate it down > to a specific set of facts which may be causing problems? I know some disk > checks can cause hangs if they're in a bad state. I'll add gather_subset = !hardware to my config and see if we can still run our playbooks. I don't think we depend on any hardware facts. We're not calling setup explicitly. If I have a playbook run that's hung what's the best way to figure out what it's doing?
I believe the stalling issues I've been seeing are different from those that Mike and Jeremy have been seeing. Theirs are consistently in the openshift_repos role which makes use of with_fileglob and that's actually one of the cherry-picks we did. See https://github.com/ansible/ansible/issues/16801 I've put in a PR to disable these tasks because they shouldn't be necessary for OSE where the repos should be managed via RHSM. This code path is really only useful if one were to do an origin isntall and then attempt an OSE install, I think. https://github.com/openshift/openshift-ansible/pull/2325 We've also enabled ssh debug logs on their hosts in hopes of catching any other problems with control persist. Regarding my stalled setup tasks, I'm trying to get a somewhat reliable reproducer with all facts being gathered and then I'll try to limit fact gathering as suggested in comment #5.
(In reply to Mike Fiedler from comment #3) > I hit a hang today scaling up from 300 -> 500 nodes. > > 2016-08-17 11:53:06,627 p=9838 u=root | RUNNING HANDLER > [openshift_node_dnsmasq : restart NetworkManager] ************** > 2016-08-17 11:53:06,627 p=9838 u=root | Wednesday 17 August 2016 11:53:06 > -0400 (0:00:10.812) 0:24:52.188 ****** > 2016-08-17 12:53:20,840 p=9838 u=root | [ERROR]: User interrupted execution > > Ctrl-C and restart got past it. I've just now hit this as well -- for me it occurred when I tried to run scaleup against 150 new nodes. Hadn't hit it until that point. I'm now just continuing forward in 85 node groups up to our target of 1000. The "mix" that's currently working for me is: ansible-2.2.0-0.2.pi.el7.noarch openshift-ansible 522cccbc7fd119a182a44af8fb2c0959d919a093 commented out fileglob entries in roles/openshift_repos/ forks=100
Our aim is to improve stability of the installer by shipping Ansible 2.2 release candidates or final builds.
Please test with these two builds in the latest 3.3 puddles, we've never been able to reliably reproduce these problems so it'd be great to get as much soak time as possible to see if the situation has improved. ansible-2.2.0.0-0.61.rc1.el7 openshift-ansible-3.3.30-1.git.0.b260e04.el7
Are there any known limitations to the number of forks? Previously we used to set ~100-200.
There should not be. But for the sake of apples to apples, can we test with whatever was most widely used when you were hitting the issue before?
It reproduced @ 100 forks in the past. Will try to get to this tomorrow or Thursday.
The latest Ansible available in the AOS repo is 2.2.0-0.50.prerelease.el7 Can we get the version mentioned in comment 15 pushed to https://mirror.openshift.com/enterprise/enterprise-3.4/latest/RH7-RHAOS-3.4/x86_64/os/ and https://mirror.openshift.com/enterprise/enterprise-3.3/latest/RH7-RHAOS-3.3/x86_64/os/ ?
(In reply to Mike Fiedler from comment #19) > The latest Ansible available in the AOS repo is 2.2.0-0.50.prerelease.el7 > > Can we get the version mentioned in comment 15 pushed to > https://mirror.openshift.com/enterprise/enterprise-3.4/latest/RH7-RHAOS-3.4/ > x86_64/os/ and > https://mirror.openshift.com/enterprise/enterprise-3.3/latest/RH7-RHAOS-3.3/ > x86_64/os/ ? Mike, ansible-2.2.0.0-0.61.rc1.el7 should now be in the 3.3 and 3.4 repos on the mirror. Sorry I forgot to push them there.
After ansible is updated to ansible-2.2.0.0-0.61.rc1.el7, recently QE often encounter the following issue (but not always reproduce it), especially when there are multiple installers are running on one host. TASK [openshift_node : Configure Proxy Settings] ******************************* Sunday 09 October 2016 10:14:18 +0000 (0:00:05.781) 0:18:08.612 ******** fatal: [openshift-132.lab.sjc.redhat.com]: FAILED! => {"failed": true, "msg": "'dict object' has no attribute 'http_proxy'"} fatal: [openshift-123.lab.sjc.redhat.com]: FAILED! => {"failed": true, "msg": "'dict object' has no attribute 'http_proxy'"} We never set anything about http_proxy stuff in our inventory host file.
(In reply to Johnny Liu from comment #22) > After ansible is updated to ansible-2.2.0.0-0.61.rc1.el7, recently QE often > encounter the following issue (but not always reproduce it), especially when > there are multiple installers are running on one host. > > TASK [openshift_node : Configure Proxy Settings] > ******************************* > Sunday 09 October 2016 10:14:18 +0000 (0:00:05.781) 0:18:08.612 > ******** > fatal: [openshift-132.lab.sjc.redhat.com]: FAILED! => {"failed": true, > "msg": "'dict object' has no attribute 'http_proxy'"} > fatal: [openshift-123.lab.sjc.redhat.com]: FAILED! => {"failed": true, > "msg": "'dict object' has no attribute 'http_proxy'"} > > We never set anything about http_proxy stuff in our inventory host file. The above from Johnny Liu seems to be something different to the main bug report (which is about with_fileglob). Can you please raise a separate bug and add include a link to that bug here so other people can find it. So we can diagnose that issue please include: * Output of ansible -v * Output of "ansible-playbook -vvv ..." of a failing run * Full ansible-playbook commandline from a failing run * The YAML for the "Configure Proxy Settings" task * Link to the full source of playbooks would be useful * What version of Ansible did this used to work, what version did it first fail on
(In reply to Jeremy Eder from comment #18) > It reproduced @ 100 forks in the past. Will try to get to this tomorrow or > Thursday. Hi, I'm looking at this after the "OpenShift(3.4)-on-OpenStack(10) Scalability Testing" meeting that was today. So I can assist (and get up to speed) can you please 1) Provide me with a link for the playbooks for a) > This week I've had a few installs get wedged in the openshift_repos role specifically in the steps that use with_fileglob. If I comment out those 3 or 4 steps, installs more likely to complete successfully. b) I've also seen the install hang at the [openshift_node_certificates : Check status of node certificates] step. This one seems most likely to occur when running scaleup.yml with a bunch of This is using openshift-ansible master from the day I run the test (I make sure I pull fresh from master ~once a day). Another engineer says his install died twice today, once on the openshift_repos role and then again on a dnsmasq role. 2) Is there an update to (In reply to Jeremy Eder from comment #18) > It reproduced @ 100 forks in the past. Will try to get to this tomorrow or > Thursday.
Using 2.2.0.0-0.61.rc1.el7 and higher I no longer experience install hangs. Running with forks > 20 does cause failures during the node configuration/node cert installation. See: https://bugzilla.redhat.com/show_bug.cgi?id=1382492
Hi, I'm looking at this after the "OpenShift(3.4)-on-OpenStack(10) Scalability Testing" meeting that was today, I was one of the Ansible Core representatives. Can you please raise bugs in https://github.com/ansible/ansible/issues/new for each of the issues (putting @gundalow so I get notified), then include a link here so others can follow. If there is any confidential information, such as credentials to access the source, please email that to me jobarker That may make it a lot easier to follow. Which means we can ensure each specific issues gets looked at and resolved, there will be a lot of chatter on here otherwise from 5+ different sets of debugging. I *think* there are 6 different bug reports in here: BUG A) > This week I've (Jeremy) had a few installs get wedged in the openshift_repos role specifically in the steps that use with_fileglob. If I comment out those 3 or 4 steps, installs more likely to complete successfully. BUG B) > I've (Jeremy) also seen the install hang at the [openshift_node_certificates : Check status of node certificates] step. This one seems most likely to occur when running scaleup.yml with a bunch of BUG C) > Another engineer says his install died twice today, once on the openshift_repos role BUG D) > ...and then again on a dnsmasq role. BUG E) > I've (Scott) seen a lot of general hangs, almost always in the [setup] module. BUG F) > TASK [openshift_node : Configure Proxy Settings] ******************************* > Sunday 09 October 2016 10:14:18 +0000 (0:00:05.781) 0:18:08.612 ******** > fatal: [openshift-132.lab.sjc.redhat.com]: FAILED! => {"failed": true, "msg": "'dict object' has no attribute 'http_proxy'"} > fatal: [openshift-123.lab.sjc.redhat.com]: FAILED! => {"failed": true, "msg": "'dict object' has no attribute 'http_proxy'"} Can you please raise individual bugs for the above, plus any I've forgotten in the public GutHub https://github.com/ansible/ansible/issues/new Please @gundalow in each of them so I'll see them Can they please include: * Output of ansible -v * Output of "ansible-playbook -vvv ..." of a failing run * Full ansible-playbook commandline from a failing run * The YAML for the "Configure Proxy Settings" task * Link to the full source of playbooks would be useful * If this is a regression - what version of Ansible did this used to work, what version did it first fail on In *each* of the separate, stand alone bugs reports. Thanks in advance, John Barker
(In reply to John Barker from comment #28) > BUG F) > > TASK [openshift_node : Configure Proxy Settings] ******************************* > > Sunday 09 October 2016 10:14:18 +0000 (0:00:05.781) 0:18:08.612 ******** > > fatal: [openshift-132.lab.sjc.redhat.com]: FAILED! => {"failed": true, "msg": "'dict object' has no attribute 'http_proxy'"} > > fatal: [openshift-123.lab.sjc.redhat.com]: FAILED! => {"failed": true, "msg": "'dict object' has no attribute 'http_proxy'"} File new bug - BZ#1386654 to track this issue.
Marking this bz VERIFIED - no longer seeing the hangs reported originally, even with forks=100. Follow up issues are tracked by the bugs linked in comment 29 and comment 27.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:2122