Bug 1122297

Summary:

oo-install can hang in response to network problems

Product:

OpenShift Container Platform

Reporter:

Luke Meyer <lmeyer>

Component:

Installer

Assignee:

Brenton Leanhardt <bleanhar>

Status:

CLOSED WONTFIX

QA Contact:

Severity:

high

Docs Contact:

Priority:

medium

Version:

2.1.0

CC:

bleanhar, erich, jdetiber, jialiu, jokerman, libra-bugs, libra-onpremise-devel, lmeyer, misalunk, mmasters, mmccomas, yanpzhan

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-04-26 11:44:24 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
log-for-broker	none
log-for-node	none

Description Luke Meyer 2014-07-22 21:07:47 UTC

Description of problem:
It seems we got a little ahead of ourselves when setting RESTART_NEEDED. https://github.com/openshift/openshift-extras/blob/853a2092ff76f1a2cfc364c12c1788a0a6ebabe5/enterprise/install-scripts/openshift.ks#L1171
https://github.com/openshift/openshift-extras/blob/853a2092ff76f1a2cfc364c12c1788a0a6ebabe5/enterprise/install-scripts/openshift.ks#L1206

Since oo-install runs the RPM install separately from configuration actions, setting this causes a restart_services run at the end of the RPM install, before any configuration of the services has occurred. This doesn't make much sense and causes oo-install to hang.

How reproducible:
I and at least one customer have both encountered this. May or may not actually hang every time, but it will do the spurious restart_services every time.

Steps to Reproduce:
1. Run oo-install OSE 2.1 on the host that will be the broker
2. Also configure oo-install with another host to be a node to install in parallel

Actual results in /tmp/openshift-deploy.log:
OpenShift: Completed installing RPMs.
+ [[ sh = ks ]]
+ true
+ false
+ restart_services
+ echo 'OpenShift: Begin restarting services.'

(and possible hang)

Expected results:
Should not restart services at this point, should not hang.

Comment 1 Luke Meyer 2014-07-22 21:15:40 UTC

https://github.com/openshift/openshift-extras/pull/420

Comment 2 Luke Meyer 2014-07-22 21:18:52 UTC

I don't know if it's relevant, but perhaps: I actually encountered this trying to set up a larger test deployment with 9 node hosts. Although the service restarts are unnecessary, it's not clear to me why they would cause oo-install to hang. That just shouldn't be possible; at worst oo-install should just fail. May need more testing and a separate bug to see if there's something about the particular way I was doing this.

Comment 4 Luke Meyer 2014-07-22 21:45:07 UTC

I ran it again with the 1 broker+1 node scenario and it still hung. So I think it's pretty consistent.

Comment 5 Luke Meyer 2014-07-23 03:43:49 UTC

Also, it ran fine with the 9-node scenario with the fix.

Comment 6 Yanping Zhang 2014-07-23 12:12:29 UTC

Use oo-install to install env 1 broker+1 node with https://github.com/openshift/openshift-extras/pull/420.
No matter install broker and node in parallel, or install broker seperately,both hung on below steps:
OpenShift: Begin configuring host.
OpenShift: Completed configuring host.
OpenShift: Begin configuring OpenShift.
OpenShift: Waiting for MongoDB to start (04:55:28)...
OpenShift: MongoDB is ready! (04:55:38)
OpenShift: Completed configuring OpenShift.
OpenShift: Begin restarting services.
OpenShift: Completed restarting services.
OpenShift: Begin restarting services.

The /tmp/openshift-deploy.log ends with:
OpenShift: Completed restarting services.
+ [[ sh = ks ]]
+ true
+ false
+ restart_services
+ echo 'OpenShift: Begin restarting services.'
OpenShift: Begin restarting services.
+ service iptables restart
iptables: Setting chains to policy ACCEPT: filter ^[[60G[^[[0;32m  OK  ^[[0;39m]^M
iptables: Flushing firewall rules: ^[[60G[^[[0;32m  OK  ^[[0;39m]^M
iptables: Unloading modules: ^[[60G[^[[0;32m  OK  ^[[0;39m]^M
iptables: Applying firewall rules: ^[[60G[^[[0;32m  OK  ^[[0;39m]^M
+ named
+ :
+ service named restart
Stopping named: ^[[60G[^[[0;32m  OK  ^[[0;39m]^M
Starting named: ^[[60G[^[[0;32m  OK  ^[[0;39m]^M
+ node
+ false
+ node
+ false
+ service network restart
Shutting down interface eth0:  ^[[60G[^[[0;32m  OK  ^[[0;39m]^M
Shutting down loopback interface:  ^[[60G[^[[0;32m  OK  ^[[0;39m]^M
Bringing up loopback interface:  ^[[60G[^[[0;32m  OK  ^[[0;39m]^M
Bringing up interface eth0:
Determining IP information for eth0... done.
^[[60G[^[[0;32m  OK  ^[[0;39m]^M
+ node
+ false
+ broker
+ :
+ service sshd restart
Stopping sshd: ^[[60G[^[[0;32m  OK  ^[[0;39m]^M
Starting sshd: ^[[60G[^[[0;32m  OK  ^[[0;39m]^M
+ service ntpd restart
Shutting down ntpd: ^[[60G[^[[0;32m  OK  ^[[0;39m]^M
Starting ntpd: ^[[60G[^[[0;32m  OK  ^[[0;39m]^M
+ node
+ false
+ node
+ false
+ activemq
+ :
+ service activemq restart
Stopping ActiveMQ Broker...
Stopped ActiveMQ Broker.
Starting ActiveMQ Broker...
+ node
+ false
+ node
+ false
+ broker
+ :
+ service httpd restart
Stopping httpd: ^[[60G[^[[0;32m  OK  ^[[0;39m]^M
Starting httpd: ^[[60G[^[[0;32m  OK  ^[[0;39m]^M
+ broker
+ :
+ service openshift-broker restart
Stopping openshift-broker: ^[[60G[^[[0;32m  OK  ^[[0;39m]^M
Starting openshift-broker: httpd: Could not reliably determine the server's fully qualified domain name, using broker.example.com for ServerName
^[[60G[^[[0;32m  OK  ^[[0;39m]^M
+ broker
+ :
+ service openshift-console restart
Stopping openshift-console: ^[[60G[^[[0;32m  OK  ^[[0;39m]^M
Starting openshift-console: httpd: Could not reliably determine the server's fully qualified domain name, using broker.example.com for ServerName
^[[60G[^[[0;32m  OK  ^[[0;39m]^M
+ node
+ false
+ node
+ false
+ node
+ false
+ node
+ false
+ node
+ false
+ echo 'OpenShift: Completed restarting services.'
OpenShift: Completed restarting services.
+ true
+ display_passwords
+ set +x
ActiveMQ admin password: 8aGVCOZ4zEfQ9IvbljdcPWJFg4
MongoDB admin user: admin password: wiLxPstnq9wmPB7rNrrSABuDVIs
ActiveMQ amq user user: admin password: rvp4KoOgdYJ1RL3zRljA7dFXMJY
routing plugin user: routinginfo pass: jNXLE9DT2OGLPey6iyYNH2x4P4
MongoDB key user: routinginfo: DyaNt2ZaVqzjinTZY1mT9h82gR8
MongoDB broker user: openshift password: MPOgXDum06FcWLnBwADWpQ
OpenShift user1: demo password1: VacRANiOjLXopD8CvBh6w
MCollective user: mcollective password: 0XNSg1qd6exdOfXPwstv3Q

Comment 7 Luke Meyer 2014-08-11 17:13:44 UTC

It wasn't clear to me how you tested the updated openshift.sh in oo-install. It doesn't seem like it got the change. Just so we can be clear, I've packaged it for a test environment at:

https://oo-lmeyer.rhcloud.com/ose-2.1

Can you please run oo-install from there?

Comment 9 Johnny Liu 2014-08-12 08:01:14 UTC

(In reply to Luke Meyer from comment #7)
> It wasn't clear to me how you tested the updated openshift.sh in oo-install.
> It doesn't seem like it got the change. Just so we can be clear, I've

Because QE notice the fix is in openshift.sh, but the oo-install tools does not include that fix, so we use OO_INSTALL_KEEP_ASSETS var to keep tmp oo-install tar in local host, then replace openshift.sh with the updated one to run oo-install tools to verify this bug.

> packaged it for a test environment at:
> 
> https://oo-lmeyer.rhcloud.com/ose-2.1
> 
> Can you please run oo-install from there?

Comment 10 Yanping Zhang 2014-08-12 09:10:14 UTC

Last time in Comment 6, I set export OO_INSTALL_KEEP_ASSETS=true,and ran "sh <(curl -s https://install.openshift.com/ose/) ",when the install scipt fills were downloaded ,I exited and replaced the openshift.sh in /tmp dir and tgz archive file with that from pull 420,then reran "sh <(curl -s https://install.openshift.com/ose/) to continue the rest steps.During installation, I had checked the openshift.sh was the one we expected to use.Wonder why the results were not as expected.

This time,tested with https://oo-lmeyer.rhcloud.com/ose-2.1, found it worked when install all-in-one env, but still had the same issue when installed a broker and a node in two instance.
Steps for reference:
1.Setup two instance. On one instance which was used to be broker, run :
sh <(curl -s https://oo-lmeyer.rhcloud.com/ose)
Configure the another instance as node.
2.Setup a new instance. Run sh <(curl -s https://oo-lmeyer.rhcloud.com/ose),and configure it all-in-one.

Actual results:
1.Two instances,it had same status with last test in Comment6, hang on "Begin restarting services." as below:
You can watch the full log with:
ssh root.10.53 'tail -f /tmp/openshift-deploy.log'
OpenShift: Begin configuring host.
OpenShift: Completed configuring host.
OpenShift: Begin configuring OpenShift.
OpenShift: Waiting for MongoDB to start (01:49:49)...
OpenShift: MongoDB is ready! (01:50:16)
OpenShift: Completed configuring OpenShift.
OpenShift: Begin restarting services.
OpenShift: Completed restarting services.
OpenShift: Begin restarting services.

2.One instance:installation process finished without hanging.End info:
All tasks completed.
oo-install exited; removing temporary assets.
On this all-in-one env, created app with jenkins,mysql-5.1,changed something and gitted push successfully.

Comment 11 Miciah Dashiel Butler Masters 2014-08-12 09:40:32 UTC

In Comment 6 and Comment 10, it looks like the broker is fine in both cases. Can you get the openshift-deploy.log file from the node that is hanging?

I meant to mention: I tried to reproduce this bug a couple of weeks ago, with at least two test installations, each with 1 broker and 9 nodes, as well as tests of smaller deployments, but I could not reproduce the bug.

Here are three things to keep in mind. First, oo-install interleaves output related to the different hosts, so this output can be a little confusing.

Second, each host can take a long time to install and a long time to configure, especially with a slow network or high disk contention. Yum can take a while to download files, SELinux configuration can be slow and appear to hang for at least a few minutes, and I noticed that quotacheck can take a long time during high I/O contention (e.g., I saw it take 35 minutes on a 5.7 GiB partition), so a single node can easily take an hour or two.

Third, because of the design of oo-install, the configuration step is run on each host in serial (although this might be changing with core changes related to HA), so an installation with 9 nodes could easily take 9-18 hours to install using oo-install.

However, I wouldn't expect oo-install's output to stop for more than an hour or so; as it finishes configuring each host, it should print some output.

How long did you leave oo-install running without getting any output?

Can we get the output of openshift.sh (which is logged to /tmp/openshift-deploy.log on each target host) to see where it is hanging?

Comment 12 Yanping Zhang 2014-08-12 10:08:39 UTC

Created attachment 926002 [details]
log-for-broker

Comment 13 Yanping Zhang 2014-08-12 10:09:17 UTC

Created attachment 926003 [details]
log-for-node

Comment 14 Yanping Zhang 2014-08-12 10:28:23 UTC

Above are logs from 1 broker+1node env.
From the broker log, we can confirm the installation on broker is completed in fact.Maybe could focus on log from the node, as the log indecates the installation not completed.
My testing oo-install running without getting any output is still hanging there now, longer than 1.5 hours since it began to hung.And it's more than 2 hours since i begin to install the env.

(In reply to Miciah Dashiel Butler Masters from comment #11)
> In Comment 6 and Comment 10, it looks like the broker is fine in both cases.
> Can you get the openshift-deploy.log file from the node that is hanging?
> 
> I meant to mention: I tried to reproduce this bug a couple of weeks ago,
> with at least two test installations, each with 1 broker and 9 nodes, as
> well as tests of smaller deployments, but I could not reproduce the bug.
> 
> Here are three things to keep in mind.  First, oo-install interleaves output
> related to the different hosts, so this output can be a little confusing.
> 
> Second, each host can take a long time to install and a long time to
> configure, especially with a slow network or high disk contention.  Yum can
> take a while to download files, SELinux configuration can be slow and appear
> to hang for at least a few minutes, and I noticed that quotacheck can take a
> long time during high I/O contention (e.g., I saw it take 35 minutes on a
> 5.7 GiB partition), so a single node can easily take an hour or two.
> 
> Third, because of the design of oo-install, the configuration step is run on
> each host in serial (although this might be changing with core changes
> related to HA), so an installation with 9 nodes could easily take 9-18 hours
> to install using oo-install.
> 
> However, I wouldn't expect oo-install's output to stop for more than an hour
> or so; as it finishes configuring each host, it should print some output.
> 
> How long did you leave oo-install running without getting any output?
> 
> Can we get the output of openshift.sh (which is logged to
> /tmp/openshift-deploy.log on each target host) to see where it is hanging?

Comment 15 Luke Meyer 2014-08-12 12:43:09 UTC

I wanted to note that even with OO_INSTALL_KEEP_ASSETS in place, the "assets" kept are actually the zip file (well, now it's a tarball). The tarball is actually unpacked and overwrites changes you make in the directory with each run. This makes testing a pain for sure; you'd have to re-create the tarball from the changed directory to test this way. The other way is to run oo-install from source, which is also pretty painful on RHEL at least. Probably we should just use an oo-install test site each time.

(In reply to Miciah Dashiel Butler Masters from comment #11)

> Third, because of the design of oo-install, the configuration step is run on
> each host in serial (although this might be changing with core changes
> related to HA), so an installation with 9 nodes could easily take 9-18 hours
> to install using oo-install.

Hey now, let's not get crazy... it *really* should not take that long. The longest part of the install is definitely the RPM install, and that happens in parallel. The configure step is serial as you say, the slowest part being selinux changes, but even so it shouldn't be more than ~5min/node.

Comment 16 Luke Meyer 2014-08-12 12:53:17 UTC

(In reply to Yanping Zhang from comment #10)

> 1.Two instances,it had same status with last test in Comment6, hang on
> "Begin restarting services." as below:
>   You can watch the full log with:
>   ssh root.10.53 'tail -f /tmp/openshift-deploy.log'
> OpenShift: Begin configuring host.
> OpenShift: Completed configuring host.
> OpenShift: Begin configuring OpenShift.
> OpenShift: Waiting for MongoDB to start (01:49:49)...
> OpenShift: MongoDB is ready! (01:50:16)
> OpenShift: Completed configuring OpenShift.
> OpenShift: Begin restarting services.
> OpenShift: Completed restarting services.
> OpenShift: Begin restarting services.

While output from previous steps may be interleaved, the configure steps being run in serial will have comprehensible output. This pretty clearly indicates that the broker host is being configured and has services restarted (which is all proper). The log clearly indicates the node is never getting the configure step.

So, I need to figure out why the broker restart services is causing oo-install to hang. It did not do this previously. Possibly sshd is being restarted and that's leading to a dead connection? I would expect a reasonable failure in that case, but maybe it's not.

Comment 17 Miciah Dashiel Butler Masters 2014-08-12 16:15:49 UTC

quotacheck is run as part of the configuration, so that part does go in serial.  If quotacheck takes 35 minutes (as I have sometimes seen it do, multiple times), and SELinux configuration takes 5 minutes, that's more than 40 minutes per node, in serial.

The node log that Yanping Zhang attached shows that openshift.sh finished running.  Luke, your idea about sshd sounds reasonable; maybe we need to make sure the node host is still accepting SSH connections, and possibly tweak the way we run use SSH in oo-install.

Comment 18 Luke Meyer 2014-08-12 19:53:47 UTC

My test run today with a separate broker and node was successful. So I have nothing to debug yet.

(In reply to Miciah Dashiel Butler Masters from comment #17)
> quotacheck is run as part of the configuration, so that part does go in
> serial.  If quotacheck takes 35 minutes (as I have sometimes seen it do,

OK - I've never seen that, but I believe you. I'm always working with fairly small file systems. Does the delay correlate to the amount of data or inodes on the fs?


(In reply to Yanping Zhang from comment #14)
> Above are logs from 1 broker+1node env.
> From the broker log, we can confirm the installation on broker is completed
> in fact.Maybe could focus on log from the node, as the log indecates the
> installation not completed.

Actually it seems to show that the configure part of the install never began. So oo-install got stuck somewhere.

Where did you run oo-install itself? On the broker, the node, or externally?

Comment 19 Jason DeTiberus 2014-08-12 20:24:44 UTC

(In reply to Luke Meyer from comment #18)
> (In reply to Miciah Dashiel Butler Masters from comment #17)
> > quotacheck is run as part of the configuration, so that part does go in
> > serial.  If quotacheck takes 35 minutes (as I have sometimes seen it do,
> 
> OK - I've never seen that, but I believe you. I'm always working with fairly
> small file systems. Does the delay correlate to the amount of data or inodes
> on the fs?

I've mostly seen this when there are I/O contention issues in OS1.  Granted I would probably not see it as often if I wasn't spinning up 10 hosts at a time.

Comment 20 Yanping Zhang 2014-08-13 01:46:47 UTC

I guess there may be connection issue between the host running oo-install with external nodes. On my 1 broker+1node env,I run the oo-install itself on the broker, so guess broker connection well with oo-install host, but node has problem after some steps.
(In reply to Luke Meyer from comment #18)
> My test run today with a separate broker and node was successful. So I have
> nothing to debug yet.
> 
> (In reply to Miciah Dashiel Butler Masters from comment #17)
> > quotacheck is run as part of the configuration, so that part does go in
> > serial.  If quotacheck takes 35 minutes (as I have sometimes seen it do,
> 
> OK - I've never seen that, but I believe you. I'm always working with fairly
> small file systems. Does the delay correlate to the amount of data or inodes
> on the fs?
> 
> 
> (In reply to Yanping Zhang from comment #14)
> > Above are logs from 1 broker+1node env.
> > From the broker log, we can confirm the installation on broker is completed
> > in fact.Maybe could focus on log from the node, as the log indecates the
> > installation not completed.
> 
> Actually it seems to show that the configure part of the install never
> began. So oo-install got stuck somewhere.
> 
> Where did you run oo-install itself? On the broker, the node, or externally?

Comment 21 Yanping Zhang 2014-08-13 01:55:43 UTC

Add info:about the well installed all-in-one env in comment 10, also run the oo-install itself on the same host with all-in-one.

Comment 22 Yanping Zhang 2014-08-13 07:51:17 UTC

Today I tested to install 1 broker+1 node env, and ran oo-install on another external node.This time succeeded though it took a long time. The oo-install completed normally and the 1 broker+1 node env was ok to do operations on apps.

Comment 23 Luke Meyer 2014-08-14 15:34:36 UTC

I ran another 1 broker + 1 node install, this time from the broker. It ran fine (although I did observe a sizeable delay on the quotacheck this time). I'm at a loss for how to reproduce this reliably, especially with how long it takes to iterate.

Would it make sense to merge and package the existing PR, as it seems to at least improve the situation, and keep this bug open to investigate whether there is some kind of timing condition or something that we still need to fix?

Comment 24 Johnny Liu 2014-08-15 01:54:55 UTC

(In reply to Luke Meyer from comment #23)
> I ran another 1 broker + 1 node install, this time from the broker. It ran
> fine (although I did observe a sizeable delay on the quotacheck this time).
> I'm at a loss for how to reproduce this reliably, especially with how long
> it takes to iterate.
> 
> Would it make sense to merge and package the existing PR, as it seems to at
> least improve the situation, and keep this bug open to investigate whether
> there is some kind of timing condition or something that we still need to
> fix?

Agree.

Comment 25 Johnny Liu 2014-08-15 11:34:05 UTC

Add some more info, hope to help your debug.
I also could reproduce this issue, the same as comment 10. I start two instances, one is broker, another is node, run oo-install on broker using RHN.

Broker:
# sh <(curl -s https://oo-lmeyer.rhcloud.com/ose)
<--snip-->
Install step 'install' succeeded for:
  * 10.3.8.166
  * 10.3.8.172
Running 'configure' step on 10.3.8.166 (broker.hosts.example.com)
Copying deployment scripts to target 10.3.8.166.
Executing deployment script on 10.3.8.166 (broker.hosts.example.com).
  You can watch the full log with:
  ssh root.8.166 'tail -f /tmp/openshift-deploy.log'
OpenShift: Begin configuring host.
OpenShift: Completed configuring host.
OpenShift: Begin configuring OpenShift.
OpenShift: Waiting for MongoDB to start (04:12:42)...
OpenShift: MongoDB is ready! (04:12:57)
OpenShift: Completed configuring OpenShift.
OpenShift: Begin restarting services.
OpenShift: Completed restarting services.
OpenShift: Begin restarting services.
<hang here>

RPM package installation is finished, just hang at service restart.

# tailf /tmp/openshift-deploy.log 
+ set +x
ActiveMQ admin password: BFzFn3A64vJrjka3atpTJoNdM
MongoDB admin user: admin password: 8JwWtR0TctmEyF3DVI4PzJBYqBo
ActiveMQ amq user user: admin password: KO3vGeJHPhh0jccy1lK2wGoPAjc
routing plugin user: routinginfo pass: kuMqXGetc7XRmsdcirKzjsasJM
MongoDB key user: routinginfo: Fk2eHFTvFtWzpPZf4xqgmfp77M
MongoDB broker user: openshift password: dbpwd
OpenShift user1: jialiu password1: redhat
MCollective user: mcollective password: mcopwd
+ :

Seen from the log, service restart is finished successfully on broker.

Node:
# tailf /tmp/openshift-deploy.log
+ remove_abrt_addon_python
+ grep 'Enterprise Linux Server release 6.4' /etc/redhat-release
+ broker
+ false
+ echo 'OpenShift: Completed installing RPMs.'
OpenShift: Completed installing RPMs.
+ [[ sh = ks ]]
+ false
+ false
+ :

Seen from the log, installation on node hang there.

Comment 26 Luke Meyer 2014-08-15 19:59:43 UTC

I appreciate your patience folks; after trying many different combinations of users and locations, I noticed in the above output that you're using the public IPs for both hosts (instead of "localhost" and internal IPs as I do). Which should be perfectly valid, but it does indeed appear to hang when using those on our OpenStack, so at least I can reproduce it. The workaround would appear to be "don't do that" as nothing else I tried failed this way.

I don't know how I managed to miss it before, but it looks like the services are being restarted twice. That is strange but still shouldn't break anything. Unless... maybe restarting sshd twice back to back loses the connection... but only when it's via OpenStack public IP...

BTW, this has been deployed to install.openshift.com, no more need for oo-lmeyer.

Comment 27 Luke Meyer 2014-08-19 18:59:09 UTC

Having researched Net::SSH a bit further, I've come to this conclusion: it doesn't handle network outages well (and by that I mean it hangs).

This is a general problem - if you experience a network split, if the remote host gets its power cord yanked, if the routing table gets messed up, if a firewall times out your connection - whatever the reason, if the client doesn't get some kind of notice that the connection has been dropped, Net::SSH just plain hangs.

Ref.:
https://github.com/net-ssh/net-ssh/issues/105
https://github.com/net-ssh/net-ssh/pull/108 (notes some workarounds)

Simplest workaround:
--------------------

The specific manifestation we are seeing internally appears to be this: if, when installing via oo-install on an OpenStack VM, you specify installing the local host using the external (floating) IP, then during the configure step when restart_services is invoked and the network is restarted on the local host, the ssh connection to self (via external IP) is dropped and oo-install hangs at that point (even though the configure step actually worked just fine).

The specific workaround for this is: don't do that. Use 'localhost' instead of the external IP for the "ssh_host" parameter.

If an oo-install installation has already run into this problem (specifically or similar) and you would like to recover, the procedure is (approximately):
1. End the hung script with ctrl-C.
2. Check /tmp/openshift-deploy.log on the host where oo-install hung on the configure step. The end of the output should indicate whether the step succeeded or not. If it didn't succeed, you are somewhat out of luck - start over, unless you really want to tinker and troubleshoot and know what you are doing.
3. Edit the oo-install YAML config file - this stores the state of each host, but since the script didn't end cleanly, has not been correctly updated.
4. Set all of the host states to "installed" except for the one(s) which finished the configure step; set this to "complete".
5. Run oo-install again - it should finish the remaining hosts and steps.
Obviously, this is not ideal.

Generalized solution:
---------------------

In general, this is not just a problem with some kind of OpenStack oddity, although that's the only case I've personally encountered. oo-install would ideally be robust enough to detect and react to network problems appropriately (some of the cases customers are seeing might be something entirely different). Ideally, oo-install would detect when (as in the case above) the step actually succeeded and it was just a bobble in ssh access, and be able to just continue. Otherwise recovery is manual (see above) and tricky to explain.

A generalized solution would look something like this:

1. Following the recommendation from https://github.com/net-ssh/net-ssh/pull/108, create a parallel monitoring thread in oo-install using the same connection and periodically test (with ruby timeout) whether the connection remains active. This allows us to at least detect when the connection has been lost.
2. Modify the deploy wrapper script to maintain a state file which records
a. The pid of the running install script
b. When it completes, the script's success or failure
3. Modify oo-install such that when an established ssh connection is simply lost, it retries the connection for a time and, if successful, checks the state file from (2) to see if the script either is still running or completed successfully, in which case, oo-install could continue the installation (otherwise must report failure).

It is a good idea to do this, but it is admittedly rather more work than just advising "don't do that". I have to de-prioritize this bug for now.