Bug 805521

Summary: Unable to route to the repo host from here, and SSH tunnel will never be established
Product: [Retired] CloudForms Cloud Engine Reporter: James Laska <jlaska>
Component: ozAssignee: Ian McLeod <imcleod>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Martin Kočí <mkoci>
Severity: medium Docs Contact:
Priority: medium    
Version: 1.0.0CC: brad, calfonso, jturner, whayutin
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-04-18 11:41:49 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
aeolus-debug-20120321094647.tar.gz none

Description James Laska 2012-03-21 14:06:26 UTC
Created attachment 571721 [details]
aeolus-debug-20120321094647.tar.gz

Description of problem:

Attempting to push an image to ec2 fails after being unable to setup the SSH tunnel to connect with a System Engine host.

NOTE: This bug has proved *difficult* to isolate and reliably reproduce.  The failure was encountered multiple times while testing builds and pushes, by multiple users, on the same Cloud Engine system.

Version-Release number of selected component (if applicable):
 * aeolus-all-0.8.1-1.el6.noarch
 * aeolus-conductor-0.8.1-1.el6.noarch
 * aeolus-conductor-daemons-0.8.1-1.el6.noarch
 * aeolus-conductor-doc-0.8.1-1.el6.noarch
 * aeolus-configure-2.5.1-1.el6.noarch
 * deltacloud-core-0.5.0-5.el6.noarch
 * deltacloud-core-ec2-0.5.0-5.el6.noarch
 * deltacloud-core-rhevm-0.5.0-5.el6.noarch
 * deltacloud-core-vsphere-0.5.0-5.el6.noarch
 * imagefactory-1.0.0rc10-1.el6.noarch
 * imagefactory-jeosconf-ec2-fedora-1.0.0rc10-1.el6.noarch
 * imagefactory-jeosconf-ec2-rhel-1.0.0rc10-1.el6.noarch
 * iwhd-1.5-2.el6.x86_64
 * oz-0.8.0-5.el6.noarch
 * rubygem-aeolus-cli-0.3.1-1.el6.noarch
 * rubygem-aeolus-image-0.3.0-12.el6.noarch    
 * rubygem-deltacloud-client-0.5.0-2.el6.noarch

How reproducible:
 * Having difficulty reproducing on a clean system.
 * The bug was discovered multiple times while several people were building+pushing images on the same Cloud Engine installation, I'm unclear whether load plays a role.

Steps to Reproduce:
1. Setup System Engine, import official manifest, sync and promote content and templates
2. Setup Cloud Engine, build ec2 images using System Engine templates
3. Attempt to push image to ec2

Actual results:

 * The ec2 push fails

> 2012-03-20 10:46:18,330 DEBUG imgfac.builders.BaseBuilder.RHEL6_ec2_Builder thread(2bd73c46) Message: Customizing guest: ec2-23-20-166-18.compute-1.amazonaws.com
> 2012-03-20 10:46:18,330 DEBUG oz.Guest.RHEL6RemoteGuest thread(2bd73c46) Message: Installing additional repository files
> 2012-03-20 10:46:23,683 DEBUG oz.Guest.RHEL6RemoteGuest thread(2bd73c46) Message: Unable to route to the repo host from here, and SSH tunnel will never be established
> 2012-03-20 10:46:23,683 DEBUG oz.Guest.RHEL6RemoteGuest thread(2bd73c46) Message: (56, 'SSL read: errno -12192')
> 2012-03-20 10:46:24,818 DEBUG oz.Guest.RHEL6RemoteGuest thread(2bd73c46) Message: Unable to route to the repo host from the guest, will attempt to establish an SSH tunnel
> 2012-03-20 10:46:24,818 DEBUG oz.Guest.RHEL6RemoteGuest thread(2bd73c46) Message: 'ssh -i /tmp/tmp4Vvqxx -F /dev/null -o ServerAliveInterval=30 -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o UserKnownHostsFile=/dev/null -o PasswordAuthentication=no root.amazonaws.com curl --silent --cert /etc/pki/ozrepos/Red_Hat_CloudForms_Cloud_Engine_RPMs_x86_64_6Server-client.crt --key /etc/pki/ozrepos/Red_Hat_CloudForms_Cloud_Engine_RPMs_x86_64_6Server-client.key --insecure  https://qe-blade-08.idm.lab.bos.redhat.com/pulp/repos/redhat/Stage/content/dist/rhel/server/6/6Server/x86_64/cf-ce/1.0/os/repodata/repomd.xml' failed(6): Warning: Permanently added 'ec2-23-20-166-18.compute-1.amazonaws.com,23.20.166.18' (RSA) to the list of known hosts.
> <snip>
> 2012-03-20 10:46:51,765 DEBUG imgfac.builders.BaseBuilder.RHEL6_ec2_Builder thread(2bd73c46) Message: Traceback (most recent call last):
>   File "/usr/lib/python2.6/site-packages/imgfac/builders/Fedora_ec2_Builder.py", line 458, in push_image
>     credentials)
>   File "/usr/lib/python2.6/site-packages/imgfac/builders/Fedora_ec2_Builder.py", line 665, in push_image_snapshot_ec2
>     self.guest.do_customize(guestaddr)
>   File "/usr/lib/python2.6/site-packages/oz/RedHat.py", line 1100, in do_customize
>     self._customize_repos(guestaddr)
>   File "/usr/lib/python2.6/site-packages/oz/RedHat.py", line 1041, in _customize_repos
>     raise oz.OzException.OzException("Could not reach repository %s from the host or the guest, aborting" % (repo.url))
> OzException: Could not reach repository https://qe-blade-08.idm.lab.bos.redhat.com/pulp/repos/redhat/Stage/content/dist/rhel/server/6/6Server/x86_64/cf-ce/1.0/os from the host or the guest, aborting

2012-03-20 10:46:51,765 DEBUG imgfac.BuildJob.BuildJob thread(2bd73c46) Message: Builder (2bd73c46-0b3e-47e0-a5ff-00c76bdcee29) changed status from PUSHING to FAILED

Expected results:

 * The ec2 push is successful

Additional info:

 * See attachment aeolus-debug-20120321094647.tar.gz

 * I'm concerned about this issue because it seemed to happen when we put Cloud Engine under load

 * I have ...
   1. confirmed that the firewalls are sound and that the Cloud Engine system (qeblade31) can contact the System Engine (qe-blade-08).
   2. confirmed that I can use curl, along with a valid certificate and key, I am able to access repository content (repomd.xml) from Cloud Engine (qeblade31) hosted on System Engine (qe-blade-08).
   3. confirmed I can manually push an ec2 image, and setup a tunnel

Comment 1 jrd 2012-03-21 17:48:06 UTC
It sounds like there are multiple things going on here, as the push should be taking place at a different time than the instance trying to phone home.  But I'm not qualified to debug deeper.  Ian?

Comment 2 Ian McLeod 2012-03-21 18:48:17 UTC
This error means, pretty unequivocally, that the repo exposed by system engine is failing, at least intermittently.

We attempt to reach the repo locally and from the running EC2 guest in this function:

https://github.com/aeolusproject/oz/blob/master/oz/RedHat.py#L864

For the Factory/oz host side of things all we do here is set up a curl transaction to grab the repo metadata file.  If this is successful, that counts as the repo being reachable.  If not, it fails.

In this case it is failing.

If load is playing a role it may be on the System Engine SSL server side of things.

The relevant curl error is earlier in the log:

2012-03-20 10:46:23,683 DEBUG oz.Guest.RHEL6RemoteGuest thread(2bd73c46) Message: Unable to route to the repo host from here, and SSH tunnel will never be established
2012-03-20 10:46:23,683 DEBUG oz.Guest.RHEL6RemoteGuest thread(2bd73c46) Message: (56, 'SSL read: errno -12192')

So, I'd like to ask that you first address this issue to the System Engine guys, as it seems to be an error with their component.

James, am going to NEEDINFO you and ask that you comment on the above analysis and, if you agree, please change the component to system engine (or clone the bug to system engine).

Comment 3 James Laska 2012-03-21 20:18:48 UTC
Hi Ian ... Chris can speak more to this as he helped debug the issue while the problem was happening.  I recognize there isn't a lot to go on with this bug.  Additionally, QE is still trying to reliably reproduce this failure.  But I thought this might be important enough to at least get the foot in the bz door.

I was able to confirm that I could access the System Engine hosted repository using the appropriate key and cert.  From everything I could determine while working with SSH tunnel issues in the past, System Engine appeared to be doing things correctly.  We had no trouble building and pushing images (with the same templates) to other internal cloud providers.

Not sure why this got moved to MODIFIED, moving back to NEW as investigation is still underway.

Comment 4 James Laska 2012-03-21 20:20:00 UTC
(In reply to comment #2)
> If load is playing a role it may be on the System Engine SSL server side of
> things.

I just re-read your comment and caught the above line.  I'll move this to needinfo? on me, and attempt to determine if System Engine load could have played a role.

Comment 7 James Laska 2012-04-18 11:41:49 UTC
I'm closing this bug as INSUFFICIENT_DATA.  I've not encountered this since the original filing.  I have been unable to recreate the condition that caused the curl command (issued from oz/RedHat.py) to return:

(56, 'SSL read: errno -12192')

> # man curl 
> 56     Failure in receiving network data.

If conductor (imagefactory/oz) ever reports that it is "Unable to route to the repo host from here, and SSH tunnel will never be established", check the curl return code which will be shortly after this message.  If it matches, please re-open this bug report.

As Ian suggested, I suspect the fault lies with the repo host (katello in this case).  Since I've been unable to recreate the failure, I cannot determine the root cause.