Bug 856167

Summary: 3.1 - [RHEV-H 6.3]Auto install RHEV-H with "management_server=$RHEV-M_IP" parameter, it failed to approve rhevh on rhevm side.
Product: Red Hat Enterprise Linux 6 Reporter: haiyang,dong <hadong>
Component: vdsmAssignee: Juan Hernández <juan.hernandez>
Status: CLOSED ERRATA QA Contact: Pavel Stehlik <pstehlik>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 6.3CC: abaron, acathrow, alonbl, bazulay, bsarathy, chchen, chetan, cpelland, cshao, dyasny, gouyang, hadong, iheim, ilvovsky, istein, jboggs, leiwang, lpeer, mburns, ovirt-maint, Rhev-m-bugs, ycui, yeylon, ykaul, yzaslavs, zdover
Target Milestone: rcKeywords: Regression, ZStream
Target Release: 6.3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: infra
Fixed In Version: vdsm-4.9.6-40.0 Doc Type: Bug Fix
Doc Text:
Previously, performing an automated installation of a hypervisor using the "management_server" parameter without specifying a port number and without the "management_server_fingerprint" option succeeded, but the hypervisor could not be approved from the Manager administration portal. Now, port 443 is used by default if an alternate port is not provided, and management_server_fingerprint is optional. You can automatically install and approve a Hypervisor without specifying a port number or a management_server_fingerprint.
Story Points: ---
Clone Of:
: 861399 (view as bug list) Environment:
Last Closed: 2012-12-04 19:11:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 861399, 863292    
Attachments:
Description Flags
attached vdsm-reg.log and engine.log
none
authorized_keys none

Description haiyang,dong 2012-09-11 11:16:08 UTC
Description of problem:
After Auto install RHEV-H with "management_server=$RHEV-M_IP" parameter, rhevm automatic register to rhevm3.1 successfully,but it failed to approve rhevh to up.

Version-Release number of selected component (if applicable):
rhevm-3.1.0-14.el6ev.noarch  
rhev-hypervisor6-6.3-20120910.0.rhev31.el6_3

How reproducible:
100%
 
Steps to Reproduce:
1. Auto install RHEV-H with "management_server=$RHEV-M_IP" parameter.
2. Reboot RHEV-H and check rhevm side to make sure that Automatic Register to RHEVM 3.1

Actual results:
1.vdsm-config: starting                                                          
Generating RHEV agent configuration files                                      
RHEV agent configuration files already exist.                                  
                                                                                
Configuring the RHEV Manager connection.                                        
Traceback (most recent call last):                                              
  File "/usr/share/vdsm-reg/deployUtil.py", line 1555, in <module>              
  File "/usr/share/vdsm-reg/deployUtil.py", line 1522, in main                  
  File "/usr/share/vdsm-reg/deployUtil.py", line 1453, in nodeCleanup          
  File "/usr/share/vdsm-reg/deployUtil.py", line 1444, in _nodeBackupCerts      
  File "/usr/lib64/python2.6/shutil.py", line 95, in copy2                      
  File "/usr/lib64/python2.6/shutil.py", line 50, in copyfile                  
IOError: [Errno 2] No such file or directory: '/etc/pki/vdsm/certs/cacert.pem'  
No management_server_fingerprint found.                                        
File already persisted: /etc/vdsm-reg/vdsm-reg.conf                            
                                                                                vdsm-config: ended.                                                            
Finalizing Install and Rebooting (this may take a minute)  

2.After step 2, rhevm automatic register to rhevm3.1 successfully,but it failed to approve rhevh to up.

Expected results:
After step 2, rhevm automatic register to rhevm3.1 successfully,and approve rhevh to up.

Additional info: 
------

Comment 2 haiyang,dong 2012-09-11 11:43:29 UTC
Created attachment 611737 [details]
attached vdsm-reg.log and engine.log

Comment 3 Mike Burns 2012-09-11 12:22:09 UTC
FWIW, this appears to happen every time you autoinstall with management server.

Tried:  

management_server=<hostname>:80
management_server=<hostname>:443
management_server=<hostname>:443 management_server_fingerprint=<fingerprint>

Workaround:  After installation, login as admin, navigate to the RHEVM configuration screen, choose apply, then approve in RHEVM Web UI.

Comment 4 Mike Burns 2012-09-11 12:24:48 UTC
Created attachment 611745 [details]
authorized_keys

The problem is stems from the authorized_keys file for root.  I don't know exactly how it's pulled down yet, but it's pulling html code instead of an ssh key.

Please see attached

Comment 5 Itamar Heim 2012-09-11 13:52:22 UTC
I'm guessing no redirect in servlet for this file.
juan?

Comment 6 Juan Hernández 2012-09-13 10:04:06 UTC
My hypothesis is that the problem is that Apache is smarter than JBoss in this particular case: it is returning a nice HTTP error message when the host tries to connect to an HTTPS port using HTTP. When connecting directly to JBoss (without Apache as proxy) JBoss will just start the SSL handshake, the operation will fail in the host side and then it will try again with HTTPS.

I need to verify this hypothesis and find a solution.

Comment 7 Juan Hernández 2012-09-13 11:40:06 UTC
Can you provide me with the exact kernel options used for the autoinstallation? A complete pxelinux.cfg/default entry would be great.

Comment 8 Mike Burns 2012-09-13 11:55:50 UTC
I don't have a pxe profile handy at the moment, but this will work just the same.

boot from CD/USB.  on boot menu, highlight the install (or resinstall) option and press <TAB>.  add the following:

storage_init=/dev/sda BOOTIF=eth0 adminpw=XXXXXXX management_server=hostname:port


for adminpw, generate the password using 

openssl passwd

Comment 10 Juan Hernández 2012-09-15 09:25:04 UTC
I don't have a solution for the problem, but this i what I found out:

1. The "management_server" parameter has to contain the IP address and the port number, both are mandatory.

2. The "management_server_fingerprint" is also mandatory.

3. When providing these two parameters correctly the CA certificate is downloaded from the engine correctly, and saved to /etc/pki/vdsm/certs/cacert.pem. The original CA certificate is saved and persisted to a backup file in the same directory:

cp cacert.pem bkp-date-cacert.pem

Note that this is the CA certificate, not the VDSM certificate, that one should be pushed by the manage later, after the reboot.

4. The hypervisor reboots automatically.

5. During the boot the vdsmd service it started, and as part of it start script, it validates the VDSM certificate against the CA certificate. This validation fails because the CA certificate is new (downloaded from the engine) but the VDSM certificate is still the original one. When the validation fails the vdsmd start script tries to replace the CA certificate with one in the /etc/pki/vdsm/certs directory. It will find the backup that we did before and use it. So VDSM is now running with its original CA and certificate.

6. The vdsm-reg service starts after that, and it uses the /etc/pki/vdsm/certs/cacert.pem CA certificate for the SSL communication with the engine. This fails because that CA certificate is not the one downloaded from the engine.

7. The vdsm-reg service tries now to register using HTTP instead of HTTPS and it succeeds, so the host is registered in the engine, but not approved.

8. The vdsm-reg service tries to download the SSH public key using SSL and the wrong CA certificate. This fails and vdsm-reg detects it correctly.

9. The vdsm-reg service tries to download the SSH public key using HTTP but using the HTTPS port. It gets the error page from Apache and saves it to the authorized_keys file because it is not validated.

I think there are several things that could potentially need to be fixed here:

A. Change the node so that it interprets correctly the "management_server" parameter even when the port number is not provided.

B. Rethink this VDSM logic that selects one certificate from the /etc/pki/vdsm/certs directory. It is not as easy as removing it, as otherwise VDSM and libvirt (both use the same certificates) will fail to start after the reboot and then vdsm-reg won't be able to create the bridge (it uses libvirt for that).

C. Rethink why the vdsm-reg doesn't download the CA certificate, it used to do it in older versions.

D. Fix deployUtil.py so that when it downloads certificates or SSH keys it validates them before writing to files.

I am not 100% sure this analysis is correct. Suggestions are welcome.

Comment 12 Juan Hernández 2012-09-17 09:36:48 UTC
Whatever the problem is and whatever the solution we implement I think it is not bad to validate the SSH public key before saving it to authorized_keys:

http://gerrit.ovirt.org/8018

I will prepare another patch to verify the CA certificate as well.

Comment 13 Juan Hernández 2012-09-17 10:22:54 UTC
The following change adds the verification of the downloaded CA certificate:

http://gerrit.ovirt.org/8021

Comment 14 Juan Hernández 2012-09-17 11:17:54 UTC
In point A of comment 10 I said that the node needs to parse correctly the "managemen_server" kernel parameter, but I see now that this is done in "vdsm-config" which is part of VDSM. The following patch tries to fix that:

http://gerrit.ovirt.org/8022

Comment 18 Juan Hernández 2012-09-18 17:49:22 UTC
The solution I find for this problem (in addition to the previous three patches) is to download the engine CA certificate to a different file /etc/pki/vdsm/certs/enginecacert.pem. This file isn't touched by the VDSM start script, so it is preserved after the reboot and can be used by vdsm-reg to download the SSH key using HTTPS. The proposed change is here:

http://gerrit.ovirt.org/8038

Comment 44 haiyang,dong 2012-11-02 08:40:20 UTC
Test version:
rhev-hypervisor6-6.3-20121101.0.el6_3.noarch.rpm
vdsm-4.9.6-40.0.el6_3.x86_64
rhevm-3.1.0-23.el6ev
                                           
According to Expected results:
1. Auto install RHEV-H with "management_server=$RHEV-M_IP:[portid]" parameter successfully on rhevh side should success.
2.Also see rhev-h register on RHEVM Web UI, and approve in RHEVM Web UI to make it up success.
Test result:
Tried:   
management_server=$RHEV-M_IP   Pass
management_server=$RHEV-M_IP:443   Pass
management_server=$RHEV-M_IP:443 management_server_fingerprint=<fingerprint> Pass
management_server=$RHEV-M_IP management_server_fingerprint=<fingerprint> Pass

so this bug has been fixed

Comment 45 haiyang,dong 2012-11-02 09:05:51 UTC
In order to verified this bug, could you help change the status into "ON_QA"?

Comment 46 haiyang,dong 2012-11-02 09:27:11 UTC
According to the test result of comment 44, change the status into "VERIFIED"

Comment 48 errata-xmlrpc 2012-12-04 19:11:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-1508.html