1448065 – Add RHV provider using a bad hostname do not fail the validation in UI.

Bug 1448065 - Add RHV provider using a bad hostname do not fail the validation in UI.

Summary: Add RHV provider using a bad hostname do not fail the validation in UI.

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Providers
Sub Component:
Version:	5.8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	GA
Target Release:	5.10.0
Assignee:	Juan Hernández
QA Contact:	Ilanit Stein
Docs Contact:
URL:
Whiteboard:	rhev
Depends On:	1508944
Blocks:	1461860 1510206
TreeView+	depends on / blocked

Reported:	2017-05-04 13:07 UTC by Ilanit Stein
Modified:	2018-06-21 20:27 UTC (History)
CC List:	11 users (show)
Fixed In Version:	5.10.0.0
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1461860 1510206 (view as bug list)
Environment:
Last Closed:	2018-06-21 20:27:11 UTC
Category:	---
Cloudforms Team:	RHEVM
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
evm.log (1.34 MB, application/x-gzip) 2017-05-04 13:09 UTC, Ilanit Stein	no flags	Details
evm.log wrong provider IP (8.44 KB, text/plain) 2017-11-02 12:08 UTC, Radim Hrazdil	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	ManageIQ manageiq-providers-ovirt pull 126	0	None	None	None	2017-11-03 10:35:25 UTC

Description Ilanit Stein 2017-05-04 13:07:04 UTC

Description of problem:
When adding a RHV provider using a bad hostname (wrong FQDN/IP address), 
A rotating circle is seen for few minutes, and when the rotation is over, there is no error on that the validation failed.

in evm.log this ERROR appear:
[----] E, [2017-05-04T08:52:23.038773 #6821:138e7ac] ERROR -- : MIQ(ems_infra_controller-create): Credential validation was not successful: Timed out connecting to server

Version-Release number of selected component (if applicable):
CFME-5.8.0.13

Additional info:
Not sure if this is a regression or not.

Comment 2 Ilanit Stein 2017-05-04 13:09:36 UTC

Created attachment 1276340 [details]
evm.log

Comment 3 Juan Hernández 2017-05-04 15:05:40 UTC

In my environment, with latest ManageIQ version, this reports the following error message in the GUI:

  Credential validation was not successful: Timed out connecting to server

It takes approx 2 minutes to get this, if there is no response from the server.

Ilanit, how long does it take in your environment? Please also check the rhevm.log file, there should be two lines like these there:

  [----] I, [2017-05-04T16:38:09.467825 #15624:2ad37d986388]  INFO -- : Ovirt::Service#resource_get: Sending URL: <https://engine42.local/ovirt-engine/api>
  [----] E, [2017-05-04T16:40:21.512956 #15624:2ad37d986388] ERROR -- : Ovirt::Service#resource_get: class = RestClient::Exceptions::OpenTimeout, message=Timed out connecting to server, URI=https://engine42.local/ovirt-engine/api

There you can check the time difference between the sending and the time out.

Please also check what is the result of the following command, using the same wrong host name or IP address:

  time openssl s_client -connect the_wrong_host:443

This is useful because this time out is controlled by the operating system. It may be that in your environment is longer than usual. My hypothesis is that in your environment the GUI may not wait long enough for the error from the server.

Comment 4 Ilanit Stein 2017-05-04 15:19:55 UTC

I checked in rhevm.log, those 2 messages (INFO & ERROR), and they have 2 minutes difference.

From the CFME machine:
root@vm-71-207 log]# time openssl s_client -connect <the_wrong_host>:443

socket: Connection timed out
connect:errno=110

real    2m7.399s
user    0m0.014s
sys     0m0.011s

Comment 5 Juan Hernández 2017-05-04 15:22:59 UTC

OK, thanks Ilanit, that discards my initial hypothesis. I will check if there is any change in this area between 5.8.0 and the latest master.

Comment 6 Juan Hernández 2017-05-04 16:55:49 UTC

Tried this with the latest 'fine' version, and it worked correctly, I see the following error message in the GUI:

  Credential validation was not successful: Timed out connecting to server

Ilanit, can you give me (offline) access to the environment where you detected this?

Comment 7 Juan Hernández 2017-05-05 09:02:19 UTC

The key difference between my environment and the environment where Ilanit is performing the tests is that in Ilanit's environment access to the UI goes via the Apache web server, which acts as an HTTP proxy/balancer for the UI workers.

The Apache proxy module has a connect time out that is taken by default from the 'Timeout' directive:

https://httpd.apache.org/docs/2.4/mod/core.html#timeout

In Ilanit's environment this is explicitly set to 2 minutes:

# grep '^Timeout ' /etc/httpd/conf.d/*
/etc/httpd/conf.d/manageiq-http.conf:Timeout 120

I think that is the default setting for CFME.

Unfortunately the time out used by the provider is that of the operating system, and it is approxy 2 minutes and 8 seconds. That means that request that the UI performs to validate the credentials will fail before receiving the response from the provider, because Apache will give up before receiving the request from the provider.

To verify that I changed Ilanit's environment to use 300 seconds. Then things work as expected.

We can do the following to fix this issue:

1. Increase the value of the 'Timeout' directive so that it will be longer than the default TCP connection timeout. This can be done to the 'Timeout' directive, or to the 'ProxyTimeout' directive, which is only used by the proxy module, not by the rest of the application server. It can also be done to adding the 'timeout' parameter to specific proxies:

# In /etc/httpd/conf.d/manageiq-balancer-ui.conf:
<Proxy balancer://evmcluster_ui/ lbmethod=bybusyness timeout=300>
...

2. Set the value of the 'open_timeout' configuration value of the RHV provider to a value lower than 2 minutes:

:ems_redhat:
...
:inventory:
:read_timeout: 1.hour
:open_timeout: 1.minute <-- This is new
:service:
:read_timeout: 1.hour
:open_timeout: 1.minute <-- This is new

I tested both approaches in Ilanit's environment, and both work.

I assume that the 2 minutes time out was set for a reason, so I'd say that alternative number 2 is better. Marcel, what do you think?

Comment 8 Juan Hernández 2017-05-05 09:04:54 UTC

We also need to make sure that the the 'open_timeout' setting is honoured when using version 4 of the API and the oVirt Ruby SDK. Boris, do you know if we are honouring that setting?

Comment 9 Marcel Hild 2017-05-05 12:16:46 UTC

if I understand correctly, the default timeout is being set by the os. So this probably impacts other providers as well. 
If thats the case, we should sync those timeouts, which means it's more an appliance and core issue

Comment 10 Juan Hernández 2017-05-05 12:22:59 UTC

That 2 minutes and 8 seconds (approx) is the default TCP socket connection time out. You can see that in the test that Ilanit did in comment 4. The actual value depends on the kernel configuration.

It can indeed affect other providers, unless they explicitly control/change this time out.

Who else should we involve to discuss how to address this issue?

Comment 11 Marcel Hild 2017-05-05 12:47:37 UTC

GregT, can you have a look at this? Seems to be a different timeout setting of the appliance apache and linux tcp socket

Comment 12 Oved Ourfali 2017-05-15 07:56:48 UTC

Greg, any updates?

Comment 13 Gregg Tanzillo 2017-05-15 13:28:37 UTC

I just read through this and my instinct is that option 2 in comment #7 is the way to go. The Apache 2 minute http timeout is standard, afaik. It seems that 2 minutes is a pretty long time to wait for a connection to be established and setting it to a lower value is reasonable. Also, I'm not sure where the os level timeout setting is or whether we actually configure it.

Comment 14 Juan Hernández 2017-05-17 11:58:20 UTC

The pull request that fixes this issue is here:

  Reduce the default oVirt open timeout to 1 minute
  https://github.com/ManageIQ/manageiq/pull/15099

Comment 17 Radim Hrazdil 2017-11-02 12:07:15 UTC

Verified that when trying to add provider with wrong fqdn, an error message is displayed. 

However, when trying to add provider using wrong IP address, it takes again 2 minutes for the rotating animation to disappear and no error message is displayed. 

Will add relevant part of evm.log in next comment.

Comment 18 Radim Hrazdil 2017-11-02 12:08:43 UTC

Created attachment 1347006 [details]
evm.log wrong provider IP

Relevant part of evm.log when trying to add provider using wrong IP address.

Comment 19 Radim Hrazdil 2017-11-02 12:12:47 UTC

For the sake of completeness, I used cfme version 5.9.0.2 and RHV 4.1.7-6.

Comment 20 Juan Hernández 2017-11-02 12:38:30 UTC

The original fix for this worked correctly because it fixed things for version 3 of the API. But now we are using version 4 of the API, and there we aren't honouring the time-outs set in the configuration.

Note You need to log in before you can comment on or make changes to this bug.