Description of problem: When adding a RHV provider using a bad hostname (wrong FQDN/IP address), A rotating circle is seen for few minutes, and when the rotation is over, there is no error on that the validation failed. in evm.log this ERROR appear: [----] E, [2017-05-04T08:52:23.038773 #6821:138e7ac] ERROR -- : MIQ(ems_infra_controller-create): Credential validation was not successful: Timed out connecting to server Version-Release number of selected component (if applicable): CFME-5.8.0.13 Additional info: Not sure if this is a regression or not.
Created attachment 1276340 [details] evm.log
In my environment, with latest ManageIQ version, this reports the following error message in the GUI: Credential validation was not successful: Timed out connecting to server It takes approx 2 minutes to get this, if there is no response from the server. Ilanit, how long does it take in your environment? Please also check the rhevm.log file, there should be two lines like these there: [----] I, [2017-05-04T16:38:09.467825 #15624:2ad37d986388] INFO -- : Ovirt::Service#resource_get: Sending URL: <https://engine42.local/ovirt-engine/api> [----] E, [2017-05-04T16:40:21.512956 #15624:2ad37d986388] ERROR -- : Ovirt::Service#resource_get: class = RestClient::Exceptions::OpenTimeout, message=Timed out connecting to server, URI=https://engine42.local/ovirt-engine/api There you can check the time difference between the sending and the time out. Please also check what is the result of the following command, using the same wrong host name or IP address: time openssl s_client -connect the_wrong_host:443 This is useful because this time out is controlled by the operating system. It may be that in your environment is longer than usual. My hypothesis is that in your environment the GUI may not wait long enough for the error from the server.
I checked in rhevm.log, those 2 messages (INFO & ERROR), and they have 2 minutes difference. From the CFME machine: root@vm-71-207 log]# time openssl s_client -connect <the_wrong_host>:443 socket: Connection timed out connect:errno=110 real 2m7.399s user 0m0.014s sys 0m0.011s
OK, thanks Ilanit, that discards my initial hypothesis. I will check if there is any change in this area between 5.8.0 and the latest master.
Tried this with the latest 'fine' version, and it worked correctly, I see the following error message in the GUI: Credential validation was not successful: Timed out connecting to server Ilanit, can you give me (offline) access to the environment where you detected this?
The key difference between my environment and the environment where Ilanit is performing the tests is that in Ilanit's environment access to the UI goes via the Apache web server, which acts as an HTTP proxy/balancer for the UI workers. The Apache proxy module has a connect time out that is taken by default from the 'Timeout' directive: https://httpd.apache.org/docs/2.4/mod/core.html#timeout In Ilanit's environment this is explicitly set to 2 minutes: # grep '^Timeout ' /etc/httpd/conf.d/* /etc/httpd/conf.d/manageiq-http.conf:Timeout 120 I think that is the default setting for CFME. Unfortunately the time out used by the provider is that of the operating system, and it is approxy 2 minutes and 8 seconds. That means that request that the UI performs to validate the credentials will fail before receiving the response from the provider, because Apache will give up before receiving the request from the provider. To verify that I changed Ilanit's environment to use 300 seconds. Then things work as expected. We can do the following to fix this issue: 1. Increase the value of the 'Timeout' directive so that it will be longer than the default TCP connection timeout. This can be done to the 'Timeout' directive, or to the 'ProxyTimeout' directive, which is only used by the proxy module, not by the rest of the application server. It can also be done to adding the 'timeout' parameter to specific proxies: # In /etc/httpd/conf.d/manageiq-balancer-ui.conf: <Proxy balancer://evmcluster_ui/ lbmethod=bybusyness timeout=300> ... 2. Set the value of the 'open_timeout' configuration value of the RHV provider to a value lower than 2 minutes: :ems_redhat: ... :inventory: :read_timeout: 1.hour :open_timeout: 1.minute <-- This is new :service: :read_timeout: 1.hour :open_timeout: 1.minute <-- This is new I tested both approaches in Ilanit's environment, and both work. I assume that the 2 minutes time out was set for a reason, so I'd say that alternative number 2 is better. Marcel, what do you think?
We also need to make sure that the the 'open_timeout' setting is honoured when using version 4 of the API and the oVirt Ruby SDK. Boris, do you know if we are honouring that setting?
if I understand correctly, the default timeout is being set by the os. So this probably impacts other providers as well. If thats the case, we should sync those timeouts, which means it's more an appliance and core issue
That 2 minutes and 8 seconds (approx) is the default TCP socket connection time out. You can see that in the test that Ilanit did in comment 4. The actual value depends on the kernel configuration. It can indeed affect other providers, unless they explicitly control/change this time out. Who else should we involve to discuss how to address this issue?
GregT, can you have a look at this? Seems to be a different timeout setting of the appliance apache and linux tcp socket
Greg, any updates?
I just read through this and my instinct is that option 2 in comment #7 is the way to go. The Apache 2 minute http timeout is standard, afaik. It seems that 2 minutes is a pretty long time to wait for a connection to be established and setting it to a lower value is reasonable. Also, I'm not sure where the os level timeout setting is or whether we actually configure it.
The pull request that fixes this issue is here: Reduce the default oVirt open timeout to 1 minute https://github.com/ManageIQ/manageiq/pull/15099
Verified that when trying to add provider with wrong fqdn, an error message is displayed. However, when trying to add provider using wrong IP address, it takes again 2 minutes for the rotating animation to disappear and no error message is displayed. Will add relevant part of evm.log in next comment.
Created attachment 1347006 [details] evm.log wrong provider IP Relevant part of evm.log when trying to add provider using wrong IP address.
For the sake of completeness, I used cfme version 5.9.0.2 and RHV 4.1.7-6.
The original fix for this worked correctly because it fixed things for version 3 of the API. But now we are using version 4 of the API, and there we aren't honouring the time-outs set in the configuration.