Description of problem: Increase the value of time parameter for dig check against score calculations. When network_test in hosted-engine.conf is set to dns, it runs the below dns query for maintaining score ~~~ dns_cmd = [ 'dig', '+tries=1', '+time={t}'.format(t=self._timeout) ] ~~~ With self._timeout = str(options.get('timeout', 2)). I am expecting timeout there to be 2 Also I did not a constant for timeout in ovirt_hosted_engine_ha/env/config_constants.py , so I guess you cannot change timeout there by adding it to hosted-engine.conf So therefore I would request to increase the value of `'+time={t}'.format(t=self._timeout)` by default from 2 to something higher, or provide a way to configure it if it already isn't there. The reason behind it is that the DNS servers for some users can be at distant locations and not on the same network, due to this, sometimes DNS queries can take longer than expected and if it does then the DNS queries timeout When the above command is run manually in such scenario's, sometimes the DNS queries work, but sometimes they fail ~~~ ; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> +tries=1 +time=2 ;; global options: +cmd ;; connection timed out; no servers could be reached ============================ ; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> +tries=1 +time=2 ;; global options: +cmd ;; connection timed out; no servers could be reached ============================ ; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> +tries=1 +time=2 ;; global options: +cmd ;; connection timed out; no servers could be reached ============================ ; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> +tries=1 +time=2 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 40319 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;. IN NS ;; Query time: 1313 msec ;; SERVER: x.x.x.x#53(x.x.x.x) ;; WHEN: Wed Apr 01 17:40:58 UTC 2020 ;; MSG SIZE rcvd: 28 ============================ ; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> +tries=1 +time=2 ;; global options: +cmd ;; connection timed out; no servers could be reached ============================ ; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> +tries=1 +time=2 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 15462 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;. IN NS ;; Query time: 2080 msec ;; SERVER: x.x.x.x#53(x.x.x.x) ;; WHEN: Wed Apr 01 17:43:03 UTC 2020 ;; MSG SIZE rcvd: 28 ~~~ But if '+time=' is increased, they work everytime. For example, we had a user where the DNS servers are not local with the RHHI instance. They are using a primary server that is remote to the site where the RHHI cluster is, there is a T1 connection between the RHHI site and a distant location office, then from there there is another router to another Control facility. The T1 line has a max of 160kbps and if there is other activity going on, it cause DNS queries response timing out. These DNS query timeouts when network_test is set to DNS causes the score of the host to get reduced, and eventually the hosted-engine is constantly started on different nodes Version-Release number of selected component (if applicable): rhvm-4.3.8.2-0.4.el7.noarch ovirt-hosted-engine-ha-2.3.6-1.el7ev.noarch ovirt-hosted-engine-setup-2.3.12-1.el7ev.noarch vdsm-4.30.40-1.el7ev.x86_64 How reproducible: Always happens when DNS servers are remote or there is high load on the DNS network. Steps to Reproduce: 1. 2. 3. Actual results: DNS queries timeout Expected results: DNS queries should not timeout Additional info:
AFAICT timeout parameter under network_test should do the job. It's a shared config for tcp, dns, and ping tests, but that should be all right...
let's change the default to 5 in any case. shouldn't hurt and it's going to be more resilient in general
Please provide reproduction steps.
(In reply to Nikolai Sednev from comment #4) > Please provide reproduction steps. Steps to Reproduce: 1. Deploy hosted-engine 2. Add another host that can run the HE vm 3. Shutdown the network of the host 4. Check /var/log/ovirt-hosted-engine-ha/broker.log on the host
(In reply to Michal Skrivanek from comment #3) > let's change the default to 5 in any case. shouldn't hurt and it's going to > be more resilient in general It depends. Slow DNS might also affect other areas (or are we relying on IP-addresses only?). So increasing the timeout to 5 by default seems fine, but in case we rely for DNS (esp. for monitoring actions and such), we must ensure that the DNS is fast - needing 2+ seconds to answer a query is simply a very slow DNS. A typical answer is within miliseconds: ; <<>> DiG 9.11.4-P2-RedHat-9.11.4-17.P2.el8_0.1 <<>> +tries=1 +time=1 some.server.net ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 62989 ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;some.server.net. IN A ;; ANSWER SECTION: some.server.net. 3400 IN CNAME some-other.server.net. some-other.server.net. 3400 IN A <ipv4> ;; Query time: 24 msec ;; SERVER: <ip>#53(<ip>) ;; WHEN: Thu May 28 14:38:04 CEST 2020 ;; MSG SIZE rcvd: 82 So while I am fine with increasing the timeout (I would rather prefer that to be configurable - so we could even add a warning to that specific line that DNS must not be "too slow". In case we do not use DNS at all in RHV for recurring tasks (which I doubt, as I expect that a DNS entry in RHV is updated in case DNS changes). As such I see these kind of extremely slow DNS queries as an issue for the overall stability - as it might effect several monitoring timeouts in a negative way.
(In reply to Martin Tessun from comment #6) > So while I am fine with increasing the timeout (I would rather prefer that > to be configurable - so we could even add a warning to that specific line > that DNS must not be "too slow". This can be tracked in a separate BZ > In case we do not use DNS at all in RHV for recurring tasks (which I doubt, > as I expect that a DNS entry in RHV is updated in case DNS changes). As such > I see these kind of extremely slow DNS queries as an issue for the overall > stability - as it might effect several monitoring timeouts in a negative way. I'm not sure that the dns used by the hosted engine test is also the dns used to resolve local address. connectivity check here was meant to check global connectivity usually pointing to the gateway.
Works for me on these components: Software Version:4.4.1.2-0.10.el8ev rhvm-appliance-4.4-20200604.0.el8ev.x86_64 ovirt-hosted-engine-setup-2.4.4-1.el8ev.noarch ovirt-hosted-engine-ha-2.4.3-1.el8ev.noarch Linux 4.18.0-193.9.1.el8_2.x86_64 #1 SMP Sun Jun 14 15:03:05 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux release 8.2 (Ootpa) less /usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitors/network.py self._tests = { 'ping': self._ping, 'dns': self._dns, 'tcp': self._tcp, 'none': self._none, } self._addr = options.get('addr') self._timeout = str(options.get('timeout', 5)) self._total = options.get('count', 5) self._delay = options.get('delay', 0.5) self._network_test = options.get('network_test', 'ping') if not self._network_test: self._network_test = 'ping' if self._network_test not in self._tests: raise Exception( "{t}: invalid network test".format( t=self._network_test ) )
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHV RHEL Host (ovirt-host) 4.4), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:3246