Bug 1821487

Summary:	[RFE] Increase the time for dig check while computing score calculations.
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Siddhant Rao <sirao>
Component:	ovirt-hosted-engine-ha	Assignee:	Asaf Rachmani <arachman>
Status:	CLOSED ERRATA	QA Contact:	Nikolai Sednev <nsednev>
Severity:	medium	Docs Contact:
Priority:	low
Version:	4.3.8	CC:	arachman, lsurette, mavital, michal.skrivanek, mtessun, rdlugyhe, sbonazzo
Target Milestone:	ovirt-4.4.1	Keywords:	FutureFeature, Triaged
Target Release:	---	Flags:	lsvaty: testing_plan_complete-
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	ovirt-hosted-engine-ha-2.4.3	Doc Type:	Enhancement
Doc Text:	Previously, network tests timed out after 2 seconds. The current release increases the timeout period from 2 seconds to 5 seconds. This reduces unnecessary timeouts when the network tests require more than 2 seconds to pass.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-08-04 13:27:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Integration	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Siddhant Rao 2020-04-06 23:00:59 UTC

Description of problem:
Increase the value of time parameter for dig check against score calculations.

When network_test in hosted-engine.conf is set to dns, it runs the below dns query for maintaining score

~~~
        dns_cmd = [
            'dig',
            '+tries=1',
            '+time={t}'.format(t=self._timeout)
        ]

~~~

With self._timeout = str(options.get('timeout', 2)). I am expecting timeout there to be 2

Also I did not a constant for timeout in ovirt_hosted_engine_ha/env/config_constants.py , so I guess you cannot change timeout there by adding it to hosted-engine.conf

So therefore I would request to increase the value of `'+time={t}'.format(t=self._timeout)` by default from 2 to something higher, or provide a way to configure it if it already isn't there.

The reason behind it is that the DNS servers for some users can be at distant locations and not on the same network, due to this, sometimes DNS queries can take longer than expected and if it does then the DNS queries timeout

When the above command is run manually in such scenario's, sometimes the DNS queries work, but sometimes they fail

~~~
; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> +tries=1 +time=2
;; global options: +cmd
;; connection timed out; no servers could be reached
============================

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> +tries=1 +time=2
;; global options: +cmd
;; connection timed out; no servers could be reached
============================

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> +tries=1 +time=2
;; global options: +cmd
;; connection timed out; no servers could be reached
============================

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> +tries=1 +time=2
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 40319
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;.                              IN      NS

;; Query time: 1313 msec
;; SERVER: x.x.x.x#53(x.x.x.x)
;; WHEN: Wed Apr 01 17:40:58 UTC 2020
;; MSG SIZE  rcvd: 28

============================

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> +tries=1 +time=2
;; global options: +cmd
;; connection timed out; no servers could be reached
============================

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> +tries=1 +time=2
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 15462
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;.                              IN      NS

;; Query time: 2080 msec
;; SERVER: x.x.x.x#53(x.x.x.x)
;; WHEN: Wed Apr 01 17:43:03 UTC 2020
;; MSG SIZE  rcvd: 28
~~~

But if '+time=' is increased, they work everytime.

For example, we had a user where the DNS servers are not local with the RHHI instance. They are using a primary server that is remote to the site where the RHHI cluster is, there is a T1 connection between the RHHI site and a distant location office, then from there there is another router to another Control facility. The T1 line has a max of 160kbps and if there is other activity going on, it cause DNS queries response timing out.


These DNS query timeouts when network_test is set to DNS causes the score of the host to get reduced, and eventually the hosted-engine is constantly started on different nodes



Version-Release number of selected component (if applicable):
rhvm-4.3.8.2-0.4.el7.noarch
ovirt-hosted-engine-ha-2.3.6-1.el7ev.noarch
ovirt-hosted-engine-setup-2.3.12-1.el7ev.noarch
vdsm-4.30.40-1.el7ev.x86_64

How reproducible:
Always happens when DNS servers are remote or there is high load on the DNS network.

Steps to Reproduce:
1.
2.
3.

Actual results:
DNS queries timeout

Expected results:
DNS queries should not timeout

Additional info:

Comment 1 Michal Skrivanek 2020-04-07 07:23:10 UTC

AFAICT timeout parameter under network_test should do the job. It's a shared config for tcp, dns, and ping tests, but that should be all right...

Comment 3 Michal Skrivanek 2020-04-22 07:38:25 UTC

let's change the default to 5 in any case. shouldn't hurt and it's going to be more resilient in general

Comment 4 Nikolai Sednev 2020-04-22 14:42:57 UTC

Please provide reproduction steps.

Comment 5 Asaf Rachmani 2020-05-04 13:13:14 UTC

(In reply to Nikolai Sednev from comment #4)
> Please provide reproduction steps.

Steps to Reproduce:
1. Deploy hosted-engine
2. Add another host that can run the HE vm
3. Shutdown the network of the host
4. Check /var/log/ovirt-hosted-engine-ha/broker.log on the host

Comment 6 Martin Tessun 2020-05-28 12:44:38 UTC

(In reply to Michal Skrivanek from comment #3)
> let's change the default to 5 in any case. shouldn't hurt and it's going to
> be more resilient in general

It depends. Slow DNS might also affect other areas (or are we relying on IP-addresses only?). So increasing the timeout to 5 by default seems fine, but in case we rely for DNS (esp. for monitoring actions and such), we must ensure that the DNS is fast - needing 2+ seconds to answer a query is simply a very slow DNS.

A typical answer is within miliseconds:

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-17.P2.el8_0.1 <<>> +tries=1 +time=1 some.server.net
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 62989
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;some.server.net.		IN	A

;; ANSWER SECTION:
some.server.net.	3400	IN	CNAME	some-other.server.net.
some-other.server.net.	3400	IN	A	<ipv4>

;; Query time: 24 msec
;; SERVER: <ip>#53(<ip>)
;; WHEN: Thu May 28 14:38:04 CEST 2020
;; MSG SIZE  rcvd: 82


So while I am fine with increasing the timeout (I would rather prefer that to be configurable - so we could even add a warning to that specific line that DNS must not be "too slow".

In case we do not use DNS at all in RHV for recurring tasks (which I doubt, as I expect that a DNS entry in RHV is updated in case DNS changes). As such I see these kind of extremely slow DNS queries as an issue for the overall stability - as it might effect several monitoring timeouts in a negative way.

Comment 7 Sandro Bonazzola 2020-06-08 06:32:21 UTC

(In reply to Martin Tessun from comment #6)

> So while I am fine with increasing the timeout (I would rather prefer that
> to be configurable - so we could even add a warning to that specific line
> that DNS must not be "too slow".

This can be tracked in a separate BZ

> In case we do not use DNS at all in RHV for recurring tasks (which I doubt,
> as I expect that a DNS entry in RHV is updated in case DNS changes). As such
> I see these kind of extremely slow DNS queries as an issue for the overall
> stability - as it might effect several monitoring timeouts in a negative way.

I'm not sure that the dns used by the hosted engine test is also the dns used to resolve local address.
connectivity check here was meant to check global connectivity usually pointing to the gateway.

Comment 12 Nikolai Sednev 2020-06-18 15:16:12 UTC

Works for me on these components:
Software Version:4.4.1.2-0.10.el8ev
rhvm-appliance-4.4-20200604.0.el8ev.x86_64
ovirt-hosted-engine-setup-2.4.4-1.el8ev.noarch
ovirt-hosted-engine-ha-2.4.3-1.el8ev.noarch
Linux 4.18.0-193.9.1.el8_2.x86_64 #1 SMP Sun Jun 14 15:03:05 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 8.2 (Ootpa)



less /usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitors/network.py
 
self._tests = {
            'ping': self._ping,
            'dns': self._dns,
            'tcp': self._tcp,
            'none': self._none,
        }

        self._addr = options.get('addr')
        self._timeout = str(options.get('timeout', 5))
        self._total = options.get('count', 5)
        self._delay = options.get('delay', 0.5)
        self._network_test = options.get('network_test', 'ping')
        if not self._network_test:
            self._network_test = 'ping'
        if self._network_test not in self._tests:
            raise Exception(
                "{t}: invalid network test".format(
                    t=self._network_test
                )
            )

Comment 18 errata-xmlrpc 2020-08-04 13:27:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV RHEL Host (ovirt-host) 4.4), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:3246