1984356 – dns/dig network monitor is too sensitive to network load

Bug 1984356 - dns/dig network monitor is too sensitive to network load

Summary: dns/dig network monitor is too sensitive to network load

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-hosted-engine-ha
Classification:	oVirt
Component:	Broker
Sub Component:
Version:	---
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	ovirt-4.4.8
Target Release:	---
Assignee:	Yedidyah Bar David
QA Contact:	Nikolai Sednev
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-07-21 10:08 UTC by Yedidyah Bar David
Modified:	2021-08-19 06:23 UTC (History)
CC List:	3 users (show)
Fixed In Version:	ovirt-hosted-engine-ha-2.4.8-1.el8ev
Clone Of:
Environment:
Last Closed:	2021-08-19 06:23:09 UTC
oVirt Team:	Integration
Embargoed:
Flags:	pm-rhel: ovirt-4.4+ sbonazzo: devel_ack+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	115596	0	master	MERGED	dig network monitor: Use +tcp	2021-07-22 07:28:18 UTC

Description Yedidyah Bar David 2021-07-21 10:08:19 UTC

Description of problem:

The current default network monitor, called 'dns', which uses 'dig +tries=1 +time=5' for testing the network, is quite sensitive to load on the network - if the network is sufficiently loaded, it might drop packets, and since we use the default, which is UDP, dig might fail, eventually leading to reduced score of the host and sometimes shutting down the engine virtual machine. IMO our monitoring should be somewhat more resilient - load on the network that causes UDP packets to be dropped, but still allows TCP to work reliably, even if more slowly, should not impose a shutdown of the VM.

This happened several times recently on our CI [1], and we also got one report that seems similar on the users mailing list [2].

[1] https://lists.ovirt.org/archives/list/infra@ovirt.org/thread/LIGS5WXGEKWACY5GCK7Z6Q2JYVWJ6JBF/

[2] https://lists.ovirt.org/archives/list/users@ovirt.org/thread/2HTD5WR43M5MUTEDMM4HRFBADIXEQNB4/

Version-Release number of selected component (if applicable):
Always, probably, although the default monitor was changed to dns/dig only in 4.3.5, bug 1659052.

How reproducible:
Always, probably, although I didn't try to test this systematically.

Steps to Reproduce:
1. Deploy a hosted-engine cluster with two hosts.
2. Impose some load on the network, on one of the hosts.
3.

Actual results:
If the load is high enough, eventually the engine VM will be shut down (and is expected to be started on the other host).

Expected results:
If the network is still reliable enough e.g. for TCP traffic, which is what most important applications use, the engine VM should not be shut down.

Additional info:
The current proposed solution is to use '+tcp'.

In theory, if the load is well-distributed over the network, the score of both/all hosts will be lowered approximately similarly, thus not causing a shutdown - the HA agent only shuts down the VM if the difference between itself and the host with the best score is at least 800. In practice I didn't test this, and due to the way we calculate and publish the scores, I am not sure it works very well.

If this is considered an issue of significant importance, perhaps more testing is needed, including more refined definitions of what a bad-enough network is, when should the VM be shut down, etc. - but not sure that's needed right now.

Comment 2 Nikolai Sednev 2021-08-02 14:01:09 UTC

serval14 ~]# ethtool eno1
Settings for eno1:
        Supported ports: [ FIBRE ]
        Supported link modes:   10000baseT/Full
        Supported pause frame use: Symmetric
        Supports auto-negotiation: No
        Supported FEC modes: Not reported
        Advertised link modes:  10000baseT/Full
        Advertised pause frame use: Symmetric
        Advertised auto-negotiation: No
        Advertised FEC modes: Not reported
        Speed: 10000Mb/s
        Duplex: Full
        Auto-negotiation: off
        Port: Direct Attach Copper
        PHYAD: 0
        Transceiver: internal
        Supports Wake-on: umbg
        Wake-on: g
        Current message level: 0x00000007 (7)
                               drv probe link
        Link detected: yes

Device eno1 (1/1):
==
Incoming:
#  Curr: 9.12 GBit/s
#  Avg: 4.74 GBit/s
#  Min: 2.33 kBit/s
#  Max: 9.17 GBit/s
#  Ttl: 2357.98 GByte
Outgoing:
#  Curr: 9.15 GBit/s
#  Avg: 4.52 GBit/s
#  Min: 6.30 kBit/s
#  Max: 9.17 GBit/s
#  Ttl: 427.46 GByte

serval14 ~]# hosted-engine --vm-status


--== Host serval14 (id: 1) status ==--

Host ID                            : 1
Host timestamp                     : 346512
Score                              : 3400
Engine status                      : {"vm": "up", "health": "good", "detail": "Up"}
Hostname                           : serval14
Local maintenance                  : False
stopped                            : False
crc32                              : 355af4d7
conf_on_shared_storage             : True
local_conf_timestamp               : 346513
Status up-to-date                  : True
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=346512 (Mon Aug  2 16:22:57 2021)
        host-id=1
        score=3400
        vm_conf_refresh_time=346513 (Mon Aug  2 16:22:57 2021)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUp
        stopped=False


--== Host serval15 (id: 2) status ==--

Host ID                            : 2
Host timestamp                     : 15429
Score                              : 3400
Engine status                      : {"vm": "down", "health": "bad", "detail": "unknown", "reason": "vm not running on this host"}
Hostname                           : serval15
Local maintenance                  : False
stopped                            : False
crc32                              : 63cdfc8e
conf_on_shared_storage             : True
local_conf_timestamp               : 15429
Status up-to-date                  : True
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=15429 (Mon Aug  2 16:22:56 2021)
        host-id=2
        score=3400
        vm_conf_refresh_time=15429 (Mon Aug  2 16:22:56 2021)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineDown
        stopped=False

I tested over 1 hour with extensive network load, 109 parallel sessions towards serval16 from serval14 and additional 109 parallel sessions from serval17 towards serval14, serval14 was serving 2 iperf3 servers on two different ports. 
During the test engine was reachable and I was able to see that network was indicated as red bar with 98% load and score on hosts had not been changed and HE-VM had not been migrated to unloaded host serval15.
I've used iperf3 with bidirectional loading on the eno1 interface of the serval14 host, which was running the engine and was also an SPM, eno1 is a 10Gbps fiberoptic ethernet interface, connected to the network and it was used by mgmt network of the engine.
The storage was FC dedicated LUN, connected over separate FC network.
I saw no issues on the setup.
Components were used as follows:
ovirt-engine-setup-4.4.8.2-0.11.el8ev.noarch
ovirt-hosted-engine-setup-2.5.3-1.el8ev.noarch
ovirt-hosted-engine-ha-2.4.8-1.el8ev.noarch
ovirt-ansible-collection-1.5.4-1.el8ev.noarch
Linux 4.18.0-305.12.1.el8_4.x86_64 #1 SMP Mon Jul 26 08:06:24 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 8.4 (Ootpa)

Moving to verified.

Comment 3 Sandro Bonazzola 2021-08-19 06:23:09 UTC

This bugzilla is included in oVirt 4.4.8 release, published on August 19th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.8 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.