Bug 1984356

Summary: dns/dig network monitor is too sensitive to network load
Product: [oVirt] ovirt-hosted-engine-ha Reporter: Yedidyah Bar David <didi>
Component: BrokerAssignee: Yedidyah Bar David <didi>
Status: CLOSED CURRENTRELEASE QA Contact: Nikolai Sednev <nsednev>
Severity: medium Docs Contact:
Priority: high    
Version: ---CC: arachman, bugs, ovirt
Target Milestone: ovirt-4.4.8Keywords: Triaged, ZStream
Target Release: ---Flags: pm-rhel: ovirt-4.4+
sbonazzo: devel_ack+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-hosted-engine-ha-2.4.8-1.el8ev Doc Type: Bug Fix
Doc Text:
With this release, ovirt-ha-broker uses 'dig +tcp' to test the network, instead of the default, UDP, thus making the monitoring more resilient to temporary network outages or high load. Before this release, such outages/load could cause the engine virtual machine to be shut down due to the lowered score resulting from the monitoring. doc team: Some more details follow, to clarify the above text. Feel free to just amend a bit the above, or write your own text based on all together, or whatever you find best... oVirt HA broker has a set of monitors to test various things on the host it is running on. Each such monitor has some weight, and based on this weight and the specific result of each test, an overall score is calculated. This score is then written to the storage shared by all hosts, and all hosts routinely check this storage, comparing their own score with the score of the others. If the host running the engine VM sees that its score is significantly lower than the score of the host with the best score - where the difference between them is more than 800 - it shuts down the engine VM. Once the VM is down, all hosts notice this, and the one with the best score will automatically start it. The network monitor has a rather big weight in this calculation, and so a few failures there can cause the score to be low enough to cause a VM shut down. With the change in this bug, the score should be less affected by such network drops.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-08-19 06:23:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Integration RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Yedidyah Bar David 2021-07-21 10:08:19 UTC
Description of problem:

The current default network monitor, called 'dns', which uses 'dig +tries=1 +time=5' for testing the network, is quite sensitive to load on the network - if the network is sufficiently loaded, it might drop packets, and since we use the default, which is UDP, dig might fail, eventually leading to reduced score of the host and sometimes shutting down the engine virtual machine. IMO our monitoring should be somewhat more resilient - load on the network that causes UDP packets to be dropped, but still allows TCP to work reliably, even if more slowly, should not impose a shutdown of the VM.

This happened several times recently on our CI [1], and we also got one report that seems similar on the users mailing list [2].

[1] https://lists.ovirt.org/archives/list/infra@ovirt.org/thread/LIGS5WXGEKWACY5GCK7Z6Q2JYVWJ6JBF/

[2] https://lists.ovirt.org/archives/list/users@ovirt.org/thread/2HTD5WR43M5MUTEDMM4HRFBADIXEQNB4/

Version-Release number of selected component (if applicable):
Always, probably, although the default monitor was changed to dns/dig only in 4.3.5, bug 1659052.

How reproducible:
Always, probably, although I didn't try to test this systematically.

Steps to Reproduce:
1. Deploy a hosted-engine cluster with two hosts.
2. Impose some load on the network, on one of the hosts.
3.

Actual results:
If the load is high enough, eventually the engine VM will be shut down (and is expected to be started on the other host).

Expected results:
If the network is still reliable enough e.g. for TCP traffic, which is what most important applications use, the engine VM should not be shut down.

Additional info:
The current proposed solution is to use '+tcp'.

In theory, if the load is well-distributed over the network, the score of both/all hosts will be lowered approximately similarly, thus not causing a shutdown - the HA agent only shuts down the VM if the difference between itself and the host with the best score is at least 800. In practice I didn't test this, and due to the way we calculate and publish the scores, I am not sure it works very well.

If this is considered an issue of significant importance, perhaps more testing is needed, including more refined definitions of what a bad-enough network is, when should the VM be shut down, etc. - but not sure that's needed right now.

Comment 2 Nikolai Sednev 2021-08-02 14:01:09 UTC
serval14 ~]# ethtool eno1
Settings for eno1:
        Supported ports: [ FIBRE ]
        Supported link modes:   10000baseT/Full
        Supported pause frame use: Symmetric
        Supports auto-negotiation: No
        Supported FEC modes: Not reported
        Advertised link modes:  10000baseT/Full
        Advertised pause frame use: Symmetric
        Advertised auto-negotiation: No
        Advertised FEC modes: Not reported
        Speed: 10000Mb/s
        Duplex: Full
        Auto-negotiation: off
        Port: Direct Attach Copper
        PHYAD: 0
        Transceiver: internal
        Supports Wake-on: umbg
        Wake-on: g
        Current message level: 0x00000007 (7)
                               drv probe link
        Link detected: yes

Device eno1 (1/1):
==
Incoming:
#  Curr: 9.12 GBit/s
#  Avg: 4.74 GBit/s
#  Min: 2.33 kBit/s
#  Max: 9.17 GBit/s
#  Ttl: 2357.98 GByte
Outgoing:
#  Curr: 9.15 GBit/s
#  Avg: 4.52 GBit/s
#  Min: 6.30 kBit/s
#  Max: 9.17 GBit/s
#  Ttl: 427.46 GByte

serval14 ~]# hosted-engine --vm-status


--== Host serval14 (id: 1) status ==--

Host ID                            : 1
Host timestamp                     : 346512
Score                              : 3400
Engine status                      : {"vm": "up", "health": "good", "detail": "Up"}
Hostname                           : serval14
Local maintenance                  : False
stopped                            : False
crc32                              : 355af4d7
conf_on_shared_storage             : True
local_conf_timestamp               : 346513
Status up-to-date                  : True
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=346512 (Mon Aug  2 16:22:57 2021)
        host-id=1
        score=3400
        vm_conf_refresh_time=346513 (Mon Aug  2 16:22:57 2021)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUp
        stopped=False


--== Host serval15 (id: 2) status ==--

Host ID                            : 2
Host timestamp                     : 15429
Score                              : 3400
Engine status                      : {"vm": "down", "health": "bad", "detail": "unknown", "reason": "vm not running on this host"}
Hostname                           : serval15
Local maintenance                  : False
stopped                            : False
crc32                              : 63cdfc8e
conf_on_shared_storage             : True
local_conf_timestamp               : 15429
Status up-to-date                  : True
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=15429 (Mon Aug  2 16:22:56 2021)
        host-id=2
        score=3400
        vm_conf_refresh_time=15429 (Mon Aug  2 16:22:56 2021)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineDown
        stopped=False

I tested over 1 hour with extensive network load, 109 parallel sessions towards serval16 from serval14 and additional 109 parallel sessions from serval17 towards serval14, serval14 was serving 2 iperf3 servers on two different ports. 
During the test engine was reachable and I was able to see that network was indicated as red bar with 98% load and score on hosts had not been changed and HE-VM had not been migrated to unloaded host serval15.
I've used iperf3 with bidirectional loading on the eno1 interface of the serval14 host, which was running the engine and was also an SPM, eno1 is a 10Gbps fiberoptic ethernet interface, connected to the network and it was used by mgmt network of the engine.
The storage was FC dedicated LUN, connected over separate FC network.
I saw no issues on the setup.
Components were used as follows:
ovirt-engine-setup-4.4.8.2-0.11.el8ev.noarch
ovirt-hosted-engine-setup-2.5.3-1.el8ev.noarch
ovirt-hosted-engine-ha-2.4.8-1.el8ev.noarch
ovirt-ansible-collection-1.5.4-1.el8ev.noarch
Linux 4.18.0-305.12.1.el8_4.x86_64 #1 SMP Mon Jul 26 08:06:24 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 8.4 (Ootpa)

Moving to verified.

Comment 3 Sandro Bonazzola 2021-08-19 06:23:09 UTC
This bugzilla is included in oVirt 4.4.8 release, published on August 19th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.8 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.