Description of problem: The current default network monitor, called 'dns', which uses 'dig +tries=1 +time=5' for testing the network, is quite sensitive to load on the network - if the network is sufficiently loaded, it might drop packets, and since we use the default, which is UDP, dig might fail, eventually leading to reduced score of the host and sometimes shutting down the engine virtual machine. IMO our monitoring should be somewhat more resilient - load on the network that causes UDP packets to be dropped, but still allows TCP to work reliably, even if more slowly, should not impose a shutdown of the VM. This happened several times recently on our CI [1], and we also got one report that seems similar on the users mailing list [2]. [1] https://lists.ovirt.org/archives/list/infra@ovirt.org/thread/LIGS5WXGEKWACY5GCK7Z6Q2JYVWJ6JBF/ [2] https://lists.ovirt.org/archives/list/users@ovirt.org/thread/2HTD5WR43M5MUTEDMM4HRFBADIXEQNB4/ Version-Release number of selected component (if applicable): Always, probably, although the default monitor was changed to dns/dig only in 4.3.5, bug 1659052. How reproducible: Always, probably, although I didn't try to test this systematically. Steps to Reproduce: 1. Deploy a hosted-engine cluster with two hosts. 2. Impose some load on the network, on one of the hosts. 3. Actual results: If the load is high enough, eventually the engine VM will be shut down (and is expected to be started on the other host). Expected results: If the network is still reliable enough e.g. for TCP traffic, which is what most important applications use, the engine VM should not be shut down. Additional info: The current proposed solution is to use '+tcp'. In theory, if the load is well-distributed over the network, the score of both/all hosts will be lowered approximately similarly, thus not causing a shutdown - the HA agent only shuts down the VM if the difference between itself and the host with the best score is at least 800. In practice I didn't test this, and due to the way we calculate and publish the scores, I am not sure it works very well. If this is considered an issue of significant importance, perhaps more testing is needed, including more refined definitions of what a bad-enough network is, when should the VM be shut down, etc. - but not sure that's needed right now.
serval14 ~]# ethtool eno1 Settings for eno1: Supported ports: [ FIBRE ] Supported link modes: 10000baseT/Full Supported pause frame use: Symmetric Supports auto-negotiation: No Supported FEC modes: Not reported Advertised link modes: 10000baseT/Full Advertised pause frame use: Symmetric Advertised auto-negotiation: No Advertised FEC modes: Not reported Speed: 10000Mb/s Duplex: Full Auto-negotiation: off Port: Direct Attach Copper PHYAD: 0 Transceiver: internal Supports Wake-on: umbg Wake-on: g Current message level: 0x00000007 (7) drv probe link Link detected: yes Device eno1 (1/1): == Incoming: # Curr: 9.12 GBit/s # Avg: 4.74 GBit/s # Min: 2.33 kBit/s # Max: 9.17 GBit/s # Ttl: 2357.98 GByte Outgoing: # Curr: 9.15 GBit/s # Avg: 4.52 GBit/s # Min: 6.30 kBit/s # Max: 9.17 GBit/s # Ttl: 427.46 GByte serval14 ~]# hosted-engine --vm-status --== Host serval14 (id: 1) status ==-- Host ID : 1 Host timestamp : 346512 Score : 3400 Engine status : {"vm": "up", "health": "good", "detail": "Up"} Hostname : serval14 Local maintenance : False stopped : False crc32 : 355af4d7 conf_on_shared_storage : True local_conf_timestamp : 346513 Status up-to-date : True Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=346512 (Mon Aug 2 16:22:57 2021) host-id=1 score=3400 vm_conf_refresh_time=346513 (Mon Aug 2 16:22:57 2021) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False --== Host serval15 (id: 2) status ==-- Host ID : 2 Host timestamp : 15429 Score : 3400 Engine status : {"vm": "down", "health": "bad", "detail": "unknown", "reason": "vm not running on this host"} Hostname : serval15 Local maintenance : False stopped : False crc32 : 63cdfc8e conf_on_shared_storage : True local_conf_timestamp : 15429 Status up-to-date : True Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=15429 (Mon Aug 2 16:22:56 2021) host-id=2 score=3400 vm_conf_refresh_time=15429 (Mon Aug 2 16:22:56 2021) conf_on_shared_storage=True maintenance=False state=EngineDown stopped=False I tested over 1 hour with extensive network load, 109 parallel sessions towards serval16 from serval14 and additional 109 parallel sessions from serval17 towards serval14, serval14 was serving 2 iperf3 servers on two different ports. During the test engine was reachable and I was able to see that network was indicated as red bar with 98% load and score on hosts had not been changed and HE-VM had not been migrated to unloaded host serval15. I've used iperf3 with bidirectional loading on the eno1 interface of the serval14 host, which was running the engine and was also an SPM, eno1 is a 10Gbps fiberoptic ethernet interface, connected to the network and it was used by mgmt network of the engine. The storage was FC dedicated LUN, connected over separate FC network. I saw no issues on the setup. Components were used as follows: ovirt-engine-setup-4.4.8.2-0.11.el8ev.noarch ovirt-hosted-engine-setup-2.5.3-1.el8ev.noarch ovirt-hosted-engine-ha-2.4.8-1.el8ev.noarch ovirt-ansible-collection-1.5.4-1.el8ev.noarch Linux 4.18.0-305.12.1.el8_4.x86_64 #1 SMP Mon Jul 26 08:06:24 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux release 8.4 (Ootpa) Moving to verified.
This bugzilla is included in oVirt 4.4.8 release, published on August 19th 2021. Since the problem described in this bug report should be resolved in oVirt 4.4.8 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.