Bug 1984356
Summary: | dns/dig network monitor is too sensitive to network load | ||
---|---|---|---|
Product: | [oVirt] ovirt-hosted-engine-ha | Reporter: | Yedidyah Bar David <didi> |
Component: | Broker | Assignee: | Yedidyah Bar David <didi> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Nikolai Sednev <nsednev> |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | --- | CC: | arachman, bugs, ovirt |
Target Milestone: | ovirt-4.4.8 | Keywords: | Triaged, ZStream |
Target Release: | --- | Flags: | pm-rhel:
ovirt-4.4+
sbonazzo: devel_ack+ |
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | ovirt-hosted-engine-ha-2.4.8-1.el8ev | Doc Type: | Bug Fix |
Doc Text: |
With this release, ovirt-ha-broker uses 'dig +tcp' to test the network, instead of the default, UDP, thus making the monitoring more resilient to temporary network outages or high load. Before this release, such outages/load could cause the engine virtual machine to be shut down due to the lowered score resulting from the monitoring.
doc team: Some more details follow, to clarify the above text. Feel free to just amend a bit the above, or write your own text based on all together, or whatever you find best...
oVirt HA broker has a set of monitors to test various things on the host it is running on. Each such monitor has some weight, and based on this weight and the specific result of each test, an overall score is calculated. This score is then written to the storage shared by all hosts, and all hosts routinely check this storage, comparing their own score with the score of the others. If the host running the engine VM sees that its score is significantly lower than the score of the host with the best score - where the difference between them is more than 800 - it shuts down the engine VM. Once the VM is down, all hosts notice this, and the one with the best score will automatically start it. The network monitor has a rather big weight in this calculation, and so a few failures there can cause the score to be low enough to cause a VM shut down. With the change in this bug, the score should be less affected by such network drops.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2021-08-19 06:23:09 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Integration | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Yedidyah Bar David
2021-07-21 10:08:19 UTC
serval14 ~]# ethtool eno1 Settings for eno1: Supported ports: [ FIBRE ] Supported link modes: 10000baseT/Full Supported pause frame use: Symmetric Supports auto-negotiation: No Supported FEC modes: Not reported Advertised link modes: 10000baseT/Full Advertised pause frame use: Symmetric Advertised auto-negotiation: No Advertised FEC modes: Not reported Speed: 10000Mb/s Duplex: Full Auto-negotiation: off Port: Direct Attach Copper PHYAD: 0 Transceiver: internal Supports Wake-on: umbg Wake-on: g Current message level: 0x00000007 (7) drv probe link Link detected: yes Device eno1 (1/1): == Incoming: # Curr: 9.12 GBit/s # Avg: 4.74 GBit/s # Min: 2.33 kBit/s # Max: 9.17 GBit/s # Ttl: 2357.98 GByte Outgoing: # Curr: 9.15 GBit/s # Avg: 4.52 GBit/s # Min: 6.30 kBit/s # Max: 9.17 GBit/s # Ttl: 427.46 GByte serval14 ~]# hosted-engine --vm-status --== Host serval14 (id: 1) status ==-- Host ID : 1 Host timestamp : 346512 Score : 3400 Engine status : {"vm": "up", "health": "good", "detail": "Up"} Hostname : serval14 Local maintenance : False stopped : False crc32 : 355af4d7 conf_on_shared_storage : True local_conf_timestamp : 346513 Status up-to-date : True Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=346512 (Mon Aug 2 16:22:57 2021) host-id=1 score=3400 vm_conf_refresh_time=346513 (Mon Aug 2 16:22:57 2021) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False --== Host serval15 (id: 2) status ==-- Host ID : 2 Host timestamp : 15429 Score : 3400 Engine status : {"vm": "down", "health": "bad", "detail": "unknown", "reason": "vm not running on this host"} Hostname : serval15 Local maintenance : False stopped : False crc32 : 63cdfc8e conf_on_shared_storage : True local_conf_timestamp : 15429 Status up-to-date : True Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=15429 (Mon Aug 2 16:22:56 2021) host-id=2 score=3400 vm_conf_refresh_time=15429 (Mon Aug 2 16:22:56 2021) conf_on_shared_storage=True maintenance=False state=EngineDown stopped=False I tested over 1 hour with extensive network load, 109 parallel sessions towards serval16 from serval14 and additional 109 parallel sessions from serval17 towards serval14, serval14 was serving 2 iperf3 servers on two different ports. During the test engine was reachable and I was able to see that network was indicated as red bar with 98% load and score on hosts had not been changed and HE-VM had not been migrated to unloaded host serval15. I've used iperf3 with bidirectional loading on the eno1 interface of the serval14 host, which was running the engine and was also an SPM, eno1 is a 10Gbps fiberoptic ethernet interface, connected to the network and it was used by mgmt network of the engine. The storage was FC dedicated LUN, connected over separate FC network. I saw no issues on the setup. Components were used as follows: ovirt-engine-setup-4.4.8.2-0.11.el8ev.noarch ovirt-hosted-engine-setup-2.5.3-1.el8ev.noarch ovirt-hosted-engine-ha-2.4.8-1.el8ev.noarch ovirt-ansible-collection-1.5.4-1.el8ev.noarch Linux 4.18.0-305.12.1.el8_4.x86_64 #1 SMP Mon Jul 26 08:06:24 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux release 8.4 (Ootpa) Moving to verified. This bugzilla is included in oVirt 4.4.8 release, published on August 19th 2021. Since the problem described in this bug report should be resolved in oVirt 4.4.8 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |