Created attachment 1792517 [details] example Description of problem: I booted 3 master 3 workers and simulated no latency in master-0 (192.168.127.10) network latency in master-1 (192.168.127.11) packet loss in master-2 (192.168.127.12) workers with no latency (192.168.127.13-192.168.127.15) master-1 validation is: sufficient-network-latency-requirement-for-role: Error while attempting to validate network latency: host with IP 192.168.127.12 not found in inventory. sufficient-packet-loss-requirement-for-role: Error while attempting to validate packet loss validation: host with IP 192.168.127.14 not found in inventory. The issue is that 192.168.127.14 is a worker node ip In addition, worker-0 fails network validations although no latency in worker nodes - validation contain master ip:: sufficient-packet-loss-requirement-for-role: Error while attempting to validate packet loss validation: host with IP 192.168.127.11 not found in inventory. network latency is calculated per role, therefore i expect no validation failures between workers and masters Version-Release number of selected component (if applicable): Staging v1.0.22.1 How reproducible: 100% Steps to Reproduce: 1. boot 3 masters 3 worker, set roles and api & ingress vip 2. set network latency in master-1, and packet loss in master-2 sudo tc qdisc add dev ens3 root netem delay 150ms sudo tc qdisc add dev ens3 root netem loss 10% 3. wait for network latency validation failures Actual results: masters with latency got wrong worker IP in validation message workers with no latency got validation failures containing masters ip Expected results: masters with latency got correct worker IP in validation message workers with no latency got no validation failures Additional info:
Created attachment 1792518 [details] example 2
The issue comes from this line of code (for packet validation): https://github.com/openshift/assisted-service/blob/master/internal/host/validator.go#L768 It attempts to retrieve the hostname and role from the inventory based on the IP. The error that is shown in the UI is because the DB does not have the inventory for the host yet. To fix this, I will change the logic to record this as a warning in the logs and ignore this IP. Once the host inventory is reported in the DB, the validation will be able to report it.
Fixed in https://github.com/openshift/assisted-service/pull/2053
Couldn't get validation messages in Integration environment, moving back to NEW for further investigation looks like hosts are missing from inventory, we noticed this repeated message in logs: level=warning msg="unable to determine host's role and hostname for IP: host with IP 192.168.127.10 not found in inventory" func="github.com/openshift/assisted-service/internal/host.(*validator).validateNetworkLatencyForRole" file="/go/src/github.com/openshift/origin/internal/host/validator.go:701" pkg=host-state time="2021-06-23T17:07:42Z"
Verified on Integration Covered by test: test_latency_master_and_workers
Verified on Staging OCP-Metal-v1.0.23.1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759