Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1974085

Summary: [Assisted-4.8] [Staging][Network Latency] Worker host IP appear in master validation message
Product: OpenShift Container Platform Reporter: Lital Alon <lalon>
Component: assisted-installerAssignee: Jordi Gil <jgil>
assisted-installer sub component: assisted-service QA Contact: Yuri Obshansky <yobshans>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: high CC: aos-bugs, jgil, pkliczew, rfreiman
Version: 4.8Keywords: TestBlocker, Triaged
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: AI-Team-Projects
Fixed In Version: OCP-Metal-v1.0.23.1 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-18 17:35:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
example
none
example 2 none

Description Lital Alon 2021-06-20 13:21:02 UTC
Created attachment 1792517 [details]
example

Description of problem:
I booted 3 master 3 workers and simulated
no latency in master-0 (192.168.127.10)
network latency in master-1 (192.168.127.11)
packet loss in master-2 (192.168.127.12)

workers with no latency (192.168.127.13-192.168.127.15)

master-1 validation is:
sufficient-network-latency-requirement-for-role: Error while attempting to validate network latency: host with IP 192.168.127.12 not found in inventory.
sufficient-packet-loss-requirement-for-role: Error while attempting to validate packet loss validation: host with IP 192.168.127.14 not found in inventory.

The issue is that 192.168.127.14 is a worker node ip
In addition, worker-0 fails network validations although no latency in worker nodes - validation contain master ip::
sufficient-packet-loss-requirement-for-role: Error while attempting to validate packet loss validation: host with IP 192.168.127.11 not found in inventory.

network latency is calculated per role, therefore i expect no validation failures between workers and masters


Version-Release number of selected component (if applicable):
Staging v1.0.22.1

How reproducible:
100%

Steps to Reproduce:
1. boot 3 masters 3 worker, set roles and api & ingress vip
2. set network latency in master-1, and packet loss in master-2 
sudo tc qdisc add dev ens3 root netem delay 150ms
sudo tc qdisc add dev ens3 root netem loss 10%

3. wait for network latency validation failures 

Actual results:
masters with latency got wrong worker IP in validation message
workers with no latency got validation failures containing masters ip

Expected results:
masters with latency got correct worker IP in validation message
workers with no latency got no validation failures

Additional info:

Comment 1 Lital Alon 2021-06-20 13:22:35 UTC
Created attachment 1792518 [details]
example 2

Comment 2 Jordi Gil 2021-06-21 14:43:15 UTC
The issue comes from this line of code (for packet validation):
https://github.com/openshift/assisted-service/blob/master/internal/host/validator.go#L768
It attempts to retrieve the hostname and role from the inventory based on the IP. The error that is shown in the UI is because the DB does not have the inventory for the host yet. 

To fix this, I will change the logic to record this as a warning in the logs and ignore this IP. Once the host inventory is reported in the DB, the validation will be able to report it.

Comment 3 Jordi Gil 2021-06-21 21:36:49 UTC
Fixed in https://github.com/openshift/assisted-service/pull/2053

Comment 4 Lital Alon 2021-06-23 19:38:32 UTC
Couldn't get validation messages in Integration environment, moving back to NEW for further investigation 
looks like hosts are missing from inventory, we noticed this repeated message in logs:
level=warning msg="unable to determine host's role and hostname for IP: host with IP 192.168.127.10 not found in inventory" func="github.com/openshift/assisted-service/internal/host.(*validator).validateNetworkLatencyForRole" file="/go/src/github.com/openshift/origin/internal/host/validator.go:701" pkg=host-state
time="2021-06-23T17:07:42Z"

Comment 6 Lital Alon 2021-07-18 18:31:13 UTC
Verified on Integration
Covered by test: test_latency_master_and_workers

Comment 7 Lital Alon 2021-07-21 11:14:50 UTC
Verified on Staging OCP-Metal-v1.0.23.1

Comment 10 errata-xmlrpc 2021-10-18 17:35:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759