Bug 2059833

Summary: [ASH] router perf test failed with 400 services, some service can't be reached from pod: cannot resolve
Product: OpenShift Container Platform Reporter: Qiujie Li <qili>
Component: NetworkingAssignee: aos-network-edge-staff <aos-network-edge-staff>
Networking sub component: DNS QA Contact: Hongan Li <hongli>
Status: CLOSED WONTFIX Docs Contact:
Severity: unspecified    
Priority: unspecified CC: mfisher, mmasters
Version: 4.10   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-12-19 20:50:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Qiujie Li 2022-03-02 07:12:21 UTC
Description of problem:
On ASH cluster (OVN network type, Standard_DS4_v2 (8vCPU, 28GiB) - Default type for ASH master and wokrers), run router-perf load test, it creates 4400 services, then create http-scale-client pod to load the services. After successfully created the services, from the http-scale-client pod in the cluster(deployed by the perf-scale test to load the services created before), some service can not be reached for 'cannot resolve'. 
cannot resolve: http-perf-edge-40-http-scale-edge.apps.qili-ash-ovn.installer.redhat.wwtatc.com:443

From the http-scale-client pod, other services can be reached.

From outside of the pod, the problem service http-perf-edge-40-http-scale-edge.apps.qili-ash-ovn.installer.redhat.wwtatc.com can be reached too.

This problem can recover some time later.

OpenShift release version: 4.10.0-0.nightly-2022-02-26-230022


Cluster Platform: Asure Stack


How reproducible: 
Several times on ASH
Not reproduce on AWS cluster (3 masters, 3 workers) (OVN network) and with same build m5.xlarge(16.0 GiB 4 vCPUs) vm type.


Steps to Reproduce (in detail):
1. Install cluster ASH cluster(3 masters, 3 workers) (OVN network type) at version 4.10 (I used versioned-installer-ash_wwt template from flexy)
2. git clone https://github.com/cloud-bulldozer/e2e-benchmarking.git
3. cd e2e-benchmarking/workloads/router-perf-v2
4. source env.sh, export TERMINATIONS=mix
5. run ./ingress-performance.sh


Actual results:
From the http-scale-client pod in the cluster(deployed by the perf-scale test to load the services created before), some service can not be reached for 'cannot resolve'. 

cannot resolve: http-perf-edge-40-http-scale-edge.apps.qili-ash-ovn.installer.redhat.wwtatc.com:443

From the http-scale-client pod, other services can be reached.

From outside of the pod, the problem service can be reached.

This problem can recover some time later.

Expected results:
All services can be reached from inside the pod. 


Impact of the problem:


Additional info:
Must gather and other debug information will be pasted in the following comments.


** Please do not disregard the report template; filling the template out as much as possible will allow us to help you. Please consider attaching a must-gather archive (via `oc adm must-gather`). Please review must-gather contents for sensitive information before attaching any must-gathers to a bugzilla report.  You may also mark the bug private if you wish.

Comment 2 Miciah Dashiel Butler Masters 2022-03-02 15:51:15 UTC
Setting blocker- as this appears to be a scalability issue with a relatively new platform.  

Do we have any scalability data for ASH and OpenShift 4.9?  Because support for ASH is relatively new, I am assuming for now that this is not a regression, so we will investigate the failures when we are able to prioritize this BZ.

Comment 3 Qiujie Li 2022-03-03 02:45:48 UTC
Do we have any scalability data for ASH and OpenShift 4.9?
-No, 4.10 is the first release we do test on ASH.

Another performance related bug on ASH https://bugzilla.redhat.com/show_bug.cgi?id=2059714

Comment 5 mfisher 2022-12-19 20:50:39 UTC
This issue is stale and has been closed because it has been open 90 days or more with no noted activity/comments in the last 60 days.  If this issue is crucial and still needs resolution, please open a new jira issue and the engineering team will triage and prioritize accordingly.