Bug 1903687
| Summary: | [scale] 1K DV creation failed | ||
|---|---|---|---|
| Product: | Container Native Virtualization (CNV) | Reporter: | guy chen <guchen> |
| Component: | Storage | Assignee: | Michael Henriksen <mhenriks> |
| Status: | CLOSED ERRATA | QA Contact: | Kevin Alon Goldblatt <kgoldbla> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 2.5.0 | CC: | alitke, cnv-qe-bugs, fdeutsch, mhenriks, ngavrilo, yadu |
| Target Milestone: | --- | ||
| Target Release: | 4.9.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-11-02 15:57:26 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
guy chen
2020-12-02 16:11:21 UTC
Guy, can you please provide some more information, what is the underlying storage? do some of the failed pods eventually complete? did you notice if some of the failed pods were restarted and succeeded? Storage is NFS on the net-app server, at some point the failed pods do not recovered and succeeds. Jed is working on fixing bug 1903679 that I suspect have the same root cause, this is the fix we are testing : https://github.com/jean-edouard/kubevirt/commit/6ff1169338158cbdd42cf71fda159f2aea087420, After we will finish test the fix I will check if it also solves this issue.
Updates :
I have installed an internal web server in the lab and increased it's session capacity, below.
There is an improvement but still under heavy network activity the pods get "stuck" at different stages.
# worker MPM
# StartServers: initial number of server processes to start
# MaxClients: maximum number of simultaneous client connections
# MinSpareThreads: minimum number of worker threads which are kept spare
# MaxSpareThreads: maximum number of worker threads which are kept spare
# ThreadsPerChild: constant number of worker threads in each server process
# MaxRequestsPerChild: maximum number of requests a server process serves
<IfModule mpm_worker_module>
ServerLimit 40
StartServers 2
MaxClients 1000
MinSpareThreads 25
MaxSpareThreads 75
ThreadsPerChild 25
MaxRequestsPerChild 0
</IfModule>
What sort of behavior would you expect in this situation? I believe that the cluster is behaving as it should under extreme load. Even if we added rate-limiting to CDI you'll always be able to load the cluster in a way so as to create adverse conditions that cause imports to stall or retry. The problem is that the DV creations get "stuck" and does not continue. I would expect for DV creation to continue, even if it is in a slow rate, but eventually to complete the creation. Michael, the similar bug https://bugzilla.redhat.com/show_bug.cgi?id=1903679 was fixed by increasing the number of threads in the VMI controller. should we consider something in CDI? Do we know how many target PVCs were created? And Pods? 247 DV where created successfully, I didn't checked the pod's number, if we want to debug the system more we can schedule a new performance lab to reproduce the problem. I get that 247 DVs succeeded. But in order to figure out where a bottleneck may be I need to know how many PVCs (PVCs may be created for unsuccessful DVs) and pods were created. Yeah, let's please try to reproduce the problem. Guy, please let us know when you plan the reproduction so that Michael can observe the system behavior. Thanks. Hey Guy any update on this bug reproduction? Still awaiting info so deferring to 4.9. @Michael Does this re run satisfy us that this is not a bug? @mhenriks Does the fix https://github.com/jean-edouard/kubevirt/commit/6ff1169338158cbdd42cf71fda159f2aea087420 used to fix https://bugzilla.redhat.com/show_bug.cgi?id=1903679 also fix this? Can we move this to verified? I have tested it and review the results with Michael and it is succeeded (CNV version 4.8) , moving to verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.9.0 Images security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:4104 |