Description of problem: I have run 1K DV creation with Fedora PV. After 15 hours of running only 247 out of 1K DV have succeeded, So its safe to say it does not work, from looking on the pods I see that they are failing on connection reset by peer. import pod description : Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulled 81m (x29 over 4h38m) kubelet Container image "registry.redhat.io/container-native-virtualization/virt-cdi-importer@sha256:7867d3eb3a664af01bb48a7b266032c3a9a49871f4ac069c06d299d9ae7e8ded" already present on machine Warning BackOff 2m31s (x506 over 4h29m) kubelet Back-off restarting failed container Pod log : E1124 09:48:46.282592 1 prlimit.go:175] (0.00/100%) (1.00/100%) (2.00/100%) (3.01/100%) E1124 09:48:46.282603 1 prlimit.go:176] qemu-img: curl: Recv failure: Connection reset by peer qemu-img: error while reading sector 356992: Input/output error qemu-img: curl: Recv failure: Connection reset by peer qemu-img: error while reading sector 356864: Input/output error Version-Release number of selected component (if applicable): NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-rc.4 True False 21d Cluster version is 4.6.0-rc.4 NAME DISPLAY VERSION REPLACES PHASE kubevirt-hyperconverged-operator.v2.5.0 OpenShift Virtualization 2.5.0 kubevirt-hyperconverged-operator.v2.4.3 Succeeded How reproducible: Always Steps to Reproduce: 1.Created 1K PV 2.Create 1K DV Actual results: Only 247 DV where created successfully Expected results: 1K DV will be created successfully Additional info: we have 2 directions bug 1900634 investigation - CrashLoopBackOff when Creating multiple vms from the same image I have looked at the logs with Maya Rashish and we saw there is a disconnection immediately by the server, I suspect it is because we have reached tomcat limit's ( 200 HTTP in parallel ), so i will install the image server on my lab so I can increase the limits and verify if it is an issue.
Guy, can you please provide some more information, what is the underlying storage? do some of the failed pods eventually complete? did you notice if some of the failed pods were restarted and succeeded?
Storage is NFS on the net-app server, at some point the failed pods do not recovered and succeeds. Jed is working on fixing bug 1903679 that I suspect have the same root cause, this is the fix we are testing : https://github.com/jean-edouard/kubevirt/commit/6ff1169338158cbdd42cf71fda159f2aea087420, After we will finish test the fix I will check if it also solves this issue.
Hi Guy, any updates since comment 2?
Updates : I have installed an internal web server in the lab and increased it's session capacity, below. There is an improvement but still under heavy network activity the pods get "stuck" at different stages. # worker MPM # StartServers: initial number of server processes to start # MaxClients: maximum number of simultaneous client connections # MinSpareThreads: minimum number of worker threads which are kept spare # MaxSpareThreads: maximum number of worker threads which are kept spare # ThreadsPerChild: constant number of worker threads in each server process # MaxRequestsPerChild: maximum number of requests a server process serves <IfModule mpm_worker_module> ServerLimit 40 StartServers 2 MaxClients 1000 MinSpareThreads 25 MaxSpareThreads 75 ThreadsPerChild 25 MaxRequestsPerChild 0 </IfModule>
What sort of behavior would you expect in this situation? I believe that the cluster is behaving as it should under extreme load. Even if we added rate-limiting to CDI you'll always be able to load the cluster in a way so as to create adverse conditions that cause imports to stall or retry.
The problem is that the DV creations get "stuck" and does not continue. I would expect for DV creation to continue, even if it is in a slow rate, but eventually to complete the creation.
Michael, the similar bug https://bugzilla.redhat.com/show_bug.cgi?id=1903679 was fixed by increasing the number of threads in the VMI controller. should we consider something in CDI?
Do we know how many target PVCs were created? And Pods?
247 DV where created successfully, I didn't checked the pod's number, if we want to debug the system more we can schedule a new performance lab to reproduce the problem.
I get that 247 DVs succeeded. But in order to figure out where a bottleneck may be I need to know how many PVCs (PVCs may be created for unsuccessful DVs) and pods were created. Yeah, let's please try to reproduce the problem.
Guy, please let us know when you plan the reproduction so that Michael can observe the system behavior. Thanks.
Hey Guy any update on this bug reproduction?
Still awaiting info so deferring to 4.9.
@Michael Does this re run satisfy us that this is not a bug?
@mhenriks Does the fix https://github.com/jean-edouard/kubevirt/commit/6ff1169338158cbdd42cf71fda159f2aea087420 used to fix https://bugzilla.redhat.com/show_bug.cgi?id=1903679 also fix this? Can we move this to verified?
I have tested it and review the results with Michael and it is succeeded (CNV version 4.8) , moving to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.9.0 Images security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:4104