Bug 1903687

Summary:	[scale] 1K DV creation failed
Product:	Container Native Virtualization (CNV)	Reporter:	guy chen <guchen>
Component:	Storage	Assignee:	Michael Henriksen <mhenriks>
Status:	CLOSED ERRATA	QA Contact:	Kevin Alon Goldblatt <kgoldbla>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	2.5.0	CC:	alitke, cnv-qe-bugs, fdeutsch, mhenriks, ngavrilo, yadu
Target Milestone:	---
Target Release:	4.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-11-02 15:57:26 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description guy chen 2020-12-02 16:11:21 UTC

Description of problem:

I have run 1K DV creation with Fedora PV.
After 15 hours of running only 247 out of 1K DV have succeeded, So its safe to say it does not work, from looking on the pods I see that they are failing on connection reset by peer.
 
import pod description :

Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 81m (x29 over 4h38m) kubelet Container image "registry.redhat.io/container-native-virtualization/virt-cdi-importer@sha256:7867d3eb3a664af01bb48a7b266032c3a9a49871f4ac069c06d299d9ae7e8ded" already present on machine
Warning BackOff 2m31s (x506 over 4h29m) kubelet Back-off restarting failed container

Pod log :

E1124 09:48:46.282592 1 prlimit.go:175] (0.00/100%)
(1.00/100%)
(2.00/100%)
(3.01/100%)

E1124 09:48:46.282603 1 prlimit.go:176] qemu-img: curl: Recv failure: Connection reset by peer
qemu-img: error while reading sector 356992: Input/output error
qemu-img: curl: Recv failure: Connection reset by peer
qemu-img: error while reading sector 356864: Input/output error



Version-Release number of selected component (if applicable):

NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-rc.4   True        False         21d     Cluster version is 4.6.0-rc.4
NAME                                      DISPLAY                    VERSION   REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v2.5.0   OpenShift Virtualization   2.5.0     kubevirt-hyperconverged-operator.v2.4.3   Succeeded


How reproducible:
Always

Steps to Reproduce:
1.Created 1K PV
2.Create 1K DV


Actual results:
Only 247 DV where created successfully

Expected results:
1K DV will be created successfully

Additional info:

we have 2 directions
bug 1900634 investigation - CrashLoopBackOff when Creating multiple vms from the same image
I have looked at the logs with Maya Rashish and we saw there is a disconnection immediately by the server, I suspect it is because we have reached tomcat limit's ( 200 HTTP in parallel ), so i will install the image server on my lab so I can increase the limits and verify if it is an issue.

Comment 1 Natalie Gavrielov 2020-12-09 13:27:26 UTC

Guy, can you please provide some more information, what is the underlying storage? do some of the failed pods eventually complete? did you notice if some of the failed pods were restarted and succeeded?

Comment 2 guy chen 2020-12-10 12:33:27 UTC

Storage is NFS on the net-app server, at some point the failed pods do not recovered and succeeds.

Jed is working on fixing bug 1903679 that I suspect have the same root cause, this is the fix we are testing : https://github.com/jean-edouard/kubevirt/commit/6ff1169338158cbdd42cf71fda159f2aea087420,
After we will finish test the fix I will check if it also solves this issue.

Comment 3 Natalie Gavrielov 2021-01-06 13:32:24 UTC

Hi Guy, any updates since comment 2?

Comment 5 guy chen 2021-01-25 13:28:27 UTC

Updates :
I have installed an internal web server in the lab and increased it's session capacity, below.
There is an improvement but still under heavy network activity the pods get "stuck" at different stages.

# worker MPM
# StartServers: initial number of server processes to start
# MaxClients: maximum number of simultaneous client connections
# MinSpareThreads: minimum number of worker threads which are kept spare
# MaxSpareThreads: maximum number of worker threads which are kept spare
# ThreadsPerChild: constant number of worker threads in each server process
# MaxRequestsPerChild: maximum number of requests a server process serves
<IfModule mpm_worker_module>
    ServerLimit          40
    StartServers          2
    MaxClients          1000
    MinSpareThreads      25
    MaxSpareThreads      75 
    ThreadsPerChild      25
    MaxRequestsPerChild   0
</IfModule>

Comment 6 Adam Litke 2021-04-05 13:30:10 UTC

What sort of behavior would you expect in this situation?  I believe that the cluster is behaving as it should under extreme load.  Even if we added rate-limiting to CDI you'll always be able to load the cluster in a way so as to create adverse conditions that cause imports to stall or retry.

Comment 7 guy chen 2021-04-07 11:36:11 UTC

The problem is that the DV creations get "stuck" and does not continue.
I would expect for DV creation to continue, even if it is in a slow rate, but eventually to complete the creation.

Comment 8 Yan Du 2021-04-21 12:27:05 UTC

Michael, the similar bug https://bugzilla.redhat.com/show_bug.cgi?id=1903679 was fixed by increasing the number of threads in the VMI controller. should we consider something in CDI?

Comment 9 Michael Henriksen 2021-05-04 16:03:09 UTC

Do we know how many target PVCs were created?  And Pods?

Comment 10 guy chen 2021-05-06 06:50:12 UTC

247 DV where created successfully, I didn't checked the pod's number, if we want to debug the system more we can schedule a new performance lab to reproduce the problem.

Comment 11 Michael Henriksen 2021-05-06 10:55:41 UTC

I get that 247 DVs succeeded.  But in order to figure out where a bottleneck may be I need to know how many PVCs (PVCs may be created for unsuccessful DVs) and pods were created.

Yeah, let's please try to reproduce the problem.

Comment 12 Adam Litke 2021-05-12 18:07:05 UTC

Guy, please let us know when you plan the reproduction so that Michael can observe the system behavior.  Thanks.

Comment 13 Kevin Alon Goldblatt 2021-05-18 07:22:29 UTC

Hey Guy any update on this bug reproduction?

Comment 14 Adam Litke 2021-05-18 14:24:33 UTC

Still awaiting info so deferring to 4.9.

Comment 18 Kevin Alon Goldblatt 2021-05-31 08:23:54 UTC

@Michael Does this re run satisfy us that this is not a bug?

Comment 19 Kevin Alon Goldblatt 2021-05-31 08:29:49 UTC

@mhenriks Does the fix https://github.com/jean-edouard/kubevirt/commit/6ff1169338158cbdd42cf71fda159f2aea087420 used to fix https://bugzilla.redhat.com/show_bug.cgi?id=1903679 also fix this? Can we move this to verified?

Comment 20 guy chen 2021-05-31 09:58:09 UTC

I have tested it and review the results with Michael and it is succeeded (CNV version 4.8) , moving to verified.

Comment 23 errata-xmlrpc 2021-11-02 15:57:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.9.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:4104