2236223 – Importer very slow to pull images, possibly mem throttled

Bug 2236223 - Importer very slow to pull images, possibly mem throttled

Summary: Importer very slow to pull images, possibly mem throttled

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.14.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.14.0
Assignee:	Alex Kalenyuk
QA Contact:	dalia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-08-30 17:44 UTC by Alex Kalenyuk
Modified:	2024-03-08 04:26 UTC (History)
CC List:	13 users (show)
Fixed In Version:	OCP 4.14.0-rc.2, CNV v4.14.0.rhel9-2082
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-11-08 14:06:16 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
windows import pod metrics (32.45 KB, image/png) 2023-08-30 17:45 UTC, Alex Kalenyuk	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	CNV-32676	None	None	None	2023-09-06 12:33:18 UTC
Red Hat Issue Tracker	OCPBUGS-18965	None	None	None	2023-09-13 18:29:09 UTC
Red Hat Product Errata	RHSA-2023:6817	None	None	None	2023-11-08 14:06:27 UTC

Description Alex Kalenyuk 2023-08-30 17:44:07 UTC

Description of problem:
On recent openshift nightlies simple image pulls (fedora) will simply not converge,
unless the memory limit on CDI pods is kicked up to ridiculous values (1600M),
suggesting that memory throttling may be taking place on the importer pod

Version-Release number of selected component (if applicable):
OCP 4.14.0-0.nightly-2023-08-28-154013
CNV v4.14.0.rhel9-1796

How reproducible:
100%

Steps to Reproduce:
1. Create DV

Actual results:
Basically never converge

Expected results:
Success in a timely manner

Additional info:
apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
  annotations:
    cdi.kubevirt.io/storage.bind.immediate.requested: "true"
  name: test-dv-node-import-needs-convert
spec:
  source:
    http:
      url: http://.../Fedora-Cloud-Base-35-1.2.x86_64.qcow2
  pvc:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 12Gi


Edit HCO.spec with 
resourceRequirements:
    storageWorkloads:
      limits:
        cpu: 750m
        memory: 1600M
      requests:
        cpu: 100m
        memory: 60M
To observe how the issue is alleviated

Some inspection of the same issue on GCP clusters importing a Windows image
showed high mem usage values (though not as high as the limit) - attached to the bug

Some notes:
- Is it possible the entire image stays on the page cache?
  - Note this is before qemu-img convert
  - Why did OOMs/throttles not happen before, say, in 4.14.0-ec.3?
- For some images, 2x CDI pod limits unclog
have to go a lot higher for large images (Windows) to work though
- cgroupsv2 is default now (throttles instead of OOM - https://kubernetes.io/blog/2021/11/26/qos-memory-resources/)

Comment 1 Alex Kalenyuk 2023-08-30 17:45:40 UTC

Created attachment 1986187 [details]
windows import pod metrics

Comment 2 Alex Kalenyuk 2023-09-03 12:58:23 UTC

Here is the simplest reproducer, not using CNV at all, just curl
Just pull an image bigger than mem limit and curl will hang at some point after reaching page cache==mem limit

$ oc exec -i -n default test -- curl http://.../Fedora-Cloud-Base-35-1.2.x86_64.qcow2 -o /disk/image.qcow2
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 43 1403M   43  614M    0     0  2468k      0  0:09:42  0:04:15  0:05:27  106k
(Notice avg dload)
 
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: simple-pvc-ocs
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 12Gi
---
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: test
  name: test
spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault
    runAsNonRoot: true
    runAsUser: 10001
    runAsGroup: 10001
    fsGroup: 10001
  containers:
  - image: quay.io/centos/centos:stream9
    command: ["sleep", "3600000"]
    resources:
      limits:
        cpu: 750m
        memory: 600M
      requests:
        cpu: 100m
        memory: 60M
    name: test
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop: ["ALL"]
    volumeMounts:
    - name: mypvc
      mountPath: /disk
  volumes:
  - name: mypvc
    persistentVolumeClaim:
      claimName: simple-pvc-ocs

Comment 3 Dominik Holler 2023-09-05 06:16:05 UTC

Is this bug related to https://github.com/kubevirt/containerized-data-importer/issues/2838 ?

Comment 4 Alex Kalenyuk 2023-09-05 08:15:47 UTC

(In reply to Dominik Holler from comment #3)
> Is this bug related to
> https://github.com/kubevirt/containerized-data-importer/issues/2838 ?

Yes
Note Azure/GCP have nothing to do with this; comment #2 is reproducible on a kubevirtci local env with cgroupsv2

Comment 5 dalia 2023-09-06 12:30:43 UTC

Peter can you approve blocker

Comment 8 dalia 2023-09-13 12:45:16 UTC

Alex, can you please link to a tracker in jira that cover getting this fix to ocp 4.14.0.

Comment 12 dalia 2023-09-28 08:52:29 UTC

verified on: OCP-4.14.0-rc.2 CNV-v4.14.0.rhel9-2100.

Comment 14 errata-xmlrpc 2023-11-08 14:06:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.14.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6817

Comment 15 Red Hat Bugzilla 2024-03-08 04:26:10 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.