Bug 2236223 - Importer very slow to pull images, possibly mem throttled
Summary: Importer very slow to pull images, possibly mem throttled
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Storage
Version: 4.14.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.14.0
Assignee: Alex Kalenyuk
QA Contact: dalia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-08-30 17:44 UTC by Alex Kalenyuk
Modified: 2024-03-08 04:26 UTC (History)
13 users (show)

Fixed In Version: OCP 4.14.0-rc.2, CNV v4.14.0.rhel9-2082
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-11-08 14:06:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
windows import pod metrics (32.45 KB, image/png)
2023-08-30 17:45 UTC, Alex Kalenyuk
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker CNV-32676 0 None None None 2023-09-06 12:33:18 UTC
Red Hat Issue Tracker OCPBUGS-18965 0 None None None 2023-09-13 18:29:09 UTC
Red Hat Product Errata RHSA-2023:6817 0 None None None 2023-11-08 14:06:27 UTC

Description Alex Kalenyuk 2023-08-30 17:44:07 UTC
Description of problem:
On recent openshift nightlies simple image pulls (fedora) will simply not converge,
unless the memory limit on CDI pods is kicked up to ridiculous values (1600M),
suggesting that memory throttling may be taking place on the importer pod

Version-Release number of selected component (if applicable):
OCP 4.14.0-0.nightly-2023-08-28-154013
CNV v4.14.0.rhel9-1796

How reproducible:
100%

Steps to Reproduce:
1. Create DV

Actual results:
Basically never converge

Expected results:
Success in a timely manner

Additional info:
apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
  annotations:
    cdi.kubevirt.io/storage.bind.immediate.requested: "true"
  name: test-dv-node-import-needs-convert
spec:
  source:
    http:
      url: http://.../Fedora-Cloud-Base-35-1.2.x86_64.qcow2
  pvc:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 12Gi


Edit HCO.spec with 
resourceRequirements:
    storageWorkloads:
      limits:
        cpu: 750m
        memory: 1600M
      requests:
        cpu: 100m
        memory: 60M
To observe how the issue is alleviated

Some inspection of the same issue on GCP clusters importing a Windows image
showed high mem usage values (though not as high as the limit) - attached to the bug

Some notes:
- Is it possible the entire image stays on the page cache?
  - Note this is before qemu-img convert
  - Why did OOMs/throttles not happen before, say, in 4.14.0-ec.3?
- For some images, 2x CDI pod limits unclog
have to go a lot higher for large images (Windows) to work though
- cgroupsv2 is default now (throttles instead of OOM - https://kubernetes.io/blog/2021/11/26/qos-memory-resources/)

Comment 1 Alex Kalenyuk 2023-08-30 17:45:40 UTC
Created attachment 1986187 [details]
windows import pod metrics

Comment 2 Alex Kalenyuk 2023-09-03 12:58:23 UTC
Here is the simplest reproducer, not using CNV at all, just curl
Just pull an image bigger than mem limit and curl will hang at some point after reaching page cache==mem limit

$ oc exec -i -n default test -- curl http://.../Fedora-Cloud-Base-35-1.2.x86_64.qcow2 -o /disk/image.qcow2
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 43 1403M   43  614M    0     0  2468k      0  0:09:42  0:04:15  0:05:27  106k
(Notice avg dload)
 
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: simple-pvc-ocs
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 12Gi
---
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: test
  name: test
spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault
    runAsNonRoot: true
    runAsUser: 10001
    runAsGroup: 10001
    fsGroup: 10001
  containers:
  - image: quay.io/centos/centos:stream9
    command: ["sleep", "3600000"]
    resources:
      limits:
        cpu: 750m
        memory: 600M
      requests:
        cpu: 100m
        memory: 60M
    name: test
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop: ["ALL"]
    volumeMounts:
    - name: mypvc
      mountPath: /disk
  volumes:
  - name: mypvc
    persistentVolumeClaim:
      claimName: simple-pvc-ocs

Comment 3 Dominik Holler 2023-09-05 06:16:05 UTC
Is this bug related to https://github.com/kubevirt/containerized-data-importer/issues/2838 ?

Comment 4 Alex Kalenyuk 2023-09-05 08:15:47 UTC
(In reply to Dominik Holler from comment #3)
> Is this bug related to
> https://github.com/kubevirt/containerized-data-importer/issues/2838 ?

Yes
Note Azure/GCP have nothing to do with this; comment #2 is reproducible on a kubevirtci local env with cgroupsv2

Comment 5 dalia 2023-09-06 12:30:43 UTC
Peter can you approve blocker

Comment 8 dalia 2023-09-13 12:45:16 UTC
Alex, can you please link to a tracker in jira that cover getting this fix to ocp 4.14.0.

Comment 12 dalia 2023-09-28 08:52:29 UTC
verified on: OCP-4.14.0-rc.2 CNV-v4.14.0.rhel9-2100.

Comment 14 errata-xmlrpc 2023-11-08 14:06:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.14.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6817

Comment 15 Red Hat Bugzilla 2024-03-08 04:26:10 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.