Bug 2236223

Summary: Importer very slow to pull images, possibly mem throttled
Product: Container Native Virtualization (CNV) Reporter: Alex Kalenyuk <akalenyu>
Component: StorageAssignee: Alex Kalenyuk <akalenyu>
Status: CLOSED ERRATA QA Contact: dalia <dafrank>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.14.0CC: alitke, boukhano, cye, dafrank, dholler, jpeimer, ksimon, leidwang, llong, pelauter, phou, sdodson, sfroberg
Target Milestone: ---   
Target Release: 4.14.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: OCP 4.14.0-rc.2, CNV v4.14.0.rhel9-2082 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-11-08 14:06:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
windows import pod metrics none

Description Alex Kalenyuk 2023-08-30 17:44:07 UTC
Description of problem:
On recent openshift nightlies simple image pulls (fedora) will simply not converge,
unless the memory limit on CDI pods is kicked up to ridiculous values (1600M),
suggesting that memory throttling may be taking place on the importer pod

Version-Release number of selected component (if applicable):
OCP 4.14.0-0.nightly-2023-08-28-154013
CNV v4.14.0.rhel9-1796

How reproducible:
100%

Steps to Reproduce:
1. Create DV

Actual results:
Basically never converge

Expected results:
Success in a timely manner

Additional info:
apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
  annotations:
    cdi.kubevirt.io/storage.bind.immediate.requested: "true"
  name: test-dv-node-import-needs-convert
spec:
  source:
    http:
      url: http://.../Fedora-Cloud-Base-35-1.2.x86_64.qcow2
  pvc:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 12Gi


Edit HCO.spec with 
resourceRequirements:
    storageWorkloads:
      limits:
        cpu: 750m
        memory: 1600M
      requests:
        cpu: 100m
        memory: 60M
To observe how the issue is alleviated

Some inspection of the same issue on GCP clusters importing a Windows image
showed high mem usage values (though not as high as the limit) - attached to the bug

Some notes:
- Is it possible the entire image stays on the page cache?
  - Note this is before qemu-img convert
  - Why did OOMs/throttles not happen before, say, in 4.14.0-ec.3?
- For some images, 2x CDI pod limits unclog
have to go a lot higher for large images (Windows) to work though
- cgroupsv2 is default now (throttles instead of OOM - https://kubernetes.io/blog/2021/11/26/qos-memory-resources/)

Comment 1 Alex Kalenyuk 2023-08-30 17:45:40 UTC
Created attachment 1986187 [details]
windows import pod metrics

Comment 2 Alex Kalenyuk 2023-09-03 12:58:23 UTC
Here is the simplest reproducer, not using CNV at all, just curl
Just pull an image bigger than mem limit and curl will hang at some point after reaching page cache==mem limit

$ oc exec -i -n default test -- curl http://.../Fedora-Cloud-Base-35-1.2.x86_64.qcow2 -o /disk/image.qcow2
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 43 1403M   43  614M    0     0  2468k      0  0:09:42  0:04:15  0:05:27  106k
(Notice avg dload)
 
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: simple-pvc-ocs
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 12Gi
---
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: test
  name: test
spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault
    runAsNonRoot: true
    runAsUser: 10001
    runAsGroup: 10001
    fsGroup: 10001
  containers:
  - image: quay.io/centos/centos:stream9
    command: ["sleep", "3600000"]
    resources:
      limits:
        cpu: 750m
        memory: 600M
      requests:
        cpu: 100m
        memory: 60M
    name: test
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop: ["ALL"]
    volumeMounts:
    - name: mypvc
      mountPath: /disk
  volumes:
  - name: mypvc
    persistentVolumeClaim:
      claimName: simple-pvc-ocs

Comment 3 Dominik Holler 2023-09-05 06:16:05 UTC
Is this bug related to https://github.com/kubevirt/containerized-data-importer/issues/2838 ?

Comment 4 Alex Kalenyuk 2023-09-05 08:15:47 UTC
(In reply to Dominik Holler from comment #3)
> Is this bug related to
> https://github.com/kubevirt/containerized-data-importer/issues/2838 ?

Yes
Note Azure/GCP have nothing to do with this; comment #2 is reproducible on a kubevirtci local env with cgroupsv2

Comment 5 dalia 2023-09-06 12:30:43 UTC
Peter can you approve blocker

Comment 8 dalia 2023-09-13 12:45:16 UTC
Alex, can you please link to a tracker in jira that cover getting this fix to ocp 4.14.0.

Comment 12 dalia 2023-09-28 08:52:29 UTC
verified on: OCP-4.14.0-rc.2 CNV-v4.14.0.rhel9-2100.

Comment 14 errata-xmlrpc 2023-11-08 14:06:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.14.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6817

Comment 15 Red Hat Bugzilla 2024-03-08 04:26:10 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days