Description of problem: On recent openshift nightlies simple image pulls (fedora) will simply not converge, unless the memory limit on CDI pods is kicked up to ridiculous values (1600M), suggesting that memory throttling may be taking place on the importer pod Version-Release number of selected component (if applicable): OCP 4.14.0-0.nightly-2023-08-28-154013 CNV v4.14.0.rhel9-1796 How reproducible: 100% Steps to Reproduce: 1. Create DV Actual results: Basically never converge Expected results: Success in a timely manner Additional info: apiVersion: cdi.kubevirt.io/v1beta1 kind: DataVolume metadata: annotations: cdi.kubevirt.io/storage.bind.immediate.requested: "true" name: test-dv-node-import-needs-convert spec: source: http: url: http://.../Fedora-Cloud-Base-35-1.2.x86_64.qcow2 pvc: accessModes: - ReadWriteOnce resources: requests: storage: 12Gi Edit HCO.spec with resourceRequirements: storageWorkloads: limits: cpu: 750m memory: 1600M requests: cpu: 100m memory: 60M To observe how the issue is alleviated Some inspection of the same issue on GCP clusters importing a Windows image showed high mem usage values (though not as high as the limit) - attached to the bug Some notes: - Is it possible the entire image stays on the page cache? - Note this is before qemu-img convert - Why did OOMs/throttles not happen before, say, in 4.14.0-ec.3? - For some images, 2x CDI pod limits unclog have to go a lot higher for large images (Windows) to work though - cgroupsv2 is default now (throttles instead of OOM - https://kubernetes.io/blog/2021/11/26/qos-memory-resources/)
Created attachment 1986187 [details] windows import pod metrics
Here is the simplest reproducer, not using CNV at all, just curl Just pull an image bigger than mem limit and curl will hang at some point after reaching page cache==mem limit $ oc exec -i -n default test -- curl http://.../Fedora-Cloud-Base-35-1.2.x86_64.qcow2 -o /disk/image.qcow2 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 43 1403M 43 614M 0 0 2468k 0 0:09:42 0:04:15 0:05:27 106k (Notice avg dload) apiVersion: v1 kind: PersistentVolumeClaim metadata: name: simple-pvc-ocs spec: accessModes: - ReadWriteOnce resources: requests: storage: 12Gi --- apiVersion: v1 kind: Pod metadata: creationTimestamp: null labels: run: test name: test spec: securityContext: seccompProfile: type: RuntimeDefault runAsNonRoot: true runAsUser: 10001 runAsGroup: 10001 fsGroup: 10001 containers: - image: quay.io/centos/centos:stream9 command: ["sleep", "3600000"] resources: limits: cpu: 750m memory: 600M requests: cpu: 100m memory: 60M name: test securityContext: allowPrivilegeEscalation: false capabilities: drop: ["ALL"] volumeMounts: - name: mypvc mountPath: /disk volumes: - name: mypvc persistentVolumeClaim: claimName: simple-pvc-ocs
Is this bug related to https://github.com/kubevirt/containerized-data-importer/issues/2838 ?
(In reply to Dominik Holler from comment #3) > Is this bug related to > https://github.com/kubevirt/containerized-data-importer/issues/2838 ? Yes Note Azure/GCP have nothing to do with this; comment #2 is reproducible on a kubevirtci local env with cgroupsv2
Peter can you approve blocker
Alex, can you please link to a tracker in jira that cover getting this fix to ocp 4.14.0.
verified on: OCP-4.14.0-rc.2 CNV-v4.14.0.rhel9-2100.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Virtualization 4.14.0 Images security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:6817
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days