Bug 2236223
| Summary: | Importer very slow to pull images, possibly mem throttled | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Container Native Virtualization (CNV) | Reporter: | Alex Kalenyuk <akalenyu> | ||||
| Component: | Storage | Assignee: | Alex Kalenyuk <akalenyu> | ||||
| Status: | CLOSED ERRATA | QA Contact: | dalia <dafrank> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 4.14.0 | CC: | alitke, boukhano, cye, dafrank, dholler, jpeimer, ksimon, leidwang, llong, pelauter, phou, sdodson, sfroberg | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.14.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | OCP 4.14.0-rc.2, CNV v4.14.0.rhel9-2082 | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2023-11-08 14:06:16 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Created attachment 1986187 [details]
windows import pod metrics
Here is the simplest reproducer, not using CNV at all, just curl Just pull an image bigger than mem limit and curl will hang at some point after reaching page cache==mem limit $ oc exec -i -n default test -- curl http://.../Fedora-Cloud-Base-35-1.2.x86_64.qcow2 -o /disk/image.qcow2 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 43 1403M 43 614M 0 0 2468k 0 0:09:42 0:04:15 0:05:27 106k (Notice avg dload) apiVersion: v1 kind: PersistentVolumeClaim metadata: name: simple-pvc-ocs spec: accessModes: - ReadWriteOnce resources: requests: storage: 12Gi --- apiVersion: v1 kind: Pod metadata: creationTimestamp: null labels: run: test name: test spec: securityContext: seccompProfile: type: RuntimeDefault runAsNonRoot: true runAsUser: 10001 runAsGroup: 10001 fsGroup: 10001 containers: - image: quay.io/centos/centos:stream9 command: ["sleep", "3600000"] resources: limits: cpu: 750m memory: 600M requests: cpu: 100m memory: 60M name: test securityContext: allowPrivilegeEscalation: false capabilities: drop: ["ALL"] volumeMounts: - name: mypvc mountPath: /disk volumes: - name: mypvc persistentVolumeClaim: claimName: simple-pvc-ocs Is this bug related to https://github.com/kubevirt/containerized-data-importer/issues/2838 ? (In reply to Dominik Holler from comment #3) > Is this bug related to > https://github.com/kubevirt/containerized-data-importer/issues/2838 ? Yes Note Azure/GCP have nothing to do with this; comment #2 is reproducible on a kubevirtci local env with cgroupsv2 Peter can you approve blocker Alex, can you please link to a tracker in jira that cover getting this fix to ocp 4.14.0. verified on: OCP-4.14.0-rc.2 CNV-v4.14.0.rhel9-2100. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Virtualization 4.14.0 Images security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:6817 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |
Description of problem: On recent openshift nightlies simple image pulls (fedora) will simply not converge, unless the memory limit on CDI pods is kicked up to ridiculous values (1600M), suggesting that memory throttling may be taking place on the importer pod Version-Release number of selected component (if applicable): OCP 4.14.0-0.nightly-2023-08-28-154013 CNV v4.14.0.rhel9-1796 How reproducible: 100% Steps to Reproduce: 1. Create DV Actual results: Basically never converge Expected results: Success in a timely manner Additional info: apiVersion: cdi.kubevirt.io/v1beta1 kind: DataVolume metadata: annotations: cdi.kubevirt.io/storage.bind.immediate.requested: "true" name: test-dv-node-import-needs-convert spec: source: http: url: http://.../Fedora-Cloud-Base-35-1.2.x86_64.qcow2 pvc: accessModes: - ReadWriteOnce resources: requests: storage: 12Gi Edit HCO.spec with resourceRequirements: storageWorkloads: limits: cpu: 750m memory: 1600M requests: cpu: 100m memory: 60M To observe how the issue is alleviated Some inspection of the same issue on GCP clusters importing a Windows image showed high mem usage values (though not as high as the limit) - attached to the bug Some notes: - Is it possible the entire image stays on the page cache? - Note this is before qemu-img convert - Why did OOMs/throttles not happen before, say, in 4.14.0-ec.3? - For some images, 2x CDI pod limits unclog have to go a lot higher for large images (Windows) to work though - cgroupsv2 is default now (throttles instead of OOM - https://kubernetes.io/blog/2021/11/26/qos-memory-resources/)