Bug 1922215
| Summary: | Thin provisioned disk stopped to get extended | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [oVirt] vdsm | Reporter: | Jean-Louis Dupond <jean-louis> | ||||
| Component: | General | Assignee: | Vojtech Juranek <vjuranek> | ||||
| Status: | CLOSED WORKSFORME | QA Contact: | Avihai <aefrat> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 4.40.40 | CC: | bugs, eshenitz, sfishbai, sleviim, tnisan, vjuranek | ||||
| Target Milestone: | ovirt-4.4.6 | Keywords: | Reopened | ||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2021-04-12 14:27:40 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Hi Shani, Can you please take a look at this ? The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again. Hi, Please supply a full engine log, and a screenshot if possible. (I'm afraid the VDSM is not enough) Also, please supply the following: - please describe the full scenario and how to reproduce - does it happens all the time? - did you try it on nfs? - might be a disk storage issue, can you share your df -h Thanks Hi Shani, Engine logs doesn't give much info, this as all logic happens afaik on VDSM side. 2021-01-29 13:12:40,810+01 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED(1,025), VM migrator001 has been paused. 2021-01-29 13:12:40,816+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED_ENOSPC(138), VM migrator001 has been paused due to no Storage space error. 2021-01-29 13:12:42,264+01 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-31) [] VM 'cbb4b44c-bf42-45b6-9f28-3146221ed8a7'(migrator001) moved from 'Paused' --> 'Up' 2021-01-29 13:12:42,329+01 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-31) [] EVENT_ID: VM_RECOVERED_FROM_PAUSE_ERROR(196), VM migrator001 has recovered from paused back to up. 2021-01-29 13:13:06,686+01 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-13) [] VM 'cbb4b44c-bf42-45b6-9f28-3146221ed8a7'(migrator001) moved from 'Up' --> 'Paused' 2021-01-29 13:13:06,698+01 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED(1,025), VM migrator001 has been paused. 2021-01-29 13:13:06,712+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED_ENOSPC(138), VM migrator001 has been paused due to no Storage space error. 2021-01-29 13:13:07,397+01 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-31) [] VM 'cbb4b44c-bf42-45b6-9f28-3146221ed8a7'(migrator001) moved from 'Paused' --> 'Up' 2021-01-29 13:13:07,403+01 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-31) [] EVENT_ID: VM_RECOVERED_FROM_PAUSE_ERROR(196), VM migrator001 has recovered from paused back to up. 2021-01-29 13:13:31,847+01 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-13) [] VM 'cbb4b44c-bf42-45b6-9f28-3146221ed8a7'(migrator001) moved from 'Up' --> 'Paused' 2021-01-29 13:13:31,854+01 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED(1,025), VM migrator001 has been paused. 2021-01-29 13:13:31,860+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED_ENOSPC(138), VM migrator001 has been paused due to no Storage space error. 2021-01-29 13:13:33,260+01 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-13) [] VM 'cbb4b44c-bf42-45b6-9f28-3146221ed8a7'(migrator001) moved from 'Paused' --> 'Up' 2021-01-29 13:13:33,269+01 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_RECOVERED_FROM_PAUSE_ERROR(196), VM migrator001 has recovered from paused back to up. 2021-01-29 13:14:43,714+01 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-13) [] VM 'cbb4b44c-bf42-45b6-9f28-3146221ed8a7'(migrator001) moved from 'Up' --> 'Paused' 2021-01-29 13:14:43,721+01 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED(1,025), VM migrator001 has been paused. 2021-01-29 13:14:43,728+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED_ENOSPC(138), VM migrator001 has been paused due to no Storage space error. 2021-01-29 13:14:46,981+01 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-13) [] VM 'cbb4b44c-bf42-45b6-9f28-3146221ed8a7'(migrator001) moved from 'Paused' --> 'Up' 2021-01-29 13:14:46,990+01 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_RECOVERED_FROM_PAUSE_ERROR(196), VM migrator001 has recovered from paused back to up. 2021-01-29 13:16:00,285+01 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-13) [] VM 'cbb4b44c-bf42-45b6-9f28-3146221ed8a7'(migrator001) moved from 'Up' --> 'Paused' 2021-01-29 13:16:00,293+01 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED(1,025), VM migrator001 has been paused. 2021-01-29 13:16:00,298+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED_ENOSPC(138), VM migrator001 has been paused due to no Storage space error. 2021-01-29 13:16:01,421+01 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-17) [] VM 'cbb4b44c-bf42-45b6-9f28-3146221ed8a7'(migrator001) moved from 'Paused' --> 'Up' 2021-01-29 13:16:01,427+01 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-17) [] EVENT_ID: VM_RECOVERED_FROM_PAUSE_ERROR(196), VM migrator001 has recovered from paused back to up. 2021-01-29 13:16:11,439+01 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-13) [] VM 'cbb4b44c-bf42-45b6-9f28-3146221ed8a7'(migrator001) moved from 'Up' --> 'Paused' 2021-01-29 13:16:11,447+01 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED(1,025), VM migrator001 has been paused. 2021-01-29 13:16:11,452+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED_ENOSPC(138), VM migrator001 has been paused due to no Storage space error. As you see, the last line is that the VM is pauzed due to storage space error. And never recovers anymore because the storage is never extended. On the other questions: - You should be able to simulate it sometimes by creating a disk with a thin provisioned disk on iSCSI, and try to fill it at a high rate (dd it until its full for example). But I did this like 50 times, and only happend once ... So hard to reproduce. - See above, no. - Don't think this code path applies to NFS, as there is no thin provisioned LV on NFS ... - df -h of what? On the VM? For me this looks like some race condition somewhere in the disk extension path. Libvirt triggers a message to extend the disk -> disk extension is being processed, but in the meanwhile, another message is triggered that its full again, and that one is never processed. When this happend the disk inside the VM was almost 100% full, but even then this shouldn't cause this. Seems like the VM is writing faster than VDSM can extend so we are reaching ENOSPC, fine tuning will most likely help here. Vojta, can you please share how to fine tune it? The variables which are relevant for this use case are (can be set in vdsm.conf of better via custom config in vdsm.conf.d): [irs] volume_utilization_percent = 20 volume_utilization_chunk_mb = 2048 This config means that the extend of the image will start once 20% of 2048MB chunk is full, i.e. once there is still 1.6GB free space. This is just an example which can be adjusted for specific needs (especially write speed and the delays to extend the image). Unless modified, default value value are 50% of 1GB, i.e. image extend starts once there is less than 0.5 GB of free space in the image. Closing this bug since all the helpful data was giving in comment #6. Please reopen if you think there is a real issue here. Well the issue is not that the VM gets into Paused state and then gets extended and returns to UP again. The issue here is that it gets into a paused state because of a full disk, but then never gets extended! I'll try to find a way to reproduce this. Unable to reproduce on the current version (4.4.5). |
Created attachment 1752045 [details] vdsm log Description of problem: iSCSI StorageDomain with multiple VM's on it. All QCOW and thin provisioned. Now while copying some data on a VM (70GB DISK), it as extended multiple times, which is normal. But all of a sudden, it stopped extending (when the disk was almost full), and the VM was hanging in a paused state! Had to restart the VM to get it working again. See logs in attachment.