Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1922215

Summary: Thin provisioned disk stopped to get extended
Product: [oVirt] vdsm Reporter: Jean-Louis Dupond <jean-louis>
Component: GeneralAssignee: Vojtech Juranek <vjuranek>
Status: CLOSED WORKSFORME QA Contact: Avihai <aefrat>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.40.40CC: bugs, eshenitz, sfishbai, sleviim, tnisan, vjuranek
Target Milestone: ovirt-4.4.6Keywords: Reopened
Target Release: ---   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-04-12 14:27:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
vdsm log none

Description Jean-Louis Dupond 2021-01-29 13:33:59 UTC
Created attachment 1752045 [details]
vdsm log

Description of problem:
iSCSI StorageDomain with multiple VM's on it.
All QCOW and thin provisioned.

Now while copying some data on a VM (70GB DISK), it as extended multiple times, which is normal. 
But all of a sudden, it stopped extending (when the disk was almost full), and the VM was hanging in a paused state!
Had to restart the VM to get it working again.

See logs in attachment.

Comment 1 Shir Fishbain 2021-02-01 15:08:47 UTC
Hi Shani, Can you please take a look at this ?

Comment 2 RHEL Program Management 2021-02-01 15:08:50 UTC
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 3 shani 2021-02-01 15:23:46 UTC
Hi,
Please supply a full engine log, and a screenshot if possible.
(I'm afraid the VDSM is not enough)

Also, please supply the following:
- please describe the full scenario and how to reproduce
- does it happens all the time?
- did you try it on nfs?
- might be a disk storage issue, can you share your df -h  

Thanks

Comment 4 Jean-Louis Dupond 2021-02-02 09:41:20 UTC
Hi Shani,

Engine logs doesn't give much info, this as all logic happens afaik on VDSM side.
2021-01-29 13:12:40,810+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED(1,025), VM migrator001 has been paused.
2021-01-29 13:12:40,816+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED_ENOSPC(138), VM migrator001 has been paused due to no Storage space error.
2021-01-29 13:12:42,264+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-31) [] VM 'cbb4b44c-bf42-45b6-9f28-3146221ed8a7'(migrator001) moved from 'Paused' --> 'Up'
2021-01-29 13:12:42,329+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-31) [] EVENT_ID: VM_RECOVERED_FROM_PAUSE_ERROR(196), VM migrator001 has recovered from paused back to up.
2021-01-29 13:13:06,686+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-13) [] VM 'cbb4b44c-bf42-45b6-9f28-3146221ed8a7'(migrator001) moved from 'Up' --> 'Paused'
2021-01-29 13:13:06,698+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED(1,025), VM migrator001 has been paused.
2021-01-29 13:13:06,712+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED_ENOSPC(138), VM migrator001 has been paused due to no Storage space error.
2021-01-29 13:13:07,397+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-31) [] VM 'cbb4b44c-bf42-45b6-9f28-3146221ed8a7'(migrator001) moved from 'Paused' --> 'Up'
2021-01-29 13:13:07,403+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-31) [] EVENT_ID: VM_RECOVERED_FROM_PAUSE_ERROR(196), VM migrator001 has recovered from paused back to up.
2021-01-29 13:13:31,847+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-13) [] VM 'cbb4b44c-bf42-45b6-9f28-3146221ed8a7'(migrator001) moved from 'Up' --> 'Paused'
2021-01-29 13:13:31,854+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED(1,025), VM migrator001 has been paused.
2021-01-29 13:13:31,860+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED_ENOSPC(138), VM migrator001 has been paused due to no Storage space error.
2021-01-29 13:13:33,260+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-13) [] VM 'cbb4b44c-bf42-45b6-9f28-3146221ed8a7'(migrator001) moved from 'Paused' --> 'Up'
2021-01-29 13:13:33,269+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_RECOVERED_FROM_PAUSE_ERROR(196), VM migrator001 has recovered from paused back to up.
2021-01-29 13:14:43,714+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-13) [] VM 'cbb4b44c-bf42-45b6-9f28-3146221ed8a7'(migrator001) moved from 'Up' --> 'Paused'
2021-01-29 13:14:43,721+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED(1,025), VM migrator001 has been paused.
2021-01-29 13:14:43,728+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED_ENOSPC(138), VM migrator001 has been paused due to no Storage space error.
2021-01-29 13:14:46,981+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-13) [] VM 'cbb4b44c-bf42-45b6-9f28-3146221ed8a7'(migrator001) moved from 'Paused' --> 'Up'
2021-01-29 13:14:46,990+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_RECOVERED_FROM_PAUSE_ERROR(196), VM migrator001 has recovered from paused back to up.
2021-01-29 13:16:00,285+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-13) [] VM 'cbb4b44c-bf42-45b6-9f28-3146221ed8a7'(migrator001) moved from 'Up' --> 'Paused'
2021-01-29 13:16:00,293+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED(1,025), VM migrator001 has been paused.
2021-01-29 13:16:00,298+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED_ENOSPC(138), VM migrator001 has been paused due to no Storage space error.
2021-01-29 13:16:01,421+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-17) [] VM 'cbb4b44c-bf42-45b6-9f28-3146221ed8a7'(migrator001) moved from 'Paused' --> 'Up'
2021-01-29 13:16:01,427+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-17) [] EVENT_ID: VM_RECOVERED_FROM_PAUSE_ERROR(196), VM migrator001 has recovered from paused back to up.
2021-01-29 13:16:11,439+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-13) [] VM 'cbb4b44c-bf42-45b6-9f28-3146221ed8a7'(migrator001) moved from 'Up' --> 'Paused'
2021-01-29 13:16:11,447+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED(1,025), VM migrator001 has been paused.
2021-01-29 13:16:11,452+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED_ENOSPC(138), VM migrator001 has been paused due to no Storage space error.


As you see, the last line is that the VM is pauzed due to storage space error. And never recovers anymore because the storage is never extended.

On the other questions:
- You should be able to simulate it sometimes by creating a disk with a thin provisioned disk on iSCSI, and try to fill it at a high rate (dd it until its full for example). 
  But I did this like 50 times, and only happend once ... So hard to reproduce.
- See above, no.
- Don't think this code path applies to NFS, as there is no thin provisioned LV on NFS ...
- df -h of what? On the VM? 

For me this looks like some race condition somewhere in the disk extension path.
Libvirt triggers a message to extend the disk -> disk extension is being processed, but in the meanwhile, another message is triggered that its full again, and that one is never processed.
When this happend the disk inside the VM was almost 100% full, but even then this shouldn't cause this.

Comment 5 Tal Nisan 2021-02-02 12:25:32 UTC
Seems like the VM is writing faster than VDSM can extend so we are reaching ENOSPC, fine tuning will most likely help here.
Vojta, can you please share how to fine tune it?

Comment 6 Vojtech Juranek 2021-03-01 20:28:57 UTC
The variables which are relevant for this use case are (can be set in vdsm.conf of better via custom config in vdsm.conf.d):

[irs]
volume_utilization_percent = 20
volume_utilization_chunk_mb = 2048

This config means that the extend of the image will start once 20% of 2048MB chunk is full, i.e. once there is still 1.6GB free space.
This is just an example which can be adjusted for specific needs (especially write speed and the delays to extend the image).

Unless modified, default value value are 50% of 1GB, i.e. image extend starts once there is less than 0.5 GB of free space in the image.

Comment 7 Eyal Shenitzky 2021-03-22 15:44:06 UTC
Closing this bug since all the helpful data was giving in comment #6.

Please reopen if you think there is a real issue here.

Comment 8 Jean-Louis Dupond 2021-03-23 07:34:49 UTC
Well the issue is not that the VM gets into Paused state and then gets extended and returns to UP again.
The issue here is that it gets into a paused state because of a full disk, but then never gets extended!

I'll try to find a way to reproduce this.

Comment 9 Jean-Louis Dupond 2021-04-12 14:27:40 UTC
Unable to reproduce on the current version (4.4.5).