Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1922215

Summary:

Thin provisioned disk stopped to get extended

Product:

[oVirt] vdsm

Reporter:

Jean-Louis Dupond <jean-louis>

Component:

General

Assignee:

Vojtech Juranek <vjuranek>

Status:

CLOSED WORKSFORME

QA Contact:

Avihai <aefrat>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

4.40.40

CC:

bugs, eshenitz, sfishbai, sleviim, tnisan, vjuranek

Target Milestone:

ovirt-4.4.6

Keywords:

Reopened

Target Release:

---

Hardware:

x86_64

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-04-12 14:27:40 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Storage

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
vdsm log	none

Description Jean-Louis Dupond 2021-01-29 13:33:59 UTC

Created attachment 1752045 [details]
vdsm log

Description of problem:
iSCSI StorageDomain with multiple VM's on it.
All QCOW and thin provisioned.

Now while copying some data on a VM (70GB DISK), it as extended multiple times, which is normal. 
But all of a sudden, it stopped extending (when the disk was almost full), and the VM was hanging in a paused state!
Had to restart the VM to get it working again.

See logs in attachment.

Comment 1 Shir Fishbain 2021-02-01 15:08:47 UTC

Hi Shani, Can you please take a look at this ?

Comment 2 RHEL Program Management 2021-02-01 15:08:50 UTC

The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 3 shani 2021-02-01 15:23:46 UTC

Hi,
Please supply a full engine log, and a screenshot if possible.
(I'm afraid the VDSM is not enough)

Also, please supply the following:
- please describe the full scenario and how to reproduce
- does it happens all the time?
- did you try it on nfs?
- might be a disk storage issue, can you share your df -h  

Thanks

Comment 4 Jean-Louis Dupond 2021-02-02 09:41:20 UTC

Hi Shani,

Engine logs doesn't give much info, this as all logic happens afaik on VDSM side.
2021-01-29 13:12:40,810+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED(1,025), VM migrator001 has been paused.
2021-01-29 13:12:40,816+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED_ENOSPC(138), VM migrator001 has been paused due to no Storage space error.
2021-01-29 13:12:42,264+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-31) [] VM 'cbb4b44c-bf42-45b6-9f28-3146221ed8a7'(migrator001) moved from 'Paused' --> 'Up'
2021-01-29 13:12:42,329+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-31) [] EVENT_ID: VM_RECOVERED_FROM_PAUSE_ERROR(196), VM migrator001 has recovered from paused back to up.
2021-01-29 13:13:06,686+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-13) [] VM 'cbb4b44c-bf42-45b6-9f28-3146221ed8a7'(migrator001) moved from 'Up' --> 'Paused'
2021-01-29 13:13:06,698+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED(1,025), VM migrator001 has been paused.
2021-01-29 13:13:06,712+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED_ENOSPC(138), VM migrator001 has been paused due to no Storage space error.
2021-01-29 13:13:07,397+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-31) [] VM 'cbb4b44c-bf42-45b6-9f28-3146221ed8a7'(migrator001) moved from 'Paused' --> 'Up'
2021-01-29 13:13:07,403+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-31) [] EVENT_ID: VM_RECOVERED_FROM_PAUSE_ERROR(196), VM migrator001 has recovered from paused back to up.
2021-01-29 13:13:31,847+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-13) [] VM 'cbb4b44c-bf42-45b6-9f28-3146221ed8a7'(migrator001) moved from 'Up' --> 'Paused'
2021-01-29 13:13:31,854+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED(1,025), VM migrator001 has been paused.
2021-01-29 13:13:31,860+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED_ENOSPC(138), VM migrator001 has been paused due to no Storage space error.
2021-01-29 13:13:33,260+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-13) [] VM 'cbb4b44c-bf42-45b6-9f28-3146221ed8a7'(migrator001) moved from 'Paused' --> 'Up'
2021-01-29 13:13:33,269+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_RECOVERED_FROM_PAUSE_ERROR(196), VM migrator001 has recovered from paused back to up.
2021-01-29 13:14:43,714+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-13) [] VM 'cbb4b44c-bf42-45b6-9f28-3146221ed8a7'(migrator001) moved from 'Up' --> 'Paused'
2021-01-29 13:14:43,721+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED(1,025), VM migrator001 has been paused.
2021-01-29 13:14:43,728+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED_ENOSPC(138), VM migrator001 has been paused due to no Storage space error.
2021-01-29 13:14:46,981+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-13) [] VM 'cbb4b44c-bf42-45b6-9f28-3146221ed8a7'(migrator001) moved from 'Paused' --> 'Up'
2021-01-29 13:14:46,990+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_RECOVERED_FROM_PAUSE_ERROR(196), VM migrator001 has recovered from paused back to up.
2021-01-29 13:16:00,285+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-13) [] VM 'cbb4b44c-bf42-45b6-9f28-3146221ed8a7'(migrator001) moved from 'Up' --> 'Paused'
2021-01-29 13:16:00,293+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED(1,025), VM migrator001 has been paused.
2021-01-29 13:16:00,298+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED_ENOSPC(138), VM migrator001 has been paused due to no Storage space error.
2021-01-29 13:16:01,421+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-17) [] VM 'cbb4b44c-bf42-45b6-9f28-3146221ed8a7'(migrator001) moved from 'Paused' --> 'Up'
2021-01-29 13:16:01,427+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-17) [] EVENT_ID: VM_RECOVERED_FROM_PAUSE_ERROR(196), VM migrator001 has recovered from paused back to up.
2021-01-29 13:16:11,439+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-13) [] VM 'cbb4b44c-bf42-45b6-9f28-3146221ed8a7'(migrator001) moved from 'Up' --> 'Paused'
2021-01-29 13:16:11,447+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED(1,025), VM migrator001 has been paused.
2021-01-29 13:16:11,452+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED_ENOSPC(138), VM migrator001 has been paused due to no Storage space error.


As you see, the last line is that the VM is pauzed due to storage space error. And never recovers anymore because the storage is never extended.

On the other questions:
- You should be able to simulate it sometimes by creating a disk with a thin provisioned disk on iSCSI, and try to fill it at a high rate (dd it until its full for example). 
  But I did this like 50 times, and only happend once ... So hard to reproduce.
- See above, no.
- Don't think this code path applies to NFS, as there is no thin provisioned LV on NFS ...
- df -h of what? On the VM? 

For me this looks like some race condition somewhere in the disk extension path.
Libvirt triggers a message to extend the disk -> disk extension is being processed, but in the meanwhile, another message is triggered that its full again, and that one is never processed.
When this happend the disk inside the VM was almost 100% full, but even then this shouldn't cause this.

Comment 5 Tal Nisan 2021-02-02 12:25:32 UTC

Seems like the VM is writing faster than VDSM can extend so we are reaching ENOSPC, fine tuning will most likely help here.
Vojta, can you please share how to fine tune it?

Comment 6 Vojtech Juranek 2021-03-01 20:28:57 UTC

The variables which are relevant for this use case are (can be set in vdsm.conf of better via custom config in vdsm.conf.d):

[irs]
volume_utilization_percent = 20
volume_utilization_chunk_mb = 2048

This config means that the extend of the image will start once 20% of 2048MB chunk is full, i.e. once there is still 1.6GB free space.
This is just an example which can be adjusted for specific needs (especially write speed and the delays to extend the image).

Unless modified, default value value are 50% of 1GB, i.e. image extend starts once there is less than 0.5 GB of free space in the image.

Comment 7 Eyal Shenitzky 2021-03-22 15:44:06 UTC

Closing this bug since all the helpful data was giving in comment #6.

Please reopen if you think there is a real issue here.

Comment 8 Jean-Louis Dupond 2021-03-23 07:34:49 UTC

Well the issue is not that the VM gets into Paused state and then gets extended and returns to UP again.
The issue here is that it gets into a paused state because of a full disk, but then never gets extended!

I'll try to find a way to reproduce this.

Comment 9 Jean-Louis Dupond 2021-04-12 14:27:40 UTC

Unable to reproduce on the current version (4.4.5).