Bug 1548017

Summary: [RFE] Detect write errors and resume paused vms
Product: [oVirt] vdsm Reporter: Nir Soffer <nsoffer>
Component: CoreAssignee: Dan Kenigsberg <danken>
Status: CLOSED DEFERRED QA Contact: Avihai <aefrat>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.20.15CC: bugs
Target Milestone: ---Keywords: FutureFeature
Target Release: ---Flags: rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Nir Soffer 2018-02-22 14:29:53 UTC
Description of problem:

Vdsm monitor uses only read(). If a reading from storage is ok, but writing to
storage fail, a VM may pause while vdsm storage monitoring may see the storage
domain as VALID. Since the storage domain does not change state, we will never
resume the paused VM.

Version-Release number of selected component (if applicable):
Any

How reproducible:
Unkown

Steps to Reproduce:
1. Create a storage domain with 2 LUNs
2. Create enough disks to fill the first LUN
3. Create new disk (should be create on the second LUN
4. Start a vm with the new disk
5. Make the second LUN offline (can be done using sysfs)
6. Perform some io in the vm until the vm pauses
7. Make the second LUN available again

Actual results:
VM remain paused.

Expected results:
VM resumed.

Additional info:
Not tested, but vdsm cannot handle this case.

Possible solution:

1. Create a monitoring LV on every LUN when creating or extending a PV. LVM
   supports specifying a PV when creating a new LV.
2. monitor the special monitoring LVs instead of the metadata LV for both read
   and write.
3. Use the the monitoring LV state to change the status of a storage domain, or
   the disks depending on these PVs.

Such change require moving storage monitoring out of vdsm, the current code cannot
handle checking of hundreds of paths.

Comment 1 Michal Skrivanek 2020-03-18 15:50:13 UTC
This bug didn't get any attention for a while, we didn't have the capacity to make any progress. If you deeply care about it or want to work on it please assign/target accordingly

Comment 2 Michal Skrivanek 2020-03-18 15:52:47 UTC
This bug didn't get any attention for a while, we didn't have the capacity to make any progress. If you deeply care about it or want to work on it please assign/target accordingly

Comment 3 Michal Skrivanek 2020-04-01 14:48:59 UTC
ok, closing. Please reopen if still relevant/you want to work on it.

Comment 4 Michal Skrivanek 2020-04-01 14:51:57 UTC
ok, closing. Please reopen if still relevant/you want to work on it.