Bug 1678969

Summary: ResourceManager fails to lock image if the image has a problem
Product: Red Hat Enterprise Virtualization Manager Reporter: Germano Veit Michel <gveitmic>
Component: vdsmAssignee: Vojtech Juranek <vjuranek>
Status: CLOSED DEFERRED QA Contact: Shir Fishbain <sfishbai>
Severity: high Docs Contact:
Priority: high    
Version: 4.3.0CC: bcholler, lsurette, mkalinin, mtessun, nsoffer, redhat-bugzilla, sfishbai, srevivo, tnisan, usurse, ycui
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-09-25 04:47:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1469683, 1489145, 1547336    

Description Germano Veit Michel 2019-02-20 02:16:27 UTC
Description of problem:

ResourceManager fails to lock an image if there is something wrong with it (illegal image in chain, missing leaf...)

So once the image is in bad shape (i.e. snapshot or LSM bug) it is impossible to safely fix it using update_volume, as we cannot lock it anymore.

Locking should be *separate* from image health checks. We should still be able to lock an image in bad shape, so it can be fixed safely. Locking should fail if there is a locking problem.

I assume this may also break some recovery flow, not just a manual update_volume via API.

Version-Release number of selected component (if applicable):
vdsm-4.30.9-5.git939304e.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Set an image of the chain to illegal
# dd if=/dev/603fd774-06a5-4590-a152-f0ff3dd53fd0/metadata bs=512 count=1 skip=8 status=none | sed 's/=LEAF/=INTERNAL/' | dd of=/dev/603fd774-06a5-4590-a152-f0ff3dd53fd0/metadata bs=512 count=1 seek=8

2. Try any operation that grabs its lock (ResourceManager)
# cat input.json 
{
    "vol_info" : {
        "generation": 1,
        "img_id" : "da28450c-33c5-495c-9780-20b94f6375c3",
        "sd_id": "603fd774-06a5-4590-a152-f0ff3dd53fd0",
        "vol_id" : "70c90462-8653-4cd6-9ccb-178a2f199054"
    },
    "vol_attr" : {
        "legality": "ILLEGAL"
    },
    "job_id" : "9e71e4c7-fc1f-49c5-9d17-b79a32bf3e52"
}
# vdsm-client -f input.json SDM update_volume

Actual results:
ImageIsNotLegalChain and other exceptions end up throwing ResourceAcqusitionFailed

Expected results:
Resource can be locked unless there is a locking problem.

Additional info:
2019-02-20 12:07:39,058+1000 INFO  (jsonrpc/6) [vdsm.api] START sdm_update_volume(job_id=u'9e71e4c7-fc1f-49c5-9d17-b79a32bf3e52', vol_info={u'generation': 1, u'img_id': u'da28450c-33c5-495c-9780-20b94f6375c3', u'sd_id': u'603fd774-06a5-4590-a152-f0ff3dd53fd0', u'vol_id': u'70c90462-8653-4cd6-9ccb-178a2f199054'}, vol_attr={u'legality': u'ILLEGAL'}) from=::1,58638, task_id=646deaa2-cc60-4cc6-b843-afcbdd5a0848 (api:48)
2019-02-20 12:07:39,059+1000 INFO  (jsonrpc/6) [vdsm.api] FINISH sdm_update_volume return=None from=::1,58638, task_id=646deaa2-cc60-4cc6-b843-afcbdd5a0848 (api:54)
2019-02-20 12:07:39,059+1000 INFO  (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call SDM.update_volume succeeded in 0.00 seconds (__init__:312)
2019-02-20 12:07:39,060+1000 INFO  (tasks/4) [storage.ThreadPool.WorkerThread] START task 646deaa2-cc60-4cc6-b843-afcbdd5a0848 (cmd=<bound method Task.commit of <vdsm.storage.task.Task instance at 0x7f3865677e60>>, args=None) (threadPool:208)
2019-02-20 12:07:39,060+1000 INFO  (tasks/4) [root] Running job u'9e71e4c7-fc1f-49c5-9d17-b79a32bf3e52'... (jobs:183)
2019-02-20 12:07:39,185+1000 ERROR (tasks/4) [storage.Image] There is no leaf in the image da28450c-33c5-495c-9780-20b94f6375c3 (image:221)
2019-02-20 12:07:39,185+1000 WARN  (tasks/4) [storage.ResourceManager] Resource factory failed to create resource '01_img_603fd774-06a5-4590-a152-f0ff3dd53fd0.da28450c-33c5-495c-9780-20b94f6375c3'. Canceling request. (resourceManager:544)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 540, in registerResource
    obj = namespaceObj.factory.createResource(name, lockType)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceFactories.py", line 193, in createResource
    lockType)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceFactories.py", line 122, in __getResourceCandidatesList
    imgUUID=resourceName)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/image.py", line 222, in getChain
    raise se.ImageIsNotLegalChain(imgUUID)
ImageIsNotLegalChain: Image is not a legal chain: ('da28450c-33c5-495c-9780-20b94f6375c3',)
2019-02-20 12:07:39,185+1000 WARN  (tasks/4) [storage.ResourceManager.Request] (ResName='01_img_603fd774-06a5-4590-a152-f0ff3dd53fd0.da28450c-33c5-495c-9780-20b94f6375c3', ReqID='05ca7896-b4a8-48b8-aade-8199751e67a4') Tried to cancel a processed request (resourceManager:188)
2019-02-20 12:07:39,185+1000 ERROR (tasks/4) [storage.guarded] Error acquiring lock <ResourceManagerLock ns=01_img_603fd774-06a5-4590-a152-f0ff3dd53fd0, name=da28450c-33c5-495c-9780-20b94f6375c3, mode=exclusive at 0x7f38655c32d0> (guarded:96)
2019-02-20 12:07:39,186+1000 ERROR (tasks/4) [root] Job u'9e71e4c7-fc1f-49c5-9d17-b79a32bf3e52' failed (jobs:221)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/jobs.py", line 157, in run
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdm/api/update_volume.py", line 39, in _run
    with guarded.context(self._endpoint.locks):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/guarded.py", line 102, in __enter__
    six.reraise(*exc)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/guarded.py", line 93, in __enter__
    lock.acquire()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 1001, in acquire
    res = acquireResource(self.ns, self.name, self.mode)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 1026, in acquireResource
    return _manager.acquireResource(namespace, name, lockType, timeout=timeout)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 476, in acquireResource
    raise se.ResourceAcqusitionFailed()
ResourceAcqusitionFailed: Could not acquire resource. Probably resource factory threw an exception.: ()
2019-02-20 12:07:39,194+1000 INFO  (tasks/4) [root] Job u'9e71e4c7-fc1f-49c5-9d17-b79a32bf3e52' will be deleted in 3600 seconds (jobs:249)

Comment 1 Germano Veit Michel 2019-02-20 02:21:08 UTC
This was discussed in https://gerrit.ovirt.org/#/c/93260/

Comment 7 Nir Soffer 2020-09-01 09:27:34 UTC
Vdsm will not work with illegal image chain. This requires manual
fix outside of vdsm.

This is not a bug but a feature. What you need is fsck for oVirt
storage, vdsm is not this tool currently.

I think it makes sense that vdsm will implement checking and repairing
disks, but it cannot be done via normal APIs like SDM.update_volume.

Comment 8 Nir Soffer 2020-09-08 09:22:39 UTC
Tal, why is this targeted to 4.3.0? We don't plan to change this behavior.

I suggest to close this as WONTFIX.

Comment 12 Red Hat Bugzilla 2024-01-06 04:26:06 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days