Description of problem: ResourceManager fails to lock an image if there is something wrong with it (illegal image in chain, missing leaf...) So once the image is in bad shape (i.e. snapshot or LSM bug) it is impossible to safely fix it using update_volume, as we cannot lock it anymore. Locking should be *separate* from image health checks. We should still be able to lock an image in bad shape, so it can be fixed safely. Locking should fail if there is a locking problem. I assume this may also break some recovery flow, not just a manual update_volume via API. Version-Release number of selected component (if applicable): vdsm-4.30.9-5.git939304e.el7.x86_64 How reproducible: 100% Steps to Reproduce: 1. Set an image of the chain to illegal # dd if=/dev/603fd774-06a5-4590-a152-f0ff3dd53fd0/metadata bs=512 count=1 skip=8 status=none | sed 's/=LEAF/=INTERNAL/' | dd of=/dev/603fd774-06a5-4590-a152-f0ff3dd53fd0/metadata bs=512 count=1 seek=8 2. Try any operation that grabs its lock (ResourceManager) # cat input.json { "vol_info" : { "generation": 1, "img_id" : "da28450c-33c5-495c-9780-20b94f6375c3", "sd_id": "603fd774-06a5-4590-a152-f0ff3dd53fd0", "vol_id" : "70c90462-8653-4cd6-9ccb-178a2f199054" }, "vol_attr" : { "legality": "ILLEGAL" }, "job_id" : "9e71e4c7-fc1f-49c5-9d17-b79a32bf3e52" } # vdsm-client -f input.json SDM update_volume Actual results: ImageIsNotLegalChain and other exceptions end up throwing ResourceAcqusitionFailed Expected results: Resource can be locked unless there is a locking problem. Additional info: 2019-02-20 12:07:39,058+1000 INFO (jsonrpc/6) [vdsm.api] START sdm_update_volume(job_id=u'9e71e4c7-fc1f-49c5-9d17-b79a32bf3e52', vol_info={u'generation': 1, u'img_id': u'da28450c-33c5-495c-9780-20b94f6375c3', u'sd_id': u'603fd774-06a5-4590-a152-f0ff3dd53fd0', u'vol_id': u'70c90462-8653-4cd6-9ccb-178a2f199054'}, vol_attr={u'legality': u'ILLEGAL'}) from=::1,58638, task_id=646deaa2-cc60-4cc6-b843-afcbdd5a0848 (api:48) 2019-02-20 12:07:39,059+1000 INFO (jsonrpc/6) [vdsm.api] FINISH sdm_update_volume return=None from=::1,58638, task_id=646deaa2-cc60-4cc6-b843-afcbdd5a0848 (api:54) 2019-02-20 12:07:39,059+1000 INFO (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call SDM.update_volume succeeded in 0.00 seconds (__init__:312) 2019-02-20 12:07:39,060+1000 INFO (tasks/4) [storage.ThreadPool.WorkerThread] START task 646deaa2-cc60-4cc6-b843-afcbdd5a0848 (cmd=<bound method Task.commit of <vdsm.storage.task.Task instance at 0x7f3865677e60>>, args=None) (threadPool:208) 2019-02-20 12:07:39,060+1000 INFO (tasks/4) [root] Running job u'9e71e4c7-fc1f-49c5-9d17-b79a32bf3e52'... (jobs:183) 2019-02-20 12:07:39,185+1000 ERROR (tasks/4) [storage.Image] There is no leaf in the image da28450c-33c5-495c-9780-20b94f6375c3 (image:221) 2019-02-20 12:07:39,185+1000 WARN (tasks/4) [storage.ResourceManager] Resource factory failed to create resource '01_img_603fd774-06a5-4590-a152-f0ff3dd53fd0.da28450c-33c5-495c-9780-20b94f6375c3'. Canceling request. (resourceManager:544) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 540, in registerResource obj = namespaceObj.factory.createResource(name, lockType) File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceFactories.py", line 193, in createResource lockType) File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceFactories.py", line 122, in __getResourceCandidatesList imgUUID=resourceName) File "/usr/lib/python2.7/site-packages/vdsm/storage/image.py", line 222, in getChain raise se.ImageIsNotLegalChain(imgUUID) ImageIsNotLegalChain: Image is not a legal chain: ('da28450c-33c5-495c-9780-20b94f6375c3',) 2019-02-20 12:07:39,185+1000 WARN (tasks/4) [storage.ResourceManager.Request] (ResName='01_img_603fd774-06a5-4590-a152-f0ff3dd53fd0.da28450c-33c5-495c-9780-20b94f6375c3', ReqID='05ca7896-b4a8-48b8-aade-8199751e67a4') Tried to cancel a processed request (resourceManager:188) 2019-02-20 12:07:39,185+1000 ERROR (tasks/4) [storage.guarded] Error acquiring lock <ResourceManagerLock ns=01_img_603fd774-06a5-4590-a152-f0ff3dd53fd0, name=da28450c-33c5-495c-9780-20b94f6375c3, mode=exclusive at 0x7f38655c32d0> (guarded:96) 2019-02-20 12:07:39,186+1000 ERROR (tasks/4) [root] Job u'9e71e4c7-fc1f-49c5-9d17-b79a32bf3e52' failed (jobs:221) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/jobs.py", line 157, in run self._run() File "/usr/lib/python2.7/site-packages/vdsm/storage/sdm/api/update_volume.py", line 39, in _run with guarded.context(self._endpoint.locks): File "/usr/lib/python2.7/site-packages/vdsm/storage/guarded.py", line 102, in __enter__ six.reraise(*exc) File "/usr/lib/python2.7/site-packages/vdsm/storage/guarded.py", line 93, in __enter__ lock.acquire() File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 1001, in acquire res = acquireResource(self.ns, self.name, self.mode) File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 1026, in acquireResource return _manager.acquireResource(namespace, name, lockType, timeout=timeout) File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 476, in acquireResource raise se.ResourceAcqusitionFailed() ResourceAcqusitionFailed: Could not acquire resource. Probably resource factory threw an exception.: () 2019-02-20 12:07:39,194+1000 INFO (tasks/4) [root] Job u'9e71e4c7-fc1f-49c5-9d17-b79a32bf3e52' will be deleted in 3600 seconds (jobs:249)
This was discussed in https://gerrit.ovirt.org/#/c/93260/
Vdsm will not work with illegal image chain. This requires manual fix outside of vdsm. This is not a bug but a feature. What you need is fsck for oVirt storage, vdsm is not this tool currently. I think it makes sense that vdsm will implement checking and repairing disks, but it cannot be done via normal APIs like SDM.update_volume.
Tal, why is this targeted to 4.3.0? We don't plan to change this behavior. I suggest to close this as WONTFIX.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days