DescriptionGermano Veit Michel
2019-02-20 02:16:27 UTC
Description of problem:
ResourceManager fails to lock an image if there is something wrong with it (illegal image in chain, missing leaf...)
So once the image is in bad shape (i.e. snapshot or LSM bug) it is impossible to safely fix it using update_volume, as we cannot lock it anymore.
Locking should be *separate* from image health checks. We should still be able to lock an image in bad shape, so it can be fixed safely. Locking should fail if there is a locking problem.
I assume this may also break some recovery flow, not just a manual update_volume via API.
Version-Release number of selected component (if applicable):
vdsm-4.30.9-5.git939304e.el7.x86_64
How reproducible:
100%
Steps to Reproduce:
1. Set an image of the chain to illegal
# dd if=/dev/603fd774-06a5-4590-a152-f0ff3dd53fd0/metadata bs=512 count=1 skip=8 status=none | sed 's/=LEAF/=INTERNAL/' | dd of=/dev/603fd774-06a5-4590-a152-f0ff3dd53fd0/metadata bs=512 count=1 seek=8
2. Try any operation that grabs its lock (ResourceManager)
# cat input.json
{
"vol_info" : {
"generation": 1,
"img_id" : "da28450c-33c5-495c-9780-20b94f6375c3",
"sd_id": "603fd774-06a5-4590-a152-f0ff3dd53fd0",
"vol_id" : "70c90462-8653-4cd6-9ccb-178a2f199054"
},
"vol_attr" : {
"legality": "ILLEGAL"
},
"job_id" : "9e71e4c7-fc1f-49c5-9d17-b79a32bf3e52"
}
# vdsm-client -f input.json SDM update_volume
Actual results:
ImageIsNotLegalChain and other exceptions end up throwing ResourceAcqusitionFailed
Expected results:
Resource can be locked unless there is a locking problem.
Additional info:
2019-02-20 12:07:39,058+1000 INFO (jsonrpc/6) [vdsm.api] START sdm_update_volume(job_id=u'9e71e4c7-fc1f-49c5-9d17-b79a32bf3e52', vol_info={u'generation': 1, u'img_id': u'da28450c-33c5-495c-9780-20b94f6375c3', u'sd_id': u'603fd774-06a5-4590-a152-f0ff3dd53fd0', u'vol_id': u'70c90462-8653-4cd6-9ccb-178a2f199054'}, vol_attr={u'legality': u'ILLEGAL'}) from=::1,58638, task_id=646deaa2-cc60-4cc6-b843-afcbdd5a0848 (api:48)
2019-02-20 12:07:39,059+1000 INFO (jsonrpc/6) [vdsm.api] FINISH sdm_update_volume return=None from=::1,58638, task_id=646deaa2-cc60-4cc6-b843-afcbdd5a0848 (api:54)
2019-02-20 12:07:39,059+1000 INFO (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call SDM.update_volume succeeded in 0.00 seconds (__init__:312)
2019-02-20 12:07:39,060+1000 INFO (tasks/4) [storage.ThreadPool.WorkerThread] START task 646deaa2-cc60-4cc6-b843-afcbdd5a0848 (cmd=<bound method Task.commit of <vdsm.storage.task.Task instance at 0x7f3865677e60>>, args=None) (threadPool:208)
2019-02-20 12:07:39,060+1000 INFO (tasks/4) [root] Running job u'9e71e4c7-fc1f-49c5-9d17-b79a32bf3e52'... (jobs:183)
2019-02-20 12:07:39,185+1000 ERROR (tasks/4) [storage.Image] There is no leaf in the image da28450c-33c5-495c-9780-20b94f6375c3 (image:221)
2019-02-20 12:07:39,185+1000 WARN (tasks/4) [storage.ResourceManager] Resource factory failed to create resource '01_img_603fd774-06a5-4590-a152-f0ff3dd53fd0.da28450c-33c5-495c-9780-20b94f6375c3'. Canceling request. (resourceManager:544)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 540, in registerResource
obj = namespaceObj.factory.createResource(name, lockType)
File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceFactories.py", line 193, in createResource
lockType)
File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceFactories.py", line 122, in __getResourceCandidatesList
imgUUID=resourceName)
File "/usr/lib/python2.7/site-packages/vdsm/storage/image.py", line 222, in getChain
raise se.ImageIsNotLegalChain(imgUUID)
ImageIsNotLegalChain: Image is not a legal chain: ('da28450c-33c5-495c-9780-20b94f6375c3',)
2019-02-20 12:07:39,185+1000 WARN (tasks/4) [storage.ResourceManager.Request] (ResName='01_img_603fd774-06a5-4590-a152-f0ff3dd53fd0.da28450c-33c5-495c-9780-20b94f6375c3', ReqID='05ca7896-b4a8-48b8-aade-8199751e67a4') Tried to cancel a processed request (resourceManager:188)
2019-02-20 12:07:39,185+1000 ERROR (tasks/4) [storage.guarded] Error acquiring lock <ResourceManagerLock ns=01_img_603fd774-06a5-4590-a152-f0ff3dd53fd0, name=da28450c-33c5-495c-9780-20b94f6375c3, mode=exclusive at 0x7f38655c32d0> (guarded:96)
2019-02-20 12:07:39,186+1000 ERROR (tasks/4) [root] Job u'9e71e4c7-fc1f-49c5-9d17-b79a32bf3e52' failed (jobs:221)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/vdsm/jobs.py", line 157, in run
self._run()
File "/usr/lib/python2.7/site-packages/vdsm/storage/sdm/api/update_volume.py", line 39, in _run
with guarded.context(self._endpoint.locks):
File "/usr/lib/python2.7/site-packages/vdsm/storage/guarded.py", line 102, in __enter__
six.reraise(*exc)
File "/usr/lib/python2.7/site-packages/vdsm/storage/guarded.py", line 93, in __enter__
lock.acquire()
File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 1001, in acquire
res = acquireResource(self.ns, self.name, self.mode)
File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 1026, in acquireResource
return _manager.acquireResource(namespace, name, lockType, timeout=timeout)
File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 476, in acquireResource
raise se.ResourceAcqusitionFailed()
ResourceAcqusitionFailed: Could not acquire resource. Probably resource factory threw an exception.: ()
2019-02-20 12:07:39,194+1000 INFO (tasks/4) [root] Job u'9e71e4c7-fc1f-49c5-9d17-b79a32bf3e52' will be deleted in 3600 seconds (jobs:249)
Comment 1Germano Veit Michel
2019-02-20 02:21:08 UTC
Vdsm will not work with illegal image chain. This requires manual
fix outside of vdsm.
This is not a bug but a feature. What you need is fsck for oVirt
storage, vdsm is not this tool currently.
I think it makes sense that vdsm will implement checking and repairing
disks, but it cannot be done via normal APIs like SDM.update_volume.