Bug 1678969 - ResourceManager fails to lock image if the image has a problem
Summary: ResourceManager fails to lock image if the image has a problem
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.3.0
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Vojtech Juranek
QA Contact: Shir Fishbain
URL:
Whiteboard:
Depends On:
Blocks: 1469683 1489145 1547336
TreeView+ depends on / blocked
 
Reported: 2019-02-20 02:16 UTC by Germano Veit Michel
Modified: 2024-01-06 04:26 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-25 04:47:46 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Germano Veit Michel 2019-02-20 02:16:27 UTC
Description of problem:

ResourceManager fails to lock an image if there is something wrong with it (illegal image in chain, missing leaf...)

So once the image is in bad shape (i.e. snapshot or LSM bug) it is impossible to safely fix it using update_volume, as we cannot lock it anymore.

Locking should be *separate* from image health checks. We should still be able to lock an image in bad shape, so it can be fixed safely. Locking should fail if there is a locking problem.

I assume this may also break some recovery flow, not just a manual update_volume via API.

Version-Release number of selected component (if applicable):
vdsm-4.30.9-5.git939304e.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Set an image of the chain to illegal
# dd if=/dev/603fd774-06a5-4590-a152-f0ff3dd53fd0/metadata bs=512 count=1 skip=8 status=none | sed 's/=LEAF/=INTERNAL/' | dd of=/dev/603fd774-06a5-4590-a152-f0ff3dd53fd0/metadata bs=512 count=1 seek=8

2. Try any operation that grabs its lock (ResourceManager)
# cat input.json 
{
    "vol_info" : {
        "generation": 1,
        "img_id" : "da28450c-33c5-495c-9780-20b94f6375c3",
        "sd_id": "603fd774-06a5-4590-a152-f0ff3dd53fd0",
        "vol_id" : "70c90462-8653-4cd6-9ccb-178a2f199054"
    },
    "vol_attr" : {
        "legality": "ILLEGAL"
    },
    "job_id" : "9e71e4c7-fc1f-49c5-9d17-b79a32bf3e52"
}
# vdsm-client -f input.json SDM update_volume

Actual results:
ImageIsNotLegalChain and other exceptions end up throwing ResourceAcqusitionFailed

Expected results:
Resource can be locked unless there is a locking problem.

Additional info:
2019-02-20 12:07:39,058+1000 INFO  (jsonrpc/6) [vdsm.api] START sdm_update_volume(job_id=u'9e71e4c7-fc1f-49c5-9d17-b79a32bf3e52', vol_info={u'generation': 1, u'img_id': u'da28450c-33c5-495c-9780-20b94f6375c3', u'sd_id': u'603fd774-06a5-4590-a152-f0ff3dd53fd0', u'vol_id': u'70c90462-8653-4cd6-9ccb-178a2f199054'}, vol_attr={u'legality': u'ILLEGAL'}) from=::1,58638, task_id=646deaa2-cc60-4cc6-b843-afcbdd5a0848 (api:48)
2019-02-20 12:07:39,059+1000 INFO  (jsonrpc/6) [vdsm.api] FINISH sdm_update_volume return=None from=::1,58638, task_id=646deaa2-cc60-4cc6-b843-afcbdd5a0848 (api:54)
2019-02-20 12:07:39,059+1000 INFO  (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call SDM.update_volume succeeded in 0.00 seconds (__init__:312)
2019-02-20 12:07:39,060+1000 INFO  (tasks/4) [storage.ThreadPool.WorkerThread] START task 646deaa2-cc60-4cc6-b843-afcbdd5a0848 (cmd=<bound method Task.commit of <vdsm.storage.task.Task instance at 0x7f3865677e60>>, args=None) (threadPool:208)
2019-02-20 12:07:39,060+1000 INFO  (tasks/4) [root] Running job u'9e71e4c7-fc1f-49c5-9d17-b79a32bf3e52'... (jobs:183)
2019-02-20 12:07:39,185+1000 ERROR (tasks/4) [storage.Image] There is no leaf in the image da28450c-33c5-495c-9780-20b94f6375c3 (image:221)
2019-02-20 12:07:39,185+1000 WARN  (tasks/4) [storage.ResourceManager] Resource factory failed to create resource '01_img_603fd774-06a5-4590-a152-f0ff3dd53fd0.da28450c-33c5-495c-9780-20b94f6375c3'. Canceling request. (resourceManager:544)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 540, in registerResource
    obj = namespaceObj.factory.createResource(name, lockType)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceFactories.py", line 193, in createResource
    lockType)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceFactories.py", line 122, in __getResourceCandidatesList
    imgUUID=resourceName)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/image.py", line 222, in getChain
    raise se.ImageIsNotLegalChain(imgUUID)
ImageIsNotLegalChain: Image is not a legal chain: ('da28450c-33c5-495c-9780-20b94f6375c3',)
2019-02-20 12:07:39,185+1000 WARN  (tasks/4) [storage.ResourceManager.Request] (ResName='01_img_603fd774-06a5-4590-a152-f0ff3dd53fd0.da28450c-33c5-495c-9780-20b94f6375c3', ReqID='05ca7896-b4a8-48b8-aade-8199751e67a4') Tried to cancel a processed request (resourceManager:188)
2019-02-20 12:07:39,185+1000 ERROR (tasks/4) [storage.guarded] Error acquiring lock <ResourceManagerLock ns=01_img_603fd774-06a5-4590-a152-f0ff3dd53fd0, name=da28450c-33c5-495c-9780-20b94f6375c3, mode=exclusive at 0x7f38655c32d0> (guarded:96)
2019-02-20 12:07:39,186+1000 ERROR (tasks/4) [root] Job u'9e71e4c7-fc1f-49c5-9d17-b79a32bf3e52' failed (jobs:221)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/jobs.py", line 157, in run
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdm/api/update_volume.py", line 39, in _run
    with guarded.context(self._endpoint.locks):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/guarded.py", line 102, in __enter__
    six.reraise(*exc)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/guarded.py", line 93, in __enter__
    lock.acquire()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 1001, in acquire
    res = acquireResource(self.ns, self.name, self.mode)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 1026, in acquireResource
    return _manager.acquireResource(namespace, name, lockType, timeout=timeout)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 476, in acquireResource
    raise se.ResourceAcqusitionFailed()
ResourceAcqusitionFailed: Could not acquire resource. Probably resource factory threw an exception.: ()
2019-02-20 12:07:39,194+1000 INFO  (tasks/4) [root] Job u'9e71e4c7-fc1f-49c5-9d17-b79a32bf3e52' will be deleted in 3600 seconds (jobs:249)

Comment 1 Germano Veit Michel 2019-02-20 02:21:08 UTC
This was discussed in https://gerrit.ovirt.org/#/c/93260/

Comment 7 Nir Soffer 2020-09-01 09:27:34 UTC
Vdsm will not work with illegal image chain. This requires manual
fix outside of vdsm.

This is not a bug but a feature. What you need is fsck for oVirt
storage, vdsm is not this tool currently.

I think it makes sense that vdsm will implement checking and repairing
disks, but it cannot be done via normal APIs like SDM.update_volume.

Comment 8 Nir Soffer 2020-09-08 09:22:39 UTC
Tal, why is this targeted to 4.3.0? We don't plan to change this behavior.

I suggest to close this as WONTFIX.

Comment 12 Red Hat Bugzilla 2024-01-06 04:26:06 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.