Created attachment 1284258 [details] Output of JSON that dbus service throws exception on. Description of problem: When running the lvmdbusd unit tests during the snap shot merge test we get JSON output that is invalid. It has multiple entries for the same LV and it has LV UUID's in the JSON that reference non-existent LVs. The lvm dbus service throws an exception as we are looking for a LV with a UUID that doesn't exit and we get a keyerror in the hash. Version-Release number of selected component (if applicable): LVM version: 2.02.171(2) (2017-05-03) Library version: 1.02.140 (2017-05-03) How reproducible: 100% Steps to Reproduce: 1. Run the lvm dbus service and have some PVS present 2. Run ./lvmdbustest.py -v TestDbusService.test_snapshot_merge Actual results: Unit test hangs, exception thrown in syslog from service. Expected results: Unit test runs with no errors Additional info:
Service exception: Traceback (most recent call last): File "/home/tasleson/projects/lvm2/daemons/lvmdbusd/fetch.py", line 129, in update_thread need_main_thread) File "/home/tasleson/projects/lvm2/daemons/lvmdbusd/fetch.py", line 43, in load cfg.db.refresh(log) File "/home/tasleson/projects/lvm2/daemons/lvmdbusd/lvmdb.py", line 386, in refresh _lvs, _lvs_in_vgs, _lvs_hidden, _lvs_lookup = self._parse_lvs_json(a) File "/home/tasleson/projects/lvm2/daemons/lvmdbusd/lvmdb.py", line 246, in _parse_lvs_json r = c_lvs[s['lv_uuid']] KeyError: 'X85cFy-wzLP-v3GF-7uyf-Yk6g-cy72-mXMNQK' That LV UUID does not exist in the JSON.
Please supply the actual metadata from each of the disks. Is lvmetad in use? run the same command with -vvvv and attach the output.
I think the important here is the actual use-case. When user runs 'merge' - the merged LV no longer exists for a user (it replaces origin) - standard lvm2 user command has not way to address this LV. However - internally - lvm2 surely maintains the 'origin + merged' LV till the point merge is finished. Situation is even more 'fuzzy' as the 'merge' might be postponed and even physically started after i.e. reboot & reactivation. So is reporting showing LV which should be already hidden ?
Created attachment 1284531 [details] Output of -vvvvv for fullreport
(In reply to Alasdair Kergon from comment #2) > Please supply the actual metadata from each of the disks. Please elaborate on what you want and how to retrieve it. > Is lvmetad in use? Yes > run the same command with -vvvv and attach the output. Done This is not a small race window. The lvm dbus code now loops in this condition trying to get a good read from lvm and we looped 96 times before we got usable JSON. Something has changed as we've never seen this before and it's very easy to hit.
(In reply to Tony Asleson from comment #5) > (In reply to Alasdair Kergon from comment #2) > > Please supply the actual metadata from each of the disks. > > Please elaborate on what you want and how to retrieve it. If the metadata is completely consistent run vgcfgbackup <vgname> which should write a new file /etc/lvm/backup/<vgname> and attach that. (Alternatively, or if that doesn't show the full state of the system at the time, just use lvmdump with -m and -a.)
Created attachment 1286206 [details] Debug output of fullreport that corresponds to lvmdumps Debug output -vvvvv of fullreport when reproducing and collecting output from lvmdump -a -m
Created attachment 1286207 [details] lvmdump -a -m before merge starts
Created attachment 1286208 [details] lvmdump -a -m during merge and failure
Comment on attachment 1286208 [details] lvmdump -a -m during merge and failure As soon as we detected this inconsistent JSON the lvmdump command was run. However, the dump took so long that when done and when we ran the full report the issue was resolved. Thus I cannot be certain that the dump contains data from during the bug or immediately after the merge had completed.
(In reply to Tony Asleson from comment #10) > Thus I cannot be certain that the dump > contains data from during the bug or immediately after the merge had > completed. The beauty of lvmdump is that, assuming sensible configuration, it doesn't matter - even if run after the problem has gone away, it should have collected footprints of all the intermediate stages of the operations in a form that lets us quickly set up an identical environment to replicate what's happening.
Recent regression. Old code: lvs -a: LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices lvol0 vg1 Owi-a-s--- 20.00m /dev/loop3(0) [lvol1] vg1 Swi-a-s--- 20.00m lvol0 0.00 /dev/loop4(0) new code: LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices lvol0 vg1 Owi-a-s--- 20.00m 0.00 lvol0 vg1 Owi-a-s--- 20.00m 0.00 /dev/loop4(0)
Due to this commit: commit 6fd20be629b6dceb96702c5f875a2e612e6f465d Author: Zdenek Kabelac <zkabelac> Date: Mon Dec 5 14:31:25 2016 +0100 activation: lv_info_with_seg_status API change Drop LV from passed API arg - it's always segment being checked. Also use_layer is now in full control of lv_info_with_seg_status(). It decides which device needs to be checked to get 'the most info'. TODO: future version should be able to expose status from Sometimes the status comes from a different LV from the rest of the data - it is incorrect to change it affecting the caller! - lv = origin_from_cow(lv); + status->lv = lv = origin_from_cow(lv); Compare: lvs -a -o+name LV lvol0 [lvol1] lvs -a -o+name,attr LV Attr lvol0 Owi-a-s--- lvol0 Owi-a-s--- The output changes according to which columns you select!
Restoring some reported info for old snapshot with: https://www.redhat.com/archives/lvm-devel/2017-June/msg00027.html
Not quite there yet: Old: # lvs -a -ovg_name,name,snapshot_invalid vg1 VG LV SnapInvalid vg1 lvol0 unknown vg1 [lvol1] snapshot invalid New: # lvs -a -ovg_name,name,snapshot_invalid vg1 VG LV SnapInvalid vg1 lvol0 unknown vg1 [lvol1] unknown
(that was when inactive) When active: old: VG LV SnapInvalid vg1 lvol0 unknown vg1 [lvol1] new: VG LV SnapInvalid vg1 lvol0 vg1 [lvol1] Now what are the most appropriate output values for both LVs in both cases - and in the case where the snapshot was actually invalid? I'm not sure that either the old or new get this quite right.
When the snapshot merge is 'running' - there is IMHO no snapshot anymore to be reported. Should we report 'unknown' or 'empty' field when we know it's not a snapshot?
So the latest proposal is: only show SnapInvalid as "snapshot invalid" when it known actually to be a snapshot that is invalid - exactly the same as happens for the I attribute. show it as "unknown" for the merged origin (lvol0) when inactive show it blank for the merging snapshot (lvol1) both active and inactive This restores the old behaviour with one change - to show blank instead of 'snapshot invalid' for the inactive lvol1.
(In reply to Zdenek Kabelac from comment #18) > Should we report 'unknown' or 'empty' field when we know it's not a > snapshot? empty