Bug 1458061 - Inconsistent JSON output from --fullreport
Summary: Inconsistent JSON output from --fullreport
Keywords:
Status: ASSIGNED
Alias: None
Product: LVM and device-mapper
Classification: Community
Component: lvm2
Version: 2.02.171
Hardware: Unspecified
OS: Unspecified
low
unspecified
Target Milestone: ---
: ---
Assignee: Zdenek Kabelac
QA Contact: cluster-qe
URL:
Whiteboard:
Depends On:
Blocks: 1460161
TreeView+ depends on / blocked
 
Reported: 2017-06-01 22:49 UTC by Tony Asleson
Modified: 2024-04-27 05:52 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1460161 (view as bug list)
Environment:
Last Closed:
Embargoed:
rule-engine: lvm-technical-solution?
rule-engine: lvm-test-coverage?


Attachments (Terms of Use)
Output of JSON that dbus service throws exception on. (9.15 KB, text/plain)
2017-06-01 22:49 UTC, Tony Asleson
no flags Details
Output of -vvvvv for fullreport (36.77 KB, text/plain)
2017-06-02 21:14 UTC, Tony Asleson
no flags Details
Debug output of fullreport that corresponds to lvmdumps (37.44 KB, text/plain)
2017-06-08 18:20 UTC, Tony Asleson
no flags Details
lvmdump -a -m before merge starts (1.54 MB, application/x-gzip)
2017-06-08 18:21 UTC, Tony Asleson
no flags Details
lvmdump -a -m during merge and failure (1.54 MB, application/x-gzip)
2017-06-08 18:22 UTC, Tony Asleson
no flags Details

Description Tony Asleson 2017-06-01 22:49:07 UTC
Created attachment 1284258 [details]
Output of JSON that dbus service throws exception on.

Description of problem:

When running the lvmdbusd unit tests during the snap shot merge test we get JSON output that is invalid.  It has multiple entries for the same LV and it has LV UUID's in the JSON that reference non-existent LVs.

The lvm dbus service throws an exception as we are looking for a LV with a UUID that doesn't exit and we get a keyerror in the hash.


Version-Release number of selected component (if applicable):
 LVM version:     2.02.171(2) (2017-05-03)
 Library version: 1.02.140 (2017-05-03)

How reproducible:
100%

Steps to Reproduce:
1. Run the lvm dbus service and have some PVS present
2. Run ./lvmdbustest.py -v TestDbusService.test_snapshot_merge

Actual results:
Unit test hangs, exception thrown in syslog from service.

Expected results:
Unit test runs with no errors


Additional info:

Comment 1 Tony Asleson 2017-06-01 22:50:21 UTC
Service exception:


Traceback (most recent call last):
  File "/home/tasleson/projects/lvm2/daemons/lvmdbusd/fetch.py", line 129, in update_thread
    need_main_thread)
  File "/home/tasleson/projects/lvm2/daemons/lvmdbusd/fetch.py", line 43, in load
    cfg.db.refresh(log)
  File "/home/tasleson/projects/lvm2/daemons/lvmdbusd/lvmdb.py", line 386, in refresh
    _lvs, _lvs_in_vgs, _lvs_hidden, _lvs_lookup = self._parse_lvs_json(a)
  File "/home/tasleson/projects/lvm2/daemons/lvmdbusd/lvmdb.py", line 246, in _parse_lvs_json
    r = c_lvs[s['lv_uuid']]
KeyError: 'X85cFy-wzLP-v3GF-7uyf-Yk6g-cy72-mXMNQK'


That LV UUID does not exist in the JSON.

Comment 2 Alasdair Kergon 2017-06-01 23:10:27 UTC
Please supply the actual metadata from each of the disks.

Is lvmetad in use?

run the same command with -vvvv and attach the output.

Comment 3 Zdenek Kabelac 2017-06-02 08:00:33 UTC
I think the important here is the actual use-case.

When user runs 'merge'  - the merged LV no longer exists for a user (it replaces origin) - standard lvm2 user command has not way to address this LV.

However - internally - lvm2 surely maintains the 'origin + merged' LV till the point merge is finished.

Situation is even more 'fuzzy' as the 'merge' might be postponed and even physically started after i.e. reboot & reactivation.

So is reporting showing LV which should be already hidden ?

Comment 4 Tony Asleson 2017-06-02 21:14:32 UTC
Created attachment 1284531 [details]
Output of -vvvvv for fullreport

Comment 5 Tony Asleson 2017-06-02 21:17:07 UTC
(In reply to Alasdair Kergon from comment #2)
> Please supply the actual metadata from each of the disks.

Please elaborate on what you want and how to retrieve it.

> Is lvmetad in use?

Yes

> run the same command with -vvvv and attach the output.

Done


This is not a small race window.  The lvm dbus code now loops in this condition trying to get a good read from lvm and we looped 96 times before we got usable JSON.

Something has changed as we've never seen this before and it's very easy to hit.

Comment 6 Alasdair Kergon 2017-06-08 14:50:59 UTC
(In reply to Tony Asleson from comment #5)
> (In reply to Alasdair Kergon from comment #2)
> > Please supply the actual metadata from each of the disks.
> 
> Please elaborate on what you want and how to retrieve it.

If the metadata is completely consistent run

vgcfgbackup <vgname>

which should write a new file /etc/lvm/backup/<vgname> and attach that.

(Alternatively, or if that doesn't show the full state of the system at the time, just use lvmdump with -m and -a.)

Comment 7 Tony Asleson 2017-06-08 18:20:16 UTC
Created attachment 1286206 [details]
Debug output of  fullreport that corresponds to lvmdumps

Debug output -vvvvv of fullreport when reproducing and collecting output from lvmdump -a -m

Comment 8 Tony Asleson 2017-06-08 18:21:17 UTC
Created attachment 1286207 [details]
lvmdump -a -m before merge starts

Comment 9 Tony Asleson 2017-06-08 18:22:02 UTC
Created attachment 1286208 [details]
lvmdump -a -m during merge and failure

Comment 10 Tony Asleson 2017-06-08 18:27:59 UTC
Comment on attachment 1286208 [details]
lvmdump -a -m during merge and failure

As soon as we detected this inconsistent JSON the lvmdump command was run.  However, the dump took so long that when done and when we ran the full report the issue was resolved.  Thus I cannot be certain that the dump contains data from during the bug or immediately after the merge had completed.

Comment 11 Alasdair Kergon 2017-06-08 19:59:52 UTC
 (In reply to Tony Asleson from comment #10)
> Thus I cannot be certain that the dump
> contains data from during the bug or immediately after the merge had
> completed.

The beauty of lvmdump is that, assuming sensible configuration, it doesn't matter - even if run after the problem has gone away, it should have collected footprints of all the intermediate stages of the operations in a form that lets us quickly set up an identical environment to replicate what's happening.

Comment 12 Alasdair Kergon 2017-06-08 22:53:55 UTC
Recent regression.

Old code:
lvs -a:

  LV      VG   Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices       
  lvol0   vg1  Owi-a-s--- 20.00m                                                     /dev/loop3(0) 
  [lvol1] vg1  Swi-a-s--- 20.00m      lvol0  0.00                                    /dev/loop4(0) 


new code:

  LV    VG  Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices       
  lvol0 vg1 Owi-a-s--- 20.00m             0.00                                                  
  lvol0 vg1 Owi-a-s--- 20.00m             0.00                                    /dev/loop4(0)

Comment 13 Alasdair Kergon 2017-06-09 01:12:13 UTC
Due to this commit:

commit 6fd20be629b6dceb96702c5f875a2e612e6f465d
Author: Zdenek Kabelac <zkabelac>
Date:   Mon Dec 5 14:31:25 2016 +0100

    activation: lv_info_with_seg_status API change
    
    Drop LV from passed API arg - it's always segment being checked.
    Also use_layer is now in full control of lv_info_with_seg_status().
    It decides which device needs to be checked to get 'the most info'.
    
    TODO: future version should be able to expose status from


Sometimes the status comes from a different LV from the rest of the data - it is incorrect to change it affecting the caller!

-                       lv = origin_from_cow(lv);
+                       status->lv = lv = origin_from_cow(lv);

Compare:

lvs -a -o+name
  LV   
  lvol0
  [lvol1]


lvs -a -o+name,attr
  LV    Attr      
  lvol0 Owi-a-s---
  lvol0 Owi-a-s---

The output changes according to which columns you select!

Comment 15 Zdenek Kabelac 2017-06-09 19:52:45 UTC
Restoring some reported info for old snapshot with:

https://www.redhat.com/archives/lvm-devel/2017-June/msg00027.html

Comment 16 Alasdair Kergon 2017-06-11 21:10:50 UTC
Not quite there yet:

Old:

# lvs -a -ovg_name,name,snapshot_invalid vg1
  VG   LV      SnapInvalid     
  vg1  lvol0            unknown
  vg1  [lvol1] snapshot invalid

New:

# lvs -a -ovg_name,name,snapshot_invalid vg1
  VG  LV      SnapInvalid    
  vg1 lvol0           unknown
  vg1 [lvol1]         unknown

Comment 17 Alasdair Kergon 2017-06-11 21:36:59 UTC
(that was when inactive)

When active:

old:

  VG   LV      SnapInvalid    
  vg1  lvol0           unknown
  vg1  [lvol1]  

new:

  VG  LV      SnapInvalid    
  vg1 lvol0                  
  vg1 [lvol1]         


Now what are the most appropriate output values for both LVs in both cases - and in the case where the snapshot was actually invalid?  I'm not sure that either the old or new get this quite right.

Comment 18 Zdenek Kabelac 2017-06-12 09:56:23 UTC
When the snapshot merge is 'running'  - there is IMHO  no snapshot anymore to be reported.

Should we report 'unknown'  or 'empty' field when we know it's not a snapshot?

Comment 19 Alasdair Kergon 2017-06-12 16:58:24 UTC
So the latest proposal is:

  only show SnapInvalid as "snapshot invalid" when it known actually to be a snapshot that is invalid - exactly the same as happens for the I attribute.

  show it as "unknown" for the merged origin (lvol0) when inactive

  show it blank for the merging snapshot (lvol1) both active and inactive


This restores the old behaviour with one change - to show blank instead of 'snapshot invalid' for the inactive lvol1.

Comment 20 Alasdair Kergon 2017-06-12 16:59:41 UTC
(In reply to Zdenek Kabelac from comment #18)

> Should we report 'unknown'  or 'empty' field when we know it's not a
> snapshot?

empty


Note You need to log in before you can comment on or make changes to this bug.