Bug 1242623

Summary:	thin_[check\|dump] is unable to open meta data device
Product:	Red Hat Enterprise Linux 7	Reporter:	Corey Marthaler <cmarthal>
Component:	lvm2	Assignee:	Joe Thornber <thornber>
lvm2 sub component:	Thin Provisioning	QA Contact:	cluster-qe <cluster-qe>
Status:	CLOSED WONTFIX	Docs Contact:
Severity:	medium
Priority:	medium	CC:	agk, cmarthal, heinzm, jbrassow, msnitzer, prajnoha, prockai, thornber, zkabelac
Version:	7.2	Keywords:	Regression, TestBlocker, Triaged
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-08-25 13:39:20 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Corey Marthaler 2015-07-13 18:58:34 UTC

Description of problem:
[root@host-110 ~]# pvscan
  PV /dev/vda2   VG rhel_host-110   lvm2 [7.51 GiB / 40.00 MiB free]
  PV /dev/sda1   VG VG              lvm2 [24.99 GiB / 24.99 GiB free]
  PV /dev/sdb1   VG VG              lvm2 [24.99 GiB / 24.99 GiB free]
  PV /dev/sdc1   VG VG              lvm2 [24.99 GiB / 24.99 GiB free]
  PV /dev/sdd1   VG VG              lvm2 [24.99 GiB / 24.99 GiB free]
  PV /dev/sde1   VG VG              lvm2 [24.99 GiB / 24.99 GiB free]
  PV /dev/sdf1   VG VG              lvm2 [24.99 GiB / 24.99 GiB free]
  PV /dev/sdg1   VG VG              lvm2 [24.99 GiB / 24.99 GiB free]
  PV /dev/sdh1   VG VG              lvm2 [24.99 GiB / 24.99 GiB free]
  Total: 9 [207.45 GiB] / in use: 9 [207.45 GiB] / in no VG: 0 [0   ]

[root@host-110 ~]# lvcreate  --thinpool POOL  --zero n -L 500M --poolmetadatasize 4M VG
  Logical volume "POOL" created.
[root@host-110 ~]# lvs -a -o +devices
  LV              VG  Attr       LSize   Pool Origin Data%  Meta%  Devices
  POOL            VG  twi-a-t--- 500.00m             0.00   0.88   POOL_tdata(0)
  [POOL_tdata]    VG  Twi-ao---- 500.00m                           /dev/sda1(1)
  [POOL_tmeta]    VG  ewi-ao----   4.00m                           /dev/sdh1(0)
  [lvol0_pmspare] VG  ewi-------   4.00m                           /dev/sda1(0)

[root@host-110 ~]# thin_check /dev/mapper/VG-POOL_tmeta 
syscall 'open' failed: Device or resource busy

[root@host-110 ~]# ls -l /dev/mapper/VG-POOL_tmeta
lrwxrwxrwx. 1 root root 7 Jul 13 13:37 /dev/mapper/VG-POOL_tmeta -> ../dm-2



stat("/dev/mapper/VG-POOL_tmeta", {st_mode=S_IFBLK|0660, st_rdev=makedev(253, 2), ...}) = 0
open("/dev/mapper/VG-POOL_tmeta", O_RDONLY) = 3
ioctl(3, BLKGETSIZE64, 4194304)         = 0
close(3)                                = 0
stat("/dev/mapper/VG-POOL_tmeta", {st_mode=S_IFBLK|0660, st_rdev=makedev(253, 2), ...}) = 0
open("/dev/mapper/VG-POOL_tmeta", O_RDONLY|O_EXCL|O_DIRECT) = -1 EBUSY (Device or resource busy)
write(2, "syscall 'open' failed: Device or"..., 46syscall 'open' failed: Device or resource busy) = 46
write(2, "\n", 1
)                       = 1
exit_group(1)                           = ?
+++ exited with 1 +++




Version-Release number of selected component (if applicable):
3.10.0-290.el7.x86_64

lvm2-2.02.125-2.el7    BUILT: Fri Jul 10 03:42:29 CDT 2015
lvm2-libs-2.02.125-2.el7    BUILT: Fri Jul 10 03:42:29 CDT 2015
lvm2-cluster-2.02.125-2.el7    BUILT: Fri Jul 10 03:42:29 CDT 2015
device-mapper-1.02.102-2.el7    BUILT: Fri Jul 10 03:42:29 CDT 2015
device-mapper-libs-1.02.102-2.el7    BUILT: Fri Jul 10 03:42:29 CDT 2015
device-mapper-event-1.02.102-2.el7    BUILT: Fri Jul 10 03:42:29 CDT 2015
device-mapper-event-libs-1.02.102-2.el7    BUILT: Fri Jul 10 03:42:29 CDT 2015
device-mapper-persistent-data-0.5.3-1.el7    BUILT: Tue Jul  7 08:41:42 CDT 2015
cmirror-2.02.125-2.el7    BUILT: Fri Jul 10 03:42:29 CDT 2015
sanlock-3.2.4-1.el7    BUILT: Fri Jun 19 12:48:49 CDT 2015
sanlock-lib-3.2.4-1.el7    BUILT: Fri Jun 19 12:48:49 CDT 2015
lvm2-lockd-2.02.125-2.el7    BUILT: Fri Jul 10 03:42:29 CDT 2015

Comment 1 Mike Snitzer 2015-07-13 19:34:26 UTC

The thin-pool itself has the metadata device open and in use.

You need to deactivate VG/POOL(In reply to Corey Marthaler from comment #0)
> Description of problem:
> [root@host-110 ~]# pvscan
>   PV /dev/vda2   VG rhel_host-110   lvm2 [7.51 GiB / 40.00 MiB free]
>   PV /dev/sda1   VG VG              lvm2 [24.99 GiB / 24.99 GiB free]
>   PV /dev/sdb1   VG VG              lvm2 [24.99 GiB / 24.99 GiB free]
>   PV /dev/sdc1   VG VG              lvm2 [24.99 GiB / 24.99 GiB free]
>   PV /dev/sdd1   VG VG              lvm2 [24.99 GiB / 24.99 GiB free]
>   PV /dev/sde1   VG VG              lvm2 [24.99 GiB / 24.99 GiB free]
>   PV /dev/sdf1   VG VG              lvm2 [24.99 GiB / 24.99 GiB free]
>   PV /dev/sdg1   VG VG              lvm2 [24.99 GiB / 24.99 GiB free]
>   PV /dev/sdh1   VG VG              lvm2 [24.99 GiB / 24.99 GiB free]
>   Total: 9 [207.45 GiB] / in use: 9 [207.45 GiB] / in no VG: 0 [0   ]
> 
> [root@host-110 ~]# lvcreate  --thinpool POOL  --zero n -L 500M
> --poolmetadatasize 4M VG
>   Logical volume "POOL" created.
> [root@host-110 ~]# lvs -a -o +devices
>   LV              VG  Attr       LSize   Pool Origin Data%  Meta%  Devices
>   POOL            VG  twi-a-t--- 500.00m             0.00   0.88  
> POOL_tdata(0)
>   [POOL_tdata]    VG  Twi-ao---- 500.00m                          
> /dev/sda1(1)
>   [POOL_tmeta]    VG  ewi-ao----   4.00m                          
> /dev/sdh1(0)
>   [lvol0_pmspare] VG  ewi-------   4.00m                          
> /dev/sda1(0)
> 
> [root@host-110 ~]# thin_check /dev/mapper/VG-POOL_tmeta 
> syscall 'open' failed: Device or resource busy

The thin-pool (VG/POOL) has the metadata device open and in use.

You need to deactivate VG/POOL.

This should probably be closed as NOTABUG but I'll give you the benefit of the doubt and leave it open for now...

Please elaborate on why you think this is a bug.

Comment 2 Corey Marthaler 2015-07-13 19:54:50 UTC

In 6.7 (and 7.1 for that matter) this worked, assuming other pool meta data operations weren't being run simultaneously. Plus, with the volume inactive, what /dev device is present to be checked?


# RHEL6.7

[root@mckinley-01 ~]# pvscan
  PV /dev/mapper/mpathbp1   VG VG              lvm2 [249.99 GiB / 249.99 GiB free]
  PV /dev/mapper/mpathcp1   VG VG              lvm2 [249.99 GiB / 249.99 GiB free]
  PV /dev/mapper/mpathdp1   VG VG              lvm2 [249.99 GiB / 249.99 GiB free]
  PV /dev/mapper/mpathep1   VG VG              lvm2 [249.99 GiB / 249.99 GiB free]
  PV /dev/mapper/mpathfp1   VG VG              lvm2 [249.99 GiB / 249.99 GiB free]
  PV /dev/mapper/mpathgp1   VG VG              lvm2 [249.99 GiB / 249.99 GiB free]
  PV /dev/mapper/mpathhp1   VG VG              lvm2 [249.99 GiB / 249.99 GiB free]
  PV /dev/sda2              VG vg_mckinley01   lvm2 [557.26 GiB / 0    free]
  Total: 8 [2.25 TiB] / in use: 8 [2.25 TiB] / in no VG: 0 [0   ]

[root@mckinley-01 ~]# lvcreate  --thinpool POOL  --zero n -L 500M --poolmetadatasize 4M VG
  Logical volume "POOL" created.

[root@mckinley-01 ~]# lvs -a -o +devices
  LV              VG  Attr       LSize   Pool Origin Data%  Meta% Devices
  POOL            VG  twi-a-t--- 500.00m             0.00   0.88  POOL_tdata(0)
  [POOL_tdata]    VG  Twi-ao---- 500.00m                          /dev/mapper/mpathbp1(1)
  [POOL_tmeta]    VG  ewi-ao----   4.00m                          /dev/mapper/mpathhp1(0)
  [lvol0_pmspare] VG  ewi-------   4.00m                          /dev/mapper/mpathbp1(0)

[root@mckinley-01 ~]#  thin_check /dev/mapper/VG-POOL_tmeta
examining superblock
examining devices tree
examining mapping tree


[root@mckinley-01 ~]# lvchange -an VG/POOL
[root@mckinley-01 ~]#  thin_check /dev/mapper/VG-POOL_tmeta
Couldn't stat dev path

Comment 3 Mike Snitzer 2015-07-13 20:42:35 UTC

(In reply to Corey Marthaler from comment #2)
> In 6.7 (and 7.1 for that matter) this worked, assuming other pool meta data
> operations weren't being run simultaneously. Plus, with the volume inactive,
> what /dev device is present to be checked?

Shouldn't you be using lvm to initiate the check? e.g.:
lvconvert --repair VG/POOL

(because yeah, only lvm knows/manages/activates the _hidden_ metadata device)

Comment 4 Zdenek Kabelac 2015-07-14 07:38:28 UTC

To access _tmeta  content user has to 'deactive' thin pool first
and 'swap'  _tmeta device with another LV and active this LV then and thin_check it.

LVM2 nor DM wants to allow parallel access to life _tmeta device (it could only give you misleading data)

So this new enforcing protection within kernel & thin_* tools is here to protect user from doing bad steps (i.e. exploring life device).

In future LVM2 version we may support 'snapshot' feature of thin-pool target for accessing life _tmeta content.

'swapping' 

-----
# create some tmp LV
lvcreate -L2 -n tmp  vg

# swap tmp LV with tmeta of inactive thin-pool
lvconvert --thinpool vg/pool --poolmetadata tmp

# active 'tmp' LV now with content of _tmeta
lvchange -ay vg/tmp

# thin check it
thin_check /dev/vg/tmp

# deactivate & swap back
lvchange -an vg/tmp
lvconvert --thinpool vg/pool --poolmetadata tmp

# Use pool again
-----

There steps are consider to be used by 'skilled' users.
For remaining cases  'lvconvert --repair' is the way to go.

And we slowly enhance case which --repair will handle.

Comment 5 Corey Marthaler 2015-07-15 16:05:40 UTC

This bug can be closed if we are all in agreement of the following...


If a user wants to check the current state of their poolmetadata device, even if they have quiesced all activity to that pool volume, no active thin_checks are allowed after 7.1/6.7, instead, the user needs to follow the steps in comment #4.

I'll change what my tests do and file a new bug about having a better error when attempting to check a live poolmetadata device.

Comment 6 Corey Marthaler 2015-07-15 17:05:40 UTC

Also, I've got a dumb question in response to comment #3...

You can only run 'lvconvert --repair VG/POOL' on a non corrupted poolmetadata device correct? If your poolmetadata device is corrupt, then you're out of luck? 

There's no way to repair/swap in a new metadata device based on the kernel's current view of the metadata if you didn't already know to have restored the metadata to a tmp device?

Comment 7 Mike Snitzer 2015-07-15 18:22:05 UTC

(In reply to Corey Marthaler from comment #6)
> Also, I've got a dumb question in response to comment #3...
> 
> You can only run 'lvconvert --repair VG/POOL' on a non corrupted
> poolmetadata device correct? If your poolmetadata device is corrupt, then
> you're out of luck? 
> 
> There's no way to repair/swap in a new metadata device based on the kernel's
> current view of the metadata if you didn't already know to have restored the
> metadata to a tmp device?

No idea if I'm dense or something but your question really does seem "dumb".

The entire point is to be able to repair broken ("corrupt") metadata.  SO I have no idea why you think 'lvconvert --repair VG/POOL' is only for perfectly healthy metadata.

There is a spare metadata area (should be anyway, by default) that is intended to allow for reapiring the corrupt metadata.  Once repaired lvm2 pivots to the repaired metadata.

Comment 8 Corey Marthaler 2015-07-15 21:32:21 UTC

:)

It doesn't seem to work for me unless it's healthy.

[root@host-115 ~]# lvcreate  --thinpool POOL  --zero y -L 1G --poolmetadatasize 4M VG
  Logical volume "POOL" created.

[root@host-115 ~]# lvs -a -o +devices
  LV              VG Attr       LSize   Pool Origin Data%  Meta%   Devices       
  POOL            VG twi-a-tz--   1.00g             0.00   0.98    POOL_tdata(0) 
  [POOL_tdata]    VG Twi-ao----   1.00g                            /dev/sda1(1)  
  [POOL_tmeta]    VG ewi-ao----   4.00m                            /dev/sdh1(0)  
  [lvol0_pmspare] VG ewi-------   4.00m                            /dev/sda1(0)  

[root@host-115 ~]# dd if=/dev/urandom of=/dev/mapper/VG-POOL_tmeta count=1 bs=1 skip=2
1+0 records in
1+0 records out
1 byte (1 B) copied, 0.00100156 s, 1.0 kB/s

[root@host-115 ~]# lvchange -an VG/POOL
  WARNING: Integrity check of metadata for pool VG/POOL failed.

[root@host-115 ~]# lvconvert --yes --repair VG/POOL
bad checksum in superblock
  Repair of thin metadata volume of thin pool VG/POOL failed (status:1). Manual repair required!

Comment 9 Mike Snitzer 2015-07-15 21:56:40 UTC

The "corruption" case you're focusing on is a bit of a bizarre/simple case.  If the metadata's superblock is modified I'm not sure the tools can handle fixing that (probably can't).

Comment 10 Mike Snitzer 2015-07-15 21:58:56 UTC

Only way to recover from a bad superblock is to write multiple copies and recover from one of the backups... filesystems do stuff like that AFAIK.

Anyway, iirc Joe has tools for ways to break metadata in a more controlled and fixable way.

Comment 11 Jonathan Earl Brassow 2015-07-15 22:32:56 UTC

That raises the question, do we want to write multiple copies of the superblock that a repair operation would be able to find in the event that the original was corrupted?

Comment 12 Zdenek Kabelac 2015-07-16 06:06:31 UTC

Just a comment from lvm2 POV.

The only thing lvm2 does in this case is running:

thin_repair  -i -o


So mostly the only thing it repairs for now is to basically clean NEEDS_CHECK flag and repair some 'minor' byte-lost on metadata.

There is absolutely no checking/matching between lvm2 metadata and kernel pool metadata. So don't expect any sort of fixes from lvm2 side for now.

AFAIK thin_repair is not yet able to deal for lost of root btree nodes.

Comment 13 Corey Marthaler 2015-07-16 22:23:21 UTC

thin_dump doesn't work either.

[root@host-115 ~]# lvs -a -o +devices
  LV              Attr       LSize   Pool Origin Data%  Meta%  Devices       
  POOL            twi---tz--   1.00g             0.00   1.56   POOL_tdata(0)             
  [POOL_tdata]    Twi-ao----   1.00g                           /dev/sdc1(1)              
  [POOL_tmeta]    ewi-ao----   4.00m                           /dev/sde1(0)              
  [lvol0_pmspare] ewi-------   4.00m                           /dev/sdc1(0)              
  newtmeta        -wi-a-----   8.00m                           /dev/sdc1(257)            
  origin          Vwi-a-tz--   1.00g POOL        0.00                                                                
  other1          Vwi-a-tz--   1.00g POOL        0.00                                                                
  other2          Vwi-a-tz--   1.00g POOL        0.00                                                                
  other3          Vwi-a-tz--   1.00g POOL        0.00                                                                
  other4          Vwi-a-tz--   1.00g POOL        0.00                                                                
  other5          Vwi-a-tz--   1.00g POOL        0.00                                                                
  snap            Vwi-a-tz-k   1.00g POOL origin 0.00                                                                

[root@host-115 ~]# thin_dump /dev/mapper/snapper_thinp-POOL_tmeta > /tmp/snapper_thinp_dump.1376.28760
syscall 'open' failed: Device or resource busy

[root@host-115 ~]#  thin_dump -f human_readable /dev/mapper/snapper_thinp-POOL_tmeta
syscall 'open' failed: Device or resource busy

Comment 14 Joe Thornber 2015-07-27 09:26:15 UTC

(In reply to Jonathan Earl Brassow from comment #11)
> That raises the question, do we want to write multiple copies of the
> superblock that a repair operation would be able to find in the event that
> the original was corrupted?

I'm not sure.  I split the metadata off into a separate device so people could provide resilience via raid, rather than me duplicating high level nodes in the btree (the superblock is effectively the root of all the btrees).  But this doesn't protect against user error that wipes the superblock, or indeed against a kernel bug that wipes the superblock.

Any thoughts Alasdair?

Comment 15 Corey Marthaler 2015-07-27 16:54:41 UTC

Having multiple copies of the super block would be a nice feature in the future. However, this bug is currently about how all thin_* cmds (even just thin_dump) no longer work on the "live" pool meta volume when they used to in past releases. 

In comment #4 Zdenek mentions that for all thin_* operations, you now need to deactivate the pool volume, create a tmp volume, swap the tmp volume in for the meta volume (a command that has now actually "corrupted" your thin pool), activate the tmp volume (which still contains the valid pool meta data), do whatever dump/checks you'd like, and then swap that volume back in as the pool meta data volume before reactivating.

So, is comment #4 the new correct procedure? If so, then we A. need a better error than just "syscall 'open' failed:" when attempted on the live meta device and B. need to document this carefully everywhere. Take the man page for thin_dump: "thin_dump - dump thin provisioning metadata from device or file to standard output". It doesn't say "dump thin provisioning metadata from a former metadata device that has been deactivated and then swapped onto a tmp volume not currently associated with the origin pool volume" Same thing goes for the thin_check man page.

If however comment #4 is not the new correct procedure, then when I run thin_dump on the live metadata device (like it's implied I can do in the thin_dump man page), I should be able to see the thin metadata and not a "syscall 'open' failed" warning.

Comment 16 Corey Marthaler 2015-08-03 21:51:07 UTC

Looks like this is another version of bugs 1038387/1023828.

Comment 17 Joe Thornber 2015-08-05 14:35:18 UTC

You must never ask userland to examine metadata that is potentially changing (ie. for an active pool).  This has always be the case; the recent patch just started enforcing this using a Linux only extension to the O_EXCL flag to the open() call.

For thin_dump, and thin_delta you can use the -m flag to examine a metadata snapshot in a live pool (the snapshot is unchanging so this is safe).  Very few people use this.

As for the procedure from comment #4 that you mention.  It does indeed seem convoluted.  As far as the thin tools are concerned you can run them on the metadata device so long as the pool isn't active.  So I'd expect the procedure to be:

i) deactivate pool
ii) run thin_check