Bug 1199837

Summary:	lvm jump TransactionID without getting confirmation from kernel
Product:	Red Hat Enterprise Linux 7	Reporter:	Jack Waterworth <jwaterwo>
Component:	lvm2	Assignee:	Zdenek Kabelac <zkabelac>
lvm2 sub component:	LVM Metadata / lvmetad	QA Contact:	cluster-qe <cluster-qe>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	low
Priority:	unspecified	CC:	agk, coughlan, heinzm, jbrassow, msnitzer, prajnoha, prockai, rbednar, zkabelac
Version:	7.1
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	lvm2-2.02.160-1.el7	Doc Type:	Bug Fix
Doc Text:	Cause: In some complex code path lvm2 missed to commit metadata and in case further problem arrived, it's been possible to diverge by 1 transaction_id between lvm2 and kernel metadata. Consequence: Different transaction_id stops further usage of thin pool and requires manual repair operation even thought the user would not expect such problem. Fix: lvm2 improved the logic for update of transaction_id. Result: Using common lvm2 operation should not result in a thin-pool with asynchronous transaction_id.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-11-04 04:09:29 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1199940, 1243913
Bug Blocks:	1295577, 1313485

Description Jack Waterworth 2015-03-08 20:38:51 UTC

Description of problem:
After exhausting space within a thin volume, extending and rebooting the machine caused a mismatch in transactionID

Version-Release number of selected component (if applicable):
lvm2-2.02.115-3.el7.x86_64
kernel-3.10.0-123.4.4.el7.x86_64
kernel-3.10.0-123.8.1.el7.x86_64
kernel-3.10.0-229.el7.x86_64

How reproducible:
unknown

Steps to Reproduce:
1. Fill thinpool to 100% data usage
2. Extend thinpool
3. Reboot server

Actual results:
thinpool is unable to be activated:
[root@jack-tank ~]# vgchange -ay data
Thin pool transaction_id is 7345, while expected 7346.
Thin pool transaction_id is 7345, while expected 7346.
Thin pool transaction_id is 7345, while expected 7346.
Thin pool transaction_id is 7345, while expected 7346.
Thin pool transaction_id is 7345, while expected 7346.
Thin pool transaction_id is 7345, while expected 7346.
Thin pool transaction_id is 7345, while expected 7346.
Thin pool transaction_id is 7345, while expected 7346.
Thin pool transaction_id is 7345, while expected 7346.
Thin pool transaction_id is 7345, while expected 7346.
Thin pool transaction_id is 7345, while expected 7346.
Thin pool transaction_id is 7345, while expected 7346.
Thin pool transaction_id is 7345, while expected 7346.
Thin pool transaction_id is 7345, while expected 7346.
Thin pool transaction_id is 7345, while expected 7346.
Thin pool transaction_id is 7345, while expected 7346.
Thin pool transaction_id is 7345, while expected 7346.
Thin pool transaction_id is 7345, while expected 7346.
Thin pool transaction_id is 7345, while expected 7346.
Thin pool transaction_id is 7345, while expected 7346.
Thin pool transaction_id is 7345, while expected 7346.
Thin pool transaction_id is 7345, while expected 7346.
Thin pool transaction_id is 7345, while expected 7346.
Thin pool transaction_id is 7345, while expected 7346.
Thin pool transaction_id is 7345, while expected 7346.
Thin pool transaction_id is 7345, while expected 7346.
Thin pool transaction_id is 7345, while expected 7346.
Thin pool transaction_id is 7345, while expected 7346.
11 logical volume(s) in volume group "data" now active
[root@jack-tank ~]# lvs /dev/data/thin_pool
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
thin_pool data twi---tz-- 150.00g
[root@jack-tank ~]# lvchange -ay data/thin_pool
Thin pool transaction_id is 7345, while expected 7346.
[root@jack-tank ~]# lvs /dev/data/thin_pool
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
thin_pool data twi---tz-- 150.00g
[root@jack-tank ~]#

Expected results:
transaction_id should match expected id.

Additional info:
i'm not sure what I did to get into this state. I attempt to increase the size of the thin_pool volume by adding 50G, but this did not decrease the data%. I rebooted to see if that would clear up the discrepancy. After that I was unable to activate my volumes.

Comment 3 Zdenek Kabelac 2015-03-09 08:20:15 UTC

From metadata archive it looks like lvm2 didn't properly validate removal of  data/vm_f21_server LV and jumped one step further while thin-pool has been already in overfilled stated.

I'll try to write a reproducer.

(As a side note it's also unclear why there are archived files from vgdisplay command)

Comment 4 Zdenek Kabelac 2015-03-09 14:21:54 UTC

So the reason for 'vgdisplay' archiving is missed 'backup()' creating in error path.  Since lvm2 updates metadata but likely fail to lvremove and ommits doing backup in this error path - so the next following lvm command (i.e. vgdisplay) is noticing there is missed backup and will the archiving and backup during this command - so this needs fix.

Comment 5 Zdenek Kabelac 2016-07-01 07:06:00 UTC

Any transaction id update is now provided with instant check that confirms transaction Id has really the expected number.

While the source code is a moving target - and possibility of some missed updated cannot be fully eliminated - we should be now able to spot such forgotten update and stop further actions by 2 new extra levels of validation.

Saying this - I've not seen any problems with some unexpected transaction_id change for a while - so hoping we eliminated primary source of trouble.

So hopefully there is no reproducer for this bug.

Comment 7 Roman Bednář 2016-09-15 14:59:44 UTC

Verified. No transaction id error occured when trying to reproduce the bug.

# lvs -a
  LV              VG            Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  root            rhel_virt-151 -wi-ao----   6.74g                                                    
  swap            rhel_virt-151 -wi-ao---- 828.00m                                                    
  POOL            vg            twi-aotzD-  68.00m             100.00 1.76                            
  [POOL_tdata]    vg            Twi-ao----  68.00m                                                    
  [POOL_tmeta]    vg            ewi-ao----   4.00m                                                    
  [lvol0_pmspare] vg            ewi-------   4.00m                                                    
  test_lv         vg            Vwi-a-tz-- 240.00m POOL        28.33  
                                
# vgs
  VG            #PV #LV #SN Attr   VSize   VFree  
  rhel_virt-151   1   2   0 wz--n-   7.59g  40.00m
  vg              2   2   0 wz--n- 192.00m 116.00m

# lvextend -r vg/POOL -L+50M
  Ignoring --resizefs as volume vg/POOL does not have a filesystem.
  Rounding size to boundary between physical extents: 52.00 MiB.
  WARNING: Sum of all thin volume sizes (240.00 MiB) exceeds the size of thin pools and the size of whole volume group (192.00 MiB)!
  Size of logical volume vg/POOL_tdata changed from 68.00 MiB (17 extents) to 120.00 MiB (30 extents).
  Logical volume vg/POOL_tdata successfully resized.

# reboot
...
# lvs -a
  LV              VG            Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  root            rhel_virt-151 -wi-ao----   6.74g                                                    
  swap            rhel_virt-151 -wi-ao---- 828.00m                                                    
  POOL            vg            twi-aotz-- 120.00m             56.67  1.76                            
  [POOL_tdata]    vg            Twi-ao---- 120.00m                                                    
  [POOL_tmeta]    vg            ewi-ao----   4.00m                                                    
  [lvol0_pmspare] vg            ewi-------   4.00m                                                    
  test_lv         vg            Vwi-a-tz-- 240.00m POOL        28.33     



3.10.0-501.el7.x86_64

lvm2-2.02.165-1.el7    BUILT: Wed Sep  7 18:04:22 CEST 2016
lvm2-libs-2.02.165-1.el7    BUILT: Wed Sep  7 18:04:22 CEST 2016
lvm2-cluster-2.02.165-1.el7    BUILT: Wed Sep  7 18:04:22 CEST 2016
device-mapper-1.02.134-1.el7    BUILT: Wed Sep  7 18:04:22 CEST 2016
device-mapper-libs-1.02.134-1.el7    BUILT: Wed Sep  7 18:04:22 CEST 2016
device-mapper-event-1.02.134-1.el7    BUILT: Wed Sep  7 18:04:22 CEST 2016
device-mapper-event-libs-1.02.134-1.el7    BUILT: Wed Sep  7 18:04:22 CEST 2016
device-mapper-persistent-data-0.6.3-1.el7    BUILT: Fri Jul 22 12:29:13 CEST 2016
cmirror-2.02.165-1.el7    BUILT: Wed Sep  7 18:04:22 CEST 2016

Comment 9 errata-xmlrpc 2016-11-04 04:09:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1445.html