Bug 1274834

Summary: check_cache 'slowness'
Product: Red Hat Enterprise Linux 7 Reporter: Corey Marthaler <cmarthal>
Component: device-mapper-persistent-dataAssignee: Joe Thornber <thornber>
Status: CLOSED ERRATA QA Contact: Jakub Krysl <jkrysl>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 7.2CC: agk, heinzm, jbrassow, jkrysl, lmiksik, lvm-team, msnitzer, thornber
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-10 13:15:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1469559    

Description Corey Marthaler 2015-10-23 15:25:54 UTC
Description of problem:

Copying comment #15 from bug 1189051

dmsetup create --table '0 2000 error'  errdev

cache_check  /dev/mapper/errdev

examining superblock
  superblock is corrupt
    incomplete io for block 0, e.res = 18446744073709551611, e.res2 = 0, offset = 0, nbytes = 4096


--- very slooooowly going here  ---

(gdb) bt
#0  0x00007f6cc456c644 in __io_getevents_0_4 (ctx=0x7f6cc4c9d000, min_nr=1, nr=3949, events=0x555ce9562820, timeout=0x0)
    at io_getevents.c:25
#1  0x00007f6cc456c67d in io_getevents_0_4 (ctx=<optimized out>, min_nr=min_nr@entry=1, nr=<optimized out>, 
    events=<optimized out>, timeout=timeout@entry=0x0) at io_getevents.c:54
#2  0x0000555ce7e3a7f0 in bcache::block_cache::wait_io (this=this@entry=0x555ce955e6c8) at block-cache/block_cache.cc:202
#3  0x0000555ce7e3add0 in bcache::block_cache::wait_all (this=<optimized out>) at block-cache/block_cache.cc:261
#4  bcache::block_cache::flush (this=this@entry=0x555ce955e6c8) at block-cache/block_cache.cc:674
#5  0x0000555ce7e3aecc in bcache::block_cache::~block_cache (this=0x555ce955e6c8, __in_chrg=<optimized out>)
    at block-cache/block_cache.cc:491
#6  0x0000555ce7eb1ef3 in persistent_data::block_manager<4096u>::~block_manager (this=0x555ce955e6c0, 
    __in_chrg=<optimized out>) at persistent-data/block.h:42
#7  boost::checked_delete<persistent_data::block_manager<4096u> > (x=0x555ce955e6c0)
    at /usr/include/boost/core/checked_delete.hpp:34
#8  boost::detail::sp_counted_impl_p<persistent_data::block_manager<4096u> >::dispose (this=<optimized out>)
    at /usr/include/boost/smart_ptr/detail/sp_counted_impl.hpp:78
#9  0x0000555ce7e3eaf8 in boost::detail::sp_counted_base::release (this=0x555ce95815d0)
    at /usr/include/boost/smart_ptr/detail/sp_counted_base_gcc_x86.hpp:146
#10 boost::detail::shared_count::~shared_count (this=0x7ffc84793ad8, __in_chrg=<optimized out>)
    at /usr/include/boost/smart_ptr/detail/shared_count.hpp:467
#11 boost::shared_ptr<persistent_data::block_manager<4096u> >::~shared_ptr (this=0x7ffc84793ad0, __in_chrg=<optimized out>)
    at /usr/include/boost/smart_ptr/shared_ptr.hpp:330
#12 (anonymous namespace)::metadata_check (fs=<synthetic pointer>, path="/dev/mapper/errdev") at caching/cache_check.cc:224
#13 (anonymous namespace)::check (fs=<synthetic pointer>, path="/dev/mapper/errdev") at caching/cache_check.cc:300
#14 (anonymous namespace)::check_with_exception_handling (fs=<synthetic pointer>, path="/dev/mapper/errdev")
    at caching/cache_check.cc:318
#15 cache_check_main (argc=<optimized out>, argv=<optimized out>) at caching/cache_check.cc:410
#16 0x0000555ce7e364d4 in base::command::run (this=0x555ce815d6c0 <caching::cache_check_cmd>, argv=0x7ffc84794798, argc=2)
    at base/application.h:26
#17 base::application::run (this=0x7ffc84794680, argc=2, argv=0x7ffc84794798) at base/application.cc:32
#18 0x0000555ce7e35431 in main (argc=2, argv=0x7ffc84794798) at main.cc:39


---  at some point it will finish ---

but this clearly relates to the size of  errored device

For 'reproducer'  with 2000 sectors - it's about 40 seconds
with  20000 sectors  it has NOT finished in 10 minutes - so it's not even linear time increase...

So - new bug for cache_check tool to address this 'slowness' should be created.


As for actual testing of this NEEDSCHECKFLAG - I'd probably suggest  to restore 'previous' table content before 'vgchange -an' -  since  cache_check  is  executed on 'deactivation' as well as on 'activation' phase.

So if you want to check 'needs' has been cleared in this test case - you would need to give it back original device - after  cache target itself switched to 'Fail' mode.


Version-Release number of selected component (if applicable):
3.10.0-325.el7.x86_64

lvm2-2.02.130-5.el7    BUILT: Wed Oct 14 08:27:29 CDT 2015
lvm2-libs-2.02.130-5.el7    BUILT: Wed Oct 14 08:27:29 CDT 2015
lvm2-cluster-2.02.130-5.el7    BUILT: Wed Oct 14 08:27:29 CDT 2015
device-mapper-1.02.107-5.el7    BUILT: Wed Oct 14 08:27:29 CDT 2015
device-mapper-libs-1.02.107-5.el7    BUILT: Wed Oct 14 08:27:29 CDT 2015
device-mapper-event-1.02.107-5.el7    BUILT: Wed Oct 14 08:27:29 CDT 2015
device-mapper-event-libs-1.02.107-5.el7    BUILT: Wed Oct 14 08:27:29 CDT 2015
device-mapper-persistent-data-0.5.5-1.el7    BUILT: Thu Aug 13 09:58:10 CDT 2015
cmirror-2.02.130-5.el7    BUILT: Wed Oct 14 08:27:29 CDT 2015
sanlock-3.2.4-1.el7    BUILT: Fri Jun 19 12:48:49 CDT 2015
sanlock-lib-3.2.4-1.el7    BUILT: Fri Jun 19 12:48:49 CDT 2015
lvm2-lockd-2.02.130-5.el7    BUILT: Wed Oct 14 08:27:29 CDT 2015

Comment 3 Joe Thornber 2017-09-29 10:28:37 UTC
I just tried to reproduce and it works for me:


ejt@reti:~/work/RedHat/thin-provisioning-tools$ time sudo bin/cache_check /dev/mapper/errdev
examining superblock
  superblock is corrupt
    bad checksum in superblock

real    0m0.041s
user    0m0.008s
sys     0m0.000s

So it's quite possible I've already fixed this one, I've certainly done a fair amount of work on the libaio block cache in the last couple of years.

Could we retest this one with the latest version please?

Comment 4 Jakub Krysl 2017-09-29 13:06:44 UTC
Tested with build device-mapper-persistent-data-0.7.2-1.el7:
# dmsetup create --table '0 20000 error'  errdev                                                                                                                                              
# time cache_check /dev/mapper/errdev                                                                                                                                                         
examining superblock                                
  superblock is corrupt                             
    bad checksum in superblock                      

real    0m0.008s                                    
user    0m0.001s                                    
sys     0m0.001s 

I tried 0.7.0-0.1.rc6.el7 with the same result:
# time cache_check /dev/mapper/errdev 
examining superblock
  superblock is corrupt
    bad checksum in superblock

real    0m0.008s
user    0m0.002s
sys     0m0.004s

However in 0.6.3-1.el7 it is still broken:
# time cache_check /dev/mapper/errdev 
examining superblock
  superblock is corrupt
    incomplete io for block 0, e.res = 18446744073709551611, e.res2 = 0, offset = 0, nbytes = 4096
^C

real    0m42.958s
user    0m0.003s
sys     0m0.000s

So this got fixed somewhere between 0.7.0-0.1.rc6.el7 and 0.6.3-1.el7.

Comment 7 errata-xmlrpc 2018-04-10 13:15:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:0776