Bug 471042

Summary:	RHEL5 cmirror tracker: cluster pvmove performance has regressed
Product:	Red Hat Enterprise Linux 5	Reporter:	Corey Marthaler <cmarthal>
Component:	cmirror	Assignee:	Jonathan Earl Brassow <jbrassow>
Status:	CLOSED NOTABUG	QA Contact:	Cluster QE <mspqa-list>
Severity:	high	Docs Contact:
Priority:	high
Version:	5.3	CC:	agk, bstevens, ccaulfie, ddomingo, dwysocha, edamato, heinzm, mbroz, pep, rlerch, syeghiay
Target Milestone:	rc	Keywords:	Regression
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2008-12-08 20:34:44 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Corey Marthaler 2008-11-11 15:24:12 UTC

Description of problem:
I don't have any hard numbers on this yet but it appears that cluster pvmoves have gotten much slower with one of the recent builds. This many be an lvm2/lvm2-cluster bug, but I'm filing this under cmirror for now.

I'll investigate this issue more and provide more details before proposing this to block 5.3. 

Currently my segmented pvmove testcase is still making progress after 18 hours but is still only 60% done. I don't remember this testcase ever taking this long before.

[root@hayes-03 ~]# lvs -a
  LV          VG            Attr   LSize  Origin Snap%  Move               Log Copy%  Convert
  LogVol00    VolGroup00    -wi-ao 72.44G                                                    
  LogVol01    VolGroup00    -wi-ao  1.94G                                                    
  [pvmove0]   mirror_sanity p-C-ao 95.51G               /dev/etherd/e1.1p5      60.07        
  seg_pvmove0 mirror_sanity -wI-a- 95.51G                                                    
  seg_pvmove1 mirror_sanity -wi-a- 95.51G                                                    
[root@hayes-03 ~]# lvs -a
  LV          VG            Attr   LSize  Origin Snap%  Move               Log Copy%  Convert
  LogVol00    VolGroup00    -wi-ao 72.44G                                                    
  LogVol01    VolGroup00    -wi-ao  1.94G                                                    
  [pvmove0]   mirror_sanity p-C-ao 95.51G               /dev/etherd/e1.1p5      60.08        
  seg_pvmove0 mirror_sanity -wI-a- 95.51G                                                    
  seg_pvmove1 mirror_sanity -wi-a- 95.51G                                                    




Version-Release number of selected component (if applicable):
2.6.18-117.el5

lvm2-2.02.40-6.el5    BUILT: Fri Oct 24 07:37:33 CDT 2008
lvm2-cluster-2.02.40-6.el5    BUILT: Fri Oct 24 07:38:44 CDT 2008
device-mapper-1.02.28-2.el5    BUILT: Fri Sep 19 02:50:32 CDT 2008
cmirror-1.1.34-5.el5    BUILT: Thu Nov  6 15:10:44 CST 2008
kmod-cmirror-0.1.21-2.el5    BUILT: Thu Nov  6 14:12:07 CST 2008

Comment 1 Corey Marthaler 2008-11-12 23:22:53 UTC

I ran the same pvmove scenario on the older versions cmirror-1.1.33-1.el5/kmod-cmirror-0.1.20-1.el5 and then again on the newer versions cmirror-1.1.34-5.el5/kmod-cmirror-0.1.21-2.el5 and the newer version was about 20X slower!


cmirror-1.1.33-1.el5/kmod-cmirror-0.1.20-1.el5:
============================================================
Iteration 1 of 1 started at Wed Nov 12 16:48:01 CST 2008
============================================================
SCENARIO - [segmented_pvmove]
Create a couple small segmented lvs and then pvmove them
Running lv_seg on hayes-02 to create the segmented volumes
Deactivating seg_pvmove0 mirror
Moving data from pv /dev/etherd/e1.1p5
    Finding volume group "mirror_sanity"
    Executing: /sbin/modprobe dm-log-clustered
    Archiving volume group "mirror_sanity" metadata (seqno 27).
    Creating logical volume pvmove0
    Moving 3662 extents of logical volume mirror_sanity/seg_pvmove0
    Updating volume group metadata
    Creating volume group backup "/etc/lvm/backup/mirror_sanity" (seqno 28).
    Checking progress every 15 seconds
    Removing temporary pvmove LV
    Writing out final volume group after pvmove
    Creating volume group backup "/etc/lvm/backup/mirror_sanity" (seqno 30).
Deactivating mirror seg_pvmove0... and removing
Deactivating mirror seg_pvmove1... and removing


real    13m17.521s
user    0m0.843s
sys     0m0.458s



cmirror-1.1.34-5.el5/kmod-cmirror-0.1.21-2.el5:
============================================================
Iteration 1 of 1 started at Wed Nov 12 11:04:17 CST 2008    
============================================================
SCENARIO - [segmented_pvmove]                               
Create a couple small segmented lvs and then pvmove them    
Running lv_seg on hayes-03 to create the segmented volumes  
Deactivating seg_pvmove0 mirror                             
Moving data from pv /dev/etherd/e1.1p5                      
    Finding volume group "mirror_sanity"                    
    Executing: /sbin/modprobe dm-log-clustered              
    Archiving volume group "mirror_sanity" metadata (seqno 27).
    Creating logical volume pvmove0                            
    Moving 3662 extents of logical volume mirror_sanity/seg_pvmove0
    Updating volume group metadata                                 
    Creating volume group backup "/etc/lvm/backup/mirror_sanity" (seqno 28).
    Checking progress every 15 seconds                                      
    Removing temporary pvmove LV                                            
    Writing out final volume group after pvmove                             
    Creating volume group backup "/etc/lvm/backup/mirror_sanity" (seqno 30).
Deactivating mirror seg_pvmove0... and removing                             
Deactivating mirror seg_pvmove1... and removing                             


real    264m4.870s
user    0m0.833s  
sys     0m0.495s

Comment 2 Corey Marthaler 2008-11-12 23:40:09 UTC

Here is the exact cmdline I ran:
./mirror_sanity -l /home/msp/cmarthal/work/rhel5/sts-root -r /usr/tests/sts-rhel5.3 -R ../../var/share/resource_files/hayes.xml -e segmented_pvmove -i 1

Comment 3 Jonathan Earl Brassow 2008-11-14 19:08:05 UTC

Are all mirrors suffering in the same way, or is it just this test case?

Comment 5 Corey Marthaler 2008-11-17 22:20:12 UTC

This appears to be a regression across the board for pvmoves. 

I tried the basic pvmove cases and they also appear to much slower (about twice as slow) as the previous builds.


cmirror-1.1.33-1.el5/kmod-cmirror-0.1.20-1.el5:

[cmarthal@silver bin]$ time ./pv_shuffle -l /home/msp/cmarthal/work/rhel5/sts-root -r /usr/tests/sts-rhel5.3 -R ../../var/share/resource_files/hayes.xml -i 2
1 disk(s) to be used:                       
        hayes-01=/dev/etherd/e1.1           
        hayes-02=/dev/etherd/e1.1           
        hayes-03=/dev/etherd/e1.1           

on hayes-01...
dicing /dev/etherd/e1.1 into 7... 
re-reading disks on hayes-03...   
re-reading disks on hayes-01...   
re-reading disks on hayes-02...   

vgscan on all nodes...

creating lvm devices...
create PV(s) out of all avail partitions on hayes-01
create VG(s) on hayes-01                            
         pv_shuffle                                 

creating linear on hayes-01
creating stripe on hayes-01

Iteration 1 of 2 started on hayes-03 at Mon Nov 17 14:27:13 CST 2008
FREE PVS: /dev/etherd/e1.1p1 /dev/etherd/e1.1p2                     
INUSE PVS: /dev/etherd/e1.1p4 /dev/etherd/e1.1p5 /dev/etherd/e1.1p6 /dev/etherd/e1.1p7
NOT INUSE PVS: /dev/etherd/e1.1p3                                                     

Adding /dev/etherd/e1.1p2 to the volume group on hayes-03
Moving data from /dev/etherd/e1.1p4 to /dev/etherd/e1.1p2 on hayes-03
    Finding volume group "pv_shuffle"                                
    Executing: /sbin/modprobe dm-log-clustered                       
    Archiving volume group "pv_shuffle" metadata (seqno 4).          
    Creating logical volume pvmove0                                  
    Moving 427 extents of logical volume pv_shuffle/stripe           
    Updating volume group metadata                                   
    Creating volume group backup "/etc/lvm/backup/pv_shuffle" (seqno 5).
    Checking progress every 15 seconds                                  
    Removing temporary pvmove LV                                        
    Writing out final volume group after pvmove                         
    Creating volume group backup "/etc/lvm/backup/pv_shuffle" (seqno 7).
Removing /dev/etherd/e1.1p3 from the volume group on hayes-03           


Iteration 2 of 2 started on hayes-02 at Mon Nov 17 14:58:30 CST 2008
FREE PVS: /dev/etherd/e1.1p1 /dev/etherd/e1.1p3                     
INUSE PVS: /dev/etherd/e1.1p2 /dev/etherd/e1.1p5 /dev/etherd/e1.1p6 /dev/etherd/e1.1p7
NOT INUSE PVS: /dev/etherd/e1.1p4                                                     

Adding /dev/etherd/e1.1p3 to the volume group on hayes-02
Moving data from /dev/etherd/e1.1p7 to /dev/etherd/e1.1p3 on hayes-02
    Finding volume group "pv_shuffle"                                
    Executing: /sbin/modprobe dm-log-clustered                       
    Archiving volume group "pv_shuffle" metadata (seqno 9).          
    Creating logical volume pvmove0                                  
    Moving 1280 extents of logical volume pv_shuffle/linear          
    Updating volume group metadata                                   
    Creating volume group backup "/etc/lvm/backup/pv_shuffle" (seqno 10).
    Checking progress every 15 seconds                                   
    Removing temporary pvmove LV                                         
    Writing out final volume group after pvmove                          
    Creating volume group backup "/etc/lvm/backup/pv_shuffle" (seqno 12).
Removing /dev/etherd/e1.1p4 from the volume group on hayes-02            


real    35m3.384s
user    0m0.943s 
sys     0m0.708s 




cmirror-1.1.34-5.el5/kmod-cmirror-0.1.21-2.el5:

[cmarthal@silver bin]$ time ./pv_shuffle -l /home/msp/cmarthal/work/rhel5/sts-root -r /usr/tests/sts-rhel5.3 -R ../../var/share/resource_files/hayes.xml -i 2                                                                     
1 disk(s) to be used:                                                                                            
        hayes-01=/dev/etherd/e1.1                                                                                
        hayes-02=/dev/etherd/e1.1                                                                                
        hayes-03=/dev/etherd/e1.1                                                                                

on hayes-01...
dicing /dev/etherd/e1.1 into 7... 
re-reading disks on hayes-03...   
re-reading disks on hayes-01...   
re-reading disks on hayes-02...   

vgscan on all nodes...

creating lvm devices...
create PV(s) out of all avail partitions on hayes-01
create VG(s) on hayes-01                            
         pv_shuffle                                 

creating linear on hayes-01
creating stripe on hayes-01

Iteration 1 of 2 started on hayes-03 at Mon Nov 17 15:10:20 CST 2008
FREE PVS: /dev/etherd/e1.1p1 /dev/etherd/e1.1p2                     
INUSE PVS: /dev/etherd/e1.1p4 /dev/etherd/e1.1p5 /dev/etherd/e1.1p6 /dev/etherd/e1.1p7
NOT INUSE PVS: /dev/etherd/e1.1p3                                                     

Adding /dev/etherd/e1.1p2 to the volume group on hayes-03
Moving data from /dev/etherd/e1.1p5 to /dev/etherd/e1.1p2 on hayes-03
    Finding volume group "pv_shuffle"                                
    Executing: /sbin/modprobe dm-log-clustered                       
    Archiving volume group "pv_shuffle" metadata (seqno 4).          
    Creating logical volume pvmove0
    Moving 427 extents of logical volume pv_shuffle/stripe
    Updating volume group metadata
    Creating volume group backup "/etc/lvm/backup/pv_shuffle" (seqno 5).
    Checking progress every 15 seconds
    Removing temporary pvmove LV
    Writing out final volume group after pvmove
    Creating volume group backup "/etc/lvm/backup/pv_shuffle" (seqno 7).
Removing /dev/etherd/e1.1p3 from the volume group on hayes-03


Iteration 2 of 2 started on hayes-03 at Mon Nov 17 15:41:40 CST 2008
FREE PVS: /dev/etherd/e1.1p1 /dev/etherd/e1.1p3
INUSE PVS: /dev/etherd/e1.1p2 /dev/etherd/e1.1p4 /dev/etherd/e1.1p6 /dev/etherd/e1.1p7
NOT INUSE PVS: /dev/etherd/e1.1p5

Adding /dev/etherd/e1.1p3 to the volume group on hayes-03
Moving data from /dev/etherd/e1.1p2 to /dev/etherd/e1.1p3 on hayes-03
    Finding volume group "pv_shuffle"
    Executing: /sbin/modprobe dm-log-clustered
    Archiving volume group "pv_shuffle" metadata (seqno 9).
    Creating logical volume pvmove0
    Moving 427 extents of logical volume pv_shuffle/stripe
    Updating volume group metadata
    Creating volume group backup "/etc/lvm/backup/pv_shuffle" (seqno 10).
    Checking progress every 15 seconds
    Removing temporary pvmove LV
    Writing out final volume group after pvmove
    Creating volume group backup "/etc/lvm/backup/pv_shuffle" (seqno 12).
Removing /dev/etherd/e1.1p5 from the volume group on hayes-03


real    63m14.555s
user    0m0.940s
sys     0m0.715s

Comment 7 Corey Marthaler 2008-11-26 16:26:28 UTC

FWIW, I retested this with the fix for bz 464550 that Jon built for me and that time was:

real    42m25.138s
user    0m0.923s  
sys     0m0.741s

Comment 10 Jonathan Earl Brassow 2008-12-08 04:50:50 UTC

I stole corey's 'hayes' machines (the only ones I can reproduce this on), and did some profiling.  I used a system tap script to print out every request that goes to the userspace server.  At the end of the 'Exit' line, you can see the time it takes to process the request.  I'm also printing out the time between calls.  You can easily see that the majority of the time is spent doing the work and not making cmirror requests.  Resync work is handed out in 4MB chunks.  The time between getting the work and setting it as finished we say is around the largest number 404227us (time spent between requests).  This leads us to <10MB/s - longer than 160 minutes to complete.

Entering:      dm_clog_consult_server(DM_CLOG_GET_RESYNC_WORK) [kmirrord/20217]
Exiting :      dm_clog_consult_server()                 [kmirrord/20217] 475us

Time between calls: 3us
Entering:      dm_clog_consult_server(DM_CLOG_IN_SYNC)  [kmirrord/20217]
Exiting :      dm_clog_consult_server()                 [kmirrord/20217] 17us

Time between calls: 9us
Entering:      dm_clog_consult_server(DM_CLOG_GET_SYNC_COUNT) [kmirrord/20217]
Exiting :      dm_clog_consult_server()                 [kmirrord/20217] 61us

Time between calls: 404227us
Entering:      dm_clog_consult_server(DM_CLOG_SET_REGION_SYNC) [kmirrord/20217]
Exiting :      dm_clog_consult_server()                 [kmirrord/20217] 695us

Time between calls: 3us
Entering:      dm_clog_consult_server(DM_CLOG_CLEAR_REGION) [kmirrord/20217]
Exiting :      dm_clog_consult_server()                 [kmirrord/20217] 18us

Time between calls: 1us
Entering:      dm_clog_consult_server(DM_CLOG_FLUSH)    [kmirrord/20217]
Exiting :      dm_clog_consult_server()                 [kmirrord/20217] 702us

Comment 11 Jonathan Earl Brassow 2008-12-08 20:34:44 UTC

It is with great pride that I close this NOTABUG.  The problem was network hardware related.  I'll let corey post the confirming data from his machines if he wants to.

Comment 12 Ryan Lerch 2008-12-08 20:55:12 UTC

Removing the release note request on this one based on comment #12

Comment 13 Corey Marthaler 2009-01-13 17:34:55 UTC

Just a follow up to comment #11, the hardware issue referenced was the RX-ERRs sometimes seen on one or all of the nodes in the cluster during cmirror/pvmove operations. 

Kernel Interface table
Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0       1500   0  4918652   1575      0      0  5307271      0      0      0 BMRU
lo        16436   0      121      0      0      0      121      0      0      0 LRU


Oddly, these errors are seen on machines in all 3 of my clusters all on different sub nets only when doing cmirror/openais intensive things. I don't have issues when not doing these operations. Also other clusters on these networks do not see any errors while doing other tests.