Description of problem: With following workload we are seeing upto 10x slow performance with denali as compared to Corbet. 4 x Initial workload : - - Smallfile workload - Operation = Create - Threads = 24 - Average filesize = 108 KB compressed - Files = 10000/thread Post initial sync workload - Smallfile workload - Operation = Create - Threads = 24 - Average filesize = 108 KB compressed - Files = 10000/thread Version-Release number of selected component (if applicable): Corbet release : 3.4.0.59rhs Denali release : 3.6.0.5 How reproducible: - Every time when geo-rep is running Steps to Reproduce: 1. Create 2x2 setup on master and slave end 2. Create files with initial workload on master 3. Start Geo-Rep 4. Wait till initial files get synced 5. Create post initial sync workload on master 6. Wait to get it synced to slave site Actual results: - For Corbet Initial Workload Master = 23:51 - 00:30 = 39 mins Geo-Rep Start = 00:39 Initial Geo-Rep End = 03:38 Initial Sync = 03:38 - 00:39 = 179 mins Post Initial Sync Geo-rep starts = 04:12 Post Initial Sync Geo-rep end = 04:39 Post Initial Sync duration = 04:39 - 4:12 = 27 mins - For Denali Initial Workload Master = 1:40 - 1:00 = 40 mins Geo-Rep Start = 04:50 Geo-Rep End = 08:48 Initial Sync = 08:48 - 04:50 = 238 mins Post Initial Sync Geo-rep starts = 09:06 Post Initial Sync Geo-rep end = 13:36 Post Initial Sync duration = 13:36 - 09:06 = 270 mins Expected results: There should not be any perf regression with Denali. Additional info: 1. We saw following logs 2014-05-28 09:36:09.835500] W [master(/mnt/brick/brick):278:regjob] <top>: Rsync: .gfid/78b751b3-1270-4ad5-b2ec-dcf9e2d88846 [errcode: 23] [2014-05-28 09:36:09.836548] W [master(/mnt/brick/brick):278:regjob] <top>: Rsync: .gfid/7648b46e-b003-48e8-a694-6b3de752e747 [errcode: 23] [2014-05-28 09:36:09.837604] W [master(/mnt/brick/brick):278:regjob] <top>: Rsync: .gfid/94ac7ea0-2cb8-4721-9efa-09bef7fdf4b3 [errcode: 23] [2014-05-28 09:36:09.838660] W [master(/mnt/brick/brick):278:regjob] <top>: Rsync: .gfid/e81b4660-7dee-4ed6-8e5e-5f2f9f045ec8 [errcode: 23] [2014-05-28 09:36:09.839720] W [master(/mnt/brick/brick):278:regjob] <top>: Rsync: .gfid/b49dd8a3-2ed1-469c-ab61-93efa5c86183 [errcode: 23] 2. Visualization of geo-rep is available at :- - http://perf19.perf.lab.eng.bos.redhat.com:3838/may14/georepCorbet26May_compress/ - http://perf19.perf.lab.eng.bos.redhat.com:3838/may14/georepDenali27May_compress/
"SSH + TAR" is giving better results as compared to "rsync" http://perf19.perf.lab.eng.bos.redhat.com:3838/june14/georepDenali3June_custom_build_compress_tar/
This will fix stat on '.gfid' giving EINVAL and as a result rsync 23 errors because of it. Upstream Patch: http://review.gluster.org/8011 Downstream Patch: https://code.engineering.redhat.com/gerrit/#/c/27398/
- With Corbett Initial Sync (xsync) = 179 mins Post Initial sync (changelog) = 27 mins - With Denali (3.6.0.19-1) Initial Sync (xsync) = 210 mins Post Initial sync (changelog) = 27 mins - With Denali (3.6.0.24-1) Initial Sync (xsync) = 208 mins Post Initial sync (changelog) = 24 mins http://perf19.perf.lab.eng.bos.redhat.com:3838/july14/georepDenaliJuly8_georep_3.6.0.24-1/ - With Denali (3.6.0.22-1 + Arvinda's patch, http://review.gluster.org/#/c/8124/ ) Initial Sync (xsync) = 193 mins Post Initial sync (changelog) = 23 mins http://perf19.perf.lab.eng.bos.redhat.com:3838/july14/georepDenaliJuly2_georep_3.6.0.22-1_customMasterPy/ So with Aravinda's patch we definitely see improvement, specially with xsync. Denali is performing ~14% better with changelog as compared to Corbett with patch 8124 (http://review.gluster.org/#/c/8124) With Denali we see ~7% perf regression for xsync as compared to xsync.
Downstream patches: https://code.engineering.redhat.com/gerrit/#/c/27398/ https://code.engineering.redhat.com/gerrit/#/c/29133/ https://code.engineering.redhat.com/gerrit/#/c/29095/ Upstream patches: http://review.gluster.org/#/c/8011/ http://review.gluster.org/#/c/8260/ http://review.gluster.org/#/c/8124/
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-1278.html