Bug 1006172

Summary: Dist-geo-rep: Performance degradation between earlier versions to .33rhs
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Amar Tumballi <amarts>
Component: geo-replicationAssignee: Amar Tumballi <amarts>
Status: CLOSED ERRATA QA Contact: Neependra Khare <nkhare>
Severity: high Docs Contact:
Priority: high    
Version: 2.1CC: aavati, csaba, dshaks, grajaiya, kparthas, rhs-bugs, shaines, vagarwal, vbellur, vkoppad, vraman
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.4.0.34rhs Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-11-27 15:37:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Amar Tumballi 2013-09-10 07:15:12 UTC
Description of problem:

From Neependra's Mail:

---
Hi,

To collect different performance stats I have been using build 15.
Yesterday I updated to build 22 and see degradation on geo-rep performance.
I re-run the same test with build 15 and then with build 18.

Time to sycn the same amount of data time taken by different builds: -
- With build 22 = 87 mins, 76 mins
- With build 18 = 47 mins , 40 mins
- with build 15 = 44 mins, 40 mins

For more details look at "Sequence of Event" section in following:-
http://perf19.perf.lab.eng.bos.redhat.com:3838/august13/georepBB_3.4.0.22rhs/
http://perf19.perf.lab.eng.bos.redhat.com:3838/august13/georepBB_3.4.0.18rhs/
http://perf19.perf.lab.eng.bos.redhat.com:3838/august13/georepBB_3.4.0.15rhsAug27/

From above it looks there is ~100% degradation in performance with latest build.
I would try to take a run with build 20 and 21 as well to narrow down the problem.

Regards,
Neependra
------

Version-Release number of selected component (if applicable):
glusterfs-3.4.0.22rhs

How reproducible:
100%


Expected results:
Same performance as before.

Additional info:

----

I have some data-points b/w build 20 and 21 :-
http://perf19.perf.lab.eng.bos.redhat.com:3838/august13/georepBB_3.4.0.20rhs/
http://perf19.perf.lab.eng.bos.redhat.com:3838/august13/georepBB_3.4.0.21rhs/

I have also put down the raw data in bottom of the page. On that (netReads field) or at the 4th graph
(Network Throughput on WAN emulator) look at the WAN throughput 

On build 21 between 17:20:18  - 18:31:43
On build 20 between 14:40:24  - 15:17:23

As you would see the WAN throughput drops over time with build 21. This may happen because of two reason
I can think of :-
1. Slow reads from master server - this does not seem to be problem
2. Slow processing of changelog files

Let me know if you want me to run any specific tests.
---

Comment 1 Amar Tumballi 2013-09-10 07:17:06 UTC
>>>> On 08/28/2013 10:30 AM, Venky Shankar wrote:
>>>>> On Wed, Aug 28, 2013 at 09:43:50AM +0530, Neependra Khare wrote:
>>>>>> On 08/28/2013 02:05 AM, Amar Tumballi wrote:
>>>>> [snip]
>>>>>
>>>>>> As you would see the WAN throughput drops over time with build 21.
>>>>>> This may happen because of two reason
>>>>>> I can think of :-
>>>>>> 1. Slow reads from master server - this does not seem to be problem
>>>>>> 2. Slow processing of changelog files
>>>>> Well, changelog processing is not O(1). The more the number of
>>>>> entries, more the entry and data operations on the slave.
>>>>>
>>>>> Changelog processing logic has not changed extremely between these
>>>>> builds. Let us take some sample runs with the current workload and
>>>>> check how much time does it take to process a changelog.
>>>>>
>>>> I have been working with Venky to collect more stats with another run.
>>>> With the latest run the result is still the same. Build 20 is
>>>> performing better than build 21.
>>>> Venky is suspecting the problem with AFR.
>>>>
>>>> In my opinion there are two ways we can proceed :-
>>>> 1. Take a run without replication .
>>>> 2. Create a custom build with latest build patch-sets excluding the
>>>> patches added in build 21 and
>>>> take the run with AFR.
>>>>
>>>> Any other suggestions are welcome.
>>>>
>>> Avati,
>>>
>>> It is just 1 patch which went in between these two builds. Can you think
>>> of a reason ? I am thinking because we removed '(need_unwind)' check in
>>> afr_writev, we may be taking more time to unwind the write calls now.
>>>
>>> Regards,
>>> Amar
>>>
>>>
>>>> Regards,
>>>> Neependra
>>>>
>>>>
>>>>>> Let me know if you want me to run any specific tests.
>>>>>>
>>>>>> Regards,
>>>>>> Neependra
>>>>>>
>>>>> Thanks,
>>>>> -venky
>>>>
>>>
>>
>> Amar
>> It was removed only in truncate and ftruncate.. writev is untouched.
>>
>> Avati
>>
>
> And it turns out rsync --inplace (the way geo-rep uses rsync now) calls
> ftruncate() even if size matches. That explains the perf drop. How
> severe is this? Are we planning to fix this by GA?
>
> Avati
>

And here's a fix - http://review.gluster.org/5737. It will be great to get this patch tested that it indeed fixes the perf regression (as the analysis is based only on theory) first. Amar, can you please provide Neependra with a build so that we can verify this?

Avati

Comment 3 Gowrishankar Rajaiyan 2013-10-08 08:42:52 UTC
Fixed in version please.

Comment 6 errata-xmlrpc 2013-11-27 15:37:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1769.html