Bug 237605

Summary: [cciss?/HT?] Sequencial-Write-Throughput performance decrease because of an increase of CPU and process/thread in RHEL5
Product: Red Hat Enterprise Linux 5 Reporter: Masaki MAENO <maeno.masaki>
Component: kernelAssignee: Vitaly Mayatskikh <vmayatsk>
Status: CLOSED NOTABUG QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.0CC: anton
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-02-10 09:28:21 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Masaki MAENO 2007-04-24 02:03:01 UTC
Hi.

- Description of problem:
As for RHEL5, the throughput of sequential-writing in multi-CPU and 
multi-processing/multi-thread is more deteriorated than that of RHEL4U4
(when the number of processes is more than the number of CPU with 4CPU or more) .

Especially, the throughput of seq-W is bad in the following cases. 
  - For the HyperThreading is effective.
    (The performance decrease of 70 - 80 % in RHEL5 8CPU [DL580G3 4CPU with HT
or DL380G4p 2CPU with DualCore & HT]
     compared with the case of RHEL4U4 8CPU [DL580G3 4CPU with HT or DL380G4p
2CPU with DualCore & HT]
     and RHEL5 1CPU [DL580G3 or DL380G4p].)
  - For the disk through cciss driver.
    (The performance decrease of 10 - 20 % in RHEL5 4CPU [DL580G3 4CPU or
DL380G4p 2CPU with DualCore]
     compared with the case of RHEL5 4CPU [No-brand-machine 2CPU with DualCore,
SATA rev2.0 disk].)

- Verification Hardware:
  - DL380G4p                       : CPU: Xeon     2.8Ghz [DualCore,
HyperThreading] x 2, Mem:  4GB, Disk: cciss Ultra320SCSI
  - DL580G3                        : CPU: XeonMP   3.0Ghz [HyperThreading]     
     x 4, Mem: 12GB, Disk: cciss Ultra320SCSI
  - No-brand-machine (SystemWorks) : CPU: Xeon5160 3,0Ghz [DualCore]           
     x 2, Mem: 12GB, Disk: SATA rev2.0

- Version-Release number of selected component (if applicable):
  - kernel-2.6.18-8.el5
  - kernel-2.6.18-8.1.1.el5

- How reproducible:
By benchmark softwares such as tiobench and IOzone, it writes data sequentially
at the same time in the number of processes/threads more than the number of CPU
with 4CPU or more.

Steps to Reproduce:
tiobench の例: # ./tiotest -t {1 or 2 or 4 or 8 or 16} -f {50 or 100 or 200} -b
4096 -d /mnt/otherdisk 
IOzone の例  : # ./iozone -pM -t{1/2/4/8} -l{1/2/4/8} -u{1/2/4/8} -i0 -i1 -i2
-r4k -s{128m,256m,512m,1024m} -F /mnt/otherdisk/temp{1,2,3,4,5,6,7,8}

Actual results:
When the number of processes/threads exceeds it more than the number of CPU,
the performance is deteriorated.
Especially, the throughput of seq-W is bad in the following cases. 
  - For the HyperThreading is effective.
  - For the disk through cciss driver (with 4CPU or more) .

* tiobench Result - Sequencial Write Throughput (MB/sec) - FileSize=200MB
+ DL380G4p 
- RHEL4U4 (2.6.9-42.ELsmp)
----------------------------------------------------------------------------------
   CPU | 1CPU   | 2CPU   | 4CPU (2CPU with DualCore) | 8CPU (4CPU with DC and HT)|
-------|        |        |                           |                           |
Thread |        |        |                           |                           |
----------------------------------------------------------------------------------
     1 | 57.102 | 58.67  |                    56.786 |                    57.01  |
     2 | 58.067 | 61.949 |                    62.253 |                    60.814 |
     4 | 63.437 | 67.869 |                    66.298 |                    65.184 |
     8 | 68.868 | 69.493 |                    70.449 |                    63.629 |
    16 | 72.52  | 72.455 |                    70.886 |                    68.042 |
----------------------------------------------------------------------------------

- RHEL5 (2.6.18-8.1.1.el5)
----------------------------------------------------------------------------------
   CPU | 1CPU   | 2CPU   | 4CPU (2CPU with DualCore) | 8CPU (4CPU with DC and HT)|
-------|        |        |                           |                           |
Thread |        |        |                           |                           |
----------------------------------------------------------------------------------
     1 | 62.217 | 56.871 |                    56.74  |                    57.318 |
     2 | 59.553 | 59.095 |                    58.443 |                    58.589 |
     4 | 64.76  | 61.921 |                    59.687 |                    57.429 |
     8 | 65.976 | 64.146 |                    58.11  |                    13.221 |
    16 | 66.472 | 63.652 |                    58.733 |                    17.531 |
----------------------------------------------------------------------------------

- RHEL4U4 vs RHEL5 : ((RHEL5 / RHEL4U4) - 1)
----------------------------------------------------------------------------------
   CPU | 1CPU   | 2CPU   | 4CPU (2CPU with DualCore) | 8CPU (4CPU with DC and HT)|
-------|        |        |                           |                           |
Thread |        |        |                           |                           |
----------------------------------------------------------------------------------
     1 | 0.0895 |-0.0306 |                 -0.000810 |                   0.005402|
     2 | 0.0255 |-0.0460 |                 -0.06120  |                  -0.03658 |
     4 | 0.0208 |-0.0876 |                 -0.09971  |                  -0.1189  |
     8 |-0.0419 |-0.0769 |deterioration--> -0.1751   |big deterioration -0.7922  |
    16 |-0.0833 |-0.1214 |deterioration--> -0.1714   |because of HT? -> -0.7423  |
----------------------------------------------------------------------------------

+ DL580G3
- Quite the same result as DL380G4p came out from DL580G3. (Result of DL580G3 is
same as DL580G3.)


+ No-brand Machine (SystemWorks inc.)
- RHEL4U4 (2.6.9-42.ELsmp)
------------------------------------------------------
   CPU | 1CPU   | 2CPU   | 4CPU (2CPU with DualCore) |
-------|        |        |                           |
Thread |        |        |                           |
------------------------------------------------------
     1 | 55.77  | 55.348 |                    55.005 |
     2 | 51.02  | 52.744 |                    51.962 |
     4 | 51.336 | 50.487 |                    49.67  |
     8 | 51.254 | 49.034 |                    49.31  |
    16 | 50.9   | 50.294 |                    50.079 |
------------------------------------------------------

- RHEL5 (2.6.18-8.1.1.el5)
------------------------------------------------------
   CPU | 1CPU   | 2CPU   | 4CPU (2CPU with DualCore) |
-------|        |        |                           |
Thread |        |        |                           |
------------------------------------------------------
     1 | 55.32  | 54.937 |                    54.444 |
     2 | 52.547 | 48.174 |                    49.407 |
     4 | 50.381 | 48.729 |                    48.521 |
     8 | 50.503 | 48.907 |                    46.742 |
    16 | 50.958 | 48.057 |                    48.18  |
------------------------------------------------------

- RHEL4U4 vs RHEL5 : ((RHEL5 / RHEL4U4) - 1)
------------------------------------------------------
   CPU | 1CPU   | 2CPU   | 4CPU (2CPU with DualCore) |
-------|        |        |                           |
Thread |        |        |                           |
------------------------------------------------------
     1 |-0.00806|-0.00742|                  -0.01019 |
     2 | 0.0299 |-0.0866 |                  -0.04917 |
     4 |-0.0186 |-0.0348 |                  -0.02313 |
     8 |-0.0146 |-0.00259|  almost same --> -0.05207 |
    16 | 0.00113|-0.0444 |  almost same --> -0.03792 |
------------------------------------------------------

Expected results:
RHEL5 becomes equal to RHEL4U4 or exceeds it, for sequential-writing in
multi-CPU and multi-process/thread.
Or, please tell me the reccommended value of kernel-parameter to avoid this problem.

Additional info:
http://www.google.co.jp/search?num=100&hl=ja&client=firefox&rls=org.mozilla%3Aja%3Aofficial&as_qdr=all&q=performance+cciss+site%3A.lkml.org&btnG=Google+%E6%A4%9C%E7%B4%A2&lr=

Comment:
Please measure it by all means.
And, please let me hear the opinion of this problem and the correspondence in
the future.
This problem is an important thing for each company in NTT group.

Comment 1 Masaki MAENO 2007-05-14 11:17:49 UTC
I understood that the default value of /sys/block/<device>/queue/nr_requests 
changes from 8192 (RHEL4) to 128 (RHEL5) is a cause of bad sequential-write
performance.

I confirmed that the performance improved by making nr_requests 128 to 8192 in
hp 4CPU(with DualCore) or 8CPU (2CPU with HT or 2CPU with DualCore and HT). 


Comment 2 Vitaly Mayatskikh 2009-07-09 09:29:26 UTC
I ran iozone on HP DL385 machine on both RHEL-4.4 and RHEL-5.4 (-pre)and see RHEL4 overcomes RHEL5 in "readers", "re-readers" and "random writers" tests. In all others tests RHEL5 beats RHEL4. When I've changed nr_requests from 128 to 8192, RHEL4 still overcomes RHEL5 in "readers" (10%) and "random writes" (30%) tests. In all other tests RHEL5 is faster, up to 5x.

Comment 3 Anton Arapov 2010-02-10 09:28:21 UTC
The has been closed due to long inactivity. And as per Comment #2, it's not the subject to fix. Too way much changes between EL4 and EL5, where particular slowdown can't be considered as regression.

Thanks!