Description of problem: Disk I/O slows dramatically after continual heavy disk usage. When transferring a large file (>3GB) Linux appears to introduce a huge amount of CPU overhead after transferring part of the file and slows the disk I/O to about one fourth of the prior transfer rate. On some systems the a 100GB file transfers at full speed for about 80GB (a 160GB disk) and then the rest of the file is transfered at approximately one quarter of the original rate. On another system (an 80GB disk) a 20GB file is transfered at full speed for about 3GB and then the ENTIRE remainder is transferred at the much lower rate (in this case about one eighth the orginal rate). The application uses DIRECT_IO and this occurs in both SMP and NON-SMP. It has also been reported by other users that a similar problem occurs when continually writing files that are only 100-200MB. Occurs on RAIDs and single disks. How reproducible: The best way to replicate the slowing disk I/O issue is to compile the attached program using " gcc -o disk_test disk_test.c -D_GNU_SOURCE ". The program transfers 512KB blocks and every 100 blocks (50MB) outputs a status message showing the time (in milli-seconds) that the last 100 blocks took to transfer. If you do not see an increase in these times edit disk_test.c (attached) and change the line " block_count = 5000; " to a much higher number (80% of the disk space available is likely sufficient). Steps to Reproduce: 1. download disk_test.c 2. gcc -o disk_test disk_test.c -D_GNU_SOURCE 3. run disk_test; edit the number of blocks and recompile if the issue is not seen Actual results: On a single SATA drive I will see a sustained transfer rate of about 33MBps for up to 80GB (disk size is 160GB, disk cache is 8MB, system has 3GB RAM, using DIRECT_IO, ext3, default scheduler) when transferring a 100GB file. The remaining 20GB is transferred at an unchanging rate of about 8MBps.
Created attachment 135412 [details] shows incremental and total transfer rate of large files
I am having a hard time reproducing this problem as described. However, I do see some variation in the transfer rate but it does not drop like a step-function as described. When the application starts running I am seeing transer rates of ~45MB/second yet by the time the application finishes writing ~60GB that rate drops to ~30MB/second. This gradual drop over time is probably a disk geometry issue. Ths outer most disk cylinders contain more data then the inner cylinders so the tranfer rate drops accordingly with a constant rotational speed. Also, I do see an occasional drop from ~40MB/second down to ~10MB/second when kjournald runs but that is periodic throughout the run and has very little consequence in the speed of the application as a whole. Finally, I am guessing that the reporter's system is running into a fragmentation issue. When the disk becomes very full the last chunks of data written to disk are likely much more fradmented than the early ones. This can probably cause the serious performance step-function degradation that you are seeing. Larry Woodman
OK, here is the outputs of "vmstat 60" durring the entire run of the application creating a 60GB file. As you can see the system started writing at ~40MB/sec initially and degraded down to ~26MB/sec. The reason for this was determined to be related to the disk geometry. Since there is a difference in the amount of data on different cylinders onthe disk and it rotates at a constant rate, the actual tranfer rate varies accordingly. procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 1 0 621456 181356 76432 0 0 0 40211 1179 181 0 4 50 46 1 0 0 616336 186344 76476 0 0 0 39991 1180 182 0 4 50 46 0 1 0 611152 191432 76488 0 0 0 39996 1180 182 0 4 50 46 0 1 0 606160 196308 76440 0 0 0 39062 1173 177 0 4 50 46 0 1 0 601424 200924 76448 0 0 0 36190 1168 168 0 3 50 47 0 1 0 596624 205484 76444 0 0 0 34969 1161 162 0 3 50 47 0 1 0 592144 209960 76388 0 0 0 33803 1157 158 0 3 50 47 0 1 0 587728 214236 76464 0 0 0 33852 1153 157 0 3 50 47 0 1 0 583440 218356 76492 0 0 0 31791 1148 149 0 3 50 47 0 1 0 579088 222324 76468 0 0 0 30752 1143 145 0 3 50 47 0 1 0 574608 226212 76456 0 0 0 29131 1137 139 0 3 50 47 0 1 0 570384 229856 76484 0 0 0 28145 1130 135 0 3 49 48 0 1 0 566352 233352 76456 0 0 0 26227 1125 127 0 3 50 48 0 1 0 564304 235084 76424 0 0 0 26419 1123 127 0 3 50 48 0 0 0 564184 235264 76448 0 0 0 933 1040 26 0 0 93 7 -------------------------------------------------------------------------------- Also, you can see the itteration time measured by the application correlates with MB/sec reported by vmstat. -------------------------------------------------------------------------------- [root@lovok tmp]# ./disk_test cnt=100 +1133ms cnt=200 +1186ms cnt=300 +1130ms cnt=400 +1183ms cnt=500 +1172ms ... ... ... cnt=106300 +1863ms cnt=106400 +1902ms cnt=106500 +1836ms cnt=106600 +1964ms cnt=106700 +1826ms ------------------------------------------------------------------------------ Since I am having trouble reproducing the "step-function" like performance degradation reported in this Bug can I have Brendan Illingworth gather the same statistics on his machine and post the results to this bug? Please start with the output of the disk_test program as well as a "vmstat 60" output while it is running. Thanks Larry Woodman
Created attachment 135702 [details] top vmstat disk_test output
Created attachment 135703 [details] 100GB version
Created attachment 135704 [details] 50GB version
The attached disk_test_data.txt file contains output obtained while running the disk_test_100000 code (50GB file on a disk with 65GB free; 77% usage). In addition to the requested information contained in disk_test_data.txt I have attached two additional files (disk_test_100000.c and disk_test.c). Both files have been modified to provide incremental output every 1000 blocks; additionally disk_test.c has been modified to create a 200,000 block (100GB) file. If the experimenter still does not see the issue, increase the file size as much as the disk will allow. I suspect you will see the issue at some file size, if not please post the results of running; df , vmstat 60, disk_test, df; as I would like to see that you are indeed creating a sufficiently large file (try to use 90% or more of the free space).
OK, this should be debuggable. As you can see the system time jumps up from 4% to 83% when the disk IO rate drops from 33MB/s to 8MB/s. -------------------------------------------------------------------------------- 0 1 0 5061676 59020 179640 0 0 1 32265 1166 363 2 4 0 94 0 1 0 5058132 60944 179796 0 0 4 30077 1218 577 7 4 0 89 0 1 0 5054204 62856 179840 0 0 1 30452 1218 458 2 4 0 94 1 1 0 5047996 65540 179864 0 0 20 25045 1201 472 3 19 0 78 <><><><><><><> 1 1 0 5047076 66148 179968 0 0 0 8283 1110 197 4 82 0 14 1 1 0 5046372 66724 179968 0 0 0 8476 1056 142 2 83 0 15 1 1 0 5045732 67300 179968 0 0 0 8487 1053 141 2 82 0 16 0 1 0 5045092 67880 179968 0 0 0 8517 1055 142 2 82 0 16 1 1 0 5044388 68476 179968 0 0 0 8697 1056 143 2 84 0 15 -------------------------------------------------------------------------------- I still cant get the same behavior myself, can you get me several AltSysrq-W outputs when this happens so I can see what the system is doing?
It sounds like you are seeing a less severe version of what we are seeing. The indication that your transfer rate drops to 10MBps for some of the blocks is indeed very significant as this shows the disk throughput decreasing by 75% for at least 5 seconds (50MB block/10MBps), where the throughput goes to a crawl. What activity is going on in the system at that point in time that would deprive the user program of 75% of the throughput? Additionally, the transfer rates shown in your disk_test output shows the throughput dropping from (50MB/~12sec) to (50MB/~18sec) which is a 33% decrease in throughput from the the begining to the end of the file. While indeed the velocity decreases as the disk head slowly seeks inward toward the spindle, on most disks this is compensated by the fact that the bit density increases resulting in a nearly constant data rate. Also would it be possible run the test with several larger files (~80% of the disk), and post the same data as you did for your first results.
I dont think I am seeing the same problem as you are. When you hit this problem the system time goes from 4% to 83% as your disk IO degradation occurs. Mine shows no such behavior. Either several AltSysrq-W outputs or a kernel profile are needed to figure out what is happening. Can you provide them? Thanks, Larry Woodman
How large a file have you tried? How large is the disk? Have you tried this on more than one system? Have you tried different disks? The "system time" may not be highly correlated to the true cause of the probem, we should not focus on this as the only metric. It sounds like you are seeing ~33% and more drop in performance, which is very significant. I will post your requested information shortly.
Created attachment 135802 [details] Brendan Illingworth - Profile Dump
Can you also get me several AltSysrq-W outputs so I can see the actual call backs ? Larry
Brendan, are you sure that the kernel profile you posted is right? When you are getting 33MB/S your system is using only 4% system time and when it drops off to 5MB/S the system time spikes up to 83% yet the profile shows ~80% system time in both. Can you try it again and get me more data as well as AltSysrq-W data as well? Thanks for your help. Larry Woodman
The posted profiles were obtained before, durring and after running the application. I am unfamiliar with the output format; how do you deduce that the system time is ~80% from the profiles? By AltSysrg-W do you mean a screen shot saved to file? (If not please be more specific). As soon as you have time, please answer the questions I previously posted as this might help us get on the same page.
The profile showed almost no differences when you were getting 33MB/S or 5MB/S yet the vstat output showed almost all time was being spent in the kernel when the performance dropped to 5MB/S. The profiles *should* be very different especially WRT to idle times but they are not. Yes, please get the system to the 5MB/S and 83% system time state and get me several AltSysrq-W outputs when that happens. All of the data will b written to /var/log/messages. Thanks, Larry Woodman
Please just run this in a shell script: #!/bin/sh while /bin/true; do echo;date /usr/sbin/readprofile -v | sort -nr +2 | head -15 /usr/sbin/readprofile -r sleep 5 done
One last question, are there any error messages being written to the console or /var/log/messages when the slowdown occurs? Larry
Created attachment 135886 [details] alt+sysrq+p output while 5MBps state
Created attachment 135887 [details] Profile Info This was perfect timing, the first block of output in this file is while we are crusing at 33MBps the rest are when the system crawls to about 5MBps.
VMSTAT AT SLOWDOWN BOUNDARY procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 1 0 4786044 225396 248496 0 0 15 6977 1051 196 1 15 65 19 1 1 0 4785268 225904 248532 0 0 0 3999 1047 110 1 91 0 8 MAKEPROFILE SCRIPT cat /dev/null > readprofile.txt while /bin/true; do echo; date >> readprofile.txt /usr/sbin/readprofile -v | sort -nr +2 | head -15 >> readprofile.txt /usr/sbin/readprofile -r >> readprofile.txt sleep 10 done Let me know what you make of this. -Brendan
Can you attach the /var/log/messages files after the slowdown occurs so I can look for any error messages? Thanks, Larry
I have run the application many times and never see any output to var/log/messages. Have you had any success recreating the problem on your end using significantly larger files?
As of right now we have not been able to reproduce the behavior you are seeing where there is a distinct 400% dropoff in diskIO performance. We are only seing a gradual degratation like the one I showed in comment #4. We do however see a serious performance degradation with random writes and various block sizes. That is BZ 203634. Larry Woodman
FYI, we added a 250GB SATA drive formated as a single partition and used disk_test to fill it totally. While I do see the performance degradation that I talked about in comments #3 and #4 I dont see the radical IO performance drop-off and the associated system time spike. Still trying but so far we have have not been able to reproduce this problem internally. Larry
*** Bug 204989 has been marked as a duplicate of this bug. ***
Larry, As mentioned previously it seems like you are reproducing the problem to some degree. The %30 slowdown is significant and could be only a fraction of the % 400 slowdown that we see due to system differences. Have you made any progress on identifing the cause of slowdown? Also, I am surprised that your team hasn't been able to create a setup that illicites the problem to a much higher degree (e.g. a %400 slowdown), as we see this performance drop on six of six different machines running RHEL WS4. Please post the output of disk_test (with a 100GB and 200GB file) on your two systems with 250GB drives. This bug does not involve random writes or varying block sizes. This performance degradation could limit the ablity to gain market share as it disqualfies RHEL for use in sustained data acquistion applications. If we need to request allocation of more engineers on this specific bug perhaps we should attempt to escalate the severity of this bug. One again, could you please run the 100GB-200GB version of disk_test on several machines and post the output.
Can we get a list of the complete system info. Complete Server brand, cputype, memory, specific I/O controller? I know your using SATA but what is the system/driver you are using. I could not reproduce this on my ultra SCSI or Fiber luns.
Brendan, sorry for the delay. We have not been able to reproduce the behavior you were seeing. I do see a very smooth degradation from ~40MB/sec to ~25MB/sec as we fill up a 250GB drive. This was determined to be a disk geometry issue simply bevause there is less data in the inner tracks than the outer tracks and the disk rotates at a constant rate. procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 1 0 621456 181356 76432 0 0 0 40211 1179 181 0 4 50 46 ***************************************************************************** 0 1 0 564304 235084 76424 0 0 0 26419 1123 127 0 3 50 48 ------------------------------------------------------------------------------ Can you tell us exactly what system you are using so we can try to reproduce the problem on that hardware? Thanks, Larry Woodman
Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. Please See https://access.redhat.com/support/policy/updates/errata/ If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.