Description of problem: As part of the release tests for LifeKeeper (the SteelEye HA product), we perform stress acceptance tests of the operating system. These tests involve taking a heavy duty array (MSA100/EVA/MA8000) with a large number of LUNs and running a stress test simultaneously on each of the LUNS. The stress tests are simple tar, untar and checksum of a fixed pool of data (all residing on the LUN). Previously, RHAS 2.1 (a sparse set of kernels up to 2.4.9-e.40 were tested) was able to withstand the full 32 lun load. No version of RHEL 3.0 has managed to get beyond about 8 LUNs without failing. Our failure criteria are defined by the LifeKeeper product. Either the communications fail, which means on a running system that the communications process (which is installed reniced to realtime priority and mlocked into memory) was not scheduled for a period of 15 seconds, or that LifeKeeper believes it has lost contact with one of its discs (each disc is pinged every few seconds using an INQUIRY, a failure occurs if there's no response to the INQUIRY after 120 seconds). With RHEL, both of these failures have been observed. We've tried starting out with one LUN of stress and gradually increasing a LUN at a time. Under this type of load increase, we see the times taken for I/Os to complete to rise dramatically after about 2 LUNS (using both iostat and the simple INQUIRY timing that LifeKeeper does). We also observe the system to go down to less than one MB of free memory (the remainder all residing in the cache), the system time to rise to around 70% and the iowait time to fall to about zero. sar -b reports the number of transactions per second to remain constant at between 1 and 2 (by contrast, with RHAS2.1 the tps rises linearly with LUNS until it levels out at about 120). Version-Release number of selected component (if applicable): we've tried this with the three default kernels from RHEL3.0, U1 and U2 with no appreciable differences in the results. We also tried altering various kernel tuning parameters: elvtune -r 4 -w 4 on all the devices echo 100 > /proc/sys/vm/overcommit_ratio We also tried reducing the depth of the tags on the qla2340 cards down all the way to 8 in the drivers None of these produced any appreciable effects How reproducible: All the time Additional info: The failing systems are IBM 330's with 1GB of ram and two CPUs the storage is FASTt 200 optical SAN using a brocade switch and qla2340 fibre cards. The CPUs are: cpu family : 6 model : 11 model name : Intel(R) Pentium(R) III CPU family 1133MHz stepping : 1 cpu MHz : 1128.596 cache size : 512 KB We also noted that SMP kernels seemed to withstand much less stress (3-4 LUNS) than UP kernels (which could get up to 7-8 LUNS before failing).
we have seen the same problem on rhel3 with ocfs. in our case we had about 20 luns even u to 50
Could be IO elevator, could be SCSI midlayer, could be the HBA driver. Assigning to both Tom Coughlan and Doug Ledford, who've done work on these kernel subsystems...
Rik, it could be any of those three, but that wouldn't explain the system time going through the roof I don't think. James, can you boot up this machine with kernel profiling enabled, load it up until it's doing this exact thing, then zero out the profile, let it run this way for a minute or so, then get the profile data and post that here? I'd like to know what part of the kernel we are spending our time in before doing any guess work as to what the cause is.
Created attachment 100979 [details] readprofile output
The readprofile output was taken from a 2.4.21-15.EL machine a few minutes into a tiobench. readprofile was issued prior to tiobench to clear counts.
During the readprofile/tiobench run was the system exhibiting very poor throughput and scheduling behavior?
hmmm... what would you qualify as poor thrughput in terms of tiobench values?
This indeed seems similar to bug 121434, where a similar issue has been observed on production systems with practical applications. James, could you attach your script that does the synthetic stress-testing benchmark? I could test one of my systems overnight to see if I get similar results.
Posting this comment just in case anyone on this bugzilla is not also watching bugzilla 121434. There was a set of test kernel rpms posted in bugzilla 121434 that *might* also have an impact on this issue. No guarantees, but tests and feedback appreciated.
Are there any parties interested in this bug yet? Upgrade to RHEL4 solves the problem AFAIK.
This bug is filed against RHEL 3, which is in maintenance phase. During the maintenance phase, only security errata and select mission critical bug fixes will be released for enterprise products. Since this bug does not meet that criteria, it is now being closed. For more information of the RHEL errata support policy, please visit: http://www.redhat.com/security/updates/errata/ If you feel this bug is indeed mission critical, please contact your support representative. You may be asked to provide detailed information on how this bug is affecting you.