From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.4) Gecko/20011019 Netscape6/6.2 Description of problem: Bordeaux w/ 2.4.9-13.1smp, bios a02, 32GB ram, 2GB swap. Running copy and compare on a qlogic2200 8GB Lun with one primary partition and ext3 filesystem mounted to /sdc directory. Copy and compare is started with 1 stream and a count of 50 using a 650MB iso image. The source file and both destination directories are placed in the qlogic mounted filesystem. After about 35 successful passes, cpcmp errors out and states "no space left on device," but df reports utilization of filesystem at only 25%. If one successful pass is made, then any number of passes should complete, but this is not happening. This same test has passed on a megaraid filesystem and an onboard qlogic filesystem. Only a qlogic2200 add-in card with direct attached PV650F storage is failing. I have tried recreating partition with parted, reformatting filesystem with mke2fs -j, and remounting filesystem. I have not tried ext2, and have not recreated the lun with the qlogic bios utility. I will do this next. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Install bordeaux with 2.4.9-13.1 or earlier, qlogic2200 w/ PV650F attached storage. 2. Create a GPT primary partition with ext3 filesystem and mount 3. Run copy and compare to use atleast 1GB of space per pass, and run multiple passes. Actual Results: Failure of copy and compare due to lack of disk space, despite df reporting only 25% utilization of filesystem. Expected Results: No failure Additional info:
Please try with anything *other* than 2.4.9-13.1. 2.4.9-13.1 is 'known bad'.
It should be noted that the 2.4.9-13.1 that Clay refers to was built by me, based on -13 plus megaraid 1.18 patch applied and e1000 (unsigned long->u32) fix. The same problem occurred with 2.4.9-9.x also.
small request: please add something like "dell" to the version in the future in order to avoid confusion with Red Hat build numbers
Unfortionatly, the qlogic2x00 driver is of such poor quality that it's not realistic to find let alone fix bugs in it. If Dell has contacts with Qlogic, please take it up with Qlogic.
Reproduced in RC2 and RC1. Also a useful piece of information. I tried ext2 and not only did I not get the error, but the performance of the copy and compare was significantly improved. I'd say atleast 5x faster with ext2.
Please try the 2.4.9-13.3 we sent to Dell yesterday; the error seems to be an out-of-memory condition that could be fixed/avoided in that kernel.
Reproduced in qa1108 (2.4.9-13.3smp)
2.4.9-13.4 has another possible fix and will be available for you shortly
your copy/compare is one of the kinds of loads that ext3 is expected to be slower on. You can test ext3 with "data=writeback" and you should find that the performance is pretty close to what you get with ext2. See http://www.redhat.com/support/wpapers/redhat/ext3/ and in particular http://www.redhat.com/support/wpapers/redhat/ext3/tuning.html
2.4.9-13.4 is now available at ftp://ftp.beta.redhat.com/pub/testing/kernel/ Please give it a whirl!
Reproduced with qa1108 updated to 2.4.9-13.4smp kernel.
Reproduced with qa1129 (2.4.9-17.3smp kernel), though it did take twice as long to error.
Where can we pick up a copy of cpcmp to run?
ftp.beta.redhat.com:~dell/cpcmp.tgz
that was quick, thanks!
To make sure that we are duplicating the problem, what is the *exact* command line that you are running to see this?
./cpcmp.pl 9 testbig.iso ./a,./b,50,1 Testbig.iso is a single 670MB iso image file, made from a distro CD I think. ./a and ./b are placed in the directory the qlogic device is mounted to. And I start one stream, for 50 iterations. My lun is 8GB, so this scenario should never get above appr. 25% disk utilization.
OK, thanks!
I've had one successful run so far on our bordeaux with 32GB RAM, 2GB swap, writing on an empty 8GB ext3 partition on a drive on a PV660F using the qla2200 driver (not qla2x00 driver) with the 2.4.9-17.3 smp kernel. I've started another one. I wonder, is there any chance that we didn't loudly enough communicate to you that qla2x00 is the old driver left in for comparison with the new driver, qla2200?
Yes, there is a pretty good chance of that :) I will try the qla2200 now.
Aha, thanks! I had another run complete successfully and have started a third, so I'll bet this is fixed for you as well. FWIW, Kudzu should have this set up correctly, so new installs should select the 2200 driver for ISP2200 cards. We're leaving the 2x00 driver for the ISP2100 cards until we can confirm that Arjan's driver works for the ISP2100 cards.
FYI, kudzu has been updated because my modules.conf had a qla2200 entry after install. However the initrd did not contain the driver because the installer has not been updated with qla2200 entries.
OK, folks here confirm that after that tree went out, a new fix was put in the installer for this case, so it should work right from the installer now.
I've seen two successful 50-count runs. I am satisfied that this is fixed.
"I am satisfied that this is fixed." -> closing
Since this seemed fixed, we decided to step up the disk IO intensity to really test the 2.4.9-17.4 qlogic driver and am getting this error again. Started cpcmp on 5 luns simultaneously with 4 streams on each lun with: ./cpcmp.pl 9 testbig.iso ./a,./b,50,4 This gets each lun's disk usage to roughly 50-60% at peak usage. After a couple of hours, 3 of the luns are reporting out of disk space even though they are only at 50-60% used.
Reopening
Same setup as previously mentioned, 4 of 5 luns failed within an hour with qa1207 (2.4.9-17.6smp).
This is the same testing level that you've been doing with other controllers like the adaptec SCSI controller, the megaraid controller, etc. Right?
Yes.
Stephen says that this could be a design effect of ext3. Instead of having just "in use" and "free" blocks, you have "in use", "free", and "free but not available because transactions have not yet been committed". When your scripts get in sync and delete several iso images at the same time, it is possible that you get so many of that third state that is neither properly "in use" nor "free" that with apparantly plenty of space on the filesystem, you temporarily do not have free blocks to use. It is also possible that this only happens with qlogic and not with the other drivers because of differences in hardware speed. If you try this with ext2 and it still happens, that would rule this out, of course. Could you do that?
If you want the gory detail, ext3 has to avoid reusing disk blocks which have been freed but where the delete has not yet been committed to disk. After a crash, subsequent recovery can end up rolling back any uncommitted transactions, and if we reused deleted blocks before their commit, we might overwrite them on disk while there is still a chance that the old version might be needed after a recovery and rollback. I've never seen a scenario where this made an observable difference, but doing mass deletes and rewrites on a sufficiently fast disk array might cause the behaviour you describe simply because the deletes have not been committed. If we can determine that this is definitely the cause of your problem, then it should be possible to teach ext3 to recognise this situation and to force an early commit rather than returning ENOSPACE immediately.
Per conversation with Clay, ext2 doesn't fail, just ext3. Also, I've often seen the cpcmp scripts (which simply kick off all processes at once) running pretty synchronized, so that leads me to believe that Stephen's thoughts are indeed applicable. Clay, please try mounting with "data=writeback" and see if this goes away. I've also asked to have the tests run with more small files (say, /usr/share/doc instead of a 650MB file) and see if anything changes. -Matt
"data=writeback" will not cure the problem. All that "writeback" mode does is to remove any synchronisation of data writes with transaction commits, so that newly-written data may not be seen on disk after a crash, and you can therefore find stale data blocks in recently-allocated files. Even in writeback mode, ext3 still refuses to allow such data writes to overwrite metadata which is still valid. However, in "data=journal" mode, the problem _will_ probably disappear, simply because the writing of data to the journal will force transaction commits much more frequently (of course, you'll see much worse performance in that mode for the sort of workload we're talking about here).
Running the five-lun setup mounted with the data=journal option. It is running fine (if a little slow) and has made it past the point where I would expect an error. I'm letting it run over night and will update you in the morning whether pass or fail.
One other thing to try is to simply test with a larger partition -- try, say, a 12GB instead of 8GB disk with the same cpcmp invocation. Also, is your lun a single disk, or is it a switched raid volume that would be faster than a single disk? Thanks!
Well, still running this morning. Still hasn't finished due to the slower performance of running with data=journal. This is definitely past the point where I would expect a failure, though. Each lun is on a single 8GB disk. I would need to reconfigure the enclosure to span across multiple disks to get a bigger lun, which I could do but haven't tried. One thing that still puzzles me...what happened in the qlogic driver between 17.3 and 17.4 that made the problem disappear with only one stream running? And due to ext3, will there always be some threshold where the uncommitted but deleted bits will affect valid disk space size? To what extent can and will ext3 be modified to account for this?
Use software RAID 0 to put two luns together to test on a larger FS. 17.3 and earlier: The older driver was slower. With enough physical ram to cache several successive allocations without needing to go to disk, and a big enough journal, it's possible that you could get enough block deletes pending but not committed because of the slow driver. Part of it might have been serialization due to more abuse of the io_request_lock, I don't know. And yes, quite fundamentally with a journaling file system, the two phase commit will always mean that there will, for some window of time after a file has been deleted, be blocks that are neither allocated for existing files nor available for use for other files. That's just one of the fundamental tradeoffs of a journaling file system. You are trading off space for consistency. The journal size is only part of that tradeoff. This is a degenerate case and so far Matt hasn't taken my challenge to step up to the plate with a real-world case where this could be a problem. In normal cases, this pressure doesn't show up very much.
Lowering to Sev 2 here, "enhancement" in Bugzilla. Workarounds include: 1) data=journal (working so far) 2) increase disk space so you've got >=2X disk space 3) 'sync; sync; sync; sleep 10;' between massive deletes and refills to give the journal a fighting chance to run. 4) using a larger journal
That is: 4) use a smaller journal