Bug 25749
Summary: | zero page curruption in 2.4.* | ||
---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Christopher Blizzard <blizzard> |
Component: | kernel | Assignee: | Ben LaHaise <bcrl> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Brock Organ <borgan> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 7.0 | CC: | bcrl, dmgrime, sct |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i386 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2001-02-13 22:19:44 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Christopher Blizzard
2001-02-02 19:35:10 UTC
Client side test case software (before I forget): http://polygraph.ircache.net/ I'm looking for the kernel rev where the problem appears to be introduced - as the problem doens't occurr at exactly the sampe place every time, it might be a while before I can track it down. So far, 2.4.0-test10 APPEARS to be OK. 2.4.0-test11 and 2.4.0-test12 APPEAR OK too. There is something going on between 2.4.0-test12 and 2.4.0-prerelease which seems to be effecting the performance of the application. I suspect (based on timing information tracked in the app) that the cost of mmap() has increased in situations where process maintains a large number of active mappings (~5000+). It also appears the disk I/O has slowed significantly as well - perhaps related to the above. This behaviour is also shown in production 2.4.0. I'll start looking at the 2.4.1-testX kernels - so far, I can only reproduce regularly on 2.4.1. It appears that the change causes the behavior went in with 2.4.1-pre1. As noted previously, 2.4.0-prerelease and 2.4.0 both performed horribly compared to 2.4.0-test{9,10,11,12}. 2.4.1-pre1 crashes as intially described and the system becomes unusable - even simple commands such as "ls" and "sync" die with SEGV. There are two things I'm curious about: could you try booting the kernel with the nofxsr option? Also, does the corruption still occur if you run the machine with no swap? add dmgrime to the cc list Tests run under both 2.4.1-pre1 AND 2.4.1: nofxsr && noswap: performance problem as described above, no crash nofxsr : performance problem as described above, crash noswap : performance problem as described above, no crash So, seems like the crash can be prevented by disabling swap, but the performance problem seems to persist from 2.4.0-prerelease through 2.4.1 production. The "performance problem" I keep referring to I will try to dig into - my first instinct points at something with the raw device I/O. I suspect it has to do with concurrent raw requests to mlutiple physical devices, I'm going to rerun some tests with only 1 spindle - the application serializes requests per spindle, so this will rule out a concurrency race. Can you please test again with swap after applying the following patchball: http://www.kvack.org/~blah/fix-v2.4.1-A.tar.gz Unpack the tarball and apply the patches with for i in fix-v2.4.1-A/*.diff ; do patch -p1 -s -N -E -d linux/ <$i ; done . This has the kiobuf fixes from Stephen, a patch for zeropage COW based on Linus' ideas, and Jens' block fixes. I'm also curious to know which of the patches make a difference (I expect that the zeropage fix is the culprit). -ben Patched downloaded and applied against stock 2.4.1. The crash symptoms appear to be gone - but the "performance" issue remains. Did something change from 2.4.0-test12 to 2.4.0-prerelease that would effect performance of an application with MANY ( >5000 ) active mmap() segments? There appears to be quite a bit of activity in mm/mmap.c - in particular the removal of "merge_segments()"; perhaps related? I'm going to try stock 2.4.1 with one the zeropage patch next to check stability - update coming soon. Update: Stock 2.4.1 + 05-zeropage.diff is stable - crash symptoms gone. Performance problems remain. Please see previous note regarding mm/mmap.c. Ben, I'm assigning this bug to you directly since you are working on it. Here's a quick update: I was pretty much out of commission last week, but I'm back now and putting together a patch based on the suggestion that the removal of segment merging in the kernel is the source of the problem. I should have it for you later on today, and will update this entry then. This was fixed for 7.1 final. |