From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818) Description of problem: This failure happened during testing for another VM issue explained in bugzilla 100739. You may want to check that issue for history and test environment description. I am breaking this out as requested by MKJ. We have only seen this issue on SMP boxes, but most of the testing has been on SMP boxes. Version-Release number of selected component (if applicable): kernel-2.4.20-19.9.3 How reproducible: Sometimes Steps to Reproduce: 1. Run newburn on a System with VNC server activated for about 2 days. 2. 3. Actual Results: System panic in page_referenced(). Expected Results: No panic. Additional info: This has been reproduced on 2.4.20-19.9.3 based kernel. The kernel was recompiled with a 1.18h megaraid driver to support the PERC4/DC that was in some of the test cases. It has been reproduced on earlier kernel versions. Those crashes are included in the tarball in a subdirectory.
Created attachment 93578 [details] Panics and partial objdumps of kernels.
Created attachment 93579 [details] JPG of oops Excerpt from call trace: launder_page 0x1f2 refill_inactive_zone 0x51e rebalance_dirty_zone rebalance_inactive_zone rebalance_inactive do_try_to_free_pages_kswapd
From "older-panics": EIP is at page_referenced [kernel] 0xe5 (2.4.20-9rhsmp) ^^^^^^^^^^^^^ So this is a longer-standing problem that was NOT raised as a show-stopper when we asked for a list of *all* show-stoppers before starting this exercise. PLEASE, if you are going to use a camera to record oops messages, do it at a higher resolution than 640x480. That's screen resolution, and that means that your images are barely legible. Stopping the camera down to lowest resolution (or using a decade-old camera or a webcam or some similar junk) just makes the job harder. I know that Dell sells real digital cameras...
In regards to comment #3: The issue was regressed to see if they would occur under the latest errata. Especially since there was a code change in page_referenced() that seems like it might have fixed the issue.
So what was the previous unique bugzilla# you had reported this under?
This issue was seen during Taroon development and subsequently corrected. I'm not sure we ever saw it on x86 though ...
IIRC the solution was to make cpu_relax() include a barrier, which in x86 means making rep_nop() include a memory barrier. Could you please try that ?
@@ -517,7 +518,7 @@ struct microcode { /* REP NOP (PAUSE) is a good thing to insert into busy-wait loops. */ static inline void rep_nop(void) { - __asm__ __volatile__("rep;nop"); + __asm__ __volatile__("rep;nop" ::: "memory"); } #define cpu_relax() rep_nop()
I will try this patch tonight... but shouldn't I see more spin lock failures if this was the case?
Okay, I must be missing something. I just don't see how this can help when the problem is occuring on a single physical CPU system with hyper-threading. Can you dig up exactly why this was put in Taroon? Was the processor/chipset vendor involved?
The cpu_relax() has to include a memory barrier, otherwise the compiler has no obligation to reload the variable from memory and the system could spin forever in this loop, not unlike the way you've seen happening... On x86 it usually doesn't trigger, but on some other architectures it was immediately noticable. In the 2.5 kernel cpu_relax() includes a barrier on all architectures.
Yes. But this problem is happening on a single physical CPU with HT on. It has a shared cache between the CPU's.
Besides x86 is a cache coherent architecture. Memory barriers should only be needed for ordering of the cache writes, not for insuring cache coherency. If that is the problem then the CPU/chipset vendors needs to know and fix the issue.
Please see https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=100739#c58 Bugzilla 100739 comment #58 for information that might pertain to this issue as well.
Please see bugzilla 100739 for a patch in the 64th comment for this issue. The patch has been verified on 15 machines running newburn for 3 days. One would normally see 6+ failures in this time frame on those same test machines. Testing will continue while code review verifies the fix of this race condition.
OK, so I was wrong, it's all one big happy bug family... *** This bug has been marked as a duplicate of 100739 ***
Opening up per Dell request (rh).
Changed to 'CLOSED' state since 'RESOLVED' has been deprecated.