From Bugzilla Helper: User-Agent: Mozilla/4.76 [en] (X11; U; Linux 2.4.9-31 i686; Nav) Description of problem: The netdump-client (the netconsole.o module) often gets a false idle_timeout and stops during dump of the processor memory. Version-Release number of selected component (if applicable): How reproducible: Sometimes Steps to Reproduce: 1. Set-up the netdump server and client 2. Make sure that the server requests a memory dump 3. Force a kernel crash Actual Results: The netdump-client gets a idle_timeout during the transfer of the processor memory (512MB). Expected Results: The transfer should complete without idle_timeout Additional info: The processor memory is 512MB. The false idle_timeout occurs almost always. The kernel version is 2.4.17 A printk inserted at the idle_timeout check in the code gives; printk("t: %llu %llu (%llu) > (%llu)\n", t1, t0, ((t1 - t0) >> 20), \ (unsigned long long)(mhz * idle_timeout)); Output: t: 649639796377 649639801243 (17592186044415) > (5250) As you can see in the printout t1 < t0 which gives a negative diff. The huge number 17592186044415 is 0xfffffffffff, which is the negative number shifted 20 bits right. I can not imagine how t1 < t0, it implies that the time goes backward, which seems impossible. As a temporary work-around I added a check; if (t0 && t1 > t0) { ... in the code. It fixes the problem, and the idle_timeout still seems to work (I killed the server during the transfer about 10 times). The work-around is insecure since I dont know the cause of the problem, I just handle the symptom.
humm we haven't shipped a 2.4.17 kernel with netdump/netconsole..... 2.4.17 isn't exactly a good kernel either ;(
No, we modified the patch to be able to use 2.4.17. I forgot that this was not the original kernel, but I suspected that the kernel version could have something to do with this so I included it. If you think this is a kernel fault, please set this bug report to NOTABUG and I apologize for the unnecessary bother. I think the rdtscll() translates more or less directly to an assembly instruction (rdtsc), so I still can't understand how it can run backwards regardless of kernel version. Best Regards, Lars Ekman
well it depends which version of netdump you took ;) also rdtsc can't go backwards, but can be out of sync between cpus
The version is netdump-0.6.7. We found the bug! I had a discussion with our HW-guru and he suggested that the global t0 variable was updated between the read of t1 and the compare. The simple solution is to store t0 in a local variable before reading t1. From netconsole.c (updated); t0temp = t0; rdtscll(t1); if (idle_timeout) { if (t0temp) { if (((t1 - t0temp) >> 20) > (unsigned long long)(mhz * idle_timeout)) { ... With this fix it seems to work fine. I have tried 5 times successfully (before it never worked).
Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. Please See https://access.redhat.com/support/policy/updates/errata/ If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.