Red Hat Bugzilla – Bug 156608
[RHEL3 U4] The system clock gains much time when netconle is activated.
Last modified: 2010-10-21 22:57:33 EDT
Escalated to Bugzilla from IssueTracker
Adding Jeff Moyer to cc: list in hopes he can help explain this one. Jeff, zap_completion_queue() is called in both the netconsole and netump code paths AFAICT. This code then, seemingly could wreak havoc, i.e., the jiffies bump below (and what happens if they set idle_timeout?): if (idle_timeout) { if (t0) { if (((t1 - t0) >> 20) > mhz_cycles * (unsigned long long)idle_timeout) { t0 = t1; printk("netdump idle timeout - rebooting in 3 seconds.\n"); mdelay(3000); machine_restart(NULL); } } } /* maintain jiffies in a polling fashion, based on rdtsc. */ { static unsigned long long prev_tick; if (t1 - prev_tick >= jiffy_cycles) { prev_tick += jiffy_cycles; jiffies++; } } Since the code has always been like this, what am I missing?
setting back to kernel...
Jeff, looks like those two if statements need a netdump_mode check?
The user-land mhz argument sent to the netconsole module is basically ignored, unless, during module load, upon reading the tsc two successive times with an mdelay() in between, it happens to have done so when the tsc wrapped around: platform_timestamp(t0); mdelay(1); platform_timestamp(t1); In other works, if t1 > 0, mhz is completely ignored. So let's put that issue out of the picture. The question is whether netconsole should be doing anything at all with jiffies during runtime. Doing an alt-sysrq-t operation with thousands of processes, or simply repeated keyboard-generated attempts (instead of echoing to /proc/sysrq-trigger), is essentially one huge interrupt handler. I don't know what the author's intent was -- to "help" jiffies along, or whether it was meant to only do so in a netdump operation? What would happen say, if a 9600-baud serial console were hooked up, without netconsole registered, where a single alt-sysrq-t on a system with thousands of processes could consume several minutes? It should also be kept in mind that alt-sysrq-t is a debug strategy, not something that should be done in the normal course of events. Furthermore, using /proc/sysrq-trigger does the operation in process context so clock interrupts wouldn't be blocked.
Ok, the fix will be to make this simple change to zap_completion_queue(): /* maintain jiffies in a polling fashion, based on rdtsc. */ - { + if (netdump_mode) { static unsigned long long prev_tick; if (t1 - prev_tick >= jiffy_cycles) { prev_tick += jiffy_cycles; jiffies++; } } Note that there is no way the idle_timeout check above it can cause a problem, because t0 can never be set until a netdump operation is set in motion. netconsole.c has no business mucking around with jiffies during runtime.
Should be -- the patch will be posted in conjunction with BZ #157439, which I'm waiting for IBM to test.
A fix for this problem has just been committed to the RHEL3 U6 patch pool this evening (in kernel version 2.4.21-32.7.EL).
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-663.html