Bug 501026 - 'serial8250: too much work for irq4' message when viewing serial console on SMP full-virtualized xen domU
'serial8250: too much work for irq4' message when viewing serial console on S...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: xen (Show other bugs)
5.3
All Linux
medium Severity medium
: rc
: 5.6
Assigned To: Michal Novotny
Virtualization Bugs
:
: 498033 (view as bug list)
Depends On:
Blocks: 557597 514499
  Show dependency treegraph
 
Reported: 2009-05-15 10:55 EDT by Casey Dahlin
Modified: 2014-06-18 04:46 EDT (History)
17 users (show)

See Also:
Fixed In Version: xen-3.0.3-118.el5
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-01-13 17:17:05 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
patch that didn't pass testing (456 bytes, patch)
2009-11-23 08:46 EST, Paolo Bonzini
no flags Details | Diff
patch that didn't pass testing (1.49 KB, patch)
2009-11-23 08:47 EST, Paolo Bonzini
no flags Details | Diff
patch that didn't pass testing (2.60 KB, patch)
2009-11-23 08:49 EST, Paolo Bonzini
no flags Details | Diff
Implement rate limiting to qemu-dm's serial port implementation (10.67 KB, patch)
2010-09-09 13:17 EDT, Michal Novotny
no flags Details | Diff
Implement rate limiting to qemu-dm's serial port implementation using variable rate checks value (9.37 KB, patch)
2010-09-13 06:36 EDT, Michal Novotny
no flags Details | Diff

  None (edit)
Description Casey Dahlin 2009-05-15 10:55:48 EDT
Customer is getting this message when viewing the serial console of a full-virtual domU created in xen with multiple CPUs. It appears to be the same as this issue:

http://lkml.indiana.edu/hypermail/linux/kernel/0802.0/2133.html

With the patch attached here:

http://lkml.indiana.edu/hypermail/linux/kernel/0802.0/3367.html
Comment 2 Chris Lalancette 2009-06-25 09:20:23 EDT
This patch also addresses the same area:

http://marc.info/?l=linux-serial&m=121863945506976&w=2

Chris Lalancette
Comment 3 Casey Dahlin 2009-09-04 10:25:26 EDT
It appears the patch isn't fixing the customer's issue.
Comment 4 Andrew Jones 2009-09-21 11:06:31 EDT
I'll see if I can get this to recreate on my test machine here so we can experiment with these patches locally, and/or to dig deeper if they're failing to plug the hole.
Comment 5 Paolo Bonzini 2009-10-13 08:25:19 EDT
Was the patch in http://lkml.indiana.edu/hypermail/linux/kernel/0802.0/3367.html ever tested by the customer?  It is not even in upstream QEMU.

I can build a test package for it if needed.
Comment 7 Paolo Bonzini 2009-11-11 07:58:35 EST
This bug report has two patches attached.  Apparently one of them was built and tested (with negative result), and the other one was not.  The problem is, it is not clear which one was tested.

I requested additional information from cdahlin about which patch was tested, but I got no response so far.  I can prepare both a patched xen package with the QEMU patch, or a patched kernel, but it would be nice to know which one wasn't tried yet.
Comment 8 Casey Dahlin 2009-11-11 09:23:43 EST
The patch posted by clalance in comment #2 is the one that was tested, but looking back it appears I only got the second of the two patches in the email.

The patch I originally posted I don't think I could get to apply, which was why I escalated.
Comment 10 Paolo Bonzini 2009-11-22 11:18:01 EST
PING, time is looming for 5.5
Comment 11 Issue Tracker 2009-11-22 14:17:32 EST
Event posted on 2009-11-16 02:43:39 EST by tsato@redhat.com

Hi Paolo,

NEC tested the test packages(upgraded xen and xen-libs and rebooted) on
RHEL5.4(x86_64),
but the messages are still printed out.

# rpm -q xen xen-libs
 xen-3.0.3-96.el5.gc6adf02
 xen-libs-3.0.3-96.el5.gc6adf02
 xen-libs-3.0.3-96.el5.gc6adf02


--- dmesg
------------------------------------------------------------------
mtrr: type mismatch for f0000000,100000 old: uncachable new:
write-combining
mtrr: type mismatch for f0000000,400000 old: uncachable new:
write-combining
serial8250: too much work for irq4
serial8250: too much work for irq4
serial8250: too much work for irq4
serial8250: too much work for irq4
serial8250: too much work for irq4
serial8250: too much work for irq4
serial8250: too much work for irq4
serial8250: too much work for irq4
----------------------------------------------------------------------------


Internal Status set to 'Waiting on Engineering'

This event sent from IssueTracker by cdahlin 
 issue 296313
Comment 13 Paolo Bonzini 2009-11-23 08:46:15 EST
Created attachment 373114 [details]
patch that didn't pass testing

Thanks, I attach the backported patch for future reference.
Comment 14 Paolo Bonzini 2009-11-23 08:47:40 EST
Created attachment 373116 [details]
patch that didn't pass testing
Comment 15 Paolo Bonzini 2009-11-23 08:49:28 EST
Created attachment 373117 [details]
patch that didn't pass testing

sorry for the repeated mistake
Comment 16 Chris Lalancette 2009-11-24 17:17:55 EST
*** Bug 498033 has been marked as a duplicate of this bug. ***
Comment 17 Markus Armbruster 2009-11-25 02:38:02 EST
Note: bug 498033 (just marked as a duplicate) has some useful background information.
Comment 19 Andrew Jones 2009-12-15 07:13:17 EST
I've seen this message on https://inventory.engineering.redhat.com/view/amd-dinar-03.lab.bos.redhat.com when installing a kernel rpm with 'rpm -ivh'. I was on a -164 HVM guest and installing from an nfs mount.
Comment 25 Andrew Jones 2010-04-12 04:00:02 EDT
This also occurs on F13 FV guests.
Comment 26 Miroslav Rezanina 2010-04-14 08:36:46 EDT
I played a little bit with slowing serial port rate based on patch attached by Paolo. Even if I slow rate too much (causing problems in guest), I was able to see error message.
Comment 27 Paolo Bonzini 2010-05-25 11:31:46 EDT
This is arguably a guest kernel bug, since there is actually no harm if "so much work" is given to irq4.  I wonder if we shouldn't close this as WONTFIX.
Comment 30 Markus Armbruster 2010-05-26 04:14:01 EDT
I wouldn't want to call this a kernel bug.

A real UART has pretty well-defined timing behavior: if you program it to a certain bit rate, you can rely on its FIFO not emptying faster than that.

QEMU's UART emulation does not emulate proper timing at all.  Continuous serial I/O can easily overwhelm the guest.  Linux detects this "can't happen" condition, and takes proper action to protect itself.  "Can't happen" conditions are usually a sign of a bug, so Linux reports it.

Yes, we can "fix" it in the Linux kernel.  It's really a work-around for broken hardware, where the hardware happens to be virtual.

We normally probe the hardware for flaws before we enable work-arounds.  How to do that?

What about older and non-Linux guests?  They're prone to stumble over a misbehaving UART, too.
Comment 33 Miroslav Rezanina 2010-06-11 01:32:19 EDT
Testing shows that qemu serial driver is not able to measure rate of incoming data properly - proposed patch's counter have only 0 and 1 value so no limiting ever done.  

I also do not see the way how to determine proper limiting - kernel is sending data too fast (problem is when guest writes to serial, not when it reads).
Comment 35 Michal Novotny 2010-09-09 09:22:18 EDT
Well, the issue is not about reading or writing but it's because of missing implementation of proper limiting based on the baud rate defined. I'm currently working on implementing that kind of limiting to honor the specifications in this respect.

The data are going there as they're being sent by the guest and it's trying to process all the interrupts. From what I know the Windows guests are not having issues with that and they can cope with this just fine unlike 8250 serial driver in the Linux kernel codes that complains with message mentioned in this bug's summary.

What we need is basically the implementation of:

 (read_data_burst + write_data_burst * 8) <= baudrate

which means that the data transfer rate is lower than the baud rate itself to make it working fine. The multiplication by 8 is necessary since baud rate itself is in bits per second (bps) instead of bytes per second and the burst variable will be in bytes as measured (computed) from data coming through the ioport_{read|write} functions.

The implementation is a little tricky since we need to implement a timer function to periodically check for whether the transfer rate didn't already exceed the baud rate (which is basically the reason of the "serial8250: too much work for irq4" message since the data are being transmitted on a much higher rate than allowed).

Michal
Comment 36 Michal Novotny 2010-09-09 13:17:13 EDT
Created attachment 446305 [details]
Implement rate limiting to qemu-dm's serial port implementation

Hi,
this is the patch I've done to this one that's implementing the rate limiting stuff. It's been tested on RHEL 5.5 x86_64 dom0 using two test-cases:
1) installing the RPM inside the Linux HVM guest
2) doing copy & paste of 18 kB text into the HVM guest's text editor (vim)

Both the cases were working fine with no annoying "too much work for irq4" messages. Could you please ask customer for retesting ?

Thanks,
Michal
Comment 37 Michal Novotny 2010-09-09 13:23:40 EDT
Hi Masaki,
I've build the package with those patches and they're on my PRC site at:

http://people.redhat.com/minovotn/xen/

Could you please pass this URL to the customer for testing?

Thanks,
Michal
Comment 38 Michal Novotny 2010-09-13 06:36:27 EDT
Created attachment 446884 [details]
Implement rate limiting to qemu-dm's serial port implementation using variable rate checks value

This is a slightly modified version of my patch for serial port using the variable value of rate checks per second, i.e. different approach than previous version was using. The code has been rewritten after investigation of kernel code and testing using various baud rates since the previous version was not working correctly using baud rate of e.g. 57600 bps (i.e. it was working just with some guest settings).

The version of RPMs at http://people.redhat.com/minovotn/xen has been updated to use this version of the patch (and now it's suffixed serial).

Michal
Comment 44 Miroslav Rezanina 2010-11-11 04:03:03 EST
Fix built into xen-3.0.3-118.el5
Comment 46 Jinxin Zheng 2010-11-17 02:23:52 EST
I can reproduce this bug on xen-3.0.3-117.el5 by these steps:

1. append the console param to the hvm guest's kernel cmd line in the grub.conf.

console=ttyS0,115200n8

(this is done with guestfish).

2. create the domain and attach to its console, assigning more than 1 vcpus:

$ xm create -c hvm1.cfg serial=pty vcpus=2

3. when the boot up is done and logged into the domain on its console, issue the 'yes' command:

$ yes

this produces a lot of "serial8250: too much work for irq4" mixed in the output of the 'yes' command.

Also I tried ifconfig like comment 39 in this step, just the same.


After upgraded to xen-3.0.3-118.el5 the reproducer above does not cause this issue any more, so I'm putting this bug into VERIFIED. Thanks.
Comment 47 Jinxin Zheng 2010-11-17 02:35:06 EST
Additionally I've tested with console speed = 9600. Situations are the same with 115200.
Comment 49 errata-xmlrpc 2011-01-13 17:17:05 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0031.html
Comment 50 unicell 2011-12-04 10:31:15 EST
With advisory RHBA-2011:0031-1 (link in Comment 49), I can still reproduce this bug. 

And I just found out it can be fixed by backporting 16550A support from upstream QEMU or higher xen version (3.3)
Comment 51 Paolo Bonzini 2011-12-05 05:00:47 EST
> And I just found out it can be fixed by backporting 16550A support from
> upstream QEMU or higher xen version (3.3)

That's too intrusive right now, unfortunately.

The bug is much less frequent (and more importantly, it is bearable) with RHBA-2011:0031-1.

Note You need to log in before you can comment on or make changes to this bug.