Bug 502642 - /proc/interrupts and lspci report different values for IRQ when MSI/-X is enabled on device
/proc/interrupts and lspci report different values for IRQ when MSI/-X is ena...
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.5
All Linux
low Severity medium
: rc
: ---
Assigned To: Prarit Bhargava
Red Hat Kernel QE team
:
Depends On:
Blocks: 533192
  Show dependency treegraph
 
Reported: 2009-05-26 13:07 EDT by Prarit Bhargava
Modified: 2011-08-08 13:35 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-08-08 13:35:39 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
RHEL5 initial patch for this issue (4.30 KB, patch)
2009-05-28 11:23 EDT, Prarit Bhargava
no flags Details | Diff
RHEL5 patch for this issue (16.02 KB, patch)
2009-06-04 11:26 EDT, Prarit Bhargava
no flags Details | Diff

  None (edit)
Description Prarit Bhargava 2009-05-26 13:07:40 EDT
Description of problem:

tcamuso noticed this while debugging another issue.

/proc/interrupts shows

 74:         30          0          0          0       6880          0         30          0       PCI-MSI-X  eth7
 82:        163          0          0          0          0          0      53102        304       PCI-MSI-X  eth7

lspci -vv shows

Interrupt: pin A routed to IRQ 169

Version-Release number of selected component (if applicable): 2.6.18-150.el5


How reproducible: 100%


Steps to Reproduce:
1. cat /proc/interrupts
2. lspci -vv 
3. compare output
  
Actual results: lspci displays legacy IRQ value.  /proc/interrupts displays actual IRQ value.


Expected results: IRQ values should be the same


Additional info: Good test system is hp-bl495cg5-02.rhts.  Also resolving this will require TWO bugzillas, one for the kernel and one for lspci/pciutils.
Comment 1 Prarit Bhargava 2009-05-26 13:09:39 EDT
For the purposes of this BZ please note that MSI = MSI || MSI-X.

P.
Comment 2 Prarit Bhargava 2009-05-26 13:16:13 EDT
jgarzik,

gospo and I were discussing this while looking at output from a Fedora system.

I think that this is something that should be fixed -- lspci should display the *correct* IRQ information.

I do, however, have a question.

If a device has MSI enabled is it necessary for lspci to report the legacy IRQ value for that device?

I can't think of any reason knowing the legacy value of the interrupt would matter when MSI was enabled.  I suppose if you did a -bv it might be important.

So what we were thinking about was adding and exporting a new data field in the pci_dev struct (msi_interrupts?) which would be non-zero if MSI was enabled.  The new field could be a list or it could simply be a bitmap.  I haven't thought about the details yet ;)

lspci could read the new field via sysfs.  If it were zero, lspci would report the legacy interrupt value.  If it were non-zero, lspci would report all values in the list as MSI interrupts.

Thoughts?

P.
Comment 3 Prarit Bhargava 2009-05-26 13:23:41 EDT
From private email between myself and tcamuso:

I was thinking of doing something like (pseudo-code warning):

(BTW MSI = msi || msi-x enabled  ;)  )

if (msi)
   output msi IRQ values
else
   output legacy values

So, in your case if MSI was enabled lspci output would contain lines like

   Interrupt: MSI routed to IRQ 74
   Interrupt: MSI routed to IRQ 82

If it wasn't enabled (ie, booted with pci=nomsi)

   Interrupt: pin D routed to IRQ 185

Which, IMO, isn't too bad.  I'm, of course, open to other suggestions.

I could certainly do

   Interrupt: MSI routed to IRQs 74, 82

to make it a bit more readable, but I prefer putting MSI in front of the output to avoid breaking userspace tools that might grep for the Interrupt message.

P. 

(By no means am I saying the above is the correct thing to do -- just copying it here so I don't have to dig it up later)
Comment 4 Tony Camuso 2009-05-26 13:34:41 EDT
From mail I just sent to Prarit:
:)

Prarit Bhargava wrote:

>
> (BTW MSI = msi || msi-x enabled ;) )
>
> if (msi)
>    output msi IRQ values
> else
>    output legacy values
>
> So, in your case if MSI was enabled lspci output would contain lines like
>
>    Interrupt: MSI routed to IRQ 74
>    Interrupt: MSI routed to IRQ 82
>

Yes, I like this. You might want to consider mapping the Vector Table Entry
to the IRQ. This could be important for somebody troubleshooting the driver.
Something like:

    Interrupt: MSI Vector Table[0] routed to IRQ 74
    Interrupt: MSI Vector Table[1] routed to IRQ 82

IIRC, the kernel does not allow for one vector to be assigned to more than
one Vector Table Entry. Either all entries get a unique vector, or they all
get the same one.

In the case of legacy MSI, which does not have a Vector Table, there will be
only one IRQ assigned, anyway, so you can display

    Interrupt: MSI roued to IRQ 74

That will add a little more conditional code, so I don't know if you think
it would be worth it.

> If it wasn't enabled (ie, booted with pci=nomsi)
>
>    Interrupt: pin D routed to IRQ 185

This would be fine, of course.

>
> Which, IMO, isn't too bad.  I'm, of course, open to other suggestions.
>
> I could certainly do
>
>    Interrupt: MSI routed to IRQs 74, 82
>
> to make it a bit more readable, but I prefer putting MSI in front of the output to avoid breaking userspace tools that might grep for the Interrupt message.
>
I think your first suggestion is more readable, since you don't have to
go digging through a list. And your concern about grepping for the MSIs
is a very valid one.
Comment 5 Andy Gospodarek 2009-05-26 14:00:01 EDT
(In reply to comment #4)
> 
>     Interrupt: MSI Vector Table[0] routed to IRQ 74
>     Interrupt: MSI Vector Table[1] routed to IRQ 82
> 
> IIRC, the kernel does not allow for one vector to be assigned to more than
> one Vector Table Entry. Either all entries get a unique vector, or they all
> get the same one.
> 

There is a chance that multiple msi-x vectors could be assigned to single IRQs.  We had this exact problem with ixgbe hardware on some systems and code had to be added to the msi-x handler to make sure it checked more than just the first vector it found as popped when an interrupt occurred.  It would be fine with me if the output was just as you had it above and in the case of overlap it would be like this:

     Interrupt: MSI Vector Table[0] routed to IRQ 74
     Interrupt: MSI Vector Table[1] routed to IRQ 82
     Interrupt: MSI Vector Table[2] routed to IRQ 74
     Interrupt: MSI Vector Table[3] routed to IRQ 84
Comment 6 Tony Camuso 2009-05-26 14:10:25 EDT
Of course, there is nothing wrong with one vector being assigned to more than one VTE, I just thought I read somewhere that this wasn't supported. 

It is now.
:)

In any event, engineering triumphs and Andy's modification works just fine for me.
Comment 7 Tony Camuso 2009-05-26 19:04:52 EDT
I would like to comment on a comment I made earlier.The discussion is somewhat arcane, but bear with me.
:) 

> In the case of legacy MSI, which does not have a Vector Table, there will be
> only one IRQ assigned, anyway, so you can display
>
>    Interrupt: MSI roued to IRQ 74

Currently, the kernel does not assign more than one vector to a device that is capable only of legacy MSI. This is because legacy MSI interrupt vectors...

* Must be contiguous numbers
* Must be naturally aligned, such that

  Vectors  Base Vector Number bit mask
  -------  ---------------------------
      2        1
      4        3
      8        7
     16        F
     32       1F 

... max number of MSI vectors is only 32, while max number of MSI-X vectors is 2048 ...

This makes the interrupt event number map exactly to the corresponding low bits of the vector.  

I surmise the Linux community that was involved with this at the time did not think it was worth the extra work at the time. MSI-X came very soon after MSI and is the far better choice for an MSI implementation, so it was unlikely that anybody would be using legacy MSI with multiple interrupts. 

However, engineers don't often respond well to "no, you can't", so I would not be terribly surprised if an enterprising fellow went ahead and implemented multiple vectors for legacy MSI. 

All that being said, perhaps MSI as well as MSI-X should be checked for multiple vectors.

Or maybe I just need to take more of my cold medicine.
Comment 8 Prarit Bhargava 2009-05-28 10:30:55 EDT
See upstream discussion here:

http://marc.info/?l=linux-pci&m=122516331313328&w=2

I'm discussing with Matthew Wilcox to see where he was on his open/release idea for sysfs.

Due to the complexity of the code, I don't think I'll be able to easily backport this into RHEL5.  I may do something out-of-stream for RHEL5.

P.
Comment 9 Prarit Bhargava 2009-05-28 10:56:31 EDT
Adding harald and mhlavink (pciutils maintainers).

P.
Comment 10 Prarit Bhargava 2009-05-28 11:21:55 EDT
(In reply to comment #8)
>  I may do something out-of-stream for RHEL5.
> 

s/may/will/g

FYI.

P.
Comment 11 Prarit Bhargava 2009-05-28 11:23:13 EDT
Created attachment 345777 [details]
RHEL5 initial patch for this issue

Comments on the kernel-side patch?

P.
Comment 12 Jeff Garzik 2009-05-28 16:17:34 EDT
/proc/interrupts and lspci may show two different values because they are displaying different things.  SOMETIMES they are the same, but not always.

This is because the numbers obtained from the system irq vector should be considered a "cookie" value.  In the case of legacy ISA interrupts, the cookie is the same as the vector.  In the case of MSI/MSI-X, the assignment is a bit more arbitrary.

As such, the irq vector slot is not the same thing as the value being programmed into a PCI device's register.

Furthermore, the PCI device register may not reflect updates from quirks and fixups, so pdev->irq may differ from the PCI device register value in uncommon cases.

So:
1) /proc/interrupts displays a platform-specific, CPU-specific enumeration of the system's irq vectors.
2) lspci displays PCI device register values.

While sometimes the values will be the same, there is no guarantee.
Comment 13 Tony Camuso 2009-05-28 16:31:17 EDT
> While sometimes the values will be the same, there is no guarantee.  

It would be nice in the case of MSI/MSI-X, however, if lspci could show, not just the contents of the device irq field in the config header, which is meaningless in the MSI/MSI-X context, but also the vector(s) assigned to MSI/MSI-X.

Prarit's proposed patch fixes it for /proc/interrupts. 

Why not have a complementary fix in lspci -vv?
Comment 14 Prarit Bhargava 2009-05-28 17:26:33 EDT
(In reply to comment #12)
> /proc/interrupts and lspci may show two different values because they are
> displaying different things.  SOMETIMES they are the same, but not always.
> 
> This is because the numbers obtained from the system irq vector should be
> considered a "cookie" value.  In the case of legacy ISA interrupts, the cookie
> is the same as the vector.  In the case of MSI/MSI-X, the assignment is a bit
> more arbitrary.
> 
> As such, the irq vector slot is not the same thing as the value being
> programmed into a PCI device's register.

Right, we understand all of that.

> 
> Furthermore, the PCI device register may not reflect updates from quirks and
> fixups, so pdev->irq may differ from the PCI device register value in uncommon
> cases.
> 
> So:
> 1) /proc/interrupts displays a platform-specific, CPU-specific enumeration of
> the system's irq vectors.
> 2) lspci displays PCI device register values.
> 
> While sometimes the values will be the same, there is no guarantee.  

Yes, understood.  However, what I'm adding is not going to make dev->irq, or the interrupt reporting in lspci match that of MSI/-X.  I'm adding the ability to lspci to report the MSI/-X vectors.

P.
Comment 15 Jeff Garzik 2009-05-28 18:38:16 EDT
Ok :)  I wasn't sure which part of the bug I was supposed to comment on :)

The general comment about /proc/interrupts is that is gets parsed, and so we need to try to avoid breaking userland programs that depend on its format.  The patch in comment #11 seems to be a sysfs addition rather than a /proc/interrupts update, so that seems fine from that perspective.

With that out of the way, the only remaining concerns are ABI supportability moving forward.  This is visible to userspace... will it be a RHEL-only extension?  RHEL5-only?  It sounds like upstream is unknown at the moment, but there is a /possibility/ of divergence.

WRT lspci, I don't think there is any objection to getting lspci to output additional information.
Comment 16 Prarit Bhargava 2009-05-28 18:47:19 EDT
(In reply to comment #15)
> Ok :)  I wasn't sure which part of the bug I was supposed to comment on :)
> 
> The general comment about /proc/interrupts is that is gets parsed, and so we
> need to try to avoid breaking userland programs that depend on its format.  The
> patch in comment #11 seems to be a sysfs addition rather than a
> /proc/interrupts update, so that seems fine from that perspective.
> 
> With that out of the way, the only remaining concerns are ABI supportability
> moving forward.  This is visible to userspace... will it be a RHEL-only
> extension?  RHEL5-only?  It sounds like upstream is unknown at the moment, but
> there is a /possibility/ of divergence.
> 

Yeah -- I'm worried about that too.  Especially with RHEL6 around the corner.

The situation upstream is that the person who originally suggested the patch has moved on from the company he did it for.  Add to that Matthew Wilcox's suggestion of expanding sysfs, and upstream looks like it will end up being a *huge* piece of code to backport to RHEL5.

I will at least try to implement what was last suggested upstream in RHEL5.

> WRT lspci, I don't think there is any objection to getting lspci to output
> additional information.  

P.
Comment 17 Prarit Bhargava 2009-06-04 11:26:50 EDT
Created attachment 346552 [details]
RHEL5 patch for this issue

Closely matches upstream -- waiting for upstream's decision on implementation.

Note You need to log in before you can comment on or make changes to this bug.