Bug 451593

Summary: Multiple outstanding ptc.g instruction support
Product: Red Hat Enterprise Linux 5 Reporter: Luming Yu <luyu>
Component: kernelAssignee: Luming Yu <luyu>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: high Docs Contact:
Priority: high    
Version: 5.3CC: bjorn.helgaas, dchapman, dzickus, fenghua.yu, glen.foster, jane.lv, peterm
Target Milestone: ---   
Target Release: ---   
Hardware: ia64   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-01-20 19:58:27 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 454962    
Attachments:
Description Flags
a back port none

Description Luming Yu 2008-06-16 02:31:33 UTC
Description of problem:
According to SDM2.2, Itanium supports multiple outstanding ptc.g instructions.
But current kernel function ia64_global_tlb_purge() uses a spinlock to serialize
ptc.g instructions issued by multiple processors. This serialization might have
scalability issue on a big SMP machine where many processors could purge TLB
in parallel.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Luming Yu 2008-06-18 14:24:58 UTC
Created attachment 309737 [details]
a back port

Comment 2 Ronald Pacheco 2008-07-08 17:21:39 UTC
Do we know if this has passed muster in RH kernel?  If so, can we get a devel_ack?

Comment 4 RHEL Program Management 2008-07-10 18:21:58 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 6 Glen A. Foster 2008-08-20 16:24:08 UTC
HP is also interested in this fix/patch since we have 256-way servers, too.

Comment 7 Doug Chapman 2008-08-21 20:52:30 UTC
Luming,

The patch attached no longer applies to the recent RHEL5 kernels.  Can you update it with something that will apply to -103 or -104 so we can test this?  Also, if you have a kernel rpm already built can you please point me to it?

thanks,

- Doug

Comment 8 Luming Yu 2008-08-25 08:00:14 UTC
To comment#7,

The brew with the patch applies to -105 :
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1436575

Thanks,
Luming

Comment 9 Doug Chapman 2008-09-09 15:39:14 UTC
It sounds like testing this patch only makes sens on Tukwilla systems.  I don't have access to any tukwilla systems yet so I can't test it.

Comment 10 Don Zickus 2008-09-15 14:17:47 UTC
in kernel-2.6.18-115.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 13 Ryan Lerch 2008-10-24 01:08:23 UTC
This bug has been marked for inclusion in the Red Hat Enterprise Linux 5.3
Release Notes.

To aid in the development of relevant and accurate release notes, please fill
out the "Release Notes" field above with the following 4 pieces of information:

Cause:   What actions or circumstances cause this bug to present.
Consequence:  What happens when the bug presents.
Fix:   What was done to fix the bug.
Result:  What now happens when the actions or circumstances above occur. (NB:
this is not the same as 'the bug doesn't present anymore')

Comment 15 Bjorn Helgaas 2008-12-02 23:59:57 UTC
I think there are three patches in RHEL5.3s4 related to this issue, and I don't understand this one:
  linux-2.6-ia64-set-default-max_purges-1-regardless-of-pal-return.patch

This is not upstream.  It doesn't follow the coding style (missing spaces in "if").  It adds a bunch of boot-time noise (an extra line for every processor).  The extra output seems to be two run-on sentences and contains a typo ("outsanding" instead of "outstanding"):
  cpu_init: PAL max_purges is overridden to 1 PALO is required for multiple outsanding ptc.g 

The number of outstanding TLB purges is a system-wide parameter, not a per-processor parameter, so if we print anything, we should print it exactly once, not once per processor.

When there's no PALO, I don't see why we should ignore the max_purges value from PAL_VM_SUMMARY.

Comment 16 Luming Yu 2008-12-03 02:55:25 UTC
>linux-2.6-ia64-set-default-max_purges-1-regardless-of-pal-return.patch

This patch is to address the concern of regression that could be introduced by the "Multiple outstanding ptc.g" patch. 

 "Comparing with old kernel , the patch would not cause any behavior changes on old platforms as to ptc.g instruction support.
Because, by default, the patch uses field max_purges in vm_info_2 from PAL : PAL_VM_SUMMARY.(please check kernel code cpu_init and Figure 11-74. Layout  of vm_info_2 Return value in IA64 Software Developer's Manual vol. 2). On any platforms without multiple outstanding ptc.g instruction support , PAL_VM_SUMMARY should return 1 for max_purges .
The only chance that the patch could introduce regression on old platforms is that  PAL_VM_SUMMARY doesn't return 1.  Frankly speaking, we couldn't call this case "regression". Because it is broken firmware problem that we can just workaround it "with nptcg=1" or ask for hardware support."

Comment 17 Bjorn Helgaas 2008-12-03 15:40:54 UTC
I understand you don't want to suddenly start doing multiple outstanding purges on platforms in the field where that hasn't been tested.  I think you should propose that for the upstream kernel, too, so this doesn't remain a RHEL-only quirk.

I don't think there's any point in even checking the max_purges returned from PAL_VM_SUMMARY.  You might just as well do this:

    if (ia64_pal_vm_summary(NULL, &vmi) == 0) {
        max_ctx = (1U << (vmi.pal_vm_info_2_s.rid_size - 3)) - 1;
        /*
         * Multiple outstanding purges haven't been validated on
         * already-shipping platforms, so it's too risky to enable
         * them based on PAL.  We will enable them if a PALO table
         * exists.
         */
        setup_ptcg_sem(1, NPTCG_FROM_PAL);
        ...

That's much easier to understand.

If you really want to print something, I think it'd be reasonable to print it *once*.  I'm strongly opposed to the current behavior, which adds a whole bunch of useless and confusing console output on large systems where nothing is wrong.

When a PALO exists, I think we also should print the max_purges in it.

Comment 18 Luming Yu 2008-12-04 03:13:21 UTC
I agree that upstream needs similar patch.

Comment 19 Luming Yu 2008-12-04 03:18:13 UTC
the good point of the current patch is that we don't see any error messages if "vmi.pal_vm_info_2_s.max_purges == 1" that is what old platform should return from their PAL calls.

Comment 20 Bjorn Helgaas 2008-12-04 05:37:57 UTC
> the good point of the current patch is that we don't see any error
> messages if "vmi.pal_vm_info_2_s.max_purges == 1" that is what old
> platform should return from their PAL calls.

This is not true.  I see this error message on a shipping BL860c.
PAL_VM_SUMMARY returns max_purges == 0, which is legal per the spec.
There is nothing useful the user can do as a result of this message.

I think we need:
  1) A message when we parse a PALO table, saying how many outstanding
     purges we will support, and
  2) A message if there's no PALO but PAL_VM_SUMMARY returns
     max_purges > 1.  In this case, the platform claims to support
     multiple purges, but we avoid them because they haven't been tested,
     and we should tell the user we turned off this functionality.

Comment 21 Luming Yu 2008-12-04 06:07:11 UTC
You are right,I seem to miss this point : 0 is legal, and it means max_purges=1. (IA64 SPEC says so.)
Yes, we should avoid those message when PAL return 0 here. 
We probably also need to provide info to user in dmesg as you said above.
But I'm afraid we can do anything right now for 5.3.

Although there is defect in dmesg,  linux-2.6-ia64-set-default-max_purges-1-regardless-of-pal-return.patch still is good at preventing any possible functional regression..

Probably, we can polish dmesg in 5.4...

Comment 23 errata-xmlrpc 2009-01-20 19:58:27 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html

Comment 24 Luming Yu 2009-02-01 05:15:37 UTC
*** Bug 474321 has been marked as a duplicate of this bug. ***