Bug 186325 - upstream kernel patch for global_flush_tlb missing in Red Hat kernel
upstream kernel patch for global_flush_tlb missing in Red Hat kernel
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.0
x86_64 Linux
medium Severity medium
: ---
: ---
Assigned To: Peter Martuccelli
Brian Brock
:
: 170538 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-03-22 17:14 EST by Terence Ripperda
Modified: 2007-11-30 17:07 EST (History)
7 users (show)

See Also:
Fixed In Version: RHBA-2007-0304
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-05-07 20:59:57 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Terence Ripperda 2006-03-22 17:14:48 EST
Description of problem:
There are 2 upstream kernel patches that attempted an optimization to
global_flush_tlb and subsequently removed this optimization. Red Hat's kernels
have picked up the first of these two changes, but not the later.

the attempted optimization caused global_flush_tlb to only flush the caches and
tlbs if pages sent to change_page_attr had been restored back to the 'cached'
state. if the pages were modified to the 'uncached' state, global_flush_tlb
would skip flushing the caches and tlbs (right when it's needed the most).

the net result is that stale cache data exists for pages marked uncached and
used for dma push buffers. this stale cache data can later be flushed,
corrupting push buffer data.

the file and changes in question are linux/arch/x86_64/mm/pageattr.c, revisions
1.16 and 1.17. the problem is that df_list only contains pages added to the list
by change_page_attr when the page is being reverted. pages being marked uncached
are never added to this list. so when global_flush_tlb is called, it early
exits, due to no pages being in the df_list and as a result, does not flush the
caches/tlb.

Version-Release number of selected component (if applicable):


How reproducible:
by their very nature, caching issues range greatly in reproducibility, depending
on system and application. an earlier customer report that first reported this
could reproduce it very easily. we worked around this by adding an extra flush
in our driver, but we're still seeing problems when running stress tests for
multiple days, so we're re-investigating our workaround.

hopefully there's enough information above about the problem. if needed, I can
try to put together a specific reproduction for you.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Comment 1 Terence Ripperda 2006-04-10 12:25:49 EDT
I just wanted to check if there were any updates to this. I was hoping there
would be an updated kernel package that we could have some of our customers test
to verify if their bugs were related to this. thanks.
Comment 3 Gunther Mayer 2006-04-11 03:45:10 EDT
I am also waiting for a Redhat comment/test kernel for this.
Please also see my IT#90639.

Comment 4 Pancrazio `ezio' de Mauro 2006-04-11 13:37:38 EDT
Terence, exactly which patches are you referring to?

From a look at upstream, it looks like you may be talking about
http://git.kernel.org/git/?p=linux/kernel/git/stable/linux-2.6.16.y.git;a=blobdiff;h=b90e8fe9eeb00509da9cbda82aa45034e38a64c2;hp=94862e1ec032d2616ca270071e70fb523e1aa150;hb=094804c5a132f04c12dd4902ee15c64362e5c1af;f=arch/x86_64/mm/pageattr.c

@@ -220,8 +220,6 @@ void global_flush_tlb(void)
       down_read(&init_mm.mmap_sem);
       df = xchg(&df_list, NULL);
       up_read(&init_mm.mmap_sem);
-         if (!df)
-                 return;
       flush_map((df && !df->next) ? df->address : 0);
       for (; df; df = next_df) {
               next_df = df->next;

which effectively reverts one of the changes we apply in our
linux-2.6.9-x86_64-change_page_attr-flush-fix.patch

Could you confirm which patches you are referring to?
Comment 5 Terence Ripperda 2006-04-11 13:42:12 EDT
yes, that's the patch.

that early return will return if the df_list (deferred page list) is empty. but
if you look through the rest of the file, this list is only populated in
save_page, which is only called from __change_page_attr when a page is being
reverted back to cached.

the flip side of this is that there are no pages added to the pg_list when a
page is being converted to uncached, so the TLB/cache flush is skipped in this
case. that leaves stale data cached for a page that is now expected to be
uncached. this cached data may be flushed out to system memory at a later point
in time.

it's subtle, but leads to a lot of stability problems in graphics intensive
environments (especially stress tests).
Comment 15 Jeff Burke 2006-06-02 10:31:27 EDT
Which stress tests we used in duplicating the issue?
I would like to add them to our internal tests.

Thanks,
Jeff
Comment 16 Terence Ripperda 2006-06-20 18:46:08 EDT
Hi Jeff,

unfortunately, this was reproduced using our binary driver running a stress test
suite. we're working on a directed test that should reproduce this problem,
which we can then give you full source to. I hope to have that done within 1-2
weeks.

Thanks,
Terence
Comment 17 Ernie Petrides 2006-08-11 16:13:13 EDT
*** Bug 170538 has been marked as a duplicate of this bug. ***
Comment 18 Jason Baron 2006-08-21 16:45:24 EDT
committed in stream U5 build 42.2. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/
Comment 19 RHEL Product and Program Management 2006-09-07 15:25:30 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 20 RHEL Product and Program Management 2006-09-07 15:25:49 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 21 RHEL Product and Program Management 2006-09-07 15:26:11 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 24 Red Hat Bugzilla 2007-03-18 18:38:20 EDT
User jparadis@redhat.com's account has been closed
Comment 26 Red Hat Bugzilla 2007-05-07 20:59:58 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0304.html

Note You need to log in before you can comment on or make changes to this bug.