Description of problem: There are 2 upstream kernel patches that attempted an optimization to global_flush_tlb and subsequently removed this optimization. Red Hat's kernels have picked up the first of these two changes, but not the later. the attempted optimization caused global_flush_tlb to only flush the caches and tlbs if pages sent to change_page_attr had been restored back to the 'cached' state. if the pages were modified to the 'uncached' state, global_flush_tlb would skip flushing the caches and tlbs (right when it's needed the most). the net result is that stale cache data exists for pages marked uncached and used for dma push buffers. this stale cache data can later be flushed, corrupting push buffer data. the file and changes in question are linux/arch/x86_64/mm/pageattr.c, revisions 1.16 and 1.17. the problem is that df_list only contains pages added to the list by change_page_attr when the page is being reverted. pages being marked uncached are never added to this list. so when global_flush_tlb is called, it early exits, due to no pages being in the df_list and as a result, does not flush the caches/tlb. Version-Release number of selected component (if applicable): How reproducible: by their very nature, caching issues range greatly in reproducibility, depending on system and application. an earlier customer report that first reported this could reproduce it very easily. we worked around this by adding an extra flush in our driver, but we're still seeing problems when running stress tests for multiple days, so we're re-investigating our workaround. hopefully there's enough information above about the problem. if needed, I can try to put together a specific reproduction for you. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
I just wanted to check if there were any updates to this. I was hoping there would be an updated kernel package that we could have some of our customers test to verify if their bugs were related to this. thanks.
I am also waiting for a Redhat comment/test kernel for this. Please also see my IT#90639.
Terence, exactly which patches are you referring to? From a look at upstream, it looks like you may be talking about http://git.kernel.org/git/?p=linux/kernel/git/stable/linux-2.6.16.y.git;a=blobdiff;h=b90e8fe9eeb00509da9cbda82aa45034e38a64c2;hp=94862e1ec032d2616ca270071e70fb523e1aa150;hb=094804c5a132f04c12dd4902ee15c64362e5c1af;f=arch/x86_64/mm/pageattr.c @@ -220,8 +220,6 @@ void global_flush_tlb(void) down_read(&init_mm.mmap_sem); df = xchg(&df_list, NULL); up_read(&init_mm.mmap_sem); - if (!df) - return; flush_map((df && !df->next) ? df->address : 0); for (; df; df = next_df) { next_df = df->next; which effectively reverts one of the changes we apply in our linux-2.6.9-x86_64-change_page_attr-flush-fix.patch Could you confirm which patches you are referring to?
yes, that's the patch. that early return will return if the df_list (deferred page list) is empty. but if you look through the rest of the file, this list is only populated in save_page, which is only called from __change_page_attr when a page is being reverted back to cached. the flip side of this is that there are no pages added to the pg_list when a page is being converted to uncached, so the TLB/cache flush is skipped in this case. that leaves stale data cached for a page that is now expected to be uncached. this cached data may be flushed out to system memory at a later point in time. it's subtle, but leads to a lot of stability problems in graphics intensive environments (especially stress tests).
Which stress tests we used in duplicating the issue? I would like to add them to our internal tests. Thanks, Jeff
Hi Jeff, unfortunately, this was reproduced using our binary driver running a stress test suite. we're working on a directed test that should reproduce this problem, which we can then give you full source to. I hope to have that done within 1-2 weeks. Thanks, Terence
*** Bug 170538 has been marked as a duplicate of this bug. ***
committed in stream U5 build 42.2. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
User jparadis's account has been closed
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0304.html