Bug 1377488

Summary: aarch64: out of bounds array access on NUMA systems
Product: Red Hat Enterprise Linux 7 Reporter: David Daney <ddaney>
Component: kernel-aarch64Assignee: Kernel Drivers <hwkernel-mgr>
kernel-aarch64 sub component: Platform Enablement QA Contact: Jeff Bastian <jbastian>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: ctatman, jcm, jfeeney, lmiksik, mlangsdo, rrichter
Version: 7.3   
Target Milestone: rc   
Target Release: 7.3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kernel-aarch64-4.5.0-12.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-03 22:53:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1250216    

Description David Daney 2016-09-19 21:17:06 UTC
Description of problem:

I believe the RHELSA-7.3 kernels have this bug, but the failure with all official kernels builds does not produce an OOPS, but may instead result in unbound work queues being assigned worker threads on the wrong node.

The wq_numa_init() function makes a private CPU to node map by calling
cpu_to_node() early in the boot process, before the non-boot CPUs are
brought online.  Since the default implementation of cpu_to_node()
returns zero for CPUs that have never been brought online, the
workqueue system's view is that *all* CPUs are on node zero.

When the unbound workqueue for a non-zero node is created, the
tsk_cpus_allowed() for the worker threads is the empty set because
there are, in the view of the workqueue system, no CPUs on non-zero
nodes.  The code in try_to_wake_up() using this empty cpumask ends up
using the cpumask empty set value of NR_CPUS as an index into the
per-CPU area pointer array, and gets garbage as it is one past the end
of the array.  This results in:

[    0.881970] Unable to handle kernel paging request at virtual address fffffb1008b926a4
[    1.970095] pgd = fffffc00094b0000
[    1.973530] [fffffb1008b926a4] *pgd=0000000000000000, *pud=0000000000000000, *pmd=0000000000000000
[    1.982610] Internal error: Oops: 96000004 [#1] SMP
[    1.987541] Modules linked in:
[    1.990631] CPU: 48 PID: 295 Comm: cpuhp/48 Tainted: G        W       4.8.0-rc6-preempt-vol+ #9
[    1.999435] Hardware name: Cavium ThunderX CN88XX board (DT)
[    2.005159] task: fffffe0fe89cc300 task.stack: fffffe0fe8b8c000
[    2.011158] PC is at try_to_wake_up+0x194/0x34c
[    2.015737] LR is at try_to_wake_up+0x150/0x34c
[    2.020318] pc : [<fffffc00080e7468>] lr : [<fffffc00080e7424>] pstate: 600000c5
[    2.027803] sp : fffffe0fe8b8fb10
[    2.031149] x29: fffffe0fe8b8fb10 x28: 0000000000000000
[    2.036522] x27: fffffc0008c63bc8 x26: 0000000000001000
[    2.041896] x25: fffffc0008c63c80 x24: fffffc0008bfb200
[    2.047270] x23: 00000000000000c0 x22: 0000000000000004
[    2.052642] x21: fffffe0fe89d25bc x20: 0000000000001000
[    2.058014] x19: fffffe0fe89d1d00 x18: 0000000000000000
[    2.063386] x17: 0000000000000000 x16: 0000000000000000
[    2.068760] x15: 0000000000000018 x14: 0000000000000000
[    2.074133] x13: 0000000000000000 x12: 0000000000000000
[    2.079505] x11: 0000000000000000 x10: 0000000000000000
[    2.084879] x9 : 0000000000000000 x8 : 0000000000000000
[    2.090251] x7 : 0000000000000040 x6 : 0000000000000000
[    2.095621] x5 : ffffffffffffffff x4 : 0000000000000000
[    2.100991] x3 : 0000000000000000 x2 : 0000000000000000
[    2.106364] x1 : fffffc0008be4c24 x0 : ffffff0ffffada80
[    2.111737]
[    2.113236] Process cpuhp/48 (pid: 295, stack limit = 0xfffffe0fe8b8c020)
[    2.120102] Stack: (0xfffffe0fe8b8fb10 to 0xfffffe0fe8b90000)
[    2.125914] fb00:                                   fffffe0fe8b8fb80 fffffc00080e7648
.
.
.
[    2.442859] Call trace:
[    2.445327] Exception stack(0xfffffe0fe8b8f940 to 0xfffffe0fe8b8fa70)
[    2.451843] f940: fffffe0fe89d1d00 0000040000000000 fffffe0fe8b8fb10 fffffc00080e7468
[    2.459767] f960: fffffe0fe8b8f980 fffffc00080e4958 ffffff0ff91ab200 fffffc00080e4b64
[    2.467690] f980: fffffe0fe8b8f9d0 fffffc00080e515c fffffe0fe8b8fa80 0000000000000000
[    2.475614] f9a0: fffffe0fe8b8f9d0 fffffc00080e58e4 fffffe0fe8b8fa80 0000000000000000
[    2.483540] f9c0: fffffe0fe8d10000 0000000000000040 fffffe0fe8b8fa50 fffffc00080e5ac4
[    2.491465] f9e0: ffffff0ffffada80 fffffc0008be4c24 0000000000000000 0000000000000000
[    2.499387] fa00: 0000000000000000 ffffffffffffffff 0000000000000000 0000000000000040
[    2.507309] fa20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[    2.515233] fa40: 0000000000000000 0000000000000000 0000000000000000 0000000000000018
[    2.523156] fa60: 0000000000000000 0000000000000000
[    2.528089] [<fffffc00080e7468>] try_to_wake_up+0x194/0x34c
[    2.533723] [<fffffc00080e7648>] wake_up_process+0x28/0x34
[    2.539275] [<fffffc00080d3764>] create_worker+0x110/0x19c
[    2.544824] [<fffffc00080d69dc>] alloc_unbound_pwq+0x3cc/0x4b0
[    2.550724] [<fffffc00080d6bcc>] wq_update_unbound_numa+0x10c/0x1e4
[    2.557066] [<fffffc00080d7d78>] workqueue_online_cpu+0x220/0x28c
[    2.563234] [<fffffc00080bd288>] cpuhp_invoke_callback+0x6c/0x168
[    2.569398] [<fffffc00080bdf74>] cpuhp_up_callbacks+0x44/0xe4
[    2.575210] [<fffffc00080be194>] cpuhp_thread_fun+0x13c/0x148
[    2.581027] [<fffffc00080dfbac>] smpboot_thread_fn+0x19c/0x1a8
[    2.586929] [<fffffc00080dbd64>] kthread+0xdc/0xf0
[    2.591776] [<fffffc0008083380>] ret_from_fork+0x10/0x50
[    2.597147] Code: b00057e1 91304021 91005021 b8626822 (b8606821)
[    2.603464] ---[ end trace 58c0cd36b88802bc ]---
[    2.608138] Kernel panic - not syncing: Fatal exception

Version-Release number of selected component (if applicable):

Upstream v4.8-rc5 and others

How reproducible:

OOPS is config dependent, it depends on the values that follow the array in memory, but the out of bounds access happens every time the kernel is booted on a system containing NUMA nodes other than node zero.


Steps to Reproduce:
1. Boot kernel on NUMA system with 2 or more nodes.

Actual results:

OOPS message/stack trace with some configurations(builds), workqueue worker threads for unbound workqueues assigned to CPUs on the wrong node (possible performance impact).  Not all builds fail with OOPS.

Expected results:

No OOPS message.

Additional info:

Potential fix here: https://lkml.org/lkml/2016/9/19/678

Comment 2 David Daney 2016-09-20 00:46:45 UTC
Brew with potential fix now here:

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11777838

Comment 3 David Daney 2016-09-20 18:50:24 UTC
(In reply to David Daney from comment #2)
> Brew with potential fix now here:
> 
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11777838

Now obsolete, don't use this fix.

Comment 4 David Daney 2016-09-20 18:51:30 UTC
New version of potential fix is here:
https://lkml.org/lkml/2016/9/20/532

Comment 5 David Daney 2016-09-20 20:19:48 UTC
These are the symptoms of the problem on kernel-4.5.0-9.el7.aarch64:


[root@localhost ~]# ps -e | grep 'kworker/u'
    6 ?        00:00:00 kworker/u192:0
  608 ?        00:00:00 kworker/u193:5
  611 ?        00:00:00 kworker/u193:6
  614 ?        00:00:00 kworker/u193:7
 1012 ?        00:00:00 kworker/u192:2
 3327 ?        00:00:00 kworker/u192:1


We can see two unbound workqueues with several worker threads per queue.

[root@localhost ~]# taskset -p 6
pid 6's current affinity mask: ffffffffffffffffffffffff
[root@localhost ~]# taskset -p 608
pid 608's current affinity mask: ffffffffffffffffffffffff

Look, both work queues have affinity to all 96 CPUs.

Should be like this:

[root@localhost ~]# ps -e | grep 'kworker/u'
    6 ?        00:00:00 kworker/u192:0
    7 ?        00:00:00 kworker/u193:0
  253 ?        00:00:00 kworker/u194:0
.
.
.
[root@localhost ~]# taskset -p 6
pid 6's current affinity mask: ffffffffffffffffffffffff
[root@localhost ~]# taskset -p 7
pid 7's current affinity mask: ffffffffffff
[root@localhost ~]# taskset -p 253
pid 253's current affinity mask: ffffffffffff000000000000

First unbound work queue has affinity to all 96 CPUs
Second unbound work queue has affinity to node-0 CPUs (48 of them)
Third unbound work queue has affinity to node-1 CPUSs (the other 48)

Comment 6 David Daney 2016-09-20 22:15:27 UTC
(In reply to David Daney from comment #4)
> New version of potential fix is here:
> https://lkml.org/lkml/2016/9/20/532

brew build of this patch is here:

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11785057

Comment 8 Jeff Bastian 2016-09-22 20:22:55 UTC
Verified on cavium-thunderx2-02.khw.lab.eng.bos.redhat.com following the test steps in comment 5.

::::::::::::
:: Before ::
::::::::::::

[root@cavium-thunderx2-02 ~]# uname -r
4.5.0-10.el7.aarch64

[root@cavium-thunderx2-02 ~]# pgrep -laf kworker/u
6 kworker/u192:0
3941 kworker/u193:1
3954 kworker/u192:1
4001 kworker/u193:0
4021 kworker/u192:2
4071 kworker/u193:2

[root@cavium-thunderx2-02 ~]# for p in $(pgrep -laf kworker/u |
    awk '{print $1}') ; do
        taskset -p $p
done
pid 6's current affinity mask: ffffffffffffffffffffffff
pid 3941's current affinity mask: ffffffffffffffffffffffff
pid 3954's current affinity mask: ffffffffffffffffffffffff
pid 4001's current affinity mask: ffffffffffffffffffffffff
pid 4021's current affinity mask: ffffffffffffffffffffffff
pid 4071's current affinity mask: ffffffffffffffffffffffff

[root@cavium-thunderx2-02 ~]# ./hex2bin.py -d 96 ffffffffffffffffffffffff
111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111

:::::::::::
:: After ::
:::::::::::

[root@cavium-thunderx2-02 ~]# uname -r
4.5.0-13.el7.aarch64
[root@cavium-thunderx2-02 ~]# pgrep -laf kworker/u
6 kworker/u192:0
7 kworker/u193:0
8 kworker/u194:0
532 kworker/u194:1
597 kworker/u194:2
600 kworker/u194:3
603 kworker/u194:4
606 kworker/u194:5
609 kworker/u194:6
612 kworker/u194:7
615 kworker/u194:8
619 kworker/u194:9
623 kworker/u192:1
681 kworker/u193:1
892 kworker/u193:2
1085 kworker/u193:3
1128 kworker/u193:4
[root@cavium-thunderx2-02 ~]# for p in $(pgrep -laf kworker/u |
    awk '{print $1}') ; do
        taskset -p $p
done
pid 6's current affinity mask: ffffffffffffffffffffffff
pid 7's current affinity mask: ffffffffffff
pid 8's current affinity mask: ffffffffffff000000000000
pid 532's current affinity mask: ffffffffffff000000000000
pid 597's current affinity mask: ffffffffffff000000000000
pid 600's current affinity mask: ffffffffffff000000000000
pid 603's current affinity mask: ffffffffffff000000000000
pid 606's current affinity mask: ffffffffffff000000000000
pid 609's current affinity mask: ffffffffffff000000000000
pid 612's current affinity mask: ffffffffffff000000000000
pid 615's current affinity mask: ffffffffffff000000000000
pid 619's current affinity mask: ffffffffffff000000000000
pid 623's current affinity mask: ffffffffffffffffffffffff
pid 681's current affinity mask: ffffffffffff
pid 892's current affinity mask: ffffffffffff
pid 1085's current affinity mask: ffffffffffff
pid 1128's current affinity mask: ffffffffffff

[root@cavium-thunderx2-02 ~]# ./hex2bin.py -d 96 ffffffffffffffffffffffff
111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111

[root@cavium-thunderx2-02 ~]# ./hex2bin.py -d 96 ffffffffffff
000000000000000000000000000000000000000000000000111111111111111111111111111111111111111111111111

[root@cavium-thunderx2-02 ~]# ./hex2bin.py -d 96 ffffffffffff000000000000
111111111111111111111111111111111111111111111111000000000000000000000000000000000000000000000000

::::::::::::::::::::::::
:: hex2bin.py utility ::
::::::::::::::::::::::::

#!/usr/bin/python

import sys
import argparse

parser = argparse.ArgumentParser(description='Convert hex to binary.')
parser.add_argument('N', metavar='N', help='number in hexadecimal to convert')
parser.add_argument('-d', '--digits', type=int,
                    help='digits to print (default is 32)')
args = parser.parse_args()

if args.digits is None:
  args.digits = 32

print bin(int(args.N, base=16))[2:].zfill(args.digits)

Comment 10 errata-xmlrpc 2016-11-03 22:53:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2145.html