Hide Forgot
Description of problem: I believe the RHELSA-7.3 kernels have this bug, but the failure with all official kernels builds does not produce an OOPS, but may instead result in unbound work queues being assigned worker threads on the wrong node. The wq_numa_init() function makes a private CPU to node map by calling cpu_to_node() early in the boot process, before the non-boot CPUs are brought online. Since the default implementation of cpu_to_node() returns zero for CPUs that have never been brought online, the workqueue system's view is that *all* CPUs are on node zero. When the unbound workqueue for a non-zero node is created, the tsk_cpus_allowed() for the worker threads is the empty set because there are, in the view of the workqueue system, no CPUs on non-zero nodes. The code in try_to_wake_up() using this empty cpumask ends up using the cpumask empty set value of NR_CPUS as an index into the per-CPU area pointer array, and gets garbage as it is one past the end of the array. This results in: [ 0.881970] Unable to handle kernel paging request at virtual address fffffb1008b926a4 [ 1.970095] pgd = fffffc00094b0000 [ 1.973530] [fffffb1008b926a4] *pgd=0000000000000000, *pud=0000000000000000, *pmd=0000000000000000 [ 1.982610] Internal error: Oops: 96000004 [#1] SMP [ 1.987541] Modules linked in: [ 1.990631] CPU: 48 PID: 295 Comm: cpuhp/48 Tainted: G W 4.8.0-rc6-preempt-vol+ #9 [ 1.999435] Hardware name: Cavium ThunderX CN88XX board (DT) [ 2.005159] task: fffffe0fe89cc300 task.stack: fffffe0fe8b8c000 [ 2.011158] PC is at try_to_wake_up+0x194/0x34c [ 2.015737] LR is at try_to_wake_up+0x150/0x34c [ 2.020318] pc : [<fffffc00080e7468>] lr : [<fffffc00080e7424>] pstate: 600000c5 [ 2.027803] sp : fffffe0fe8b8fb10 [ 2.031149] x29: fffffe0fe8b8fb10 x28: 0000000000000000 [ 2.036522] x27: fffffc0008c63bc8 x26: 0000000000001000 [ 2.041896] x25: fffffc0008c63c80 x24: fffffc0008bfb200 [ 2.047270] x23: 00000000000000c0 x22: 0000000000000004 [ 2.052642] x21: fffffe0fe89d25bc x20: 0000000000001000 [ 2.058014] x19: fffffe0fe89d1d00 x18: 0000000000000000 [ 2.063386] x17: 0000000000000000 x16: 0000000000000000 [ 2.068760] x15: 0000000000000018 x14: 0000000000000000 [ 2.074133] x13: 0000000000000000 x12: 0000000000000000 [ 2.079505] x11: 0000000000000000 x10: 0000000000000000 [ 2.084879] x9 : 0000000000000000 x8 : 0000000000000000 [ 2.090251] x7 : 0000000000000040 x6 : 0000000000000000 [ 2.095621] x5 : ffffffffffffffff x4 : 0000000000000000 [ 2.100991] x3 : 0000000000000000 x2 : 0000000000000000 [ 2.106364] x1 : fffffc0008be4c24 x0 : ffffff0ffffada80 [ 2.111737] [ 2.113236] Process cpuhp/48 (pid: 295, stack limit = 0xfffffe0fe8b8c020) [ 2.120102] Stack: (0xfffffe0fe8b8fb10 to 0xfffffe0fe8b90000) [ 2.125914] fb00: fffffe0fe8b8fb80 fffffc00080e7648 . . . [ 2.442859] Call trace: [ 2.445327] Exception stack(0xfffffe0fe8b8f940 to 0xfffffe0fe8b8fa70) [ 2.451843] f940: fffffe0fe89d1d00 0000040000000000 fffffe0fe8b8fb10 fffffc00080e7468 [ 2.459767] f960: fffffe0fe8b8f980 fffffc00080e4958 ffffff0ff91ab200 fffffc00080e4b64 [ 2.467690] f980: fffffe0fe8b8f9d0 fffffc00080e515c fffffe0fe8b8fa80 0000000000000000 [ 2.475614] f9a0: fffffe0fe8b8f9d0 fffffc00080e58e4 fffffe0fe8b8fa80 0000000000000000 [ 2.483540] f9c0: fffffe0fe8d10000 0000000000000040 fffffe0fe8b8fa50 fffffc00080e5ac4 [ 2.491465] f9e0: ffffff0ffffada80 fffffc0008be4c24 0000000000000000 0000000000000000 [ 2.499387] fa00: 0000000000000000 ffffffffffffffff 0000000000000000 0000000000000040 [ 2.507309] fa20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 2.515233] fa40: 0000000000000000 0000000000000000 0000000000000000 0000000000000018 [ 2.523156] fa60: 0000000000000000 0000000000000000 [ 2.528089] [<fffffc00080e7468>] try_to_wake_up+0x194/0x34c [ 2.533723] [<fffffc00080e7648>] wake_up_process+0x28/0x34 [ 2.539275] [<fffffc00080d3764>] create_worker+0x110/0x19c [ 2.544824] [<fffffc00080d69dc>] alloc_unbound_pwq+0x3cc/0x4b0 [ 2.550724] [<fffffc00080d6bcc>] wq_update_unbound_numa+0x10c/0x1e4 [ 2.557066] [<fffffc00080d7d78>] workqueue_online_cpu+0x220/0x28c [ 2.563234] [<fffffc00080bd288>] cpuhp_invoke_callback+0x6c/0x168 [ 2.569398] [<fffffc00080bdf74>] cpuhp_up_callbacks+0x44/0xe4 [ 2.575210] [<fffffc00080be194>] cpuhp_thread_fun+0x13c/0x148 [ 2.581027] [<fffffc00080dfbac>] smpboot_thread_fn+0x19c/0x1a8 [ 2.586929] [<fffffc00080dbd64>] kthread+0xdc/0xf0 [ 2.591776] [<fffffc0008083380>] ret_from_fork+0x10/0x50 [ 2.597147] Code: b00057e1 91304021 91005021 b8626822 (b8606821) [ 2.603464] ---[ end trace 58c0cd36b88802bc ]--- [ 2.608138] Kernel panic - not syncing: Fatal exception Version-Release number of selected component (if applicable): Upstream v4.8-rc5 and others How reproducible: OOPS is config dependent, it depends on the values that follow the array in memory, but the out of bounds access happens every time the kernel is booted on a system containing NUMA nodes other than node zero. Steps to Reproduce: 1. Boot kernel on NUMA system with 2 or more nodes. Actual results: OOPS message/stack trace with some configurations(builds), workqueue worker threads for unbound workqueues assigned to CPUs on the wrong node (possible performance impact). Not all builds fail with OOPS. Expected results: No OOPS message. Additional info: Potential fix here: https://lkml.org/lkml/2016/9/19/678
Brew with potential fix now here: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11777838
(In reply to David Daney from comment #2) > Brew with potential fix now here: > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11777838 Now obsolete, don't use this fix.
New version of potential fix is here: https://lkml.org/lkml/2016/9/20/532
These are the symptoms of the problem on kernel-4.5.0-9.el7.aarch64: [root@localhost ~]# ps -e | grep 'kworker/u' 6 ? 00:00:00 kworker/u192:0 608 ? 00:00:00 kworker/u193:5 611 ? 00:00:00 kworker/u193:6 614 ? 00:00:00 kworker/u193:7 1012 ? 00:00:00 kworker/u192:2 3327 ? 00:00:00 kworker/u192:1 We can see two unbound workqueues with several worker threads per queue. [root@localhost ~]# taskset -p 6 pid 6's current affinity mask: ffffffffffffffffffffffff [root@localhost ~]# taskset -p 608 pid 608's current affinity mask: ffffffffffffffffffffffff Look, both work queues have affinity to all 96 CPUs. Should be like this: [root@localhost ~]# ps -e | grep 'kworker/u' 6 ? 00:00:00 kworker/u192:0 7 ? 00:00:00 kworker/u193:0 253 ? 00:00:00 kworker/u194:0 . . . [root@localhost ~]# taskset -p 6 pid 6's current affinity mask: ffffffffffffffffffffffff [root@localhost ~]# taskset -p 7 pid 7's current affinity mask: ffffffffffff [root@localhost ~]# taskset -p 253 pid 253's current affinity mask: ffffffffffff000000000000 First unbound work queue has affinity to all 96 CPUs Second unbound work queue has affinity to node-0 CPUs (48 of them) Third unbound work queue has affinity to node-1 CPUSs (the other 48)
(In reply to David Daney from comment #4) > New version of potential fix is here: > https://lkml.org/lkml/2016/9/20/532 brew build of this patch is here: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11785057
Verified on cavium-thunderx2-02.khw.lab.eng.bos.redhat.com following the test steps in comment 5. :::::::::::: :: Before :: :::::::::::: [root@cavium-thunderx2-02 ~]# uname -r 4.5.0-10.el7.aarch64 [root@cavium-thunderx2-02 ~]# pgrep -laf kworker/u 6 kworker/u192:0 3941 kworker/u193:1 3954 kworker/u192:1 4001 kworker/u193:0 4021 kworker/u192:2 4071 kworker/u193:2 [root@cavium-thunderx2-02 ~]# for p in $(pgrep -laf kworker/u | awk '{print $1}') ; do taskset -p $p done pid 6's current affinity mask: ffffffffffffffffffffffff pid 3941's current affinity mask: ffffffffffffffffffffffff pid 3954's current affinity mask: ffffffffffffffffffffffff pid 4001's current affinity mask: ffffffffffffffffffffffff pid 4021's current affinity mask: ffffffffffffffffffffffff pid 4071's current affinity mask: ffffffffffffffffffffffff [root@cavium-thunderx2-02 ~]# ./hex2bin.py -d 96 ffffffffffffffffffffffff 111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 ::::::::::: :: After :: ::::::::::: [root@cavium-thunderx2-02 ~]# uname -r 4.5.0-13.el7.aarch64 [root@cavium-thunderx2-02 ~]# pgrep -laf kworker/u 6 kworker/u192:0 7 kworker/u193:0 8 kworker/u194:0 532 kworker/u194:1 597 kworker/u194:2 600 kworker/u194:3 603 kworker/u194:4 606 kworker/u194:5 609 kworker/u194:6 612 kworker/u194:7 615 kworker/u194:8 619 kworker/u194:9 623 kworker/u192:1 681 kworker/u193:1 892 kworker/u193:2 1085 kworker/u193:3 1128 kworker/u193:4 [root@cavium-thunderx2-02 ~]# for p in $(pgrep -laf kworker/u | awk '{print $1}') ; do taskset -p $p done pid 6's current affinity mask: ffffffffffffffffffffffff pid 7's current affinity mask: ffffffffffff pid 8's current affinity mask: ffffffffffff000000000000 pid 532's current affinity mask: ffffffffffff000000000000 pid 597's current affinity mask: ffffffffffff000000000000 pid 600's current affinity mask: ffffffffffff000000000000 pid 603's current affinity mask: ffffffffffff000000000000 pid 606's current affinity mask: ffffffffffff000000000000 pid 609's current affinity mask: ffffffffffff000000000000 pid 612's current affinity mask: ffffffffffff000000000000 pid 615's current affinity mask: ffffffffffff000000000000 pid 619's current affinity mask: ffffffffffff000000000000 pid 623's current affinity mask: ffffffffffffffffffffffff pid 681's current affinity mask: ffffffffffff pid 892's current affinity mask: ffffffffffff pid 1085's current affinity mask: ffffffffffff pid 1128's current affinity mask: ffffffffffff [root@cavium-thunderx2-02 ~]# ./hex2bin.py -d 96 ffffffffffffffffffffffff 111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 [root@cavium-thunderx2-02 ~]# ./hex2bin.py -d 96 ffffffffffff 000000000000000000000000000000000000000000000000111111111111111111111111111111111111111111111111 [root@cavium-thunderx2-02 ~]# ./hex2bin.py -d 96 ffffffffffff000000000000 111111111111111111111111111111111111111111111111000000000000000000000000000000000000000000000000 :::::::::::::::::::::::: :: hex2bin.py utility :: :::::::::::::::::::::::: #!/usr/bin/python import sys import argparse parser = argparse.ArgumentParser(description='Convert hex to binary.') parser.add_argument('N', metavar='N', help='number in hexadecimal to convert') parser.add_argument('-d', '--digits', type=int, help='digits to print (default is 32)') args = parser.parse_args() if args.digits is None: args.digits = 32 print bin(int(args.N, base=16))[2:].zfill(args.digits)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-2145.html