Bug 1423432
The major part of the changes come from this patch: Backporting it and some more. --------------- %< -------------- commit 55c041cf3750894df6f304ad0489c777e27402e7 Author: Sebastian Andrzej Siewior <bigeasy> Date: Thu Jan 21 15:58:56 2016 +0100 sched: fixup migrate disable (all tasks were bound to CPU0) Includes: |From: Thomas Gleixner <tglx> | |Subject: [PATCH] sched: use tsk_cpus_allowed() instead of accessing ->cpus_allowed | |Use the future-safe accessor for struct task_struct's. | |Signed-off-by: Thomas Gleixner <tglx> |Signed-off-by: Sebastian Andrzej Siewior <bigeasy> and |From: Thomas Gleixner <tglx> |Subject: [PATCH] sched: provide a tsk_nr_cpus_allowed() helper | |tsk_nr_cpus_allowed() is an accessor for task->nr_cpus_allowed which allows |us to change the representation of ->nr_cpus_allowed if required. | |Signed-off-by: Thomas Gleixner <tglx> |Signed-off-by: Sebastian Andrzej Siewior <bigeasy> and the folded changes from introduce_migrate_disable_cpu_light.patch. Signed-off-by: Sebastian Andrzej Siewior <bigeasy> --------------- >% --------------- Created attachment 1258194 [details] [PATCH 1/5] sched: Move p->nr_cpus_allowed check to select_task_rq() commit 6c1d9410f007a26d13173cf17204cfd965f49b83 From: Wanpeng Li <wanpeng.li.com> Date: Wed, 5 Nov 2014 09:14:37 +0800 Subject: sched: Move p->nr_cpus_allowed check to select_task_rq() Move the p->nr_cpus_allowed check into kernel/sched/core.c: select_task_rq(). This change will make fair.c, rt.c, and deadline.c all start with the same logic. Suggested-and-Acked-by: Steven Rostedt <rostedt> Signed-off-by: Wanpeng Li <wanpeng.li.com> Signed-off-by: Peter Zijlstra (Intel) <peterz> Cc: "pang.xunlei" <pang.xunlei> Cc: Linus Torvalds <torvalds> Link: http://lkml.kernel.org/r/1415150077-59053-1-git-send-email-wanpeng.li@linux.intel.com Signed-off-by: Ingo Molnar <mingo> Signed-off-by: Daniel Bristot de Oliveira <bristot> Created attachment 1258195 [details]
[PATCH 2/5] sched: fixup migrate disable (all tasks were bound to CPU0)
commit 55c041cf3750894df6f304ad0489c777e27402e7
Author: Sebastian Andrzej Siewior <bigeasy>
Date: Thu Jan 21 15:58:56 2016 +0100
Subject: sched: fixup migrate disable (all tasks were bound to CPU0)
Includes:
|From: Thomas Gleixner <tglx>
|
|Subject: [PATCH] sched: use tsk_cpus_allowed() instead of accessing ->cpus_allowed
|
|Use the future-safe accessor for struct task_struct's.
|
|Signed-off-by: Thomas Gleixner <tglx>
|Signed-off-by: Sebastian Andrzej Siewior <bigeasy>
and
|From: Thomas Gleixner <tglx>
|Subject: [PATCH] sched: provide a tsk_nr_cpus_allowed() helper
|
|tsk_nr_cpus_allowed() is an accessor for task->nr_cpus_allowed which allows
|us to change the representation of ->nr_cpus_allowed if required.
|
|Signed-off-by: Thomas Gleixner <tglx>
|Signed-off-by: Sebastian Andrzej Siewior <bigeasy>
and the folded changes from introduce_migrate_disable_cpu_light.patch.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy>
Signed-off-by: Daniel Bristot de Oliveira <bristot>
Created attachment 1258196 [details]
[PATCH 3/5] sched: fix ->nr_cpus_allowed = 1 transcription error during migrate_disable() cleanup
commit b3b36ee61c780407737a321007a24cfe9c3668a7
Author: Mike Galbraith <umgwanakikbuti>
Date: Sat Jan 23 08:11:53 2016 +0100
Subject: sched: fix ->nr_cpus_allowed = 1 transcription error during migrate_disable() cleanup
Setting p->nr_cpus_allowed accidentally wandered into migrate_disable()
during the cleanup - kill it.
Signed-off-by: Mike Galbraith <umgwanakikbuti>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy>
Signed-off-by: Daniel Bristot de Oliveira <bristot>
Created attachment 1258197 [details]
[PATCH 4/5] kernel: softirq: unlock with irqs on
commit f848b439cd995a2016422fa0c08daff9e8bb45f7
Author: Sebastian Andrzej Siewior <bigeasy>
Date: Tue Feb 9 18:17:18 2016 +0100
Subject: kernel: softirq: unlock with irqs on
kernel: softirq: unlock with irqs on
We unlock the lock while the interrupts are off. This isn't a problem
now but will get because the migrate_disable() + enable are not
symmetrical in regard to the status of interrupts.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy>
Signed-off-by: Daniel Bristot de Oliveira <bristot>
Created attachment 1258198 [details]
[PATCH 5/5] kernel: migrate_disable() do fastpath in atomic & irqs-off
commit d09adb58ff92f87f3d2823f23c269c9ce241dd69
Author: Sebastian Andrzej Siewior <bigeasy>
Date: Tue Feb 9 18:18:01 2016 +0100
Subject: kernel: migrate_disable() do fastpath in atomic & irqs-off
With interrupts off it makes no sense to do the long path since we can't
leave the CPU anyway. Also we might end up in a recursion with lockdep.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy>
Signed-off-by: Daniel Bristot de Oliveira <bristot>
There is a problem in the updated version of the migrate_disable()/enable() implementation backported in this patch. The problem is the following: In the new behavior, when a task is attached to the rt runqueue, it is checked if it either can run in more than one CPU, or if it is with migration disable. If either check is true, the rt_rq->rt_nr_migratory counter is not increased. The counter increases otherwise. When the task is detached, the same check is done. If either check is true, the rt_rq->rt_nr_migratory counter is not decreased. The counter decreases otherwise. One important thing is that, migrate disable/enable does not touch this counter for tasks attached to the -rt rq. So suppose the following chain of events. Assumptions: Task A is the only runnable task in A Task B runs in the CPU B rt_nr_migratory is 0, so. Task B has RT priority Task A can run on all CPUS. B is running A is running Task A runs on CFS (non-rt) CPU A/TASK A CPU B/TASK B A takes the rt mutex X . A disables migration . . B rt task tries to take the rt mutex X . As it is locked by A: . A inherits the rt priority of B . A is dequeued from CFS RQ of CPU A . A is enqueued in the RT RQ of CPU A . As migration is disabled: . rt_nr_migratory in A is not increased . A enables migration A releases the rt mutex { A returns to its original priority A ask to be dequeued from RT RQ { As migration is now enabled and it can run on all CPUS { rt_nr_migratory should be decreased As it is 0, rt_nr_migratory under flows. } } The rt_nr_migratory value must be at least 0, and that is the cause of the BUG(). This variable is important because it notifies if there are more than one runnable & migratable task in the runqueue. If there are more than one tasks, the rt_rq is set as overloaded, and this tries to migrate some tasks. This rule is important to keep the scheduler working conserving, that is in a system with M CPUs, the M highest priority tasks should be running. Created attachment 1287084 [details]
Subject: [PATCH 1/3] Revert "sched: fixup migrate disable (all tasks were
This reverts commit c3d80fe97a84a8d137b93e61b2ccbca974022d84.
Signed-off-by: Daniel Bristot de Oliveira <bristot>
Created attachment 1287085 [details]
[PATCH 2/3] Revert "kernel: softirq: unlock with irqs on"
This reverts commit 93bc0e814e212229047d2b5c9690493ef8d9da2e.
Signed-off-by: Daniel Bristot de Oliveira <bristot>
Created attachment 1287086 [details]
[PATCH 3/3] Revert "kernel: migrate_disable() do fastpath in atomic & irqs-off"
This reverts commit 334af746c8f6bebc99a9687b144e14c250f5d6ec.
Signed-off-by: Daniel Bristot de Oliveira <bristot>
Returning to assigned, we have the bug reported in the BZ1441552. I checked and this bug is also present upstream, and I am working on it. -- Daniel The maintainer replied saying he will apply the patch to the next -rt release. -- Daniel We thought on backporting this as a proactive action, but since we created, no real problem occurred to justify this BZ to become a priority. So, let's close this BZ... Feel free to open it if any problem justifies it. |
Created attachment 1251865 [details] removed code, do we need it?! Description of problem: Since v4.4-x-rt stable kernel the migrate_disable/enable operations were modified. The code now is simpler than it used to be, but we need to be sure that the simplification does not depend on another part of the kernel, mainly the control of the task's cpumask. Version-Release number of selected component (if applicable): RHEL-RT kernel based on RHEL 7.3 based How reproducible: No bug on the current code. Actual results: It works. Expected results: It should work with the newer code. Additional info: The main problem so far is that a considerable part of the code was removed. See attachment.