Red Hat Bugzilla – Bug 1510602
locking: bring in upstream PREEMPT_RT rtlock patches to fix single-reader limitation
Last modified: 2018-10-30 05:42:37 EDT
Early versions of rwsem and rwlock implementation on PREEMPT_RT imposed a single-reader limitation due to preemption problems. Upstream has lifted that limitation with commits: 81f6cae6196d rtmutex: add rwlock implementation based on rtmutex b7abb5b0ad58 rtmutex: add rwsem implementation based on rtmutex The current RHEL-RT 7.4 has an early version of the rwsem fix and does *not* contain the corresponding rwlock commit. Backport commit 81f6cae6196d so that rwlocks allow multiple read lockers.
Created attachment 1349103 [details] Shell script to trace a run of rteval Script used to run ftrace on an rteval run looking for latency spikes
To see this problem, boot an RT kernel with SELinux enabled on a system with more than 64-cores. I've been using hp-dl580gen9-01.khw.lab.eng.bos.redhat.com. Use the attached trace-rteval.sh script to run rteval with tracing: $ sudo ./trace-rteval.sh --duration=2h --breakthresh=200 -z This will kick off an rteval run with tracing set to stop if a cyclictest latency is seen greater than 200us. When the run stops at least to reports will be generated: latency-trace-n.txt latency-trace-n-cpuX.txt Note that it's possible for a latency spike to occur on more than one CPU at the same time, so there may be multiple per-cpu reports. The per-cpu reports will show the cpu where a latency occurred. Search the file for the string 'hit', which is the tracemark message that occurs when a threshold is hit, then search backwards for a trace showing that the function 'security_compute_av()' went through the slowpath and caused a latency spike. This is due to contention on the lock 'policy_rwlock' in security/selinux/ss/services.c, which is a reader/writer lock (rwlock).
It looks like we're going to revert the initial commits that partially implement reader/writer locks (just rwsems) for 7.5, so I think we should start from scratch and bring both rwsems and rwlocks into 7.6. See https://bugzilla.redhat.com/show_bug.cgi?id=1448770 for more info on why the initial commits are being reverted.
Adding KVM guys to the CC. Will want to either duplicate the KVM testing setup or keep that setup around so we can test when we bring back the reader/writer patches
Just confirmation that we'll be reverting the rwsems changes for 7.5, so this bug is now for all reader/writer changes in our 7.6 RT kernel.
==update== QE hit xfs issue with latest kernel-rt build. 3.10.0-864.rt56.806.el7.x86_64 qemu-kvm-rhev-2.10.0-21.el7_5.1.x86_64 libvirt-3.9.0-14.el7.x86_64 tuned-2.9.0-1.el7.noarch Reproduce: 2/2 However with Scott's latest debugging testing builds for QE, this xfs hang issue has gone. I asked Luis this question on irc, seems latest patch is missed in latest kernel-rt build. I'll re-test after this patch applying.
Unfortunately we hit xfs hang issue again. Version: kernel-rt-3.10.0-880.rt56.823.el7.x86_64 Testing: Finishing 24 hours cyclictest testing with below scenario(1) and (2), then hit Call Trace issue with scenario (3) cyclitest testing process. (1)Single VM with 1 rt vCPU (2)Single VM with 8 rt vCPUs (3)Multiple VMs each with 1 rt vCPU So this issue still exits, and it's become hard to reproduce. Best Regards, Pei
Created attachment 1432478 [details] Guest call log of kernel-rt-3.10.0-880.rt56.823.el7.x86_64
(In reply to Pei Zhang from comment #10) > Unfortunately we hit xfs hang issue again. > > Version: kernel-rt-3.10.0-880.rt56.823.el7.x86_64 > > > Testing: > > Finishing 24 hours cyclictest testing with below scenario(1) and (2), then > hit Call Trace issue with scenario (3) cyclitest testing process. > > (1)Single VM with 1 rt vCPU > (2)Single VM with 8 rt vCPUs > (3)Multiple VMs each with 1 rt vCPU > > > So this issue still exits, and it's become hard to reproduce. > > > Best Regards, > Pei Pei, Just to confirm, this was without make -j or a polling thread? (in either host or guest).
(In reply to Marcelo Tosatti from comment #14) > (In reply to Pei Zhang from comment #10) > > Unfortunately we hit xfs hang issue again. > > > > Version: kernel-rt-3.10.0-880.rt56.823.el7.x86_64 > > > > > > Testing: > > > > Finishing 24 hours cyclictest testing with below scenario(1) and (2), then > > hit Call Trace issue with scenario (3) cyclitest testing process. > > > > (1)Single VM with 1 rt vCPU > > (2)Single VM with 8 rt vCPUs > > (3)Multiple VMs each with 1 rt vCPU > > > > > > So this issue still exits, and it's become hard to reproduce. > > > > > > Best Regards, > > Pei > > Pei, > > Just to confirm, this was without make -j or a polling thread? > (in either host or guest). Sorry, I forgot and missed to reply this needinfo. In host and guest, we both test with # make -j$twice_housekeeping_cpus_num.
(In reply to Pei Zhang from comment #10) > Unfortunately we hit xfs hang issue again. > > Version: kernel-rt-3.10.0-880.rt56.823.el7.x86_64 > > > Testing: > > Finishing 24 hours cyclictest testing with below scenario(1) and (2), then > hit Call Trace issue with scenario (3) cyclitest testing process. > > (1)Single VM with 1 rt vCPU > (2)Single VM with 8 rt vCPUs > (3)Multiple VMs each with 1 rt vCPU > > > So this issue still exits, and it's become hard to reproduce. > > > Best Regards, > Pei Pei, Can you confirm there was no polling thread (or make -j) in either guest and host in this case?
(In reply to Marcelo Tosatti from comment #16) > (In reply to Pei Zhang from comment #10) > > Unfortunately we hit xfs hang issue again. > > > > Version: kernel-rt-3.10.0-880.rt56.823.el7.x86_64 > > > > > > Testing: > > > > Finishing 24 hours cyclictest testing with below scenario(1) and (2), then > > hit Call Trace issue with scenario (3) cyclitest testing process. > > > > (1)Single VM with 1 rt vCPU > > (2)Single VM with 8 rt vCPUs > > (3)Multiple VMs each with 1 rt vCPU > > > > > > So this issue still exits, and it's become hard to reproduce. > > > > > > Best Regards, > > Pei > > Pei, > > Can you confirm there was no polling thread (or make -j) > in either guest and host in this case? In both host and guest, we both tested with # make -j$twice_housekeeping_cpus_num in this case.
(In reply to Pei Zhang from comment #17) > (In reply to Marcelo Tosatti from comment #16) > > (In reply to Pei Zhang from comment #10) > > > Unfortunately we hit xfs hang issue again. > > > > > > Version: kernel-rt-3.10.0-880.rt56.823.el7.x86_64 > > > > > > > > > Testing: > > > > > > Finishing 24 hours cyclictest testing with below scenario(1) and (2), then > > > hit Call Trace issue with scenario (3) cyclitest testing process. > > > > > > (1)Single VM with 1 rt vCPU > > > (2)Single VM with 8 rt vCPUs > > > (3)Multiple VMs each with 1 rt vCPU > > > > > > > > > So this issue still exits, and it's become hard to reproduce. > > > > > > > > > Best Regards, > > > Pei > > > > Pei, > > > > Can you confirm there was no polling thread (or make -j) > > in either guest and host in this case? > > In both host and guest, we both tested with # make > -j$twice_housekeeping_cpus_num in this case. Marcelo, I'm not sure if I answer your question. So I'd like to share all the steps of testing(using automation): 1. Install rhel7.6 host 2. Setup rt host 3. Install guest and do testing 3.1 Single VM with 1 rt vCPU 3.1.1. Install rhel7.6 guest 3.1.2. Setup rt guest 3.1.3. Do cyclictest testing,(1)(2)(3) are running at same time. (1)In rt host, compiling kernel with $twice_housekeeping_CPUs_num # make -j14 (2)In rt guest, compiling kernel with $twice_housekeeping_vCPUs_num # make -j2 (3)In rt guest, start cyclictest # taskset -c 1 cyclictest -m -n -q -p95 -D 1h -h60 -t 1 -a 1 -b40 --notrace 3.2 Single VM with 8 rt vCPUs 3.1.1. Install rhel7.6 guest 3.1.2. Setup rt guest 3.1.3. Do cyclictest testing,(1)(2)(3) are running at same time. (1)In rt host, compiling kernel with $twice_housekeeping_CPUs_num # make -j14 (2)In rt guest, compiling kernel with $twice_housekeeping_vCPUs_num # make -j4 (3)In rt guest, start cyclictest # taskset -c 1,2,3,4,5,6,7,8 cyclictest -m -n -q -p95 -D 24h -h60 -t 8 -a 1,2,3,4,5,6,7,8 -b40 --notrace 3.3 Multiple VMs each with 1 rt vCPU 3.1.1. Install 4 rhel7.6 guests 3.1.2. Setup these 4 rt guests 3.1.3. Do cyclictest testing,(1)(2)(3) are running at same time. (1)In rt host, compiling kernel with $twice_housekeeping_CPUs_num # make -j14 (2)In each rt guest, compiling kernel with $twice_housekeeping_vCPUs_num # make -j2. (3)In each rt guest, start cyclictest. # taskset -c 1 cyclictest -m -n -q -p95 -D 1h -h60 -t 1 -a 1 -b40 --notrace Thanks, Pei
Additional info to Comment 18, step (1) and (2) will repeat running until finish cyclictest 24 hours testing.
(In reply to Pei Zhang from comment #18) > (In reply to Pei Zhang from comment #17) > > (In reply to Marcelo Tosatti from comment #16) > > > (In reply to Pei Zhang from comment #10) > > > > Unfortunately we hit xfs hang issue again. > > > > > > > > Version: kernel-rt-3.10.0-880.rt56.823.el7.x86_64 > > > > > > > > > > > > Testing: > > > > > > > > Finishing 24 hours cyclictest testing with below scenario(1) and (2), then > > > > hit Call Trace issue with scenario (3) cyclitest testing process. > > > > > > > > (1)Single VM with 1 rt vCPU > > > > (2)Single VM with 8 rt vCPUs > > > > (3)Multiple VMs each with 1 rt vCPU > > > > > > > > > > > > So this issue still exits, and it's become hard to reproduce. > > > > > > > > > > > > Best Regards, > > > > Pei > > > > > > Pei, > > > > > > Can you confirm there was no polling thread (or make -j) > > > in either guest and host in this case? > > > > In both host and guest, we both tested with # make > > -j$twice_housekeeping_cpus_num in this case. > > > Marcelo, I'm not sure if I answer your question. So I'd like to share all > the steps of testing(using automation): > > > 1. Install rhel7.6 host > > 2. Setup rt host > > 3. Install guest and do testing > > 3.1 Single VM with 1 rt vCPU > > 3.1.1. Install rhel7.6 guest > > 3.1.2. Setup rt guest > > 3.1.3. Do cyclictest testing,(1)(2)(3) are running at same time. > > (1)In rt host, compiling kernel with $twice_housekeeping_CPUs_num > # make -j14 > > (2)In rt guest, compiling kernel with $twice_housekeeping_vCPUs_num > # make -j2 > > (3)In rt guest, start cyclictest > # taskset -c 1 cyclictest -m -n -q -p95 -D 1h -h60 -t 1 -a 1 -b40 > --notrace > > 3.2 Single VM with 8 rt vCPUs > > 3.1.1. Install rhel7.6 guest > > 3.1.2. Setup rt guest > > 3.1.3. Do cyclictest testing,(1)(2)(3) are running at same time. > > (1)In rt host, compiling kernel with $twice_housekeeping_CPUs_num > # make -j14 > > (2)In rt guest, compiling kernel with $twice_housekeeping_vCPUs_num > # make -j4 > > (3)In rt guest, start cyclictest > # taskset -c 1,2,3,4,5,6,7,8 cyclictest -m -n -q -p95 -D 24h -h60 -t > 8 -a 1,2,3,4,5,6,7,8 -b40 --notrace > > 3.3 Multiple VMs each with 1 rt vCPU > > 3.1.1. Install 4 rhel7.6 guests > > 3.1.2. Setup these 4 rt guests > > 3.1.3. Do cyclictest testing,(1)(2)(3) are running at same time. > > (1)In rt host, compiling kernel with $twice_housekeeping_CPUs_num > # make -j14 > > (2)In each rt guest, compiling kernel with > $twice_housekeeping_vCPUs_num > # make -j2. > > (3)In each rt guest, start cyclictest. > # taskset -c 1 cyclictest -m -n -q -p95 -D 1h -h60 -t 1 -a 1 -b40 > --notrace > > > Thanks, > Pei Pei, Yes, you do. The problem is as follows: On the host, the kvm-vcpus (each one of them isolated to a single pcpu), have scheduling priority FIFO:1. This means that a kvm-vcpu will run until: 1) A higher -RT priority process is scheduled on that pcpu. 2) It sleeps. Depending on the workload in the kvm-vcpu, the kvm-vcpu might never sleep, not allowing the worker threads on that pcpu (which have a lower priority than the kvm-vcpu process), to run. Which would explain why a certain process is unable to execute for more than 600 seconds (the warning you see when the message "Bug 1590222 - INFO: task worker:194936 blocked for more than 600 seconds" is printed). The tracing suggested at https://bugzilla.redhat.com/show_bug.cgi?id=1590222#c12 Would allow us to know if that is happening or not (that, kvm-vcpu process never sleeping due to non-sleeping "make -j" workload). So, the tracing suggested at:
(In reply to Marcelo Tosatti from comment #20) > Pei, > > Yes, you do. The problem is as follows: > > On the host, the kvm-vcpus (each one of them isolated to a single pcpu), > have scheduling priority FIFO:1. > > This means that a kvm-vcpu will run until: > > 1) A higher -RT priority process is scheduled on that pcpu. > 2) It sleeps. > > Depending on the workload in the kvm-vcpu, the kvm-vcpu might never > sleep, not allowing the worker threads on that pcpu (which have a lower > priority than the kvm-vcpu process), to run. Which would explain why a > certain > process is unable to execute for more than 600 seconds (the warning you see > when > the message > > "Bug 1590222 - INFO: task worker:194936 blocked for more than 600 seconds" > > is printed). Wouldn't RT throttling kick in in that case?
(In reply to Scott Wood from comment #21) > (In reply to Marcelo Tosatti from comment #20) > > Pei, > > > > Yes, you do. The problem is as follows: > > > > On the host, the kvm-vcpus (each one of them isolated to a single pcpu), > > have scheduling priority FIFO:1. > > > > This means that a kvm-vcpu will run until: > > > > 1) A higher -RT priority process is scheduled on that pcpu. > > 2) It sleeps. > > > > Depending on the workload in the kvm-vcpu, the kvm-vcpu might never > > sleep, not allowing the worker threads on that pcpu (which have a lower > > priority than the kvm-vcpu process), to run. Which would explain why a > > certain > > process is unable to execute for more than 600 seconds (the warning you see > > when > > the message > > > > "Bug 1590222 - INFO: task worker:194936 blocked for more than 600 seconds" > > > > is printed). > > Wouldn't RT throttling kick in in that case? -RT throttling is disabled for KVM-RT, and you dont want to enable it: So after sometime investigating the suggestion to use -RT throttling to solve the problem of maintenance tasks that need to run on the isolated CPUs, the following facts become clear: Benefits of -RT throttling: 1) Applications do not need to be aware of scheduling in /scheduling out points (schedule out points, by the operating system are a generic, application independent solution). Disadvantages of -RT throttling: 1) The schedule out points are not controlled. Say you have latency spikes on the application: latency | | /\ | / \ | / \ | / \______________ |___________________ time For example due to internal DPDK timer handling. The schedule out points, not being coordinated to happen at specific points in time, can happen exactly at those latency spikes, adding the amount of time reserved for non realtime applications to run the latency spike. In comparison, the DPDK application can choose when to schedule out: for example when the queues are empty and when its not about to execute an internal timer. 2) runqueue lock contention. A high resolution -RT unthrottle timer is necessary on the host side (it must be armed to fire N us after the runqueue is throttled). This timer must acquire the runqueue lock, which has large maximum latency. From Daniel's message: > I already have a prototype of the scheduler, and the results are very > good so far. As the number of migration is limited to 1 per CPU > (during > a task's period), there is almost no runqueue lock contention! This > helps to address the push/pull locking latency we see on large box (I > can see a reduction from 70us to < 2us with the same workload - a very > high work on the realtime1 box (with 24 cpus)). I don't have the numbers handy, but IIRC i saw > 100us runqueue lock contention previously. Either way, those numbers are not in the acceptable range. Given that guaranteeing "runqueue lock contention below X us" is a requirement which upstream kernel development will not take into account, i don't see any option other than the suggestion for users to fix this in their application. https://lwn.net/Articles/296419/ "Ingo's suggestion was to raise the limit to ten seconds of CPU time. As he (and others) pointed out: any SCHED_FIFO application which needs to monopolize the CPU for that long has serious problems and needs to be fixed." Unless someone has another idea to perform this at the operating system level, without using runqueue lock, or reducing runqueue lock contention, i'll reply to the DPDK thread which included an example of how applications should perform this.
(In reply to Marcelo Tosatti from comment #22) > (In reply to Scott Wood from comment #21) > > Wouldn't RT throttling kick in in that case? > > -RT throttling is disabled for KVM-RT, and you dont want to enable it: > > So after sometime investigating the suggestion to > use -RT throttling to solve the problem of > maintenance tasks that need to run on the isolated > CPUs, the following facts become clear: I wasn't suggesting it as a solution, just trying to understand the cause of the hangs that have been reported. What is it that automatically disables RT throttling? It's enabled on my system even after installing kernel-rt-kvm. I tried recreating the load in bug 1590222, and just reproduced a somewhat similar hang that left a bunch of processes stuck in the D state even after all loads were killed. I'll look more deeply into it tomorrow.
(In reply to Marcelo Tosatti from comment #22) > (In reply to Scott Wood from comment #21) > > (In reply to Marcelo Tosatti from comment #20) > > > Pei, > > > > > > Yes, you do. The problem is as follows: > > > > > > On the host, the kvm-vcpus (each one of them isolated to a single pcpu), > > > have scheduling priority FIFO:1. > > > > > > This means that a kvm-vcpu will run until: > > > > > > 1) A higher -RT priority process is scheduled on that pcpu. > > > 2) It sleeps. > > > > > > Depending on the workload in the kvm-vcpu, the kvm-vcpu might never > > > sleep, not allowing the worker threads on that pcpu (which have a lower > > > priority than the kvm-vcpu process), to run. Which would explain why a > > > certain > > > process is unable to execute for more than 600 seconds (the warning you see > > > when > > > the message > > > > > > "Bug 1590222 - INFO: task worker:194936 blocked for more than 600 seconds" > > > > > > is printed). > > > > Wouldn't RT throttling kick in in that case? > > -RT throttling is disabled for KVM-RT, and you dont want to enable it: > > So after sometime investigating the suggestion to > use -RT throttling to solve the problem of > maintenance tasks that need to run on the isolated > CPUs, the following facts become clear: > > Benefits of -RT throttling: > > 1) Applications do not need to be aware of scheduling in > /scheduling out points (schedule out points, by > the operating system are a generic, application independent > solution). > > Disadvantages of -RT throttling: > > 1) The schedule out points are not controlled. > > Say you have latency spikes on the application: > > latency > | > | /\ > | / \ > | / \ > | / \______________ > |___________________ time > > For example due to internal DPDK timer handling. > The schedule out points, not being coordinated > to happen at specific points in time, can happen > exactly at those latency spikes, adding the amount > of time reserved for non realtime applications to run > the latency spike. > > In comparison, the DPDK application can choose when > to schedule out: for example when the queues are > empty and when its not about to execute an internal > timer. > > 2) runqueue lock contention. > > A high resolution -RT unthrottle timer is necessary on the host side > (it must be armed to fire N us after the runqueue is throttled). > This timer must acquire the runqueue lock, which has > large maximum latency. From Daniel's message: > > > I already have a prototype of the scheduler, and the results are very > > good so far. As the number of migration is limited to 1 per CPU > > (during > > a task's period), there is almost no runqueue lock contention! This > > helps to address the push/pull locking latency we see on large box (I > > can see a reduction from 70us to < 2us with the same workload - a very > > high work on the realtime1 box (with 24 cpus)). > > I don't have the numbers handy, but IIRC i saw > 100us runqueue lock > contention previously. > > Either way, those numbers are not in the acceptable range. > > Given that guaranteeing "runqueue lock contention below X us" > is a requirement which upstream kernel development will not take into > account, i don't see any option other than the suggestion > for users to fix this in their application. > > https://lwn.net/Articles/296419/ > > "Ingo's suggestion was to raise the limit to ten seconds of CPU time. As > he (and others) pointed out: any SCHED_FIFO application which needs to > monopolize the CPU for that long has serious problems and needs to be > fixed." > > Unless someone has another idea to perform this at the operating > system level, without using runqueue lock, or reducing > runqueue lock contention, i'll reply to the DPDK thread which > included an example of how applications should perform this. It is important to keep in mind that RT Throttling default behavior is not exactly what one would expect: The RT_RUNTIME_SHARE sched feature is enabled by default and allows a CPU about to suffer RT throttling to borrow RT bandwidth from other CPUs. In a system where not all the CPUs are currently running RT tasks, the CPU running a RT task could keep this task running for longer than the RT throttling threshold. Depending on the load distribution in the system, this RT task could run for minutes without suffering the RT throttle. So, the expected behavior is only ensured if RT_RUNTIME_SHARE is explicitly disabled. Also, Daniel Bristot introduced a sched feature called RT_RUNTIME_GREED that will only apply the RT throttling when there are SCHED_OTHER (non-RT) tasks waiting for CPU time. Even if the RT throttling threshold has been reached. RT_RUNTIME_GREED minimizes the RT throttling impact in performance and latency as much as possible. More detailed notes: - RT_RUNTIME_SHARE: https://bugzilla.redhat.com/show_bug.cgi?id=1459275#c39 - RT_SCHED_GREED: https://bugzilla.redhat.com/show_bug.cgi?id=1491722#c6
(In reply to Scott Wood from comment #23) > (In reply to Marcelo Tosatti from comment #22) > > (In reply to Scott Wood from comment #21) > > > Wouldn't RT throttling kick in in that case? > > > > -RT throttling is disabled for KVM-RT, and you dont want to enable it: > > > > So after sometime investigating the suggestion to > > use -RT throttling to solve the problem of > > maintenance tasks that need to run on the isolated > > CPUs, the following facts become clear: > > I wasn't suggesting it as a solution, just trying to understand the cause of > the hangs that have been reported. What is it that automatically disables > RT throttling? It's enabled on my system even after installing > kernel-rt-kvm. > realtime-virtual-host tuned profile. > I tried recreating the load in bug 1590222, and just reproduced a somewhat > similar hang that left a bunch of processes stuck in the D state even after > all loads were killed. I'll look more deeply into it tomorrow. That would be very helpful. Do you have a reproducer?
*** Bug 1590222 has been marked as a duplicate of this bug. ***
==Verification== KVM-RT regularly 24 hours queuelat testing get PASS results. So QE think this bug has been fixed. Versions: qemu-kvm-rhev-2.12.0-18.el7.x86_64 kernel-rt-3.10.0-957.rt56.910.el7.x86_64 tuned-2.10.0-6.el7.noarch libvirt-4.5.0-10.el7.x86_64 Testings: (1)Single VM with 1 rt vCPU: 24 hours queuelat testing Pass (2)Single VM with 8 rt vCPUs: 24 hours queuelat testing Pass (3)Multiple VMs each with 1 rt vCPU: 24 hours queuelat testing Pass Move this bug to 'VERIFIED'. Please correct me if any mistakes. Thanks. Note: The XFS hang issue was mentioned in above comments, and xfs issue still exists now, below bz is tracking it. Bug 1590222 - INFO: task worker:194936 blocked for more than 600 seconds
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:3096