Red Hat Bugzilla – Bug 509702
Implement support for CLONE_IO
Last modified: 2016-11-24 10:39:07 EST
Description of problem:
The kernel (since 2.6.25) supports a CLONE_IO flag which tells the kernel that the new thread cooperates with the current thread on I/O. This greatly increases the throughput of a thread pool issuing sequential I/O to a single file when using the CFQ scheduler.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Try to create a thread with CLONE_IO
No pthread API
pthread_attr_shareio_np() or something
See also blocked qemu bug.
Note, the qemu bug (bug #498242) is on F12VirtTarget
What are the consequences of adding the flag? Where will there be POSIX incompabilities if this flag is used?
I ask because if there are none and there are no other drawbacks it should be the default.
There will be no POSIX incompatibility if CLONE_IO is used by default, but there may be severe performance implications. Consider a threaded database accessing several indices (in response to different queries). Without CLONE_IO, each thread gets its on IO context and thus a "time slice" of the disk. This allows sequential clustered accesses to complete rapidly.
On the other hand, with CLONE_IO, requests from a single thread will have no special affinity to each other, and thus requests from all threads will be interspersed with each other. If the threads issue sequential or clustered requests, they will be forced to seek more than without CLONE_IO.
To avoid these regressions, I recommend having CLONE_IO as an opt-in choice for applications that know that their threads are making unrelated requests.
I talked to Chris Wright today about this.
He explained that this is meant to consolidate IO contexts so that the kernel doesn't wait for more requests to see whether consolidation of requests is possible. If all the threads use the same context the consecutive requests cause the outstanding requests to be processed.
But this is really a nice side effect. The kernel doesn't really gets smarter. It doesn't notice which threads are working on the same files and regions so that requests can be consolidated. And it doesn't notice when requests don't ever can be consolidated. Using a single IO context just hides the effects enough.
This is all a detail of the current kernel implementation. Codifying this in an interface which is has to maintained forever isn't a good idea.
It is likely not a good idea to have more than one IO context for a process. Chris explained that qemu wants to use the flag for all threads of the thread pool. And even there is a problem: the flag canot be set for already running threads.
Therefore I suggest an alternative. Add a new prctl() to select this mode process-wide. This way all newly created threads will get the support. And it might even be possible to change all existing threads in a process to revert back to one IO context.
I cannot see a way to formulate all this in a useful way as a thread attribute which makes sense from this point on, even if the kernel IO and thread implementation changes. Therefore I'm closing this as WONTFIX.