So am I correct that when they use the older kernel (-194 based), that they do not see these roughly 30s delays when writing? What might be helpful is to test something around 5.6.z and see if you have the same performance issues with it. That would help narrow down whether this is a regression and when it might have crept in. Clearly, that won't have the fix for the original kswapd deadlock, so this would just be to help us determine when the performance issue might have started for you.
If the client is writing with 30s gaps in between, then it sounds like the VM subsystem is not attempting to write the data. The NFS client really only issues writes under three conditions: 1) a data-integrity flush -- fsync() or close(). The client needs to flush out data to ensure that it's written out to the server before it releases the inode. 2) memory pressure flush -- the VM is asking us to write out dirty pages so it can free up memory 3) kupdated -- the data has hit the expire interval so kupdated is going to try and write it out. This generally happens every 30s... The catch here is that if you are not doing fsyncs, and you have gobs of memory then there's little need for the kernel to flush out memory. That leaves kupdated and since it's only going to flush out data that's older than 30s then that may help explain why you only see I/O on that interval.
(In reply to comment #2) > So am I correct that when they use the older kernel (-194 based), that they do > not see these roughly 30s delays when writing? Yep. Just the kswapd hangs.
What would be most helpful here is a self-contained testcase that demonstrates the problem. Could someone at EMC provide something along those lines? In the absence of that, could you bisect down the problem, starting with pre-built kernels? Our support folks ought to be able to provide you some to test.
More data points ... NFS: kernel -238 (plus patches from BZ 516490): smooth NFS write I/O kernel -250: bursty NFS write I/O (~35 second intervals) Local f/s: kernel -238 (plus patches from BZ 516490): relatively smooth local f/s I/O kernel -250: we have one iostat which shows hardly any I/O
Perhaps it's the patches for 441730 then? If you have a reproducer, it would be nice to try and bisect that down as well...
Ok, so you've tested a -238 kernel with the patchset for 516490. The thing to do now I think would be to add on the patchset for bug 441730 and test with that. If that does not show the problem, then I'd be more inclined to think this is a problem at the VM layer. Otherwise, we can dive more deeply into the patchset for 441730 and see if we can discern the cause.
Kernel -245 is queued up next. Depending on how that goes, might do -250 without the patchset from bug 441730.
Based on the last set of comments that indicate that this may be a hardware issue, I'm going to go ahead and close this as NOTABUG. If it looks like it is one after all, then please reopen it.