Description of problem: It looks like there is a strange interaction between Xen and the dom0 IO scheduler. I see that sometimes one guest is doing heavy IO, while another guest has a handful of processes in iowait state, but no IO at all is done for several seconds. Needless to say, this unfairness interferes with system operation and can cause trouble. CFQ in dom0 should make things fairer between the guests, but it does not seem to work right... Version-Release number of selected component (if applicable): 2.6.18-8.el5xen Steps to Reproduce: 1. run several Xen guests, some of which have data on the same hard disk 2. run "vmstat 1" in each guest, while IO heavy workloads run 3. watch one guest get starved from disk IO for seconds at a time (sometimes) Expected results: Both guests get fair access to the disk, with relatively low latency.
Running "iostat -x 3 sdc dm-4 dm-13" has yielded some strange information: avg-cpu: %user %nice %system %iowait %steal %idle 4.50 0.02 1.21 1.05 4.28 88.95 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdc 9.82 23.20 27.72 32.75 1809.70 816.19 43.43 0.84 13.89 8.03 48.55 dm-4 0.00 0.00 28.94 35.62 1689.18 472.95 33.49 0.38 5.88 7.12 45.98 dm-13 0.00 0.00 3.48 16.72 75.71 313.93 19.28 0.42 20.54 12.26 24.78 avg-cpu: %user %nice %system %iowait %steal %idle 0.66 0.00 0.33 0.00 5.81 93.19 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdc 32.78 37.79 46.49 73.58 4805.35 1543.81 52.88 13.23 132.37 8.38 100.60 dm-4 0.00 0.00 76.25 107.36 4647.49 1257.53 32.16 19.37 145.53 5.48 100.60 dm-13 0.00 0.00 2.68 5.69 96.32 313.04 48.96 2.34 88.48 120.32 100.60 avg-cpu: %user %nice %system %iowait %steal %idle 1.00 0.00 0.33 0.33 5.50 92.83 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdc 23.00 26.67 32.67 55.33 3453.33 1072.00 51.42 11.46 139.17 11.38 100.13 dm-4 0.00 0.00 52.33 72.33 3314.67 874.67 33.60 15.25 126.92 8.03 100.13 dm-13 0.00 0.00 3.33 7.00 66.67 176.00 23.48 2.58 306.84 96.90 100.13 avg-cpu: %user %nice %system %iowait %steal %idle 1.00 0.00 0.33 0.17 5.99 92.51 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdc 23.18 41.72 36.42 79.80 3578.81 1854.30 46.75 13.82 122.01 8.56 99.47 dm-4 0.00 0.00 57.95 109.60 3570.86 1221.19 28.60 19.06 116.20 5.94 99.47 dm-13 0.00 0.00 1.99 11.59 66.23 617.22 50.34 1.96 150.93 73.07 99.21 avg-cpu: %user %nice %system %iowait %steal %idle 1.00 0.00 0.33 0.00 4.98 93.69 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdc 23.26 28.57 43.85 64.12 3864.45 1132.23 46.28 12.83 117.76 9.24 99.80 dm-4 0.00 0.00 64.45 91.36 3808.64 1023.26 31.01 17.91 99.97 6.41 99.80 dm-13 0.00 0.00 0.66 4.65 10.63 143.52 29.00 2.07 546.00 187.75 99.80 avg-cpu: %user %nice %system %iowait %steal %idle 1.00 0.00 0.33 0.00 5.18 93.48 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdc 28.09 35.12 35.45 69.90 4099.00 1147.83 49.80 11.57 107.96 9.54 100.47 dm-4 0.00 0.00 62.21 107.02 3970.57 1182.61 30.45 18.84 121.45 5.94 100.47 dm-13 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.04 0.00 0.00 100.47 avg-cpu: %user %nice %system %iowait %steal %idle 1.00 0.00 0.33 0.17 4.49 94.02 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdc 39.00 22.00 51.00 53.67 5730.67 1197.33 66.19 11.60 117.34 9.57 100.13 dm-4 0.00 0.00 90.67 65.67 5792.00 920.00 42.93 18.04 118.35 6.41 100.13 dm-13 0.00 0.00 0.67 5.00 10.67 210.67 39.06 1.24 305.88 176.71 100.13 avg-cpu: %user %nice %system %iowait %steal %idle 1.16 0.00 0.66 0.17 6.80 91.21 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdc 42.05 16.56 44.04 47.35 5650.33 1017.22 72.96 9.54 104.91 10.88 99.47 dm-4 0.00 0.00 88.08 55.63 5843.71 839.74 46.51 15.42 104.93 6.92 99.47 dm-13 0.00 0.00 0.33 7.95 5.30 174.83 21.76 1.12 183.04 120.16 99.47 Notice how sometimes dm-13 just gets starved. The requests sit in the queue and do not get serviced anywhere nearly as fast as the requests to dm-4 get serviced.
Switching to the anticipatory scheduler seems to be a small improvement over CFQ in this case, with the gross unfairness going away and both domains running somewhat slowly. The deadline scheduler seems to make things even worse than CFQ: request latencies of way over a second have been seen.
I talked it over with Alan, and it turns out the Linux ata layer does not throw in an ordered request every few requests to make fairness happen (like the SCSI layer does). This allows the NCQ implementation in the disk to starve out one area of the disk, because another area on the disk is busy. The fix should not be that hard: if there have not been any ordered requests (whatever their SATA equivalent is) for the last N requests, simply make a request ordered to obtain some measure of fairness. I think many SCSI adapters use 16 or 32 as the magic number. No idea what would be optimal for SATA.
SATA does not support ordered tags. Furthermore, we are at the mercy of non-SATA layers (SCSI, block, elevator) to send commands, or not. Any fix you feel is applicable to SATA is also applicable to any other non-ordered tagging case, such as drivers/block/sx8.
Agreed, we may be better off trying to fix this in the CFQ elevator.
Various drivers at the SCSI layer know about this issue and handle it. Libata needs to do likewise. We don't have non-ordered tags but we do have the ability to stop stuffing data into a constipated drive. CFQ may be able to do limited handling of this but it only sees a subset of the queue so thats no help.
I've experienced very large io delays with one of our production applications also. Firstly when a tar cvfj to a nfs share was being performed, secondly when about 40gb consisting of 8 files was deleted from a local ext3 fs causing over 7 seconds of delay to a write. I'm neither using Xen or SATA. This was on a hp dl class server using the cciss driver. Looking at sar; # sar -f /var/log/sa/sa06 -n NFS -b | less tps rtps wtps bread/s bwrtn/s 05:15:01 AM 1759.00 440.62 1318.38 23891.73 20215.73 # sar -f /var/log/sa/sa06 | less Linux 2.6.18-8.el5 (x.om) 11/06/2007 CPU %user %nice %system %iowait %steal %idle 05:15:01 AM all 8.29 0.01 4.28 42.54 0.00 44.88 My app log shows several large delays in which it should be writing transaction logging to disk. The SendBrdcast would usually be sending a heartbeat every second but it is a single thread waiting on the disk io to be performed. 11/06/07 05:14:14 SendBrdcast: broadcast address = x, len = 20 >>>> 7 second outage 11/06/07 05:14:21 SendBrdcast: broadcast address = x, len = 20 >>>> 17 second outage 11/06/07 05:14:38 TCPListen: Accepted socket from client
Re: #9 - This reply was on base OS not Xen. Should be in a different BZ. It sounds like standard VM page pressure ... Can you share a vmstat 1 ... to see what's happening over time is needed and if there is a kernel problem or not. How many spindles are behind you lun? 1759 IO/sec, 20 MB/sec read and write sounds about correct for a single lun w/ NFS I/O?
This request was previously evaluated by Red Hat Product Management for inclusion in the current Red Hat Enterprise Linux release, but Red Hat was unable to resolve it in time. This request will be reviewed for a future Red Hat Enterprise Linux release.