Bug 328971
Summary: | disk IO in a paravirt guest starved for seconds | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Rik van Riel <riel> |
Component: | kernel | Assignee: | Red Hat Kernel Manager <kernel-mgr> |
Status: | CLOSED WONTFIX | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 5.0 | CC: | bill-bugzilla.redhat.com, dshaks, duncan.lindley, k.georgiou, lwoodman, perfbz, peterm, rhod, xen-maint |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2013-01-14 12:36:43 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Rik van Riel
2007-10-12 04:50:00 UTC
Running "iostat -x 3 sdc dm-4 dm-13" has yielded some strange information: avg-cpu: %user %nice %system %iowait %steal %idle 4.50 0.02 1.21 1.05 4.28 88.95 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdc 9.82 23.20 27.72 32.75 1809.70 816.19 43.43 0.84 13.89 8.03 48.55 dm-4 0.00 0.00 28.94 35.62 1689.18 472.95 33.49 0.38 5.88 7.12 45.98 dm-13 0.00 0.00 3.48 16.72 75.71 313.93 19.28 0.42 20.54 12.26 24.78 avg-cpu: %user %nice %system %iowait %steal %idle 0.66 0.00 0.33 0.00 5.81 93.19 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdc 32.78 37.79 46.49 73.58 4805.35 1543.81 52.88 13.23 132.37 8.38 100.60 dm-4 0.00 0.00 76.25 107.36 4647.49 1257.53 32.16 19.37 145.53 5.48 100.60 dm-13 0.00 0.00 2.68 5.69 96.32 313.04 48.96 2.34 88.48 120.32 100.60 avg-cpu: %user %nice %system %iowait %steal %idle 1.00 0.00 0.33 0.33 5.50 92.83 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdc 23.00 26.67 32.67 55.33 3453.33 1072.00 51.42 11.46 139.17 11.38 100.13 dm-4 0.00 0.00 52.33 72.33 3314.67 874.67 33.60 15.25 126.92 8.03 100.13 dm-13 0.00 0.00 3.33 7.00 66.67 176.00 23.48 2.58 306.84 96.90 100.13 avg-cpu: %user %nice %system %iowait %steal %idle 1.00 0.00 0.33 0.17 5.99 92.51 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdc 23.18 41.72 36.42 79.80 3578.81 1854.30 46.75 13.82 122.01 8.56 99.47 dm-4 0.00 0.00 57.95 109.60 3570.86 1221.19 28.60 19.06 116.20 5.94 99.47 dm-13 0.00 0.00 1.99 11.59 66.23 617.22 50.34 1.96 150.93 73.07 99.21 avg-cpu: %user %nice %system %iowait %steal %idle 1.00 0.00 0.33 0.00 4.98 93.69 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdc 23.26 28.57 43.85 64.12 3864.45 1132.23 46.28 12.83 117.76 9.24 99.80 dm-4 0.00 0.00 64.45 91.36 3808.64 1023.26 31.01 17.91 99.97 6.41 99.80 dm-13 0.00 0.00 0.66 4.65 10.63 143.52 29.00 2.07 546.00 187.75 99.80 avg-cpu: %user %nice %system %iowait %steal %idle 1.00 0.00 0.33 0.00 5.18 93.48 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdc 28.09 35.12 35.45 69.90 4099.00 1147.83 49.80 11.57 107.96 9.54 100.47 dm-4 0.00 0.00 62.21 107.02 3970.57 1182.61 30.45 18.84 121.45 5.94 100.47 dm-13 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.04 0.00 0.00 100.47 avg-cpu: %user %nice %system %iowait %steal %idle 1.00 0.00 0.33 0.17 4.49 94.02 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdc 39.00 22.00 51.00 53.67 5730.67 1197.33 66.19 11.60 117.34 9.57 100.13 dm-4 0.00 0.00 90.67 65.67 5792.00 920.00 42.93 18.04 118.35 6.41 100.13 dm-13 0.00 0.00 0.67 5.00 10.67 210.67 39.06 1.24 305.88 176.71 100.13 avg-cpu: %user %nice %system %iowait %steal %idle 1.16 0.00 0.66 0.17 6.80 91.21 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdc 42.05 16.56 44.04 47.35 5650.33 1017.22 72.96 9.54 104.91 10.88 99.47 dm-4 0.00 0.00 88.08 55.63 5843.71 839.74 46.51 15.42 104.93 6.92 99.47 dm-13 0.00 0.00 0.33 7.95 5.30 174.83 21.76 1.12 183.04 120.16 99.47 Notice how sometimes dm-13 just gets starved. The requests sit in the queue and do not get serviced anywhere nearly as fast as the requests to dm-4 get serviced. Switching to the anticipatory scheduler seems to be a small improvement over CFQ in this case, with the gross unfairness going away and both domains running somewhat slowly. The deadline scheduler seems to make things even worse than CFQ: request latencies of way over a second have been seen. I talked it over with Alan, and it turns out the Linux ata layer does not throw in an ordered request every few requests to make fairness happen (like the SCSI layer does). This allows the NCQ implementation in the disk to starve out one area of the disk, because another area on the disk is busy. The fix should not be that hard: if there have not been any ordered requests (whatever their SATA equivalent is) for the last N requests, simply make a request ordered to obtain some measure of fairness. I think many SCSI adapters use 16 or 32 as the magic number. No idea what would be optimal for SATA. SATA does not support ordered tags. Furthermore, we are at the mercy of non-SATA layers (SCSI, block, elevator) to send commands, or not. Any fix you feel is applicable to SATA is also applicable to any other non-ordered tagging case, such as drivers/block/sx8. Agreed, we may be better off trying to fix this in the CFQ elevator. Various drivers at the SCSI layer know about this issue and handle it. Libata needs to do likewise. We don't have non-ordered tags but we do have the ability to stop stuffing data into a constipated drive. CFQ may be able to do limited handling of this but it only sees a subset of the queue so thats no help. I've experienced very large io delays with one of our production applications also. Firstly when a tar cvfj to a nfs share was being performed, secondly when about 40gb consisting of 8 files was deleted from a local ext3 fs causing over 7 seconds of delay to a write. I'm neither using Xen or SATA. This was on a hp dl class server using the cciss driver. Looking at sar; # sar -f /var/log/sa/sa06 -n NFS -b | less tps rtps wtps bread/s bwrtn/s 05:15:01 AM 1759.00 440.62 1318.38 23891.73 20215.73 # sar -f /var/log/sa/sa06 | less Linux 2.6.18-8.el5 (x.om) 11/06/2007 CPU %user %nice %system %iowait %steal %idle 05:15:01 AM all 8.29 0.01 4.28 42.54 0.00 44.88 My app log shows several large delays in which it should be writing transaction logging to disk. The SendBrdcast would usually be sending a heartbeat every second but it is a single thread waiting on the disk io to be performed. 11/06/07 05:14:14 SendBrdcast: broadcast address = x, len = 20 >>>> 7 second outage 11/06/07 05:14:21 SendBrdcast: broadcast address = x, len = 20 >>>> 17 second outage 11/06/07 05:14:38 TCPListen: Accepted socket from client Re: #9 - This reply was on base OS not Xen. Should be in a different BZ. It sounds like standard VM page pressure ... Can you share a vmstat 1 ... to see what's happening over time is needed and if there is a kernel problem or not. How many spindles are behind you lun? 1759 IO/sec, 20 MB/sec read and write sounds about correct for a single lun w/ NFS I/O? This request was previously evaluated by Red Hat Product Management for inclusion in the current Red Hat Enterprise Linux release, but Red Hat was unable to resolve it in time. This request will be reviewed for a future Red Hat Enterprise Linux release. |