Bug 328971

Summary:	disk IO in a paravirt guest starved for seconds
Product:	Red Hat Enterprise Linux 5	Reporter:	Rik van Riel <riel>
Component:	kernel	Assignee:	Red Hat Kernel Manager <kernel-mgr>
Status:	CLOSED WONTFIX	QA Contact:	Red Hat Kernel QE team <kernel-qe>
Severity:	high	Docs Contact:
Priority:	high
Version:	5.0	CC:	bill-bugzilla.redhat.com, dshaks, duncan.lindley, k.georgiou, lwoodman, perfbz, peterm, rhod, xen-maint
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-01-14 12:36:43 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Rik van Riel 2007-10-12 04:50:00 UTC

Description of problem:

It looks like there is a strange interaction between Xen and the dom0 IO
scheduler.  I see that sometimes one guest is doing heavy IO, while another
guest has a handful of processes in iowait state, but no IO at all is done for
several seconds.

Needless to say, this unfairness interferes with system operation and can cause
trouble.  CFQ in dom0 should make things fairer between the guests, but it does
not seem to work right...

Version-Release number of selected component (if applicable):

2.6.18-8.el5xen

Steps to Reproduce:
1. run several Xen guests, some of which have data on the same hard disk
2. run "vmstat 1" in each guest, while IO heavy workloads run
3. watch one guest get starved from disk IO for seconds at a time (sometimes)
  
Expected results:

Both guests get fair access to the disk, with relatively low latency.

Comment 1 Rik van Riel 2007-10-12 04:55:08 UTC

Running "iostat -x 3 sdc dm-4 dm-13" has yielded some strange information:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.50    0.02    1.21    1.05    4.28   88.95

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz
  await  svctm  %util
sdc               9.82    23.20 27.72 32.75  1809.70   816.19    43.43     0.84
  13.89   8.03  48.55
dm-4              0.00     0.00 28.94 35.62  1689.18   472.95    33.49     0.38
   5.88   7.12  45.98
dm-13             0.00     0.00  3.48 16.72    75.71   313.93    19.28     0.42
  20.54  12.26  24.78

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.66    0.00    0.33    0.00    5.81   93.19

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz
  await  svctm  %util
sdc              32.78    37.79 46.49 73.58  4805.35  1543.81    52.88    13.23
 132.37   8.38 100.60
dm-4              0.00     0.00 76.25 107.36  4647.49  1257.53    32.16    19.37
 145.53   5.48 100.60
dm-13             0.00     0.00  2.68  5.69    96.32   313.04    48.96     2.34
  88.48 120.32 100.60

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.00    0.00    0.33    0.33    5.50   92.83

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz
  await  svctm  %util
sdc              23.00    26.67 32.67 55.33  3453.33  1072.00    51.42    11.46
 139.17  11.38 100.13
dm-4              0.00     0.00 52.33 72.33  3314.67   874.67    33.60    15.25
 126.92   8.03 100.13
dm-13             0.00     0.00  3.33  7.00    66.67   176.00    23.48     2.58
 306.84  96.90 100.13

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.00    0.00    0.33    0.17    5.99   92.51

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz
  await  svctm  %util
sdc              23.18    41.72 36.42 79.80  3578.81  1854.30    46.75    13.82
 122.01   8.56  99.47
dm-4              0.00     0.00 57.95 109.60  3570.86  1221.19    28.60    19.06
 116.20   5.94  99.47
dm-13             0.00     0.00  1.99 11.59    66.23   617.22    50.34     1.96
 150.93  73.07  99.21

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.00    0.00    0.33    0.00    4.98   93.69

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz
  await  svctm  %util
sdc              23.26    28.57 43.85 64.12  3864.45  1132.23    46.28    12.83
 117.76   9.24  99.80
dm-4              0.00     0.00 64.45 91.36  3808.64  1023.26    31.01    17.91
  99.97   6.41  99.80
dm-13             0.00     0.00  0.66  4.65    10.63   143.52    29.00     2.07
 546.00 187.75  99.80

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.00    0.00    0.33    0.00    5.18   93.48

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz
  await  svctm  %util
sdc              28.09    35.12 35.45 69.90  4099.00  1147.83    49.80    11.57
 107.96   9.54 100.47
dm-4              0.00     0.00 62.21 107.02  3970.57  1182.61    30.45    18.84
 121.45   5.94 100.47
dm-13             0.00     0.00  0.00  0.00     0.00     0.00     0.00     1.04
   0.00   0.00 100.47

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.00    0.00    0.33    0.17    4.49   94.02

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz
  await  svctm  %util
sdc              39.00    22.00 51.00 53.67  5730.67  1197.33    66.19    11.60
 117.34   9.57 100.13
dm-4              0.00     0.00 90.67 65.67  5792.00   920.00    42.93    18.04
 118.35   6.41 100.13
dm-13             0.00     0.00  0.67  5.00    10.67   210.67    39.06     1.24
 305.88 176.71 100.13

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.16    0.00    0.66    0.17    6.80   91.21

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz
  await  svctm  %util
sdc              42.05    16.56 44.04 47.35  5650.33  1017.22    72.96     9.54
 104.91  10.88  99.47
dm-4              0.00     0.00 88.08 55.63  5843.71   839.74    46.51    15.42
 104.93   6.92  99.47
dm-13             0.00     0.00  0.33  7.95     5.30   174.83    21.76     1.12
 183.04 120.16  99.47

Notice how sometimes dm-13 just gets starved.  The requests sit in the queue and
do not get serviced anywhere nearly as fast as the requests to dm-4 get serviced.

Comment 2 Rik van Riel 2007-10-12 05:11:57 UTC

Switching to the anticipatory scheduler seems to be a small improvement over CFQ
in this case, with the gross unfairness going away and both domains running
somewhat slowly.

The deadline scheduler seems to make things even worse than CFQ: request
latencies of way over a second have been seen.

Comment 3 Rik van Riel 2007-10-12 16:10:59 UTC

I talked it over with Alan, and it turns out the Linux ata layer does not throw
in an ordered request every few requests to make fairness happen (like the SCSI
layer does).

This allows the NCQ implementation in the disk to starve out one area of the
disk, because another area on the disk is busy.

The fix should not be that hard: if there have not been any ordered requests
(whatever their SATA equivalent is) for the last N requests, simply make a
request ordered to obtain some measure of fairness.

I think many SCSI adapters use 16 or 32 as the magic number. No idea what would
be optimal for SATA.

Comment 6 Jeff Garzik 2007-11-08 16:40:52 UTC

SATA does not support ordered tags.

Furthermore, we are at the mercy of non-SATA layers (SCSI, block, elevator) to
send commands, or not.

Any fix you feel is applicable to SATA is also applicable to any other
non-ordered tagging case, such as drivers/block/sx8.

Comment 7 Rik van Riel 2007-11-08 17:12:01 UTC

Agreed, we may be better off trying to fix this in the CFQ elevator.

Comment 8 Alan Cox 2007-11-08 17:33:19 UTC

Various drivers at the SCSI layer know about this issue and handle it.

Libata needs to do likewise. We don't have non-ordered tags but we do have the
ability to stop stuffing data into a constipated drive.

CFQ may be able to do limited handling of this but it only sees a subset of the
queue so thats no help.

Comment 9 Duncan Lindley 2007-11-16 16:37:23 UTC

I've experienced very large io delays with one of our production applications
also. Firstly when a tar cvfj to a nfs share was being performed, secondly when
about 40gb consisting of 8 files was deleted from a local ext3 fs causing over 7
seconds of delay to a write.

I'm neither using Xen or SATA.

This was on a hp dl class server using the cciss driver.

Looking at sar;

# sar -f /var/log/sa/sa06 -n NFS -b | less
                tps      rtps      wtps   bread/s   bwrtn/s
05:15:01 AM   1759.00    440.62   1318.38  23891.73  20215.73

# sar -f /var/log/sa/sa06 | less
Linux 2.6.18-8.el5 (x.om)        11/06/2007
                   CPU     %user     %nice   %system   %iowait    %steal     %idle
05:15:01 AM       all      8.29      0.01      4.28     42.54      0.00     44.88

My app log shows several large delays in which it should be writing transaction
logging to disk. The SendBrdcast would usually be sending a heartbeat every
second but it is a single thread waiting on the disk io to be performed.

11/06/07  05:14:14  SendBrdcast: broadcast address = x, len = 20
>>>> 7 second outage
11/06/07  05:14:21  SendBrdcast: broadcast address = x, len = 20
>>>> 17 second outage
11/06/07  05:14:38  TCPListen: Accepted socket from client

Comment 10 John Shakshober 2007-11-16 18:04:16 UTC

Re: #9 - This reply was on base OS not Xen.  Should be in a different BZ.
It sounds like standard VM page pressure ... Can you share a vmstat 1 ... to see
what's happening over time is needed and if there is a kernel problem or not.

How many spindles are behind you lun? 1759 IO/sec, 20 MB/sec read and write
sounds about correct for a single lun w/ NFS I/O?

Comment 11 RHEL Program Management 2008-03-11 19:39:37 UTC

This request was previously evaluated by Red Hat Product Management
for inclusion in the current Red Hat Enterprise Linux release, but
Red Hat was unable to resolve it in time.  This request will be
reviewed for a future Red Hat Enterprise Linux release.