Bug 1753187 - [IPI][OSP] High flake rate in parallel and serial test suites
Summary: [IPI][OSP] High flake rate in parallel and serial test suites
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.3.0
Assignee: Martin André
QA Contact: David Sanz
URL:
Whiteboard:
Depends On:
Blocks: 1755531
TreeView+ depends on / blocked
 
Reported: 2019-09-18 11:25 UTC by Martin André
Modified: 2020-01-23 11:06 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1755531 (view as bug list)
Environment:
Last Closed: 2020-01-23 11:06:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 2382 0 'None' closed Bug 1753187: OpenStack: Permit choosing the volume type for instances 2020-10-19 15:06:08 UTC
Github openshift installer pull 2394 0 'None' closed Bug 1754070: Allow setting root volume via installconfig 2020-10-19 15:06:08 UTC
Github openshift release pull 5456 0 'None' closed openstack: mount the etcd storage in a ramdisk 2020-10-19 15:06:21 UTC
Red Hat Product Errata RHBA-2020:0062 0 None None None 2020-01-23 11:06:39 UTC

Description Martin André 2019-09-18 11:25:55 UTC
The parallel [1] and serial [2] test suites for OpenStack IPI show a very high flake rate.

The initial analysis shows the API is occasionally going down because it loses connection to etcd.

This is what we’re seeing in the kube-apiserver logs:

I0918 07:45:45.876569       1 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: []
I0918 07:45:45.876643       1 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{etcd-0.mandre.shiftstack.com:2379 <nil>} {etcd-2.mandre.shiftstack.com:2379 <nil>}]
I0918 07:45:45.876734       1 healthz.go:175] [+]terminating ok
[+]ping ok
[+]log ok
[-]etcd failed: reason withheld
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/kube-apiserver-requestheader-reload ok
[+]poststarthook/kube-apiserver-clientCA-reload ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-discovery-available ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/ca-registration ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/quota.openshift.io-clusterquotamapping ok
[+]poststarthook/openshift.io-kubernetes-informers-synched ok
[+]poststarthook/openshift.io-clientCA-reload ok
[+]poststarthook/openshift.io-requestheader-reload ok
[+]poststarthook/openshift.io-startkubeinformers ok
[+]poststarthook/aggregator-reload-proxy-client-cert ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/apiservice-wait-for-first-sync ok
[+]poststarthook/apiservice-openapi-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
healthz check failed

And the etcd logs show:

2019-09-18 07:45:42.803024 W | etcdserver: read-only range request "key:\"kubernetes.io/health\" " with result "range_response_count:0 size:6" took too long (323.228847ms) to execute
2019-09-18 07:45:42.803062 W | etcdserver: read-only range request "key:\"kubernetes.io/health\" " with result "range_response_count:0 size:6" took too long (323.466363ms) to execute
2019-09-18 07:45:45.874046 W | etcdserver: read-only range request "key:\"kubernetes.io/health\" " with result "error:context deadline exceeded" took too long (1.99994342s) to execute
2019-09-18 07:45:45.874787 W | etcdserver: read-only range request "key:\"kubernetes.io/health\" " with result "error:context deadline exceeded" took too long (2.00006305s) to execute

The `etcd_disk_backend_commit_duration_seconds_bucket` and `etcd_disk_wal_fsync_duration_seconds_bucket metrics` both shows operation taking a long time.

Comment 2 Martin André 2019-09-18 11:45:37 UTC
Here are the result of fio running on the separate machine (not under stress) with the same specs as the master nodes:

sh-5.0# fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=22m --bs=2300 --name=mytest
mytest: (g=0): rw=write, bs=(R) 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1
fio-3.13
Starting 1 process
mytest: Laying out IO file (1 file / 22MiB)
Jobs: 1 (f=1): [W(1)][97.7%][w=595KiB/s][w=265 IOPS][eta 00m:11s]
mytest: (groupid=0, jobs=1): err= 0: pid=202: Wed Sep 18 11:42:27 2019
  write: IOPS=21, BW=48.0KiB/s (50.2kB/s)(21.0MiB/459817msec); 0 zone resets
    clat (usec): min=12, max=654, avg=26.75, stdev=22.97
     lat (usec): min=13, max=656, avg=28.16, stdev=23.17
    clat percentiles (usec):
     |  1.00th=[   15],  5.00th=[   17], 10.00th=[   18], 20.00th=[   18],
     | 30.00th=[   18], 40.00th=[   22], 50.00th=[   23], 60.00th=[   24],
     | 70.00th=[   24], 80.00th=[   31], 90.00th=[   43], 95.00th=[   56],
     | 99.00th=[   80], 99.50th=[   91], 99.90th=[  529], 99.95th=[  570],
     | 99.99th=[  611]
   bw (  KiB/s): min=    4, max=  624, per=100.00%, avg=55.81, stdev=146.28, samples=791
   iops        : min=    1, max=  278, avg=25.04, stdev=65.14, samples=791
  lat (usec)   : 20=35.25%, 50=56.86%, 100=7.55%, 250=0.23%, 500=0.01%
  lat (usec)   : 750=0.11%
  fsync/fdatasync/sync_file_range:
    sync (usec): min=1751, max=5099.5k, avg=45809.43, stdev=149235.52
    sync percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    3],
     | 30.00th=[    3], 40.00th=[    5], 50.00th=[    5], 60.00th=[    5],
     | 70.00th=[    5], 80.00th=[    8], 90.00th=[  113], 95.00th=[  317],
     | 99.00th=[  751], 99.50th=[  844], 99.90th=[ 1116], 99.95th=[ 1334],
     | 99.99th=[ 2500]
  cpu          : usr=0.04%, sys=0.14%, ctx=35818, majf=0, minf=12
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10029,0,0 short=10029,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=48.0KiB/s (50.2kB/s), 48.0KiB/s-48.0KiB/s (50.2kB/s-50.2kB/s), io=21.0MiB (23.1MB), run=459817-459817msec

Comment 3 Sam Batschelet 2019-09-18 12:02:24 UTC
I didn't see this bad on the instance I tried. I added the results to doc[1] to compare against the other clouds. Below is the summary.

These data points include the p99 for fsync[1] and average bandwidth rate (bw). Also as noted above these numbers are taken without workload. Iw ould also like not note as per etcd faq the p99 for fsync should be below 10ms under load. We are 4x+ that....

GCP: 4.2
p99 fsync: 2.966 ms
bw: 1506KiB/s

AWS: 4.2
p99 fsync: 2.507 ms
bw: 1655KiB/s

Azure 4.2
p99 fsync: 10.290 ms
bw: 465KiB/s

OpenShift 4.2
p99 fsync: 46.0ms
bw: 394KiB/s


[1]: https://docs.google.com/document/d/1_Volv4eo8AWiNeYU6YPWRW_E_fZa6A7cwA_gYxVwT7E/edit
[2]: https://github.com/etcd-io/etcd/blob/master/Documentation/faq.md#what-does-the-etcd-warning-failed-to-send-out-heartbeat-on-time-mean

Comment 4 Martin André 2019-09-18 12:33:01 UTC
I re-ran the test and got much better results:

sh-5.0# fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=22m --bs=2300 --name=mytest
mytest: (g=0): rw=write, bs=(R) 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1
fio-3.13
Starting 1 process
mytest: Laying out IO file (1 file / 22MiB)
Jobs: 1 (f=1): [W(1)][98.3%][w=456KiB/s][w=203 IOPS][eta 00m:01s]
mytest: (groupid=0, jobs=1): err= 0: pid=41: Wed Sep 18 12:30:02 2019
  write: IOPS=167, BW=377KiB/s (386kB/s)(21.0MiB/59784msec); 0 zone resets
    clat (usec): min=11, max=741, avg=22.88, stdev=21.56
     lat (usec): min=13, max=742, avg=24.18, stdev=21.61
    clat percentiles (usec):
     |  1.00th=[   15],  5.00th=[   17], 10.00th=[   18], 20.00th=[   18],
     | 30.00th=[   18], 40.00th=[   20], 50.00th=[   23], 60.00th=[   23],
     | 70.00th=[   24], 80.00th=[   24], 90.00th=[   30], 95.00th=[   36],
     | 99.00th=[   53], 99.50th=[   65], 99.90th=[  172], 99.95th=[  644],
     | 99.99th=[  709]
   bw (  KiB/s): min=   17, max=  619, per=99.95%, avg=375.81, stdev=205.76, samples=119
   iops        : min=    8, max=  276, avg=167.43, stdev=91.64, samples=119
  lat (usec)   : 20=41.18%, 50=57.62%, 100=1.07%, 250=0.03%, 750=0.10%
  fsync/fdatasync/sync_file_range:
    sync (usec): min=1864, max=353528, avg=5926.76, stdev=11475.29
    sync percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    3],
     | 30.00th=[    3], 40.00th=[    5], 50.00th=[    5], 60.00th=[    5],
     | 70.00th=[    5], 80.00th=[    6], 90.00th=[    8], 95.00th=[   16],
     | 99.00th=[   44], 99.50th=[   60], 99.90th=[  201], 99.95th=[  213],
     | 99.99th=[  326]
  cpu          : usr=0.19%, sys=1.03%, ctx=35837, majf=0, minf=11
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10029,0,0 short=10029,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=377KiB/s (386kB/s), 377KiB/s-377KiB/s (386kB/s-386kB/s), io=21.0MiB (23.1MB), run=59784-59784msec


Attaching a performance volume to the instance gives me:

sh-5.0# fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=22m --bs=2300 --name=mytest
mytest: (g=0): rw=write, bs=(R) 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1
fio-3.13
Starting 1 process
mytest: Laying out IO file (1 file / 22MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=638KiB/s][w=284 IOPS][eta 00m:00s]
mytest: (groupid=0, jobs=1): err= 0: pid=35: Wed Sep 18 12:25:41 2019
  write: IOPS=301, BW=677KiB/s (693kB/s)(21.0MiB/33285msec); 0 zone resets
    clat (usec): min=8, max=713, avg=17.80, stdev=20.97
     lat (usec): min=9, max=714, avg=19.08, stdev=21.02
    clat percentiles (usec):
     |  1.00th=[   10],  5.00th=[   13], 10.00th=[   13], 20.00th=[   14],
     | 30.00th=[   14], 40.00th=[   14], 50.00th=[   18], 60.00th=[   19],
     | 70.00th=[   20], 80.00th=[   20], 90.00th=[   22], 95.00th=[   28],
     | 99.00th=[   38], 99.50th=[   43], 99.90th=[  129], 99.95th=[  635],
     | 99.99th=[  709]
   bw (  KiB/s): min=  592, max=  754, per=100.00%, avg=676.38, stdev=42.22, samples=66
   iops        : min=  264, max=  336, avg=301.39, stdev=18.79, samples=66
  lat (usec)   : 10=2.20%, 20=85.37%, 50=12.19%, 100=0.10%, 250=0.03%
  lat (usec)   : 750=0.10%
  fsync/fdatasync/sync_file_range:
    sync (usec): min=1427, max=55694, avg=3290.14, stdev=1411.63
    sync percentiles (usec):
     |  1.00th=[ 1631],  5.00th=[ 1762], 10.00th=[ 1860], 20.00th=[ 1991],
     | 30.00th=[ 2114], 40.00th=[ 2376], 50.00th=[ 3654], 60.00th=[ 3916],
     | 70.00th=[ 4080], 80.00th=[ 4293], 90.00th=[ 4621], 95.00th=[ 4883],
     | 99.00th=[ 5997], 99.50th=[ 7111], 99.90th=[13042], 99.95th=[16319],
     | 99.99th=[37487]
  cpu          : usr=0.39%, sys=1.56%, ctx=35740, majf=0, minf=12
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10029,0,0 short=10029,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=677KiB/s (693kB/s), 677KiB/s-677KiB/s (693kB/s-693kB/s), io=21.0MiB (23.1MB), run=33285-33285msec

Disk stats (read/write):
  vdb: ios=1/25573, merge=0/4, ticks=0/33610, in_queue=24027, util=40.19%

Comment 7 W. Trevor King 2019-09-23 18:16:01 UTC
Moving back to ASSIGNED so we can hang https://github.com/openshift/installer/pull/2394 on this too, since Eric says this is a 4.2 blocker without a user-facing knob.

Comment 8 W. Trevor King 2019-09-23 18:17:10 UTC
4.2 and 4.3 have split, so retargetting to 4.3, and we'll clone a 4.2 blocker back off of this.

Comment 9 egarcia 2019-09-26 13:39:43 UTC
In order to resolve the disk latency problem, we needed to implement a way to boot from volume so we can attach higher performance volumes to instances: https://bugzilla.redhat.com/show_bug.cgi?id=1754070

Comment 11 Martin André 2019-11-21 12:56:24 UTC
Implementing boot from volume helped a lot but wasn't sufficient to get good enough disk performance in CI.

We switch to put etcd data in a ramdisk in https://github.com/openshift/release/pull/5456 for our CI.

Comment 13 errata-xmlrpc 2020-01-23 11:06:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062


Note You need to log in before you can comment on or make changes to this bug.