1753187 – [IPI][OSP] High flake rate in parallel and serial test suites

Bug 1753187 - [IPI][OSP] High flake rate in parallel and serial test suites

Summary: [IPI][OSP] High flake rate in parallel and serial test suites

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Martin André
QA Contact:	David Sanz
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1755531
TreeView+	depends on / blocked

Reported:	2019-09-18 11:25 UTC by Martin André
Modified:	2020-01-23 11:06 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1755531 (view as bug list)
Environment:
Last Closed:	2020-01-23 11:06:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift installer pull 2382	'None'	closed	Bug 1753187: OpenStack: Permit choosing the volume type for instances	2020-10-19 15:06:08 UTC
Github	openshift installer pull 2394	'None'	closed	Bug 1754070: Allow setting root volume via installconfig	2020-10-19 15:06:08 UTC
Github	openshift release pull 5456	'None'	closed	openstack: mount the etcd storage in a ramdisk	2020-10-19 15:06:21 UTC
Red Hat Product Errata	RHBA-2020:0062	None	None	None	2020-01-23 11:06:39 UTC

Description Martin André 2019-09-18 11:25:55 UTC

The parallel [1] and serial [2] test suites for OpenStack IPI show a very high flake rate.

The initial analysis shows the API is occasionally going down because it loses connection to etcd.

This is what we’re seeing in the kube-apiserver logs:

I0918 07:45:45.876569       1 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: []
I0918 07:45:45.876643       1 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{etcd-0.mandre.shiftstack.com:2379 <nil>} {etcd-2.mandre.shiftstack.com:2379 <nil>}]
I0918 07:45:45.876734       1 healthz.go:175] [+]terminating ok
[+]ping ok
[+]log ok
[-]etcd failed: reason withheld
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/kube-apiserver-requestheader-reload ok
[+]poststarthook/kube-apiserver-clientCA-reload ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-discovery-available ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/ca-registration ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/quota.openshift.io-clusterquotamapping ok
[+]poststarthook/openshift.io-kubernetes-informers-synched ok
[+]poststarthook/openshift.io-clientCA-reload ok
[+]poststarthook/openshift.io-requestheader-reload ok
[+]poststarthook/openshift.io-startkubeinformers ok
[+]poststarthook/aggregator-reload-proxy-client-cert ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/apiservice-wait-for-first-sync ok
[+]poststarthook/apiservice-openapi-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
healthz check failed

And the etcd logs show:

2019-09-18 07:45:42.803024 W | etcdserver: read-only range request "key:\"kubernetes.io/health\" " with result "range_response_count:0 size:6" took too long (323.228847ms) to execute
2019-09-18 07:45:42.803062 W | etcdserver: read-only range request "key:\"kubernetes.io/health\" " with result "range_response_count:0 size:6" took too long (323.466363ms) to execute
2019-09-18 07:45:45.874046 W | etcdserver: read-only range request "key:\"kubernetes.io/health\" " with result "error:context deadline exceeded" took too long (1.99994342s) to execute
2019-09-18 07:45:45.874787 W | etcdserver: read-only range request "key:\"kubernetes.io/health\" " with result "error:context deadline exceeded" took too long (2.00006305s) to execute

The `etcd_disk_backend_commit_duration_seconds_bucket` and `etcd_disk_wal_fsync_duration_seconds_bucket metrics` both shows operation taking a long time.

Comment 2 Martin André 2019-09-18 11:45:37 UTC

Here are the result of fio running on the separate machine (not under stress) with the same specs as the master nodes:

sh-5.0# fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=22m --bs=2300 --name=mytest
mytest: (g=0): rw=write, bs=(R) 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1
fio-3.13
Starting 1 process
mytest: Laying out IO file (1 file / 22MiB)
Jobs: 1 (f=1): [W(1)][97.7%][w=595KiB/s][w=265 IOPS][eta 00m:11s]
mytest: (groupid=0, jobs=1): err= 0: pid=202: Wed Sep 18 11:42:27 2019
  write: IOPS=21, BW=48.0KiB/s (50.2kB/s)(21.0MiB/459817msec); 0 zone resets
    clat (usec): min=12, max=654, avg=26.75, stdev=22.97
     lat (usec): min=13, max=656, avg=28.16, stdev=23.17
    clat percentiles (usec):
     |  1.00th=[   15],  5.00th=[   17], 10.00th=[   18], 20.00th=[   18],
     | 30.00th=[   18], 40.00th=[   22], 50.00th=[   23], 60.00th=[   24],
     | 70.00th=[   24], 80.00th=[   31], 90.00th=[   43], 95.00th=[   56],
     | 99.00th=[   80], 99.50th=[   91], 99.90th=[  529], 99.95th=[  570],
     | 99.99th=[  611]
   bw (  KiB/s): min=    4, max=  624, per=100.00%, avg=55.81, stdev=146.28, samples=791
   iops        : min=    1, max=  278, avg=25.04, stdev=65.14, samples=791
  lat (usec)   : 20=35.25%, 50=56.86%, 100=7.55%, 250=0.23%, 500=0.01%
  lat (usec)   : 750=0.11%
  fsync/fdatasync/sync_file_range:
    sync (usec): min=1751, max=5099.5k, avg=45809.43, stdev=149235.52
    sync percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    3],
     | 30.00th=[    3], 40.00th=[    5], 50.00th=[    5], 60.00th=[    5],
     | 70.00th=[    5], 80.00th=[    8], 90.00th=[  113], 95.00th=[  317],
     | 99.00th=[  751], 99.50th=[  844], 99.90th=[ 1116], 99.95th=[ 1334],
     | 99.99th=[ 2500]
  cpu          : usr=0.04%, sys=0.14%, ctx=35818, majf=0, minf=12
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10029,0,0 short=10029,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=48.0KiB/s (50.2kB/s), 48.0KiB/s-48.0KiB/s (50.2kB/s-50.2kB/s), io=21.0MiB (23.1MB), run=459817-459817msec

Comment 3 Sam Batschelet 2019-09-18 12:02:24 UTC

I didn't see this bad on the instance I tried. I added the results to doc[1] to compare against the other clouds. Below is the summary.

These data points include the p99 for fsync[1] and average bandwidth rate (bw). Also as noted above these numbers are taken without workload. Iw ould also like not note as per etcd faq the p99 for fsync should be below 10ms under load. We are 4x+ that....

GCP: 4.2
p99 fsync: 2.966 ms
bw: 1506KiB/s

AWS: 4.2
p99 fsync: 2.507 ms
bw: 1655KiB/s

Azure 4.2
p99 fsync: 10.290 ms
bw: 465KiB/s

OpenShift 4.2
p99 fsync: 46.0ms
bw: 394KiB/s


[1]: https://docs.google.com/document/d/1_Volv4eo8AWiNeYU6YPWRW_E_fZa6A7cwA_gYxVwT7E/edit
[2]: https://github.com/etcd-io/etcd/blob/master/Documentation/faq.md#what-does-the-etcd-warning-failed-to-send-out-heartbeat-on-time-mean

Comment 4 Martin André 2019-09-18 12:33:01 UTC

I re-ran the test and got much better results:

sh-5.0# fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=22m --bs=2300 --name=mytest
mytest: (g=0): rw=write, bs=(R) 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1
fio-3.13
Starting 1 process
mytest: Laying out IO file (1 file / 22MiB)
Jobs: 1 (f=1): [W(1)][98.3%][w=456KiB/s][w=203 IOPS][eta 00m:01s]
mytest: (groupid=0, jobs=1): err= 0: pid=41: Wed Sep 18 12:30:02 2019
  write: IOPS=167, BW=377KiB/s (386kB/s)(21.0MiB/59784msec); 0 zone resets
    clat (usec): min=11, max=741, avg=22.88, stdev=21.56
     lat (usec): min=13, max=742, avg=24.18, stdev=21.61
    clat percentiles (usec):
     |  1.00th=[   15],  5.00th=[   17], 10.00th=[   18], 20.00th=[   18],
     | 30.00th=[   18], 40.00th=[   20], 50.00th=[   23], 60.00th=[   23],
     | 70.00th=[   24], 80.00th=[   24], 90.00th=[   30], 95.00th=[   36],
     | 99.00th=[   53], 99.50th=[   65], 99.90th=[  172], 99.95th=[  644],
     | 99.99th=[  709]
   bw (  KiB/s): min=   17, max=  619, per=99.95%, avg=375.81, stdev=205.76, samples=119
   iops        : min=    8, max=  276, avg=167.43, stdev=91.64, samples=119
  lat (usec)   : 20=41.18%, 50=57.62%, 100=1.07%, 250=0.03%, 750=0.10%
  fsync/fdatasync/sync_file_range:
    sync (usec): min=1864, max=353528, avg=5926.76, stdev=11475.29
    sync percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    3],
     | 30.00th=[    3], 40.00th=[    5], 50.00th=[    5], 60.00th=[    5],
     | 70.00th=[    5], 80.00th=[    6], 90.00th=[    8], 95.00th=[   16],
     | 99.00th=[   44], 99.50th=[   60], 99.90th=[  201], 99.95th=[  213],
     | 99.99th=[  326]
  cpu          : usr=0.19%, sys=1.03%, ctx=35837, majf=0, minf=11
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10029,0,0 short=10029,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=377KiB/s (386kB/s), 377KiB/s-377KiB/s (386kB/s-386kB/s), io=21.0MiB (23.1MB), run=59784-59784msec


Attaching a performance volume to the instance gives me:

sh-5.0# fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=22m --bs=2300 --name=mytest
mytest: (g=0): rw=write, bs=(R) 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1
fio-3.13
Starting 1 process
mytest: Laying out IO file (1 file / 22MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=638KiB/s][w=284 IOPS][eta 00m:00s]
mytest: (groupid=0, jobs=1): err= 0: pid=35: Wed Sep 18 12:25:41 2019
  write: IOPS=301, BW=677KiB/s (693kB/s)(21.0MiB/33285msec); 0 zone resets
    clat (usec): min=8, max=713, avg=17.80, stdev=20.97
     lat (usec): min=9, max=714, avg=19.08, stdev=21.02
    clat percentiles (usec):
     |  1.00th=[   10],  5.00th=[   13], 10.00th=[   13], 20.00th=[   14],
     | 30.00th=[   14], 40.00th=[   14], 50.00th=[   18], 60.00th=[   19],
     | 70.00th=[   20], 80.00th=[   20], 90.00th=[   22], 95.00th=[   28],
     | 99.00th=[   38], 99.50th=[   43], 99.90th=[  129], 99.95th=[  635],
     | 99.99th=[  709]
   bw (  KiB/s): min=  592, max=  754, per=100.00%, avg=676.38, stdev=42.22, samples=66
   iops        : min=  264, max=  336, avg=301.39, stdev=18.79, samples=66
  lat (usec)   : 10=2.20%, 20=85.37%, 50=12.19%, 100=0.10%, 250=0.03%
  lat (usec)   : 750=0.10%
  fsync/fdatasync/sync_file_range:
    sync (usec): min=1427, max=55694, avg=3290.14, stdev=1411.63
    sync percentiles (usec):
     |  1.00th=[ 1631],  5.00th=[ 1762], 10.00th=[ 1860], 20.00th=[ 1991],
     | 30.00th=[ 2114], 40.00th=[ 2376], 50.00th=[ 3654], 60.00th=[ 3916],
     | 70.00th=[ 4080], 80.00th=[ 4293], 90.00th=[ 4621], 95.00th=[ 4883],
     | 99.00th=[ 5997], 99.50th=[ 7111], 99.90th=[13042], 99.95th=[16319],
     | 99.99th=[37487]
  cpu          : usr=0.39%, sys=1.56%, ctx=35740, majf=0, minf=12
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10029,0,0 short=10029,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=677KiB/s (693kB/s), 677KiB/s-677KiB/s (693kB/s-693kB/s), io=21.0MiB (23.1MB), run=33285-33285msec

Disk stats (read/write):
  vdb: ios=1/25573, merge=0/4, ticks=0/33610, in_queue=24027, util=40.19%

Comment 7 W. Trevor King 2019-09-23 18:16:01 UTC

Moving back to ASSIGNED so we can hang https://github.com/openshift/installer/pull/2394 on this too, since Eric says this is a 4.2 blocker without a user-facing knob.

Comment 8 W. Trevor King 2019-09-23 18:17:10 UTC

4.2 and 4.3 have split, so retargetting to 4.3, and we'll clone a 4.2 blocker back off of this.

Comment 9 egarcia 2019-09-26 13:39:43 UTC

In order to resolve the disk latency problem, we needed to implement a way to boot from volume so we can attach higher performance volumes to instances: https://bugzilla.redhat.com/show_bug.cgi?id=1754070

Comment 11 Martin André 2019-11-21 12:56:24 UTC

Implementing boot from volume helped a lot but wasn't sufficient to get good enough disk performance in CI.

We switch to put etcd data in a ramdisk in https://github.com/openshift/release/pull/5456 for our CI.

Comment 13 errata-xmlrpc 2020-01-23 11:06:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Note You need to log in before you can comment on or make changes to this bug.