The parallel [1] and serial [2] test suites for OpenStack IPI show a very high flake rate. The initial analysis shows the API is occasionally going down because it loses connection to etcd. This is what we’re seeing in the kube-apiserver logs: I0918 07:45:45.876569 1 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [] I0918 07:45:45.876643 1 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{etcd-0.mandre.shiftstack.com:2379 <nil>} {etcd-2.mandre.shiftstack.com:2379 <nil>}] I0918 07:45:45.876734 1 healthz.go:175] [+]terminating ok [+]ping ok [+]log ok [-]etcd failed: reason withheld [+]poststarthook/generic-apiserver-start-informers ok [+]poststarthook/kube-apiserver-requestheader-reload ok [+]poststarthook/kube-apiserver-clientCA-reload ok [+]poststarthook/start-apiextensions-informers ok [+]poststarthook/start-apiextensions-controllers ok [+]poststarthook/crd-discovery-available ok [+]poststarthook/crd-informer-synced ok [+]poststarthook/bootstrap-controller ok [+]poststarthook/rbac/bootstrap-roles ok [+]poststarthook/scheduling/bootstrap-system-priority-classes ok [+]poststarthook/ca-registration ok [+]poststarthook/start-kube-apiserver-admission-initializer ok [+]poststarthook/quota.openshift.io-clusterquotamapping ok [+]poststarthook/openshift.io-kubernetes-informers-synched ok [+]poststarthook/openshift.io-clientCA-reload ok [+]poststarthook/openshift.io-requestheader-reload ok [+]poststarthook/openshift.io-startkubeinformers ok [+]poststarthook/aggregator-reload-proxy-client-cert ok [+]poststarthook/start-kube-aggregator-informers ok [+]poststarthook/apiservice-registration-controller ok [+]poststarthook/apiservice-status-available-controller ok [+]poststarthook/apiservice-wait-for-first-sync ok [+]poststarthook/apiservice-openapi-controller ok [+]poststarthook/kube-apiserver-autoregistration ok [+]autoregister-completion ok healthz check failed And the etcd logs show: 2019-09-18 07:45:42.803024 W | etcdserver: read-only range request "key:\"kubernetes.io/health\" " with result "range_response_count:0 size:6" took too long (323.228847ms) to execute 2019-09-18 07:45:42.803062 W | etcdserver: read-only range request "key:\"kubernetes.io/health\" " with result "range_response_count:0 size:6" took too long (323.466363ms) to execute 2019-09-18 07:45:45.874046 W | etcdserver: read-only range request "key:\"kubernetes.io/health\" " with result "error:context deadline exceeded" took too long (1.99994342s) to execute 2019-09-18 07:45:45.874787 W | etcdserver: read-only range request "key:\"kubernetes.io/health\" " with result "error:context deadline exceeded" took too long (2.00006305s) to execute The `etcd_disk_backend_commit_duration_seconds_bucket` and `etcd_disk_wal_fsync_duration_seconds_bucket metrics` both shows operation taking a long time.
Here are the result of fio running on the separate machine (not under stress) with the same specs as the master nodes: sh-5.0# fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=22m --bs=2300 --name=mytest mytest: (g=0): rw=write, bs=(R) 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1 fio-3.13 Starting 1 process mytest: Laying out IO file (1 file / 22MiB) Jobs: 1 (f=1): [W(1)][97.7%][w=595KiB/s][w=265 IOPS][eta 00m:11s] mytest: (groupid=0, jobs=1): err= 0: pid=202: Wed Sep 18 11:42:27 2019 write: IOPS=21, BW=48.0KiB/s (50.2kB/s)(21.0MiB/459817msec); 0 zone resets clat (usec): min=12, max=654, avg=26.75, stdev=22.97 lat (usec): min=13, max=656, avg=28.16, stdev=23.17 clat percentiles (usec): | 1.00th=[ 15], 5.00th=[ 17], 10.00th=[ 18], 20.00th=[ 18], | 30.00th=[ 18], 40.00th=[ 22], 50.00th=[ 23], 60.00th=[ 24], | 70.00th=[ 24], 80.00th=[ 31], 90.00th=[ 43], 95.00th=[ 56], | 99.00th=[ 80], 99.50th=[ 91], 99.90th=[ 529], 99.95th=[ 570], | 99.99th=[ 611] bw ( KiB/s): min= 4, max= 624, per=100.00%, avg=55.81, stdev=146.28, samples=791 iops : min= 1, max= 278, avg=25.04, stdev=65.14, samples=791 lat (usec) : 20=35.25%, 50=56.86%, 100=7.55%, 250=0.23%, 500=0.01% lat (usec) : 750=0.11% fsync/fdatasync/sync_file_range: sync (usec): min=1751, max=5099.5k, avg=45809.43, stdev=149235.52 sync percentiles (msec): | 1.00th=[ 3], 5.00th=[ 3], 10.00th=[ 3], 20.00th=[ 3], | 30.00th=[ 3], 40.00th=[ 5], 50.00th=[ 5], 60.00th=[ 5], | 70.00th=[ 5], 80.00th=[ 8], 90.00th=[ 113], 95.00th=[ 317], | 99.00th=[ 751], 99.50th=[ 844], 99.90th=[ 1116], 99.95th=[ 1334], | 99.99th=[ 2500] cpu : usr=0.04%, sys=0.14%, ctx=35818, majf=0, minf=12 IO depths : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,10029,0,0 short=10029,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=48.0KiB/s (50.2kB/s), 48.0KiB/s-48.0KiB/s (50.2kB/s-50.2kB/s), io=21.0MiB (23.1MB), run=459817-459817msec
I didn't see this bad on the instance I tried. I added the results to doc[1] to compare against the other clouds. Below is the summary. These data points include the p99 for fsync[1] and average bandwidth rate (bw). Also as noted above these numbers are taken without workload. Iw ould also like not note as per etcd faq the p99 for fsync should be below 10ms under load. We are 4x+ that.... GCP: 4.2 p99 fsync: 2.966 ms bw: 1506KiB/s AWS: 4.2 p99 fsync: 2.507 ms bw: 1655KiB/s Azure 4.2 p99 fsync: 10.290 ms bw: 465KiB/s OpenShift 4.2 p99 fsync: 46.0ms bw: 394KiB/s [1]: https://docs.google.com/document/d/1_Volv4eo8AWiNeYU6YPWRW_E_fZa6A7cwA_gYxVwT7E/edit [2]: https://github.com/etcd-io/etcd/blob/master/Documentation/faq.md#what-does-the-etcd-warning-failed-to-send-out-heartbeat-on-time-mean
I re-ran the test and got much better results: sh-5.0# fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=22m --bs=2300 --name=mytest mytest: (g=0): rw=write, bs=(R) 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1 fio-3.13 Starting 1 process mytest: Laying out IO file (1 file / 22MiB) Jobs: 1 (f=1): [W(1)][98.3%][w=456KiB/s][w=203 IOPS][eta 00m:01s] mytest: (groupid=0, jobs=1): err= 0: pid=41: Wed Sep 18 12:30:02 2019 write: IOPS=167, BW=377KiB/s (386kB/s)(21.0MiB/59784msec); 0 zone resets clat (usec): min=11, max=741, avg=22.88, stdev=21.56 lat (usec): min=13, max=742, avg=24.18, stdev=21.61 clat percentiles (usec): | 1.00th=[ 15], 5.00th=[ 17], 10.00th=[ 18], 20.00th=[ 18], | 30.00th=[ 18], 40.00th=[ 20], 50.00th=[ 23], 60.00th=[ 23], | 70.00th=[ 24], 80.00th=[ 24], 90.00th=[ 30], 95.00th=[ 36], | 99.00th=[ 53], 99.50th=[ 65], 99.90th=[ 172], 99.95th=[ 644], | 99.99th=[ 709] bw ( KiB/s): min= 17, max= 619, per=99.95%, avg=375.81, stdev=205.76, samples=119 iops : min= 8, max= 276, avg=167.43, stdev=91.64, samples=119 lat (usec) : 20=41.18%, 50=57.62%, 100=1.07%, 250=0.03%, 750=0.10% fsync/fdatasync/sync_file_range: sync (usec): min=1864, max=353528, avg=5926.76, stdev=11475.29 sync percentiles (msec): | 1.00th=[ 3], 5.00th=[ 3], 10.00th=[ 3], 20.00th=[ 3], | 30.00th=[ 3], 40.00th=[ 5], 50.00th=[ 5], 60.00th=[ 5], | 70.00th=[ 5], 80.00th=[ 6], 90.00th=[ 8], 95.00th=[ 16], | 99.00th=[ 44], 99.50th=[ 60], 99.90th=[ 201], 99.95th=[ 213], | 99.99th=[ 326] cpu : usr=0.19%, sys=1.03%, ctx=35837, majf=0, minf=11 IO depths : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,10029,0,0 short=10029,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=377KiB/s (386kB/s), 377KiB/s-377KiB/s (386kB/s-386kB/s), io=21.0MiB (23.1MB), run=59784-59784msec Attaching a performance volume to the instance gives me: sh-5.0# fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=22m --bs=2300 --name=mytest mytest: (g=0): rw=write, bs=(R) 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1 fio-3.13 Starting 1 process mytest: Laying out IO file (1 file / 22MiB) Jobs: 1 (f=1): [W(1)][100.0%][w=638KiB/s][w=284 IOPS][eta 00m:00s] mytest: (groupid=0, jobs=1): err= 0: pid=35: Wed Sep 18 12:25:41 2019 write: IOPS=301, BW=677KiB/s (693kB/s)(21.0MiB/33285msec); 0 zone resets clat (usec): min=8, max=713, avg=17.80, stdev=20.97 lat (usec): min=9, max=714, avg=19.08, stdev=21.02 clat percentiles (usec): | 1.00th=[ 10], 5.00th=[ 13], 10.00th=[ 13], 20.00th=[ 14], | 30.00th=[ 14], 40.00th=[ 14], 50.00th=[ 18], 60.00th=[ 19], | 70.00th=[ 20], 80.00th=[ 20], 90.00th=[ 22], 95.00th=[ 28], | 99.00th=[ 38], 99.50th=[ 43], 99.90th=[ 129], 99.95th=[ 635], | 99.99th=[ 709] bw ( KiB/s): min= 592, max= 754, per=100.00%, avg=676.38, stdev=42.22, samples=66 iops : min= 264, max= 336, avg=301.39, stdev=18.79, samples=66 lat (usec) : 10=2.20%, 20=85.37%, 50=12.19%, 100=0.10%, 250=0.03% lat (usec) : 750=0.10% fsync/fdatasync/sync_file_range: sync (usec): min=1427, max=55694, avg=3290.14, stdev=1411.63 sync percentiles (usec): | 1.00th=[ 1631], 5.00th=[ 1762], 10.00th=[ 1860], 20.00th=[ 1991], | 30.00th=[ 2114], 40.00th=[ 2376], 50.00th=[ 3654], 60.00th=[ 3916], | 70.00th=[ 4080], 80.00th=[ 4293], 90.00th=[ 4621], 95.00th=[ 4883], | 99.00th=[ 5997], 99.50th=[ 7111], 99.90th=[13042], 99.95th=[16319], | 99.99th=[37487] cpu : usr=0.39%, sys=1.56%, ctx=35740, majf=0, minf=12 IO depths : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,10029,0,0 short=10029,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=677KiB/s (693kB/s), 677KiB/s-677KiB/s (693kB/s-693kB/s), io=21.0MiB (23.1MB), run=33285-33285msec Disk stats (read/write): vdb: ios=1/25573, merge=0/4, ticks=0/33610, in_queue=24027, util=40.19%
Moving back to ASSIGNED so we can hang https://github.com/openshift/installer/pull/2394 on this too, since Eric says this is a 4.2 blocker without a user-facing knob.
4.2 and 4.3 have split, so retargetting to 4.3, and we'll clone a 4.2 blocker back off of this.
In order to resolve the disk latency problem, we needed to implement a way to boot from volume so we can attach higher performance volumes to instances: https://bugzilla.redhat.com/show_bug.cgi?id=1754070
Implementing boot from volume helped a lot but wasn't sufficient to get good enough disk performance in CI. We switch to put etcd data in a ramdisk in https://github.com/openshift/release/pull/5456 for our CI.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062