Bug 1850057 - e2e-azure-upgrade-4.4-stable-to-4.5-ci failing with API unreachable for ~28% of the upgrade time
Summary: e2e-azure-upgrade-4.4-stable-to-4.5-ci failing with API unreachable for ~28% ...
Keywords:
Status: VERIFIED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.5
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.6.0
Assignee: Colin Walters
QA Contact: Michael Nguyen
URL:
Whiteboard: coreos
Depends On: 1852047 1861507 1852058 1852565
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-23 13:32 UTC by Michal Fojtik
Modified: 2020-09-19 16:54 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1852047 (view as bug list)
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)
azure-io (42.05 KB, image/png)
2020-06-23 16:02 UTC, Sam Batschelet
no flags Details
azure apiserver latency 4.4 to 4.5 (53.90 KB, image/png)
2020-06-30 14:01 UTC, Sam Batschelet
no flags Details
azure apiserver latency 4.3 to 4.4 (31.91 KB, image/png)
2020-06-30 14:02 UTC, Sam Batschelet
no flags Details
azure etcd latency 4.3 to 4.4 (162.77 KB, image/png)
2020-06-30 14:03 UTC, Sam Batschelet
no flags Details
azure etcd latency 4.4 to 4.5 (188.80 KB, image/png)
2020-06-30 14:07 UTC, Sam Batschelet
no flags Details


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-etcd-operator pull 418 None closed Bug 1850057: etcd-pod: Use ionice -c2 -n0 2020-09-19 15:32:57 UTC
Github openshift machine-config-operator issues 1897 None open Bug 1850057: stage OS updates (nicely) while etcd is still running 2020-09-19 15:32:57 UTC
Github openshift machine-config-operator pull 1957 None closed Bug 1850057: Use bfq scheduler on control plane, idle I/O for rpm-ostreed 2020-09-19 15:32:54 UTC

Description Michal Fojtik 2020-06-23 13:32:04 UTC
Description of problem:

kube-apiserver:
Jun 23 12:01:07.819: API was unreachable during disruption for at least 21m51s of 1h19m1s (28%):

openshift-apiserver:
Jun 23 12:01:07.819: API was unreachable during disruption for at least 21m37s of 1h19m1s (27%):

Latest failed job: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci/1275367195391561728

There is some evidence of excessive etcd leader changes and some KAS containers crashlooping (which should not lead to disruption).

This need to be investigated.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 6 Sam Batschelet 2020-06-23 16:02:08 UTC
Created attachment 1698480 [details]
azure-io

Comment 10 Lalatendu Mohanty 2020-06-23 19:07:53 UTC
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?  Is it serious enough to warrant blocking edges?
  example: Up to 2 minute disruption in edge routing
  example: Up to 90seconds of API downtime
  example: etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  example: Issue resolves itself after five minutes
  example: Admin uses oc to fix things
  example: Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  example: No, it’s always been like this we just never noticed
  example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 11 Sam Batschelet 2020-06-24 15:55:04 UTC
Update: etcd team is now working with group-b to better understand how the apiserver failures are calculated. We are focused on ensuring the load balancers are not causing invalid results as calculations are taken external to the cluster. The validity of these values relies on accurate health reporting from load balancers.

Comment 15 Sam Batschelet 2020-06-30 14:01:05 UTC
Created attachment 1699317 [details]
azure apiserver latency 4.4 to 4.5

Comment 16 Sam Batschelet 2020-06-30 14:02:09 UTC
Created attachment 1699318 [details]
azure apiserver latency 4.3 to 4.4

Comment 17 Sam Batschelet 2020-06-30 14:03:27 UTC
Created attachment 1699319 [details]
azure etcd latency 4.3 to 4.4

Comment 18 Sam Batschelet 2020-06-30 14:07:46 UTC
Created attachment 1699320 [details]
azure etcd latency 4.4 to 4.5

Comment 23 Micah Abbott 2020-07-07 20:02:24 UTC
As this is going to be a blocker for 4.6, we'll need to prioritize this work in the next sprint or so.

Comment 24 W. Trevor King 2020-07-07 20:11:05 UTC
I'm linking https://github.com/openshift/machine-config-operator/issues/1897 , which discusses some possible OSTree-side mitigation strategies.

Comment 25 Jonathan Lebon 2020-07-07 21:22:17 UTC
I think the main idea in https://github.com/openshift/machine-config-operator/issues/1897 is to make the MCO use the same API that Zincati does in FCOS so that we pay the IO cost earlier. So the bulk of the work will be about adapting the MCO rather than RHCOS itself, so re-assigning back to MCO (but obviously work is needed in both; e.g. it'll need at least https://github.com/coreos/rpm-ostree/pull/2158). I left Colin as the assignee in case he wanted to tackle the MCO side.

Comment 26 Colin Walters 2020-08-20 13:37:34 UTC
Status update on this is mostly:

Still working on code and tooling to gather more data about whether the proposed changes improve things.

 - We need to synthesize a "nontrivial" OS update in CI https://github.com/coreos/coreos-assembler/pull/1635
 - It took unexpectedly long amount of time to land small "prep work" PRs like https://github.com/openshift/machine-config-operator/pull/1962
 - Still waiting on any kind of high level review from the MCO team on https://github.com/openshift/machine-config-operator/pull/1946
 - In trying to understand the upgrade tests I stumbled on https://github.com/openshift/origin/pull/25421 for example

Comment 30 Micah Abbott 2020-09-19 16:54:25 UTC
VERIFIED with 4.6.0-fc.5

```
$ oc get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-fc.5   True        False         19m     Cluster version is 4.6.0-fc.5

$ oc -n openshift-etcd get po
NAME                                                   READY   STATUS      RESTARTS   AGE
etcd-ci-ln-j38n5qt-f76d1-r2ks4-master-0                3/3     Running     0          41m
etcd-ci-ln-j38n5qt-f76d1-r2ks4-master-1                3/3     Running     0          26m
etcd-ci-ln-j38n5qt-f76d1-r2ks4-master-2                3/3     Running     0          40m
etcd-quorum-guard-5c6f86bc54-4h6b8                     1/1     Running     0          47m
etcd-quorum-guard-5c6f86bc54-5hr9m                     1/1     Running     0          47m
etcd-quorum-guard-5c6f86bc54-ddvpg                     1/1     Running     0          47m
installer-2-ci-ln-j38n5qt-f76d1-r2ks4-master-0         0/1     Completed   0          48m
installer-2-ci-ln-j38n5qt-f76d1-r2ks4-master-1         0/1     Completed   0          46m
installer-2-ci-ln-j38n5qt-f76d1-r2ks4-master-2         0/1     Completed   0          47m
installer-3-ci-ln-j38n5qt-f76d1-r2ks4-master-0         0/1     Completed   0          41m
installer-3-ci-ln-j38n5qt-f76d1-r2ks4-master-1         0/1     Completed   0          40m
installer-3-ci-ln-j38n5qt-f76d1-r2ks4-master-2         0/1     Completed   0          40m
revision-pruner-2-ci-ln-j38n5qt-f76d1-r2ks4-master-0   0/1     Completed   0          47m
revision-pruner-2-ci-ln-j38n5qt-f76d1-r2ks4-master-1   0/1     Completed   0          46m
revision-pruner-2-ci-ln-j38n5qt-f76d1-r2ks4-master-2   0/1     Completed   0          46m
revision-pruner-3-ci-ln-j38n5qt-f76d1-r2ks4-master-0   0/1     Completed   0          40m
revision-pruner-3-ci-ln-j38n5qt-f76d1-r2ks4-master-1   0/1     Completed   0          25m
revision-pruner-3-ci-ln-j38n5qt-f76d1-r2ks4-master-2   0/1     Completed   0          40m

$ oc -n openshift-etcd describe pod/etcd-ci-ln-j38n5qt-f76d1-r2ks4-master-0 | grep ionice
      # See https://etcd.io/docs/v3.4.0/tuning/ for why we use ionice
      exec ionice -c2 -n0 etcd \

$ oc get nodes
NAME                                       STATUS   ROLES    AGE   VERSION
ci-ln-j38n5qt-f76d1-r2ks4-master-0         Ready    master   51m   v1.19.0-rc.2+fc4c489
ci-ln-j38n5qt-f76d1-r2ks4-master-1         Ready    master   51m   v1.19.0-rc.2+fc4c489
ci-ln-j38n5qt-f76d1-r2ks4-master-2         Ready    master   51m   v1.19.0-rc.2+fc4c489
ci-ln-j38n5qt-f76d1-r2ks4-worker-b-jg46v   Ready    worker   41m   v1.19.0-rc.2+fc4c489
ci-ln-j38n5qt-f76d1-r2ks4-worker-c-g5rvz   Ready    worker   41m   v1.19.0-rc.2+fc4c489
ci-ln-j38n5qt-f76d1-r2ks4-worker-d-njqvl   Ready    worker   41m   v1.19.0-rc.2+fc4c489

$ oc debug node/ci-ln-j38n5qt-f76d1-r2ks4-master-0
Starting pod/ci-ln-j38n5qt-f76d1-r2ks4-master-0-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.0.5
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host 
sh-4.4# lsblk     
NAME                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                            8:0    0   128G  0 disk 
|-sda1                         8:1    0   384M  0 part /boot
|-sda2                         8:2    0   127M  0 part /boot/efi
|-sda3                         8:3    0     1M  0 part 
`-sda4                         8:4    0 127.5G  0 part 
  `-coreos-luks-root-nocrypt 253:0    0 127.5G  0 dm   /sysroot
sh-4.4# cat /sys/block/sda/queue/scheduler 
mq-deadline kyber [bfq] none
sh-4.4# 

sh-4.4# cat /etc/systemd/system/rpm-ostreed.service.d/mco-controlplane-nice.conf 
# See https://github.com/openshift/machine-config-operator/issues/1897
[Service]
Nice=10
IOSchedulingClass=best-effort
IOSchedulingPriority=6
sh-4.4# 
```


Note You need to log in before you can comment on or make changes to this bug.