Bug 1850057
| Summary: | e2e-azure-upgrade-4.4-stable-to-4.5-ci failing with API unreachable for ~28% of the upgrade time | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Michal Fojtik <mfojtik> | ||||||||||||
| Component: | Machine Config Operator | Assignee: | Colin Walters <walters> | ||||||||||||
| Status: | CLOSED ERRATA | QA Contact: | Michael Nguyen <mnguyen> | ||||||||||||
| Severity: | high | Docs Contact: | |||||||||||||
| Priority: | urgent | ||||||||||||||
| Version: | 4.5 | CC: | aos-bugs, bbreard, bleanhar, imcleod, jack.ottofaro, jlebon, jligon, kewang, lmohanty, mfojtik, mharri, miabbott, nstielau, sbatsche, scuppett, sdodson, vrutkovs, wking | ||||||||||||
| Target Milestone: | --- | Keywords: | Upgrades | ||||||||||||
| Target Release: | 4.6.0 | ||||||||||||||
| Hardware: | Unspecified | ||||||||||||||
| OS: | Unspecified | ||||||||||||||
| Whiteboard: | coreos | ||||||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||
| Clone Of: | |||||||||||||||
| : | 1852047 (view as bug list) | Environment: | |||||||||||||
| Last Closed: | 2020-10-27 16:08:40 UTC | Type: | Bug | ||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||
| Documentation: | --- | CRM: | |||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
| Embargoed: | |||||||||||||||
| Bug Depends On: | 1852047, 1852058, 1852565, 1861507 | ||||||||||||||
| Bug Blocks: | |||||||||||||||
| Attachments: |
|
||||||||||||||
|
Description
Michal Fojtik
2020-06-23 13:32:04 UTC
Created attachment 1698480 [details]
azure-io
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, it’s always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1 Update: etcd team is now working with group-b to better understand how the apiserver failures are calculated. We are focused on ensuring the load balancers are not causing invalid results as calculations are taken external to the cluster. The validity of these values relies on accurate health reporting from load balancers. Created attachment 1699317 [details]
azure apiserver latency 4.4 to 4.5
Created attachment 1699318 [details]
azure apiserver latency 4.3 to 4.4
Created attachment 1699319 [details]
azure etcd latency 4.3 to 4.4
Created attachment 1699320 [details]
azure etcd latency 4.4 to 4.5
As this is going to be a blocker for 4.6, we'll need to prioritize this work in the next sprint or so. I'm linking https://github.com/openshift/machine-config-operator/issues/1897 , which discusses some possible OSTree-side mitigation strategies. I think the main idea in https://github.com/openshift/machine-config-operator/issues/1897 is to make the MCO use the same API that Zincati does in FCOS so that we pay the IO cost earlier. So the bulk of the work will be about adapting the MCO rather than RHCOS itself, so re-assigning back to MCO (but obviously work is needed in both; e.g. it'll need at least https://github.com/coreos/rpm-ostree/pull/2158). I left Colin as the assignee in case he wanted to tackle the MCO side. Status update on this is mostly: Still working on code and tooling to gather more data about whether the proposed changes improve things. - We need to synthesize a "nontrivial" OS update in CI https://github.com/coreos/coreos-assembler/pull/1635 - It took unexpectedly long amount of time to land small "prep work" PRs like https://github.com/openshift/machine-config-operator/pull/1962 - Still waiting on any kind of high level review from the MCO team on https://github.com/openshift/machine-config-operator/pull/1946 - In trying to understand the upgrade tests I stumbled on https://github.com/openshift/origin/pull/25421 for example VERIFIED with 4.6.0-fc.5
```
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.6.0-fc.5 True False 19m Cluster version is 4.6.0-fc.5
$ oc -n openshift-etcd get po
NAME READY STATUS RESTARTS AGE
etcd-ci-ln-j38n5qt-f76d1-r2ks4-master-0 3/3 Running 0 41m
etcd-ci-ln-j38n5qt-f76d1-r2ks4-master-1 3/3 Running 0 26m
etcd-ci-ln-j38n5qt-f76d1-r2ks4-master-2 3/3 Running 0 40m
etcd-quorum-guard-5c6f86bc54-4h6b8 1/1 Running 0 47m
etcd-quorum-guard-5c6f86bc54-5hr9m 1/1 Running 0 47m
etcd-quorum-guard-5c6f86bc54-ddvpg 1/1 Running 0 47m
installer-2-ci-ln-j38n5qt-f76d1-r2ks4-master-0 0/1 Completed 0 48m
installer-2-ci-ln-j38n5qt-f76d1-r2ks4-master-1 0/1 Completed 0 46m
installer-2-ci-ln-j38n5qt-f76d1-r2ks4-master-2 0/1 Completed 0 47m
installer-3-ci-ln-j38n5qt-f76d1-r2ks4-master-0 0/1 Completed 0 41m
installer-3-ci-ln-j38n5qt-f76d1-r2ks4-master-1 0/1 Completed 0 40m
installer-3-ci-ln-j38n5qt-f76d1-r2ks4-master-2 0/1 Completed 0 40m
revision-pruner-2-ci-ln-j38n5qt-f76d1-r2ks4-master-0 0/1 Completed 0 47m
revision-pruner-2-ci-ln-j38n5qt-f76d1-r2ks4-master-1 0/1 Completed 0 46m
revision-pruner-2-ci-ln-j38n5qt-f76d1-r2ks4-master-2 0/1 Completed 0 46m
revision-pruner-3-ci-ln-j38n5qt-f76d1-r2ks4-master-0 0/1 Completed 0 40m
revision-pruner-3-ci-ln-j38n5qt-f76d1-r2ks4-master-1 0/1 Completed 0 25m
revision-pruner-3-ci-ln-j38n5qt-f76d1-r2ks4-master-2 0/1 Completed 0 40m
$ oc -n openshift-etcd describe pod/etcd-ci-ln-j38n5qt-f76d1-r2ks4-master-0 | grep ionice
# See https://etcd.io/docs/v3.4.0/tuning/ for why we use ionice
exec ionice -c2 -n0 etcd \
$ oc get nodes
NAME STATUS ROLES AGE VERSION
ci-ln-j38n5qt-f76d1-r2ks4-master-0 Ready master 51m v1.19.0-rc.2+fc4c489
ci-ln-j38n5qt-f76d1-r2ks4-master-1 Ready master 51m v1.19.0-rc.2+fc4c489
ci-ln-j38n5qt-f76d1-r2ks4-master-2 Ready master 51m v1.19.0-rc.2+fc4c489
ci-ln-j38n5qt-f76d1-r2ks4-worker-b-jg46v Ready worker 41m v1.19.0-rc.2+fc4c489
ci-ln-j38n5qt-f76d1-r2ks4-worker-c-g5rvz Ready worker 41m v1.19.0-rc.2+fc4c489
ci-ln-j38n5qt-f76d1-r2ks4-worker-d-njqvl Ready worker 41m v1.19.0-rc.2+fc4c489
$ oc debug node/ci-ln-j38n5qt-f76d1-r2ks4-master-0
Starting pod/ci-ln-j38n5qt-f76d1-r2ks4-master-0-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.0.5
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 128G 0 disk
|-sda1 8:1 0 384M 0 part /boot
|-sda2 8:2 0 127M 0 part /boot/efi
|-sda3 8:3 0 1M 0 part
`-sda4 8:4 0 127.5G 0 part
`-coreos-luks-root-nocrypt 253:0 0 127.5G 0 dm /sysroot
sh-4.4# cat /sys/block/sda/queue/scheduler
mq-deadline kyber [bfq] none
sh-4.4#
sh-4.4# cat /etc/systemd/system/rpm-ostreed.service.d/mco-controlplane-nice.conf
# See https://github.com/openshift/machine-config-operator/issues/1897
[Service]
Nice=10
IOSchedulingClass=best-effort
IOSchedulingPriority=6
sh-4.4#
```
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475 |