Bug 1851353
| Summary: | [sig-apps] Deployment should not disrupt a cloud load-balancer's connectivity during rollout | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Daniel Mellado <dmellado> |
| Component: | kube-controller-manager | Assignee: | Maciej Szulik <maszulik> |
| Status: | CLOSED ERRATA | QA Contact: | zhou ying <yinzhou> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.3.z | CC: | aos-bugs, bbennett, deads, dmace, dosmith, mfojtik, wking, wlewis |
| Target Milestone: | --- | ||
| Target Release: | 4.6.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | non-multi-arch | ||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: |
[sig-apps] Deployment should not disrupt a cloud load-balancer's connectivity during rollout
|
|
| Last Closed: | 2020-10-27 16:09:46 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Daniel Mellado
2020-06-26 09:49:38 UTC
I only see 'fail.*etcdserver: request is too large' a few times over the past 14d [1]: $ curl -sL 'https://search.svc.ci.openshift.org/search?search=fail.*etcdserver%3A+request+is+too+large&maxAge=336h&type=junit' | jq -r 'keys[]' https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/endurance-e2e-aws-4.3/1276032622354501632 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/endurance-e2e-aws-4.3/1276394920189366272 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/endurance-e2e-aws-4.4/1273858136133865472 Comparing with job failures: $ curl -sL 'https://search.svc.ci.openshift.org/search?search=Deployment+should+not+disrupt+a+cloud+load-balancer%27s+connectivity+during+rollout&maxAge=336h&type=junit&name=release-openshift-' | jq -r 'keys[]' | wc -l 91 $ curl -sL 'https://search.svc.ci.openshift.org/search?search=Deployment+should+not+disrupt+a+cloud+load-balancer%27s+connectivity+during+rollout&maxAge=336h&type=junit&name=release-openshift-' | jq -r 'keys[]' | sed 's|.*/\([^/]*\)/[0-9]*|\1|' | sort | uniq -c | sort -n 1 release-openshift-ocp-installer-e2e-aws-fips-4.4 1 release-openshift-ocp-installer-e2e-azure-ovn-4.4 1 release-openshift-ocp-installer-e2e-gcp-ovn-4.3 1 release-openshift-ocp-installer-e2e-gcp-ovn-4.4 1 release-openshift-ocp-installer-e2e-openstack-ppc64le-4.4 1 release-openshift-okd-installer-e2e-aws-4.5 1 release-openshift-origin-installer-e2e-azure-shared-vpc-4.4 1 release-openshift-origin-installer-e2e-azure-shared-vpc-4.5 1 release-openshift-origin-installer-e2e-gcp-4.6 1 release-openshift-origin-installer-e2e-gcp-shared-vpc-4.4 1 release-openshift-origin-installer-e2e-remote-libvirt-ppc64le-4.4 1 release-openshift-origin-installer-launch-aws 2 release-openshift-ocp-installer-e2e-azure-4.4 2 release-openshift-okd-installer-e2e-aws-4.4 2 release-openshift-origin-installer-e2e-aws-ovn-4.6 2 release-openshift-origin-installer-e2e-azure-shared-vpc-4.6 2 release-openshift-origin-installer-e2e-gcp-4.3 2 release-openshift-origin-installer-e2e-remote-libvirt-s390x-4.4 3 promote-release-openshift-okd-machine-os-content-e2e-aws-4.5 3 release-openshift-ocp-installer-e2e-openstack-4.3 3 release-openshift-okd-installer-e2e-aws-4.6 3 release-openshift-origin-installer-e2e-aws-4.4 4 rehearse-9652-promote-release-openshift-okd-machine-os-content-e2e-aws-4.5 4 release-openshift-ocp-installer-e2e-aws-ovn-4.3 4 release-openshift-ocp-installer-e2e-aws-ovn-4.4 4 release-openshift-ocp-installer-e2e-aws-ovn-4.5 4 release-openshift-ocp-installer-e2e-azure-4.6 5 release-openshift-ocp-installer-e2e-azure-4.5 6 release-openshift-origin-installer-e2e-aws-4.6 24 release-openshift-origin-installer-e2e-aws-4.5 So I think you've picked a corner-case example job. It's worth figuring out what's going on with the large requests in this bug, but we probably want a new one for whatever is impacting this test-case in release-openshift-origin-installer-e2e-aws-4.5. This error is generated if the object being persisted exceeds the etcd object storage capacity (I think 1.5MB by default). What reason do we have to believe that etcd is doing anything wrong here? I think this should be reassigned to whomever we suspect is the client generating the request containing an object too big to store. The two runs I looked at both failed to schedule the pod, but I can't figure out why or whether it should have scheduled. It appears the pods exist, but the scheduler isn't able to find a spot for all of them. https://github.com/openshift/kubernetes/pull/310 and upstream https://github.com/kubernetes/kubernetes/pull/93857 are handling this I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. https://github.com/kubernetes/kubernetes/pull/93857 is already picked in latest k8s bump in https://github.com/openshift/kubernetes/pull/325 https://github.com/openshift/kubernetes/pull/325 merged, moving then to modified. From the search https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-apps%5C%5D+Deployment+should+not+disrupt+a+cloud+load-balancer%27s+connectivity+during+rollout, we can't find the same error info for this test , so will move to verified status. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |