Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1851353

Summary: [sig-apps] Deployment should not disrupt a cloud load-balancer's connectivity during rollout
Product: OpenShift Container Platform Reporter: Daniel Mellado <dmellado>
Component: kube-controller-managerAssignee: Maciej Szulik <maszulik>
Status: CLOSED ERRATA QA Contact: zhou ying <yinzhou>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.3.zCC: aos-bugs, bbennett, deads, dmace, dosmith, mfojtik, wking, wlewis
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: non-multi-arch
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
[sig-apps] Deployment should not disrupt a cloud load-balancer's connectivity during rollout
Last Closed: 2020-10-27 16:09:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Daniel Mellado 2020-06-26 09:49:38 UTC
test:
[sig-apps] Deployment should not disrupt a cloud load-balancer's connectivity during rollout 

is failing frequently in CI, see search results:
https://search.svc.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-apps%5C%5D+Deployment+should+not+disrupt+a+cloud+load-balancer%27s+connectivity+during+rollout

This test is failing with in i.e. https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/endurance-e2e-aws-4.3/1276394920189366272

With fail [github.com/openshift/origin/test/extended/util/client.go:681]: Jun 26 06:09:57.469: etcdserver: request is too large

```Jun 26 06:09:54.843: INFO: >>> kubeConfig: /tmp/cluster-credentials/kubeconfig
Jun 26 06:09:54.845: INFO: Waiting up to 30m0s for all (but 100) nodes to be schedulable
Jun 26 06:09:55.481: INFO: Waiting up to 10m0s for all pods (need at least 0) in namespace 'kube-system' to be running and ready
Jun 26 06:09:55.838: INFO: 0 / 0 pods in namespace 'kube-system' are running and ready (0 seconds elapsed)
Jun 26 06:09:55.838: INFO: expected 0 pod replicas in namespace 'kube-system', 0 are Running and Ready.
Jun 26 06:09:55.838: INFO: Waiting up to 5m0s for all daemonsets in namespace 'kube-system' to start
Jun 26 06:09:55.937: INFO: e2e test version: v1.16.2+3f43aac
Jun 26 06:09:55.962: INFO: kube-apiserver version: v1.16.2
Jun 26 06:09:55.962: INFO: >>> kubeConfig: /tmp/cluster-credentials/kubeconfig
Jun 26 06:09:56.012: INFO: Cluster IP family: ipv4
[BeforeEach] [Top Level]
  /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/util/test.go:61
[BeforeEach] [sig-apps] Deployment
  /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:151
STEP: Creating a kubernetes client
Jun 26 06:09:56.037: INFO: >>> kubeConfig: /tmp/cluster-credentials/kubeconfig
STEP: Building a namespace api object, basename deployment
Jun 26 06:09:56.246: INFO: About to run a Kube e2e test, ensuring namespace is privileged
goroutine 1 [running]:
runtime/debug.Stack(0x124d622, 0x0, 0x0)
	/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/debug/stack.go:24 +0x9d
github.com/openshift/origin/test/extended/util.FatalErr(0x4df7760, 0xc0021ac280)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/util/client.go:680 +0x26
github.com/openshift/origin/test/extended/util.addE2EServiceAccountsToSCC(0x6223600, 0xc001668550, 0xc0005e5b80, 0x1, 0x1, 0x56a08d7, 0xa)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/util/test.go:679 +0x136
github.com/openshift/origin/test/extended/util.createTestingNS(0x569fb71, 0xa, 0x63a8400, 0xc001665040, 0xc00284a870, 0x7d00000005499ac0, 0x1, 0x7df8b53713e0378b)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/util/test.go:322 +0x24a
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework.(*Framework).CreateNamespace(0xc00069ef00, 0x569fb71, 0xa, 0xc00284a870, 0xc00171ba68, 0xc001bec1c0, 0x34)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:399 +0x73
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework.(*Framework).BeforeEach(0xc00069ef00)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:215 +0xa22
github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/leafnodes.(*runner).runSync(0xc000433560, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/leafnodes/runner.go:113 +0x9c
github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/leafnodes.(*runner).run(0xc000433560, 0xb7, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/leafnodes/runner.go:64 +0xcf
github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/leafnodes.(*SetupNode).Run(0xc000010210, 0x61d54a0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/leafnodes/setup_nodes.go:15 +0x64
github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/spec.(*Spec).runSample(0xc002301a40, 0x0, 0x61d54a0, 0xc000287480)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/spec/spec.go:193 +0x221
github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/spec.(*Spec).Run(0xc002301a40, 0x61d54a0, 0xc000287480)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/spec/spec.go:138 +0xf4
github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).runSpec(0xc001c88a00, 0xc002301a40, 0x0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/specrunner/spec_runner.go:200 +0x10f
github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).runSpecs(0xc001c88a00, 0x1)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/specrunner/spec_runner.go:170 +0x124
github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).Run(0xc001c88a00, 0xc000d3e868)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/specrunner/spec_runner.go:66 +0x117
github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/suite.(*Suite).Run(0xc00028f900, 0x61d3d60, 0xc0019bdbd0, 0x0, 0x0, 0xc001668120, 0x1, 0x1, 0x62c69c0, 0xc000287480, ...)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/suite/suite.go:62 +0x42e
github.com/openshift/origin/pkg/test/ginkgo.(*TestOptions).Run(0xc001b60390, 0xc001b62490, 0x1, 0x1, 0x0, 0x0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/pkg/test/ginkgo/cmd_runtest.go:59 +0x4d2
main.newRunTestCommand.func1(0xc000bd1680, 0xc001b62490, 0x1, 0x1, 0x0, 0x0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:233 +0x11f
github.com/openshift/origin/vendor/github.com/spf13/cobra.(*Command).execute(0xc000bd1680, 0xc001b62450, 0x1, 0x1, 0xc000bd1680, 0xc001b62450)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/spf13/cobra/command.go:826 +0x465
github.com/openshift/origin/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc000bd0c80, 0x61dbad8, 0xa5c4b00, 0xa5c4b00)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/spf13/cobra/command.go:914 +0x2fc
github.com/openshift/origin/vendor/github.com/spf13/cobra.(*Command).Execute(...)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/spf13/cobra/command.go:864
main.main.func1(0xc000bd0c80, 0x0, 0x0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:68 +0x93
main.main()
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:69 +0x327

Jun 26 06:09:57.469: INFO: etcdserver: request is too large
[AfterEach] [sig-apps] Deployment
  /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/apps/deployment.go:64
[AfterEach] [sig-apps] Deployment
  /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:152
Jun 26 06:09:57.470: INFO: Running AfterSuite actions on all nodes
Jun 26 06:09:57.470: INFO: Running AfterSuite actions on node 1
fail [github.com/openshift/origin/test/extended/util/client.go:681]: Jun 26 06:09:57.469: etcdserver: request is too large
```

Comment 1 W. Trevor King 2020-06-26 20:13:09 UTC
I only see 'fail.*etcdserver: request is too large' a few times over the past 14d [1]:

$ curl -sL 'https://search.svc.ci.openshift.org/search?search=fail.*etcdserver%3A+request+is+too+large&maxAge=336h&type=junit' | jq -r 'keys[]'
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/endurance-e2e-aws-4.3/1276032622354501632
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/endurance-e2e-aws-4.3/1276394920189366272
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/endurance-e2e-aws-4.4/1273858136133865472

Comparing with job failures:

$ curl -sL 'https://search.svc.ci.openshift.org/search?search=Deployment+should+not+disrupt+a+cloud+load-balancer%27s+connectivity+during+rollout&maxAge=336h&type=junit&name=release-openshift-' | jq -r 'keys[]' | wc -l
91
$ curl -sL 'https://search.svc.ci.openshift.org/search?search=Deployment+should+not+disrupt+a+cloud+load-balancer%27s+connectivity+during+rollout&maxAge=336h&type=junit&name=release-openshift-' | jq -r 'keys[]' | sed 's|.*/\([^/]*\)/[0-9]*|\1|' | sort | uniq -c | sort -n
      1 release-openshift-ocp-installer-e2e-aws-fips-4.4
      1 release-openshift-ocp-installer-e2e-azure-ovn-4.4
      1 release-openshift-ocp-installer-e2e-gcp-ovn-4.3
      1 release-openshift-ocp-installer-e2e-gcp-ovn-4.4
      1 release-openshift-ocp-installer-e2e-openstack-ppc64le-4.4
      1 release-openshift-okd-installer-e2e-aws-4.5
      1 release-openshift-origin-installer-e2e-azure-shared-vpc-4.4
      1 release-openshift-origin-installer-e2e-azure-shared-vpc-4.5
      1 release-openshift-origin-installer-e2e-gcp-4.6
      1 release-openshift-origin-installer-e2e-gcp-shared-vpc-4.4
      1 release-openshift-origin-installer-e2e-remote-libvirt-ppc64le-4.4
      1 release-openshift-origin-installer-launch-aws
      2 release-openshift-ocp-installer-e2e-azure-4.4
      2 release-openshift-okd-installer-e2e-aws-4.4
      2 release-openshift-origin-installer-e2e-aws-ovn-4.6
      2 release-openshift-origin-installer-e2e-azure-shared-vpc-4.6
      2 release-openshift-origin-installer-e2e-gcp-4.3
      2 release-openshift-origin-installer-e2e-remote-libvirt-s390x-4.4
      3 promote-release-openshift-okd-machine-os-content-e2e-aws-4.5
      3 release-openshift-ocp-installer-e2e-openstack-4.3
      3 release-openshift-okd-installer-e2e-aws-4.6
      3 release-openshift-origin-installer-e2e-aws-4.4
      4 rehearse-9652-promote-release-openshift-okd-machine-os-content-e2e-aws-4.5
      4 release-openshift-ocp-installer-e2e-aws-ovn-4.3
      4 release-openshift-ocp-installer-e2e-aws-ovn-4.4
      4 release-openshift-ocp-installer-e2e-aws-ovn-4.5
      4 release-openshift-ocp-installer-e2e-azure-4.6
      5 release-openshift-ocp-installer-e2e-azure-4.5
      6 release-openshift-origin-installer-e2e-aws-4.6
     24 release-openshift-origin-installer-e2e-aws-4.5

So I think you've picked a corner-case example job.  It's worth figuring out what's going on with the large requests in this bug, but we probably want a new one for whatever is impacting this test-case in release-openshift-origin-installer-e2e-aws-4.5.

Comment 3 Dan Mace 2020-07-14 18:10:26 UTC
This error is generated if the object being persisted exceeds the etcd object storage capacity (I think 1.5MB by default). What reason do we have to believe that etcd is doing anything wrong here? I think this should be reassigned to whomever we suspect is the client generating the request containing an object too big to store.

Comment 5 David Eads 2020-08-07 14:39:43 UTC
The two runs I looked at both failed to schedule the pod, but I can't figure out why or whether it should have scheduled.  It appears the pods exist, but the scheduler isn't able to find a spot for all of them.

Comment 6 Maciej Szulik 2020-08-11 11:11:22 UTC
https://github.com/openshift/kubernetes/pull/310
and upstream
https://github.com/kubernetes/kubernetes/pull/93857
are handling this

Comment 7 Maciej Szulik 2020-08-21 13:55:28 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 9 Maciej Szulik 2020-09-08 10:10:31 UTC
https://github.com/kubernetes/kubernetes/pull/93857 is already picked in latest k8s bump in https://github.com/openshift/kubernetes/pull/325

Comment 10 Maciej Szulik 2020-09-10 18:45:22 UTC
https://github.com/openshift/kubernetes/pull/325 merged, moving then to modified.

Comment 15 errata-xmlrpc 2020-10-27 16:09:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 16 Red Hat Bugzilla 2023-09-14 06:02:58 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days