Bug 1731227 - e2e flake: cluster upgrade should maintain a functioning cluster
Summary: e2e flake: cluster upgrade should maintain a functioning cluster
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 4.3.0
Assignee: Brad Ison
QA Contact: Jianwei Hou
URL:
Whiteboard: buildcop
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-07-18 17:17 UTC by Matthew Staebler
Modified: 2019-11-29 13:47 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-11-29 13:47:36 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Matthew Staebler 2019-07-18 17:17:49 UTC
[sig-cluster-lifecycle] Upgrade [Feature:Upgrade] cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Suite:k8s] [Disabled:SpecialConfig]

The watch closed while creating the namespace for the test.

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/4188

Comment 1 Hongkai Liu 2019-07-26 17:24:57 UTC
similar logs


https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/4684


Jul 26 11:41:08.704 E ns/openshift-kube-apiserver pod/kube-apiserver-ip-10-0-154-141.ec2.internal node/ip-10-0-154-141.ec2.internal container=kube-apiserver-7 container exited with code 255 (Error): nAPI AggregationController: Processing item v1.security.openshift.io\nI0726 11:37:46.613332       1 controller.go:107] OpenAPI AggregationController: Processing item v1.user.openshift.io\nI0726 11:37:48.513362       1 controller.go:107] OpenAPI AggregationController: Processing item v1.route.openshift.io\nI0726 11:37:50.307862       1 controller.go:107] OpenAPI AggregationController: Processing item v1.build.openshift.io\nI0726 11:37:52.037284       1 controller.go:107] OpenAPI AggregationController: Processing item v1.image.openshift.io\nI0726 11:37:53.541798       1 controller.go:107] OpenAPI AggregationController: Processing item v1.packages.operators.coreos.com\nI0726 11:37:55.258885       1 controller.go:107] OpenAPI AggregationController: Processing item v1.oauth.openshift.io\nI0726 11:37:57.261110       1 controller.go:107] OpenAPI AggregationController: Processing item v1.project.openshift.io\nI0726 11:37:58.844377       1 controller.go:107] OpenAPI AggregationController: Processing item v1.apps.openshift.io\nI0726 11:38:08.687688       1 cacher.go:648] cacher (*core.Pod): 1 objects queued in incoming channel.\nI0726 11:38:08.854644       1 cacher.go:648] cacher (*core.Pod): 2 objects queued in incoming channel.\nI0726 11:38:09.912022       1 log.go:172] suppressing panic for copyResponse error in test; copy error: context canceled\nI0726 11:38:09.913893       1 log.go:172] suppressing panic for copyResponse error in test; copy error: context canceled\nI0726 11:38:11.217980       1 log.go:172] suppressing panic for copyResponse error in test; copy error: context canceled\nI0726 11:38:17.508211       1 genericapiserver.go:546] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-kube-apiserver", Name:"kube-apiserver-ip-10-0-154-141.ec2.internal", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'TerminationStart' Received signal to terminate, becoming unready, but keeping serving\nI0726 11:38:17.508319       1 controller.go:176] Shutting down kubernetes service endpoint reconciler\n
Jul 26 11:41:08.759 I ns/openshift-kube-controller-manager pod/revision-pruner-5-ip-10-0-154-141.ec2.internal Started container pruner
Jul 26 11:41:08.773 I ns/openshift-kube-apiserver-operator deployment/kube-apiserver-operator Status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeControllerDegraded: All master node(s) are ready" to "StaticPodsDegraded: nodes/ip-10-0-154-141.ec2.internal pods/kube-apiserver-ip-10-0-154-141.ec2.internal container=\"kube-apiserver-7\" is not ready\nStaticPodsDegraded: nodes/ip-10-0-154-141.ec2.internal pods/kube-apiserver-ip-10-0-154-141.ec2.internal container=\"kube-apiserver-7\" is terminated: \"Error\" - \"nAPI AggregationController: Processing item v1.security.openshift.io\\nI0726 11:37:46.613332       1 controller.go:107] OpenAPI AggregationController: Processing item v1.user.openshift.io\\nI0726 11:37:48.513362       1 controller.go:107] OpenAPI AggregationController: Processing item v1.route.openshift.io\\nI0726 11:37:50.307862       1 controller.go:107] OpenAPI AggregationController: Processing item v1.build.openshift.io\\nI0726 11:37:52.037284       1 controller.go:107] OpenAPI AggregationController: Processing item v1.image.openshift.io\\nI0726 11:37:53.541798       1 controller.go:107] OpenAPI AggregationController: Processing item v1.packages.operators.coreos.com\\nI0726 11:37:55.258885       1 controller.go:107] OpenAPI AggregationController: Processing item v1.oauth.openshift.io\\nI0726 11:37:57.261110       1 controller.go:107] OpenAPI AggregationController: Processing item v1.project.openshift.io\\nI0726 11:37:58.844377       1 controller.go:107] OpenAPI AggregationController: Processing item v1.apps.openshift.io\\nI0726 11:38:08.687688       1 cacher.go:648] cacher (*core.Pod): 1 objects queued in incoming channel.\\nI0726 11:38:08.854644       1 cacher.go:648] cacher (*core.Pod): 2 objects queued in incoming channel.\\nI0726 11:38:09.912022       1 log.go:172] suppressing panic for copyResponse error in test; copy error: context canceled\\nI0726 11:38:09.913893       1 log.go:172] suppressing panic for copyResponse error in test; copy error: context canceled\\nI0726 11:38:11.217980       1 log.go:172] suppressing panic for copyResponse error in test; copy error: context canceled\\nI0726 11:38:17.508211       1 genericapiserver.go:546] Event(v1.ObjectReference{Kind:\\\"Pod\\\", Namespace:\\\"openshift-kube-apiserver\\\", Name:\\\"kube-apiserver-ip-10-0-154-141.ec2.internal\\\", UID:\\\"\\\", APIVersion:\\\"v1\\\", ResourceVersion:\\\"\\\", FieldPath:\\\"\\\"}): type: 'Normal' reason: 'TerminationStart' Received signal to terminate, becoming unready, but keeping serving\\nI0726 11:38:17.508319       1 controller.go:176] Shutting down kubernetes service endpoint reconciler\\n\"\nStaticPodsDegraded: nodes/ip-10-0-154-141.ec2.internal pods/kube-apiserver-ip-10-0-154-141.ec2.internal container=\"kube-apiserver-cert-syncer-7\" is not ready\nStaticPodsDegraded: nodes/ip-10-0-154-141.ec2.internal pods/kube-apiserver-ip-10-0-154-141.ec2.internal container=\"kube-apiserver-cert-syncer-7\" is terminated: \"Error\" - \"I0726 11:29:20.394711       1 certsync_controller.go:269] Starting CertSyncer\\nI0726 11:29:20.396391       1 observer_polling.go:106] Starting file observer\\nW0726 11:35:37.539772       1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 29991 (30960)\\n\"\nStaticPodsDegraded: nodes/ip-10-0-154-141.ec2.internal pods/kube-apiserver-ip-10-0-154-141.ec2.internal container=\"kube-apiserver-insecure-readyz-7\" is not ready\nStaticPodsDegraded: nodes/ip-10-0-154-141.ec2.internal pods/kube-apiserver-ip-10-0-154-141.ec2.internal container=\"kube-apiserver-insecure-readyz-7\" is terminated: \"Error\" - \"I0726 11:29:20.704377       1 readyz.go:103] Listening on 0.0.0.0:6080\\n\"\nNodeControllerDegraded: All master node(s) are ready"

Comment 4 Abhinav Dahiya 2019-10-30 16:13:04 UTC
most of these update completed, but it looks like entirety of services going down was node-update.

Comment 5 Antonio Murdaca 2019-10-31 09:02:24 UTC
(In reply to Abhinav Dahiya from comment #4)
> most of these update completed, but it looks like entirety of services going
> down was node-update.

I'm assuming node-update is something related to Cloud Compute (as we have nothing like that in MCO)

Comment 6 Alberto 2019-11-19 14:44:52 UTC
I can't see how this relates to machines. Machine API logs all seems ok.

Timeouts seems to happen for two different scenarios for job.go and framework.go. Not sure how to find more meaningful logs:

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/4684/artifacts/e2e-aws-upgrade/e2e.log
STEP: Destroying namespace "e2e-sig-apps-replicaset-upgrade-6988" for this suite.
Jul 26 11:54:55.123: INFO: Waiting up to 30s for server preferred namespaced resources to be successfully discovered
Jul 26 11:54:57.296: INFO: namespace e2e-sig-apps-replicaset-upgrade-6988 deletion completed in 26.223852863s
Jul 26 11:54:57.296: INFO: Running AfterSuite actions on node 1
Jul 26 11:54:57.296: INFO: Dumping logs locally to: /tmp/artifacts/junit
Jul 26 11:54:57.296: INFO: Error running cluster/log-dump/log-dump.sh: fork/exec ../../cluster/log-dump/log-dump.sh: no such file or directory
fail [k8s.io/kubernetes/test/e2e/upgrades/apps/job.go:64]: Expected
    <bool>: false
to be true

failed: (30m31s) 2019-07-26T11:54:57 "[Disruptive] Cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Suite:openshift] [Serial]"


https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1/220/artifacts/e2e-aws-upgrade/e2e.log
STEP: Destroying namespace "e2e-tests-sig-apps-deployment-upgrade-tfmm4" for this suite.
Jul 26 16:00:18.490: INFO: Waiting up to 30s for server preferred namespaced resources to be successfully discovered
Jul 26 16:00:20.168: INFO: namespace e2e-tests-sig-apps-deployment-upgrade-tfmm4 deletion completed in 7.745938894s
Jul 26 16:00:20.168: INFO: Running AfterSuite actions on node 1
Jul 26 16:00:20.168: INFO: Dumping logs locally to: /tmp/artifacts/junit
Jul 26 16:00:20.170: INFO: Error running cluster/log-dump/log-dump.sh: fork/exec ../../cluster/log-dump/log-dump.sh: no such file or directory
fail [k8s.io/kubernetes/test/e2e/upgrades/apps/job.go:63]: Expected
    <bool>: false
to be true


https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/59/artifacts/e2e-aws-upgrade/e2e.log
Jul 18 11:29:17.503: INFO: namespace sig-storage-volume-mode-downgrade-1768 deletion completed in 8.339061161s
Jul 18 11:29:17.510: INFO: Running AfterSuite actions on all nodes
Jul 18 11:29:17.510: INFO: Running AfterSuite actions on node 1
Jul 18 11:29:17.510: INFO: Dumping logs locally to: /tmp/artifacts/junit
Jul 18 11:29:17.511: INFO: Error running cluster/log-dump/log-dump.sh: fork/exec ../../cluster/log-dump/log-dump.sh: no such file or directory
fail [k8s.io/kubernetes/test/e2e/framework/framework.go:220]: Unexpected error:
    <*errors.errorString | 0xc00032a410>: {
        s: "watch closed before UntilWithoutRetry timeout",
    }
    watch closed before UntilWithoutRetry timeout
occurred


https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/4188/artifacts/e2e-aws-upgrade/e2e.log
Jul 18 11:29:17.503: INFO: namespace sig-storage-volume-mode-downgrade-1768 deletion completed in 8.339061161s
Jul 18 11:29:17.510: INFO: Running AfterSuite actions on all nodes
Jul 18 11:29:17.510: INFO: Running AfterSuite actions on node 1
Jul 18 11:29:17.510: INFO: Dumping logs locally to: /tmp/artifacts/junit
Jul 18 11:29:17.511: INFO: Error running cluster/log-dump/log-dump.sh: fork/exec ../../cluster/log-dump/log-dump.sh: no such file or directory
fail [k8s.io/kubernetes/test/e2e/framework/framework.go:220]: Unexpected error:
    <*errors.errorString | 0xc00032a410>: {
        s: "watch closed before UntilWithoutRetry timeout",
    }
    watch closed before UntilWithoutRetry timeout
occurred

Comment 7 Brad Ison 2019-11-29 13:47:36 UTC
I don't really know what to make of this. This was opened in July and bounced around. It doesn't seem to have anything to do with the Cloud Compute components. The errors posted above look to me like a failure to create one of the test namespaces, possibly because the API server is being restarted, but the tests don't retry the creation.

If it's still a flake causing problems in 4.2, please re-open and move it to a more appropriate component, which I'm guessing at this point is the API server.


Note You need to log in before you can comment on or make changes to this bug.