Description of problem: https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-serial-4.2/21 https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-serial-4.2/20 https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-serial-4.2/19 ... etc. fail [k8s.io/kubernetes/test/e2e/apps/daemon_set.go:425]: Expected <int>: 2 to equal <int>: 0 Version-Release number of selected component (if applicable): 4.2 jobs How reproducible: Always
None of the linked jobs failed with what was mentioned, but suffered other problems from other BZ.
It actually failed with it if you sroll there. It doesn't seems to be DeamonSet's fault. The tests detect the number of schedulable nodes when the e2e starts (on demand next kube version) but that number has been <2 when the e2e started although there are 3 worker nodes. That seems like e2e either doesn't wait for nodes to be schedulable or it was tainted at the time, but I don't see either in the end state so I'd assume it was transient (at the start up).
Fair, but I can't find this issue in the previous 2 weeks in https://search.svc.ci.openshift.org/ so I'm lowering the severity and moving this to 4.3
I'd not trust search that much it seems broken - check the dates on https://prow.k8s.io/job-history/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-serial-4.2 I am not sure to which team this belongs to though - I guess Node (to make sure there operator is degraded until the nodes are schedulable and tests wait for it not to be degraded).
This job just started running on 8/16 and this test has never passed on this platform. Not sure how it can be a release blocker in that case. https://testgrid.k8s.io/redhat-openshift-release-informing#redhat-canary-openshift-ocp-installer-e2e-azure-serial-4.2&sort-by-flakiness= It isn't a flake; it is just not working at all and the test appears to be fragile in design. Tomas, any chance we could pick the on-demand behavior back? At the very least, the monitoring operator should not be able to become ready until the workers are ready as the monitoring stack runs on workers, not masters. So it is strange that the e2e tests appear to start before all the workers are ready. Regardless, there is no "Node operator". The closest thing is the MCO. Posting test results here for convenience: Aug 28 11:01:53.051: INFO: >>> kubeConfig: /tmp/admin.kubeconfig Aug 28 11:01:53.329: INFO: >>> kubeConfig: /tmp/admin.kubeconfig Aug 28 11:01:53.331: INFO: Waiting up to 30m0s for all (but 100) nodes to be schedulable Aug 28 11:01:53.450: INFO: Waiting up to 10m0s for all pods (need at least 0) in namespace 'kube-system' to be running and ready Aug 28 11:01:53.593: INFO: 0 / 0 pods in namespace 'kube-system' are running and ready (0 seconds elapsed) Aug 28 11:01:53.593: INFO: expected 0 pod replicas in namespace 'kube-system', 0 are Running and Ready. Aug 28 11:01:53.593: INFO: Waiting up to 5m0s for all daemonsets in namespace 'kube-system' to start Aug 28 11:01:53.650: INFO: e2e test version: v1.14.0+e791baf Aug 28 11:01:53.689: INFO: kube-apiserver version: v1.14.0+db14b69 [BeforeEach] [Top Level] /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/util/test.go:73 [BeforeEach] [sig-apps] Daemon set [Serial] /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:149 STEP: Creating a kubernetes client Aug 28 11:01:53.692: INFO: >>> kubeConfig: /tmp/admin.kubeconfig STEP: Building a namespace api object, basename daemonsets Aug 28 11:01:53.917: INFO: About to run a Kube e2e test, ensuring namespace is privileged Aug 28 11:01:54.467: INFO: No PodSecurityPolicies found; assuming PodSecurityPolicy is disabled. STEP: Waiting for a default service account to be provisioned in namespace [BeforeEach] [sig-apps] Daemon set [Serial] /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/apps/daemon_set.go:102 [It] should rollback without unnecessary restarts [Conformance] [Suite:openshift/conformance/serial/minimal] [Suite:k8s] /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:692 Aug 28 11:01:54.768: INFO: Conformance test suite needs a cluster with at least 2 nodes. Aug 28 11:01:54.768: INFO: Create a RollingUpdate DaemonSet Aug 28 11:01:54.819: INFO: Check that daemon pods launch on every node of the cluster Aug 28 11:01:54.866: INFO: DaemonSet pods can't tolerate node ci-op-hdkv5gfr-3a8ca-8cvhp-master-0 with taints [{Key:node-role.kubernetes.io/master Value: Effect:NoSchedule TimeAdded:<nil>}], skip checking this node Aug 28 11:01:54.866: INFO: DaemonSet pods can't tolerate node ci-op-hdkv5gfr-3a8ca-8cvhp-master-1 with taints [{Key:node-role.kubernetes.io/master Value: Effect:NoSchedule TimeAdded:<nil>}], skip checking this node Aug 28 11:01:54.866: INFO: DaemonSet pods can't tolerate node ci-op-hdkv5gfr-3a8ca-8cvhp-master-2 with taints [{Key:node-role.kubernetes.io/master Value: Effect:NoSchedule TimeAdded:<nil>}], skip checking this node Aug 28 11:01:54.910: INFO: Number of nodes with available pods: 0 Aug 28 11:01:54.910: INFO: Node ci-op-hdkv5gfr-3a8ca-8cvhp-worker-centralus1-rm9bl is running more than one daemon pod Aug 28 11:01:55.994: INFO: DaemonSet pods can't tolerate node ci-op-hdkv5gfr-3a8ca-8cvhp-master-0 with taints [{Key:node-role.kubernetes.io/master Value: Effect:NoSchedule TimeAdded:<nil>}], skip checking this node Aug 28 11:01:55.994: INFO: DaemonSet pods can't tolerate node ci-op-hdkv5gfr-3a8ca-8cvhp-master-1 with taints [{Key:node-role.kubernetes.io/master Value: Effect:NoSchedule TimeAdded:<nil>}], skip checking this node Aug 28 11:01:55.994: INFO: DaemonSet pods can't tolerate node ci-op-hdkv5gfr-3a8ca-8cvhp-master-2 with taints [{Key:node-role.kubernetes.io/master Value: Effect:NoSchedule TimeAdded:<nil>}], skip checking this node Aug 28 11:01:56.038: INFO: Number of nodes with available pods: 0 Aug 28 11:01:56.038: INFO: Node ci-op-hdkv5gfr-3a8ca-8cvhp-worker-centralus1-rm9bl is running more than one daemon pod Aug 28 11:01:57.029: INFO: DaemonSet pods can't tolerate node ci-op-hdkv5gfr-3a8ca-8cvhp-master-0 with taints [{Key:node-role.kubernetes.io/master Value: Effect:NoSchedule TimeAdded:<nil>}], skip checking this node Aug 28 11:01:57.029: INFO: DaemonSet pods can't tolerate node ci-op-hdkv5gfr-3a8ca-8cvhp-master-1 with taints [{Key:node-role.kubernetes.io/master Value: Effect:NoSchedule TimeAdded:<nil>}], skip checking this node Aug 28 11:01:57.029: INFO: DaemonSet pods can't tolerate node ci-op-hdkv5gfr-3a8ca-8cvhp-master-2 with taints [{Key:node-role.kubernetes.io/master Value: Effect:NoSchedule TimeAdded:<nil>}], skip checking this node Aug 28 11:01:57.072: INFO: Number of nodes with available pods: 0 Aug 28 11:01:57.072: INFO: Node ci-op-hdkv5gfr-3a8ca-8cvhp-worker-centralus1-rm9bl is running more than one daemon pod Aug 28 11:01:58.033: INFO: DaemonSet pods can't tolerate node ci-op-hdkv5gfr-3a8ca-8cvhp-master-0 with taints [{Key:node-role.kubernetes.io/master Value: Effect:NoSchedule TimeAdded:<nil>}], skip checking this node Aug 28 11:01:58.033: INFO: DaemonSet pods can't tolerate node ci-op-hdkv5gfr-3a8ca-8cvhp-master-1 with taints [{Key:node-role.kubernetes.io/master Value: Effect:NoSchedule TimeAdded:<nil>}], skip checking this node Aug 28 11:01:58.033: INFO: DaemonSet pods can't tolerate node ci-op-hdkv5gfr-3a8ca-8cvhp-master-2 with taints [{Key:node-role.kubernetes.io/master Value: Effect:NoSchedule TimeAdded:<nil>}], skip checking this node Aug 28 11:01:58.076: INFO: Number of nodes with available pods: 1 Aug 28 11:01:58.076: INFO: Node ci-op-hdkv5gfr-3a8ca-8cvhp-worker-centralus2-qpf86 is running more than one daemon pod Aug 28 11:01:58.991: INFO: DaemonSet pods can't tolerate node ci-op-hdkv5gfr-3a8ca-8cvhp-master-0 with taints [{Key:node-role.kubernetes.io/master Value: Effect:NoSchedule TimeAdded:<nil>}], skip checking this node Aug 28 11:01:58.991: INFO: DaemonSet pods can't tolerate node ci-op-hdkv5gfr-3a8ca-8cvhp-master-1 with taints [{Key:node-role.kubernetes.io/master Value: Effect:NoSchedule TimeAdded:<nil>}], skip checking this node Aug 28 11:01:58.991: INFO: DaemonSet pods can't tolerate node ci-op-hdkv5gfr-3a8ca-8cvhp-master-2 with taints [{Key:node-role.kubernetes.io/master Value: Effect:NoSchedule TimeAdded:<nil>}], skip checking this node Aug 28 11:01:59.047: INFO: Number of nodes with available pods: 3 Aug 28 11:01:59.047: INFO: Number of running nodes: 3, number of available pods: 3 Aug 28 11:01:59.047: INFO: Update the DaemonSet to trigger a rollout Aug 28 11:01:59.536: INFO: Updating DaemonSet daemon-set [AfterEach] [sig-apps] Daemon set [Serial] /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/apps/daemon_set.go:68 STEP: Deleting DaemonSet "daemon-set" STEP: deleting DaemonSet.extensions daemon-set in namespace e2e-daemonsets-531, will wait for the garbage collector to delete the pods Aug 28 11:02:03.877: INFO: Deleting DaemonSet.extensions daemon-set took: 59.944832ms Aug 28 11:02:04.077: INFO: Terminating DaemonSet.extensions daemon-set pods took: 200.340876ms Aug 28 11:02:10.119: INFO: Number of nodes with available pods: 0 Aug 28 11:02:10.119: INFO: Number of running nodes: 0, number of available pods: 0 Aug 28 11:02:10.163: INFO: daemonset: {"kind":"DaemonSetList","apiVersion":"apps/v1","metadata":{"selfLink":"/apis/apps/v1/namespaces/e2e-daemonsets-531/daemonsets","resourceVersion":"64509"},"items":null} Aug 28 11:02:10.205: INFO: pods: {"kind":"PodList","apiVersion":"v1","metadata":{"selfLink":"/api/v1/namespaces/e2e-daemonsets-531/pods","resourceVersion":"64509"},"items":null} [AfterEach] [sig-apps] Daemon set [Serial] /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:150 STEP: Collecting events from namespace "e2e-daemonsets-531". STEP: Found 24 events. Aug 28 11:02:10.458: INFO: At 2019-08-28 11:01:54 +0000 UTC - event for daemon-set: {daemonset-controller } SuccessfulCreate: Created pod: daemon-set-bcw5z Aug 28 11:02:10.458: INFO: At 2019-08-28 11:01:54 +0000 UTC - event for daemon-set: {daemonset-controller } SuccessfulCreate: Created pod: daemon-set-wns7g Aug 28 11:02:10.458: INFO: At 2019-08-28 11:01:54 +0000 UTC - event for daemon-set: {daemonset-controller } SuccessfulCreate: Created pod: daemon-set-mtb2b Aug 28 11:02:10.458: INFO: At 2019-08-28 11:01:54 +0000 UTC - event for daemon-set-bcw5z: {default-scheduler } Scheduled: Successfully assigned e2e-daemonsets-531/daemon-set-bcw5z to ci-op-hdkv5gfr-3a8ca-8cvhp-worker-centralus1-rm9bl Aug 28 11:02:10.458: INFO: At 2019-08-28 11:01:54 +0000 UTC - event for daemon-set-mtb2b: {default-scheduler } Scheduled: Successfully assigned e2e-daemonsets-531/daemon-set-mtb2b to ci-op-hdkv5gfr-3a8ca-8cvhp-worker-centralus2-qpf86 Aug 28 11:02:10.458: INFO: At 2019-08-28 11:01:54 +0000 UTC - event for daemon-set-wns7g: {default-scheduler } Scheduled: Successfully assigned e2e-daemonsets-531/daemon-set-wns7g to ci-op-hdkv5gfr-3a8ca-8cvhp-worker-centralus3-gv9zd Aug 28 11:02:10.458: INFO: At 2019-08-28 11:01:56 +0000 UTC - event for daemon-set-bcw5z: {kubelet ci-op-hdkv5gfr-3a8ca-8cvhp-worker-centralus1-rm9bl} Started: Started container app Aug 28 11:02:10.458: INFO: At 2019-08-28 11:01:56 +0000 UTC - event for daemon-set-bcw5z: {kubelet ci-op-hdkv5gfr-3a8ca-8cvhp-worker-centralus1-rm9bl} Pulled: Container image "docker.io/library/nginx:1.14-alpine" already present on machine Aug 28 11:02:10.458: INFO: At 2019-08-28 11:01:56 +0000 UTC - event for daemon-set-bcw5z: {kubelet ci-op-hdkv5gfr-3a8ca-8cvhp-worker-centralus1-rm9bl} Created: Created container app Aug 28 11:02:10.458: INFO: At 2019-08-28 11:01:56 +0000 UTC - event for daemon-set-wns7g: {kubelet ci-op-hdkv5gfr-3a8ca-8cvhp-worker-centralus3-gv9zd} Pulled: Container image "docker.io/library/nginx:1.14-alpine" already present on machine Aug 28 11:02:10.458: INFO: At 2019-08-28 11:01:57 +0000 UTC - event for daemon-set-mtb2b: {kubelet ci-op-hdkv5gfr-3a8ca-8cvhp-worker-centralus2-qpf86} Pulled: Container image "docker.io/library/nginx:1.14-alpine" already present on machine Aug 28 11:02:10.458: INFO: At 2019-08-28 11:01:57 +0000 UTC - event for daemon-set-mtb2b: {kubelet ci-op-hdkv5gfr-3a8ca-8cvhp-worker-centralus2-qpf86} Created: Created container app Aug 28 11:02:10.458: INFO: At 2019-08-28 11:01:57 +0000 UTC - event for daemon-set-wns7g: {kubelet ci-op-hdkv5gfr-3a8ca-8cvhp-worker-centralus3-gv9zd} Started: Started container app Aug 28 11:02:10.458: INFO: At 2019-08-28 11:01:57 +0000 UTC - event for daemon-set-wns7g: {kubelet ci-op-hdkv5gfr-3a8ca-8cvhp-worker-centralus3-gv9zd} Created: Created container app Aug 28 11:02:10.458: INFO: At 2019-08-28 11:01:58 +0000 UTC - event for daemon-set-mtb2b: {kubelet ci-op-hdkv5gfr-3a8ca-8cvhp-worker-centralus2-qpf86} Started: Started container app Aug 28 11:02:10.458: INFO: At 2019-08-28 11:01:59 +0000 UTC - event for daemon-set: {daemonset-controller } SuccessfulDelete: Deleted pod: daemon-set-mtb2b Aug 28 11:02:10.458: INFO: At 2019-08-28 11:02:00 +0000 UTC - event for daemon-set-mtb2b: {kubelet ci-op-hdkv5gfr-3a8ca-8cvhp-worker-centralus2-qpf86} Killing: Stopping container app Aug 28 11:02:10.458: INFO: At 2019-08-28 11:02:02 +0000 UTC - event for daemon-set: {daemonset-controller } SuccessfulCreate: Created pod: daemon-set-7jts9 Aug 28 11:02:10.458: INFO: At 2019-08-28 11:02:02 +0000 UTC - event for daemon-set-7jts9: {default-scheduler } Scheduled: Successfully assigned e2e-daemonsets-531/daemon-set-7jts9 to ci-op-hdkv5gfr-3a8ca-8cvhp-worker-centralus2-qpf86 Aug 28 11:02:10.458: INFO: At 2019-08-28 11:02:04 +0000 UTC - event for daemon-set-bcw5z: {kubelet ci-op-hdkv5gfr-3a8ca-8cvhp-worker-centralus1-rm9bl} Killing: Stopping container app Aug 28 11:02:10.458: INFO: At 2019-08-28 11:02:04 +0000 UTC - event for daemon-set-wns7g: {kubelet ci-op-hdkv5gfr-3a8ca-8cvhp-worker-centralus3-gv9zd} Killing: Stopping container app Aug 28 11:02:10.458: INFO: At 2019-08-28 11:02:05 +0000 UTC - event for daemon-set-7jts9: {kubelet ci-op-hdkv5gfr-3a8ca-8cvhp-worker-centralus2-qpf86} Pulling: Pulling image "foo:non-existent" Aug 28 11:02:10.458: INFO: At 2019-08-28 11:02:06 +0000 UTC - event for daemon-set-7jts9: {kubelet ci-op-hdkv5gfr-3a8ca-8cvhp-worker-centralus2-qpf86} Failed: Error: ErrImagePull Aug 28 11:02:10.458: INFO: At 2019-08-28 11:02:06 +0000 UTC - event for daemon-set-7jts9: {kubelet ci-op-hdkv5gfr-3a8ca-8cvhp-worker-centralus2-qpf86} Failed: Failed to pull image "foo:non-existent": rpc error: code = Unknown desc = Error reading manifest non-existent in docker.io/library/foo: errors: denied: requested access to the resource is denied unauthorized: authentication required Aug 28 11:02:10.500: INFO: POD NODE PHASE GRACE CONDITIONS Aug 28 11:02:10.500: INFO: Aug 28 11:02:10.544: INFO: skipping dumping cluster info - cluster too large Aug 28 11:02:10.544: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready STEP: Destroying namespace "e2e-daemonsets-531" for this suite. Aug 28 11:02:18.724: INFO: Waiting up to 30s for server preferred namespaced resources to be successfully discovered Aug 28 11:02:22.665: INFO: namespace e2e-daemonsets-531 deletion completed in 12.077397749s Aug 28 11:02:22.672: INFO: Running AfterSuite actions on all nodes Aug 28 11:02:22.673: INFO: Running AfterSuite actions on node 1 fail [k8s.io/kubernetes/test/e2e/apps/daemon_set.go:425]: Expected <int>: 2 to equal <int>: 0
canary-openshift-ocp-installer-e2e-azure-serial-4.2 is in the release-informing category in testgrid, not the release-blocking category. Deferring to 4.3 by category.
Hit same issue in https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-serial-4.2/89 -> [sig-apps] Daemon set [Serial] should rollback without unnecessary restarts [Conformance] [Suite:openshift/conformance/serial/minimal] [Suite:k8s]
This test is now run against 4.2 nightly as part of azure-serial, and is now being considered a blocking failure for 4.2 (branching for 4.3). Moving this back to 4.2 and moving to urgent. Please reach out to me or nstielau if you think this should not be the case.
> This test is now run against 4.2 nightly as part of azure-serial, and is now being considered a blocking failure for 4.2 This can not be decided one week before final freeze :-/ azure-serial is also in the release-informing category in testgrid https://testgrid.k8s.io/redhat-openshift-release-informing#redhat-canary-openshift-ocp-installer-e2e-azure-serial-4.2&sort-by-flakiness= We will not block the release for this e2e test failure. It is because the test is fragile, not because there is something functionally wrong. This is also not a Node team bug. Sending back to kube-controller-manager. > The tests detect the number of schedulable nodes when the e2e starts (on demand next kube version) but that number has been <2 when the e2e started although there are 3 worker nodes. Tomas, can you pick back the on-demand behaviour you mentioned?
This test has always been release blocking. We added Azure. It fails on Azure because of "takes longer to start up". I agree fixing the test to be more resilient is good. The remaining question is "why is machine-config for Azure reporting ready before all worker pools created on install are schedulable?". That decision needs to be made for all platforms (it's either a product requirement or not). That question must be be decided in 4.2, either way, because then we can use it to enforce other rules.
I picked the PR but it doesn't fix the issue, only limits the occurrence rate. If ginko start that e2e soon enough it detects the 0 schedulable nodes anyways. It really looks to me that the issue at hand is starting e2e on a cluster that is not ready yet.
"starting e2e on a cluster that is not ready yet" doesn't fully explain why this is different on Azure. Can we work with RHCOS team to look at the startup time and what they delay might be?
> "starting e2e on a cluster that is not ready yet" doesn't fully explain why this is different on Azure. Can we work with RHCOS team to look at the startup time and what they delay might be? Sure, can you file a new BZ for MCO team? This is already linked to the e2e test fix (+ different teams).
confirmed with latest jobs , the issue has fixed: https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-serial-4.2/127 https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-serial-4.2/124 https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-serial-4.2/125
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922