https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/20982/pull-ci-origin-conformance-k8s/10 We *must* get these tests passing extremely soon in 3.11 and master. This is not a deferrable bug.
(In reply to Clayton Coleman from comment #0) > https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/ > 20982/pull-ci-origin-conformance-k8s/10 > > We *must* get these tests passing extremely soon in 3.11 and master. > > This is not a deferrable bug. While I am
(In reply to Clayton Coleman from comment #0) > https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/ > 20982/pull-ci-origin-conformance-k8s/10 > > We *must* get these tests passing extremely soon in 3.11 and master. > > This is not a deferrable bug. While I am looking into this failure, I have a question that the PR 20982 (https://github.com/openshift/origin/pull/20982) was opened on Sep 14 around 12:25 AM EDT and this test failed there. However when I see another PR (https://github.com/openshift/origin/pull/20989 was opened on Sep 14 around 1:22 PM EDT) opened after the PR 20982 , and I see that these conformance tests are passed (https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/20989/test_pull_request_origin_extended_conformance_install-release-3.11/26/), So does this mean that these conformance tests are not failing consistently, but only being flaky?
May be or may be not related to this failure, but I see that registry console pod was failing during the tests: Sep 14 06:47:35.551: INFO: registry-console-1-deploy ci-op-kt88gw9m-6531d-ig-m-nghp Failed [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2018-09-14 05:26:41 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2018-09-14 05:36:46 +0000 UTC ContainersNotReady containers with unready status: [deployment]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC ContainersNotReady containers with unready status: [deployment]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2018-09-14 05:26:41 +0000 UTC }]
So far, my understanding is that tests are failing due to following: Sep 14 06:42:25.473: INFO: Node ci-op-kt88gw9m-6531d-ig-m-nghp is running more than one daemon pod Whereas in the test code, it is expected to have only one pod per node. I am seeing following issue // Ensure that exactly 1 pod is running on all nodes in nodeNames. for _, nodeName := range nodeNames { if nodesToPodCount[nodeName] != 1 { framework.Logf("Node %s is running more than one daemon pod", nodeName) return false, nil } } I will keep looking further. As such I dont see any functionality issue with DS, but the tests might be failing either due to infrastructure issue (node selectors etc) or some issue like above so still looking. I am wondering it is possible to get access to the cluster where the tests runs when these tests fails?
So far I think I know what is the issue, it sounds like the tests expect its DS's pod to schedule on master, and but somehow the ds pos is not getting scheduled on master. And i think the following message is wrong/confusing because: Node ci-op-kt88gw9m-6531d-ig-m-nghp is running more than one daemon pod Because my understanding is that no ds pod is running on the master in tests, but since the check is "nodesToPodCount[nodeName] != 1" so the above message is erroneous. Also though master is set to "Unschedulable:true", but my understanding is that DS pods must be scheduled there even if master (or any node) is set "Unschedulable:true", but I am not sure why it is not happening. I will continue looking into it, but still wondering if there is a way to get access to cluster where tests fail.
During the tests were being run, I notice following messages on the master: Sep 14 06:28:24 ci-op-kt88gw9m-6531d-ig-m-nghp origin-node: I0914 06:28:24.715057 21539 cloud_request_manager.go:89] Requesting node addresses from cloud provider for node "ci-op-kt88gw9m-6531d-ig-m-nghp" Sep 14 06:28:24 ci-op-kt88gw9m-6531d-ig-m-nghp origin-node: I0914 06:28:24.718597 21539 cloud_request_manager.go:108] Node addresses from cloud provider for node "ci-op-kt88gw9m-6531d-ig-m-nghp" collected Sep 14 06:28:33 ci-op-kt88gw9m-6531d-ig-m-nghp origin-node: I0914 06:28:33.343455 21539 container_manager_linux.go:428] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service Sep 14 06:28:34 ci-op-kt88gw9m-6531d-ig-m-nghp origin-node: I0914 06:28:34.719401 21539 cloud_request_manager.go:89] Requesting node addresses from cloud provider for node "ci-op-kt88gw9m-6531d-ig-m-nghp" Sep 14 06:28:34 ci-op-kt88gw9m-6531d-ig-m-nghp origin-node: I0914 06:28:34.722988 21539 cloud_request_manager.go:108] Node addresses from cloud provider for node "ci-op-kt88gw9m-6531d-ig-m-nghp" collected Sep 14 06:28:44 ci-op-kt88gw9m-6531d-ig-m-nghp origin-node: I0914 06:28:44.723177 21539 cloud_request_manager.go:89] Requesting node addresses from cloud provider for node "ci-op-kt88gw9m-6531d-ig-m-nghp" Sep 14 06:28:44 ci-op-kt88gw9m-6531d-ig-m-nghp origin-node: I0914 06:28:44.726840 21539 cloud_request_manager.go:108] Node addresses from cloud provider for node "ci-op-kt88gw9m-6531d-ig-m-nghp" collected I am wondering if the above caused issue on the master from scheduling more pods.
(In reply to Avesh Agarwal from comment #6) > During the tests were being run, I notice following messages on the master: > > Sep 14 06:28:24 ci-op-kt88gw9m-6531d-ig-m-nghp origin-node: I0914 > 06:28:24.715057 21539 cloud_request_manager.go:89] Requesting node > addresses from cloud provider for node "ci-op-kt88gw9m-6531d-ig-m-nghp" > Sep 14 06:28:24 ci-op-kt88gw9m-6531d-ig-m-nghp origin-node: I0914 > 06:28:24.718597 21539 cloud_request_manager.go:108] Node addresses from > cloud provider for node "ci-op-kt88gw9m-6531d-ig-m-nghp" collected > Sep 14 06:28:33 ci-op-kt88gw9m-6531d-ig-m-nghp origin-node: I0914 > 06:28:33.343455 21539 container_manager_linux.go:428] [ContainerManager]: > Discovered runtime cgroups name: /system.slice/docker.service > Sep 14 06:28:34 ci-op-kt88gw9m-6531d-ig-m-nghp origin-node: I0914 > 06:28:34.719401 21539 cloud_request_manager.go:89] Requesting node > addresses from cloud provider for node "ci-op-kt88gw9m-6531d-ig-m-nghp" > Sep 14 06:28:34 ci-op-kt88gw9m-6531d-ig-m-nghp origin-node: I0914 > 06:28:34.722988 21539 cloud_request_manager.go:108] Node addresses from > cloud provider for node "ci-op-kt88gw9m-6531d-ig-m-nghp" collected > Sep 14 06:28:44 ci-op-kt88gw9m-6531d-ig-m-nghp origin-node: I0914 > 06:28:44.723177 21539 cloud_request_manager.go:89] Requesting node > addresses from cloud provider for node "ci-op-kt88gw9m-6531d-ig-m-nghp" > Sep 14 06:28:44 ci-op-kt88gw9m-6531d-ig-m-nghp origin-node: I0914 > 06:28:44.726840 21539 cloud_request_manager.go:108] Node addresses from > cloud provider for node "ci-op-kt88gw9m-6531d-ig-m-nghp" collected > > > I am wondering if the above caused issue on the master from scheduling more > pods. I thinkt the above is normal so nevermind
I am also looking into the code to see if CheckNodeUnschedulablePred is not handled correctly and is messing it up.
I am able to reproduce the issue on my AWS cluster. The problem is related to node selectors as default global node selectors are being assigned to test namespaces and causing mismatch between desired pods and running pods and due to this tests are failing. Not sure how the tests were passing previously, may be we were not setting default global selectors previously for running these tests? I am not sure upstream would accept code changes for this, mostly we will have to set default global node selectors in master config file to empty or IOW not set it.
(In reply to Avesh Agarwal from comment #9) > I am able to reproduce the issue on my AWS cluster. The problem is related > to node selectors as default global node selectors are being assigned to > test namespaces and causing mismatch between desired pods and running pods > and due to this tests are failing. Not sure how the tests were passing > previously, may be we were not setting default global selectors previously > for running these tests? > > I am not sure upstream would accept code changes for this, mostly we will > have to set default global node selectors in master config file to empty or > IOW not set it. I think I am still seeing the issue even after clearing the default node selectors in master config, so still looking.
After clearing default node selectors previously I had just restarted api-server not controllers, but I restarted the controllers pod, tests started passing. So again, the issue is still happening related to node selectors issue and mostly caused by vendor/k8s.io/kubernetes/pkg/controller/daemon/patch_nodeselector.go that we carry in openshift to make DS controller node selectors related admission plugin aware. Looking why it is not working as expected with the conformance tests.
(In reply to Avesh Agarwal from comment #11) > After clearing default node selectors previously I had just restarted > api-server not controllers, but I restarted the controllers pod, tests > started passing. So again, the issue is still happening related to node > selectors issue and mostly caused by > vendor/k8s.io/kubernetes/pkg/controller/daemon/patch_nodeselector.go that we > carry in openshift to make DS controller node selectors related admission > plugin aware. Looking why it is not working as expected with the conformance > tests. Adding to this, DS controller (patched one with node selector plugin aware) in OpenShift is working as expected. Since DS tests in k8s are not aware of node selectors set by admission plugin, DS tests are failing as they expect all nodes available for DS pods. The only solution, so far, is to not set default global node selector to make these tests pass. I am not sure if we would want to patch k8s conformance tests in OpenShift.
A conformance test in kubernetes is not allowed to make assumptions about the topology of a cluster. So any test that assumes masters are on the cluster could be invalid. I actually think as we have applied the definition of conformance, having a scheduler test that assumes it can schedule on all nodes is invalid. But I'd rather just get the test working. The e2e tests for scheduling should be using the empty node selector on their namespaces - are you not seeing this?
My understanding is that a work around was merged into Origin to get this working.
https://github.com/openshift/origin/pull/21020
https://github.com/openshift/origin/pull/21033
Sorry, automatic pick to 3.11 didn't happen. https://github.com/openshift/origin/pull/21058
https://github.com/openshift/origin/pull/21058 is merged, so moving it to modified.
Move to verified. In new conformance test, the Daemon set is passed. //https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/21040/test_pull_request_origin_extended_conformance_install/15723/?log#log [sig-apps] Daemon set [Serial] should retry creating failed daemon pods [Conformance] [Suite:openshift/conformance/serial/minimal] [Suite:k8s] [sig-apps] Daemon set [Serial] should run and stop complex daemon [Conformance] [Suite:openshift/conformance/serial/minimal] [Suite:k8s] [sig-apps] Daemon set [Serial] should run and stop simple daemon [Conformance] [Suite:openshift/conformance/serial/minimal] [Suite:k8s] [sig-apps] Daemon set [Serial] should update pod when spec was updated and update strategy is RollingUpdate [Conformance] [Suite:openshift/conformance/serial/minimal] [Suite:k8s]
Closing bugs that were verified and targeted for GA but for some reason were not picked up by errata. This bug fix should be present in current 3.11 release content.