Bug 1628998 - OpenShift fails conformance test suite on two daemonset tests in master (and 3.11)
Summary: OpenShift fails conformance test suite on two daemonset tests in master (and ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 3.11.0
Assignee: Avesh Agarwal
QA Contact: Xiaoli Tian
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-09-14 15:32 UTC by Clayton Coleman
Modified: 2018-12-21 15:23 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-12-21 15:23:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Clayton Coleman 2018-09-14 15:32:47 UTC
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/20982/pull-ci-origin-conformance-k8s/10

We *must* get these tests passing extremely soon in 3.11 and master.

This is not a deferrable bug.

Comment 1 Avesh Agarwal 2018-09-17 01:01:50 UTC
(In reply to Clayton Coleman from comment #0)
> https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/
> 20982/pull-ci-origin-conformance-k8s/10
> 
> We *must* get these tests passing extremely soon in 3.11 and master.
> 
> This is not a deferrable bug.

While I am

Comment 2 Avesh Agarwal 2018-09-17 01:09:33 UTC
(In reply to Clayton Coleman from comment #0)
> https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/
> 20982/pull-ci-origin-conformance-k8s/10
> 
> We *must* get these tests passing extremely soon in 3.11 and master.
> 
> This is not a deferrable bug.

While I am looking into this failure, I have a question that the PR 20982 (https://github.com/openshift/origin/pull/20982) was opened on Sep 14 around 12:25 AM EDT and this test failed there. However when I see another PR (https://github.com/openshift/origin/pull/20989 was opened on Sep 14 around 1:22 PM EDT) opened after the PR 20982 , and I see that these conformance tests are passed (https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/20989/test_pull_request_origin_extended_conformance_install-release-3.11/26/),

So does this mean that these conformance tests are not failing consistently, but only being flaky?

Comment 3 Avesh Agarwal 2018-09-17 01:12:48 UTC
May be or may be not related to this failure, but I see that registry console pod was failing during the tests: 


Sep 14 06:47:35.551: INFO: registry-console-1-deploy                          ci-op-kt88gw9m-6531d-ig-m-nghp  Failed          [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2018-09-14 05:26:41 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2018-09-14 05:36:46 +0000 UTC ContainersNotReady containers with unready status: [deployment]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC ContainersNotReady containers with unready status: [deployment]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2018-09-14 05:26:41 +0000 UTC  }]

Comment 4 Avesh Agarwal 2018-09-17 12:16:23 UTC
So far, my understanding is that tests are failing due to following:

Sep 14 06:42:25.473: INFO: Node ci-op-kt88gw9m-6531d-ig-m-nghp is running more than one daemon pod

Whereas in the test code, it is expected to have only one pod per node.

I am seeing following issue 


                // Ensure that exactly 1 pod is running on all nodes in nodeNames.
                for _, nodeName := range nodeNames {
                        if nodesToPodCount[nodeName] != 1 {
                                framework.Logf("Node %s is running more than one daemon pod", nodeName)
                                return false, nil
                        }
                }

I will keep looking further. As such I dont see any functionality issue with DS, but the tests might be failing either due to infrastructure issue (node selectors etc) or some issue like above so still looking.


I am wondering it is possible to get access to the cluster where the tests runs when these tests fails?

Comment 5 Avesh Agarwal 2018-09-17 14:45:17 UTC
So far I think I know what is the issue, it sounds like the tests expect its DS's pod to schedule on master, and but somehow the ds pos is not getting scheduled on master. And i think the following message is wrong/confusing because:

Node ci-op-kt88gw9m-6531d-ig-m-nghp is running more than one daemon pod

Because my understanding is that no ds pod is running on the master in tests, but since the check is "nodesToPodCount[nodeName] != 1" so the above message is erroneous. 

Also though master is set to "Unschedulable:true", but my understanding is that DS pods must be scheduled there even if master (or any node) is set "Unschedulable:true", but I am not sure why it is not happening.  

I will continue looking into it, but still wondering if there is a way to get access to cluster where tests fail.

Comment 6 Avesh Agarwal 2018-09-17 15:52:38 UTC
During the tests were being run, I notice following messages on the master:

Sep 14 06:28:24 ci-op-kt88gw9m-6531d-ig-m-nghp origin-node: I0914 06:28:24.715057   21539 cloud_request_manager.go:89] Requesting node addresses from cloud provider for node "ci-op-kt88gw9m-6531d-ig-m-nghp"
Sep 14 06:28:24 ci-op-kt88gw9m-6531d-ig-m-nghp origin-node: I0914 06:28:24.718597   21539 cloud_request_manager.go:108] Node addresses from cloud provider for node "ci-op-kt88gw9m-6531d-ig-m-nghp" collected
Sep 14 06:28:33 ci-op-kt88gw9m-6531d-ig-m-nghp origin-node: I0914 06:28:33.343455   21539 container_manager_linux.go:428] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Sep 14 06:28:34 ci-op-kt88gw9m-6531d-ig-m-nghp origin-node: I0914 06:28:34.719401   21539 cloud_request_manager.go:89] Requesting node addresses from cloud provider for node "ci-op-kt88gw9m-6531d-ig-m-nghp"
Sep 14 06:28:34 ci-op-kt88gw9m-6531d-ig-m-nghp origin-node: I0914 06:28:34.722988   21539 cloud_request_manager.go:108] Node addresses from cloud provider for node "ci-op-kt88gw9m-6531d-ig-m-nghp" collected
Sep 14 06:28:44 ci-op-kt88gw9m-6531d-ig-m-nghp origin-node: I0914 06:28:44.723177   21539 cloud_request_manager.go:89] Requesting node addresses from cloud provider for node "ci-op-kt88gw9m-6531d-ig-m-nghp"
Sep 14 06:28:44 ci-op-kt88gw9m-6531d-ig-m-nghp origin-node: I0914 06:28:44.726840   21539 cloud_request_manager.go:108] Node addresses from cloud provider for node "ci-op-kt88gw9m-6531d-ig-m-nghp" collected


I am wondering if the above caused issue on the master from scheduling more pods.

Comment 7 Avesh Agarwal 2018-09-17 16:38:59 UTC
(In reply to Avesh Agarwal from comment #6)
> During the tests were being run, I notice following messages on the master:
> 
> Sep 14 06:28:24 ci-op-kt88gw9m-6531d-ig-m-nghp origin-node: I0914
> 06:28:24.715057   21539 cloud_request_manager.go:89] Requesting node
> addresses from cloud provider for node "ci-op-kt88gw9m-6531d-ig-m-nghp"
> Sep 14 06:28:24 ci-op-kt88gw9m-6531d-ig-m-nghp origin-node: I0914
> 06:28:24.718597   21539 cloud_request_manager.go:108] Node addresses from
> cloud provider for node "ci-op-kt88gw9m-6531d-ig-m-nghp" collected
> Sep 14 06:28:33 ci-op-kt88gw9m-6531d-ig-m-nghp origin-node: I0914
> 06:28:33.343455   21539 container_manager_linux.go:428] [ContainerManager]:
> Discovered runtime cgroups name: /system.slice/docker.service
> Sep 14 06:28:34 ci-op-kt88gw9m-6531d-ig-m-nghp origin-node: I0914
> 06:28:34.719401   21539 cloud_request_manager.go:89] Requesting node
> addresses from cloud provider for node "ci-op-kt88gw9m-6531d-ig-m-nghp"
> Sep 14 06:28:34 ci-op-kt88gw9m-6531d-ig-m-nghp origin-node: I0914
> 06:28:34.722988   21539 cloud_request_manager.go:108] Node addresses from
> cloud provider for node "ci-op-kt88gw9m-6531d-ig-m-nghp" collected
> Sep 14 06:28:44 ci-op-kt88gw9m-6531d-ig-m-nghp origin-node: I0914
> 06:28:44.723177   21539 cloud_request_manager.go:89] Requesting node
> addresses from cloud provider for node "ci-op-kt88gw9m-6531d-ig-m-nghp"
> Sep 14 06:28:44 ci-op-kt88gw9m-6531d-ig-m-nghp origin-node: I0914
> 06:28:44.726840   21539 cloud_request_manager.go:108] Node addresses from
> cloud provider for node "ci-op-kt88gw9m-6531d-ig-m-nghp" collected
> 
> 
> I am wondering if the above caused issue on the master from scheduling more
> pods.

I thinkt the above is normal so nevermind

Comment 8 Avesh Agarwal 2018-09-17 16:58:39 UTC
I am also looking into the code to see if CheckNodeUnschedulablePred is not handled correctly and is messing it up.

Comment 9 Avesh Agarwal 2018-09-17 20:59:33 UTC
I am able to reproduce the issue on my AWS cluster. The problem is related to node selectors as default global node selectors are being assigned to test namespaces and causing mismatch between desired pods and running pods and due to  this tests are failing. Not sure how the tests were passing previously, may be we were not setting default global selectors previously for running these tests?

I am not sure upstream would accept code changes for this, mostly we will have to set default global node selectors in master config file to empty or IOW not set it.

Comment 10 Avesh Agarwal 2018-09-17 21:15:03 UTC
(In reply to Avesh Agarwal from comment #9)
> I am able to reproduce the issue on my AWS cluster. The problem is related
> to node selectors as default global node selectors are being assigned to
> test namespaces and causing mismatch between desired pods and running pods
> and due to  this tests are failing. Not sure how the tests were passing
> previously, may be we were not setting default global selectors previously
> for running these tests?
> 
> I am not sure upstream would accept code changes for this, mostly we will
> have to set default global node selectors in master config file to empty or
> IOW not set it.

I think I am still seeing the issue even after clearing the default node selectors in master config, so still looking.

Comment 11 Avesh Agarwal 2018-09-17 22:24:21 UTC
After clearing default node selectors previously I had just restarted api-server not controllers, but I restarted the controllers pod, tests started passing.  So again, the issue is still happening related to node selectors issue and mostly caused by  vendor/k8s.io/kubernetes/pkg/controller/daemon/patch_nodeselector.go that we carry in openshift to make DS controller node selectors related admission plugin aware. Looking why it is not working as expected with the conformance tests.

Comment 12 Avesh Agarwal 2018-09-17 23:56:17 UTC
(In reply to Avesh Agarwal from comment #11)
> After clearing default node selectors previously I had just restarted
> api-server not controllers, but I restarted the controllers pod, tests
> started passing.  So again, the issue is still happening related to node
> selectors issue and mostly caused by 
> vendor/k8s.io/kubernetes/pkg/controller/daemon/patch_nodeselector.go that we
> carry in openshift to make DS controller node selectors related admission
> plugin aware. Looking why it is not working as expected with the conformance
> tests.

Adding to this, DS controller (patched one with node selector plugin aware) in OpenShift is working as expected. Since DS tests in k8s are not aware of node selectors set by admission plugin, DS tests are failing as they expect all nodes available for DS pods. The only solution, so far, is to not set default global node selector to make these tests pass. I am not sure if we would want to patch k8s conformance tests in OpenShift.

Comment 13 Clayton Coleman 2018-09-18 13:10:16 UTC
A conformance test in kubernetes is not allowed to make assumptions about the topology of a cluster. So any test that assumes masters are on the cluster could be invalid.  I actually think as we have applied the definition of conformance, having a scheduler test that assumes it can schedule on all nodes is invalid.  But I'd rather just get the test working.

The e2e tests for scheduling should be using the empty node selector on their namespaces - are you not seeing this?

Comment 14 Seth Jennings 2018-09-18 15:36:27 UTC
My understanding is that a work around was merged into Origin to get this working.

Comment 15 Seth Jennings 2018-09-18 15:37:09 UTC
https://github.com/openshift/origin/pull/21020

Comment 17 Avesh Agarwal 2018-09-19 16:13:28 UTC
https://github.com/openshift/origin/pull/21033

Comment 18 Seth Jennings 2018-09-20 13:36:14 UTC
Sorry, automatic pick to 3.11 didn't happen.

https://github.com/openshift/origin/pull/21058

Comment 19 Avesh Agarwal 2018-09-20 17:22:50 UTC
https://github.com/openshift/origin/pull/21058 is merged, so moving it to modified.

Comment 21 DeShuai Ma 2018-09-21 10:03:08 UTC
Move to verified.
In new conformance test, the Daemon set is passed.

//https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/21040/test_pull_request_origin_extended_conformance_install/15723/?log#log


[sig-apps] Daemon set [Serial] should retry creating failed daemon pods [Conformance] [Suite:openshift/conformance/serial/minimal] [Suite:k8s]

[sig-apps] Daemon set [Serial] should run and stop complex daemon [Conformance] [Suite:openshift/conformance/serial/minimal] [Suite:k8s]

[sig-apps] Daemon set [Serial] should run and stop simple daemon [Conformance] [Suite:openshift/conformance/serial/minimal] [Suite:k8s]

[sig-apps] Daemon set [Serial] should update pod when spec was updated and update strategy is RollingUpdate [Conformance] [Suite:openshift/conformance/serial/minimal] [Suite:k8s]

Comment 23 Luke Meyer 2018-12-21 15:23:21 UTC
Closing bugs that were verified and targeted for GA but for some reason were not picked up by errata. This bug fix should be present in current 3.11 release content.


Note You need to log in before you can comment on or make changes to this bug.