Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1548641

Summary:	App pods should not be scheduled on master node whatever in a fresh install or after upgrade
Product:	OpenShift Container Platform	Reporter:	liujia <jiajliu>
Component:	Cluster Version Operator	Assignee:	Fabian von Feilitzsch <fabian>
Status:	CLOSED CURRENTRELEASE	QA Contact:	liujia <jiajliu>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.9.0	CC:	aheslin, anli, aos-bugs, bchilds, dakini, dma, erich, fabian, ghuang, jiajliu, jialiu, jokerman, juzhao, mmccomas, sdodson, wjiang, wmeng, xtian
Target Milestone:	---	Keywords:	NeedsTestCase
Target Release:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:	undefined	Story Points:	---
Clone Of:
Clones:	1554828 (view as bug list)		Environment:
Last Closed:	2018-03-16 16:55:48 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1557345
Bug Blocks:

Description liujia 2018-02-24 05:50:17 UTC

Description of problem:
Upgrade ocp from v3.7 to v3.9,some app pods except web console are scheduled on master node after upgrade. 

before upgrade:
# oc get node
NAME                                STATUS                     AGE       VERSION
qe-jliu-r1-master-etcd-1            Ready,SchedulingDisabled   8m        v1.7.6+a08f5eeb62
qe-jliu-r1-node-registry-router-1   Ready                      8m        v1.7.6+a08f5eeb62

# oc get pod -o wide --all-namespaces
NAMESPACE      NAME                             READY     STATUS      RESTARTS   AGE       IP            NODE
default        docker-registry-1-x5bvp          1/1       Running     0          4m        10.129.0.4    qe-jliu-r1-node-registry-router-1
default        registry-console-1-5cl58         1/1       Running     0          3m        10.129.0.6    qe-jliu-r1-node-registry-router-1
default        router-1-kflpp                   1/1       Running     0          5m        10.240.0.46   qe-jliu-r1-node-registry-router-1
install-test   mongodb-1-gmr48                  1/1       Running     0          3m        10.129.0.9    qe-jliu-r1-node-registry-router-1
install-test   nodejs-mongodb-example-1-build   0/1       Completed   0          3m        10.129.0.8    qe-jliu-r1-node-registry-router-1
install-test   nodejs-mongodb-example-1-qnwgs   1/1       Running     0          1m        10.129.0.11   qe-jliu-r1-node-registry-router-1

after upgrade:
# oc get node
NAME                                STATUS    ROLES     AGE       VERSION
qe-jliu-r1-master-etcd-1            Ready     master    43m       v1.9.1+a0ce1bc657
qe-jliu-r1-node-registry-router-1   Ready     <none>    43m       v1.9.1+a0ce1bc657

# oc get pod -o wide --all-namespaces
NAMESPACE               NAME                             READY     STATUS             RESTARTS   AGE       IP            NODE
default                 docker-registry-2-qg769          1/1       Running            0          5m        10.129.0.2    qe-jliu-r1-node-registry-router-1
default                 registry-console-2-4t6zg         1/1       Running            0          14m       10.128.0.4    qe-jliu-r1-master-etcd-1
default                 router-2-47qgj                   1/1       Running            0          5m        10.240.0.46   qe-jliu-r1-node-registry-router-1
install-test            mongodb-1-k967r                  1/1       Running            0          5m        10.128.0.5    qe-jliu-r1-master-etcd-1
install-test            nodejs-mongodb-example-1-gglxw   0/1       ImagePullBackOff   0          5m        10.128.0.20   qe-jliu-r1-master-etcd-1
openshift-web-console   webconsole-54877f6577-v9vkv      1/1       Running            0          15m       10.128.0.2    qe-jliu-r1-master-etcd-1


Version-Release number of the following components:
openshift-ansible-3.9.0-0.51.0.git.0.e26400f.el7.noarch

How reproducible:
always

Steps to Reproduce:
1. Upgrade v3.7 to v3.9
2.
3.

Actual results:
App pods were scheduled on master node.

Expected results:
Only web console was scheduled on master node.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 weiwei jiang 2018-02-24 06:02:49 UTC

FYI https://github.com/openshift/openshift-ansible/pull/6949

So should be work as design.

Comment 2 Scott Dodson 2018-02-25 20:25:22 UTC

What we intend to do here is:

if osm_default_node_selector is defined
  complete upgrade
else
  label all non master non infra nodes node-role.kubernetes.io/compute=true
  set default node selector = 'node-role.kubernetes.io/compute=true'
  complete upgrade

If automating the labeling proves to be too challenging then instead we'll block the upgrade and link to docmentation that explains the scheduling changes and advise the admin how to label their nodes and then set the inventory variable to unblock the upgrade.

Comment 3 Johnny Liu 2018-03-01 02:31:34 UTC

Some background about where this bug come from:
https://bugzilla.redhat.com/show_bug.cgi?id=1539691#c10

According to the discussion in BZ#1539691, this issue does not only happen in upgrade env, also happen in fresh install. Now master is schedulable, and "Taint master nodes" is still being discussed. So that means any pods maybe scheduled onto master nodes.

This fix should consider both upgrade and fresh install. Based on that, I update the title summary.

Comment 4 Fabian von Feilitzsch 2018-03-02 19:40:28 UTC

Fix for upgrade here, looking into install.

https://github.com/openshift/openshift-ansible/pull/7364

Comment 5 openshift-github-bot 2018-03-08 06:44:50 UTC

Commits pushed to master at https://github.com/openshift/openshift-ansible

https://github.com/openshift/openshift-ansible/commit/236eb827f8010271807bb30d4b9a108eab88cf03
Bug 1548641- upgrade now properly sets labels and selectors

https://github.com/openshift/openshift-ansible/commit/791a6eb30427283dd8c8d30cfb7986fd25a6a704
Merge pull request #7364 from fabianvf/bz1548641

Automatic merge from submit-queue.

Bug 1548641- upgrade now properly sets labels and selectors

https://bugzilla.redhat.com/show_bug.cgi?id=1548641

Comment 10 Johnny Liu 2018-03-09 09:13:26 UTC

For fresh install, verified this bug with openshift-ansible-3.9.4-1.git.0.a49cc04.el7.noarch, and PASS.

1. For those nodes withouth'region=infra' label, the nodes will be added with a 'node-role.kubernetes.io/compute=true' label by installer, no any label operation for 'region=infra' nodes.
# oc get nodes
NAME             STATUS    ROLES     AGE       VERSION
192.168.100.10   Ready     <none>    1h        v1.9.1+a0ce1bc657
192.168.100.11   Ready     master    1h        v1.9.1+a0ce1bc657
192.168.100.13   Ready     compute   1h        v1.9.1+a0ce1bc657
192.168.100.15   Ready     master    1h        v1.9.1+a0ce1bc657
192.168.100.16   Ready     <none>    1h        v1.9.1+a0ce1bc657
192.168.100.7    Ready     compute   1h        v1.9.1+a0ce1bc657
192.168.100.9    Ready     master    1h        v1.9.1+a0ce1bc657

# oc get nodes -l node-role.kubernetes.io/compute=true
NAME             STATUS    ROLES     AGE       VERSION
192.168.100.13   Ready     compute   2h        v1.9.1+a0ce1bc657
192.168.100.7    Ready     compute   2h        v1.9.1+a0ce1bc657

# oc get nodes -l region=infra
NAME             STATUS    ROLES     AGE       VERSION
192.168.100.10   Ready     <none>    2h        v1.9.1+a0ce1bc657
192.168.100.16   Ready     <none>    2h        v1.9.1+a0ce1bc657

2. If osm_default_node_selector is not defined, the following setting is shown in master config file.
projectConfig:
  defaultNodeSelector: node-role.kubernetes.io/compute=true

If osm_default_node_selector is defined, e.g: osm_default_node_selector=role=node,region=primary, then set it using user definition.
projectConfig:
  defaultNodeSelector: role=node,region=primary


3. All the pods without node selector definition are using 'defaultNodeSelector' setting for scheduling.
When no osm_default_node_selector is defined.
# oc get nodes
NAME             STATUS    ROLES     AGE       VERSION
192.168.100.10   Ready     <none>    1h        v1.9.1+a0ce1bc657
192.168.100.11   Ready     master    1h        v1.9.1+a0ce1bc657
192.168.100.13   Ready     compute   1h        v1.9.1+a0ce1bc657
192.168.100.15   Ready     master    1h        v1.9.1+a0ce1bc657
192.168.100.16   Ready     <none>    1h        v1.9.1+a0ce1bc657
192.168.100.7    Ready     compute   1h        v1.9.1+a0ce1bc657
192.168.100.9    Ready     master    1h        v1.9.1+a0ce1bc657

# oc get po --all-namespaces -o wide
NAMESPACE                           NAME                               READY     STATUS              RESTARTS   AGE       IP               NODE
147i2                               database-1-deploy                  0/1       ContainerCreating   0          3s        <none>           192.168.100.13
9mh7y                               hooks-1-deploy                     1/1       Running             0          21s       10.2.12.244      192.168.100.13
9mh7y                               hooks-1-q9dxj                      0/1       ContainerCreating   0          18s       <none>           192.168.100.13
default                             docker-registry-1-h2jpp            0/1       CrashLoopBackOff    4          59m       10.2.8.3         192.168.100.10
default                             docker-registry-1-nfhjr            0/1       Running             4          59m       10.2.6.3         192.168.100.16
default                             registry-console-1-hkc29           0/1       Running             2          57m       10.2.6.4         192.168.100.16
default                             router-1-chqf8                     1/1       Running             0          1h        192.168.100.10   192.168.100.10
default                             router-1-dcjv7                     1/1       Running             0          1h        192.168.100.16   192.168.100.16
install-test                        mongodb-1-lv66g                    1/1       Running             2          52m       10.2.10.5        192.168.100.7
install-test                        nodejs-mongodb-example-1-build     0/1       Completed           0          52m       10.2.10.4        192.168.100.7
install-test                        nodejs-mongodb-example-1-pm26p     0/1       Running             2          51m       10.2.10.6        192.168.100.7
kube-service-catalog                apiserver-rw7dp                    1/1       Running             0          55m       10.2.0.4         192.168.100.11
kube-service-catalog                controller-manager-z6pb7           1/1       Running             0          55m       10.2.0.5         192.168.100.11
muvlq                               hooks-1-4pv8d                      1/1       Running             0          2m        10.2.10.91       192.168.100.7
muvlq                               hooks-1-lpxln                      1/1       Terminating         0          2m        10.2.12.241      192.168.100.13
muvlq                               hooks-1-q9z7c                      1/1       Running             0          3m        10.2.12.214      192.168.100.13
muvlq                               hooks-2-deploy                     1/1       Running             0          33s       10.2.10.92       192.168.100.7
muvlq                               hooks-2-js4mm                      0/1       ContainerCreating   0          17s       <none>           192.168.100.13
muvlq                               hooks-2-kqj86                      1/1       Running             0          28s       10.2.10.93       192.168.100.7
openshift-ansible-service-broker    asb-1-bjf59                        0/1       CrashLoopBackOff    6          53m       10.2.10.3        192.168.100.7
openshift-ansible-service-broker    asb-etcd-1-8g2lx                   1/1       Running             0          53m       10.2.12.3        192.168.100.13
openshift-template-service-broker   apiserver-5khf7                    0/1       Running             0          54m       10.2.6.5         192.168.100.16
openshift-template-service-broker   apiserver-txm7g                    0/1       Running             0          54m       10.2.8.4         192.168.100.10
openshift-web-console               webconsole-74f5ddb69c-742rc        0/1       Running             0          58m       10.2.0.3         192.168.100.11
openshift-web-console               webconsole-74f5ddb69c-n5482        0/1       Running             0          58m       10.2.2.2         192.168.100.15
openshift-web-console               webconsole-74f5ddb69c-wcwcq        0/1       Running             0          58m       10.2.4.2         192.168.100.9
tmwl9                               postgresql-1-rzq26                 0/1       Running             2          6m        10.2.10.63       192.168.100.7
tmwl9                               rails-postgresql-example-2-build   1/1       Running             0          6m        10.2.12.185      192.168.100.13
ytm3h                               git-3-4t8ts                        0/1       Running             0          4m        10.2.12.205      192.168.100.13
ytm3h                               gitserver-2-wg9qj                  1/1       Running             0          6m        10.2.12.180      192.168.100.13
ytm3h                               ruby-hello-world-1-deploy          1/1       Running             0          3m        10.2.12.239      192.168.100.13
ytm3h                               ruby-hello-world-1-wfbl7           0/1       ImagePullBackOff    0          2m        10.2.12.240      192.168.100.13
ytm3h                               ruby-hello-world-2-build           0/1       Completed           0          4m        10.2.10.71       192.168.100.7
ytm3h                               ruby-hello-world-3-build           1/1       Running             0          2m        10.2.10.90       192.168.100.7
ytrf0                               git-server-2-deploy                1/1       Running             0          3m        10.2.12.233      192.168.100.13
ytrf0                               git-server-2-kng2h                 0/1       ContainerCreating   0          3m        10.2.12.235      192.168.100.13
ytrf0                               ruby-hello-world-1-build           0/1       Error               0          3m        10.2.10.76       192.168.100.7
z6o44                               pod-add-chown                      0/1       ContainerCreating   0          0s        <none>           192.168.100.13


So the fix looks good to fresh install, the left is upgrade part, verification is still in progress.

When osm_default_node_selector="role=node,region=primary" is setting.
# oc get nodes
NAME             STATUS    ROLES     AGE       VERSION
192.168.100.17   Ready     compute   1h        v1.9.1+a0ce1bc657
192.168.100.18   Ready     compute   1h        v1.9.1+a0ce1bc657
192.168.100.19   Ready     master    1h        v1.9.1+a0ce1bc657
192.168.100.22   Ready     compute   1h        v1.9.1+a0ce1bc657
192.168.100.23   Ready     master    1h        v1.9.1+a0ce1bc657
192.168.100.4    Ready     master    1h        v1.9.1+a0ce1bc657
192.168.100.8    Ready     compute   1h        v1.9.1+a0ce1bc657

# oc get nodes -l role=node,region=primary
NAME             STATUS    ROLES     AGE       VERSION
192.168.100.22   Ready     compute   1h        v1.9.1+a0ce1bc657

# oc get po --all-namespaces -o wide
NAMESPACE                           NAME                             READY     STATUS      RESTARTS   AGE       IP               NODE
cl-auto-reg-ha                      mongodb-1-rrkv6                  1/1       Running     0          33m       11.0.5.30        192.168.100.22
cl-auto-reg-ha                      nodejs-mongodb-example-1-build   0/1       Completed   0          33m       11.0.5.29        192.168.100.22
cl-auto-reg-ha                      nodejs-mongodb-example-1-cxvlm   1/1       Running     0          30m       11.0.5.32        192.168.100.22
default                             docker-registry-1-cxwbd          1/1       Running     0          31m       11.0.6.6         192.168.100.17
default                             docker-registry-1-n5bvr          1/1       Running     0          31m       11.0.2.5         192.168.100.8
default                             registry-console-1-rv7wc         1/1       Running     1          1h        11.0.5.19        192.168.100.22
default                             router-1-ksrt6                   1/1       Running     0          28m       192.168.100.8    192.168.100.8
default                             router-1-r4plq                   1/1       Running     0          28m       192.168.100.18   192.168.100.18
install-test                        mongodb-1-89w8h                  1/1       Running     1          1h        11.0.5.20        192.168.100.22
install-test                        myapp-485-1-mmrfr                1/1       Running     0          34m       11.0.5.27        192.168.100.22
install-test                        nodejs-mongodb-example-4-build   0/1       Completed   0          40m       11.0.5.23        192.168.100.22
install-test                        nodejs-mongodb-example-4-nfsll   1/1       Running     0          38m       11.0.5.25        192.168.100.22
kube-service-catalog                apiserver-mnj7s                  1/1       Running     1          1h        11.0.0.7         192.168.100.4
kube-service-catalog                controller-manager-hjghk         1/1       Running     2          1h        11.0.0.5         192.168.100.4
openshift-ansible-service-broker    asb-1-n8xrz                      1/1       Running     9          1h        11.0.5.21        192.168.100.22
openshift-ansible-service-broker    asb-etcd-1-5fl2z                 1/1       Running     1          1h        11.0.5.22        192.168.100.22
openshift-template-service-broker   apiserver-8f474                  1/1       Running     1          1h        11.0.3.5         192.168.100.18
openshift-template-service-broker   apiserver-ww8hv                  1/1       Running     1          1h        11.0.6.5         192.168.100.17
openshift-template-service-broker   apiserver-xkp2m                  1/1       Running     1          1h        11.0.2.4         192.168.100.8
openshift-web-console               webconsole-74f5ddb69c-7d8ww      1/1       Running     1          1h        11.0.4.3         192.168.100.23
openshift-web-console               webconsole-74f5ddb69c-9q2bn      1/1       Running     1          1h        11.0.0.6         192.168.100.4
openshift-web-console               webconsole-74f5ddb69c-zt5hw      1/1       Running     1          1h        11.0.1.4         192.168.100.19

Comment 11 Johnny Liu 2018-03-10 01:31:17 UTC

That means if a cluster only have 'region=infra' node, will fail to deploy app pod, because no matched node for scheduling.

Comment 13 liujia 2018-03-12 08:04:34 UTC

Scenario1: Upgrade against ocp with default node selector configured	

Version:
openshift-ansible-3.9.7-1.git.0.60d5c90.el7.noarch

Steps:
1. HA install ocp v3.7 with osm_default_node_selector: 'region=primary' setting in inventory file.

# cat /etc/origin/master/master-config.yaml|grep "defaultNodeSelector"
  defaultNodeSelector: region=primary

# oc get node
NAME                        STATUS                     AGE       VERSION
qe-jliu-ha-master-etcd-1    Ready,SchedulingDisabled   3h        v1.7.6+a08f5eeb62
qe-jliu-ha-master-etcd-2    Ready,SchedulingDisabled   3h        v1.7.6+a08f5eeb62
qe-jliu-ha-master-etcd-3    Ready,SchedulingDisabled   3h        v1.7.6+a08f5eeb62
qe-jliu-ha-node-primary-1   Ready                      3h        v1.7.6+a08f5eeb62
qe-jliu-ha-node-primary-2   Ready                      3h        v1.7.6+a08f5eeb62
qe-jliu-ha-nrri-1           Ready                      3h        v1.7.6+a08f5eeb62
qe-jliu-ha-nrri-2           Ready                      3h        v1.7.6+a08f5eeb62

# oc get nodes -l region=infra
NAME                STATUS    AGE       VERSION
qe-jliu-ha-nrri-1   Ready     4h        v1.7.6+a08f5eeb62
qe-jliu-ha-nrri-2   Ready     4h        v1.7.6+a08f5eeb62

# oc get nodes -l region=primary
NAME                        STATUS    AGE       VERSION
qe-jliu-ha-node-primary-1   Ready     4h        v1.7.6+a08f5eeb62
qe-jliu-ha-node-primary-2   Ready     4h        v1.7.6+a08f5eeb62

2. Trigger upgrade with above inventory file in step1

Expected results:
Master was scheduled and web console was running on master node.
# oc get pod -o wide -n openshift-web-console
NAME                          READY     STATUS    RESTARTS   AGE       IP         NODE
webconsole-776767c6f4-l6fqz   1/1       Running   0          41m       10.2.0.4   qe-jliu-ha-master-etcd-1
webconsole-776767c6f4-rcxcx   1/1       Running   0          41m       10.2.2.2   qe-jliu-ha-master-etcd-2
webconsole-776767c6f4-xqx9p   1/1       Running   0          41m       10.2.4.2   qe-jliu-ha-master-etcd-3

Default node selector was the same with it was before upgrade.
# cat /etc/origin/master/master-config.yaml|grep "defaultNodeSelector"
  defaultNodeSelector: region=primary

Unexpected results:
No compute labels should be added to any nodes when default node selector defined. But actually non infra nodes and two of masters were added compute label.

# oc get nodes
NAME                        STATUS    ROLES            AGE       VERSION
qe-jliu-ha-master-etcd-1    Ready     master           6h        v1.9.1+a0ce1bc657
qe-jliu-ha-master-etcd-2    Ready     compute,master   6h        v1.9.1+a0ce1bc657
qe-jliu-ha-master-etcd-3    Ready     compute,master   6h        v1.9.1+a0ce1bc657
qe-jliu-ha-node-primary-1   Ready     compute          6h        v1.9.1+a0ce1bc657
qe-jliu-ha-node-primary-2   Ready     compute          6h        v1.9.1+a0ce1bc657
qe-jliu-ha-nrri-1           Ready     <none>           6h        v1.9.1+a0ce1bc657
qe-jliu-ha-nrri-2           Ready     <none>           6h        v1.9.1+a0ce1bc657

So assign bug back.

Comment 14 liujia 2018-03-12 08:36:02 UTC

Scenario2: Upgrade against ocp without default node selector configured	

Version:
openshift-ansible-3.9.7-1.git.0.60d5c90.el7.noarch

Steps:
1. non-ha containerized install ocp v3.7 withour defalt node selector
2. trigger upgrade against above ocp

Result:
Compute label was not added into master config.
App pods were still scheduled on master node but not compute node.

# oc get node
NAME                               STATUS    ROLES     AGE       VERSION
qe-jliu-c-master-etcd-1            Ready     master    1h        v1.9.1+a0ce1bc657
qe-jliu-c-node-registry-router-1   Ready     compute   1h        v1.9.1+a0ce1bc657

# cat /etc/origin/master/master-config.yaml|grep "defaultNodeSelector"
  defaultNodeSelector: ''

# oc get pod -o wide --all-namespaces
NAMESPACE                           NAME                             READY     STATUS             RESTARTS   AGE       IP            NODE
default                             docker-registry-3-d28pp          1/1       Running            0          35m       10.129.0.29   qe-jliu-c-node-registry-router-1
default                             registry-console-2-rtrqh         1/1       Running            0          49m       10.128.0.7    qe-jliu-c-master-etcd-1
default                             router-2-tz7sp                   1/1       Running            0          35m       10.240.0.7    qe-jliu-c-node-registry-router-1
install-test                        mongodb-1-9xndk                  1/1       Running            0          35m       10.128.0.11   qe-jliu-c-master-etcd-1
install-test                        nodejs-mongodb-example-1-cxt9v   1/1       Running            0          35m       10.128.0.28   qe-jliu-c-master-etcd-1
kube-service-catalog                apiserver-qpsn6                  1/1       Running            0          44m       10.128.0.8    qe-jliu-c-master-etcd-1
kube-service-catalog                controller-manager-n4b5g         1/1       Running            0          44m       10.128.0.9    qe-jliu-c-master-etcd-1
openshift-template-service-broker   apiserver-5fgxf                  0/1       ImagePullBackOff   0          43m       10.129.0.34   qe-jliu-c-node-registry-router-1
openshift-template-service-broker   apiserver-87wht                  1/1       Running            1          1h        10.128.0.4    qe-jliu-c-master-etcd-1
openshift-web-console               webconsole-776767c6f4-tvbhd      1/1       Running            0          50m       10.128.0.5    qe-jliu-c-master-etcd-1

Comment 15 liujia 2018-03-12 08:42:19 UTC

@Fabian von Feilitzsch 

In upgrade verify, cover two basic scenarios, and neither of them works. Scenario 2(comment 14) is the same steps as original issue(description). Scenario 1(comment 13) is supplement test for the new change. Tracked it here together, if you need them tracked separated, then I will file a new bug for scenario 1.

Comment 16 DeShuai Ma 2018-03-12 09:19:46 UTC

The fix https://github.com/openshift/openshift-ansible/pull/7364 will cause to https://bugzilla.redhat.com/show_bug.cgi?id=1543727#c5

Comment 17 Gan Huang 2018-03-12 10:08:45 UTC

No issues found for master/node scaling up.

Comment 18 Scott Dodson 2018-03-12 12:36:15 UTC

In Scenario 1 the only thing that concerns me is why we added the compute label to qe-jliu-ha-master-etcd-2 and qe-jliu-ha-master-etcd-3 if they're already masters. Our intent is to apply compute label even if it's not used as a default node selector as it will become required for other functionality in the future.

In Scenario 2 the thing we need to fix is setting the default node selector in master config.

Comment 19 Fabian von Feilitzsch 2018-03-12 21:26:06 UTC

Fix for scenario 2:

https://github.com/openshift/openshift-ansible/pull/7501

Comment 20 Fabian von Feilitzsch 2018-03-12 21:27:18 UTC

@Scott, should we break scenario 1 into a separate bug and mark this one modified?

Comment 21 Junqi Zhao 2018-03-13 01:02:24 UTC

I see the default node-selector is node-role.kubernetes.io/compute=true, there is a problem, see the following:

1) Create one project named 'hosa', node-selector is ""
# oc get project -n hosa -o yaml
apiVersion: v1
items:
- apiVersion: project.openshift.io/v1
  kind: Project
  metadata:
    annotations:
      openshift.io/node-selector: ""
      openshift.io/sa.initialized-roles: "true"
      openshift.io/sa.scc.mcs: s0:c1,c0
      openshift.io/sa.scc.supplemental-groups: 1000000000/10000
      openshift.io/sa.scc.uid-range: 1000000000/10000
    creationTimestamp: 2018-03-12T06:41:00Z
*********************snipped**************************************

2) I have one hawkular-openshift-agent ds, which node-selector is not defined, this means it should start up one pod in each node, and there is not "node-role.kubernetes.io/compute=true" label on master node, the ds actually check if the master node is labeled with node-selector "node-role.kubernetes.io/compute=true", but it is not, so the hawkular-openshift-agent pod on master node would be recreated continuesly(defect is filed, https://bugzilla.redhat.com/show_bug.cgi?id=1543727), see the event
# oc get event
54m         54m          1         hawkular-openshift-agent-l88s4.151b51d4b370b454   Pod                     Warning   MatchNodeSelector   kubelet, 172.16.120.93   Predicate MatchNodeSelector failed
1h          1h           1         hawkular-openshift-agent-l8bmg.151b4f89ec04badc   Pod                     Warning   MatchNodeSelector   kubelet, 172.16.120.93   Predicate MatchNodeSelector failed
18m         18m          1         hawkular-openshift-agent-l8bnq.151b53d58897ac5d   Pod                     Warning   MatchNodeSelector   kubelet, 172.16.120.93   Predicate MatchNodeSelector failed
18m         18m          1         hawkular-openshift-agent-l8cdn.151b53d37068bc5b   Pod                     Warning   MatchNodeSelector   kubelet, 172.16.120.93   Predicate MatchNodeSelector failed
1h          1h           1         hawkular-openshift-agent-l8ct8.151b50c721482011   Pod                     Warning   MatchNodeSelector   kubelet, 172.16.120.93   Predicate MatchNodeSelector failed
49m         49m          1         hawkular-openshift-agent-l8f9g.151b521a453dd798   Pod                     Warning   MatchNodeSelector   kubelet, 172.16.120.93   Predicate MatchNodeSelector failed
1h          1h           1         hawkular-openshift-agent-l8gfg.151b5147fa1961db   Pod                     Warning   MatchNodeSelector   kubelet, 172.16.120.93   Predicate MatchNodeSelector failed
31m         31m          1         hawkular-openshift-agent-l8gk9.151b531a79eff10e   Pod                     Warning   MatchNodeSelector   kubelet, 172.16.120.93   Predicate MatchNodeSelector failed
18m         18m          1         hawkular-openshift-agent-l8hdz.151b53d2464b812a   Pod                     Warning   MatchNodeSelector   kubelet, 172.16.120.93   Predicate MatchNodeSelector failed
1h          1h           1         hawkular-openshift-agent-l8hxs.151b4f15bcaf15af   Pod                     Warning   MatchNodeSelector   kubelet, 172.16.120.93   Predicate MatchNodeSelector failed
1h          1h           1         hawkular-openshift-agent-l8m97.151b50156eaba408   Pod                     Warning   MatchNodeSelector   kubelet, 172.16.120.93   Predicate MatchNodeSelector failed
1h          1h           1         hawkular-openshift-agent-l8mwv.151b4f3e049f205a   Pod                     Warning   MatchNodeSelector   kubelet, 172.16.120.93   Predicate MatchNodeSelector failed
1h          1h           1         hawkular-openshift-agent-l8ndn.151b4f65e0b4f99a   Pod                     Warning   MatchNodeSelector   kubelet, 172.16.120.93   Predicate MatchNodeSelector failed
1h          1h           1         hawkular-openshift-agent-l8nsz.151b4f34c26ea5c4   Pod                     Warning   MatchNodeSelector   kubelet, 172.16.120.93   Predicate MatchNodeSelector failed
1h          1h           1         hawkular-openshift-agent-l8r9w.151b4e61aaf37da9   Pod                     Warning   MatchNodeSelector   kubelet, 172.16.120.93   Predicate MatchNodeSelector failed


# oc get ds -n hosa
NAME                       DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
hawkular-openshift-agent   2         2         1         2            1           <none>          15h


# oc get po -n hosa -o wide
NAME                             READY     STATUS    RESTARTS   AGE       IP            NODE
hawkular-openshift-agent-6s6x4   1/1       Running   0          16h       10.129.0.26   172.16.120.78
hawkular-openshift-agent-j2wq4   0/1       Pending   0          0s        <none>        172.16.120.93

# oc get po -n hosa -o wide
NAME                             READY     STATUS    RESTARTS   AGE       IP            NODE
hawkular-openshift-agent-6s6x4   1/1       Running   0          16h       10.129.0.26   172.16.120.78
hawkular-openshift-agent-hrbhf   0/1       Pending   0          0s        <none>        172.16.120.93


# oc get node --show-labels
NAME            STATUS    ROLES     AGE       VERSION             LABELS
172.16.120.78   Ready     compute   18h       v1.9.1+a0ce1bc657   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=431ac1fb-1463-4527-b3d1-79245dd698e1,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=regionOne,failure-domain.beta.kubernetes.io/zone=nova,kubernetes.io/hostname=172.16.120.78,node-role.kubernetes.io/compute=true,registry=enabled,role=node,router=enabled
172.16.120.93   Ready     master    18h       v1.9.1+a0ce1bc657   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=431ac1fb-1463-4527-b3d1-79245dd698e1,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=regionOne,failure-domain.beta.kubernetes.io/zone=nova,kubernetes.io/hostname=172.16.120.93,node-role.kubernetes.io/master=true,openshift-infra=apiserver,role=node

Comment 22 DeShuai Ma 2018-03-13 07:56:06 UTC

Import: the issue of pod recreate by DS controller will cause high workload in cluster. https://bugzilla.redhat.com/show_bug.cgi?id=1501514

Comment 23 Scott Dodson 2018-03-13 12:44:30 UTC

(In reply to Fabian von Feilitzsch from comment #20)
> @Scott, should we break scenario 1 into a separate bug and mark this one
> modified?

Sure, can you go ahead and create a fork of this bug that deals specifically with Scenario 1? Marking this one modified as https://github.com/openshift/openshift-ansible/pull/7501 has merged

Comment 24 Scott Dodson 2018-03-13 12:49:04 UTC

Nevermind, I created a bug for Scenario 1 https://bugzilla.redhat.com/show_bug.cgi?id=1554828

Comment 25 Fabian von Feilitzsch 2018-03-13 15:53:56 UTC

small addendum, there was a second issue masked by the first: https://github.com/openshift/openshift-ansible/pull/7508

Comment 26 openshift-github-bot 2018-03-14 15:24:34 UTC

Commits pushed to master at https://github.com/openshift/openshift-ansible

https://github.com/openshift/openshift-ansible/commit/8ed2940fcabd39ecbc8ce9224a332460b9c9b75f
Bug 1548641- Correct arguments to yedit

https://github.com/openshift/openshift-ansible/commit/2c2cfbcb13000d7b4ccdf7f4368429a513969613
Merge pull request #7529 from openshift-cherrypick-robot/cherry-pick-7508-to-master

[master] Bug 1548641- Correct arguments to yedit

Comment 27 liujia 2018-03-15 09:26:48 UTC

Scenario2: Upgrade against ocp without default node selector configured	

Version:
openshift-ansible-3.9.9-1.git.0.1a1f7d8.el7.noarch

Steps:
1. ha containerized install ocp v3.7 without defalt node selector
2. trigger upgrade against above ocp

Result:
Compute label was added into master config.
Web console was scheduled on master node.
Original and new app pods were scheduled on compute node.

# cat /etc/origin/master/master-config.yaml|grep "defaultNodeSelector"
  defaultNodeSelector: node-role.kubernetes.io/compute=true

# oc get node
NAME                        STATUS    ROLES     AGE       VERSION
qe-jliu-ha-master-etcd-1    Ready     master    2h        v1.9.1+a0ce1bc657
qe-jliu-ha-master-etcd-2    Ready     master    2h        v1.9.1+a0ce1bc657
qe-jliu-ha-master-etcd-3    Ready     master    2h        v1.9.1+a0ce1bc657
qe-jliu-ha-node-primary-1   Ready     compute   2h        v1.9.1+a0ce1bc657
qe-jliu-ha-node-primary-2   Ready     compute   2h        v1.9.1+a0ce1bc657
qe-jliu-ha-nrri-1           Ready     <none>    2h        v1.9.1+a0ce1bc657
qe-jliu-ha-nrri-2           Ready     <none>    2h        v1.9.1+a0ce1bc657


# oc get pod -o wide --all-namespaces
NAMESPACE                           NAME                             READY     STATUS      RESTARTS   AGE       IP            NODE
default                             docker-registry-3-qgnjk          1/1       Running     0          51m       10.2.6.4      qe-jliu-ha-nrri-1
default                             docker-registry-3-vblpf          1/1       Running     0          51m       10.2.6.3      qe-jliu-ha-nrri-1
default                             registry-console-1-7sfdz         1/1       Running     0          45m       10.2.10.3     qe-jliu-ha-node-primary-1
default                             router-2-7pjhm                   1/1       Running     0          51m       10.240.0.68   qe-jliu-ha-nrri-2
default                             router-2-t2sn6                   1/1       Running     0          53m       10.240.0.67   qe-jliu-ha-nrri-1
install-test                        mongodb-1-dfnnz                  1/1       Running     0          45m       10.2.10.2     qe-jliu-ha-node-primary-1
install-test                        nodejs-mongodb-example-1-xz84b   1/1       Running     0          45m       10.2.10.4     qe-jliu-ha-node-primary-1
kube-service-catalog                apiserver-n7v4d                  1/1       Running     0          1h        10.2.0.5      qe-jliu-ha-master-etcd-1
kube-service-catalog                controller-manager-9f9vj         1/1       Running     0          1h        10.2.0.6      qe-jliu-ha-master-etcd-1
openshift-ansible-service-broker    asb-etcd-2-deploy                0/1       Error       0          17m       10.2.10.5     qe-jliu-ha-node-primary-1
openshift-template-service-broker   apiserver-tlbt8                  1/1       Running     1          59m       10.2.4.2      qe-jliu-ha-nrri-2
openshift-template-service-broker   apiserver-tmsll                  1/1       Running     1          58m       10.2.6.2      qe-jliu-ha-nrri-1
openshift-web-console               webconsole-7d878975d8-hkvj5      1/1       Running     0          1h        10.2.2.2      qe-jliu-ha-master-etcd-2
openshift-web-console               webconsole-7d878975d8-vn6c5      1/1       Running     0          1h        10.2.8.2      qe-jliu-ha-master-etcd-3
openshift-web-console               webconsole-7d878975d8-z2rrd      1/1       Running     0          1h        10.2.0.4      qe-jliu-ha-master-etcd-1
test                                cakephp-mysql-example-1-build    0/1       Completed   0          7m        10.2.12.3     qe-jliu-ha-node-primary-2
test                                cakephp-mysql-example-1-mv422    1/1       Running     0          5m        10.2.12.6     qe-jliu-ha-node-primary-2
test                                mysql-1-zsrwz                    1/1       Running     0          7m        10.2.12.4     qe-jliu-ha-node-primary-2


Combined with comment 10 and comment 17, the original issue about install and upgrade part in description has been fixed. Other issues in this bug's comment will be verified and tracked separately.

Comment 28 Scott Dodson 2018-03-15 18:27:24 UTC

*** Bug 1556970 has been marked as a duplicate of this bug. ***

Comment 30 Anping Li 2018-03-16 13:12:02 UTC

Open new bug 1557345 to trace  Comment 29.