Description of problem: Upgrade ocp from v3.7 to v3.9,some app pods except web console are scheduled on master node after upgrade. before upgrade: # oc get node NAME STATUS AGE VERSION qe-jliu-r1-master-etcd-1 Ready,SchedulingDisabled 8m v1.7.6+a08f5eeb62 qe-jliu-r1-node-registry-router-1 Ready 8m v1.7.6+a08f5eeb62 # oc get pod -o wide --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE default docker-registry-1-x5bvp 1/1 Running 0 4m 10.129.0.4 qe-jliu-r1-node-registry-router-1 default registry-console-1-5cl58 1/1 Running 0 3m 10.129.0.6 qe-jliu-r1-node-registry-router-1 default router-1-kflpp 1/1 Running 0 5m 10.240.0.46 qe-jliu-r1-node-registry-router-1 install-test mongodb-1-gmr48 1/1 Running 0 3m 10.129.0.9 qe-jliu-r1-node-registry-router-1 install-test nodejs-mongodb-example-1-build 0/1 Completed 0 3m 10.129.0.8 qe-jliu-r1-node-registry-router-1 install-test nodejs-mongodb-example-1-qnwgs 1/1 Running 0 1m 10.129.0.11 qe-jliu-r1-node-registry-router-1 after upgrade: # oc get node NAME STATUS ROLES AGE VERSION qe-jliu-r1-master-etcd-1 Ready master 43m v1.9.1+a0ce1bc657 qe-jliu-r1-node-registry-router-1 Ready <none> 43m v1.9.1+a0ce1bc657 # oc get pod -o wide --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE default docker-registry-2-qg769 1/1 Running 0 5m 10.129.0.2 qe-jliu-r1-node-registry-router-1 default registry-console-2-4t6zg 1/1 Running 0 14m 10.128.0.4 qe-jliu-r1-master-etcd-1 default router-2-47qgj 1/1 Running 0 5m 10.240.0.46 qe-jliu-r1-node-registry-router-1 install-test mongodb-1-k967r 1/1 Running 0 5m 10.128.0.5 qe-jliu-r1-master-etcd-1 install-test nodejs-mongodb-example-1-gglxw 0/1 ImagePullBackOff 0 5m 10.128.0.20 qe-jliu-r1-master-etcd-1 openshift-web-console webconsole-54877f6577-v9vkv 1/1 Running 0 15m 10.128.0.2 qe-jliu-r1-master-etcd-1 Version-Release number of the following components: openshift-ansible-3.9.0-0.51.0.git.0.e26400f.el7.noarch How reproducible: always Steps to Reproduce: 1. Upgrade v3.7 to v3.9 2. 3. Actual results: App pods were scheduled on master node. Expected results: Only web console was scheduled on master node. Additional info: Please attach logs from ansible-playbook with the -vvv flag
FYI https://github.com/openshift/openshift-ansible/pull/6949 So should be work as design.
What we intend to do here is: if osm_default_node_selector is defined complete upgrade else label all non master non infra nodes node-role.kubernetes.io/compute=true set default node selector = 'node-role.kubernetes.io/compute=true' complete upgrade If automating the labeling proves to be too challenging then instead we'll block the upgrade and link to docmentation that explains the scheduling changes and advise the admin how to label their nodes and then set the inventory variable to unblock the upgrade.
Some background about where this bug come from: https://bugzilla.redhat.com/show_bug.cgi?id=1539691#c10 According to the discussion in BZ#1539691, this issue does not only happen in upgrade env, also happen in fresh install. Now master is schedulable, and "Taint master nodes" is still being discussed. So that means any pods maybe scheduled onto master nodes. This fix should consider both upgrade and fresh install. Based on that, I update the title summary.
Fix for upgrade here, looking into install. https://github.com/openshift/openshift-ansible/pull/7364
Commits pushed to master at https://github.com/openshift/openshift-ansible https://github.com/openshift/openshift-ansible/commit/236eb827f8010271807bb30d4b9a108eab88cf03 Bug 1548641- upgrade now properly sets labels and selectors https://github.com/openshift/openshift-ansible/commit/791a6eb30427283dd8c8d30cfb7986fd25a6a704 Merge pull request #7364 from fabianvf/bz1548641 Automatic merge from submit-queue. Bug 1548641- upgrade now properly sets labels and selectors https://bugzilla.redhat.com/show_bug.cgi?id=1548641
For fresh install, verified this bug with openshift-ansible-3.9.4-1.git.0.a49cc04.el7.noarch, and PASS. 1. For those nodes withouth'region=infra' label, the nodes will be added with a 'node-role.kubernetes.io/compute=true' label by installer, no any label operation for 'region=infra' nodes. # oc get nodes NAME STATUS ROLES AGE VERSION 192.168.100.10 Ready <none> 1h v1.9.1+a0ce1bc657 192.168.100.11 Ready master 1h v1.9.1+a0ce1bc657 192.168.100.13 Ready compute 1h v1.9.1+a0ce1bc657 192.168.100.15 Ready master 1h v1.9.1+a0ce1bc657 192.168.100.16 Ready <none> 1h v1.9.1+a0ce1bc657 192.168.100.7 Ready compute 1h v1.9.1+a0ce1bc657 192.168.100.9 Ready master 1h v1.9.1+a0ce1bc657 # oc get nodes -l node-role.kubernetes.io/compute=true NAME STATUS ROLES AGE VERSION 192.168.100.13 Ready compute 2h v1.9.1+a0ce1bc657 192.168.100.7 Ready compute 2h v1.9.1+a0ce1bc657 # oc get nodes -l region=infra NAME STATUS ROLES AGE VERSION 192.168.100.10 Ready <none> 2h v1.9.1+a0ce1bc657 192.168.100.16 Ready <none> 2h v1.9.1+a0ce1bc657 2. If osm_default_node_selector is not defined, the following setting is shown in master config file. projectConfig: defaultNodeSelector: node-role.kubernetes.io/compute=true If osm_default_node_selector is defined, e.g: osm_default_node_selector=role=node,region=primary, then set it using user definition. projectConfig: defaultNodeSelector: role=node,region=primary 3. All the pods without node selector definition are using 'defaultNodeSelector' setting for scheduling. When no osm_default_node_selector is defined. # oc get nodes NAME STATUS ROLES AGE VERSION 192.168.100.10 Ready <none> 1h v1.9.1+a0ce1bc657 192.168.100.11 Ready master 1h v1.9.1+a0ce1bc657 192.168.100.13 Ready compute 1h v1.9.1+a0ce1bc657 192.168.100.15 Ready master 1h v1.9.1+a0ce1bc657 192.168.100.16 Ready <none> 1h v1.9.1+a0ce1bc657 192.168.100.7 Ready compute 1h v1.9.1+a0ce1bc657 192.168.100.9 Ready master 1h v1.9.1+a0ce1bc657 # oc get po --all-namespaces -o wide NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE 147i2 database-1-deploy 0/1 ContainerCreating 0 3s <none> 192.168.100.13 9mh7y hooks-1-deploy 1/1 Running 0 21s 10.2.12.244 192.168.100.13 9mh7y hooks-1-q9dxj 0/1 ContainerCreating 0 18s <none> 192.168.100.13 default docker-registry-1-h2jpp 0/1 CrashLoopBackOff 4 59m 10.2.8.3 192.168.100.10 default docker-registry-1-nfhjr 0/1 Running 4 59m 10.2.6.3 192.168.100.16 default registry-console-1-hkc29 0/1 Running 2 57m 10.2.6.4 192.168.100.16 default router-1-chqf8 1/1 Running 0 1h 192.168.100.10 192.168.100.10 default router-1-dcjv7 1/1 Running 0 1h 192.168.100.16 192.168.100.16 install-test mongodb-1-lv66g 1/1 Running 2 52m 10.2.10.5 192.168.100.7 install-test nodejs-mongodb-example-1-build 0/1 Completed 0 52m 10.2.10.4 192.168.100.7 install-test nodejs-mongodb-example-1-pm26p 0/1 Running 2 51m 10.2.10.6 192.168.100.7 kube-service-catalog apiserver-rw7dp 1/1 Running 0 55m 10.2.0.4 192.168.100.11 kube-service-catalog controller-manager-z6pb7 1/1 Running 0 55m 10.2.0.5 192.168.100.11 muvlq hooks-1-4pv8d 1/1 Running 0 2m 10.2.10.91 192.168.100.7 muvlq hooks-1-lpxln 1/1 Terminating 0 2m 10.2.12.241 192.168.100.13 muvlq hooks-1-q9z7c 1/1 Running 0 3m 10.2.12.214 192.168.100.13 muvlq hooks-2-deploy 1/1 Running 0 33s 10.2.10.92 192.168.100.7 muvlq hooks-2-js4mm 0/1 ContainerCreating 0 17s <none> 192.168.100.13 muvlq hooks-2-kqj86 1/1 Running 0 28s 10.2.10.93 192.168.100.7 openshift-ansible-service-broker asb-1-bjf59 0/1 CrashLoopBackOff 6 53m 10.2.10.3 192.168.100.7 openshift-ansible-service-broker asb-etcd-1-8g2lx 1/1 Running 0 53m 10.2.12.3 192.168.100.13 openshift-template-service-broker apiserver-5khf7 0/1 Running 0 54m 10.2.6.5 192.168.100.16 openshift-template-service-broker apiserver-txm7g 0/1 Running 0 54m 10.2.8.4 192.168.100.10 openshift-web-console webconsole-74f5ddb69c-742rc 0/1 Running 0 58m 10.2.0.3 192.168.100.11 openshift-web-console webconsole-74f5ddb69c-n5482 0/1 Running 0 58m 10.2.2.2 192.168.100.15 openshift-web-console webconsole-74f5ddb69c-wcwcq 0/1 Running 0 58m 10.2.4.2 192.168.100.9 tmwl9 postgresql-1-rzq26 0/1 Running 2 6m 10.2.10.63 192.168.100.7 tmwl9 rails-postgresql-example-2-build 1/1 Running 0 6m 10.2.12.185 192.168.100.13 ytm3h git-3-4t8ts 0/1 Running 0 4m 10.2.12.205 192.168.100.13 ytm3h gitserver-2-wg9qj 1/1 Running 0 6m 10.2.12.180 192.168.100.13 ytm3h ruby-hello-world-1-deploy 1/1 Running 0 3m 10.2.12.239 192.168.100.13 ytm3h ruby-hello-world-1-wfbl7 0/1 ImagePullBackOff 0 2m 10.2.12.240 192.168.100.13 ytm3h ruby-hello-world-2-build 0/1 Completed 0 4m 10.2.10.71 192.168.100.7 ytm3h ruby-hello-world-3-build 1/1 Running 0 2m 10.2.10.90 192.168.100.7 ytrf0 git-server-2-deploy 1/1 Running 0 3m 10.2.12.233 192.168.100.13 ytrf0 git-server-2-kng2h 0/1 ContainerCreating 0 3m 10.2.12.235 192.168.100.13 ytrf0 ruby-hello-world-1-build 0/1 Error 0 3m 10.2.10.76 192.168.100.7 z6o44 pod-add-chown 0/1 ContainerCreating 0 0s <none> 192.168.100.13 So the fix looks good to fresh install, the left is upgrade part, verification is still in progress. When osm_default_node_selector="role=node,region=primary" is setting. # oc get nodes NAME STATUS ROLES AGE VERSION 192.168.100.17 Ready compute 1h v1.9.1+a0ce1bc657 192.168.100.18 Ready compute 1h v1.9.1+a0ce1bc657 192.168.100.19 Ready master 1h v1.9.1+a0ce1bc657 192.168.100.22 Ready compute 1h v1.9.1+a0ce1bc657 192.168.100.23 Ready master 1h v1.9.1+a0ce1bc657 192.168.100.4 Ready master 1h v1.9.1+a0ce1bc657 192.168.100.8 Ready compute 1h v1.9.1+a0ce1bc657 # oc get nodes -l role=node,region=primary NAME STATUS ROLES AGE VERSION 192.168.100.22 Ready compute 1h v1.9.1+a0ce1bc657 # oc get po --all-namespaces -o wide NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE cl-auto-reg-ha mongodb-1-rrkv6 1/1 Running 0 33m 11.0.5.30 192.168.100.22 cl-auto-reg-ha nodejs-mongodb-example-1-build 0/1 Completed 0 33m 11.0.5.29 192.168.100.22 cl-auto-reg-ha nodejs-mongodb-example-1-cxvlm 1/1 Running 0 30m 11.0.5.32 192.168.100.22 default docker-registry-1-cxwbd 1/1 Running 0 31m 11.0.6.6 192.168.100.17 default docker-registry-1-n5bvr 1/1 Running 0 31m 11.0.2.5 192.168.100.8 default registry-console-1-rv7wc 1/1 Running 1 1h 11.0.5.19 192.168.100.22 default router-1-ksrt6 1/1 Running 0 28m 192.168.100.8 192.168.100.8 default router-1-r4plq 1/1 Running 0 28m 192.168.100.18 192.168.100.18 install-test mongodb-1-89w8h 1/1 Running 1 1h 11.0.5.20 192.168.100.22 install-test myapp-485-1-mmrfr 1/1 Running 0 34m 11.0.5.27 192.168.100.22 install-test nodejs-mongodb-example-4-build 0/1 Completed 0 40m 11.0.5.23 192.168.100.22 install-test nodejs-mongodb-example-4-nfsll 1/1 Running 0 38m 11.0.5.25 192.168.100.22 kube-service-catalog apiserver-mnj7s 1/1 Running 1 1h 11.0.0.7 192.168.100.4 kube-service-catalog controller-manager-hjghk 1/1 Running 2 1h 11.0.0.5 192.168.100.4 openshift-ansible-service-broker asb-1-n8xrz 1/1 Running 9 1h 11.0.5.21 192.168.100.22 openshift-ansible-service-broker asb-etcd-1-5fl2z 1/1 Running 1 1h 11.0.5.22 192.168.100.22 openshift-template-service-broker apiserver-8f474 1/1 Running 1 1h 11.0.3.5 192.168.100.18 openshift-template-service-broker apiserver-ww8hv 1/1 Running 1 1h 11.0.6.5 192.168.100.17 openshift-template-service-broker apiserver-xkp2m 1/1 Running 1 1h 11.0.2.4 192.168.100.8 openshift-web-console webconsole-74f5ddb69c-7d8ww 1/1 Running 1 1h 11.0.4.3 192.168.100.23 openshift-web-console webconsole-74f5ddb69c-9q2bn 1/1 Running 1 1h 11.0.0.6 192.168.100.4 openshift-web-console webconsole-74f5ddb69c-zt5hw 1/1 Running 1 1h 11.0.1.4 192.168.100.19
That means if a cluster only have 'region=infra' node, will fail to deploy app pod, because no matched node for scheduling.
Scenario1: Upgrade against ocp with default node selector configured Version: openshift-ansible-3.9.7-1.git.0.60d5c90.el7.noarch Steps: 1. HA install ocp v3.7 with osm_default_node_selector: 'region=primary' setting in inventory file. # cat /etc/origin/master/master-config.yaml|grep "defaultNodeSelector" defaultNodeSelector: region=primary # oc get node NAME STATUS AGE VERSION qe-jliu-ha-master-etcd-1 Ready,SchedulingDisabled 3h v1.7.6+a08f5eeb62 qe-jliu-ha-master-etcd-2 Ready,SchedulingDisabled 3h v1.7.6+a08f5eeb62 qe-jliu-ha-master-etcd-3 Ready,SchedulingDisabled 3h v1.7.6+a08f5eeb62 qe-jliu-ha-node-primary-1 Ready 3h v1.7.6+a08f5eeb62 qe-jliu-ha-node-primary-2 Ready 3h v1.7.6+a08f5eeb62 qe-jliu-ha-nrri-1 Ready 3h v1.7.6+a08f5eeb62 qe-jliu-ha-nrri-2 Ready 3h v1.7.6+a08f5eeb62 # oc get nodes -l region=infra NAME STATUS AGE VERSION qe-jliu-ha-nrri-1 Ready 4h v1.7.6+a08f5eeb62 qe-jliu-ha-nrri-2 Ready 4h v1.7.6+a08f5eeb62 # oc get nodes -l region=primary NAME STATUS AGE VERSION qe-jliu-ha-node-primary-1 Ready 4h v1.7.6+a08f5eeb62 qe-jliu-ha-node-primary-2 Ready 4h v1.7.6+a08f5eeb62 2. Trigger upgrade with above inventory file in step1 Expected results: Master was scheduled and web console was running on master node. # oc get pod -o wide -n openshift-web-console NAME READY STATUS RESTARTS AGE IP NODE webconsole-776767c6f4-l6fqz 1/1 Running 0 41m 10.2.0.4 qe-jliu-ha-master-etcd-1 webconsole-776767c6f4-rcxcx 1/1 Running 0 41m 10.2.2.2 qe-jliu-ha-master-etcd-2 webconsole-776767c6f4-xqx9p 1/1 Running 0 41m 10.2.4.2 qe-jliu-ha-master-etcd-3 Default node selector was the same with it was before upgrade. # cat /etc/origin/master/master-config.yaml|grep "defaultNodeSelector" defaultNodeSelector: region=primary Unexpected results: No compute labels should be added to any nodes when default node selector defined. But actually non infra nodes and two of masters were added compute label. # oc get nodes NAME STATUS ROLES AGE VERSION qe-jliu-ha-master-etcd-1 Ready master 6h v1.9.1+a0ce1bc657 qe-jliu-ha-master-etcd-2 Ready compute,master 6h v1.9.1+a0ce1bc657 qe-jliu-ha-master-etcd-3 Ready compute,master 6h v1.9.1+a0ce1bc657 qe-jliu-ha-node-primary-1 Ready compute 6h v1.9.1+a0ce1bc657 qe-jliu-ha-node-primary-2 Ready compute 6h v1.9.1+a0ce1bc657 qe-jliu-ha-nrri-1 Ready <none> 6h v1.9.1+a0ce1bc657 qe-jliu-ha-nrri-2 Ready <none> 6h v1.9.1+a0ce1bc657 So assign bug back.
Scenario2: Upgrade against ocp without default node selector configured Version: openshift-ansible-3.9.7-1.git.0.60d5c90.el7.noarch Steps: 1. non-ha containerized install ocp v3.7 withour defalt node selector 2. trigger upgrade against above ocp Result: Compute label was not added into master config. App pods were still scheduled on master node but not compute node. # oc get node NAME STATUS ROLES AGE VERSION qe-jliu-c-master-etcd-1 Ready master 1h v1.9.1+a0ce1bc657 qe-jliu-c-node-registry-router-1 Ready compute 1h v1.9.1+a0ce1bc657 # cat /etc/origin/master/master-config.yaml|grep "defaultNodeSelector" defaultNodeSelector: '' # oc get pod -o wide --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE default docker-registry-3-d28pp 1/1 Running 0 35m 10.129.0.29 qe-jliu-c-node-registry-router-1 default registry-console-2-rtrqh 1/1 Running 0 49m 10.128.0.7 qe-jliu-c-master-etcd-1 default router-2-tz7sp 1/1 Running 0 35m 10.240.0.7 qe-jliu-c-node-registry-router-1 install-test mongodb-1-9xndk 1/1 Running 0 35m 10.128.0.11 qe-jliu-c-master-etcd-1 install-test nodejs-mongodb-example-1-cxt9v 1/1 Running 0 35m 10.128.0.28 qe-jliu-c-master-etcd-1 kube-service-catalog apiserver-qpsn6 1/1 Running 0 44m 10.128.0.8 qe-jliu-c-master-etcd-1 kube-service-catalog controller-manager-n4b5g 1/1 Running 0 44m 10.128.0.9 qe-jliu-c-master-etcd-1 openshift-template-service-broker apiserver-5fgxf 0/1 ImagePullBackOff 0 43m 10.129.0.34 qe-jliu-c-node-registry-router-1 openshift-template-service-broker apiserver-87wht 1/1 Running 1 1h 10.128.0.4 qe-jliu-c-master-etcd-1 openshift-web-console webconsole-776767c6f4-tvbhd 1/1 Running 0 50m 10.128.0.5 qe-jliu-c-master-etcd-1
@Fabian von Feilitzsch In upgrade verify, cover two basic scenarios, and neither of them works. Scenario 2(comment 14) is the same steps as original issue(description). Scenario 1(comment 13) is supplement test for the new change. Tracked it here together, if you need them tracked separated, then I will file a new bug for scenario 1.
The fix https://github.com/openshift/openshift-ansible/pull/7364 will cause to https://bugzilla.redhat.com/show_bug.cgi?id=1543727#c5
No issues found for master/node scaling up.
In Scenario 1 the only thing that concerns me is why we added the compute label to qe-jliu-ha-master-etcd-2 and qe-jliu-ha-master-etcd-3 if they're already masters. Our intent is to apply compute label even if it's not used as a default node selector as it will become required for other functionality in the future. In Scenario 2 the thing we need to fix is setting the default node selector in master config.
Fix for scenario 2: https://github.com/openshift/openshift-ansible/pull/7501
@Scott, should we break scenario 1 into a separate bug and mark this one modified?
I see the default node-selector is node-role.kubernetes.io/compute=true, there is a problem, see the following: 1) Create one project named 'hosa', node-selector is "" # oc get project -n hosa -o yaml apiVersion: v1 items: - apiVersion: project.openshift.io/v1 kind: Project metadata: annotations: openshift.io/node-selector: "" openshift.io/sa.initialized-roles: "true" openshift.io/sa.scc.mcs: s0:c1,c0 openshift.io/sa.scc.supplemental-groups: 1000000000/10000 openshift.io/sa.scc.uid-range: 1000000000/10000 creationTimestamp: 2018-03-12T06:41:00Z *********************snipped************************************** 2) I have one hawkular-openshift-agent ds, which node-selector is not defined, this means it should start up one pod in each node, and there is not "node-role.kubernetes.io/compute=true" label on master node, the ds actually check if the master node is labeled with node-selector "node-role.kubernetes.io/compute=true", but it is not, so the hawkular-openshift-agent pod on master node would be recreated continuesly(defect is filed, https://bugzilla.redhat.com/show_bug.cgi?id=1543727), see the event # oc get event 54m 54m 1 hawkular-openshift-agent-l88s4.151b51d4b370b454 Pod Warning MatchNodeSelector kubelet, 172.16.120.93 Predicate MatchNodeSelector failed 1h 1h 1 hawkular-openshift-agent-l8bmg.151b4f89ec04badc Pod Warning MatchNodeSelector kubelet, 172.16.120.93 Predicate MatchNodeSelector failed 18m 18m 1 hawkular-openshift-agent-l8bnq.151b53d58897ac5d Pod Warning MatchNodeSelector kubelet, 172.16.120.93 Predicate MatchNodeSelector failed 18m 18m 1 hawkular-openshift-agent-l8cdn.151b53d37068bc5b Pod Warning MatchNodeSelector kubelet, 172.16.120.93 Predicate MatchNodeSelector failed 1h 1h 1 hawkular-openshift-agent-l8ct8.151b50c721482011 Pod Warning MatchNodeSelector kubelet, 172.16.120.93 Predicate MatchNodeSelector failed 49m 49m 1 hawkular-openshift-agent-l8f9g.151b521a453dd798 Pod Warning MatchNodeSelector kubelet, 172.16.120.93 Predicate MatchNodeSelector failed 1h 1h 1 hawkular-openshift-agent-l8gfg.151b5147fa1961db Pod Warning MatchNodeSelector kubelet, 172.16.120.93 Predicate MatchNodeSelector failed 31m 31m 1 hawkular-openshift-agent-l8gk9.151b531a79eff10e Pod Warning MatchNodeSelector kubelet, 172.16.120.93 Predicate MatchNodeSelector failed 18m 18m 1 hawkular-openshift-agent-l8hdz.151b53d2464b812a Pod Warning MatchNodeSelector kubelet, 172.16.120.93 Predicate MatchNodeSelector failed 1h 1h 1 hawkular-openshift-agent-l8hxs.151b4f15bcaf15af Pod Warning MatchNodeSelector kubelet, 172.16.120.93 Predicate MatchNodeSelector failed 1h 1h 1 hawkular-openshift-agent-l8m97.151b50156eaba408 Pod Warning MatchNodeSelector kubelet, 172.16.120.93 Predicate MatchNodeSelector failed 1h 1h 1 hawkular-openshift-agent-l8mwv.151b4f3e049f205a Pod Warning MatchNodeSelector kubelet, 172.16.120.93 Predicate MatchNodeSelector failed 1h 1h 1 hawkular-openshift-agent-l8ndn.151b4f65e0b4f99a Pod Warning MatchNodeSelector kubelet, 172.16.120.93 Predicate MatchNodeSelector failed 1h 1h 1 hawkular-openshift-agent-l8nsz.151b4f34c26ea5c4 Pod Warning MatchNodeSelector kubelet, 172.16.120.93 Predicate MatchNodeSelector failed 1h 1h 1 hawkular-openshift-agent-l8r9w.151b4e61aaf37da9 Pod Warning MatchNodeSelector kubelet, 172.16.120.93 Predicate MatchNodeSelector failed # oc get ds -n hosa NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE hawkular-openshift-agent 2 2 1 2 1 <none> 15h # oc get po -n hosa -o wide NAME READY STATUS RESTARTS AGE IP NODE hawkular-openshift-agent-6s6x4 1/1 Running 0 16h 10.129.0.26 172.16.120.78 hawkular-openshift-agent-j2wq4 0/1 Pending 0 0s <none> 172.16.120.93 # oc get po -n hosa -o wide NAME READY STATUS RESTARTS AGE IP NODE hawkular-openshift-agent-6s6x4 1/1 Running 0 16h 10.129.0.26 172.16.120.78 hawkular-openshift-agent-hrbhf 0/1 Pending 0 0s <none> 172.16.120.93 # oc get node --show-labels NAME STATUS ROLES AGE VERSION LABELS 172.16.120.78 Ready compute 18h v1.9.1+a0ce1bc657 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=431ac1fb-1463-4527-b3d1-79245dd698e1,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=regionOne,failure-domain.beta.kubernetes.io/zone=nova,kubernetes.io/hostname=172.16.120.78,node-role.kubernetes.io/compute=true,registry=enabled,role=node,router=enabled 172.16.120.93 Ready master 18h v1.9.1+a0ce1bc657 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=431ac1fb-1463-4527-b3d1-79245dd698e1,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=regionOne,failure-domain.beta.kubernetes.io/zone=nova,kubernetes.io/hostname=172.16.120.93,node-role.kubernetes.io/master=true,openshift-infra=apiserver,role=node
Import: the issue of pod recreate by DS controller will cause high workload in cluster. https://bugzilla.redhat.com/show_bug.cgi?id=1501514
(In reply to Fabian von Feilitzsch from comment #20) > @Scott, should we break scenario 1 into a separate bug and mark this one > modified? Sure, can you go ahead and create a fork of this bug that deals specifically with Scenario 1? Marking this one modified as https://github.com/openshift/openshift-ansible/pull/7501 has merged
Nevermind, I created a bug for Scenario 1 https://bugzilla.redhat.com/show_bug.cgi?id=1554828
small addendum, there was a second issue masked by the first: https://github.com/openshift/openshift-ansible/pull/7508
Commits pushed to master at https://github.com/openshift/openshift-ansible https://github.com/openshift/openshift-ansible/commit/8ed2940fcabd39ecbc8ce9224a332460b9c9b75f Bug 1548641- Correct arguments to yedit https://github.com/openshift/openshift-ansible/commit/2c2cfbcb13000d7b4ccdf7f4368429a513969613 Merge pull request #7529 from openshift-cherrypick-robot/cherry-pick-7508-to-master [master] Bug 1548641- Correct arguments to yedit
Scenario2: Upgrade against ocp without default node selector configured Version: openshift-ansible-3.9.9-1.git.0.1a1f7d8.el7.noarch Steps: 1. ha containerized install ocp v3.7 without defalt node selector 2. trigger upgrade against above ocp Result: Compute label was added into master config. Web console was scheduled on master node. Original and new app pods were scheduled on compute node. # cat /etc/origin/master/master-config.yaml|grep "defaultNodeSelector" defaultNodeSelector: node-role.kubernetes.io/compute=true # oc get node NAME STATUS ROLES AGE VERSION qe-jliu-ha-master-etcd-1 Ready master 2h v1.9.1+a0ce1bc657 qe-jliu-ha-master-etcd-2 Ready master 2h v1.9.1+a0ce1bc657 qe-jliu-ha-master-etcd-3 Ready master 2h v1.9.1+a0ce1bc657 qe-jliu-ha-node-primary-1 Ready compute 2h v1.9.1+a0ce1bc657 qe-jliu-ha-node-primary-2 Ready compute 2h v1.9.1+a0ce1bc657 qe-jliu-ha-nrri-1 Ready <none> 2h v1.9.1+a0ce1bc657 qe-jliu-ha-nrri-2 Ready <none> 2h v1.9.1+a0ce1bc657 # oc get pod -o wide --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE default docker-registry-3-qgnjk 1/1 Running 0 51m 10.2.6.4 qe-jliu-ha-nrri-1 default docker-registry-3-vblpf 1/1 Running 0 51m 10.2.6.3 qe-jliu-ha-nrri-1 default registry-console-1-7sfdz 1/1 Running 0 45m 10.2.10.3 qe-jliu-ha-node-primary-1 default router-2-7pjhm 1/1 Running 0 51m 10.240.0.68 qe-jliu-ha-nrri-2 default router-2-t2sn6 1/1 Running 0 53m 10.240.0.67 qe-jliu-ha-nrri-1 install-test mongodb-1-dfnnz 1/1 Running 0 45m 10.2.10.2 qe-jliu-ha-node-primary-1 install-test nodejs-mongodb-example-1-xz84b 1/1 Running 0 45m 10.2.10.4 qe-jliu-ha-node-primary-1 kube-service-catalog apiserver-n7v4d 1/1 Running 0 1h 10.2.0.5 qe-jliu-ha-master-etcd-1 kube-service-catalog controller-manager-9f9vj 1/1 Running 0 1h 10.2.0.6 qe-jliu-ha-master-etcd-1 openshift-ansible-service-broker asb-etcd-2-deploy 0/1 Error 0 17m 10.2.10.5 qe-jliu-ha-node-primary-1 openshift-template-service-broker apiserver-tlbt8 1/1 Running 1 59m 10.2.4.2 qe-jliu-ha-nrri-2 openshift-template-service-broker apiserver-tmsll 1/1 Running 1 58m 10.2.6.2 qe-jliu-ha-nrri-1 openshift-web-console webconsole-7d878975d8-hkvj5 1/1 Running 0 1h 10.2.2.2 qe-jliu-ha-master-etcd-2 openshift-web-console webconsole-7d878975d8-vn6c5 1/1 Running 0 1h 10.2.8.2 qe-jliu-ha-master-etcd-3 openshift-web-console webconsole-7d878975d8-z2rrd 1/1 Running 0 1h 10.2.0.4 qe-jliu-ha-master-etcd-1 test cakephp-mysql-example-1-build 0/1 Completed 0 7m 10.2.12.3 qe-jliu-ha-node-primary-2 test cakephp-mysql-example-1-mv422 1/1 Running 0 5m 10.2.12.6 qe-jliu-ha-node-primary-2 test mysql-1-zsrwz 1/1 Running 0 7m 10.2.12.4 qe-jliu-ha-node-primary-2 Combined with comment 10 and comment 17, the original issue about install and upgrade part in description has been fixed. Other issues in this bug's comment will be verified and tracked separately.
*** Bug 1556970 has been marked as a duplicate of this bug. ***
Open new bug 1557345 to trace Comment 29.