Description of problem: See the following details. Version-Release number of the following components: openshift-ansible-3.10.21-1.git.0.6446011.el7.noarch How reproducible: 50% Steps to Reproduce: 1. Enable Prometheus deployment together when installing cluster 2. 3. Actual results: TASK [openshift_control_plane : Retrieve list of schedulable nodes matching selector] *** Thursday 26 July 2018 13:37:36 -0400 (0:00:00.101) 0:28:14.786 ********* ok: [host-8-248-147.host.centralci.eng.rdu2.redhat.com] => {"changed": false, "failed": false, "results": {"cmd": "/usr/bin/oc get node --selector=node-role.kubernetes.io/infra=true --field-selector=spec.unschedulable!=true -o json -n default", "results": [{"apiVersion": "v1", "items": [], "kind": "List", "metadata": {"resourceVersion": "", "selfLink": ""}}], "returncode": 0}, "state": "list"} TASK [openshift_control_plane : Ensure that Prometheus has nodes to run on] **** Thursday 26 July 2018 13:37:37 -0400 (0:00:00.944) 0:28:15.730 ********* fatal: [host-8-248-147.host.centralci.eng.rdu2.redhat.com]: FAILED! => { "assertion": false, "changed": false, "evaluated_to": false, "failed": true, "msg": "No schedulable nodes found matching node selector for Prometheus - 'node-role.kubernetes.io/infra=true'" } Expected results: installation is completed successfully. Additional info: After the failure, log into the cluster, run "/usr/bin/oc get node --selector=node-role.kubernetes.io/infra=true --field-selector=spec.unschedulable!=true -o json -n default", it return infra node successfully. Seem like there is some delay for infra node get ready, that means when "Retrieve list of schedulable nodes matching selector", the infra node does not get ready yet, after several minutes, it get ready.
Created attachment 1470865 [details] installation log with inventory file embedded
In my another cluster, this issue is reproduced. # oc get node NAME STATUS ROLES AGE VERSION qe-jialiu310stg-master-etcd-nfs-1 Ready master 12m v1.10.0+b81c8f8 qe-jialiu310stg-node-1 Ready compute 4m v1.10.0+b81c8f8 qe-jialiu310stg-node-infra-1 Ready <none> 4m v1.10.0+b81c8f8 the infra node still not be labeled with "infra" after 4m. # oc describe po sync-s9rbm -n openshift-node Name: sync-s9rbm Namespace: openshift-node Node: qe-jialiu310stg-node-infra-1/172.16.120.43 Start Time: Thu, 26 Jul 2018 14:27:21 -0400 Labels: app=sync component=network controller-revision-hash=3178364908 openshift.io/component=sync pod-template-generation=2 type=infra Annotations: scheduler.alpha.kubernetes.io/critical-pod= Status: Pending IP: 172.16.120.43 Controlled By: DaemonSet/sync Containers: sync: Container ID: Image: registry.access.stage.redhat.com/openshift3/ose-node:v3.10 Image ID: Port: <none> Host Port: <none> Command: <--snip--> QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/disk-pressure:NoSchedule node.kubernetes.io/memory-pressure:NoSchedule node.kubernetes.io/not-ready:NoExecute node.kubernetes.io/unreachable:NoExecute Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulling 2m kubelet, qe-jialiu310stg-node-infra-1 pulling image "registry.access.stage.redhat.com/openshift3/ose-node:v3.10" Maybe this issue related to performance, registry bandwith, or the size of ose-node is too big. After 11m, the node is labeled correctly. # oc get node NAME STATUS ROLES AGE VERSION qe-jialiu310stg-master-etcd-nfs-1 Ready master 19m v1.10.0+b81c8f8 qe-jialiu310stg-node-1 Ready compute 11m v1.10.0+b81c8f8 qe-jialiu310stg-node-infra-1 Ready infra 11m v1.10.0+b81c8f8 # crictl images registry.access.stage.redhat.com/openshift3/ose-node v3.10 e3822f4f47155 1.28GB
The sync pod is responsible for labeling the node so this makes sense that the node would be unlabeled until it become available.
PR to ignore the node check if the sync pods are not ready yet. https://github.com/openshift/openshift-ansible/pull/9661
Waiting until 1617990 is verified for master (3.11) and then will backport the fix to 3.10.
PR for 3.10: https://github.com/openshift/openshift-ansible/pull/9819
Fix has been merged to the 3.10 branch.
The latest installer rpm is openshift-ansible-3.10.41-1.git.0.fd15dd7.el7.noarch, the PR is not landed into the rpm yet, waiting for newer build.
Re-test this bug with openshift-ansible-3.10.43-1.git.0.4794155.el7.noarch, still reproduced, though the PR already landed into the PRM and take effect. TASK [openshift_node_group : Wait for the sync daemonset to become ready and available] *** Thursday 06 September 2018 14:44:12 +0800 (0:00:00.176) 0:21:27.759 **** FAILED - RETRYING: Wait for the sync daemonset to become ready and available (60 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (59 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (58 retries left). ok: [host-8-249-125.host.centralci.eng.rdu2.redhat.com] => {"attempts": 4, "changed": false, "failed": false, "failed_when_result": false, "results": {"cmd": "/usr/bin/oc get daemonset sync -o json -n openshift-node", "results": [{"apiVersion": "extensions/v1beta1", "kind": "DaemonSet", "metadata": {"annotations": {"image.openshift.io/triggers": "[\n {\"from\":{\"kind\":\"ImageStreamTag\",\"name\":\"node:v3.10\"},\"fieldPath\":\"spec.template.spec.containers[?(@.name==\\\"sync\\\")].image\"}\n]\n", "kubectl.kubernetes.io/last-applied-configuration": "{\"apiVersion\":\"apps/v1\",\"kind\":\"DaemonSet\",\"metadata\":{\"annotations\":{\"image.openshift.io/triggers\":\"[\\n {\\\"from\\\":{\\\"kind\\\":\\\"ImageStreamTag\\\",\\\"name\\\":\\\"node:v3.10\\\"},\\\"fieldPath\\\":\\\"spec.template.spec.containers[?(@.name==\\\\\\\"sync\\\\\\\")].image\\\"}\\n]\\n\",\"kubernetes.io/description\":\"This daemon set provides dynamic configuration of nodes and relabels nodes as appropriate.\\n\"},\"name\":\"sync\",\"namespace\":\"openshift-node\"},\"spec\":{\"selector\":{\"matchLabels\":{\"app\":\"sync\"}},\"template\":{\"metadata\":{\"annotations\":{\"scheduler.alpha.kubernetes.io/critical-pod\":\"\"},\"labels\":{\"app\":\"sync\",\"component\":\"network\",\"openshift.io/component\":\"sync\",\"type\":\"infra\"}},\"spec\":{\"containers\":[{\"command\":[\"/bin/bash\",\"-c\",\"#!/bin/bash\\nset -euo pipefail\\n\\n# set by the node image\\nunset KUBECONFIG\\n\\ntrap 'kill $(jobs -p); exit 0' TERM\\n\\n# track the current state of the config\\nif [[ -f /etc/origin/node/node-config.yaml ]]; then\\n md5sum /etc/origin/node/node-config.yaml \\u003e /tmp/.old\\nelse\\n touch /tmp/.old\\nfi\\n\\n# loop until BOOTSTRAP_CONFIG_NAME is set\\nwhile true; do\\n file=/etc/sysconfig/origin-node\\n if [[ -f /etc/sysconfig/atomic-openshift-node ]]; then\\n file=/etc/sysconfig/atomic-openshift-node\\n elif [[ -f /etc/sysconfig/origin-node ]]; then\\n file=/etc/sysconfig/origin-node\\n else\\n echo \\\"info: Waiting for the node sysconfig file to be created\\\" 2\\u003e\\u00261\\n sleep 15 \\u0026 wait\\n continue\\n fi\\n name=\\\"$(sed -nE 's|^BOOTSTRAP_CONFIG_NAME=([^#].+)|\\\\1|p' \\\"${file}\\\" | head -1)\\\"\\n if [[ -z \\\"${name}\\\" ]]; then\\n echo \\\"info: Waiting for BOOTSTRAP_CONFIG_NAME to be set\\\" 2\\u003e\\u00261\\n sleep 15 \\u0026 wait\\n continue\\n fi\\n # in the background check to see if the value changes and exit if so\\n pid=$BASHPID\\n (\\n while true; do\\n if ! updated=\\\"$(sed -nE 's|^BOOTSTRAP_CONFIG_NAME=([^#].+)|\\\\1|p' \\\"${file}\\\" | head -1)\\\"; then\\n echo \\\"error: Unable to check for bootstrap config, exiting\\\" 2\\u003e\\u00261\\n kill $pid\\n exit 1\\n fi\\n if [[ \\\"${updated}\\\" != \\\"${name}\\\" ]]; then\\n echo \\\"info: Bootstrap configuration profile name changed, exiting\\\" 2\\u003e\\u00261\\n kill $pid\\n exit 0\\n fi\\n sleep 15\\n done\\n ) \\u0026\\n break\\ndone\\n\\n# periodically refresh both node-config.yaml and relabel the node\\nwhile true; do\\n if ! oc extract \\\"configmaps/${name}\\\" -n openshift-node --to=/etc/origin/node --confirm --request-timeout=10s --config /etc/origin/node/node.kubeconfig \\\"--token=$( cat /var/run/secrets/kubernetes.io/serviceaccount/token )\\\" \\u003e /dev/null; then\\n echo \\\"error: Unable to retrieve latest config for node\\\" 2\\u003e\\u00261\\n sleep 15 \\u0026\\n wait $!\\n continue\\n fi\\n # detect whether the node-config.yaml has changed, and if so trigger a restart of the kubelet.\\n md5sum /etc/origin/node/node-config.yaml \\u003e /tmp/.new\\n if [[ \\\"$( cat /tmp/.old )\\\" != \\\"$( cat /tmp/.new )\\\" ]]; then\\n echo \\\"info: Configuration changed, restarting kubelet\\\" 2\\u003e\\u00261\\n # TODO: kubelet doesn't relabel nodes, best effort for now\\n # https://github.com/kubernetes/kubernetes/issues/59314\\n if args=\\\"$(openshift start node --write-flags --config /etc/origin/node/node-config.yaml)\\\"; then\\n labels=$(tr ' ' '\\\\n' \\u003c\\u003c\\u003c$args | sed -ne '/^--node-labels=/ { s/^--node-labels=//; p; }' | tr ',\\\\n' ' ')\\n if [[ -n \\\"${labels}\\\" ]]; then\\n echo \\\"info: Applying node labels $labels\\\" 2\\u003e\\u00261\\n if ! oc label --config=/etc/origin/node/node.kubeconfig \\\"node/${NODE_NAME}\\\" ${labels} --overwrite; then\\n echo \\\"error: Unable to apply labels, will retry in 10\\\" 2\\u003e\\u00261\\n sleep 10 \\u0026\\n wait $!\\n continue\\n fi\\n fi\\n else\\n echo \\\"error: The downloaded node configuration is invalid, retrying later\\\" 2\\u003e\\u00261\\n sleep 10 \\u0026\\n wait $!\\n continue\\n fi\\n if ! pkill -U 0 -f '(^|/)hyperkube kubelet '; then\\n echo \\\"error: Unable to restart Kubelet\\\" 2\\u003e\\u00261\\n sleep 10 \\u0026\\n wait $!\\n continue\\n fi\\n fi\\n cp -f /tmp/.new /tmp/.old\\n sleep 180 \\u0026\\n wait $!\\ndone\\n\"],\"env\":[{\"name\":\"NODE_NAME\",\"valueFrom\":{\"fieldRef\":{\"fieldPath\":\"spec.nodeName\"}}}],\"image\":\" \",\"name\":\"sync\",\"securityContext\":{\"privileged\":true,\"runAsUser\":0},\"volumeMounts\":[{\"mountPath\":\"/etc/origin/node/\",\"name\":\"host-config\"},{\"mountPath\":\"/etc/sysconfig\",\"name\":\"host-sysconfig-node\",\"readOnly\":true}]}],\"hostNetwork\":true,\"hostPID\":true,\"serviceAccountName\":\"sync\",\"terminationGracePeriodSeconds\":1,\"volumes\":[{\"hostPath\":{\"path\":\"/etc/origin/node\"},\"name\":\"host-config\"},{\"hostPath\":{\"path\":\"/etc/sysconfig\"},\"name\":\"host-sysconfig-node\"}]}},\"updateStrategy\":{\"rollingUpdate\":{\"maxUnavailable\":\"50%\"},\"type\":\"RollingUpdate\"}}}\n", "kubernetes.io/description": "This daemon set provides dynamic configuration of nodes and relabels nodes as appropriate.\n"}, "creationTimestamp": "2018-09-06T06:44:26Z", "generation": 2, "labels": {"app": "sync", "component": "network", "openshift.io/component": "sync", "type": "infra"}, "name": "sync", "namespace": "openshift-node", "resourceVersion": "1148", "selfLink": "/apis/extensions/v1beta1/namespaces/openshift-node/daemonsets/sync", "uid": "4c0f62ed-b1a0-11e8-8411-fa163ea893ad"}, "spec": {"revisionHistoryLimit": 10, "selector": {"matchLabels": {"app": "sync"}}, "template": {"metadata": {"annotations": {"scheduler.alpha.kubernetes.io/critical-pod": ""}, "creationTimestamp": null, "labels": {"app": "sync", "component": "network", "openshift.io/component": "sync", "type": "infra"}}, "spec": {"containers": [{"command": ["/bin/bash", "-c", "#!/bin/bash\nset -euo pipefail\n\n# set by the node image\nunset KUBECONFIG\n\ntrap 'kill $(jobs -p); exit 0' TERM\n\n# track the current state of the config\nif [[ -f /etc/origin/node/node-config.yaml ]]; then\n md5sum /etc/origin/node/node-config.yaml > /tmp/.old\nelse\n touch /tmp/.old\nfi\n\n# loop until BOOTSTRAP_CONFIG_NAME is set\nwhile true; do\n file=/etc/sysconfig/origin-node\n if [[ -f /etc/sysconfig/atomic-openshift-node ]]; then\n file=/etc/sysconfig/atomic-openshift-node\n elif [[ -f /etc/sysconfig/origin-node ]]; then\n file=/etc/sysconfig/origin-node\n else\n echo \"info: Waiting for the node sysconfig file to be created\" 2>&1\n sleep 15 & wait\n continue\n fi\n name=\"$(sed -nE 's|^BOOTSTRAP_CONFIG_NAME=([^#].+)|\\1|p' \"${file}\" | head -1)\"\n if [[ -z \"${name}\" ]]; then\n echo \"info: Waiting for BOOTSTRAP_CONFIG_NAME to be set\" 2>&1\n sleep 15 & wait\n continue\n fi\n # in the background check to see if the value changes and exit if so\n pid=$BASHPID\n (\n while true; do\n if ! updated=\"$(sed -nE 's|^BOOTSTRAP_CONFIG_NAME=([^#].+)|\\1|p' \"${file}\" | head -1)\"; then\n echo \"error: Unable to check for bootstrap config, exiting\" 2>&1\n kill $pid\n exit 1\n fi\n if [[ \"${updated}\" != \"${name}\" ]]; then\n echo \"info: Bootstrap configuration profile name changed, exiting\" 2>&1\n kill $pid\n exit 0\n fi\n sleep 15\n done\n ) &\n break\ndone\n\n# periodically refresh both node-config.yaml and relabel the node\nwhile true; do\n if ! oc extract \"configmaps/${name}\" -n openshift-node --to=/etc/origin/node --confirm --request-timeout=10s --config /etc/origin/node/node.kubeconfig \"--token=$( cat /var/run/secrets/kubernetes.io/serviceaccount/token )\" > /dev/null; then\n echo \"error: Unable to retrieve latest config for node\" 2>&1\n sleep 15 &\n wait $!\n continue\n fi\n # detect whether the node-config.yaml has changed, and if so trigger a restart of the kubelet.\n md5sum /etc/origin/node/node-config.yaml > /tmp/.new\n if [[ \"$( cat /tmp/.old )\" != \"$( cat /tmp/.new )\" ]]; then\n echo \"info: Configuration changed, restarting kubelet\" 2>&1\n # TODO: kubelet doesn't relabel nodes, best effort for now\n # https://github.com/kubernetes/kubernetes/issues/59314\n if args=\"$(openshift start node --write-flags --config /etc/origin/node/node-config.yaml)\"; then\n labels=$(tr ' ' '\\n' <<<$args | sed -ne '/^--node-labels=/ { s/^--node-labels=//; p; }' | tr ',\\n' ' ')\n if [[ -n \"${labels}\" ]]; then\n echo \"info: Applying node labels $labels\" 2>&1\n if ! oc label --config=/etc/origin/node/node.kubeconfig \"node/${NODE_NAME}\" ${labels} --overwrite; then\n echo \"error: Unable to apply labels, will retry in 10\" 2>&1\n sleep 10 &\n wait $!\n continue\n fi\n fi\n else\n echo \"error: The downloaded node configuration is invalid, retrying later\" 2>&1\n sleep 10 &\n wait $!\n continue\n fi\n if ! pkill -U 0 -f '(^|/)hyperkube kubelet '; then\n echo \"error: Unable to restart Kubelet\" 2>&1\n sleep 10 &\n wait $!\n continue\n fi\n fi\n cp -f /tmp/.new /tmp/.old\n sleep 180 &\n wait $!\ndone\n"], "env": [{"name": "NODE_NAME", "valueFrom": {"fieldRef": {"apiVersion": "v1", "fieldPath": "spec.nodeName"}}}], "image": "registry.access.stage.redhat.com/openshift3/ose-node:v3.10", "imagePullPolicy": "IfNotPresent", "name": "sync", "resources": {}, "securityContext": {"privileged": true, "runAsUser": 0}, "terminationMessagePath": "/dev/termination-log", "terminationMessagePolicy": "File", "volumeMounts": [{"mountPath": "/etc/origin/node/", "name": "host-config"}, {"mountPath": "/etc/sysconfig", "name": "host-sysconfig-node", "readOnly": true}]}], "dnsPolicy": "ClusterFirst", "hostNetwork": true, "hostPID": true, "restartPolicy": "Always", "schedulerName": "default-scheduler", "securityContext": {}, "serviceAccount": "sync", "serviceAccountName": "sync", "terminationGracePeriodSeconds": 1, "volumes": [{"hostPath": {"path": "/etc/origin/node", "type": ""}, "name": "host-config"}, {"hostPath": {"path": "/etc/sysconfig", "type": ""}, "name": "host-sysconfig-node"}]}}, "templateGeneration": 2, "updateStrategy": {"rollingUpdate": {"maxUnavailable": "50%"}, "type": "RollingUpdate"}}, "status": {"currentNumberScheduled": 1, "desiredNumberScheduled": 1, "numberAvailable": 1, "numberMisscheduled": 0, "numberReady": 1, "observedGeneration": 2, "updatedNumberScheduled": 1}}], "returncode": 0}, "state": "list"} <--snip--> TASK [openshift_control_plane : Retrieve list of schedulable nodes matching selector] *** Thursday 06 September 2018 14:48:58 +0800 (0:00:00.181) 0:26:13.734 **** ok: [host-8-249-125.host.centralci.eng.rdu2.redhat.com] => {"changed": false, "failed": false, "results": {"cmd": "/usr/bin/oc get node --selector=node-role.kubernetes.io/infra=true --field-selector=spec.unschedulable!=true -o json -n default", "results": [{"apiVersion": "v1", "items": [], "kind": "List", "metadata": {"resourceVersion": "", "selfLink": ""}}], "returncode": 0}, "state": "list"} TASK [openshift_control_plane : Ensure that Prometheus has nodes to run on] **** Thursday 06 September 2018 14:48:58 +0800 (0:00:00.481) 0:26:14.216 **** fatal: [host-8-249-125.host.centralci.eng.rdu2.redhat.com]: FAILED! => { "assertion": false, "changed": false, "evaluated_to": false, "failed": true, "msg": "No schedulable nodes found matching node selector for Prometheus - 'node-role.kubernetes.io/infra=true'" } After the failure, log into the cluster, have some check: [root@qe-jialiu310-master-etcd-nfs-1 ~]# oc get node NAME STATUS ROLES AGE VERSION qe-jialiu310-master-etcd-nfs-1 Ready master 9m v1.10.0+b81c8f8 qe-jialiu310-node-1 Ready compute 5m v1.10.0+b81c8f8 qe-jialiu310-node-infra-1 Ready <none> 5m v1.10.0+b81c8f8 Only infra node is not labeled correctly, after several mins, check again, labels are set well. [root@qe-jialiu310-master-etcd-nfs-1 ~]# oc get node NAME STATUS ROLES AGE VERSION qe-jialiu310-master-etcd-nfs-1 Ready master 23m v1.10.0+b81c8f8 qe-jialiu310-node-1 Ready compute 19m v1.10.0+b81c8f8 qe-jialiu310-node-infra-1 Ready infra 19m v1.10.0+b81c8f8
Ok, from the log it looks like the sync check was ok, but it is only checking if the sync pod is running on 1 node (master) at that point in the install because the other nodes haven't been added yet. So I think what we need to do is check that all the sync pods are available either immediately before the prometheus install, or at least after the nodes have been added to the cluster.
New PR to move the location of the sync pod check: https://github.com/openshift/openshift-ansible/pull/9965 Johnny, are you able to test if this patch resolves the issue?
I also encounter exactly the same issue while installing 3.10 This is the log [1]. Is there any update on this issue? [1] http://pastebin.test.redhat.com/643583
Checking that sync pods have started on all nodes is not sufficient. The plan to fix this: * Update sync script to annotate the node with md5 of the applied config * Check that sync DS has started on all nodes * Check that all nodes have annotation with a hash Once this annotation appears this would mean the node has new config applied and it has restarted * Wait for api server to be back
PR to 3.10 https://github.com/openshift/openshift-ansible/pull/10045 holding the PR to make sure that it works in 3.11 first (see bug #1628208)
3.10 backport PR - https://github.com/openshift/openshift-ansible/pull/10100
Fix is available in openshift-ansible-3.10.52-1
Upon testing this bug with openshift-ansible-3.10.53-1.git.0.ba2c2ec.el7.noarch, also hit https://bugzilla.redhat.com/show_bug.cgi?id=1636914, almost 100% reproduced against stage registry, I almost confirm this is related to stage registry performance. # for i in `seq 1 30`; do oc get po -n openshift-node; sleep 60; done NAME READY STATUS RESTARTS AGE sync-kn94l 0/1 ContainerCreating 0 29m sync-tvlfm 1/1 Running 0 29m sync-tzj99 1/1 Running 0 33m # oc describe po sync-kn94l -n openshift-node Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulling 13m kubelet, qe-jialiu310-node-1 pulling image "registry.access.stage.redhat.com/openshift3/ose-node:v3.10" # for i in `seq 1 30`; do oc get po -n openshift-node; sleep 60; done sync-kn94l 1/1 Running 0 30m sync-tvlfm 1/1 Running 0 30m sync-tzj99 1/1 Running 0 34m The size of ose-node image is about 1.3GB!!! It is too big. This is easy to cause problem. # crictl images W1012 07:22:54.727619 29319 util_unix.go:75] Using "/var/run/crio/crio.sock" as endpoint is deprecated, please consider using full url format "unix:///var/run/crio/crio.sock". IMAGE TAG IMAGE ID SIZE registry.access.stage.redhat.com/openshift3/ose-control-plane v3.10 c7ed74a5e981a 804MB registry.access.stage.redhat.com/openshift3/ose-node v3.10 2128ae84c408a 1.32GB Personally I think there are 3 ways to resolve this issue: 1. improve stage registry performance 2. implement some logic in openshift-ansible installer to pre-pull ose-node image based on different runtime. 3. increase timeout just like https://github.com/openshift/openshift-ansible/pull/10363 #2 is the best option, if this case is some edge cases, #3 is also acceptable for me, so pls backport the 3.11 fix for BZ#1636914 to 3.10 too. BTW, stage testing is very important for QE, if this issue is not resolved, we have no good workaround, this would delay advisory release.
(In reply to Johnny Liu from comment #18) > 2. implement some logic in openshift-ansible installer to pre-pull ose-node > image based on different runtime. Node images are being prepulled, ContainerCreating means the image is available and runtime is creating a new container > 3. increase timeout just like > https://github.com/openshift/openshift-ansible/pull/10363 > > #2 is the best option, if this case is some edge cases, #3 is also > acceptable for me, so pls backport the 3.11 fix for BZ#1636914 to 3.10 too. Created https://github.com/openshift/openshift-ansible/pull/10393 to cherrypick the fix to 3.10 branch
(In reply to Vadim Rutkovsky from comment #19) > (In reply to Johnny Liu from comment #18) > > 2. implement some logic in openshift-ansible installer to pre-pull ose-node > > image based on different runtime. > > Node images are being prepulled, ContainerCreating means the image is > available and runtime is creating a new container "Pre-pull node image" task is prepulling a docker image, do not take runtime into consideration. In my testing, I am using crio runtime, that means when sync pod is creating, it have to pull the image, it will take a long time. And I do not think "ContainerCreating" means the image is available, in comment 18, when you run "oc describe" the "ContainerCreating", the event indicates it is pulling the image. # oc describe po sync-kn94l -n openshift-node Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulling 13m kubelet, qe-jialiu310-node-1 pulling image "registry.access.stage.redhat.com/openshift3/ose-node:v3.10"
(In reply to Johnny Liu from comment #20) > (In reply to Vadim Rutkovsky from comment #19) > > (In reply to Johnny Liu from comment #18) > > > 2. implement some logic in openshift-ansible installer to pre-pull ose-node > > > image based on different runtime. > > > > Node images are being prepulled, ContainerCreating means the image is > > available and runtime is creating a new container > "Pre-pull node image" task is prepulling a docker image, do not take runtime > into consideration. In my testing, I am using crio runtime, that means when > sync pod is creating, it have to pull the image, it will take a long time. Ah, correct. I filed bug 1639201 to track that. Keeping this bug in POST, as more PRs would be required to fully fix this issue
It was discussed during our team meeting that https://github.com/openshift/openshift-ansible/pull/10393 was sufficient to resolve this bug.
Due to once crio runtime is enabled, would hit https://bugzilla.redhat.com/show_bug.cgi?id=1639623 which is causing master api get ready timeout against stage registry, in this bug's verification, will use docker runtime for testing. Verified this bug with openshift-ansible-3.10.58-1.git.0.5753b33.el7.noarch, and PASS. TASK [openshift_manage_node : Wait for sync DS to set annotations on all nodes] *** Thursday 18 October 2018 12:25:09 +0800 (0:00:00.476) 0:26:06.549 ****** FAILED - RETRYING: Wait for sync DS to set annotations on all nodes (180 retries left). <--snip--> FAILED - RETRYING: Wait for sync DS to set annotations on all nodes (134 retries left). ok: [host-8-249-20.host.centralci.eng.rdu2.redhat.com -> host-8-249-20.host.centralci.eng.rdu2.redhat.com] <snip--> <snip--> <snip--> TASK [openshift_control_plane : Ensure that Prometheus has nodes to run on] **** Thursday 18 October 2018 12:34:52 +0800 (0:00:00.365) 0:35:49.541 ****** ok: [host-8-249-20.host.centralci.eng.rdu2.redhat.com] => { "changed": false, "failed": false, "msg": "All assertions passed" }
Per comment 23, move this bug to VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2709