Description of problem: Pre-pull tasks for control-plane and node are assuming Docker is used as a runtime
master PR - https://github.com/openshift/openshift-ansible/pull/10402
When testing https://bugzilla.redhat.com/show_bug.cgi?id=1636914, the installation task failed at "openshift_control_plane : Wait for control plane pods to appear" when crio runtime is enabled, so hope we could get this fix ASAP.
3.11 cherrypick https://github.com/openshift/openshift-ansible/pull/10436
Fix is available in openshift-ansible-3.11.26-1
Re-test this bug with openshift-ansible-3.11.28-1.git.0.96f8adb.el7.noarch, need some more fix. "Pre-pull node image" is still using docker runtime to pre-pull node image. - name: Pre-pull node image docker_image: name: "{{ osn_image }}" environment: NO_PROXY: "{{ openshift.common.no_proxy | default('') }}" when: node_image.stdout_lines == [] # 10 minutes to pull the image async: 600 poll: 0 register: image_prepull
Correct, all other images except the node one are being prepulled. Created master PR https://github.com/openshift/openshift-ansible/pull/10464
Fix is available in openshift-ansible-3.11.34-1
Re-test this bug with openshift-ansible-3.11.34-1.git.0.499a30b.el7.noarch. The PR seem like take effect, but unfortunately "Check status of node image pre-pull" fail to pre-pull the ose-node. TASK [openshift_node : Check status of node image pre-pull] ******************** Monday 29 October 2018 15:39:31 +0800 (0:00:00.117) 0:06:22.707 ******** changed: [host-8-251-110.host.centralci.eng.rdu2.redhat.com] => {"ansible_job_id": "274419478459.15172", "attempts": 1, "changed": true, "cmd": ["crictl", "pull", "registry.access.stage.redhat.com/openshift3/ose-node:v3.11"], "delta": "0:01:45.258525", "end": "2018-10-29 03:35:49.786125", "failed_when_result": false, "finished": 1, "msg": "non-zero return code", "rc": 1, "start": "2018-10-29 03:34:04.527600", "stderr": "W1029 03:34:04.559334 15179 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\".\ntime=\"2018-10-29T03:35:49-04:00\" level=fatal msg=\"pulling image failed: rpc error: code = Unavailable desc = transport is closing\"", "stderr_lines": ["W1029 03:34:04.559334 15179 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\".", "time=\"2018-10-29T03:35:49-04:00\" level=fatal msg=\"pulling image failed: rpc error: code = Unavailable desc = transport is closing\""], "stdout": "", "stdout_lines": []} changed: [host-8-250-70.host.centralci.eng.rdu2.redhat.com] => {"ansible_job_id": "271535462819.15098", "attempts": 1, "changed": true, "cmd": ["crictl", "pull", "registry.access.stage.redhat.com/openshift3/ose-node:v3.11"], "delta": "0:01:45.420395", "end": "2018-10-29 03:35:50.226819", "failed_when_result": false, "finished": 1, "msg": "non-zero return code", "rc": 1, "start": "2018-10-29 03:34:04.806424", "stderr": "W1029 03:34:04.839431 15105 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\".\ntime=\"2018-10-29T03:35:50-04:00\" level=fatal msg=\"pulling image failed: rpc error: code = Unavailable desc = transport is closing\"", "stderr_lines": ["W1029 03:34:04.839431 15105 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\".", "time=\"2018-10-29T03:35:50-04:00\" level=fatal msg=\"pulling image failed: rpc error: code = Unavailable desc = transport is closing\""], "stdout": "", "stdout_lines": []} changed: [host-8-244-143.host.centralci.eng.rdu2.redhat.com] => {"ansible_job_id": "454904875218.15574", "attempts": 1, "changed": true, "cmd": ["crictl", "pull", "registry.access.stage.redhat.com/openshift3/ose-node:v3.11"], "delta": "0:01:45.260942", "end": "2018-10-29 03:35:49.134957", "failed_when_result": false, "finished": 1, "msg": "non-zero return code", "rc": 1, "start": "2018-10-29 03:34:03.874015", "stderr": "W1029 03:34:03.905513 15581 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\".\ntime=\"2018-10-29T03:35:49-04:00\" level=fatal msg=\"pulling image failed: rpc error: code = Unavailable desc = transport is closing\"", "stderr_lines": ["W1029 03:34:03.905513 15581 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\".", "time=\"2018-10-29T03:35:49-04:00\" level=fatal msg=\"pulling image failed: rpc error: code = Unavailable desc = transport is closing\""], "stdout": "", "stdout_lines": []} I logged into the node, run "crictl pull registry.access.stage.redhat.com/openshift3/ose-node:v3.11" manually, and succeed. [root@qe-jialiu-node-infra-1 ~]# crictl images W1029 06:21:05.577647 30848 util_unix.go:75] Using "/var/run/crio/crio.sock" as endpoint is deprecated, please consider using full url format "unix:///var/run/crio/crio.sock". IMAGE TAG IMAGE ID SIZE registry.access.stage.redhat.com/openshift3/ose-pod v3.11 751ff215ba5da 244MB [root@qe-jialiu-node-infra-1 ~]# crictl pull registry.access.stage.redhat.com/openshift3/ose-node:v3.11 W1029 06:21:08.744570 30856 util_unix.go:75] Using "/var/run/crio/crio.sock" as endpoint is deprecated, please consider using full url format "unix:///var/run/crio/crio.sock". Image is up to date for registry.access.stage.redhat.com/openshift3/ose-node@sha256:3da731d733cd4d67897d22bfdcb027b009494de667bd7a3c870557102ce10bf5 [root@qe-jialiu-node-infra-1 ~]# crictl images W1029 06:23:23.738749 30895 util_unix.go:75] Using "/var/run/crio/crio.sock" as endpoint is deprecated, please consider using full url format "unix:///var/run/crio/crio.sock". IMAGE TAG IMAGE ID SIZE registry.access.stage.redhat.com/openshift3/ose-node v3.11 074bf04571e22 1.17GB registry.access.stage.redhat.com/openshift3/ose-pod v3.11 751ff215ba5da 244MB Do you know why "rpc error" is thrown out for the ansible task.
Created attachment 1498474 [details] installation log with inventory file embedded
Sounds like we should add more retries there
@Vadim, When you are trying to add more retries, could you help also add one more task to pre-pull "ose-control-plane" image together? When I was testing this bug, I hit "Wait for control plane pods to appear" timeout issue for several times, because of api static pod is not get running yet. I think to pre-pull "ose-control-plane" image would improve such issue.
Sure thing, added retries to pulls for every image we prepull. Master PR - https://github.com/openshift/openshift-ansible/pull/10551
The error suggests the CRI-O daemon went down at some point: "transport is closing", is there any way to grab logs from the system? could be an OOM or a panic in cri-o itself from what I can tell
Overall I agree that we should retry some time btw
Johnny, any chance to grab system logs when that happens?
@Antonio, good point. I reproduced the issue again, and follow your instruction to do some check. [root@qe-jialiu311-node-1 ~]# journalctl -f -u crio -- Logs begin at Tue 2018-10-30 22:31:52 EDT. -- Oct 30 22:40:57 qe-jialiu311-node-1 systemd[1]: Starting Open Container Initiative Daemon... Oct 30 22:40:57 qe-jialiu311-node-1 systemd[1]: Started Open Container Initiative Daemon. Oct 30 22:42:18 qe-jialiu311-node-1 systemd[1]: Stopping Open Container Initiative Daemon... Oct 30 22:43:48 qe-jialiu311-node-1 systemd[1]: crio.service stop-sigterm timed out. Killing. Oct 30 22:43:48 qe-jialiu311-node-1 systemd[1]: crio.service: main process exited, code=killed, status=9/KILL Oct 30 22:43:48 qe-jialiu311-node-1 systemd[1]: Unit crio.service entered failed state. Oct 30 22:43:48 qe-jialiu311-node-1 systemd[1]: crio.service failed. Oct 30 22:43:48 qe-jialiu311-node-1 systemd[1]: Starting Open Container Initiative Daemon... Oct 30 22:43:48 qe-jialiu311-node-1 systemd[1]: Started Open Container Initiative Daemon. So seem like at "22:43:48", crio service seem like be restarted. Grap /var/log/messages, get the following message at that moment: Oct 30 22:42:17 qe-jialiu311-node-1 ansible-os_firewall_manage_iptables: Invoked with protocol=tcp name=Prometheus monitoring chain=OS_FIREWALL_ALLOW create_jump_rule=True action=add ip_version=ipv4 jump_rule_chain=INPUT port=9000-10000 Oct 30 22:42:18 qe-jialiu311-node-1 ansible-systemd: Invoked with no_block=False force=None name=cri-o enabled=True daemon_reload=False state=restarted user=False masked=None Oct 30 22:42:18 qe-jialiu311-node-1 systemd: Stopping Open Container Initiative Daemon... Oct 30 22:42:18 qe-jialiu311-node-1 ansible-async_wrapper.py: 15979 still running (590) <--snip--> <--snip--> <--snip--> Oct 30 22:43:46 qe-jialiu311-node-1 ansible-async_wrapper.py: 15900 still running (500) Oct 30 22:43:48 qe-jialiu311-node-1 systemd: crio.service stop-sigterm timed out. Killing. Oct 30 22:43:48 qe-jialiu311-node-1 systemd: crio.service: main process exited, code=killed, status=9/KILL Oct 30 22:43:48 qe-jialiu311-node-1 systemd: Unit crio.service entered failed state. Oct 30 22:43:48 qe-jialiu311-node-1 systemd: crio.service failed. Oct 30 22:43:48 qe-jialiu311-node-1 systemd: Starting Open Container Initiative Daemon... Go through openshift-ansible installer log: TASK [openshift_node : Pre-pull node image] ************************************ Wednesday 31 October 2018 10:42:02 +0800 (0:00:00.312) 0:00:58.300 ***** changed: [host-8-243-212.host.centralci.eng.rdu2.redhat.com] => {"ansible_job_id": "410992443836.16394", "changed": true, "finished": 0, "results_file": "/root/.ansible_async/410992443836.16394", "started": 1} changed: [host-8-249-227.host.centralci.eng.rdu2.redhat.com] => {"ansible_job_id": "647625704086.15896", "changed": true, "finished": 0, "results_file": "/root/.ansible_async/647625704086.15896", "started": 1} changed: [host-8-249-183.host.centralci.eng.rdu2.redhat.com] => {"ansible_job_id": "359232487803.15905", "changed": true, "finished": 0, "results_file": "/root/.ansible_async/359232487803.15905", "started": 1} <--snip--> RUNNING HANDLER [openshift_node : restart NetworkManager] ********************** Wednesday 31 October 2018 10:42:16 +0800 (0:00:00.373) 0:01:12.106 ***** <--snip--> RUNNING HANDLER [openshift_node : restart dnsmasq] ***************************** Wednesday 31 October 2018 10:42:17 +0800 (0:00:00.414) 0:01:12.521 ***** <--snip--> TASK [openshift_node : Restart cri-o] ****************************************** Wednesday 31 October 2018 10:42:19 +0800 (0:00:00.130) 0:01:15.038 ***** <--snip--> TASK [openshift_node : Restart journald] *************************************** Wednesday 31 October 2018 10:47:13 +0800 (0:00:02.357) 0:06:09.155 ***** <--snip--> TASK [openshift_node : Check status of node image pre-pull] ******************** Wednesday 31 October 2018 10:47:16 +0800 (0:00:00.127) 0:06:11.580 ***** Since "Pre-pull node image" async task is triggered, installer did all kinds of service restart, even crio service, so "transport is closing" totally make scenes. @Antonio @Vadim, now I do not think "retry some time" could improve this better. pre-pull image should happen behind crio service is restarted. WDYT.
We can't add 'retry' in ansible scripts, as this task is running async and the result is not known until its checked. Please file a new issue for CRI-O service being killed. Moving this to ON_QA as 3.11 cherrypick was merged
Based on comment 8, move this bug to "VERIFIED". About issue described in comment 16, will file new bug to track it.
Also run some other regression testing for pull/10436 and pull/10464/files against both docker and crio runtime behind proxy with openshift-ansible-3.11.39-1.git.0.fe42b3b.el7.noarch, cluster is set up successfully, they would not break install behind proxy.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0024