The root cause analysis is in comment 16. +++ This bug was initially created as a clone of Bug #1639201 +++ Description of problem: Pre-pull tasks for control-plane and node are assuming Docker is used as a runtime --- Additional comment from Vadim Rutkovsky on 2018-10-15 06:30:32 EDT --- master PR - https://github.com/openshift/openshift-ansible/pull/10402 --- Additional comment from Johnny Liu on 2018-10-16 05:04:12 EDT --- When testing https://bugzilla.redhat.com/show_bug.cgi?id=1636914, the installation task failed at "openshift_control_plane : Wait for control plane pods to appear" when crio runtime is enabled, so hope we could get this fix ASAP. --- Additional comment from Vadim Rutkovsky on 2018-10-18 08:29:04 EDT --- 3.11 cherrypick https://github.com/openshift/openshift-ansible/pull/10436 --- Additional comment from Vadim Rutkovsky on 2018-10-19 04:52:51 EDT --- Fix is available in openshift-ansible-3.11.26-1 --- Additional comment from Johnny Liu on 2018-10-22 02:26:36 EDT --- Re-test this bug with openshift-ansible-3.11.28-1.git.0.96f8adb.el7.noarch, need some more fix. "Pre-pull node image" is still using docker runtime to pre-pull node image. - name: Pre-pull node image docker_image: name: "{{ osn_image }}" environment: NO_PROXY: "{{ openshift.common.no_proxy | default('') }}" when: node_image.stdout_lines == [] # 10 minutes to pull the image async: 600 poll: 0 register: image_prepull --- Additional comment from Vadim Rutkovsky on 2018-10-22 05:03:12 EDT --- Correct, all other images except the node one are being prepulled. Created master PR https://github.com/openshift/openshift-ansible/pull/10464 --- Additional comment from Vadim Rutkovsky on 2018-10-29 02:38:30 EDT --- Fix is available in openshift-ansible-3.11.34-1 --- Additional comment from Johnny Liu on 2018-10-29 06:42:38 EDT --- Re-test this bug with openshift-ansible-3.11.34-1.git.0.499a30b.el7.noarch. The PR seem like take effect, but unfortunately "Check status of node image pre-pull" fail to pre-pull the ose-node. TASK [openshift_node : Check status of node image pre-pull] ******************** Monday 29 October 2018 15:39:31 +0800 (0:00:00.117) 0:06:22.707 ******** changed: [host-8-251-110.host.centralci.eng.rdu2.redhat.com] => {"ansible_job_id": "274419478459.15172", "attempts": 1, "changed": true, "cmd": ["crictl", "pull", "registry.access.stage.redhat.com/openshift3/ose-node:v3.11"], "delta": "0:01:45.258525", "end": "2018-10-29 03:35:49.786125", "failed_when_result": false, "finished": 1, "msg": "non-zero return code", "rc": 1, "start": "2018-10-29 03:34:04.527600", "stderr": "W1029 03:34:04.559334 15179 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\".\ntime=\"2018-10-29T03:35:49-04:00\" level=fatal msg=\"pulling image failed: rpc error: code = Unavailable desc = transport is closing\"", "stderr_lines": ["W1029 03:34:04.559334 15179 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\".", "time=\"2018-10-29T03:35:49-04:00\" level=fatal msg=\"pulling image failed: rpc error: code = Unavailable desc = transport is closing\""], "stdout": "", "stdout_lines": []} changed: [host-8-250-70.host.centralci.eng.rdu2.redhat.com] => {"ansible_job_id": "271535462819.15098", "attempts": 1, "changed": true, "cmd": ["crictl", "pull", "registry.access.stage.redhat.com/openshift3/ose-node:v3.11"], "delta": "0:01:45.420395", "end": "2018-10-29 03:35:50.226819", "failed_when_result": false, "finished": 1, "msg": "non-zero return code", "rc": 1, "start": "2018-10-29 03:34:04.806424", "stderr": "W1029 03:34:04.839431 15105 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\".\ntime=\"2018-10-29T03:35:50-04:00\" level=fatal msg=\"pulling image failed: rpc error: code = Unavailable desc = transport is closing\"", "stderr_lines": ["W1029 03:34:04.839431 15105 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\".", "time=\"2018-10-29T03:35:50-04:00\" level=fatal msg=\"pulling image failed: rpc error: code = Unavailable desc = transport is closing\""], "stdout": "", "stdout_lines": []} changed: [host-8-244-143.host.centralci.eng.rdu2.redhat.com] => {"ansible_job_id": "454904875218.15574", "attempts": 1, "changed": true, "cmd": ["crictl", "pull", "registry.access.stage.redhat.com/openshift3/ose-node:v3.11"], "delta": "0:01:45.260942", "end": "2018-10-29 03:35:49.134957", "failed_when_result": false, "finished": 1, "msg": "non-zero return code", "rc": 1, "start": "2018-10-29 03:34:03.874015", "stderr": "W1029 03:34:03.905513 15581 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\".\ntime=\"2018-10-29T03:35:49-04:00\" level=fatal msg=\"pulling image failed: rpc error: code = Unavailable desc = transport is closing\"", "stderr_lines": ["W1029 03:34:03.905513 15581 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\".", "time=\"2018-10-29T03:35:49-04:00\" level=fatal msg=\"pulling image failed: rpc error: code = Unavailable desc = transport is closing\""], "stdout": "", "stdout_lines": []} I logged into the node, run "crictl pull registry.access.stage.redhat.com/openshift3/ose-node:v3.11" manually, and succeed. [root@qe-jialiu-node-infra-1 ~]# crictl images W1029 06:21:05.577647 30848 util_unix.go:75] Using "/var/run/crio/crio.sock" as endpoint is deprecated, please consider using full url format "unix:///var/run/crio/crio.sock". IMAGE TAG IMAGE ID SIZE registry.access.stage.redhat.com/openshift3/ose-pod v3.11 751ff215ba5da 244MB [root@qe-jialiu-node-infra-1 ~]# crictl pull registry.access.stage.redhat.com/openshift3/ose-node:v3.11 W1029 06:21:08.744570 30856 util_unix.go:75] Using "/var/run/crio/crio.sock" as endpoint is deprecated, please consider using full url format "unix:///var/run/crio/crio.sock". Image is up to date for registry.access.stage.redhat.com/openshift3/ose-node@sha256:3da731d733cd4d67897d22bfdcb027b009494de667bd7a3c870557102ce10bf5 [root@qe-jialiu-node-infra-1 ~]# crictl images W1029 06:23:23.738749 30895 util_unix.go:75] Using "/var/run/crio/crio.sock" as endpoint is deprecated, please consider using full url format "unix:///var/run/crio/crio.sock". IMAGE TAG IMAGE ID SIZE registry.access.stage.redhat.com/openshift3/ose-node v3.11 074bf04571e22 1.17GB registry.access.stage.redhat.com/openshift3/ose-pod v3.11 751ff215ba5da 244MB Do you know why "rpc error" is thrown out for the ansible task. --- Additional comment from Johnny Liu on 2018-10-29 06:43 EDT --- --- Additional comment from Vadim Rutkovsky on 2018-10-29 12:10:36 EDT --- Sounds like we should add more retries there --- Additional comment from Johnny Liu on 2018-10-29 23:11:21 EDT --- @Vadim, When you are trying to add more retries, could you help also add one more task to pre-pull "ose-control-plane" image together? When I was testing this bug, I hit "Wait for control plane pods to appear" timeout issue for several times, because of api static pod is not get running yet. I think to pre-pull "ose-control-plane" image would improve such issue. --- Additional comment from Vadim Rutkovsky on 2018-10-30 05:20:31 EDT --- Sure thing, added retries to pulls for every image we prepull. Master PR - https://github.com/openshift/openshift-ansible/pull/10551 --- Additional comment from Antonio Murdaca on 2018-10-30 08:42:36 EDT --- The error suggests the CRI-O daemon went down at some point: "transport is closing", is there any way to grab logs from the system? could be an OOM or a panic in cri-o itself from what I can tell --- Additional comment from Antonio Murdaca on 2018-10-30 08:44:54 EDT --- Overall I agree that we should retry some time btw --- Additional comment from Antonio Murdaca on 2018-10-30 09:14:03 EDT --- Johnny, any chance to grab system logs when that happens? --- Additional comment from Johnny Liu on 2018-10-30 23:34:41 EDT --- @Antonio, good point. I reproduced the issue again, and follow your instruction to do some check. [root@qe-jialiu311-node-1 ~]# journalctl -f -u crio -- Logs begin at Tue 2018-10-30 22:31:52 EDT. -- Oct 30 22:40:57 qe-jialiu311-node-1 systemd[1]: Starting Open Container Initiative Daemon... Oct 30 22:40:57 qe-jialiu311-node-1 systemd[1]: Started Open Container Initiative Daemon. Oct 30 22:42:18 qe-jialiu311-node-1 systemd[1]: Stopping Open Container Initiative Daemon... Oct 30 22:43:48 qe-jialiu311-node-1 systemd[1]: crio.service stop-sigterm timed out. Killing. Oct 30 22:43:48 qe-jialiu311-node-1 systemd[1]: crio.service: main process exited, code=killed, status=9/KILL Oct 30 22:43:48 qe-jialiu311-node-1 systemd[1]: Unit crio.service entered failed state. Oct 30 22:43:48 qe-jialiu311-node-1 systemd[1]: crio.service failed. Oct 30 22:43:48 qe-jialiu311-node-1 systemd[1]: Starting Open Container Initiative Daemon... Oct 30 22:43:48 qe-jialiu311-node-1 systemd[1]: Started Open Container Initiative Daemon. So seem like at "22:43:48", crio service seem like be restarted. Grap /var/log/messages, get the following message at that moment: Oct 30 22:42:17 qe-jialiu311-node-1 ansible-os_firewall_manage_iptables: Invoked with protocol=tcp name=Prometheus monitoring chain=OS_FIREWALL_ALLOW create_jump_rule=True action=add ip_version=ipv4 jump_rule_chain=INPUT port=9000-10000 Oct 30 22:42:18 qe-jialiu311-node-1 ansible-systemd: Invoked with no_block=False force=None name=cri-o enabled=True daemon_reload=False state=restarted user=False masked=None Oct 30 22:42:18 qe-jialiu311-node-1 systemd: Stopping Open Container Initiative Daemon... Oct 30 22:42:18 qe-jialiu311-node-1 ansible-async_wrapper.py: 15979 still running (590) <--snip--> <--snip--> <--snip--> Oct 30 22:43:46 qe-jialiu311-node-1 ansible-async_wrapper.py: 15900 still running (500) Oct 30 22:43:48 qe-jialiu311-node-1 systemd: crio.service stop-sigterm timed out. Killing. Oct 30 22:43:48 qe-jialiu311-node-1 systemd: crio.service: main process exited, code=killed, status=9/KILL Oct 30 22:43:48 qe-jialiu311-node-1 systemd: Unit crio.service entered failed state. Oct 30 22:43:48 qe-jialiu311-node-1 systemd: crio.service failed. Oct 30 22:43:48 qe-jialiu311-node-1 systemd: Starting Open Container Initiative Daemon... Go through openshift-ansible installer log: TASK [openshift_node : Pre-pull node image] ************************************ Wednesday 31 October 2018 10:42:02 +0800 (0:00:00.312) 0:00:58.300 ***** changed: [host-8-243-212.host.centralci.eng.rdu2.redhat.com] => {"ansible_job_id": "410992443836.16394", "changed": true, "finished": 0, "results_file": "/root/.ansible_async/410992443836.16394", "started": 1} changed: [host-8-249-227.host.centralci.eng.rdu2.redhat.com] => {"ansible_job_id": "647625704086.15896", "changed": true, "finished": 0, "results_file": "/root/.ansible_async/647625704086.15896", "started": 1} changed: [host-8-249-183.host.centralci.eng.rdu2.redhat.com] => {"ansible_job_id": "359232487803.15905", "changed": true, "finished": 0, "results_file": "/root/.ansible_async/359232487803.15905", "started": 1} <--snip--> RUNNING HANDLER [openshift_node : restart NetworkManager] ********************** Wednesday 31 October 2018 10:42:16 +0800 (0:00:00.373) 0:01:12.106 ***** <--snip--> RUNNING HANDLER [openshift_node : restart dnsmasq] ***************************** Wednesday 31 October 2018 10:42:17 +0800 (0:00:00.414) 0:01:12.521 ***** <--snip--> TASK [openshift_node : Restart cri-o] ****************************************** Wednesday 31 October 2018 10:42:19 +0800 (0:00:00.130) 0:01:15.038 ***** <--snip--> TASK [openshift_node : Restart journald] *************************************** Wednesday 31 October 2018 10:47:13 +0800 (0:00:02.357) 0:06:09.155 ***** <--snip--> TASK [openshift_node : Check status of node image pre-pull] ******************** Wednesday 31 October 2018 10:47:16 +0800 (0:00:00.127) 0:06:11.580 ***** Since "Pre-pull node image" async task is triggered, installer did all kinds of service restart, even crio service, so "transport is closing" totally make scenes. @Antonio @Vadim, now I do not think "retry some time" could improve this better. pre-pull image should happen behind crio service is restarted. WDYT. --- Additional comment from Vadim Rutkovsky on 2018-11-05 05:38:46 EST --- We can't add 'retry' in ansible scripts, as this task is running async and the result is not known until its checked. Please file a new issue for CRI-O service being killed. Moving this to ON_QA as 3.11 cherrypick was merged --- Additional comment from Johnny Liu on 2018-11-06 23:00:19 EST --- Based on comment 8, move this bug to "VERIFIED". About issue described in comment 16, will file new bug to track it.
>Since "Pre-pull node image" async task is triggered, installer did all kinds of service restart, even crio service, so "transport is closing" totally make scenes. Do you have inventory + playbook logs? This doesn't seem to be happening on upstream CI at least
The attachment in original bug (BZ#1647660) has inventory + playbook logs. Pls refer to https://bugzilla.redhat.com/attachment.cgi?id=1498474
It doesn't seem to be ansible playbooks stopping crio service. The last mention of crio.service is in `Start the CRI-O service` task. Could you collect `systemctl status crio` and `journalctl -u crio` from the affected nodes?
Interesting, I can now reproduce this on CI: ``` fatal: [ci-op-kng215iz-6d825-ig-m-h99f]: FAILED! => {"changed": false, "msg": "Error connecting: Error while fetching server API version: ('Connection aborted.', error(2, 'No such file or directory'))"} ```
Right, we _were_ restarting crio between the start of the pull and checking its status. So if it took a bit longer then it would fail. This race condition is fixed in https://github.com/openshift/openshift-ansible/pull/10641
3.11 cherrypick - https://github.com/openshift/openshift-ansible/pull/10642
(In reply to Vadim Rutkovsky from comment #3) > It doesn't seem to be ansible playbooks stopping crio service. The last > mention of crio.service is in `Start the CRI-O service` task. > > Could you collect `systemctl status crio` and `journalctl -u crio` from the > affected nodes? Go trough install log in https://bugzilla.redhat.com/attachment.cgi?id=1498474, In line - 2762, "Pre-pull node image" task is triggered. In line - 2951, "Restart cri-o" task is exected; In line - 3065, "Check status of node image pre-pull" task's output is saying the pre-pull failed. Obviously, installer were restarting crio between the start of the pull and checking its status. Glad to see your already reproduce it in comment 5.
BTW, could you also cherrypick the fix for 3.10?
(In reply to Johnny Liu from comment #8) > BTW, could you also cherrypick the fix for 3.10? Yep, 3.10 cherrypick would be created once 3.11 PR is merged
I've found out that CRI-O 1.11.x onwards has also a bug (flake) when pulling images that cause a panic in CRI-O, causing itself to restart leaving bad state behind. Maybe it's also related, the PR for fixing that is in flight here https://github.com/containers/storage/pull/234 (and will need cri-o revendor and a new cri-o release)
Fix is available in openshift-ansible-3.11.44-1
Verified this bug with openshift-ansible-3.11.44-1.git.0.11d174e.el7.noarch, and PASS. Playbook run the tasks in the following order: TASK [openshift_node : Restart cri-o] ****************************************** Thursday 15 November 2018 11:21:01 +0800 (0:00:00.148) 0:01:09.222 ***** TASK [openshift_node : restart NetworkManager to ensure resolv.conf is present] *** Thursday 15 November 2018 11:21:01 +0800 (0:00:00.584) 0:01:09.807 ***** TASK [openshift_node : Check that node image is present] *********************** Thursday 15 November 2018 11:21:02 +0800 (0:00:00.374) 0:01:10.181 ***** TASK [openshift_node : Pre-pull node image] ************************************ Thursday 15 November 2018 11:21:02 +0800 (0:00:00.368) 0:01:10.550 ***** TASK [openshift_node : Check status of node image pre-pull] ******************** Thursday 15 November 2018 11:21:47 +0800 (0:00:00.134) 0:01:55.161 ***** FAILED - RETRYING: Check status of node image pre-pull (20 retries left). FAILED - RETRYING: Check status of node image pre-pull (20 retries left). FAILED - RETRYING: Check status of node image pre-pull (20 retries left). FAILED - RETRYING: Check status of node image pre-pull (19 retries left). FAILED - RETRYING: Check status of node image pre-pull (19 retries left). FAILED - RETRYING: Check status of node image pre-pull (19 retries left). FAILED - RETRYING: Check status of node image pre-pull (18 retries left). FAILED - RETRYING: Check status of node image pre-pull (18 retries left). FAILED - RETRYING: Check status of node image pre-pull (18 retries left). changed: [host-8-248-78.host.centralci.eng.rdu2.redhat.com] => {"ansible_job_id": "392270439751.19485", "attempts": 4, "changed": true, "cmd": ["crictl", "pull", "registry.access.stage.redhat.com/openshift3/ose-node:v3.11"], "delta": "0:02:07.614401", "end": "2018-11-14 22:23:10.833324", "failed_when_result": false, "finished": 1, "rc": 0, "start": "2018-11-14 22:21:03.218923", "stderr": "W1114 22:21:03.249200 19492 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\".", "stderr_lines": ["W1114 22:21:03.249200 19492 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\"."], "stdout": "Image is up to date for registry.access.stage.redhat.com/openshift3/ose-node@sha256:3da731d733cd4d67897d22bfdcb027b009494de667bd7a3c870557102ce10bf5", "stdout_lines": ["Image is up to date for registry.access.stage.redhat.com/openshift3/ose-node@sha256:3da731d733cd4d67897d22bfdcb027b009494de667bd7a3c870557102ce10bf5"]} FAILED - RETRYING: Check status of node image pre-pull (17 retries left). changed: [host-8-241-102.host.centralci.eng.rdu2.redhat.com] => {"ansible_job_id": "464770149400.19100", "attempts": 4, "changed": true, "cmd": ["crictl", "pull", "registry.access.stage.redhat.com/openshift3/ose-node:v3.11"], "delta": "0:02:10.633968", "end": "2018-11-14 22:23:14.309121", "failed_when_result": false, "finished": 1, "rc": 0, "start": "2018-11-14 22:21:03.675153", "stderr": "W1114 22:21:03.711652 19107 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\".", "stderr_lines": ["W1114 22:21:03.711652 19107 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\"."], "stdout": "Image is up to date for registry.access.stage.redhat.com/openshift3/ose-node@sha256:3da731d733cd4d67897d22bfdcb027b009494de667bd7a3c870557102ce10bf5", "stdout_lines": ["Image is up to date for registry.access.stage.redhat.com/openshift3/ose-node@sha256:3da731d733cd4d67897d22bfdcb027b009494de667bd7a3c870557102ce10bf5"]} changed: [host-8-241-190.host.centralci.eng.rdu2.redhat.com] => {"ansible_job_id": "41997407307.19014", "attempts": 5, "changed": true, "cmd": ["crictl", "pull", "registry.access.stage.redhat.com/openshift3/ose-node:v3.11"], "delta": "0:02:19.185821", "end": "2018-11-14 22:23:22.820222", "failed_when_result": false, "finished": 1, "rc": 0, "start": "2018-11-14 22:21:03.634401", "stderr": "W1114 22:21:03.666832 19021 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\".", "stderr_lines": ["W1114 22:21:03.666832 19021 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\"."], "stdout": "Image is up to date for registry.access.stage.redhat.com/openshift3/ose-node@sha256:3da731d733cd4d67897d22bfdcb027b009494de667bd7a3c870557102ce10bf5", "stdout_lines": ["Image is up to date for registry.access.stage.redhat.com/openshift3/ose-node@sha256:3da731d733cd4d67897d22bfdcb027b009494de667bd7a3c870557102ce10bf5"]}
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0024