Bug 1639201

Summary: Images are not prepulled when CRIO runtime is used
Product: OpenShift Container Platform Reporter: Vadim Rutkovsky <vrutkovs>
Component: InstallerAssignee: Vadim Rutkovsky <vrutkovs>
Status: CLOSED ERRATA QA Contact: Johnny Liu <jialiu>
Severity: medium Docs Contact:
Priority: high    
Version: 3.11.0CC: amurdaca, aos-bugs, jialiu, jokerman, mmccomas, pkanthal, vrutkovs, wmeng
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: node, pod and control-plane images were not pre-pulled when crio was used Consequence: tasks timed out as they included pull time Fix: images are prepulled when docker and crio are used Result: installation succeeds
Story Points: ---
Clone Of:
: 1639623 1647288 (view as bug list) Environment:
Last Closed: 2019-01-10 09:04:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1639623, 1647288    
Attachments:
Description Flags
installation log with inventory file embedded none

Description Vadim Rutkovsky 2018-10-15 09:35:59 UTC
Description of problem:
Pre-pull tasks for control-plane and node are assuming Docker is used as a runtime

Comment 1 Vadim Rutkovsky 2018-10-15 10:30:32 UTC
master PR - https://github.com/openshift/openshift-ansible/pull/10402

Comment 2 Johnny Liu 2018-10-16 09:04:12 UTC
When testing https://bugzilla.redhat.com/show_bug.cgi?id=1636914, the installation task failed at "openshift_control_plane : Wait for control plane pods to appear" when crio runtime is enabled, so hope we could get this fix ASAP.

Comment 3 Vadim Rutkovsky 2018-10-18 12:29:04 UTC
3.11 cherrypick https://github.com/openshift/openshift-ansible/pull/10436

Comment 4 Vadim Rutkovsky 2018-10-19 08:52:51 UTC
Fix is available in openshift-ansible-3.11.26-1

Comment 5 Johnny Liu 2018-10-22 06:26:36 UTC
Re-test this bug with openshift-ansible-3.11.28-1.git.0.96f8adb.el7.noarch, need some more fix.

"Pre-pull node image" is still using docker runtime to pre-pull node image.
- name: Pre-pull node image
  docker_image:
    name: "{{ osn_image }}"
  environment:
    NO_PROXY: "{{ openshift.common.no_proxy | default('') }}"
  when: node_image.stdout_lines == []
  # 10 minutes to pull the image
  async: 600
  poll: 0
  register: image_prepull

Comment 6 Vadim Rutkovsky 2018-10-22 09:03:12 UTC
Correct, all other images except the node one are being prepulled.
Created master PR https://github.com/openshift/openshift-ansible/pull/10464

Comment 7 Vadim Rutkovsky 2018-10-29 06:38:30 UTC
Fix is available in  openshift-ansible-3.11.34-1

Comment 8 Johnny Liu 2018-10-29 10:42:38 UTC
Re-test this bug with openshift-ansible-3.11.34-1.git.0.499a30b.el7.noarch.

The PR seem like take effect, but unfortunately "Check status of node image pre-pull" fail to pre-pull the ose-node.

TASK [openshift_node : Check status of node image pre-pull] ********************
Monday 29 October 2018  15:39:31 +0800 (0:00:00.117)       0:06:22.707 ******** 
changed: [host-8-251-110.host.centralci.eng.rdu2.redhat.com] => {"ansible_job_id": "274419478459.15172", "attempts": 1, "changed": true, "cmd": ["crictl", "pull", "registry.access.stage.redhat.com/openshift3/ose-node:v3.11"], "delta": "0:01:45.258525", "end": "2018-10-29 03:35:49.786125", "failed_when_result": false, "finished": 1, "msg": "non-zero return code", "rc": 1, "start": "2018-10-29 03:34:04.527600", "stderr": "W1029 03:34:04.559334   15179 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\".\ntime=\"2018-10-29T03:35:49-04:00\" level=fatal msg=\"pulling image failed: rpc error: code = Unavailable desc = transport is closing\"", "stderr_lines": ["W1029 03:34:04.559334   15179 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\".", "time=\"2018-10-29T03:35:49-04:00\" level=fatal msg=\"pulling image failed: rpc error: code = Unavailable desc = transport is closing\""], "stdout": "", "stdout_lines": []}
changed: [host-8-250-70.host.centralci.eng.rdu2.redhat.com] => {"ansible_job_id": "271535462819.15098", "attempts": 1, "changed": true, "cmd": ["crictl", "pull", "registry.access.stage.redhat.com/openshift3/ose-node:v3.11"], "delta": "0:01:45.420395", "end": "2018-10-29 03:35:50.226819", "failed_when_result": false, "finished": 1, "msg": "non-zero return code", "rc": 1, "start": "2018-10-29 03:34:04.806424", "stderr": "W1029 03:34:04.839431   15105 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\".\ntime=\"2018-10-29T03:35:50-04:00\" level=fatal msg=\"pulling image failed: rpc error: code = Unavailable desc = transport is closing\"", "stderr_lines": ["W1029 03:34:04.839431   15105 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\".", "time=\"2018-10-29T03:35:50-04:00\" level=fatal msg=\"pulling image failed: rpc error: code = Unavailable desc = transport is closing\""], "stdout": "", "stdout_lines": []}
changed: [host-8-244-143.host.centralci.eng.rdu2.redhat.com] => {"ansible_job_id": "454904875218.15574", "attempts": 1, "changed": true, "cmd": ["crictl", "pull", "registry.access.stage.redhat.com/openshift3/ose-node:v3.11"], "delta": "0:01:45.260942", "end": "2018-10-29 03:35:49.134957", "failed_when_result": false, "finished": 1, "msg": "non-zero return code", "rc": 1, "start": "2018-10-29 03:34:03.874015", "stderr": "W1029 03:34:03.905513   15581 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\".\ntime=\"2018-10-29T03:35:49-04:00\" level=fatal msg=\"pulling image failed: rpc error: code = Unavailable desc = transport is closing\"", "stderr_lines": ["W1029 03:34:03.905513   15581 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\".", "time=\"2018-10-29T03:35:49-04:00\" level=fatal msg=\"pulling image failed: rpc error: code = Unavailable desc = transport is closing\""], "stdout": "", "stdout_lines": []}


I logged into the node, run "crictl pull registry.access.stage.redhat.com/openshift3/ose-node:v3.11" manually, and succeed.

[root@qe-jialiu-node-infra-1 ~]# crictl images
W1029 06:21:05.577647   30848 util_unix.go:75] Using "/var/run/crio/crio.sock" as endpoint is deprecated, please consider using full url format "unix:///var/run/crio/crio.sock".
IMAGE                                                 TAG                 IMAGE ID            SIZE
registry.access.stage.redhat.com/openshift3/ose-pod   v3.11               751ff215ba5da       244MB
[root@qe-jialiu-node-infra-1 ~]# crictl pull registry.access.stage.redhat.com/openshift3/ose-node:v3.11
W1029 06:21:08.744570   30856 util_unix.go:75] Using "/var/run/crio/crio.sock" as endpoint is deprecated, please consider using full url format "unix:///var/run/crio/crio.sock".
Image is up to date for registry.access.stage.redhat.com/openshift3/ose-node@sha256:3da731d733cd4d67897d22bfdcb027b009494de667bd7a3c870557102ce10bf5
[root@qe-jialiu-node-infra-1 ~]# crictl images
W1029 06:23:23.738749   30895 util_unix.go:75] Using "/var/run/crio/crio.sock" as endpoint is deprecated, please consider using full url format "unix:///var/run/crio/crio.sock".
IMAGE                                                  TAG                 IMAGE ID            SIZE
registry.access.stage.redhat.com/openshift3/ose-node   v3.11               074bf04571e22       1.17GB
registry.access.stage.redhat.com/openshift3/ose-pod    v3.11               751ff215ba5da       244MB


Do you know why "rpc error" is thrown out for the ansible task.

Comment 9 Johnny Liu 2018-10-29 10:43:18 UTC
Created attachment 1498474 [details]
installation log with inventory file embedded

Comment 10 Vadim Rutkovsky 2018-10-29 16:10:36 UTC
Sounds like we should add more retries there

Comment 11 Johnny Liu 2018-10-30 03:11:21 UTC
@Vadim, When you are trying to add more retries, could you help also add one more task to pre-pull "ose-control-plane" image together?

When I was testing this bug, I hit "Wait for control plane pods to appear" timeout issue for several times, because of api static pod is not get running yet. I think to pre-pull "ose-control-plane" image would improve such issue.

Comment 12 Vadim Rutkovsky 2018-10-30 09:20:31 UTC
Sure thing, added retries to pulls for every image we prepull.
Master PR - https://github.com/openshift/openshift-ansible/pull/10551

Comment 13 Antonio Murdaca 2018-10-30 12:42:36 UTC
The error suggests the CRI-O daemon went down at some point: "transport is closing", is there any way to grab logs from the system? could be an OOM or a panic in cri-o itself from what I can tell

Comment 14 Antonio Murdaca 2018-10-30 12:44:54 UTC
Overall I agree that we should retry some time btw

Comment 15 Antonio Murdaca 2018-10-30 13:14:03 UTC
Johnny, any chance to grab system logs when that happens?

Comment 16 Johnny Liu 2018-10-31 03:34:41 UTC
@Antonio, good point.

I reproduced the issue again, and follow your instruction to do some check.
[root@qe-jialiu311-node-1 ~]# journalctl -f -u crio
-- Logs begin at Tue 2018-10-30 22:31:52 EDT. --
Oct 30 22:40:57 qe-jialiu311-node-1 systemd[1]: Starting Open Container Initiative Daemon...
Oct 30 22:40:57 qe-jialiu311-node-1 systemd[1]: Started Open Container Initiative Daemon.
Oct 30 22:42:18 qe-jialiu311-node-1 systemd[1]: Stopping Open Container Initiative Daemon...
Oct 30 22:43:48 qe-jialiu311-node-1 systemd[1]: crio.service stop-sigterm timed out. Killing.
Oct 30 22:43:48 qe-jialiu311-node-1 systemd[1]: crio.service: main process exited, code=killed, status=9/KILL
Oct 30 22:43:48 qe-jialiu311-node-1 systemd[1]: Unit crio.service entered failed state.
Oct 30 22:43:48 qe-jialiu311-node-1 systemd[1]: crio.service failed.
Oct 30 22:43:48 qe-jialiu311-node-1 systemd[1]: Starting Open Container Initiative Daemon...
Oct 30 22:43:48 qe-jialiu311-node-1 systemd[1]: Started Open Container Initiative Daemon.


So seem like at "22:43:48", crio service seem like be restarted.

Grap /var/log/messages, get the following message at that moment:
Oct 30 22:42:17 qe-jialiu311-node-1 ansible-os_firewall_manage_iptables: Invoked with protocol=tcp name=Prometheus monitoring chain=OS_FIREWALL_ALLOW create_jump_rule=True action=add ip_version=ipv4 jump_rule_chain=INPUT port=9000-10000
Oct 30 22:42:18 qe-jialiu311-node-1 ansible-systemd: Invoked with no_block=False force=None name=cri-o enabled=True daemon_reload=False state=restarted user=False masked=None
Oct 30 22:42:18 qe-jialiu311-node-1 systemd: Stopping Open Container Initiative Daemon...
Oct 30 22:42:18 qe-jialiu311-node-1 ansible-async_wrapper.py: 15979 still running (590)
<--snip-->
<--snip-->
<--snip-->
Oct 30 22:43:46 qe-jialiu311-node-1 ansible-async_wrapper.py: 15900 still running (500)
Oct 30 22:43:48 qe-jialiu311-node-1 systemd: crio.service stop-sigterm timed out. Killing.
Oct 30 22:43:48 qe-jialiu311-node-1 systemd: crio.service: main process exited, code=killed, status=9/KILL
Oct 30 22:43:48 qe-jialiu311-node-1 systemd: Unit crio.service entered failed state.
Oct 30 22:43:48 qe-jialiu311-node-1 systemd: crio.service failed.
Oct 30 22:43:48 qe-jialiu311-node-1 systemd: Starting Open Container Initiative Daemon...

Go through openshift-ansible installer log:
TASK [openshift_node : Pre-pull node image] ************************************
Wednesday 31 October 2018  10:42:02 +0800 (0:00:00.312)       0:00:58.300 ***** 

changed: [host-8-243-212.host.centralci.eng.rdu2.redhat.com] => {"ansible_job_id": "410992443836.16394", "changed": true, "finished": 0, "results_file": "/root/.ansible_async/410992443836.16394", "started": 1}
changed: [host-8-249-227.host.centralci.eng.rdu2.redhat.com] => {"ansible_job_id": "647625704086.15896", "changed": true, "finished": 0, "results_file": "/root/.ansible_async/647625704086.15896", "started": 1}
changed: [host-8-249-183.host.centralci.eng.rdu2.redhat.com] => {"ansible_job_id": "359232487803.15905", "changed": true, "finished": 0, "results_file": "/root/.ansible_async/359232487803.15905", "started": 1}

<--snip-->
RUNNING HANDLER [openshift_node : restart NetworkManager] **********************
Wednesday 31 October 2018  10:42:16 +0800 (0:00:00.373)       0:01:12.106 ***** 
<--snip-->
RUNNING HANDLER [openshift_node : restart dnsmasq] *****************************
Wednesday 31 October 2018  10:42:17 +0800 (0:00:00.414)       0:01:12.521 ***** 
<--snip-->
TASK [openshift_node : Restart cri-o] ******************************************
Wednesday 31 October 2018  10:42:19 +0800 (0:00:00.130)       0:01:15.038 ***** 
<--snip-->
TASK [openshift_node : Restart journald] ***************************************
Wednesday 31 October 2018  10:47:13 +0800 (0:00:02.357)       0:06:09.155 ***** 
<--snip-->
TASK [openshift_node : Check status of node image pre-pull] ********************
Wednesday 31 October 2018  10:47:16 +0800 (0:00:00.127)       0:06:11.580 ***** 

Since "Pre-pull node image" async task is triggered, installer did all kinds of service restart, even crio service, so "transport is closing" totally make scenes. 


@Antonio @Vadim, now I do not think "retry some time" could improve this better.  
pre-pull image should happen behind crio service is restarted. WDYT.

Comment 17 Vadim Rutkovsky 2018-11-05 10:38:46 UTC
We can't add 'retry' in ansible scripts, as this task is running async and the result is not known until its checked. Please file a new issue for CRI-O service being killed.

Moving this to ON_QA as 3.11 cherrypick was merged

Comment 18 Johnny Liu 2018-11-07 04:00:19 UTC
Based on comment 8, move this bug to "VERIFIED". About issue described in  comment 16, will file new bug to track it.

Comment 19 Johnny Liu 2018-11-07 10:06:05 UTC
Also run some other regression testing for pull/10436 and pull/10464/files against both docker and crio runtime behind proxy with openshift-ansible-3.11.39-1.git.0.fe42b3b.el7.noarch, cluster is set up successfully, they would not break install behind proxy.

Comment 21 errata-xmlrpc 2019-01-10 09:04:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0024