Bug 1647288
Summary: | Images prepull is broken due to CRIO service is restart during the prepulling process | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Johnny Liu <jialiu> |
Component: | Installer | Assignee: | Vadim Rutkovsky <vrutkovs> |
Status: | CLOSED ERRATA | QA Contact: | Johnny Liu <jialiu> |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | 3.11.0 | CC: | amurdaca, aos-bugs, jialiu, jokerman, mmccomas, pkanthal, vrutkovs, wmeng |
Target Milestone: | --- | ||
Target Release: | 3.11.z | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: crio was restarted in some cases before verifying the image prepull has finished
Consequence: images were not prepulled
Fix: crio is restarted before image prepull begins
Result: install succeeds
|
Story Points: | --- |
Clone Of: | 1639201 | Environment: | |
Last Closed: | 2019-01-10 09:04:10 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1639201 | ||
Bug Blocks: | 1639623 |
Description
Johnny Liu
2018-11-07 04:04:19 UTC
>Since "Pre-pull node image" async task is triggered, installer did all kinds of service restart, even crio service, so "transport is closing" totally make scenes.
Do you have inventory + playbook logs? This doesn't seem to be happening on upstream CI at least
The attachment in original bug (BZ#1647660) has inventory + playbook logs. Pls refer to https://bugzilla.redhat.com/attachment.cgi?id=1498474 It doesn't seem to be ansible playbooks stopping crio service. The last mention of crio.service is in `Start the CRI-O service` task. Could you collect `systemctl status crio` and `journalctl -u crio` from the affected nodes? Interesting, I can now reproduce this on CI: ``` fatal: [ci-op-kng215iz-6d825-ig-m-h99f]: FAILED! => {"changed": false, "msg": "Error connecting: Error while fetching server API version: ('Connection aborted.', error(2, 'No such file or directory'))"} ``` Right, we _were_ restarting crio between the start of the pull and checking its status. So if it took a bit longer then it would fail. This race condition is fixed in https://github.com/openshift/openshift-ansible/pull/10641 3.11 cherrypick - https://github.com/openshift/openshift-ansible/pull/10642 (In reply to Vadim Rutkovsky from comment #3) > It doesn't seem to be ansible playbooks stopping crio service. The last > mention of crio.service is in `Start the CRI-O service` task. > > Could you collect `systemctl status crio` and `journalctl -u crio` from the > affected nodes? Go trough install log in https://bugzilla.redhat.com/attachment.cgi?id=1498474, In line - 2762, "Pre-pull node image" task is triggered. In line - 2951, "Restart cri-o" task is exected; In line - 3065, "Check status of node image pre-pull" task's output is saying the pre-pull failed. Obviously, installer were restarting crio between the start of the pull and checking its status. Glad to see your already reproduce it in comment 5. BTW, could you also cherrypick the fix for 3.10? (In reply to Johnny Liu from comment #8) > BTW, could you also cherrypick the fix for 3.10? Yep, 3.10 cherrypick would be created once 3.11 PR is merged I've found out that CRI-O 1.11.x onwards has also a bug (flake) when pulling images that cause a panic in CRI-O, causing itself to restart leaving bad state behind. Maybe it's also related, the PR for fixing that is in flight here https://github.com/containers/storage/pull/234 (and will need cri-o revendor and a new cri-o release) Fix is available in openshift-ansible-3.11.44-1 Verified this bug with openshift-ansible-3.11.44-1.git.0.11d174e.el7.noarch, and PASS. Playbook run the tasks in the following order: TASK [openshift_node : Restart cri-o] ****************************************** Thursday 15 November 2018 11:21:01 +0800 (0:00:00.148) 0:01:09.222 ***** TASK [openshift_node : restart NetworkManager to ensure resolv.conf is present] *** Thursday 15 November 2018 11:21:01 +0800 (0:00:00.584) 0:01:09.807 ***** TASK [openshift_node : Check that node image is present] *********************** Thursday 15 November 2018 11:21:02 +0800 (0:00:00.374) 0:01:10.181 ***** TASK [openshift_node : Pre-pull node image] ************************************ Thursday 15 November 2018 11:21:02 +0800 (0:00:00.368) 0:01:10.550 ***** TASK [openshift_node : Check status of node image pre-pull] ******************** Thursday 15 November 2018 11:21:47 +0800 (0:00:00.134) 0:01:55.161 ***** FAILED - RETRYING: Check status of node image pre-pull (20 retries left). FAILED - RETRYING: Check status of node image pre-pull (20 retries left). FAILED - RETRYING: Check status of node image pre-pull (20 retries left). FAILED - RETRYING: Check status of node image pre-pull (19 retries left). FAILED - RETRYING: Check status of node image pre-pull (19 retries left). FAILED - RETRYING: Check status of node image pre-pull (19 retries left). FAILED - RETRYING: Check status of node image pre-pull (18 retries left). FAILED - RETRYING: Check status of node image pre-pull (18 retries left). FAILED - RETRYING: Check status of node image pre-pull (18 retries left). changed: [host-8-248-78.host.centralci.eng.rdu2.redhat.com] => {"ansible_job_id": "392270439751.19485", "attempts": 4, "changed": true, "cmd": ["crictl", "pull", "registry.access.stage.redhat.com/openshift3/ose-node:v3.11"], "delta": "0:02:07.614401", "end": "2018-11-14 22:23:10.833324", "failed_when_result": false, "finished": 1, "rc": 0, "start": "2018-11-14 22:21:03.218923", "stderr": "W1114 22:21:03.249200 19492 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\".", "stderr_lines": ["W1114 22:21:03.249200 19492 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\"."], "stdout": "Image is up to date for registry.access.stage.redhat.com/openshift3/ose-node@sha256:3da731d733cd4d67897d22bfdcb027b009494de667bd7a3c870557102ce10bf5", "stdout_lines": ["Image is up to date for registry.access.stage.redhat.com/openshift3/ose-node@sha256:3da731d733cd4d67897d22bfdcb027b009494de667bd7a3c870557102ce10bf5"]} FAILED - RETRYING: Check status of node image pre-pull (17 retries left). changed: [host-8-241-102.host.centralci.eng.rdu2.redhat.com] => {"ansible_job_id": "464770149400.19100", "attempts": 4, "changed": true, "cmd": ["crictl", "pull", "registry.access.stage.redhat.com/openshift3/ose-node:v3.11"], "delta": "0:02:10.633968", "end": "2018-11-14 22:23:14.309121", "failed_when_result": false, "finished": 1, "rc": 0, "start": "2018-11-14 22:21:03.675153", "stderr": "W1114 22:21:03.711652 19107 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\".", "stderr_lines": ["W1114 22:21:03.711652 19107 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\"."], "stdout": "Image is up to date for registry.access.stage.redhat.com/openshift3/ose-node@sha256:3da731d733cd4d67897d22bfdcb027b009494de667bd7a3c870557102ce10bf5", "stdout_lines": ["Image is up to date for registry.access.stage.redhat.com/openshift3/ose-node@sha256:3da731d733cd4d67897d22bfdcb027b009494de667bd7a3c870557102ce10bf5"]} changed: [host-8-241-190.host.centralci.eng.rdu2.redhat.com] => {"ansible_job_id": "41997407307.19014", "attempts": 5, "changed": true, "cmd": ["crictl", "pull", "registry.access.stage.redhat.com/openshift3/ose-node:v3.11"], "delta": "0:02:19.185821", "end": "2018-11-14 22:23:22.820222", "failed_when_result": false, "finished": 1, "rc": 0, "start": "2018-11-14 22:21:03.634401", "stderr": "W1114 22:21:03.666832 19021 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\".", "stderr_lines": ["W1114 22:21:03.666832 19021 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\"."], "stdout": "Image is up to date for registry.access.stage.redhat.com/openshift3/ose-node@sha256:3da731d733cd4d67897d22bfdcb027b009494de667bd7a3c870557102ce10bf5", "stdout_lines": ["Image is up to date for registry.access.stage.redhat.com/openshift3/ose-node@sha256:3da731d733cd4d67897d22bfdcb027b009494de667bd7a3c870557102ce10bf5"]} Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0024 |