Bug 1567651

Summary: unable to install crio garbage collector
Product: OpenShift Container Platform Reporter: raffaele spazzoli <rspazzol>
Component: InstallerAssignee: Vadim Rutkovsky <vrutkovs>
Installer sub component: openshift-installer QA Contact: Johnny Liu <jialiu>
Status: CLOSED NOTABUG Docs Contact:
Severity: unspecified    
Priority: unspecified CC: aos-bugs, eminguez, jokerman, mmccomas, rspazzol, smilner, wmeng
Version: 3.9.0Keywords: Reopened
Target Milestone: ---Flags: rspazzol: needinfo-
Target Release: 3.9.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-05-28 09:24:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description raffaele spazzoli 2018-04-15 18:57:58 UTC
Description of problem:
when Installing the crio garbage collector (openshift_crio_enable_docker_gc: true) the installer failed with the following error:

TASK [openshift_master : Ensure that docker-gc daemonset has nodes to run on] ***************************************************************************************************************
task path: /home/rspazzol/git/casl-ansible/galaxy/openshift-ansible/roles/openshift_master/tasks/ensure_nodes_matching_selector.yml:10
Sunday 15 April 2018  14:49:31 -0400 (0:00:00.701)       0:23:03.594 ********** 
fatal: [env1-master-rp77]: FAILED! => {
    "assertion": false, 
    "changed": false, 
    "evaluated_to": false, 
    "msg": "No schedulable nodes found matching node selector for docker-gc daemonset - 'runtime=cri-o'"
}
	to retry, use: --limit @/home/rspazzol/git/casl-ansible/playbooks/openshift/end-to-end.retry


based on the only documentation available on crio: 

https://docs.openshift.com/container-platform/3.9/release_notes/ocp_3_9_release_notes.html#ocp-39-crio

when not specifying the "runtime=cri-o" label, the garbage collector should be installed in every node.





Description of problem:

Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Scott Dodson 2018-04-16 13:49:08 UTC
I think we should make the default node selector for this "" and remove the check for nodes that match the selector as long as it's safe to run docker-gc on all hosts.

https://github.com/openshift/openshift-ansible/pull/7310 is where we added it and the docs referenced above definitely state we'll run it everywhere.

Comment 2 Vadim Rutkovsky 2018-04-18 09:29:09 UTC
It not clear if docker GC is safe to run on all nodes, waiting for Stephen to clarify that

Comment 3 Steve Milner 2018-04-18 14:55:05 UTC
The daemon set runs an oc command. Adding sjenning to verify if running the command on all nodes would be safe from the POV for OpenShift (oc).

Comment 4 Seth Jennings 2018-04-18 17:03:05 UTC
I would not run it on nodes that are using docker as the runtime for Openshift.  The kubelet manages the GC in that case and they may not play nicely together.

Comment 5 Steve Milner 2018-04-18 18:53:04 UTC
Thanks Seth,

I defer to his statement. I believe if people want to use dockergc on nodes they should provide the label on said nodes, or install with the label in the inventory.

Comment 6 Vadim Rutkovsky 2018-04-24 13:23:13 UTC
Closing as NOTABUG - docker-gc should run on cri-o nodes only, so the label should be set for these nodes

Comment 7 raffaele spazzoli 2018-04-24 14:07:49 UTC
I disagree with closing this bug.
This inventory configuration should work:
openshift_use_crio: true
openshift_crio_enable_docker_gc: true

but it will fail.

the cri-o label, based on the doc should be used only when some nodes are cri-o enable and some aren't. But if all the nodes are cri-o enable the label should not be necessary.

By the way the cri-o label is redundant, in fact the installer can deduct which nodes are crio enable by looking at the openshift_use_crio: true host variable.

so in order to simplify the installer the necessity of the cri-o label should be removed. the installer should be able to see which nodes are cri-o enabled and then label the node consequently (since this is needed to deploy the daemonset).

Comment 8 Steve Milner 2018-04-24 16:49:50 UTC
Raffaele,

I agree that this can be simplified. The implementation for this part of the installer code went the long way around based on requirements which changed a few times.

As it stands, the install will work as long as the instructions are followed so I don't see this as something which needs to be done right now ... but I do agree that it should be simplified in the ways you've touched on.

Comment 9 raffaele spazzoli 2018-04-24 18:08:22 UTC
Steve,

the only instructions that I know of are in the release notes:

"When CRI-O use is enabled, it is installed alongside docker, which currently is required to perform build and push operations to the registry. Over time, temporary docker builds can accumulate on nodes. You can optionally set the following to enable garbage collection, which adds a daemonset to clean out the builds:

openshift_crio_enable_docker_gc=true

When enabled, it will run garbage collection on all nodes by default. You can also limit the running of the daemonset on specific nodes by setting the following:

openshift_crio_docker_gc_node_selector={'runtime': 'cri-o'}

For example, the above would ensure it is only run on nodes with the runtime: cri-o label. This can be helpful if you are running CRI-O only on some nodes, and others are only running docker."

I have followed the instructions and the installer fails. As you can see the instructions clearly say that you need to set the cri-o label only if you need to specify the nodes that have crio but it should not be needed if you want crio on all the nodes ("When enabled, it will run garbage collection on all nodes by default."), which is what I was doing.

So I think that either the doc or the installer needs to be fixed.

Comment 10 Steve Milner 2018-04-24 19:59:58 UTC
If that's the case then you're correct that something needs to change. For now let's get a doc update noting that you must set the label on the cri-o nodes unless Vadim disagrees.

Comment 11 Vadim Rutkovsky 2018-05-02 09:46:02 UTC
(In reply to raffaele spazzoli from comment #7)
> but it will fail.

That's intended. The user should carefully read Release Notes and documentation and mark the nodes, which have only cri-o installed, no docker.

> 
> By the way the cri-o label is redundant, in fact the installer can deduct
> which nodes are crio enable by looking at the openshift_use_crio: true host
> variable.
> so in order to simplify the installer the necessity of the cri-o label
> should be removed
> By the way the cri-o label is redundant, in fact the installer can deduct which > nodes are crio enable by looking at the openshift_use_crio: true host variable.

There are three combinations here:

1) docker only
2) crio + docker
3) crio only

Installer would only install missing packages, so if the nodes already have docker installed it won't erase it. As a result, in combinations 1) and 2)  enabling docker GC would be dangerous. Note, that installer can't remove existing docker (it can be previously installed on the node).

The only solution I see is user manually labelling node where its safe to enable GC and re-running the installer. "openshift_crio_docker_gc_node_selector" variable and "Ensure that docker-gc daemonset has nodes to run on" are protecting nodes from unwanted GC runs.

Rafaelle, do you agree with closing this bug?