1657036 – Trying to enable cri-o, install fails trying to communicate to docker

Bug 1657036 - Trying to enable cri-o, install fails trying to communicate to docker

Summary: Trying to enable cri-o, install fails trying to communicate to docker

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Russell Teague
QA Contact:	Johnny Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-12-06 21:52 UTC by David Critch
Modified:	2018-12-11 15:10 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-12-11 15:10:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description David Critch 2018-12-06 21:52:21 UTC

Description of problem:
I've been using cri-o as the OCP runtime for a few releases now. In the last week or so, attempting to rebuild with crio fails. The nodes fail to start since the kubelet can not talk to a docker daemon.

Version-Release number of selected component (if applicable):
OCP: atomic-openshift-node-3.11.43-1.git.0.647ac05.el7.x86_64
openshift-ansible: commit 8ce8a45542ed29f0b325417a9aab1b673f33c2e1 (HEAD -> release-3.11, tag: openshift-ansible-3.11.52-1, origin/release-3.11)


How reproducible:
Always

Steps to Reproduce:
1. Configure inventory file for cri-o:
    openshift_use_crio=True
    openshift_use_crio_only=True
    openshift_crio_enable_docker_gc=True
    openshift_crio_docker_gc_node_selector={'runtime': 'cri-o'}
    # add runtime="cri-o" to node labels
2. Run openshift-ansible/playbooks/deploy_cluster.yml

Actual results:
- Install fails with the following message:
Failure summary:


  1. Hosts:    dc-ocp-m0.cloud.lab.eng.bos.redhat.com
     Play:     Approve any pending CSR requests from inventory nodes
     Task:     Approve node certificates when bootstrapping
     Message:  Could not find csr for nodes: dc-ocp-n0.cloud.lab.eng.bos.redhat.com, dc-ocp-m1.cloud.lab.eng.bos.redhat.com, dc-ocp-n4.cloud.lab.eng.bos.redhat.com, dc-ocp-n3.cloud.lab.eng.bos.redhat.com, dc-ocp-n2.cloud.lab.eng.bos.redhat.com, dc-ocp-n1.cloud.lab.eng.bos.redhat.com, dc-ocp-m2.cloud.lab.eng.bos.redhat.com

There are other BZs that mention this and hostname/hostname -f differences but that doesn't seem to be the case here. Actual error, from the node:
Dec 06 21:43:01 dc-ocp-n0.cloud.lab.eng.bos.redhat.com atomic-openshift-node[55550]: E1206 21:43:01.918501   55550 kube_docker_client.go:91] failed to retrieve docker version: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Dec 06 21:43:01 dc-ocp-n0.cloud.lab.eng.bos.redhat.com atomic-openshift-node[55550]: W1206 21:43:01.918539   55550 kube_docker_client.go:92] Using empty version for docker client, this may sometimes cause compatibility issue.
Dec 06 21:43:01 dc-ocp-n0.cloud.lab.eng.bos.redhat.com atomic-openshift-node[55550]: F1206 21:43:01.918872   55550 server.go:262] failed to run Kubelet: failed to create kubelet: failed to get docker version: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?


Output of oc get nodes (master0 with docker, rest as cri-o???):

# oc get nodes -o wide
NAME                                     STATUS     ROLES                    AGE       VERSION           INTERNAL-IP     EXTERNAL-IP   OS-IMAGE    KERNEL-VERSION              CONTAINER-RUNTIME
dc-ocp-m0.cloud.lab.eng.bos.redhat.com   Ready      master                   26m       v1.11.0+d4cacc0   10.19.138.166   <none>        OpenShift   3.10.0-957.1.3.el7.x86_64   docker://1.13.1
dc-ocp-m1.cloud.lab.eng.bos.redhat.com   NotReady   master                   26m       v1.11.0+d4cacc0   10.19.138.167   <none>        OpenShift   3.10.0-957.1.3.el7.x86_64   cri-o://1.11.8
dc-ocp-m2.cloud.lab.eng.bos.redhat.com   NotReady   master                   26m       v1.11.0+d4cacc0   10.19.138.168   <none>        OpenShift   3.10.0-957.1.3.el7.x86_64   cri-o://1.11.8
dc-ocp-n0.cloud.lab.eng.bos.redhat.com   NotReady   compute,infra,kubevirt   21m       v1.11.0+d4cacc0   10.19.138.161   <none>        OpenShift   3.10.0-957.1.3.el7.x86_64   cri-o://1.11.8
dc-ocp-n1.cloud.lab.eng.bos.redhat.com   NotReady   compute,infra,kubevirt   21m       v1.11.0+d4cacc0   10.19.138.162   <none>        OpenShift   3.10.0-957.1.3.el7.x86_64   cri-o://1.11.8
dc-ocp-n2.cloud.lab.eng.bos.redhat.com   NotReady   compute,infra,kubevirt   21m       v1.11.0+d4cacc0   10.19.138.163   <none>        OpenShift   3.10.0-957.1.3.el7.x86_64   cri-o://1.11.8
dc-ocp-n3.cloud.lab.eng.bos.redhat.com   NotReady   compute,infra,kubevirt   21m       v1.11.0+d4cacc0   10.19.138.164   <none>        OpenShift   3.10.0-957.1.3.el7.x86_64   cri-o://1.11.8
dc-ocp-n4.cloud.lab.eng.bos.redhat.com   NotReady   compute,infra,kubevirt   21m       v1.11.0+d4cacc0   10.19.138.165   <none>        OpenShift   3.10.0-957.1.3.el7.x86_64   cri-o://1.11.8

Since the master0 is reporting runtime as docker, OCP is trying to start up those pods with docker:
# oc get pods -n kube-system -o wide
NAME                                                        READY     STATUS             RESTARTS   AGE       IP              NODE                                     NOMINATED NODE
master-api-dc-ocp-m0.cloud.lab.eng.bos.redhat.com           0/1       CrashLoopBackOff   9          27m       10.19.138.166   dc-ocp-m0.cloud.lab.eng.bos.redhat.com   <none>
master-api-dc-ocp-m1.cloud.lab.eng.bos.redhat.com           1/1       Running            0          27m       10.19.138.167   dc-ocp-m1.cloud.lab.eng.bos.redhat.com   <none>
master-api-dc-ocp-m2.cloud.lab.eng.bos.redhat.com           1/1       Running            0          28m       10.19.138.168   dc-ocp-m2.cloud.lab.eng.bos.redhat.com   <none>
master-controllers-dc-ocp-m0.cloud.lab.eng.bos.redhat.com   0/1       CrashLoopBackOff   9          28m       10.19.138.166   dc-ocp-m0.cloud.lab.eng.bos.redhat.com   <none>
master-controllers-dc-ocp-m1.cloud.lab.eng.bos.redhat.com   1/1       Running            0          27m       10.19.138.167   dc-ocp-m1.cloud.lab.eng.bos.redhat.com   <none>
master-controllers-dc-ocp-m2.cloud.lab.eng.bos.redhat.com   1/1       Running            0          27m       10.19.138.168   dc-ocp-m2.cloud.lab.eng.bos.redhat.com   <none>
master-etcd-dc-ocp-m0.cloud.lab.eng.bos.redhat.com          0/1       CrashLoopBackOff   9          27m       10.19.138.166   dc-ocp-m0.cloud.lab.eng.bos.redhat.com   <none>

But they are already running under crio:
 crictl ps
W1206 21:49:35.826966   14179 util_unix.go:75] Using "/var/run/crio/crio.sock" as endpoint is deprecated, please consider using full url format "unix:///var/run/crio/crio.sock".
CONTAINER ID        IMAGE                                                              CREATED             STATE               NAME                ATTEMPT
6f7cbb0b5a1e0       901c817d48ccadd98b0bcd9f9d3f16738c8dbaee0e0a6d5fb85217a616493d4a   26 minutes ago      Running             sync                0
452da364d0521       e043f4037c7ff202ac1ae302bb4990d1f398f3a80f22ab02e3a13b389499f963   29 minutes ago      Running             api                 0
ff9b3adc42728       e043f4037c7ff202ac1ae302bb4990d1f398f3a80f22ab02e3a13b389499f963   29 minutes ago      Running             controllers         0
50f0e74ff4ba4       635bb36d7fc7b0199d318dcb4fde1aaadf5654b9ad4f9a4a3a1c5fe94c23339f   29 minutes ago      Running             etcd                0

So they keep crashing:

# oc logs master-api-dc-ocp-m0.cloud.lab.eng.bos.redhat.com -n kube-system | tail
I1206 21:48:41.154781       1 plugins.go:84] Registered admission plugin "PodTolerationRestriction"
I1206 21:48:41.154794       1 plugins.go:84] Registered admission plugin "ResourceQuota"
I1206 21:48:41.154807       1 plugins.go:84] Registered admission plugin "PodSecurityPolicy"
I1206 21:48:41.154819       1 plugins.go:84] Registered admission plugin "Priority"
I1206 21:48:41.154842       1 plugins.go:84] Registered admission plugin "SecurityContextDeny"
I1206 21:48:41.154859       1 plugins.go:84] Registered admission plugin "ServiceAccount"
I1206 21:48:41.154871       1 plugins.go:84] Registered admission plugin "DefaultStorageClass"
I1206 21:48:41.154885       1 plugins.go:84] Registered admission plugin "PersistentVolumeClaimResize"
I1206 21:48:41.154896       1 plugins.go:84] Registered admission plugin "StorageObjectInUseProtection"
F1206 21:48:41.155462       1 start_api.go:68] failed to create listener: failed to listen on 0.0.0.0:8443: listen tcp4 0.0.0.0:8443: bind: address already in use



Expected results:
- OCP/the kubelet should not be attempting to talk to docker in a cri-o environment.

- All nodes should be using the cri-o runtime.

Comment 1 Johnny Liu 2018-12-07 03:27:54 UTC

I think this is kinds of side effect when fixing https://bugzilla.redhat.com/show_bug.cgi?id=1647516, QE also opened doc bug - https://bugzilla.redhat.com/show_bug.cgi?id=1656359 to request doc update.

Comment 2 Ryan Howe 2018-12-07 17:03:17 UTC

More information might be needed but taking a guess.

1. With an OpenShift using crio, install docker will still get installed/or should be installed still. Docker is not used as the runtime it is only used if container build need to happen on that node.

2. If oc get node -o wide shows the run time as docker likely the node is not using a node-config that has crio configured as the runtime.
https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_node_group/templates/node-config.yaml.j2#L22-L31

3. Now in openshift nodes get their node-config.yaml from a configmap based on the group that belong to. A label on the node sets the grou.

# oc get nodes --show-labels
To see the config the node will use run the following
# oc get cm -n openshift-node <GROUP_NAME> -o yaml

More the likely the 1st master belongs to the wrong nodegroup and using a node-config that does not have crio set as the run time and it defaults back to using docker which is installed.

dc-ocp-m0.cloud.lab.eng.bos.redhat.com Ready master 26m v1.11.0+d4cacc0 10.19.138.166 <none> OpenShift 3.10.0-957.1.3.el7.x86_64 docker://1.13.1

Next Steps:
- Confirm the node-group that this host belongs too.
- Make sure that node-group has its config set to use crio. If not change the group this node belongs to by changing the label.
- Locally on the node check both configs /etc/origin/node/{bootstrap-,}node-config.yaml and make sure they are correct with crio configured
- /etc/origin/node/node-config.yaml will get replaced by the node-sync pod based on the configmap that is linked to the nodegroup this host belongs too.

Comment 3 David Critch 2018-12-07 20:24:21 UTC

Here's some more output....

# oc get nodes --show-labels
NAME                                     STATUS     ROLES                    AGE       VERSION           LABELS
dc-ocp-m0.cloud.lab.eng.bos.redhat.com   Ready      master                   39m       v1.11.0+d4cacc0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=dc-ocp-m0.cloud.lab.eng.bos.redhat.com,node-role.kubernetes.io/master=true
dc-ocp-m1.cloud.lab.eng.bos.redhat.com   NotReady   master                   39m       v1.11.0+d4cacc0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=dc-ocp-m1.cloud.lab.eng.bos.redhat.com,node-role.kubernetes.io/master=true
dc-ocp-m2.cloud.lab.eng.bos.redhat.com   NotReady   master                   39m       v1.11.0+d4cacc0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=dc-ocp-m2.cloud.lab.eng.bos.redhat.com,node-role.kubernetes.io/master=true
dc-ocp-n0.cloud.lab.eng.bos.redhat.com   NotReady   compute,infra,kubevirt   34m       v1.11.0+d4cacc0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=dc-ocp-n0.cloud.lab.eng.bos.redhat.com,node-role.kubernetes.io/compute=true,node-role.kubernetes.io/infra=true,node-role.kubernetes.io/kubevirt=true
dc-ocp-n1.cloud.lab.eng.bos.redhat.com   NotReady   compute,infra,kubevirt   34m       v1.11.0+d4cacc0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=dc-ocp-n1.cloud.lab.eng.bos.redhat.com,node-role.kubernetes.io/compute=true,node-role.kubernetes.io/infra=true,node-role.kubernetes.io/kubevirt=true
dc-ocp-n2.cloud.lab.eng.bos.redhat.com   NotReady   compute,infra,kubevirt   34m       v1.11.0+d4cacc0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=dc-ocp-n2.cloud.lab.eng.bos.redhat.com,node-role.kubernetes.io/compute=true,node-role.kubernetes.io/infra=true,node-role.kubernetes.io/kubevirt=true
dc-ocp-n3.cloud.lab.eng.bos.redhat.com   NotReady   compute,infra,kubevirt   34m       v1.11.0+d4cacc0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=dc-ocp-n3.cloud.lab.eng.bos.redhat.com,node-role.kubernetes.io/compute=true,node-role.kubernetes.io/infra=true,node-role.kubernetes.io/kubevirt=true
dc-ocp-n4.cloud.lab.eng.bos.redhat.com   NotReady   compute,infra,kubevirt   34m       v1.11.0+d4cacc0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=dc-ocp-n4.cloud.lab.eng.bos.redhat.com,node-role.kubernetes.io/compute=true,node-role.kubernetes.io/infra=true,node-role.kubernetes.io/kubevirt=true


# oc get cm -n openshift-node node-config-master -o yaml
apiVersion: v1
data:
  node-config.yaml: |
    apiVersion: v1
    authConfig:
      authenticationCacheSize: 1000
      authenticationCacheTTL: 5m
      authorizationCacheSize: 1000
      authorizationCacheTTL: 5m
    dnsBindAddress: 127.0.0.1:53
    dnsDomain: cluster.local
    dnsIP: 0.0.0.0
    dnsNameservers: null
    dnsRecursiveResolvConf: /etc/origin/node/resolv.conf
    dockerConfig:
      dockerShimRootDirectory: /var/lib/dockershim
      dockerShimSocket: /var/run/dockershim.sock
      execHandlerName: native
    enableUnidling: true
    imageConfig:
      format: registry.redhat.io/openshift3/ose-${component}:${version}
      latest: false
    iptablesSyncPeriod: 30s
    kind: NodeConfig
    kubeletArguments:
      bootstrap-kubeconfig:
      - /etc/origin/node/bootstrap.kubeconfig
      cert-dir:
      - /etc/origin/node/certificates
      enable-controller-attach-detach:
      - 'true'
      feature-gates:
      - RotateKubeletClientCertificate=true,RotateKubeletServerCertificate=true
      node-labels:
      - node-role.kubernetes.io/master=true
      pod-manifest-path:
      - /etc/origin/node/pods
      rotate-certificates:
      - 'true'
    masterClientConnectionOverrides:
      acceptContentTypes: application/vnd.kubernetes.protobuf,application/json
      burst: 40
      contentType: application/vnd.kubernetes.protobuf
      qps: 20
    masterKubeConfig: node.kubeconfig
    networkConfig:
      mtu: 1450
      networkPluginName: redhat/openshift-ovs-subnet
    proxyArguments:
      cluster-cidr:
      - 10.128.0.0/14
    servingInfo:
      bindAddress: 0.0.0.0:10250
      bindNetwork: tcp4
      clientCA: client-ca.crt
    volumeConfig:
      localQuota:
        perFSGroup: null
    volumeDirectory: /var/lib/origin/openshift.local.volumes
kind: ConfigMap
metadata:
  creationTimestamp: 2018-12-07T19:41:11Z
  name: node-config-master
  namespace: openshift-node
  resourceVersion: "1249"
  selfLink: /api/v1/namespaces/openshift-node/configmaps/node-config-master
  uid: 0d064984-fa58-11e8-ac5d-beeffeed0062


# oc get cm -n openshift-node node-config-infra -o yaml
apiVersion: v1
data:
  node-config.yaml: |
    apiVersion: v1
    authConfig:
      authenticationCacheSize: 1000
      authenticationCacheTTL: 5m
      authorizationCacheSize: 1000
      authorizationCacheTTL: 5m
    dnsBindAddress: 127.0.0.1:53
    dnsDomain: cluster.local
    dnsIP: 0.0.0.0
    dnsNameservers: null
    dnsRecursiveResolvConf: /etc/origin/node/resolv.conf
    dockerConfig:
      dockerShimRootDirectory: /var/lib/dockershim
      dockerShimSocket: /var/run/dockershim.sock
      execHandlerName: native
    enableUnidling: true
    imageConfig:
      format: registry.redhat.io/openshift3/ose-${component}:${version}
      latest: false
    iptablesSyncPeriod: 30s
    kind: NodeConfig
    kubeletArguments:
      bootstrap-kubeconfig:
      - /etc/origin/node/bootstrap.kubeconfig
      cert-dir:
      - /etc/origin/node/certificates
      enable-controller-attach-detach:
      - 'true'
      feature-gates:
      - RotateKubeletClientCertificate=true,RotateKubeletServerCertificate=true
      node-labels:
      - node-role.kubernetes.io/compute=true,node-role.kubernetes.io/infra=true
      pod-manifest-path:
      - /etc/origin/node/pods
      rotate-certificates:
      - 'true'
    masterClientConnectionOverrides:
      acceptContentTypes: application/vnd.kubernetes.protobuf,application/json
      burst: 40
      contentType: application/vnd.kubernetes.protobuf
      qps: 20
    masterKubeConfig: node.kubeconfig
    networkConfig:
      mtu: 1450
      networkPluginName: redhat/openshift-ovs-subnet
    proxyArguments:
      cluster-cidr:
      - 10.128.0.0/14
    servingInfo:
      bindAddress: 0.0.0.0:10250
      bindNetwork: tcp4
      clientCA: client-ca.crt
    volumeConfig:
      localQuota:
        perFSGroup: null
    volumeDirectory: /var/lib/origin/openshift.local.volumes
kind: ConfigMap
metadata:
  creationTimestamp: 2018-12-07T19:41:16Z
  name: node-config-infra
  namespace: openshift-node
  resourceVersion: "1259"
  selfLink: /api/v1/namespaces/openshift-node/configmaps/node-config-infra
  uid: 0fe61bd6-fa58-11e8-ac5d-beeffeed0062

# oc get cm -n openshift-node node-config-infra-compute -o yaml
apiVersion: v1
data:
  node-config.yaml: |
    apiVersion: v1
    authConfig:
      authenticationCacheSize: 1000
      authenticationCacheTTL: 5m
      authorizationCacheSize: 1000
      authorizationCacheTTL: 5m
    dnsBindAddress: 127.0.0.1:53
    dnsDomain: cluster.local
    dnsIP: 0.0.0.0
    dnsNameservers: null
    dnsRecursiveResolvConf: /etc/origin/node/resolv.conf
    dockerConfig:
      dockerShimRootDirectory: /var/lib/dockershim
      dockerShimSocket: /var/run/dockershim.sock
      execHandlerName: native
    enableUnidling: true
    imageConfig:
      format: registry.redhat.io/openshift3/ose-${component}:${version}
      latest: false
    iptablesSyncPeriod: 30s
    kind: NodeConfig
    kubeletArguments:
      bootstrap-kubeconfig:
      - /etc/origin/node/bootstrap.kubeconfig
      cert-dir:
      - /etc/origin/node/certificates
      enable-controller-attach-detach:
      - 'true'
      feature-gates:
      - RotateKubeletClientCertificate=true,RotateKubeletServerCertificate=true
      node-labels:
      - node-role.kubernetes.io/compute=true,node-role.kubernetes.io/infra=true,node-role.kubernetes.io/kubevirt=true
      pod-manifest-path:
      - /etc/origin/node/pods
      rotate-certificates:
      - 'true'
    masterClientConnectionOverrides:
      acceptContentTypes: application/vnd.kubernetes.protobuf,application/json
      burst: 40
      contentType: application/vnd.kubernetes.protobuf
      qps: 20
    masterKubeConfig: node.kubeconfig
    networkConfig:
      mtu: 1450
      networkPluginName: redhat/openshift-ovs-subnet
    proxyArguments:
      cluster-cidr:
      - 10.128.0.0/14
    servingInfo:
      bindAddress: 0.0.0.0:10250
      bindNetwork: tcp4
      clientCA: client-ca.crt
    volumeConfig:
      localQuota:
        perFSGroup: null
    volumeDirectory: /var/lib/origin/openshift.local.volumes
kind: ConfigMap
metadata:
  creationTimestamp: 2018-12-07T19:41:21Z
  name: node-config-infra-compute
  namespace: openshift-node
  resourceVersion: "1269"
  selfLink: /api/v1/namespaces/openshift-node/configmaps/node-config-infra-compute
  uid: 12b2c79f-fa58-11e8-ac5d-beeffeed0062

Here's the relevant bits from my inventory:

openshift_use_crio=True
openshift_use_crio_only=True
openshift_crio_enable_docker_gc=True
openshift_crio_docker_gc_node_selector={'runtime': 'cri-o'}
openshift_node_groups=[{'name': 'node-config-master', 'labels': ['node-role.kubernetes.io/master=true']}, { 'name': 'node-config-infra', 'labels': ['node-role.kubernetes.io/compute=true', 'node-role.kubernetes.io/infra=true']},  { 'name': 'node-config-infra-compute', 'labels': ['node-role.kubernetes.io/compute=true', 'node-role.kubernetes.io/infra=true', 'node-role.kubernetes.io/kubevirt=true']}]

<...snip...>
[nodes]
dc-ocp-m0.cloud.lab.eng.bos.redhat.com runtime="cri-o"
dc-ocp-m1.cloud.lab.eng.bos.redhat.com runtime="cri-o"
dc-ocp-m2.cloud.lab.eng.bos.redhat.com runtime="cri-o"
dc-ocp-n0.cloud.lab.eng.bos.redhat.com openshift_node_group_name="node-config-infra-compute" runtime="cri-o"
dc-ocp-n1.cloud.lab.eng.bos.redhat.com openshift_node_group_name="node-config-infra-compute" runtime="cri-o"
dc-ocp-n2.cloud.lab.eng.bos.redhat.com openshift_node_group_name="node-config-infra-compute" runtime="cri-o"
dc-ocp-n3.cloud.lab.eng.bos.redhat.com openshift_node_group_name="node-config-infra-compute" runtime="cri-o"
dc-ocp-n4.cloud.lab.eng.bos.redhat.com openshift_node_group_name="node-config-infra-compute" runtime="cri-o"

It seems like the bootstrap node config is configured for crio, but the actual node config isnt:
# grep -ri kubeletArguments /etc/origin/node/* -A10
/etc/origin/node/bootstrap-node-config.yaml:kubeletArguments:
/etc/origin/node/bootstrap-node-config.yaml-  bootstrap-kubeconfig:
/etc/origin/node/bootstrap-node-config.yaml-  - /etc/origin/node/bootstrap.kubeconfig
/etc/origin/node/bootstrap-node-config.yaml-  cert-dir:
/etc/origin/node/bootstrap-node-config.yaml-  - /etc/origin/node/certificates
/etc/origin/node/bootstrap-node-config.yaml-  container-runtime:
/etc/origin/node/bootstrap-node-config.yaml-  - remote
/etc/origin/node/bootstrap-node-config.yaml-  container-runtime-endpoint:
/etc/origin/node/bootstrap-node-config.yaml-  - /var/run/crio/crio.sock
/etc/origin/node/bootstrap-node-config.yaml-  enable-controller-attach-detach:
/etc/origin/node/bootstrap-node-config.yaml-  - 'true'
--
/etc/origin/node/node-config.yaml:kubeletArguments:
/etc/origin/node/node-config.yaml-  bootstrap-kubeconfig:
/etc/origin/node/node-config.yaml-  - /etc/origin/node/bootstrap.kubeconfig
/etc/origin/node/node-config.yaml-  cert-dir:
/etc/origin/node/node-config.yaml-  - /etc/origin/node/certificates
/etc/origin/node/node-config.yaml-  enable-controller-attach-detach:
/etc/origin/node/node-config.yaml-  - 'true'
/etc/origin/node/node-config.yaml-  feature-gates:
/etc/origin/node/node-config.yaml-  - RotateKubeletClientCertificate=true,RotateKubeletServerCertificate=true
/etc/origin/node/node-config.yaml-  node-labels:
/etc/origin/node/node-config.yaml-  - node-role.kubernetes.io/master=true
--
/etc/origin/node/tmp/node-config.yaml:kubeletArguments:
/etc/origin/node/tmp/node-config.yaml-  bootstrap-kubeconfig:
/etc/origin/node/tmp/node-config.yaml-  - /etc/origin/node/bootstrap.kubeconfig
/etc/origin/node/tmp/node-config.yaml-  cert-dir:
/etc/origin/node/tmp/node-config.yaml-  - /etc/origin/node/certificates
/etc/origin/node/tmp/node-config.yaml-  enable-controller-attach-detach:
/etc/origin/node/tmp/node-config.yaml-  - 'true'
/etc/origin/node/tmp/node-config.yaml-  feature-gates:
/etc/origin/node/tmp/node-config.yaml-  - RotateKubeletClientCertificate=true,RotateKubeletServerCertificate=true
/etc/origin/node/tmp/node-config.yaml-  node-labels:
/etc/origin/node/tmp/node-config.yaml-  - node-role.kubernetes.io/master=true



I captured oc get nodes a couple of times during the ansible run to see how things change. master0 flips from crio to docker at some point:

# cat oc.get_nodes
Fri Dec  7 19:42:10 UTC 2018
NAME                                     STATUS    ROLES     AGE       VERSION           INTERNAL-IP     EXTERNAL-IP   OS-IMAGE    KERNEL-VERSION              CONTAINER-RUNTIME
dc-ocp-m0.cloud.lab.eng.bos.redhat.com   Ready     master    3m        v1.11.0+d4cacc0   10.19.138.166   <none>        OpenShift   3.10.0-957.1.3.el7.x86_64   cri-o://1.11.8
dc-ocp-m1.cloud.lab.eng.bos.redhat.com   Ready     master    3m        v1.11.0+d4cacc0   10.19.138.167   <none>        OpenShift   3.10.0-957.1.3.el7.x86_64   cri-o://1.11.8
dc-ocp-m2.cloud.lab.eng.bos.redhat.com   Ready     master    3m        v1.11.0+d4cacc0   10.19.138.168   <none>        OpenShift   3.10.0-957.1.3.el7.x86_64   cri-o://1.11.8

Fri Dec  7 19:42:35 UTC 2018
NAME                                     STATUS     ROLES     AGE       VERSION           INTERNAL-IP     EXTERNAL-IP   OS-IMAGE    KERNEL-VERSION              CONTAINER-RUNTIME
dc-ocp-m0.cloud.lab.eng.bos.redhat.com   NotReady   master    3m        v1.11.0+d4cacc0   10.19.138.166   <none>        OpenShift   3.10.0-957.1.3.el7.x86_64   cri-o://1.11.8
dc-ocp-m1.cloud.lab.eng.bos.redhat.com   NotReady   master    3m        v1.11.0+d4cacc0   10.19.138.167   <none>        OpenShift   3.10.0-957.1.3.el7.x86_64   cri-o://1.11.8
dc-ocp-m2.cloud.lab.eng.bos.redhat.com   NotReady   master    3m        v1.11.0+d4cacc0   10.19.138.168   <none>        OpenShift   3.10.0-957.1.3.el7.x86_64   cri-o://1.11.8
Fri Dec  7 19:44:57 UTC 2018
NAME                                     STATUS     ROLES                    AGE       VERSION           INTERNAL-IP     EXTERNAL-IP   OS-IMAGE    KERNEL-VERSION              CONTAINER-RUNTIME
dc-ocp-m0.cloud.lab.eng.bos.redhat.com   Ready      master                   5m        v1.11.0+d4cacc0   10.19.138.166   <none>        OpenShift   3.10.0-957.1.3.el7.x86_64   docker://1.13.1
dc-ocp-m1.cloud.lab.eng.bos.redhat.com   NotReady   master                   5m        v1.11.0+d4cacc0   10.19.138.167   <none>        OpenShift   3.10.0-957.1.3.el7.x86_64   cri-o://1.11.8
dc-ocp-m2.cloud.lab.eng.bos.redhat.com   NotReady   master                   5m        v1.11.0+d4cacc0   10.19.138.168   <none>        OpenShift   3.10.0-957.1.3.el7.x86_64   cri-o://1.11.8
dc-ocp-n0.cloud.lab.eng.bos.redhat.com   NotReady   compute,infra,kubevirt   24s       v1.11.0+d4cacc0   10.19.138.161   <none>        OpenShift   3.10.0-957.1.3.el7.x86_64   cri-o://1.11.8
dc-ocp-n1.cloud.lab.eng.bos.redhat.com   NotReady   compute,infra,kubevirt   23s       v1.11.0+d4cacc0   10.19.138.162   <none>        OpenShift   3.10.0-957.1.3.el7.x86_64   cri-o://1.11.8
dc-ocp-n2.cloud.lab.eng.bos.redhat.com   NotReady   compute,infra,kubevirt   23s       v1.11.0+d4cacc0   10.19.138.163   <none>        OpenShift   3.10.0-957.1.3.el7.x86_64   cri-o://1.11.8
dc-ocp-n3.cloud.lab.eng.bos.redhat.com   NotReady   compute,infra,kubevirt   23s       v1.11.0+d4cacc0   10.19.138.164   <none>        OpenShift   3.10.0-957.1.3.el7.x86_64   cri-o://1.11.8
dc-ocp-n4.cloud.lab.eng.bos.redhat.com   NotReady   compute,infra,kubevirt   23s       v1.11.0+d4cacc0   10.19.138.165   <none>        OpenShift   3.10.0-957.1.3.el7.x86_64   cri-o://1.11.8



If I omit the `openshift_use_crio_only=True` part, then they all end up running docker, with the master/api pods crashing as before since they are started w/ crio.

Comment 4 Johnny Liu 2018-12-10 03:04:42 UTC

I am pretty sure you are hitting the issue what I said in comment 1.

@Russell, this issue is caused by https://github.com/openshift/openshift-ansible/pull/10645
 when fixing BZ#1647516. Do you think installer should update user customized node config automatically based on openshift_use_crio setting?

Comment 5 David Critch 2018-12-10 19:28:14 UTC

(In reply to Johnny Liu from comment #4)
> I am pretty sure you are hitting the issue what I said in comment 1.
> 
> @Russell, this issue is caused by
> https://github.com/openshift/openshift-ansible/pull/10645
>  when fixing BZ#1647516. Do you think installer should update user
> customized node config automatically based on openshift_use_crio setting?

Confirmed. I reverted to the previous commit of the node-config.yaml.j2 template, and the install proceeded as expected.

Comment 6 Scott Dodson 2018-12-11 03:26:35 UTC

I think you need to assign nodes you intend to use crio on to one of the crio node groups or when crafting your own you need to make sure that the relevant edits are applied to the kubelet config so that it uses a remote runtime and the socket is provided.

Comment 7 Russell Teague 2018-12-11 13:45:18 UTC

@Johnny,
You are correct.  This is a side effect of fixing BZ#1647516.  Previously, if a cluster was deployed and the first master had openshift_use_crio=True, ALL configmaps were created with crio settings regardless if those hosts were supposed to use crio.  In order to use crio, you must specify a node config which has crio edits.  We have default groups available, node-config-master-crio, node-config-infra-crio, node-config-compute-crio, node-config-master-infra-crio, node-config-all-in-one-crio.

https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_facts/defaults/main.yml#L144-L209

This is a docs issue as we've required proper use of node configs for a while.  The bug mentioned above just allowed a loophole for not using the right node config.

Please assign crio node configs to your hosts and redeploy to confirm this fixes your issue.

Comment 8 David Critch 2018-12-11 15:07:20 UTC

I can confirm that fixes my issue. I updated my inventory so the openshit_node_groups are like so:
openshift_node_groups=[{'name': 'node-config-master', 'labels': ['node-role.kubernetes.io/master=true'], 'edits': '{{ openshift_node_group_edits_crio }}'}, { 'name': 'node-config-infra', 'labels': ['node-role.kubernetes.io/compute=true', 'node-role.kubernetes.io/infra=true'] , 'edits': '{{ openshift_node_group_edits_crio }}'},  { 'name': 'node-config-infra-compute', 'labels': ['node-role.kubernetes.io/compute=true', 'node-role.kubernetes.io/infra=true', 'node-role.kubernetes.io/kubevirt=true'], 'edits': '{{ openshift_node_group_edits_crio }}'}]

And the deploy works from latest git release-3.11 w/o modifications. Thanks!

Note You need to log in before you can comment on or make changes to this bug.