Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1611830

Summary:

Regression: Installation fails with CRI-O for various components

Product:

OpenShift Container Platform

Reporter:

Wolfgang Kulhanek <wkulhane>

Component:

Containers

Assignee:

Giuseppe Scrivano <gscrivan>

Status:

CLOSED WORKSFORME

QA Contact:

DeShuai Ma <dma>

Severity:

urgent

Docs Contact:

Priority:

unspecified

Version:

3.10.0

CC:

amurdaca, aos-bugs, chezhang, jokerman, mmccomas, mpatel, wkulhane, zitang

Target Milestone:

---

Target Release:

3.10.z

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-08-21 14:52:01 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Ansible hosts file as requested	none

Description Wolfgang Kulhanek 2018-08-02 20:32:52 UTC

Description of problem:
With 3.10(.14 and .21 installer) the rollout of various pods fails. Affected pods are:
- asb (Ansible Service Broker)
- kibana
- Elasticsearch


Version-Release number of selected component (if applicable):
OCP 3.10.14 with 3.10.21 Installer

How reproducible:
Every time

Steps to Reproduce:
1. Install OCP with CRI-O enabled
2. Rollout of the above pods fails.
3. Subsequent (manual) rollouts succeeds

Actual results:
oc get pod --all-namespaces

openshift-ansible-service-broker    asb-1-deploy                                  0/1       Error       0          20m
openshift-logging                   logging-es-data-master-fx351ghs-1-deploy      0/1       Error       0          22m
openshift-logging                   logging-kibana-1-deploy                       0/1       Error       0          23m

It is always these three pods.

Seeing this event:
19m       19m       1         logging-es-data-master-fx351ghs-1-8cp9w.154729ce51f5385d   Pod                     spec.containers{proxy}           Warning   Failed                        kubelet, infranode1.wk310g.internal   Error: container create failed: container_linux.go:341: creating new parent process caused "container_linux.go:1713: running lstat on namespace path \"/proc/0/ns/ipc\" caused \"lstat /proc/0/ns/ipc: no such file or directory\""

The exact same configuration works without fault when using Docker as the runtime. This also happens with or without OCS.


Expected results:


Additional info:

Comment 1 Mrunal Patel 2018-08-02 20:49:22 UTC

How are these pods started? Are they automatically launched by the installer?

Comment 2 Wolfgang Kulhanek 2018-08-02 21:08:54 UTC

Yes. They are launched by the installer (when logging and ASB are enabled).

Comment 3 Mrunal Patel 2018-08-02 21:30:25 UTC

Can you share the playbook or the settings to enable these for the install?

Comment 4 Mrunal Patel 2018-08-02 21:33:21 UTC

I found the one for ansible- ansible_service_broker_install: true
What is the one for logging?

Comment 5 Wolfgang Kulhanek 2018-08-02 21:48:31 UTC

openshift_logging_install_logging=true

Comment 6 Antonio Murdaca 2018-08-06 14:51:34 UTC

can you try disabling selinux? that error shows up when selinux complains about something most of the times (https://github.com/kubernetes-incubator/cri-o/issues/528)

Comment 7 Wolfgang Kulhanek 2018-08-06 15:10:58 UTC

Hm. This is an old bugzilla. This worked fine with 3.9(.31/.33). It doesn't work with 3.10.14.

I can >try< to turn SELinux off for testing. But I really don't want to for production environments.

Will do so when I have an hour or two to test - probably tomorrow.

Comment 8 Wolfgang Kulhanek 2018-08-07 18:31:15 UTC

I did try an install with SELinux set to permissive. The logging-es pod came up but the asb deploy still failed.

Comment 9 Mrunal Patel 2018-08-07 18:38:21 UTC

What error did the asb deploy run into? Could you get the logs?
Also, could you gather the SELinux AVCs?

ausearch -m avc -ts recent

Comment 10 Wolfgang Kulhanek 2018-08-07 20:09:51 UTC

There are no logs since the deploy never succeeds.

Events are like this (I did do an oc rollout latest asb to fix it - that's the asb-2 events):

LAST SEEN   FIRST SEEN   COUNT     NAME                            KIND                    SUBOBJECT                     TYPE      REASON                        SOURCE                                     MESSAGE
1h          1h           1         asb-1-deploy.1548ad8100be7548   Pod                                                   Normal    Scheduled                     default-scheduler                          Successfully assigned asb-1-deploy to infranode1.rhte-cloud2.internal
1h          1h           1         asb.1548ad80ffddb340            DeploymentConfig                                      Normal    DeploymentCreated             deploymentconfig-controller                Created new replication controller "asb-1" for version 1
1h          1h           1         asb-1-bgrw9.1548ad8174cb3063    Pod                                                   Normal    Scheduled                     default-scheduler                          Successfully assigned asb-1-bgrw9 to infranode1.rhte-cloud2.internal
1h          1h           1         asb-1.1548ad817498cbab          ReplicationController                                 Normal    SuccessfulCreate              replication-controller                     Created pod: asb-1-bgrw9
1h          1h           1         asb-1-deploy.1548ad8169e7cb1c   Pod                     spec.containers{deployment}   Normal    Started                       kubelet, infranode1.rhte-cloud2.internal   Started container
1h          1h           1         asb-1-deploy.1548ad8168699ece   Pod                     spec.containers{deployment}   Normal    Created                       kubelet, infranode1.rhte-cloud2.internal   Created container
1h          1h           1         asb-1-deploy.1548ad815f1ba8ad   Pod                     spec.containers{deployment}   Normal    Pulled                        kubelet, infranode1.rhte-cloud2.internal   Container image "registry.access.redhat.com/openshift3/ose-deployer:v3.10.14" already present on machine
1h          1h           1         asb-1-bgrw9.1548ad81e4c4627c    Pod                     spec.containers{asb}          Normal    Pulling                       kubelet, infranode1.rhte-cloud2.internal   pulling image "registry.access.redhat.com/openshift3/ose-ansible-service-broker:v3.10"
1h          1h           1         asb-1-bgrw9.1548ad8d5815f5f4    Pod                     spec.containers{asb}          Normal    Pulled                        kubelet, infranode1.rhte-cloud2.internal   Successfully pulled image "registry.access.redhat.com/openshift3/ose-ansible-service-broker:v3.10"
1h          1h           1         asb-1-bgrw9.1548ad8e5bdb1ed6    Pod                     spec.containers{asb}          Normal    Pulled                        kubelet, infranode1.rhte-cloud2.internal   Container image "registry.access.redhat.com/openshift3/ose-ansible-service-broker:v3.10" already present on machine
1h          1h           2         asb-1-bgrw9.1548ad8d696467ca    Pod                     spec.containers{asb}          Normal    Started                       kubelet, infranode1.rhte-cloud2.internal   Started container
1h          1h           2         asb-1-bgrw9.1548ad8d67fc1b5f    Pod                     spec.containers{asb}          Normal    Created                       kubelet, infranode1.rhte-cloud2.internal   Created container
1h          1h           1         asb-1.1548ad9201e264d7          ReplicationController                                 Normal    SuccessfulDelete              replication-controller                     Deleted pod: asb-1-bgrw9
1h          1h           1         asb-1-bgrw9.1548ad920a2ae6c7    Pod                     spec.containers{asb}          Normal    Killing                       kubelet, infranode1.rhte-cloud2.internal   Killing container with id cri-o://asb:Need to kill Pod
1h          1h           1         asb.1548ad92017beea1            DeploymentConfig                                      Normal    ReplicationControllerScaled   deploymentconfig-controller                Scaled replication controller "asb-1" from 1 to 0
1h          1h           1         asb-2-deploy.1548adb1198eca60   Pod                                                   Normal    Scheduled                     default-scheduler                          Successfully assigned asb-2-deploy to infranode1.rhte-cloud2.internal
1h          1h           1         asb.1548adb118bebbaf            DeploymentConfig                                      Normal    DeploymentCreated             deploymentconfig-controller                Created new replication controller "asb-2" for version 2
1h          1h           1         asb-2-deploy.1548adb1d3e4848c   Pod                     spec.containers{deployment}   Normal    Pulled                        kubelet, infranode1.rhte-cloud2.internal   Container image "registry.access.redhat.com/openshift3/ose-deployer:v3.10.14" already present on machine
1h          1h           1         asb-2-deploy.1548adb1dc4a78d5   Pod                     spec.containers{deployment}   Normal    Created                       kubelet, infranode1.rhte-cloud2.internal   Created container
1h          1h           1         asb-2-deploy.1548adb1dd6536ae   Pod                     spec.containers{deployment}   Normal    Started                       kubelet, infranode1.rhte-cloud2.internal   Started container
1h          1h           1         asb-2.1548adb1e7e515ab          ReplicationController                                 Normal    SuccessfulCreate              replication-controller                     Created pod: asb-2-55qj4
1h          1h           1         asb-2-55qj4.1548adb1e8009562    Pod                                                   Normal    Scheduled                     default-scheduler                          Successfully assigned asb-2-55qj4 to infranode1.rhte-cloud2.internal
1h          1h           1         asb-2-55qj4.1548adb24e9faff7    Pod                     spec.containers{asb}          Normal    Started                       kubelet, infranode1.rhte-cloud2.internal   Started container
1h          1h           1         asb-2-55qj4.1548adb24d4919f6    Pod                     spec.containers{asb}          Normal    Created                       kubelet, infranode1.rhte-cloud2.internal   Created container
1h          1h           1         asb-2-55qj4.1548adb24453b5e6    Pod                     spec.containers{asb}          Normal    Pulled                        kubelet, infranode1.rhte-cloud2.internal   Container image "registry.access.redhat.com/openshift3/ose-ansible-service-broker:v3.10" already present on machine
1h          1h           1         asb-2-deploy.1548adb5f5c18b4e   Pod                     spec.containers{deployment}   Normal    Killing                       kubelet, infranode1.rhte-cloud2.internal   Killing container with id cri-o://deployment:Need to kill Pod



ausearch -m avc -ts recent
returns (infranode1 is the one that's supposed to run asb)

[root@infranode1 ~]# ausearch -m avc -ts recent
<no matches>

Comment 11 Giuseppe Scrivano 2018-08-09 11:04:18 UTC

could you please share the inventory file used for the installation?

Comment 12 Wolfgang Kulhanek 2018-08-09 14:33:52 UTC

Created attachment 1474715 [details]
Ansible hosts file as requested

Ansible hosts file as requested

Comment 13 Giuseppe Scrivano 2018-08-10 19:48:41 UTC

I cannot reproduce this problem, I've tried on a fresh RHEL 7.5 system:

-bash-4.2# oc get --all-namespaces=true pods
NAMESPACE                           NAME                                      READY     STATUS    RESTARTS   AGE
default                             docker-registry-1-726hk                   1/1       Running   2          2h
default                             registry-console-1-l6njp                  1/1       Running   2          2h
default                             router-1-js98m                            1/1       Running   2          2h
kube-service-catalog                apiserver-fpcrm                           1/1       Running   2          2h
kube-service-catalog                controller-manager-jjfw9                  1/1       Running   2          2h
kube-system                         master-api-rhel7                          1/1       Running   2          2h
kube-system                         master-controllers-rhel7                  1/1       Running   2          2h
kube-system                         master-etcd-rhel7                         1/1       Running   2          2h
openshift-ansible-service-broker    asb-1-fww6m                               1/1       Running   3          2h
openshift-logging                   logging-curator-1-h7szp                   1/1       Running   2          2h
openshift-logging                   logging-es-data-master-zavcw4k3-4-bq4kq   2/2       Running   0          1m
openshift-logging                   logging-fluentd-mw9fn                     1/1       Running   2          2h
openshift-logging                   logging-kibana-1-c5g8d                    2/2       Running   4          2h
openshift-node                      sync-975nf                                1/1       Running   2          2h
openshift-sdn                       ovs-m9t4g                                 1/1       Running   2          2h
openshift-sdn                       sdn-9qrth                                 1/1       Running   2          2h
openshift-template-service-broker   apiserver-h5l5m                           1/1       Running   3          2h
openshift-web-console               webconsole-7f944b7c85-fzfw7               1/1       Running   4          2h

I am using cri-o-1.10.6-1.rhaos3.10.git56d7d9a.el7.x86_64 and atomic-openshift-3.10.27-1.git.0.1695df4.el7.x86_64.

I am running with SELinux enforcing.

Looks like the problem is somewhere else.  If you have a cluster with the problem I can try to log there and investigate the issue there

Comment 14 Giuseppe Scrivano 2018-08-21 14:52:01 UTC

have not gotten any update on this, so closing for now.  Please reopen if the issue still persists.