1393716 – [networking_public_59]Errors 'can't execute openshift' shown when running 'oadm diagnostics NetworkCheck' on containerized install env.

Bug 1393716 - [networking_public_59]Errors 'can't execute openshift' shown when running 'oadm diagnostics NetworkCheck' on containerized install env.

Summary: [networking_public_59]Errors 'can't execute openshift' shown when running 'oa...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.4.0
Hardware:	All
OS:	All
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	3.9.z
Assignee:	Ravi Sankar
QA Contact:	Meng Bo
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1501797 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-11-10 08:31 UTC by zhaozhanqi
Modified:	2018-06-15 18:29 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-06-15 18:29:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Origin (Github)	12127	0	None	None	None	2017-02-07 19:21:05 UTC

Description zhaozhanqi 2016-11-10 08:31:14 UTC

Description of problem:
run 'oadm diagnostics NetworkCheck' on container install env. show errors:
chroot: can't execute 'openshift': No such file or directory


Version-Release number of selected component (if applicable):
# openshift version
openshift v3.4.0.24+52fd77b
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0

How reproducible:
always

Steps to Reproduce:
1. Setup OCP cluster with container install 
2. run 'oadm diagnostics NetworkCheck'

Actual results:
# oadm diagnostics NetworkCheck
[Note] Determining if client configuration exists for client/cluster diagnostics
Info:  Successfully read a client config file at '/root/.kube/config'

[Note] Running diagnostic: NetworkCheck
       Description: Create a pod on all schedulable nodes and run network diagnostics from the application standpoint
       
Info:  Output from the network diagnostic pod on node "qe-zzhao-34-acc-node-registry-router-1":
       chroot: can't execute 'openshift': No such file or directory
       
[Note] Summary of diagnostics execution (version v3.4.0.24+52fd77b):
[Note] Completed with no errors or warnings seen.



Expected results:

no error.

Additional info:
this error only happen on container install env.

Comment 1 Meng Bo 2016-11-10 08:35:26 UTC

Since there is no openshift binary presents on the node role after install the ocp env with openshift-ansible.

Maybe we can request the productization team to fix this as what they have done for master role.

Comment 2 Ben Bennett 2016-11-10 15:05:31 UTC

Ravi: Can we bundle the openshift binary in the debug pod somehow?

Comment 3 Ravi Sankar 2016-11-10 20:20:27 UTC

Ben: We intentionally did not bundle openshift binary in the debug pod. OpenShift image is 500MB+ and pulling this image on every node is expensive, instead we want to use busybox image which is only 1MB and rely on openshift running on the node.

Comment 4 Eric Paris 2016-11-14 19:10:57 UTC

The OpenShift image is 500M but the binary is smaller...  Not sure how much smaller...

Comment 5 Meng Bo 2016-11-15 02:05:43 UTC

(In reply to Eric Paris from comment #4)
> The OpenShift image is 500M but the binary is smaller...  Not sure how much
> smaller...

The openshift binary is 200M+, oc binary is 100M+ for now, and keep increasing. It is still too big for a debug pod IMO.

Comment 6 Ravi Sankar 2016-11-15 04:22:07 UTC

Fixed in https://github.com/openshift/openshift-ansible/pull/2804

Comment 7 Ravi Sankar 2016-11-15 04:49:07 UTC

Tried to add an option to diagnostics cmd that pulls the openshift image for network diagnostics instead of reusing the existing binary on the node.

Code branch: https://github.com/openshift/origin/compare/master...pravisankar:network-diag-pull-openshift-bin?expand=1

While testing the code, I realized that pulling the openshift/origin image is not sufficient. We need an image that has openshift, docker, tc, ip, brctl, ovs-ctl, etc. binaries. I made it working by hacking a little bit but need to maintain a new image that has all these needed binaries. As of today, openshift image is 540M and even openshift binary is 252M. Network diagnostics with this pull openshift image option is significantly slower than the current method (machine has 16G RAM and 32 cpu cores). I don't see much issues with our current approach, so I prefer to trash this code.

Comment 8 Ben Bennett 2016-11-15 13:46:39 UTC

Ok, not maintaining that image seems reasonable for now.  Let's hold the code in case we need it later, but for now, let's not do anything with it.  Thanks!

Does it make sense to add a way to pass a custom path to the debug pod?

Comment 9 Ben Bennett 2016-11-16 16:58:34 UTC

If we don't get this fixed for 3.4 then we need to release note that containerized install has trouble with this.

Comment 10 Ravi Sankar 2016-12-03 03:01:56 UTC

Fixed in https://github.com/openshift/origin/pull/12127

Comment 11 openshift-github-bot 2017-02-09 13:20:17 UTC

Commit pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/d4744a9bb6773509f99864f7637318925799b486
Bug 1393716 - Fix network diagnostics on containerized openshift install

Added openshift/diagnostics-deployer image: It has openshift-network-debug script and it handles
containerized and non-containerized openshift env and initiates network diagnostics on the node.

Comment 12 Troy Dawson 2017-02-10 22:54:03 UTC

This has been merged into ocp and is in OCP v3.5.0.19 or newer.

Comment 14 zhaozhanqi 2017-02-13 10:03:38 UTC

Verified this bug on v3.5.0.19

this bug has been fixed.

Comment 15 Ravi Sankar 2017-02-16 19:44:36 UTC

@zhaozhanqi, @bomeng, Can you please reverify this bug one more time once this pr https://github.com/openshift/origin/pull/12982 is merged?

Comment 16 zhaozhanqi 2017-02-17 08:54:45 UTC

Assigned this bug since the PR https://github.com/openshift/origin/pull/12982 is not be merged yet

Comment 17 zhaozhanqi 2017-03-01 05:16:06 UTC

Tested this issue on v3.5.0.35 on atomic host env. met the pod cannot be running.

# openshift version
openshift v3.5.0.35
kubernetes v1.5.2+43a9be4
etcd 3.1.0

# cat /etc/redhat-release 
Red Hat Enterprise Linux Atomic Host release 7.3



QoS Class:	BestEffort
Tolerations:	<none>
Events:
  FirstSeen	LastSeen	Count	From								SubObjectPath				Type		Reason		Message
  ---------	--------	-----	----								-------------				--------	------		-------
  8s		8s		1	{kubelet host-8-175-189.host.centralci.eng.rdu2.redhat.com}	spec.containers{network-diag-pod-kbst5}	Normal		Pulled		Container image "openshift3/ose" already present on machine
  8s		8s		1	{kubelet host-8-175-189.host.centralci.eng.rdu2.redhat.com}	spec.containers{network-diag-pod-kbst5}	Normal		Created		Created container with docker id 528da7892463; Security:[seccomp=unconfined]
  7s		7s		1	{kubelet host-8-175-189.host.centralci.eng.rdu2.redhat.com}	spec.containers{network-diag-pod-kbst5}	Warning		Failed		Failed to start container with docker id 528da7892463 with error: Error response from daemon: {"message":"invalid header field value \"oci runtime error: container_linux.go:247: starting container process caused \\\"process_linux.go:359: container init caused \\\\\\\"rootfs_linux.go:54: mounting \\\\\\\\\\\\\\\"/var/lib/origin/openshift.local.volumes/pods/af3c94ac-fe3d-11e6-a07f-fa163ebcec06/volumes/kubernetes.io~secret/kconfig-secret\\\\\\\\\\\\\\\" to rootfs \\\\\\\\\\\\\\\"/var/lib/docker/devicemapper/mnt/7bbe3c708f48e9867c0a16f0e5fd162337c88284e3972b6e971d3dcc7abb6b5c/rootfs\\\\\\\\\\\\\\\" at \\\\\\\\\\\\\\\"/var/lib/docker/devicemapper/mnt/7bbe3c708f48e9867c0a16f0e5fd162337c88284e3972b6e971d3dcc7abb6b5c/rootfs/host/secrets\\\\\\\\\\\\\\\" caused \\\\\\\\\\\\\\\"mkdir /var/lib/docker/devicemapper/mnt/7bbe3c708f48e9867c0a16f0e5fd162337c88284e3972b6e971d3dcc7abb6b5c/rootfs/host/secrets: permission denied\\\\\\\\\\\\\\\"\\\\\\\"\\\"\\n\""}
  7s		7s		1	{kubelet host-8-175-189.host.centralci.eng.rdu2.redhat.com}						Warning		FailedSync	Error syncing pod, skipping: failed to "StartContainer" for "network-diag-pod-kbst5" with RunContainerError: "runContainer: Error response from daemon: {\"message\":\"invalid header field value \\\"oci runtime error: container_linux.go:247: starting container process caused \\\\\\\"process_linux.go:359: container init caused \\\\\\\\\\\\\\\"rootfs_linux.go:54: mounting \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"/var/lib/origin/openshift.local.volumes/pods/af3c94ac-fe3d-11e6-a07f-fa163ebcec06/volumes/kubernetes.io~secret/kconfig-secret\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\" to rootfs \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"/var/lib/docker/devicemapper/mnt/7bbe3c708f48e9867c0a16f0e5fd162337c88284e3972b6e971d3dcc7abb6b5c/rootfs\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\" at \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"/var/lib/docker/devicemapper/mnt/7bbe3c708f48e9867c0a16f0e5fd162337c88284e3972b6e971d3dcc7abb6b5c/rootfs/host/secrets\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\" caused \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"mkdir /var/lib/docker/devicemapper/mnt/7bbe3c708f48e9867c0a16f0e5fd162337c88284e3972b6e971d3dcc7abb6b5c/rootfs/host/secrets: permission denied\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"\\\\\\\\\\\\\\\"\\\\\\\"\\\\n\\\"\"}"

Comment 19 Ben Bennett 2017-04-19 15:44:08 UTC

This has proven to be hard to reproduce.  Weibin was going to try to reproduce.

Comment 20 Weibin Liang 2017-05-01 13:12:24 UTC

Ravi is accessing and debugging the issue on: host-8-174-253.host.centralci.eng.rdu2.redhat.com

Comment 21 Ben Bennett 2017-05-12 13:23:28 UTC

Flagged upcoming release since this is not a regression.

Comment 23 Ravi Sankar 2017-05-18 22:05:15 UTC

'oadm diagnostics NetworkCheck' output was not helpful and there is no relevant logs on the node as these test namespaces/pods/services were deleted after the operation. 
Created custom openshift/origin and openshift/node images with debug info and the test run showed it was failing due to https://bugzilla.redhat.com/show_bug.cgi?id=1439142

I will revisit this issue once we tackle bug#1439142

Comment 24 Ben Bennett 2017-06-02 14:28:03 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1439142 has merged... any update on this bug?  Thanks

Comment 25 Ravi Sankar 2017-06-27 18:55:46 UTC

*** Bug 1421643 has been marked as a duplicate of this bug. ***

Comment 26 Ravi Sankar 2017-07-05 19:28:23 UTC

@zhaozhanqi
Can you check if this diagnostics issue be reproduced by disabling selinux on the node (setenforce 0)?

Comment 28 zhaozhanqi 2017-07-06 09:17:40 UTC

@Ravi Sankar

if set setenforce 0 on node, the error in comment 17 will NOT be found. 

but the pod container still cannot be running with error:
# docker logs 0bef11792cb4
error: --deployment or OPENSHIFT_DEPLOYMENT_NAME is required

FYI, seems using the latest 3.6 atomic host containerized install will always failed in my side. will research this later.  I just using the 3.5 version.

Comment 29 Ravi Sankar 2017-07-12 17:59:06 UTC

We are trying to solve 2 issues for this bug:

(1) For containerized openshift install, expecting specific node image name format is fragile and that caused '--deployment <name> is required' error.

(2) We would like to remove the hack where we call docker directly to fetch node container id and pid. After discussion with container team(mrunalp), this will be rewritten using runc that works with other container runtimes (CRI-O, docker, etc.)

Comment 30 Ben Bennett 2017-09-15 18:15:28 UTC

@mrunalp will you work with Ravi on this please?

Comment 31 Ravi Sankar 2017-10-18 19:23:12 UTC

*** Bug 1501797 has been marked as a duplicate of this bug. ***

Comment 33 Ben Bennett 2017-10-26 18:20:18 UTC

CRI changes are too risky for 3.7.

Comment 36 Ben Bennett 2018-04-26 18:14:39 UTC

In 3.10 this is no longer supported.

Note You need to log in before you can comment on or make changes to this bug.