Bug 2066891 - ovnkube-trace fails if the container doesn't have bash shell
Summary: ovnkube-trace fails if the container doesn't have bash shell
Keywords:
Status: VERIFIED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
low
high
Target Milestone: ---
: ---
Assignee: Andreas Karis
QA Contact: Ross Brattain
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-03-22 17:05 UTC by Pablo Alonso Rodriguez
Modified: 2024-01-17 05:22 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:
rbrattai: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 1205 0 None Merged [DownstreamMerge] 4.12 initial merge from upstream: 7-18-22 2022-07-23 11:57:37 UTC
Github ovn-org ovn-kubernetes pull 2971 0 None open ovnkube-trace improvements and refactor 2022-05-06 19:16:48 UTC

Description Pablo Alonso Rodriguez 2022-03-22 17:05:13 UTC
Description of problem:

ovnkubernetes-trace requires bash while running some `oc exec` commands in the pods[1]. Many pods may not have bash at all. Good examples are:
- Alpine-based pods with ash
- Microcontainers with no shell at all.

Version-Release number of selected component (if applicable):

4.7 (also master)

How reproducible:

Always

Steps to Reproduce:
1. ovnkube-trace where at least one of the pods doesn't have bash installed

Actual results:

I0321 17:08:02.422880    6390 ovnkube-trace.go:252] Reading interface index from /sys/class/net/...
I0321 17:08:02.422924    6390 ovnkube-trace.go:255] The command is cat /sys/class/net/eth0/iflink
I0321 17:08:02.498786    6390 ovnkube-trace.go:259] The command error command terminated with exit code 1 stdOut:
 stdErr: time="2022-03-21T22:08:02Z" level=error msg="exec failed: container_linux.go:367: starting container process caused: exec: \"bash\": executable file not found in $PATH"
Failed to get information from pod myexamplepod-7ac851df37-6dm1a: command terminated with exit code 1

Expected results:

ovnkube-trace to work properly

Additional info:

In general, it is bad practice to exec in the application pods. Not only because it can be considered a disruption, but also because it is risky to make assumptions in what those pods contain.

Given that the only place that seems to exec in a pod not belonging to ovn-kubernetes is here[2] and it seems that the intent is to get the interface index of the pod network interface, wouldn't it be possible to just go through the ovnkube-node container of the pod node to get that information in a container-image-agnostic way?

[1] - https://github.com/openshift/ovn-kubernetes/blob/a74fcde51660abbc19916ce87b4928b5b8327295/go-controller/cmd/ovnkube-trace/ovnkube-trace.go#L101
[2] - https://github.com/openshift/ovn-kubernetes/blob/a74fcde51660abbc19916ce87b4928b5b8327295/go-controller/cmd/ovnkube-trace/ovnkube-trace.go#L272

Comment 1 Pablo Alonso Rodriguez 2022-03-22 17:30:38 UTC
For reference, a command like this seems to work to replace what the code in[2] does:

oc rsh -c ovnkube-node ovnkube-node-7dxzf bash -c 'ip -o link show "$(chroot /host crictl pods --state ready --namespace "^openshift-dns\$" --name "^dns-default-qk94t\$" -q | head -1 -c15)"' | sed -r -e 's/^.+if([0-9]+):.+$/\1/g'

Where:
- ovnkube-node-7dxzf is the ovnkube-node pod of the node (requiring bash here is not a problem because it is under our control)
- chroot /host : We chroot to the host to get access to crio
- crictl pods --state ready --namespace "^openshift-dns\$" --name "^dns-default-qk94t\$" -q : This gets the pod ID from crio
- head -1 -c15 - We get the first 15 characters of the crio pod id because they match the veth pair interface on the host network whose ifindex we want to retrieve.
- ip -o link show: We get a oneliner with the veth pair information with the other side of the pair on host network. Note that the name is displayed like ${FIRST_15_CHARACTERS_OF_POD_ID}@if${THE_INDEX_WE_WANT_IN_POD_NETWORK_NAMESPACE}
- sed -r -e 's/^.+if([0-9]+):.+$/\1/g' - This extracts the desired ifindex from the @ifXX on the name.

I am open to any comments on this approach.

Regards.

Comment 8 Ross Brattain 2022-08-08 17:44:19 UTC
Fails on HyperShift hosted cluster

F0808 17:33:08.168679   34860 ovnkube-trace.go:1049] Failed to get database URIs: cannot find ovnkube pods with container: ovnkube-master

On HyperShift ovnkube-node



root        3027  0.1  0.7 748276 57316 ?        Ssl  17:07   0:02 /usr/bin/ovnkube --init-node ip-10.compute.internal --nb-address ssl:ovnkube-sbdb-clusters-hypershift-ci-13033.apps.o412a11h.qe.devcluster.openshift.com:443 --sb-address ssl:ovnkube-sbdb-clusters-hypershift-ci-13033.apps.o412a11h.qe.devcluster.openshift.com:443 --nb-client-privkey /ovn-cert/tls.key --nb-client-cert /ovn-cert/tls.crt --nb-client-cacert /ovn-ca/ca-bundle.crt --nb-cert-common-name ovn --sb-client-privkey /ovn-cert/tls.key --sb-client-cert /ovn-cert/tls.crt --sb-client-cacert /ovn-ca/ca-bundle.crt --sb-cert-common-name ovn --config-file=/run/ovnkube-config/ovnkube.conf --loglevel 4 --inactivity-probe=180000 --gateway-mode shared --gateway-interface br-ex --metrics-bind-address 127.0.0.1:29103 --ovn-metrics-bind-address 127.0.0.1:29105 --metrics-enable-pprof --export-ovs-metrics --disable-snat-multiple-gws

Comment 9 Andreas Karis 2022-08-18 19:16:19 UTC
I don't think that this ever worked on hypershift. Can you check if this works on a non-hypershift cluster, and then we'd need an RFE or a bug for ovnkube-trace on hypershift?

Comment 10 Andreas Karis 2022-08-18 19:23:58 UTC
I created https://issues.redhat.com/browse/OCPBUGS-298 to track the hypershift use case. But indeed, ovnkube-trace was never compatible with hypershift. The code that it fails on wasn't modified for this BZ:
~~~
func getDatabaseURIs(coreclient *corev1client.CoreV1Client, restconfig *rest.Config, ovnNamespace string) (string, string, bool, error) {
	containerName := "ovnkube-master"
	var err error

	found := false
	var podName string

	listOptions := metav1.ListOptions{}
	pods, err := coreclient.Pods(ovnNamespace).List(context.TODO(), listOptions)
	if err != nil {
		return "", "", false, err
	}
	for _, pod := range pods.Items {
		for _, container := range pod.Spec.Containers {
			if container.Name == containerName {
				found = true
				podName = pod.Name
				break
			}
		}
	}
	if !found {
		klog.V(5).Infof("Cannot find ovnkube pods with container %s", containerName)
		return "", "", false, fmt.Errorf("cannot find ovnkube pods with container: %s", containerName)
	}
~~~

Comment 11 Ross Brattain 2022-09-26 03:09:59 UTC
Verified on 4.12.0-0.nightly-2022-09-20-095559

Tested on containers created `FROM scratch`


test-ovnkube-trace.sh passed with one issue, worker node matching fails on 3-node all-in-one

oc get nodes --show-labels | awk '!/node-role.kubernetes.io\/master=|node-role.kubernetes.io\/control-plane=/ && $1!="NAME" {print $1}' 

fails on nodes with these labels.

NAME                                               STATUS   ROLES                         AGE     VERSION           LABELS
master-0-0.o412e1db-0.qe.lab.redhat.com   Ready    control-plane,master,worker   4d22h   v1.24.0+07c9eb7   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=master-0-0.o412e1db-0.qe.lab.redhat.com,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos
master-0-1.o412e1db-0.qe.lab.redhat.com   Ready    control-plane,master,worker   4d22h   v1.24.0+07c9eb7   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=master-0-1.o412e1db-0.qe.lab.redhat.com,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos
master-0-2.o412e1db-0.qe.lab.redhat.com   Ready    control-plane,master,worker   4d22h   v1.24.0+07c9eb7   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=master-0-2.o412e1db-0.qe.lab.redhat.com,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos


Note You need to log in before you can comment on or make changes to this bug.