Description of problem: ovnkubernetes-trace requires bash while running some `oc exec` commands in the pods[1]. Many pods may not have bash at all. Good examples are: - Alpine-based pods with ash - Microcontainers with no shell at all. Version-Release number of selected component (if applicable): 4.7 (also master) How reproducible: Always Steps to Reproduce: 1. ovnkube-trace where at least one of the pods doesn't have bash installed Actual results: I0321 17:08:02.422880 6390 ovnkube-trace.go:252] Reading interface index from /sys/class/net/... I0321 17:08:02.422924 6390 ovnkube-trace.go:255] The command is cat /sys/class/net/eth0/iflink I0321 17:08:02.498786 6390 ovnkube-trace.go:259] The command error command terminated with exit code 1 stdOut: stdErr: time="2022-03-21T22:08:02Z" level=error msg="exec failed: container_linux.go:367: starting container process caused: exec: \"bash\": executable file not found in $PATH" Failed to get information from pod myexamplepod-7ac851df37-6dm1a: command terminated with exit code 1 Expected results: ovnkube-trace to work properly Additional info: In general, it is bad practice to exec in the application pods. Not only because it can be considered a disruption, but also because it is risky to make assumptions in what those pods contain. Given that the only place that seems to exec in a pod not belonging to ovn-kubernetes is here[2] and it seems that the intent is to get the interface index of the pod network interface, wouldn't it be possible to just go through the ovnkube-node container of the pod node to get that information in a container-image-agnostic way? [1] - https://github.com/openshift/ovn-kubernetes/blob/a74fcde51660abbc19916ce87b4928b5b8327295/go-controller/cmd/ovnkube-trace/ovnkube-trace.go#L101 [2] - https://github.com/openshift/ovn-kubernetes/blob/a74fcde51660abbc19916ce87b4928b5b8327295/go-controller/cmd/ovnkube-trace/ovnkube-trace.go#L272
For reference, a command like this seems to work to replace what the code in[2] does: oc rsh -c ovnkube-node ovnkube-node-7dxzf bash -c 'ip -o link show "$(chroot /host crictl pods --state ready --namespace "^openshift-dns\$" --name "^dns-default-qk94t\$" -q | head -1 -c15)"' | sed -r -e 's/^.+if([0-9]+):.+$/\1/g' Where: - ovnkube-node-7dxzf is the ovnkube-node pod of the node (requiring bash here is not a problem because it is under our control) - chroot /host : We chroot to the host to get access to crio - crictl pods --state ready --namespace "^openshift-dns\$" --name "^dns-default-qk94t\$" -q : This gets the pod ID from crio - head -1 -c15 - We get the first 15 characters of the crio pod id because they match the veth pair interface on the host network whose ifindex we want to retrieve. - ip -o link show: We get a oneliner with the veth pair information with the other side of the pair on host network. Note that the name is displayed like ${FIRST_15_CHARACTERS_OF_POD_ID}@if${THE_INDEX_WE_WANT_IN_POD_NETWORK_NAMESPACE} - sed -r -e 's/^.+if([0-9]+):.+$/\1/g' - This extracts the desired ifindex from the @ifXX on the name. I am open to any comments on this approach. Regards.
Fails on HyperShift hosted cluster F0808 17:33:08.168679 34860 ovnkube-trace.go:1049] Failed to get database URIs: cannot find ovnkube pods with container: ovnkube-master On HyperShift ovnkube-node root 3027 0.1 0.7 748276 57316 ? Ssl 17:07 0:02 /usr/bin/ovnkube --init-node ip-10.compute.internal --nb-address ssl:ovnkube-sbdb-clusters-hypershift-ci-13033.apps.o412a11h.qe.devcluster.openshift.com:443 --sb-address ssl:ovnkube-sbdb-clusters-hypershift-ci-13033.apps.o412a11h.qe.devcluster.openshift.com:443 --nb-client-privkey /ovn-cert/tls.key --nb-client-cert /ovn-cert/tls.crt --nb-client-cacert /ovn-ca/ca-bundle.crt --nb-cert-common-name ovn --sb-client-privkey /ovn-cert/tls.key --sb-client-cert /ovn-cert/tls.crt --sb-client-cacert /ovn-ca/ca-bundle.crt --sb-cert-common-name ovn --config-file=/run/ovnkube-config/ovnkube.conf --loglevel 4 --inactivity-probe=180000 --gateway-mode shared --gateway-interface br-ex --metrics-bind-address 127.0.0.1:29103 --ovn-metrics-bind-address 127.0.0.1:29105 --metrics-enable-pprof --export-ovs-metrics --disable-snat-multiple-gws
I don't think that this ever worked on hypershift. Can you check if this works on a non-hypershift cluster, and then we'd need an RFE or a bug for ovnkube-trace on hypershift?
I created https://issues.redhat.com/browse/OCPBUGS-298 to track the hypershift use case. But indeed, ovnkube-trace was never compatible with hypershift. The code that it fails on wasn't modified for this BZ: ~~~ func getDatabaseURIs(coreclient *corev1client.CoreV1Client, restconfig *rest.Config, ovnNamespace string) (string, string, bool, error) { containerName := "ovnkube-master" var err error found := false var podName string listOptions := metav1.ListOptions{} pods, err := coreclient.Pods(ovnNamespace).List(context.TODO(), listOptions) if err != nil { return "", "", false, err } for _, pod := range pods.Items { for _, container := range pod.Spec.Containers { if container.Name == containerName { found = true podName = pod.Name break } } } if !found { klog.V(5).Infof("Cannot find ovnkube pods with container %s", containerName) return "", "", false, fmt.Errorf("cannot find ovnkube pods with container: %s", containerName) } ~~~
Verified on 4.12.0-0.nightly-2022-09-20-095559 Tested on containers created `FROM scratch` test-ovnkube-trace.sh passed with one issue, worker node matching fails on 3-node all-in-one oc get nodes --show-labels | awk '!/node-role.kubernetes.io\/master=|node-role.kubernetes.io\/control-plane=/ && $1!="NAME" {print $1}' fails on nodes with these labels. NAME STATUS ROLES AGE VERSION LABELS master-0-0.o412e1db-0.qe.lab.redhat.com Ready control-plane,master,worker 4d22h v1.24.0+07c9eb7 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=master-0-0.o412e1db-0.qe.lab.redhat.com,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos master-0-1.o412e1db-0.qe.lab.redhat.com Ready control-plane,master,worker 4d22h v1.24.0+07c9eb7 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=master-0-1.o412e1db-0.qe.lab.redhat.com,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos master-0-2.o412e1db-0.qe.lab.redhat.com Ready control-plane,master,worker 4d22h v1.24.0+07c9eb7 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=master-0-2.o412e1db-0.qe.lab.redhat.com,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos