+++ This bug was initially created as a clone of Bug #1909289 +++ Sometime in last few releases init containers stopped being debuggable with `oc debug pod/foo -c <init_container_name>`. The root cause is the wait logic for the pod container to be running ignores init containers. The fix is to adjust the wait logic to correctly read init containers. Also, I simplified and removed some logic that was subject to exiting early on errors that might be transient (for instance, the first image pull can fail and the second can succeed) and replaced those with warning messages. May need a backport for 4.6. --- Additional comment from nstielau on 2021-01-04 20:10:14 UTC --- Sounds like this is a regression but not a new one. Moving to blocker- to denote that we won't block the release on this.
Tested with the oc build form the repo , can't produce the issue now: Compared with older oc , the `oc debug -c init-container` will hang: [root@dhcp-140-138 roottest]# oc debug po/openshift-kube-scheduler-ci-ln-jq80922-f76d1-xjrpn-master-0 -c wait-for-host-port Starting pod/openshift-kube-scheduler-ci-ln-jq80922-f76d1-xjrpn-master-0-debug, command was: /usr/bin/timeout 30 /bin/bash -c echo -n "Waiting for port :10259 and :10251 to be released." while [ -n "$(lsof -ni :10251)" -o -n "$(lsof -i :10259)" ]; do echo -n "." sleep 1 done While the oc build from the repo works well: [root@dhcp-140-138 roottest]# /root/oc debug po/openshift-kube-scheduler-ci-ln-jq80922-f76d1-xjrpn-master-0 -c wait-for-host-port Starting pod/openshift-kube-scheduler-ci-ln-jq80922-f76d1-xjrpn-master-0-debug, command was: /usr/bin/timeout 30 /bin/bash -c echo -n "Waiting for port :10259 and :10251 to be released." while [ -n "$(lsof -ni :10251)" -o -n "$(lsof -i :10259)" ]; do echo -n "." sleep 1 done Pod IP: 10.0.0.5 If you don't see a command prompt, try pressing enter. sh-4.4# sh-4.4# ls bin boot dev etc home lib lib64 lost+found media mnt opt proc root run sbin srv sys tmp usr var sh-4.4# ps ax PID TTY STAT TIME COMMAND 1 pts/0 Ss 0:00 /bin/sh 9 pts/0 R+ 0:00 ps ax sh-4.4# exit exit Removing debug pod ...
Verified bug with payload & oc version below and i do not see any hang. [knarra@knarra openshift-client-linux-4.6.0-0.nightly-2021-01-30-211400]$ ./oc version -o yaml clientVersion: buildDate: "2021-01-30T16:34:42Z" compiler: gc gitCommit: 18d7461aca47e77cefb355339252a8d4c149188f gitTreeState: clean gitVersion: 4.6.0-202101301510.p0-18d7461 goVersion: go1.15.5 major: "" minor: "" platform: linux/amd64 openshiftVersion: 4.6.0-0.nightly-2021-01-30-211400 releaseClientVersion: 4.6.0-0.nightly-2021-01-30-211400 serverVersion: buildDate: "2021-01-28T07:35:27Z" compiler: gc gitCommit: e49167aad6a08046be6ab21ff13029110c76951d gitTreeState: clean gitVersion: v1.19.0+e49167a goVersion: go1.15.5 major: "1" minor: "19" platform: linux/amd64 Do not see any hang: ====================== [knarra@knarra openshift-client-linux-4.6.0-0.nightly-2021-01-30-211400]$ ./oc debug po/openshift-kube-scheduler-xiuwang-sharegcp-gs6jh-m-0.c.openshift-qe.internal -c wait-for-host-port -n openshift-kube-scheduler Starting pod/openshift-kube-scheduler-xiuwang-sharegcp-gs6jh-m-0copenshift-q-debug, command was: /usr/bin/timeout 30 /bin/bash -c echo -n "Waiting for port :10259 and :10251 to be released." while [ -n "$(lsof -ni :10251)" -o -n "$(lsof -i :10259)" ]; do echo -n "." sleep 1 done Pod IP: 10.0.0.7 If you don't see a command prompt, try pressing enter. sh-4.4# ls bin dev home lib64 media opt root sbin sys usr boot etc lib lost+found mnt proc run srv tmp var sh-4.4# exit exit Removing debug pod ... With the previous version of oc i see it hangs: ================================================== [knarra@knarra openshift-client-linux-4.6.10]$ ./oc debug po/openshift-kube-scheduler-xiuwang-sharegcp-gs6jh-m-0.c.openshift-qe.internal -c wait-for-host-port -n openshift-kube-scheduler Starting pod/openshift-kube-scheduler-xiuwang-sharegcp-gs6jh-m-0copenshift-q-debug, command was: /usr/bin/timeout 30 /bin/bash -c echo -n "Waiting for port :10259 and :10251 to be released." while [ -n "$(lsof -ni :10251)" -o -n "$(lsof -i :10259)" ]; do echo -n "." sleep 1 done ^C Removing debug pod ... Based on the above moving bug to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.6.16 security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:0308