Description of problem: on a second try to run sosreport from another debug pod raised an error: "bash: sosreport: command not found" WA: To remove the container and start with a fresh toolbox: sudo podman rm 'toolbox-' toolbox sosreport Version-Release number of selected component (if applicable): Cluster version: 4.7.0-0.nightly-2021-01-10-070949 Kubernetes Version: v1.20.0+394a5a3 toolbox version: toolbox-0.0.8-1.rhaos4.7.el8.noarch IMAGE=registry.redhat.io/rhel8/support-tools:latest How reproducible: always Steps to Reproduce: 1. oc debug node/master-0-2 2. chroot /host 3. toolbox 4. sosreport 5. exit 6. exit 7. oc debug node/master-0-2 (same node) 8. chroot /host 9. toolbox 10. sosreport Actual results: [root@toolbox /]# sosreport bash: sosreport: command not found Expected results: [root@toolbox /]# sosreport --allow-system-changes sosreport (version 3.9) This command will collect diagnostic and configuration information from this Red Hat Enterprise Linux system and installed applications. An archive containing the collected information will be generated in /var/tmp/sos.n5azqk3d and may be provided to a Red Hat support representative. Any information provided to Red Hat will be treated in accordance with the published support policies at: https://access.redhat.com/support/ The generated archive may contain data considered sensitive and its content should be reviewed by the originating organization before being passed to any third party. No changes will be made to system configuration. Press ENTER to continue, or CTRL-C to quit. Additional info: [kni@provisionhost-0-0 ~]$ oc debug node/master-0-2 Starting pod/master-0-2-debug ... To use host binaries, run `chroot /host` Pod IP: 192.168.123.100 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# toolbox Error: error creating container storage: the container name "support-tools" is already in use by "d8bda1396275e28a891c3691e291842996008507d0c041bda47eb385857da90c". You have to remove that container to be able to reuse that name.: that name is already in use Error: `/proc/self/exe run -it --name support-tools --privileged --ipc=host --net=host --pid=host -e HOST=/host -e NAME=support-tools -e IMAGE=registry.redhat.io/rhel8/support-tools:latest -v /run:/run -v /var/log:/var/log -v /etc/machine-id:/etc/machine-id -v /etc/localtime:/etc/localtime -v /:/host registry.redhat.io/rhel8/support-tools:latest` failed: exit status 125 Spawning a container 'toolbox-' with image 'registry.redhat.io/rhel8/support-tools' [root@toolbox /]# sosreport --allow-system-changes sosreport (version 3.9) This command will collect diagnostic and configuration information from this Red Hat Enterprise Linux system and installed applications. An archive containing the collected information will be generated in /var/tmp/sos.n5azqk3d and may be provided to a Red Hat support representative. Any information provided to Red Hat will be treated in accordance with the published support policies at: https://access.redhat.com/support/ The generated archive may contain data considered sensitive and its content should be reviewed by the originating organization before being passed to any third party. No changes will be made to system configuration. Press ENTER to continue, or CTRL-C to quit. Please enter the case id that you are generating this report for []: 123456 Setting up archive ... Setting up plugins ... Running plugins. Please wait ... Finishing plugins [Running: systemd] ager] Finished running plugins Creating compressed archive... Your sosreport has been generated and saved in: /var/tmp/sosreport-toolbox-123456-2021-01-12-mhrpzuf.tar.xz Size 5.87MiB Owner root md5 a10ed7e6a161dd57a853d82736966945 Please send this file to your support representative. [root@toolbox /]# ls /var/tmp/ sosreport-toolbox-123456-2021-01-12-mhrpzuf.tar.xz sosreport-toolbox-123456-2021-01-12-mhrpzuf.tar.xz.md5 sosreport-toolbox-123456-2021-01-12-nigglku.tar.xz sosreport-toolbox-123456-2021-01-12-nigglku.tar.xz.md5 [root@toolbox /]# exit exit [kni@provisionhost-0-0 ~]$ oc debug node/master-0-2 Starting pod/master-0-2-debug ... To use host binaries, run `chroot /host` Pod IP: 192.168.123.100 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# toolbox Error: error creating container storage: the container name "support-tools" is already in use by "d8bda1396275e28a891c3691e291842996008507d0c041bda47eb385857da90c". You have to remove that container to be able to reuse that name.: that name is already in use Error: `/proc/self/exe run -it --name support-tools --privileged --ipc=host --net=host --pid=host -e HOST=/host -e NAME=support-tools -e IMAGE=registry.redhat.io/rhel8/support-tools:latest -v /run:/run -v /var/log:/var/log -v /etc/machine-id:/etc/machine-id -v /etc/localtime:/etc/localtime -v /:/host registry.redhat.io/rhel8/support-tools:latest` failed: exit status 125 Container 'toolbox-' already exists. Trying to start... (To remove the container and start with a fresh toolbox, run: sudo podman rm 'toolbox-') toolbox- Container started successfully. To exit, type 'exit'. [root@toolbox /]# sosreport bash: sosreport: command not found [root@toolbox /]# exit exit Error: exec session exited with non-zero exit code 1: OCI runtime error sh-4.4# sudo podman rm 'toolbox-' 46bb3a49eca5484b4c56b1f3fc074dd8f0055bf686ecb301be93d83a9cb70fa4 sh-4.4# toolbox Error: error creating container storage: the container name "support-tools" is already in use by "d8bda1396275e28a891c3691e291842996008507d0c041bda47eb385857da90c". You have to remove that container to be able to reuse that name.: that name is already in use Error: `/proc/self/exe run -it --name support-tools --privileged --ipc=host --net=host --pid=host -e HOST=/host -e NAME=support-tools -e IMAGE=registry.redhat.io/rhel8/support-tools:latest -v /run:/run -v /var/log:/var/log -v /etc/machine-id:/etc/machine-id -v /etc/localtime:/etc/localtime -v /:/host registry.redhat.io/rhel8/support-tools:latest` failed: exit status 125 Spawning a container 'toolbox-' with image 'registry.redhat.io/rhel8/support-tools' [root@toolbox /]# sosreport sosreport (version 3.9) This command will collect diagnostic and configuration information from this Red Hat Enterprise Linux system and installed applications. An archive containing the collected information will be generated in /var/tmp/sos.3as43d85 and may be provided to a Red Hat support representative. Any information provided to Red Hat will be treated in accordance with the published support policies at: https://access.redhat.com/support/ The generated archive may contain data considered sensitive and its content should be reviewed by the originating organization before being passed to any third party. No changes will be made to system configuration. Press ENTER to continue, or CTRL-C to quit.
must-gather log could be found at http://file.emea.redhat.com/~elgerman/must-gather-sosreport.tar.gz
The toolbox package is provided by the container-tools module, which RHCOS consumes as part of our OS manifest. Moving to the container-tools component for triage
I mentioned in BZ 1915318 that toolbox is currently maintained and packaged separately for OCP, so I'm not yet sure this bug is relevant to RHEL either. This one is really strange, and I was able to reproduce it with OCP 4.6.7 (podman-1.9.3-3.rhaos4.6.el8.x86_64 & toolbox-0.0.8-1.rhaos4.6.el8) # ./oc debug node/worker0 Starting pod/worker0-debug ... To use host binaries, run `chroot /host` Pod IP: 192.168.130.20 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# bash -x /usr/bin/toolbox + set -eo pipefail + trap cleanup EXIT + REGISTRY=registry.redhat.io + IMAGE=rhel8/support-tools + TOOLBOX_NAME=toolbox- + TOOLBOXRC=/root/.toolboxrc + main + setup + '[' -f /root/.toolboxrc ']' + TOOLBOX_IMAGE=registry.redhat.io/rhel8/support-tools + [[ '' =~ ^(--help|-h)$ ]] + run + image_exists + sudo podman inspect registry.redhat.io/rhel8/support-tools ++ image_runlabel ++ sudo podman container runlabel --display RUN registry.redhat.io/rhel8/support-tools + local 'runlabel=command: podman run -it --name support-tools --privileged --ipc=host --net=host --pid=host -e HOST=/host -e NAME=support-tools -e IMAGE=registry.redhat.io/rhel8/support-tools:latest -v /run:/run -v /var/log:/var/log -v /etc/machine-id:/etc/machine-id -v /etc/localtime:/etc/localtime -v /:/host registry.redhat.io/rhel8/support-tools:latest' + container_exists + sudo podman inspect toolbox- + echo 'Spawning a container '\''toolbox-'\'' with image '\''registry.redhat.io/rhel8/support-tools'\''' Spawning a container 'toolbox-' with image 'registry.redhat.io/rhel8/support-tools' + [[ -z command: podman run -it --name support-tools --privileged --ipc=host --net=host --pid=host -e HOST=/host -e NAME=support-tools -e IMAGE=registry.redhat.io/rhel8/support-tools:latest -v /run:/run -v /var/log:/var/log -v /etc/machine-id:/etc/machine-id -v /etc/localtime:/etc/localtime -v /:/host registry.redhat.io/rhel8/support-tools:latest ]] + echo 'Detected RUN label in the container image. Using that as the default...' Detected RUN label in the container image. Using that as the default... + container_runlabel + sudo podman container runlabel --name toolbox- RUN registry.redhat.io/rhel8/support-tools command: podman run -it --name toolbox- --privileged --ipc=host --net=host --pid=host -e HOST=/host -e NAME=toolbox- -e IMAGE=registry.redhat.io/rhel8/support-tools:latest -v /run:/run -v /var/log:/var/log -v /etc/machine-id:/etc/machine-id -v /etc/localtime:/etc/localtime -v /:/host registry.redhat.io/rhel8/support-tools:latest [root@worker0 /]# env LANG=C.utf8 HOSTNAME=worker0 container=oci PWD=/ HOME=/root HOST=/host NAME=toolbox- TERM=xterm SHLVL=1 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin IMAGE=registry.redhat.io/rhel8/support-tools:latest _=/usr/bin/env [root@worker0 /]# ls /root/buildinfo/ Dockerfile-rhel8-support-tools-8.3-18 Dockerfile-ubi8-8.3-227 content_manifests [root@worker0 /]# which sosreport /usr/sbin/sosreport [root@worker0 /]# exit exit + return + cleanup + sudo podman stop toolbox- + cleanup + sudo podman stop toolbox- sh-4.4# bash -x /usr/bin/toolbox + set -eo pipefail + trap cleanup EXIT + REGISTRY=registry.redhat.io + IMAGE=rhel8/support-tools + TOOLBOX_NAME=toolbox- + TOOLBOXRC=/root/.toolboxrc + main + setup + '[' -f /root/.toolboxrc ']' + TOOLBOX_IMAGE=registry.redhat.io/rhel8/support-tools + [[ '' =~ ^(--help|-h)$ ]] + run + image_exists + sudo podman inspect registry.redhat.io/rhel8/support-tools ++ image_runlabel ++ sudo podman container runlabel --display RUN registry.redhat.io/rhel8/support-tools + local 'runlabel=command: podman run -it --name support-tools --privileged --ipc=host --net=host --pid=host -e HOST=/host -e NAME=support-tools -e IMAGE=registry.redhat.io/rhel8/support-tools:latest -v /run:/run -v /var/log:/var/log -v /etc/machine-id:/etc/machine-id -v /etc/localtime:/etc/localtime -v /:/host registry.redhat.io/rhel8/support-tools:latest' + container_exists + sudo podman inspect toolbox- + echo 'Container '\''toolbox-'\'' already exists. Trying to start...' Container 'toolbox-' already exists. Trying to start... + echo '(To remove the container and start with a fresh toolbox, run: sudo podman rm '\''toolbox-'\'')' (To remove the container and start with a fresh toolbox, run: sudo podman rm 'toolbox-') ++ container_state ++ sudo podman inspect toolbox- --format '{{.State.Status}}' + local state=exited + [[ exited == configured ]] + [[ exited == exited ]] + container_start + sudo podman start toolbox- toolbox- + echo 'Container started successfully. To exit, type '\''exit'\''.' Container started successfully. To exit, type 'exit'. + container_exec + local cmd= + '[' '!' -n '' ']' ++ sudo podman inspect registry.redhat.io/rhel8/support-tools ++ jq -re '.[].Config.Cmd[0]' + cmd=/usr/bin/bash + sudo podman exec --env LANG= --env TERM=xterm --tty --interactive toolbox- /usr/bin/bash [root@worker0 /]# which sosreport /usr/bin/which: no sosreport in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin) [root@worker0 /]# env LANG=C.utf8 HOSTNAME=worker0 S_COLORS=auto container=oci PWD=/ HOME=/root HOST=/host NAME=toolbox- TERM=xterm SHLVL=1 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin IMAGE=registry.redhat.io/rhel8/support-tools:latest LESSOPEN=||/usr/bin/lesspipe.sh %s _=/usr/bin/env [root@worker0 /]# ls /root/buildinfo/ Dockerfile-openshift-ose-base-v4.0-202011210036.4385 Dockerfile-openshift-ose-cli-v4.6.0-202011261617.p0 Dockerfile-rhel-els-8.2-4 Dockerfile-openshift-ose-base-v4.6.0-202011261617.p0 Dockerfile-openshift-ose-tools-v4.6.0-202011261617.p0 content_manifests It looks like we're supposed to be in the toolbox container (which uses the support-tools image), but somehow we are seeing the contents of the debug pod/container (which uses the ose-tools image). Maybe this is a weird console interaction between oc->cri-o->toolbox->podman ? I don't see this problem if I run toolbox directly on the node.
After discussing with Debarshi and other members of the Desktop team, we are going to move RHCOS related `toolbox` BZs back to the RHCOS component.
This is a weird one. Running sosreport somehow changes part of the filesystem / mount namespace to revert it back to the initial debug container namespace, not immediately for the running shell: ``` oc debug node/ip-10-0-135-109.ec2.internal sh-4.4# chroot /host sh-4.4# toolbox [root@ip-10-0-135-109 /]# sosreport [root@ip-10-0-135-109 /]# ls -alhid /usr/sbin/ 73402873 dr-xr-xr-x. 1 root root 6 Jan 15 12:26 /usr/sbin/ oc debug node/ip-10-0-135-109.ec2.internal sh-4.4# ls -alhid /usr/sbin/ 249562505 dr-xr-xr-x. 1 root root 4.0K Jan 9 10:49 /usr/sbin/ sh-4.4# chroot /host sh-4.4# ls -alhid /usr/sbin/ 155189423 drwxr-xr-x. 2 root root 12K Jan 1 1970 /usr/sbin/ sh-4.4# podman exec -ti b7d291bc3600 bash [root@ip-10-0-135-109 /]# ls -alhid /usr/sbin/ 249562505 dr-xr-xr-x. 1 root root 4.0K Jan 9 10:49 /usr/sbin/ ``` Workaround while I figure out the root issue: ``` # podman rm toolbox-root ```
Alternative workaround: ``` # podman rm support-tools ```
Investigation will continue next sprint.
Since we have a workaround, I'm going to target this for 4.8 as a medium priority issue. If we are able to sort out the root cause + fix, we can backport it to 4.7 with a cloned BZ.
I can not reproduce this on a standalone node so far which points to a potential interaction with the debug pod / in container layout.
Reported https://github.com/sosreport/sos/issues/2436 upstream to start a conversation while I continue investigating.
So this has nothing to do with sosreport, this is just the visible consequence of the bug. A shorter reproducer is starting a toolbox, exiting it and re-entering it. sosreport will not be available anymore and /usr will be changed. I can also reproduce this bug in 4.5 so this is an old bug.
We think this is something to do with how `podman` is handling starting/exiting/re-entering a container from within a `chroot`. Could the team have a look at this and see if they can provide additional triage?
There's no way I will find time this sprint to investigate further given the complexity of the reproducer and the fact that a workaround exists. If anyone can give a reproducer with pure Podman, not Toolbox, it will greatly assist in the investigation - Toolbox's containers are exceedingly complicated and greatly hinder our debugging efforts.
I arrived here from https://github.com/containers/toolbox/issues/919 It seems like Micah's comment 6 needs some clarification. (In reply to Matthew Heon from comment #21) > If anyone can give a reproducer with pure Podman, not Toolbox, it will > greatly assist in the investigation - Toolbox's containers are exceedingly > complicated and greatly hinder our debugging efforts. Matt, "toolbox" here isn't the Toolbox that you think it is. :) This isn't https://github.com/containers/toolbox but https://github.com/coreos/toolbox/blob/main/rhcos-toolbox#L89 which is small wrapper over a relatively simple 'podman run ...' call.
(In reply to Debarshi Ray from comment #23) > It seems like Micah's comment 6 needs some clarification. I had moved this BZ to the Desktop team because we were going to migrate from `coreos/toolbox` to `containers/toolbox` and thought this would be a use case that should be covered by `containers/toolbox`. Since we've changed plans and continue to use `coreos/toolbox` in RHCOS, that is why it was moved back to the OCP/RHCOS component.
*** Bug 2001927 has been marked as a duplicate of this bug. ***
(In reply to Simon Krenger from comment #34) > Could we get an update on this issue? > Is there any other information that we can provide? Unfortunately, higher priority work has prevented additional investigation of this problem. We recently landed some changes upstream to `rhcos-toolbox`, though I don't think they will specifically address this problem. We'll need to retest this scenario with those changes in place.
*** Bug 2093037 has been marked as a duplicate of this bug. ***
I took the suggestion from comment #28 from @mheon and tried using `crun` as the runtime for `podman` via a config file in `/etc/containers/containers.conf` but that didn't seem to help things. However, doing some experimentation with `podman attach` and `podman start --attach` was ultimately successful: ``` sh-4.4# podman create --hostname toolbox --name "${TOOLBOX_NAME}" --privileged --net=host --pid=host --ipc=host --tty --interactive -e HOST=/host -e NAME="${TOOLBOX_NAME}" -e IMAGE="${IMAGE}" --security-opt label=disable --volume /run:/run --volume /var/log:/var/log --volume /etc/machine-id:/etc/machine-id --volume /tc/localtime:/etc/localtime --volume /:/host "${TOOLBOX_IMAGE}" 03758dba71ddfcd988ddefbffabe2c5206b97996699a31abbc9d22763e10ea34 sh-4.4# podman start --attach toolbox-root [root@toolbox /]# sos usage: sos <component> [options] Available components: report, rep Collect files and command output in an archive clean, cleaner, mask Obfuscate sensitive networking information in a report collect, collector Collect an sos report from multiple nodes simultaneously sos: error: the following arguments are required: component ``` ``` sh-4.4# podman create --hostname toolbox --name "${TOOLBOX_NAME}" --privileged --net=host --pid=host --ipc=host --tty --interactive -e HOST=/host -e NAME="${TOOLBOX_NAME}" -e IMAGE="${IMAGE}" --security-opt label=disable --volume /run:/run --volume /var/log:/var/log --volume /etc/machine-id:/etc/machine-id --volume /tc/localtime:/etc/localtime --volume /:/host "${TOOLBOX_IMAGE}" f65799396b3fde7e66bccf3cc278de3f16084e963a2619ad735c1c2340d9b163 sh-4.4# podman start toolbox-root toolbox-root sh-4.4# podman attach toolbox-root [root@toolbox /]# sos usage: sos <component> [options] Available components: report, rep Collect files and command output in an archive clean, cleaner, mask Obfuscate sensitive networking information in a report collect, collector Collect an sos report from multiple nodes simultaneously sos: error: the following arguments are required: component ``` So we should make the overdue change to update `toolbox` to use `podman start --attach` in place of `podman exec`
*** Bug 2095371 has been marked as a duplicate of this bug. ***
While it would nice to get this as part of OCP 4.11, the code freeze deadline has passed and we'll have to target this as part of OCP 4.12 We can easily backport this to 4.11.z in the near future.
This has been fixed with https://github.com/coreos/toolbox/pull/81 which was included with https://bugzilla.redhat.com/show_bug.cgi?id=2093040 but we missed updating this bug.