Bug 1915537 - [Metal] sosreport is broken on a second usage from another debug pod for the same node (BareMetal IPI)
Summary: [Metal] sosreport is broken on a second usage from another debug pod for the ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.7
Hardware: x86_64
OS: Linux
high
medium
Target Milestone: ---
: 4.12.0
Assignee: Timothée Ravier
QA Contact: Michael Nguyen
URL:
Whiteboard:
: 2001927 2093037 2095371 (view as bug list)
Depends On:
Blocks: 1186913 2104118
TreeView+ depends on / blocked
 
Reported: 2021-01-12 20:32 UTC by Elena German
Modified: 2024-03-25 17:49 UTC (History)
24 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
* Previously, the `podman exec` command did not work well with nested containers. Users encountered this issue when accessing a node using the `oc debug` command and then running a container with the `toolbox` command. Because of this, users were unable to reuse toolboxes on {op-system}. This fix updates the toolbox library code to account for this behavior, so users can now reuse toolboxes on {op-system}. (link:https://bugzilla.redhat.com/show_bug.cgi?id=1915537[*BZ#1915537*])
Clone Of:
: 2104118 (view as bug list)
Environment:
Last Closed: 2022-10-21 14:54:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github coreos toolbox pull 79 0 None closed Bug 1915537: Workaround "podman exec" using the wrong mount namespace 2022-10-05 14:31:18 UTC
Red Hat Knowledge Base (Solution) 6337801 0 None None None 2021-09-17 11:54:46 UTC

Description Elena German 2021-01-12 20:32:22 UTC
Description of problem:
on a second try to run sosreport from another debug pod raised an error: 
"bash: sosreport: command not found"

WA: To remove the container and start with a fresh toolbox:
     sudo podman rm 'toolbox-'
     toolbox
     sosreport


Version-Release number of selected component (if applicable):
Cluster version: 4.7.0-0.nightly-2021-01-10-070949
Kubernetes Version: v1.20.0+394a5a3

toolbox version: toolbox-0.0.8-1.rhaos4.7.el8.noarch
IMAGE=registry.redhat.io/rhel8/support-tools:latest


How reproducible:
always


Steps to Reproduce:
1. oc debug node/master-0-2
2. chroot /host
3. toolbox
4. sosreport
5. exit
6. exit
7. oc debug node/master-0-2  (same node)
8. chroot /host
9. toolbox
10. sosreport

Actual results:
[root@toolbox /]# sosreport
bash: sosreport: command not found


Expected results:
[root@toolbox /]# sosreport --allow-system-changes

sosreport (version 3.9)

This command will collect diagnostic and configuration information from
this Red Hat Enterprise Linux system and installed applications.

An archive containing the collected information will be generated in
/var/tmp/sos.n5azqk3d and may be provided to a Red Hat support
representative.

Any information provided to Red Hat will be treated in accordance with
the published support policies at:

  https://access.redhat.com/support/

The generated archive may contain data considered sensitive and its
content should be reviewed by the originating organization before being
passed to any third party.

No changes will be made to system configuration.

Press ENTER to continue, or CTRL-C to quit.


Additional info:
[kni@provisionhost-0-0 ~]$ oc debug node/master-0-2
Starting pod/master-0-2-debug ...
To use host binaries, run `chroot /host`
Pod IP: 192.168.123.100
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# toolbox
Error: error creating container storage: the container name "support-tools" is already in use by "d8bda1396275e28a891c3691e291842996008507d0c041bda47eb385857da90c". You have to remove that container to be able to reuse that name.: that name is already in use
Error: `/proc/self/exe run -it --name support-tools --privileged --ipc=host --net=host --pid=host -e HOST=/host -e NAME=support-tools -e IMAGE=registry.redhat.io/rhel8/support-tools:latest -v /run:/run -v /var/log:/var/log -v /etc/machine-id:/etc/machine-id -v /etc/localtime:/etc/localtime -v /:/host registry.redhat.io/rhel8/support-tools:latest` failed: exit status 125
Spawning a container 'toolbox-' with image 'registry.redhat.io/rhel8/support-tools'

[root@toolbox /]# sosreport --allow-system-changes

sosreport (version 3.9)

This command will collect diagnostic and configuration information from
this Red Hat Enterprise Linux system and installed applications.

An archive containing the collected information will be generated in
/var/tmp/sos.n5azqk3d and may be provided to a Red Hat support
representative.

Any information provided to Red Hat will be treated in accordance with
the published support policies at:

  https://access.redhat.com/support/

The generated archive may contain data considered sensitive and its
content should be reviewed by the originating organization before being
passed to any third party.

No changes will be made to system configuration.

Press ENTER to continue, or CTRL-C to quit.

Please enter the case id that you are generating this report for []: 123456

 Setting up archive ...
 Setting up plugins ...
 Running plugins. Please wait ...

  Finishing plugins              [Running: systemd]                                       ager]
  Finished running plugins                                                               
Creating compressed archive...

Your sosreport has been generated and saved in:
  /var/tmp/sosreport-toolbox-123456-2021-01-12-mhrpzuf.tar.xz

 Size	5.87MiB
 Owner	root
 md5	a10ed7e6a161dd57a853d82736966945

Please send this file to your support representative.

[root@toolbox /]# ls /var/tmp/
sosreport-toolbox-123456-2021-01-12-mhrpzuf.tar.xz  sosreport-toolbox-123456-2021-01-12-mhrpzuf.tar.xz.md5  sosreport-toolbox-123456-2021-01-12-nigglku.tar.xz	sosreport-toolbox-123456-2021-01-12-nigglku.tar.xz.md5
[root@toolbox /]# exit
exit

[kni@provisionhost-0-0 ~]$ oc debug node/master-0-2
Starting pod/master-0-2-debug ...
To use host binaries, run `chroot /host`
Pod IP: 192.168.123.100
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# toolbox
Error: error creating container storage: the container name "support-tools" is already in use by "d8bda1396275e28a891c3691e291842996008507d0c041bda47eb385857da90c". You have to remove that container to be able to reuse that name.: that name is already in use
Error: `/proc/self/exe run -it --name support-tools --privileged --ipc=host --net=host --pid=host -e HOST=/host -e NAME=support-tools -e IMAGE=registry.redhat.io/rhel8/support-tools:latest -v /run:/run -v /var/log:/var/log -v /etc/machine-id:/etc/machine-id -v /etc/localtime:/etc/localtime -v /:/host registry.redhat.io/rhel8/support-tools:latest` failed: exit status 125
Container 'toolbox-' already exists. Trying to start...
(To remove the container and start with a fresh toolbox, run: sudo podman rm 'toolbox-')
toolbox-
Container started successfully. To exit, type 'exit'.
[root@toolbox /]# sosreport
bash: sosreport: command not found
[root@toolbox /]# exit
exit
Error: exec session exited with non-zero exit code 1: OCI runtime error
sh-4.4# sudo podman rm 'toolbox-'
46bb3a49eca5484b4c56b1f3fc074dd8f0055bf686ecb301be93d83a9cb70fa4
sh-4.4# toolbox
Error: error creating container storage: the container name "support-tools" is already in use by "d8bda1396275e28a891c3691e291842996008507d0c041bda47eb385857da90c". You have to remove that container to be able to reuse that name.: that name is already in use
Error: `/proc/self/exe run -it --name support-tools --privileged --ipc=host --net=host --pid=host -e HOST=/host -e NAME=support-tools -e IMAGE=registry.redhat.io/rhel8/support-tools:latest -v /run:/run -v /var/log:/var/log -v /etc/machine-id:/etc/machine-id -v /etc/localtime:/etc/localtime -v /:/host registry.redhat.io/rhel8/support-tools:latest` failed: exit status 125
Spawning a container 'toolbox-' with image 'registry.redhat.io/rhel8/support-tools'
[root@toolbox /]# sosreport

sosreport (version 3.9)

This command will collect diagnostic and configuration information from
this Red Hat Enterprise Linux system and installed applications.

An archive containing the collected information will be generated in
/var/tmp/sos.3as43d85 and may be provided to a Red Hat support
representative.

Any information provided to Red Hat will be treated in accordance with
the published support policies at:

  https://access.redhat.com/support/

The generated archive may contain data considered sensitive and its
content should be reviewed by the originating organization before being
passed to any third party.

No changes will be made to system configuration.

Press ENTER to continue, or CTRL-C to quit.

Comment 1 Elena German 2021-01-12 20:45:04 UTC
must-gather log could be found at http://file.emea.redhat.com/~elgerman/must-gather-sosreport.tar.gz

Comment 2 Micah Abbott 2021-01-12 22:08:32 UTC
The toolbox package is provided by the container-tools module, which RHCOS consumes as part of our OS manifest.

Moving to the container-tools component for triage

Comment 5 Derrick Ornelas 2021-01-13 15:46:33 UTC
I mentioned in BZ 1915318 that toolbox is currently maintained and packaged separately for OCP, so I'm not yet sure this bug is relevant to RHEL either. 

This one is really strange, and I was able to reproduce it with OCP 4.6.7 (podman-1.9.3-3.rhaos4.6.el8.x86_64 & toolbox-0.0.8-1.rhaos4.6.el8)

# ./oc debug node/worker0
Starting pod/worker0-debug ...
To use host binaries, run `chroot /host`
Pod IP: 192.168.130.20
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host

sh-4.4# bash -x /usr/bin/toolbox
+ set -eo pipefail
+ trap cleanup EXIT
+ REGISTRY=registry.redhat.io
+ IMAGE=rhel8/support-tools
+ TOOLBOX_NAME=toolbox-
+ TOOLBOXRC=/root/.toolboxrc
+ main
+ setup
+ '[' -f /root/.toolboxrc ']'
+ TOOLBOX_IMAGE=registry.redhat.io/rhel8/support-tools
+ [[ '' =~ ^(--help|-h)$ ]]
+ run
+ image_exists
+ sudo podman inspect registry.redhat.io/rhel8/support-tools
++ image_runlabel
++ sudo podman container runlabel --display RUN registry.redhat.io/rhel8/support-tools
+ local 'runlabel=command: podman run -it --name support-tools --privileged --ipc=host --net=host --pid=host -e HOST=/host -e NAME=support-tools -e IMAGE=registry.redhat.io/rhel8/support-tools:latest -v /run:/run -v /var/log:/var/log -v /etc/machine-id:/etc/machine-id -v /etc/localtime:/etc/localtime -v /:/host registry.redhat.io/rhel8/support-tools:latest'
+ container_exists
+ sudo podman inspect toolbox-
+ echo 'Spawning a container '\''toolbox-'\'' with image '\''registry.redhat.io/rhel8/support-tools'\'''
Spawning a container 'toolbox-' with image 'registry.redhat.io/rhel8/support-tools'
+ [[ -z command: podman run -it --name support-tools --privileged --ipc=host --net=host --pid=host -e HOST=/host -e NAME=support-tools -e IMAGE=registry.redhat.io/rhel8/support-tools:latest -v /run:/run -v /var/log:/var/log -v /etc/machine-id:/etc/machine-id -v /etc/localtime:/etc/localtime -v /:/host registry.redhat.io/rhel8/support-tools:latest ]]
+ echo 'Detected RUN label in the container image. Using that as the default...'
Detected RUN label in the container image. Using that as the default...
+ container_runlabel
+ sudo podman container runlabel --name toolbox- RUN registry.redhat.io/rhel8/support-tools
command: podman run -it --name toolbox- --privileged --ipc=host --net=host --pid=host -e HOST=/host -e NAME=toolbox- -e IMAGE=registry.redhat.io/rhel8/support-tools:latest -v /run:/run -v /var/log:/var/log -v /etc/machine-id:/etc/machine-id -v /etc/localtime:/etc/localtime -v /:/host registry.redhat.io/rhel8/support-tools:latest

[root@worker0 /]# env
LANG=C.utf8
HOSTNAME=worker0
container=oci
PWD=/
HOME=/root
HOST=/host
NAME=toolbox-
TERM=xterm
SHLVL=1
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
IMAGE=registry.redhat.io/rhel8/support-tools:latest
_=/usr/bin/env

[root@worker0 /]# ls  /root/buildinfo/
Dockerfile-rhel8-support-tools-8.3-18  Dockerfile-ubi8-8.3-227	content_manifests

[root@worker0 /]# which sosreport
/usr/sbin/sosreport

[root@worker0 /]# exit
exit
+ return
+ cleanup
+ sudo podman stop toolbox-
+ cleanup
+ sudo podman stop toolbox-


sh-4.4# bash -x /usr/bin/toolbox
+ set -eo pipefail
+ trap cleanup EXIT
+ REGISTRY=registry.redhat.io
+ IMAGE=rhel8/support-tools
+ TOOLBOX_NAME=toolbox-
+ TOOLBOXRC=/root/.toolboxrc
+ main
+ setup
+ '[' -f /root/.toolboxrc ']'
+ TOOLBOX_IMAGE=registry.redhat.io/rhel8/support-tools
+ [[ '' =~ ^(--help|-h)$ ]]
+ run
+ image_exists
+ sudo podman inspect registry.redhat.io/rhel8/support-tools
++ image_runlabel
++ sudo podman container runlabel --display RUN registry.redhat.io/rhel8/support-tools
+ local 'runlabel=command: podman run -it --name support-tools --privileged --ipc=host --net=host --pid=host -e HOST=/host -e NAME=support-tools -e IMAGE=registry.redhat.io/rhel8/support-tools:latest -v /run:/run -v /var/log:/var/log -v /etc/machine-id:/etc/machine-id -v /etc/localtime:/etc/localtime -v /:/host registry.redhat.io/rhel8/support-tools:latest'
+ container_exists
+ sudo podman inspect toolbox-
+ echo 'Container '\''toolbox-'\'' already exists. Trying to start...'
Container 'toolbox-' already exists. Trying to start...
+ echo '(To remove the container and start with a fresh toolbox, run: sudo podman rm '\''toolbox-'\'')'
(To remove the container and start with a fresh toolbox, run: sudo podman rm 'toolbox-')
++ container_state
++ sudo podman inspect toolbox- --format '{{.State.Status}}'
+ local state=exited
+ [[ exited == configured ]]
+ [[ exited == exited ]]
+ container_start
+ sudo podman start toolbox-
toolbox-
+ echo 'Container started successfully. To exit, type '\''exit'\''.'
Container started successfully. To exit, type 'exit'.
+ container_exec
+ local cmd=
+ '[' '!' -n '' ']'
++ sudo podman inspect registry.redhat.io/rhel8/support-tools
++ jq -re '.[].Config.Cmd[0]'
+ cmd=/usr/bin/bash
+ sudo podman exec --env LANG= --env TERM=xterm --tty --interactive toolbox- /usr/bin/bash

[root@worker0 /]# which sosreport
/usr/bin/which: no sosreport in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin)

[root@worker0 /]# env
LANG=C.utf8
HOSTNAME=worker0
S_COLORS=auto
container=oci
PWD=/
HOME=/root
HOST=/host
NAME=toolbox-
TERM=xterm
SHLVL=1
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
IMAGE=registry.redhat.io/rhel8/support-tools:latest
LESSOPEN=||/usr/bin/lesspipe.sh %s
_=/usr/bin/env

[root@worker0 /]# ls /root/buildinfo/
Dockerfile-openshift-ose-base-v4.0-202011210036.4385  Dockerfile-openshift-ose-cli-v4.6.0-202011261617.p0    Dockerfile-rhel-els-8.2-4
Dockerfile-openshift-ose-base-v4.6.0-202011261617.p0  Dockerfile-openshift-ose-tools-v4.6.0-202011261617.p0  content_manifests


It looks like we're supposed to be in the toolbox container (which uses the support-tools image), but somehow we are seeing the contents of the debug pod/container (which uses the ose-tools image).  Maybe this is a weird console interaction between oc->cri-o->toolbox->podman ?  I don't see this problem if I run toolbox directly on the node.

Comment 6 Micah Abbott 2021-01-14 15:54:19 UTC
After discussing with Debarshi and other members of the Desktop team, we are going to move RHCOS related `toolbox` BZs back to the RHCOS component.

Comment 7 Timothée Ravier 2021-01-15 12:44:46 UTC
This is a weird one. Running sosreport somehow changes part of the filesystem / mount namespace to revert it back to the initial debug container namespace, not immediately for the running shell:

```
oc debug node/ip-10-0-135-109.ec2.internal
sh-4.4# chroot /host
sh-4.4# toolbox
[root@ip-10-0-135-109 /]# sosreport
[root@ip-10-0-135-109 /]# ls -alhid /usr/sbin/
73402873 dr-xr-xr-x. 1 root root 6 Jan 15 12:26 /usr/sbin/

oc debug node/ip-10-0-135-109.ec2.internal
sh-4.4# ls -alhid /usr/sbin/
249562505 dr-xr-xr-x. 1 root root 4.0K Jan  9 10:49 /usr/sbin/
sh-4.4# chroot /host
sh-4.4# ls -alhid /usr/sbin/
155189423 drwxr-xr-x. 2 root root 12K Jan  1  1970 /usr/sbin/
sh-4.4# podman exec -ti b7d291bc3600 bash
[root@ip-10-0-135-109 /]# ls -alhid /usr/sbin/
249562505 dr-xr-xr-x. 1 root root 4.0K Jan  9 10:49 /usr/sbin/
```

Workaround while I figure out the root issue:

```
# podman rm toolbox-root
```

Comment 8 Timothée Ravier 2021-01-15 12:56:00 UTC
Alternative workaround:

```
# podman rm support-tools
```

Comment 9 Timothée Ravier 2021-01-15 15:31:16 UTC
Investigation will continue next sprint.

Comment 10 Micah Abbott 2021-01-18 20:17:25 UTC
Since we have a workaround, I'm going to target this for 4.8 as a medium priority issue.  If we are able to sort out the root cause + fix, we can backport it to 4.7 with a cloned BZ.

Comment 15 Timothée Ravier 2021-03-05 14:24:26 UTC
I can not reproduce this on a standalone node so far which points to a potential interaction with the debug pod / in container layout.

Comment 16 Timothée Ravier 2021-03-08 16:21:39 UTC
Reported https://github.com/sosreport/sos/issues/2436 upstream to start a conversation while I continue investigating.

Comment 17 Timothée Ravier 2021-03-08 18:00:27 UTC
So this has nothing to do with sosreport, this is just the visible consequence of the bug. A shorter reproducer is starting a toolbox, exiting it and re-entering it. sosreport will not be available anymore and /usr will be changed. I can also reproduce this bug in 4.5 so this is an old bug.

Comment 19 Micah Abbott 2021-06-02 15:15:27 UTC
We think this is something to do with how `podman` is handling starting/exiting/re-entering a container from within a `chroot`.

Could the team have a look at this and see if they can provide additional triage?

Comment 21 Matthew Heon 2021-06-11 18:43:10 UTC
There's no way I will find time this sprint to investigate further given the complexity of the reproducer and the fact that a workaround exists.

If anyone can give a reproducer with pure Podman, not Toolbox, it will greatly assist in the investigation - Toolbox's containers are exceedingly complicated and greatly hinder our debugging efforts.

Comment 23 Debarshi Ray 2021-11-13 00:39:48 UTC
I arrived here from https://github.com/containers/toolbox/issues/919

It seems like Micah's comment 6 needs some clarification.

(In reply to Matthew Heon from comment #21)
> If anyone can give a reproducer with pure Podman, not Toolbox, it will
> greatly assist in the investigation - Toolbox's containers are exceedingly
> complicated and greatly hinder our debugging efforts.

Matt, "toolbox" here isn't the Toolbox that you think it is. :)

This isn't https://github.com/containers/toolbox but https://github.com/coreos/toolbox/blob/main/rhcos-toolbox#L89 which is small wrapper over a relatively simple 'podman run ...' call.

Comment 24 Micah Abbott 2021-12-03 14:08:23 UTC
(In reply to Debarshi Ray from comment #23)
 
> It seems like Micah's comment 6 needs some clarification.

I had moved this BZ to the Desktop team because we were going to migrate from `coreos/toolbox` to `containers/toolbox` and thought this would be a use case that should be covered by `containers/toolbox`. Since we've changed plans and continue to use `coreos/toolbox` in RHCOS, that is why it was moved back to the OCP/RHCOS component.

Comment 25 Timothée Ravier 2021-12-03 14:53:47 UTC
*** Bug 2001927 has been marked as a duplicate of this bug. ***

Comment 35 Micah Abbott 2022-04-08 16:10:48 UTC
(In reply to Simon Krenger from comment #34)
> Could we get an update on this issue?
> Is there any other information that we can provide?

Unfortunately, higher priority work has prevented additional investigation of this problem.  

We recently landed some changes upstream to `rhcos-toolbox`, though I don't think they will specifically address this problem.  We'll need to retest this scenario with those changes in place.

Comment 40 Micah Abbott 2022-06-02 19:57:59 UTC
*** Bug 2093037 has been marked as a duplicate of this bug. ***

Comment 41 Micah Abbott 2022-06-02 20:02:27 UTC
I took the suggestion from comment #28 from @mheon and tried using `crun` as the runtime for `podman` via a config file in `/etc/containers/containers.conf` but that didn't seem to help things.

However, doing some experimentation with `podman attach` and `podman start --attach` was ultimately successful:

```
sh-4.4# podman create --hostname toolbox --name "${TOOLBOX_NAME}" --privileged --net=host --pid=host --ipc=host --tty --interactive -e HOST=/host -e NAME="${TOOLBOX_NAME}" -e IMAGE="${IMAGE}" --security-opt label=disable --volume /run:/run --volume /var/log:/var/log --volume /etc/machine-id:/etc/machine-id --volume /tc/localtime:/etc/localtime --volume /:/host "${TOOLBOX_IMAGE}"
03758dba71ddfcd988ddefbffabe2c5206b97996699a31abbc9d22763e10ea34
sh-4.4# podman start --attach toolbox-root
[root@toolbox /]# sos
usage: sos <component> [options]

Available components:
        report, rep                   Collect files and command output in an archive
        clean, cleaner, mask          Obfuscate sensitive networking information in a report
        collect, collector            Collect an sos report from multiple nodes simultaneously
sos: error: the following arguments are required: component
```

```
sh-4.4# podman create --hostname toolbox --name "${TOOLBOX_NAME}" --privileged --net=host --pid=host --ipc=host --tty --interactive -e HOST=/host -e NAME="${TOOLBOX_NAME}" -e IMAGE="${IMAGE}" --security-opt label=disable --volume /run:/run --volume /var/log:/var/log --volume /etc/machine-id:/etc/machine-id --volume /tc/localtime:/etc/localtime --volume /:/host "${TOOLBOX_IMAGE}"
f65799396b3fde7e66bccf3cc278de3f16084e963a2619ad735c1c2340d9b163
sh-4.4# podman start toolbox-root
toolbox-root
sh-4.4# podman attach toolbox-root
[root@toolbox /]# sos
usage: sos <component> [options]

Available components:
        report, rep                   Collect files and command output in an archive
        clean, cleaner, mask          Obfuscate sensitive networking information in a report
        collect, collector            Collect an sos report from multiple nodes simultaneously
sos: error: the following arguments are required: component
```

So we should make the overdue change to update `toolbox` to use `podman start --attach` in place of `podman exec`

Comment 42 Micah Abbott 2022-06-09 18:27:42 UTC
*** Bug 2095371 has been marked as a duplicate of this bug. ***

Comment 43 Micah Abbott 2022-07-05 14:51:48 UTC
While it would nice to get this as part of OCP 4.11, the code freeze deadline has passed and we'll have to target this as part of OCP 4.12

We can easily backport this to 4.11.z in the near future.

Comment 46 Timothée Ravier 2022-10-21 14:54:15 UTC
This has been fixed with https://github.com/coreos/toolbox/pull/81 which was included with https://bugzilla.redhat.com/show_bug.cgi?id=2093040 but we missed updating this bug.


Note You need to log in before you can comment on or make changes to this bug.