Bug 1915537
Summary: | [Metal] sosreport is broken on a second usage from another debug pod for the same node (BareMetal IPI) | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Elena German <elgerman> | |
Component: | RHCOS | Assignee: | Timothée Ravier <travier> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Michael Nguyen <mnguyen> | |
Severity: | medium | Docs Contact: | ||
Priority: | high | |||
Version: | 4.7 | CC: | aaradhak, agogala, aos-bugs, bbreard, beth.white, debarshir, dornelas, dwalsh, imcleod, jdohmann, jerzhang, jligon, mheon, miabbott, mrobson, mrussell, npinaeva, nstielau, oarribas, rbrattai, skrenger, smilner, travier, tsweeney | |
Target Milestone: | --- | |||
Target Release: | 4.12.0 | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
* Previously, the `podman exec` command did not work well with nested containers. Users encountered this issue when accessing a node using the `oc debug` command and then running a container with the `toolbox` command. Because of this, users were unable to reuse toolboxes on {op-system}. This fix updates the toolbox library code to account for this behavior, so users can now reuse toolboxes on {op-system}. (link:https://bugzilla.redhat.com/show_bug.cgi?id=1915537[*BZ#1915537*])
|
Story Points: | --- | |
Clone Of: | ||||
: | 2104118 (view as bug list) | Environment: | ||
Last Closed: | 2022-10-21 14:54:15 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1186913, 2104118 |
Description
Elena German
2021-01-12 20:32:22 UTC
must-gather log could be found at http://file.emea.redhat.com/~elgerman/must-gather-sosreport.tar.gz The toolbox package is provided by the container-tools module, which RHCOS consumes as part of our OS manifest. Moving to the container-tools component for triage I mentioned in BZ 1915318 that toolbox is currently maintained and packaged separately for OCP, so I'm not yet sure this bug is relevant to RHEL either. This one is really strange, and I was able to reproduce it with OCP 4.6.7 (podman-1.9.3-3.rhaos4.6.el8.x86_64 & toolbox-0.0.8-1.rhaos4.6.el8) # ./oc debug node/worker0 Starting pod/worker0-debug ... To use host binaries, run `chroot /host` Pod IP: 192.168.130.20 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# bash -x /usr/bin/toolbox + set -eo pipefail + trap cleanup EXIT + REGISTRY=registry.redhat.io + IMAGE=rhel8/support-tools + TOOLBOX_NAME=toolbox- + TOOLBOXRC=/root/.toolboxrc + main + setup + '[' -f /root/.toolboxrc ']' + TOOLBOX_IMAGE=registry.redhat.io/rhel8/support-tools + [[ '' =~ ^(--help|-h)$ ]] + run + image_exists + sudo podman inspect registry.redhat.io/rhel8/support-tools ++ image_runlabel ++ sudo podman container runlabel --display RUN registry.redhat.io/rhel8/support-tools + local 'runlabel=command: podman run -it --name support-tools --privileged --ipc=host --net=host --pid=host -e HOST=/host -e NAME=support-tools -e IMAGE=registry.redhat.io/rhel8/support-tools:latest -v /run:/run -v /var/log:/var/log -v /etc/machine-id:/etc/machine-id -v /etc/localtime:/etc/localtime -v /:/host registry.redhat.io/rhel8/support-tools:latest' + container_exists + sudo podman inspect toolbox- + echo 'Spawning a container '\''toolbox-'\'' with image '\''registry.redhat.io/rhel8/support-tools'\''' Spawning a container 'toolbox-' with image 'registry.redhat.io/rhel8/support-tools' + [[ -z command: podman run -it --name support-tools --privileged --ipc=host --net=host --pid=host -e HOST=/host -e NAME=support-tools -e IMAGE=registry.redhat.io/rhel8/support-tools:latest -v /run:/run -v /var/log:/var/log -v /etc/machine-id:/etc/machine-id -v /etc/localtime:/etc/localtime -v /:/host registry.redhat.io/rhel8/support-tools:latest ]] + echo 'Detected RUN label in the container image. Using that as the default...' Detected RUN label in the container image. Using that as the default... + container_runlabel + sudo podman container runlabel --name toolbox- RUN registry.redhat.io/rhel8/support-tools command: podman run -it --name toolbox- --privileged --ipc=host --net=host --pid=host -e HOST=/host -e NAME=toolbox- -e IMAGE=registry.redhat.io/rhel8/support-tools:latest -v /run:/run -v /var/log:/var/log -v /etc/machine-id:/etc/machine-id -v /etc/localtime:/etc/localtime -v /:/host registry.redhat.io/rhel8/support-tools:latest [root@worker0 /]# env LANG=C.utf8 HOSTNAME=worker0 container=oci PWD=/ HOME=/root HOST=/host NAME=toolbox- TERM=xterm SHLVL=1 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin IMAGE=registry.redhat.io/rhel8/support-tools:latest _=/usr/bin/env [root@worker0 /]# ls /root/buildinfo/ Dockerfile-rhel8-support-tools-8.3-18 Dockerfile-ubi8-8.3-227 content_manifests [root@worker0 /]# which sosreport /usr/sbin/sosreport [root@worker0 /]# exit exit + return + cleanup + sudo podman stop toolbox- + cleanup + sudo podman stop toolbox- sh-4.4# bash -x /usr/bin/toolbox + set -eo pipefail + trap cleanup EXIT + REGISTRY=registry.redhat.io + IMAGE=rhel8/support-tools + TOOLBOX_NAME=toolbox- + TOOLBOXRC=/root/.toolboxrc + main + setup + '[' -f /root/.toolboxrc ']' + TOOLBOX_IMAGE=registry.redhat.io/rhel8/support-tools + [[ '' =~ ^(--help|-h)$ ]] + run + image_exists + sudo podman inspect registry.redhat.io/rhel8/support-tools ++ image_runlabel ++ sudo podman container runlabel --display RUN registry.redhat.io/rhel8/support-tools + local 'runlabel=command: podman run -it --name support-tools --privileged --ipc=host --net=host --pid=host -e HOST=/host -e NAME=support-tools -e IMAGE=registry.redhat.io/rhel8/support-tools:latest -v /run:/run -v /var/log:/var/log -v /etc/machine-id:/etc/machine-id -v /etc/localtime:/etc/localtime -v /:/host registry.redhat.io/rhel8/support-tools:latest' + container_exists + sudo podman inspect toolbox- + echo 'Container '\''toolbox-'\'' already exists. Trying to start...' Container 'toolbox-' already exists. Trying to start... + echo '(To remove the container and start with a fresh toolbox, run: sudo podman rm '\''toolbox-'\'')' (To remove the container and start with a fresh toolbox, run: sudo podman rm 'toolbox-') ++ container_state ++ sudo podman inspect toolbox- --format '{{.State.Status}}' + local state=exited + [[ exited == configured ]] + [[ exited == exited ]] + container_start + sudo podman start toolbox- toolbox- + echo 'Container started successfully. To exit, type '\''exit'\''.' Container started successfully. To exit, type 'exit'. + container_exec + local cmd= + '[' '!' -n '' ']' ++ sudo podman inspect registry.redhat.io/rhel8/support-tools ++ jq -re '.[].Config.Cmd[0]' + cmd=/usr/bin/bash + sudo podman exec --env LANG= --env TERM=xterm --tty --interactive toolbox- /usr/bin/bash [root@worker0 /]# which sosreport /usr/bin/which: no sosreport in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin) [root@worker0 /]# env LANG=C.utf8 HOSTNAME=worker0 S_COLORS=auto container=oci PWD=/ HOME=/root HOST=/host NAME=toolbox- TERM=xterm SHLVL=1 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin IMAGE=registry.redhat.io/rhel8/support-tools:latest LESSOPEN=||/usr/bin/lesspipe.sh %s _=/usr/bin/env [root@worker0 /]# ls /root/buildinfo/ Dockerfile-openshift-ose-base-v4.0-202011210036.4385 Dockerfile-openshift-ose-cli-v4.6.0-202011261617.p0 Dockerfile-rhel-els-8.2-4 Dockerfile-openshift-ose-base-v4.6.0-202011261617.p0 Dockerfile-openshift-ose-tools-v4.6.0-202011261617.p0 content_manifests It looks like we're supposed to be in the toolbox container (which uses the support-tools image), but somehow we are seeing the contents of the debug pod/container (which uses the ose-tools image). Maybe this is a weird console interaction between oc->cri-o->toolbox->podman ? I don't see this problem if I run toolbox directly on the node. After discussing with Debarshi and other members of the Desktop team, we are going to move RHCOS related `toolbox` BZs back to the RHCOS component. This is a weird one. Running sosreport somehow changes part of the filesystem / mount namespace to revert it back to the initial debug container namespace, not immediately for the running shell: ``` oc debug node/ip-10-0-135-109.ec2.internal sh-4.4# chroot /host sh-4.4# toolbox [root@ip-10-0-135-109 /]# sosreport [root@ip-10-0-135-109 /]# ls -alhid /usr/sbin/ 73402873 dr-xr-xr-x. 1 root root 6 Jan 15 12:26 /usr/sbin/ oc debug node/ip-10-0-135-109.ec2.internal sh-4.4# ls -alhid /usr/sbin/ 249562505 dr-xr-xr-x. 1 root root 4.0K Jan 9 10:49 /usr/sbin/ sh-4.4# chroot /host sh-4.4# ls -alhid /usr/sbin/ 155189423 drwxr-xr-x. 2 root root 12K Jan 1 1970 /usr/sbin/ sh-4.4# podman exec -ti b7d291bc3600 bash [root@ip-10-0-135-109 /]# ls -alhid /usr/sbin/ 249562505 dr-xr-xr-x. 1 root root 4.0K Jan 9 10:49 /usr/sbin/ ``` Workaround while I figure out the root issue: ``` # podman rm toolbox-root ``` Alternative workaround: ``` # podman rm support-tools ``` Investigation will continue next sprint. Since we have a workaround, I'm going to target this for 4.8 as a medium priority issue. If we are able to sort out the root cause + fix, we can backport it to 4.7 with a cloned BZ. I can not reproduce this on a standalone node so far which points to a potential interaction with the debug pod / in container layout. Reported https://github.com/sosreport/sos/issues/2436 upstream to start a conversation while I continue investigating. So this has nothing to do with sosreport, this is just the visible consequence of the bug. A shorter reproducer is starting a toolbox, exiting it and re-entering it. sosreport will not be available anymore and /usr will be changed. I can also reproduce this bug in 4.5 so this is an old bug. We think this is something to do with how `podman` is handling starting/exiting/re-entering a container from within a `chroot`. Could the team have a look at this and see if they can provide additional triage? There's no way I will find time this sprint to investigate further given the complexity of the reproducer and the fact that a workaround exists. If anyone can give a reproducer with pure Podman, not Toolbox, it will greatly assist in the investigation - Toolbox's containers are exceedingly complicated and greatly hinder our debugging efforts. I arrived here from https://github.com/containers/toolbox/issues/919 It seems like Micah's comment 6 needs some clarification. (In reply to Matthew Heon from comment #21) > If anyone can give a reproducer with pure Podman, not Toolbox, it will > greatly assist in the investigation - Toolbox's containers are exceedingly > complicated and greatly hinder our debugging efforts. Matt, "toolbox" here isn't the Toolbox that you think it is. :) This isn't https://github.com/containers/toolbox but https://github.com/coreos/toolbox/blob/main/rhcos-toolbox#L89 which is small wrapper over a relatively simple 'podman run ...' call. (In reply to Debarshi Ray from comment #23) > It seems like Micah's comment 6 needs some clarification. I had moved this BZ to the Desktop team because we were going to migrate from `coreos/toolbox` to `containers/toolbox` and thought this would be a use case that should be covered by `containers/toolbox`. Since we've changed plans and continue to use `coreos/toolbox` in RHCOS, that is why it was moved back to the OCP/RHCOS component. *** Bug 2001927 has been marked as a duplicate of this bug. *** (In reply to Simon Krenger from comment #34) > Could we get an update on this issue? > Is there any other information that we can provide? Unfortunately, higher priority work has prevented additional investigation of this problem. We recently landed some changes upstream to `rhcos-toolbox`, though I don't think they will specifically address this problem. We'll need to retest this scenario with those changes in place. *** Bug 2093037 has been marked as a duplicate of this bug. *** I took the suggestion from comment #28 from @mheon and tried using `crun` as the runtime for `podman` via a config file in `/etc/containers/containers.conf` but that didn't seem to help things. However, doing some experimentation with `podman attach` and `podman start --attach` was ultimately successful: ``` sh-4.4# podman create --hostname toolbox --name "${TOOLBOX_NAME}" --privileged --net=host --pid=host --ipc=host --tty --interactive -e HOST=/host -e NAME="${TOOLBOX_NAME}" -e IMAGE="${IMAGE}" --security-opt label=disable --volume /run:/run --volume /var/log:/var/log --volume /etc/machine-id:/etc/machine-id --volume /tc/localtime:/etc/localtime --volume /:/host "${TOOLBOX_IMAGE}" 03758dba71ddfcd988ddefbffabe2c5206b97996699a31abbc9d22763e10ea34 sh-4.4# podman start --attach toolbox-root [root@toolbox /]# sos usage: sos <component> [options] Available components: report, rep Collect files and command output in an archive clean, cleaner, mask Obfuscate sensitive networking information in a report collect, collector Collect an sos report from multiple nodes simultaneously sos: error: the following arguments are required: component ``` ``` sh-4.4# podman create --hostname toolbox --name "${TOOLBOX_NAME}" --privileged --net=host --pid=host --ipc=host --tty --interactive -e HOST=/host -e NAME="${TOOLBOX_NAME}" -e IMAGE="${IMAGE}" --security-opt label=disable --volume /run:/run --volume /var/log:/var/log --volume /etc/machine-id:/etc/machine-id --volume /tc/localtime:/etc/localtime --volume /:/host "${TOOLBOX_IMAGE}" f65799396b3fde7e66bccf3cc278de3f16084e963a2619ad735c1c2340d9b163 sh-4.4# podman start toolbox-root toolbox-root sh-4.4# podman attach toolbox-root [root@toolbox /]# sos usage: sos <component> [options] Available components: report, rep Collect files and command output in an archive clean, cleaner, mask Obfuscate sensitive networking information in a report collect, collector Collect an sos report from multiple nodes simultaneously sos: error: the following arguments are required: component ``` So we should make the overdue change to update `toolbox` to use `podman start --attach` in place of `podman exec` *** Bug 2095371 has been marked as a duplicate of this bug. *** While it would nice to get this as part of OCP 4.11, the code freeze deadline has passed and we'll have to target this as part of OCP 4.12 We can easily backport this to 4.11.z in the near future. This has been fixed with https://github.com/coreos/toolbox/pull/81 which was included with https://bugzilla.redhat.com/show_bug.cgi?id=2093040 but we missed updating this bug. |