Bug 1299578
Summary: | sosreport output in container, not host | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Chris Evich <cevich> |
Component: | sos | Assignee: | Pavel Moravec <pmoravec> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | BaseOS QE - Apps <qe-baseos-apps> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.2 | CC: | agk, atomic-bugs, bmr, cevich, dwalsh, eslobodo, gavin, jeder, plambri, pmoravec, sbradley, vigoyal |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | 1277223 | Environment: | |
Last Closed: | 2016-11-11 07:48:58 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1277223, 1299794 |
Comment 2
Bryn M. Reeves
2016-01-19 13:25:30 UTC
Also what is the version of the sos package in use? The image hashes are not meaningful for the sos maintainers in terms of a specific package NVR. (In reply to Bryn M. Reeves from comment #2) > Do you have an environment where I can test this? Bug discovered in CI testing, so "not really". However, it's easy to reproduce on an atomic-host. The steps come from the docs: https://access.redhat.com/documentation/en/red-hat-enterprise-linux-atomic-host/version-7/getting-started-with-containers/#using_the_atomic_tools_container_image # atomic run rhel7/rhel-tools # sosreport ...cut... # ls /host/var/tmp # exit The rhel-tools container is run with the '--rm' flag (RUN label), so when you exit the container, all data goes *poof*, including the sosreport output. What the docs indicate (seems to be desireable behavior), is the report itself should be written to /host/var/tmp. That way the report will persist after the container is removed (implicitly, by exiting from it). Further/OTOH, Bug 1299794 suggests that sosreport should be collecting data from /host instead of the container's filesystem. This is also likely desired behaviour, since the host's state is more useful than the container (which will be short lived anyway). To be clear though, this bug is more about the output files from sosreport not being stored in a safe/useful place. (In reply to Bryn M. Reeves from comment #3) > Also what is the version of the sos package in use? The image hashes are not > meaningful for the sos maintainers in terms of a specific package NVR. I'll get ya that detail in a sec. This was originally reproduced against a candidate/staging image. I'll do it again using the latest os-tree and available rhel-tools image... > Bug discovered in CI testing, so "not really". However, it's easy to > reproduce on an atomic-host. The steps come from the docs: I don't have access to an atomic host environment or a system where I can easily build one (sos is not my 'day job' - I work on the upstream project in my own time and help out with RHEL packaging and product integration when I can). Since this is entirely dependent on the particular environment of the Atomic host we (the sos maintainers) really need some help to be able to work on this in a timely fashion. > To be clear though, this bug is more about the output files from sosreport not > being stored in a safe/useful place. They are very unlikely to be separate issues. [root@bz1299578 ~]# atomic host status TIMESTAMP (UTC) VERSION ID OSNAME REFSPEC * 2015-12-03 19:40:36 7.2.1 aaf67b91fa rhel-atomic-host rhel-atomic-host-ostree:rhel-atomic-host/7/x86_64/standard [root@bz1299578 ~]# docker images REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE registry.access.redhat.com/rhel7/rhel-tools latest fd2acbeb2b97 6 weeks ago 1.159 GB ... [root@bz1299578 ~]# atomic run rhel7/rhel-tools Using default tag: latest fd2acbeb2b97: Download complete 6c3a84d798dc: Download complete Status: Downloaded newer image for registry.access.redhat.com/rhel7/rhel-tools:latest docker run -it --name rhel-tools --privileged --ipc=host --net=host --pid=host -e HOST=/host -e NAME=rhel-tools -e IMAGE=rhel7/rhel-tools -v /run:/run -v /var/log:/var/log -v /etc/localtime:/etc/localtime -v /:/host rhel7/rhel-tools docker run -it --name rhel-tools --privileged --ipc=host --net=host --pid=host -e HOST=/host -e NAME=rhel-tools -e IMAGE=rhel7/rhel-tools -v /run:/run -v /var/log:/var/log -v /etc/localtime:/etc/localtime -v /:/host rhel7/rhel-tools [root@bz1299578 /]# sosreport sosreport (version 3.2) This command will collect diagnostic and configuration information from this Red Hat Atomic Host system. An archive containing the collected information will be generated in /var/tmp and may be provided to a Red Hat support representative. Any information provided to Red Hat will be treated in accordance with the published support policies at: https://access.redhat.com/support/ The generated archive may contain data considered sensitive and its content should be reviewed by the originating organization before being passed to any third party. Press ENTER to continue, or CTRL-C to quit. Please enter your first initial and last name [bz1299578.novalocal]: Please enter the case id that you are generating this report for []: Setting up archive ... Setting up plugins ... [plugin:pcp] /var/log/pcp/pmlogger/bz1299578.novalocal not found [plugin:sar] sar: could not list /var/log/sa Running plugins. Please wait ... Running 70/70: yum... Creating compressed archive... Your sosreport has been generated and saved in: /var/tmp/sosreport-bz1299578.novalocal-20160119171125.tar.xz The checksum is: 31423813538ae9c973b704573053cc01 Please send this file to your support representative. [root@bz1299578 /]# ls /var/tmp sosreport-bz1299578.novalocal-20160119171125.tar.xz sosreport-bz1299578.novalocal-20160119171125.tar.xz.md5 [root@bz1299578 /]# ls /host/var/tmp [root@bz1299578 /]# ls /host/tmp ks-script-BCtjUK ks-script-CJy00F [root@bz1299578 /]# exit exit [root@bz1299578 ~]# ls /var/tmp [root@bz1299578 ~]# ls /tmp ks-script-BCtjUK ks-script-CJy00F Good news is, I see we no-longer use --rm for the rhel-tools container, so you can still access the data (as long as you don't rm the container) [root@bz1299578 ~]# docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES a77cc864b0a8 rhel7/rhel-tools "/usr/bin/bash" 23 minutes ago Exited (137) 38 seconds ago rhel-tools [root@bz1299578 ~]# docker start -it rhel-tools flag provided but not defined: -it See 'docker start --help'. [root@bz1299578 ~]# docker start rhel-tools rhel-tools [root@bz1299578 ~]# docker exec -it rhel-tools bash [root@bz1299578 /]# ls /var/tmp.md5 sosreport-bz1299578.novalocal-20160119171125.tar.xz sosreport-bz1299578.novalocal-20160119171125.tar.xz.md5 [root@bz1299578 /]# In any case, from the getting started guide, the expectation is the output files should appear in the host's /var/tmp seen as /host/var/tmp from inside the container. Here's the problem: # sosreport -vvv --debug policy sysroot is '/host' (in_container=False) set sysroot to '/' (default) sosreport (version 3.2) [...] The in_container() test that we put in place for 7.1 is no longer evaluating true in the rhel-tools container. For Red Hat distros this is just: if ENV_CONTAINER_UUID in os.environ: self._in_container = True ENV_CONTAINER_UUID is the container UUID environment variable we were told was guaranteed to be present when running in a container: # Container environment variables on Red Hat systems. ENV_CONTAINER_UUID = 'container_uuid' We added this extra check to prevent confusion when sos is run in a non-container environment and the administrator happens to have an environment variable named 'HOST'. If there is some other reliable check we can implement then we can easily switch over to that - otherwise we'd probably need to add an additional command line switch as the auto detection needs to be both safe and reliable (this was tested and worked as intended on 7.1). Debug patch to log the policy sysroot results: diff -up sos/sosreport.py.orig sos/sosreport.py --- sos/sosreport.py.orig 2016-01-19 18:43:51.198111934 +0000 +++ sos/sosreport.py 2016-01-19 18:52:45.050111934 +0000 @@ -691,6 +691,10 @@ class SoSReport(object): msg = "default" host_sysroot = self.policy.host_sysroot() + + self.soslog.debug("policy sysroot is '%s' (in_container=%s)" + % (host_sysroot, self.policy.in_container())) + # set alternate system root directory if self.opts.sysroot: msg = "cmdline" Bryn, Ya, there's no more UUID exposed automaticly by the looks of it. Checking 'container=docker' may be the preferred way. That's the way systemd works IIUC. I'll ask around and find out for sure. Dan mentions container_uuid in this post: http://developerblog.redhat.com/2014/11/06/introducing-a-super-privileged-container-concept/ As it's apparently no longer being set we may need to change to just checking for 'container=docker' - I see this set when running the rhel-tools image on Chris' test system. hrmmm, looks like fedora doesn't set 'container=docker', and checking an env. var seems a bit error-prone to me, anyone can set/unset them. I've got a meeting with Dan and team in an hour or so, I'll ask then. [policies/redhat] use 'container' variable for in_container() test Signed-off-by: Bryn M. Reeves <bmr> diff -up sos/policies/redhat.py.orig sos/policies/redhat.py --- sos/policies/redhat.py.orig 2016-01-19 19:23:42.283111934 +0000 +++ sos/policies/redhat.py 2016-01-19 19:23:50.709111934 +0000 @@ -83,8 +83,9 @@ class RedHatPolicy(LinuxPolicy): """Check if sos is running in a container and perform container specific initialisation based on ENV_HOST_SYSROOT. """ - if ENV_CONTAINER_UUID in os.environ: - self._in_container = True + if ENV_CONTAINER in os.environ: + if os.environ[ENV_CONTAINER] == 'docker': + self._in_container = True if ENV_HOST_SYSROOT in os.environ: self._host_sysroot = os.environ[ENV_HOST_SYSROOT] use_sysroot = self._in_container and self._host_sysroot != '/' @@ -124,7 +125,7 @@ class RedHatPolicy(LinuxPolicy): return self.host_name() # Container environment variables on Red Hat systems. -ENV_CONTAINER_UUID = 'container_uuid' +ENV_CONTAINER = 'container' ENV_HOST_SYSROOT = 'HOST' # sosreport -vvv --debug set sysroot to '/host' (policy) sosreport (version 3.2) This command will collect diagnostic and configuration information from this Red Hat Atomic Host system. An archive containing the collected information will be generated in /host/var/tmp and may be provided to a Red Hat support representative. Any information provided to Red Hat will be treated in accordance with the published support policies at: https://access.redhat.com/support/ The generated archive may contain data considered sensitive and its content should be reviewed by the originating organization before being passed to any third party. Press ENTER to continue, or CTRL-C to quit.[...] [...] Your sosreport has been generated and saved in: /host/var/tmp/sosreport-bz1299578.novalocal-20160119192702.tar.xz The checksum is: 0eadbb426b100e50d7d731d8550ffc25 Please send this file to your support representative. > hrmmm, looks like fedora doesn't set 'container=docker', and checking an env.
> var seems a bit error-prone to me, anyone can set/unset them.
I don't disagree: the original suggestion was that we should just test for an environment variable named 'HOST' containing a path - using 'container_uuid' was agreed on as something that would be reliable for Atomic and RHEL and safe for users in non-container environments who happen to have HOST in their environment (automatic sysroot has only been implemented for Red Hat and Fedora policies so far - users of other distributions need to give a sysroot path on the command line).
It's easy to change and easy for us to test other things but we need whatever we use to be robust - systemd stats '/proc/1/root' and '/' to determine if it is running in a chroot - we can add a similar check in the sos policy and only then try to use the 'HOST' or 'container' variables.
(In reply to Bryn M. Reeves from comment #15) > whatever we use to be robust - systemd stats '/proc/1/root' and '/' to > determine if it is running in a chroot - we can add a similar check in the > sos policy and only then try to use the 'HOST' or 'container' variables. That sounds better to me. Unf. Dan wasn't in my meeting today, and this might be something we need a wider audience to resolve. Best to fix it the right way once, than have to come back because it broke again. I'll reach out internally and cc you... We are not setting container_uuid. ENV container=docker is set in the RHEL base container, and maybe should be set for now by Fedora, but I am not even sure this is correct, since you could use other tools like runc and perhaps rkt in the future to run a RHEL7 image. For now though this is the best we have. The particular use-case here is Atomic with docker; at the time we did this work initially we were told that 'container_uuid' was "guaranteed" to be present in this environment - that is no longer the case so we need to find a better test. For now I am fine if that is more-or-less specific to the Atomic/docker setup - it is relatively easy for us to enable sysroot autodetection for other environments later (assuming there's something reliable to key off). The main point here and the reason there is some time pressure is that this is functionality we've documented for a year already and that is now broken due to the environment change - this affects Cockpit and other integration pieces so it's something we really need to address, and ideally in a way that's going to be robust for a reasonable number of updates. (In reply to Bryn M. Reeves from comment #18) > The particular use-case here is Atomic with docker; at the time we did this Agreed. Though to be clear, you mean RHEL Atomic, CentOS Atomic, Fedora Atomic etc. ya? I'd also suggest targeting specific "images" as well for support status, since sosreport would be dependent on the content (in the case of container=docker). e.g. Some random slackware docker image may not use/support this env-var. Dan's also correct in that docker is just the tooling, so depending on it's specifics vs other tooling will lead to a similar problem if/when other tooling comes on the scene. For example, maybe later it could be 'container=rkt' or w/e. For reporting on the host in the Atomic context, obviously the host details need to be exposed somehow (/host or w/e). To me that suggests maybe a circular-reference check is possible. For example, sosreport process from the POV of the host will have some key details that exactly match the sosreport process from the POV of the container. For example, details from /proc/pid/maps (presuming address space is randomized). Anyway, it's a tricky problem. In the interest of speed, and finding an 80% fix, maybe the env. var check is the best we have for now :S > Though to be clear, you mean RHEL Atomic, CentOS Atomic, Fedora Atomic etc. ya? Right; "Atomic with docker on Red Hat-like distros" is probably fair (although the priority in this bug is clearly RHEL/Atomic). > For example, maybe later it could be 'container=rkt' or w/e. That's fine an we can add them as they come up - this is the reason that container_uuid was appealing for us in the first place - it is supported by other container systems such as LXC and would have given us automatic enablement of this feature on those platforms. With all that said however I do have concerns that even with a "portable" check to key off that the requirements for sos in a super-privileged container are sufficiently special that the test portability doesn't matter - specific testing and possibly changes will be required to make sure things work correctly in each environment. (In reply to Bryn M. Reeves from comment #8) > used are all manifestations of the same bug - that due to the missing > "container_uuid" that these versions check for sos thinks it is running in a That reminds me. There is a use-case for using sosreport to gather the container's details rather than the hosts. Maybe that needs to be an RFE, but in any case it might be prudent to not "fix" this problem so thoroughly that it can't be bypassed when needed. We discussed that use case last year - it should be possible now using '--sysroot=/' although there may be some cases (command execution especially) that do not currently work as desired - right now this isn't a supported feature so we have a bit of time to get it right. Hi Chris, could you please confirm the change per https://bugzilla.redhat.com/show_bug.cgi?id=1299578#c13 correctly distinguishes between container and host, and if this check is expected to be stable for now on? I plan to backport the change to 7.2.z batch 2 where I should hand-over errata to QE within few days. So a feedback / response in that time frame is welcomed. > could you please confirm the change per https://bugzilla.redhat.com
> /show_bug.cgi?id=1299578#c13 correctly distinguishes between container and
> host, and if this check is expected to be stable for now on?
That's not really the important question: using container_uuid correctly distinguished container vs. host until it was abruptly removed from the environment between 7.1 and 7.2.
What we really need to know before committing to this new environment variable is whether we can rely on this not also suddenly going away and breaking HOST auto-detection (at least within the support limits of the current Atomic product).
Agreed. I sent a message last-week to atomic-devel, https://lists.projectatomic.io/projectatomic-archives/atomic-devel/2016-January/msg00039.html but there's no replies so far. If there's nothing better, than I think what Dan said above is (unfortunately) the case: "...this is the best we have.". It's not set in the Fedora image (verified this last week) but I'm not sure about CentOS. We probably need separate bugs to ensure those changes are made (at least in fedora images). I'll put that on my TODO list. (In reply to Pavel Moravec from comment #23) > I plan to backport the change to 7.2.z batch 2 where I should hand-over Oh, that reminds me, should this bug be flagged against 7.2 then and added to the tracking bug 1287902 b/c right now it's set up for 7.3 No - it's correct as-is: a bug needs to get accepted for rhel-X.Y (7.3 in this case) before it can be requested for rhel-x.y.z (7.2.z here). Opened Bug 1302354 for Fedora. Any reason anyone knows we can't make this BZ public? Talked with Bryn, "ok" to make this public. Waiting on PM and hoping for confirmation 'container' env. var isn't going to vanish like the last env. var. (In reply to Chris Evich from comment #29) > Talked with Bryn, "ok" to make this public. Waiting on PM and hoping for > confirmation 'container' env. var isn't going to vanish like the last env. > var. Hello, has you received such confirmation the 'container' env. name will stay stable? (In reply to Pavel Moravec from comment #30) > (In reply to Chris Evich from comment #29) > Hello, > has you received such confirmation the 'container' env. name will stay > stable? I think Dan's indication above in c17 still stands. This technology is so new and developing, it's really hard to give any guarantees. For the foreseeable future, this is probably the best standard Red Hat can stick to. We can be ready for change by implementing sufficient testing (which I have). As soon as this bug is closed/fixed, I'll add checks to my test to verify sosreport recognizes the host-type (in welcome message) and places the output in the correct location. container=docker is provided by the base image I think Fedora/Centos and RHEL all provide this. Docker does not do this by default According to Chris Fedora does _not_ currently set this. It would be good to confirm that this is going to be present in Red Hat distributions before we commit to it - we look a bit stupid when our own integration keeps breaking because of silly things like disappearing environment variables. (In reply to Bryn M. Reeves from comment #33) > According to Chris Fedora does _not_ currently set this. I opened this one for fedora https://bugzilla.redhat.com/show_bug.cgi?id=1302354 Verified this problem still exists in sos in Atomic 7.2.4 rhel-tools:7.2-23 image ID 7bd6bdb83046 The sos version is always more helpful than any Atomic/rhel-tools version for sos bugs. That said we have not updated anything here (bug is still NEW & Pavel has not granted a devel_ack yet) so we would not expect anything to have changed. Verified problem still exists in Atomic 7.2.5 rhel7/rhel-tools:7.2-28 ID 1df53c6954ce FYI this has been fixed in sos 3.3 we rebase to in RHEL7.3. If willing to test it, here is the package: http://download-node-02.eng.bos.redhat.com/brewroot/////packages/sos/3.3/2.el7/noarch/sos-3.3-2.el7.noarch.rpm (In reply to Pavel Moravec from comment #38) > FYI this has been fixed in sos 3.3 we rebase to in RHEL7.3. If willing to > test it, here is the package: (on Atomic 7.2.6) # atomic run registry.access.redhat.com/rhel7/rhel-tools:latest Trying to pull repository registry.access.redhat.com/rhel7/rhel-tools ...cut... # yum upgrade http://download-node-02.eng.bos.redhat.com/brewroot/////packages/sos/3.3/2.el7/noarch/sos-3.3-2.el7.noarch.rpm ...cut... Transaction test succeeded Running transaction Updating : sos-3.3-2.el7.noarch 1/2 Cleanup : sos-3.2-35.el7_2.3.noarch 2/2 ...cut... # sosreport ...cut... Creating compressed archive... Your sosreport has been generated and saved in: /var/tmp/sosreport-$HOSTNAME-20160831163602.tar.xz # ls -la /var/tmp (it's there) # exit # ls -la /var/tmp total 12 drwxrwxrwt. 5 root root 4096 Aug 31 16:08 . drwxr-xr-x. 24 root root 4096 Aug 29 05:49 .. drwx------. 3 root root 4096 Aug 31 16:35 sos.1CyqMn # ls -la /var/tmp/sos.1CyqMn/ total 178136 drwx------. 3 root root 4096 Aug 31 16:35 . drwxrwxrwt. 5 root root 4096 Aug 31 16:08 .. drwx------. 15 root root 4096 Aug 31 16:35 sosreport-$HOSTNAME-20160831160524 -rw-------. 1 root root 172829696 Aug 31 16:35 sosreport-$HOSTNAME-20160831160524.tar So the tarball does appear to be getting copied onto the host, but shouldn't it be at the top-level of /var/tmp and not nested beneath whatever 'sos.1CyqMn' is? Just double-checking. Otherwise we can mark this as VERIFIED. Please let me know. Update: Ahh, I see now that sosreport-$HOSTNAME-20160831163602.tar.xz was in the container's /var/tmp. That other file *60524 was from a prior attempt that I ctrl-c'd because the host's root filesystem filled up and sosreport "hung". Sp, it appears that sosreport is _not_ copying the tarball onto the host as I would have expected. If it doesn't, and someone accidentally 'docker rm rhel-tools' then the sosreport data will be lost! Please attach the sos.log from the run that left the tarball inside the container and also the complete output of 'env', run from the same environment as the report (i.e. including HOST and whatever else is set for the container). Manually updating the image to sos-3.3-2.el7 gives the expected results: <bmr> 3.3-2.el7 works good: <bmr> # sosreport -vvv --batch --debug -o general <bmr> set sysroot to '/host' (policy) <bmr> sosreport (version 3.3) <bmr> [...] <bmr> [archive:TarFileArchive] initialised empty FileCacheArchive at '/host/var/tmp/sos.Vgrh_o/sosreport-jmaster.usersys.redhat.com-20160901134453' <bmr> [...] <bmr> [plugin:general] added copyspec '['/host/etc/sysconfig']' <bmr> [plugin:general] added copyspec '['/host/proc/stat']' <bmr> [...] Yep, looks like this was my goof. Seems likely I rm'd the rhel-tools container when host ran out of space and forgot to re-update the package. Thanks Bryn for setting things right. This bug has been fixed due to sos rebase to 3.3 [1] that includes the upstream fix. Relevant RHEL7.3 sos errata is [2]. Therefore I am closing the bug. Please test it if it addresses the reported problem properly, and if not, reopne the BZ. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1293044 [2] https://rhn.redhat.com/errata/RHBA-2016-2380.html |