The current name of /distribution/install misleads new users into thinking the task is a required part of the Beaker installation process, when it actually just reports some additional details back up to the server. Renaming the task to the more explicit /distribution/report-install would help eliminate that "gotcha" for new users. I'm filing this RFE to capture the fact that we know the current name is a problem, but setting the severity and urgency both to "low" because it's one of those quirks which is an annoying usability trap, but also a change with significant enough consequences that it's currently unclear if it will ever make sense to actually fix it.
The other issue with this idea is coming up with a migration plan. Having the task available under both names in existing installations could work, but even that approach would need documentation etc.
How do people learn about /distribution/install anyway? In what situations is /distribution/install added to the job without users not knowing about it? Schedule > Reserve does it. If it started doing /distribution/report-install, would that break anything? Do people process with scripts results of jobs started manually? When people clone their existing jobs or submit their own XMLs, they'd still be able to keep using /distribution/install, it could stay there, maybe just be updated to spit out "Please consider using /distribution/report-install, this task is not longer maintained." to the output. I assume the "Rename" here is not about drop /distribution/install completely, rather change the name of the task which gets injected to the job in certain situations and depreciate (but still keep around) /distribution/install.
A nicer solution is probably just to improve the way that the installation status is reported, which would help make it clearer that /distribution/install is not actually part of the installation at all. That is covered in the job page improvements proposal. https://beaker-project.org/dev/proposals/job-page-improvements.html Changing the name of /distribution/install is probably not worth the compatibility headaches.
(In reply to Jan Pazdziora from comment #2) Right, changing the task used by reserve workflow would be fairly uncontroversial since we don't expect anyone to be programatically checking the results of such jobs. The biggest issue would be the bkr workflow commands. Currently they have /distribution/install hardcoded. Presumably a large proportion of Beaker jobs are scheduled by one of the workflow commands and so changing the task used in those would be problematic. If users are crafting their own job XML and explicitly including /distribution/install then they probably already know what it does.
Another thing that just occurred to me... /distribution/install has a large set of package requirements for basically historical reasons, not because they are actually required, and that surprises and irritates some people. If we were going to go through the compatibility headaches to rename /distribution/install it might make sense to also drop those historical package requirements at the same time. (The only reason it has never been done before is the compatibility headaches it would cause.)
Streamlining the default dependency set sounds like a genuinely good reason to migrate to a new task for installation status reporting. Let's reconsider this one after the results reporting redesign lands in Beaker 21.
+1 for removing /distribution/install from default XML (generated by webUI Reserve page - maybe by some bkr command as well). Making /distribution/install just fail with some explanation might be an option as well.
As a purely bikeshedding matter, another potential name for this new task is: /distribution/check-install. I think that came up in some previous discussions in years past although I cannot find a reference now. Personally I think /distribution/check-install makes slightly more sense than /distribution/report-install.
(In reply to Dan Callaghan from comment #8) > As a purely bikeshedding matter, another potential name for this new task > is: /distribution/check-install. I think that came up in some previous > discussions in years past although I cannot find a reference now. Personally > I think /distribution/check-install makes slightly more sense than > /distribution/report-install. +1 for check-install (and I don't think it's bikeshedding) Is there anybody already working on this, by any chance? I can't promise anything before few weeks after 7.4, but I'd love to find time to put something together. Just to make sure, the code is this, right?: https://git.beaker-project.org/cgit/beaker-core-tasks/tree/install/default I guess reading the original code should give me enough detail on what are the requirements of the new task. Are there any pitfalls worth mentioning?
(In reply to Alois Mahdal from comment #9) > Is there anybody already working on this, by any chance? Unfortunately no. > I can't promise anything before few weeks after 7.4, but I'd love to find > time to put something together. > > Just to make sure, the code is this, right?: > > > https://git.beaker-project.org/cgit/beaker-core-tasks/tree/install/default Right. > I guess reading the original code should give me enough detail on what are > the requirements of the new task. Are there any pitfalls worth mentioning? I don't think so... but the current implementation pre-dates me. :-) If we were doing a fresh implementation now I would suggest to keep things as simple and clean as possible (maybe avoid a beakerlib dependency, unless it really helps). One other thing to consider is bug 875543 *but* I don't know if there is any good solution to that.
Hi Dan, is there a plan for a fresh implementation? When we can expect this to happen?
Dear Peter, just a heads up from me, since the Beaker team is shuffling around again. I'll have to sit together with Dan to figure out a road map and what is achievable with the small team as it is (2 associates at the moment). After that I can give you a better estimate as to when this item will most likely be worked on. Hope that helps.
Coming back to this, I'll schedule this item high on our priority list for now. Work will include: * Create a new task /distribution/check-install with a cleaned up implementation and pruned dependencies * Schedule /distribution/check-install as a default for RHEL8+
Since this bz is another case where we want to change the default behaviour on RHEL8+ (and Fedora 29+ presumably), it would be a good candidate to get into 26.0 alongside the other two bugs where we are also changing defaults.
Here is a list of all the things that /distribution/install does currently: * Compares $RECIPEID with the contents of /root/RECIPE.TXT (the latter is populated by Beaker-supplied kickstart %post) in an attempt to catch the situation where an installation failed and the harness has started running the wrong recipe. This should generally be impossible to trigger nowadays since beah and restraint always fetch the recipe by id and will refuse to run an already-completed recipe. * Populates /etc/motd with the ASCII art "This System is part of the Red Hat Test System...." including some recipe-specific info. * sed's BOOTUP=serial into /etc/sysconfig/init, which appears to have the effect of turning off ANSI colours in the boot sequence. It is not clear if this has any effect on systemd distros, it seems to be undocumented on newer RHEL. * Appends "local2.info /dev/console" to /etc/syslog.conf. Presumably this has no effect on RHEL6 and above which switched to rsyslog and thus do not have /etc/syslog.conf. Also, I can't find anything that actually logs to the local2 facility, except for the totally undocumented rhts-system-info command that I never even knew existed before today... So I suspect this is an RHTS-ism that no longer has any effect. * Uses that old chestnut hvm_detect.c to check if the recipe is on an x86_64 virtualized guest, and if so removes the string "NMI appears to be stuck" from /usr/share/rhts/failurestrings as a workaround for bug 500845. I suspect this is not necessary with any recent hypervisors but it would be worth checking. * Collects /root/install.log and /root/install.log.syslog into its own $OUTPUTFILE. I think these are log files left behind by Anaconda, but they only exist on RHEL <= 6. * Reports a score equal to the number of RPM packages installed: $(rpm -qa | wc -l). * A bunch of extra checking logic which is activated only if /root/install_kernel.log is non-empty and $TESTARGS is non-empty. As far as I can tell this is a leftover RHTS-ism which is never used in Beaker. The checking logic attempts to verify that the desired kernel version (defined in $TESTARGS) is the kernel which was booted and installed, by comparing it to $(uname -r) and poking around in various boot loader configs. * Collects a large amount of system information and reports it under the /distribution/install/Sysinfo result. Namely: - hostname - kernel version - hardware details including model, CPUs, memory, lspci - /etc/fedora-release or /etc/redhat-release - /proc/cmdline - NMI counts from /proc/interrupts - /etc/modprobe.conf - modinfo output for all loaded modules - sestatus - semodule -l - /root/anaconda-ks.cfg - /root/anaconda.coverage * Checks for any lines containing certain error strings, including AVC messages, in dmesg, /root/install.log, and /root/install_kernel.log (the latter likely never to exist in Beaker). Note that it does this in a totally separate way to the normal harness dmesg-checking and AVC-checking logic. In particular it does not obey /usr/share/rhts/falsestrings. There is bug 875543 about this.
This is my proposal for what the new task should do: (In reply to Dan Callaghan from comment #15) > * Compares $RECIPEID with the contents of /root/RECIPE.TXT This check is cheap and simple so we can continue to do this. But since it should be almost impossible to fail, I would take out the PASS result and only report a FAIL (with a more descriptive result name) in case there is a mismatch. > * Populates /etc/motd with the ASCII art "This System is part of the Red Hat > Test System...." including some recipe-specific info. This is quite useful and should be kept. > * sed's BOOTUP=serial into /etc/sysconfig/init This seems to have no value. Since it is done by this task, not in the kickstart, it only applies *after* the first reboot. And Beaker already strips ANSI sequences from the serial console logs. And anyone connecting interactively to the serial console probably has an ANSI-colour-capable terminal anyway. So we should drop this. > * Appends "local2.info /dev/console" to /etc/syslog.conf No value, drop. > * Uses that old chestnut hvm_detect.c to check if the recipe is on an x86_64 > virtualized guest, and if so removes the string "NMI appears to be stuck" > from /usr/share/rhts/failurestrings as a workaround for bug 500845. I > suspect this is not necessary with any recent hypervisors but it would be > worth checking. This one is pretty funny. Because it removes the line from /usr/share/rhts/failurestrings *after* the first CheckRecipe result is reported, the workaround doesn't actually help. If the message appears you still get a failure reported by the harness dmesg checker. Moreover, this problem no longer seems to exist on any combination of recent hypervisors (checked RHEL6.7 and RHEL 7.3 KVM) with supported guest kernels (checked RHEL5-7). The only time I have seen this is in a RHEL4 guest on the dev-kvm-guest-*.rhts.eng.bos.redhat.com machines on beaker-devel, and only in years past (probably the hypervisor was quite an old version until it got upgraded). For example: https://beaker-devel.app.eng.bos.redhat.com/recipes/18801#task145135 But you still get the /distribution/install/CheckRecipe/dmesg failure reported anyway. So there is really no point keeping this workaround, we should drop it. This lets us drop gcc from the task requirements. > * Collects /root/install.log and /root/install.log.syslog into its own > $OUTPUTFILE. I think these are log files left behind by Anaconda, but they > only exist on RHEL <= 6. There is no value in keeping /root/install.log, this is the same as the install.log which is already captured by anamon. /root/install.log.syslog is an interesting one. Anaconda in RHEL6, remarkably, ships its own implementation of a syslog daemon that listens on /dev/log and writes out to /mnt/sysimage/root/install.log.syslog. So this is capturing any non-kernel syslog messages produced during the installation. In the RHEL6 recipe I looked at it was just groupadd notices. Right now, Anamon does not capture this file, although it should. Filed bug 1602251 for that. In RHEL7+ Anaconda just runs the normal systemd journal so this is no longer relevant. Assuming we rely on anamon to capture these logs, we can drop them from this task. > * Reports a score equal to the number of RPM packages installed: $(rpm -qa | > wc -l). This seems... fine I guess? > * A bunch of extra checking logic which is activated only if > /root/install_kernel.log is non-empty and $TESTARGS is non-empty It seems safe to drop this functionality. If anyone is explicitly passing TESTARGS in the task parameters, they can continue to do so using /distribution/install. Although I doubt anyone is doing this. > * Collects a large amount of system information and reports it under the > /distribution/install/Sysinfo result. Namely: > - hostname > - kernel version > - hardware details including model, CPUs, memory, lspci > - /etc/fedora-release or /etc/redhat-release > - /proc/cmdline > - NMI counts from /proc/interrupts > - /etc/modprobe.conf > - modinfo output for all loaded modules > - sestatus > - semodule -l > - /root/anaconda-ks.cfg > - /root/anaconda.coverage I would keep anything here that is specific to the *installation* just done, but I don't think there is any need to collect anything specific to the *hardware* which cannot change across recipes. We already have /distribution/inventory which is quite good at collecting detailed hardware inventory info (and plenty of RFEs still in the backlog for improving this information too). Doing a poor man's version of it in this task seems unnecessary. This would also let us drop the task requirement on lspci. > * Checks for any lines containing certain error strings, including AVC > messages, in dmesg, /root/install.log, and /root/install_kernel.log This certainly has value, but ideally it would be unified with (or somehow handled by) the harness and its existing dmesg-checking and AVC checking logic. Doing so would solve bug 875545, and ideally let us eliminate a lot of code and complexity in this task. But I will have to dig further into how that could work. Importantly, we want to capture here not the dmesg problems and AVC messages which happened *during this task* because the harness already does that. We want to capture the problems from *before the task started*, while the installation was running.
(In reply to Dan Callaghan from comment #15) > - NMI counts from /proc/interrupts The value of this one seems to be quite questionable. On my workstation (which is certainly not full of failing hardware) this shows a count of several thousand NMIs. I wonder if something is using NMIs for "normal" purposes, or the kernel is just misreporting the numbers. If there is an NMI indicating an actual hardware problem it will be caught by the dmesg checker anyway. So I would propose to drop this in the new implementation of the task.
(In reply to Dan Callaghan from comment #15) > - modinfo output for all loaded modules Specifically, it runs: modinfo -F description -F version $x, which should print the values of the "description" and "version" fields for the module... except it seems that almost no kernel modules actually have values for those fields. So I am really unsure of the utility of this. Also, the values it is reading are purely static data pulled from the kernel modules installed on disk, it is not telling anything interesting about the installed system itself (apart from the actual list of modules loaded, which can be more easily seen with lsmod). Therefore I would propose to just replace this with lsmod.
(In reply to Dan Callaghan from comment #16) > > * Checks for any lines containing certain error strings, including AVC > > messages, in dmesg, /root/install.log, and /root/install_kernel.log > > This certainly has value, but ideally it would be unified with (or somehow > handled by) the harness and its existing dmesg-checking and AVC checking > logic. Doing so would solve bug 875545, Typo. The bug I meant was 875543.
I have dug a bit more into the current behaviour of the dmesg and AVC checking... Bug 1559156 is an example of an AVC denial which happens during boot-up, after the installation, and after auditd has started. In this case, the denial only appears in /var/log/audit/audit.log and thus the checking done by /distribution/install does not find it. It is reported by the harness instead. For example: https://beaker-devel.app.eng.bos.redhat.com/recipes/24989#task162578 However if there is an AVC denial before auditd has started it will end up in dmesg and not /var/log/audit/audit.log, and thus and the harness will not find it, /distribution/install will. For example RHEL-8.0-20180713.n.3 exhibits this denial during bootup, I can't find it in Bugzilla but it's presumably the same as bug 1563792 in Fedora: [ 36.838981] audit: type=1400 audit(1531946837.796:76): avc: denied { name_bind } for pid=636 comm="rpcbind" src=63487 scontext=system_u:system_r:rpcbind_t:s0 tcontext=system_u:object_r:unreserved_port_t:s0 tclass=udp_socket permissive=0 Only appears as a failing /distribution/install/Sysinfo result. The AVC denial is buried at the end of the result log. This might make a worthwhile RFE to the harness AVC denial checking -- also check in the kernel messages in case auditd was not running, in order to catch these kinds of denials. Effectively that is what bug 875543 is asking for. As for AVC denials which happen *during* the Anaconda installation, before rebooting... I struggled a lot to come up with some easy way to trigger an AVC denial in %pre or %post. At first it seemed I could do: cp /bin/ls /tmp ; runcon guest_u:guest_r:guest_t:s0 /tmp/ls ... But it turns out runcon is not in the Anaconda image. In the end I came up with this hackery, which can be put into a ksappend in %pre or %post: cat >/etc/systemd/system/denyme.service <<"EOF" [Unit] Description=Deny Me [Service] Type=oneshot ExecStart=/bin/cat /root/.bash_profile SELinuxContext=guest_u:guest_r:guest_t:s0 EOF systemctl daemon-reload systemctl start denyme The service does indeed fail to start because of an SELinux denial but it turns out today this is *totally invisible*, at least on RHEL7. The denial is not logged by Anaconda anywhere, it goes to the kernel messages which is captured by the systemd journal. But neither anamon nor /distribution/install ever captures or looks in this so the denial is never reported or even visible. So bug 1602251 will help a bit, although that bug will not actually find any AVCs, it is just about capturing the log in Beaker. Really bug 1602251 should be expanded to also capture the journal output on RHEL7 too, since that is the equivalent on RHEL7+. Unfortunately, I am at a bit of a loss how to produce an AVC denial during the Anaconda installation on RHEL6. The RHEL6 Anaconda environment lacks runcon but there is also no systemd so I cannot use the trick shown above. I am not sure if the denial would ever appear in /root/install.log.syslog, because that doesn't appear to include any kernel messages. So I suspect the same problem exists -- an AVC denial during Anaconda would simply be lost and never reported in Beaker. The situation for dmesg problems ("cut here" traces, BUG/Oops messages, etc) is basically the same as for AVC denials. They are easier to simulate at least: echo BUG this is a fake bug >/dev/kmsg On RHEL7 and up, if it happens during the installation, it is completely lost for the same reason above: because Beaker is not capturing or looking at the systemd journal during the installation. On RHEL6 Dmesg problems which happen on boot-up *after* the installation are found by the harness' usual checking mechanisms, just like with AVC denials which happen on boot-up as mentioned above. The /distribution/install task does not do anything for those. So in summary: /distribution/install tries to find AVC denials and kernel errors during the installation but it doesn't actually work. AVC denials and kernel errors *after* the installation are already found by the harness, unless they happened before auditd started, but bug 875543 can solve that. Therefore we can drop this checking functionality in the new version. We could certainly improve the status quo here though. Bug 1602251 is just the first step (capturing the relevant logs) and we could look at actually reporting AVC denials and dmesg problems in some way too. I've filed a second RFE for that: bug 1603077.
(In reply to Dan Callaghan from comment #15) > * Checks for any lines containing certain error strings, including AVC > messages, in dmesg, /root/install.log, and /root/install_kernel.log (the > latter likely never to exist in Beaker) There is one other interesting bit I missed here... It is looking for quite broad error strings in /root/install.log, which actually has the potential to catch some installation problems that are not caught anywhere else. However, since this file is just *package* installations and most interesting Anaconda messages go to other log files these days (particular anaconda.log) it seems to be of limited usefulness. We could probably do better by looking for problematic messages across *all* the installation logs uploaded by Anamon, which means it is better handled elsewhere in Beaker. We have open RFEs such as bug 893075 and bug 1554656 about this kind of functionality. So I still think it is okay to drop this from the new /distribution/check-install task.
One other interesting thing about the dmesg checking in /distribution/install: it actually applies some very broad patterns, much broader than the patterns used by the dmesg checker in the harness. It reports matching lines at the very bottom of the Sysinfo log under the heading "Potential issues dmesg". For example, in one recipe I saw this: ******** Potential Issues dmesg ******** [ 0.227002] ACPI Error: Method parse/execution failed \_SB.PCI0._OSC, AE_ALREADY_EXISTS (20180531/psparse-516) [ 1.395075] tpm tpm0: A TPM error (7) occurred attempting to read a pcr value [ 1.412116] tpm tpm0: A TPM error (7) occurred attempting get random [ 1.834715] tpm tpm0: A TPM error (7) occurred attempting to read a pcr value [ 0.005000] DMAR: Failed to map dmar1 [ 0.006000] DMAR: Parse DMAR table failure. [ 0.226035] ACPI BIOS Error (bug): Failure creating [\_SB.PCI0._OSC.CAPD], AE_ALREADY_EXISTS (20180531/dsfield-179) [ 0.227002] ACPI Error: Method parse/execution failed \_SB.PCI0._OSC, AE_ALREADY_EXISTS (20180531/psparse-516) [ 0.228004] acpi PNP0A08:00: _OSC failed (AE_ALREADY_EXISTS); disabling ASPM [ 8.970343] systemd[1]: Failed to insert module 'ip_tables': Operation not permitted [ 0.000000] WARNING: CPU: 0 PID: 0 at drivers/iommu/dmar.c:848 warn_invalid_dmar.part.9+0x57/0x60 The WARNING message at the very end was part of a "cut here" trace which was actually reported as a Fail result by the harness dmesg checker. But all the other "error" messages before that are just "normal" kernel spew that everyone is basically accustomed to ignoring. So in some ways this is interesting info -- maybe it does help to find kernel bugs -- and it should be easy enough to add to the new /distribution/check-install task in the same way. But I wonder if anybody ever bothers to look at the bottom of this file to see these messages? Also, it's worth noting that of the three different checks being done here (kernel errors in dmesg, AVC denials, and errors in the Anaconda logs) it is only the presence of AVC denials which cause a Fail result to be reported. The other messages are included in the Sysinfo result log but they are not considered a Fail.
Patch for /distribution/check-install: https://gerrit.beaker-project.org/#/c/beaker-core-tasks/+/6213 You can see it in action here: https://beaker-devel.app.eng.bos.redhat.com/jobs/14876 (Ignore the aborted F28, that is some segfault in dnf which crept into F28 updates yesterday apparently.) Still working on patches for the Beaker side of things...
Okay, this is all the Beaker bits of it... These two are changed unconditionally to always use the new task. As noted above, we do not expect anyone parses the results of these programmatically, so we don't need to worry about breaking their scripts because the task name changes: https://gerrit.beaker-project.org/#/c/beaker/+/6215 reserve workflow: use /distribution/check-install https://gerrit.beaker-project.org/#/c/beaker/+/6216 inventory jobs: use /distribution/check-install This is the meat. We change the bkr workflow commands to use /distribution/check-install, but only on RHEL8+ and Fedora 29+. Means some unfortunate family name parsing in the client, but we've had worse... https://gerrit.beaker-project.org/#/c/beaker/+/6217 bkr workflows: use /distribution/check-install And these ones are really just for completeness, to avoid leaving unnecessary references to the old task lying around: https://gerrit.beaker-project.org/#/c/beaker/+/6218 STABLE tagging script: use /distribution/check-install https://gerrit.beaker-project.org/#/c/beaker/+/6219 docs: describe /distribution/check-install https://gerrit.beaker-project.org/#/c/beaker/+/6220 docs: replace all incidental references to /distribution/install https://gerrit.beaker-project.org/#/c/beaker/+/6221 tests: use /distribution/check-install for completeness
I have uploaded /distribution/check-install 1.0-2 to beaker-devel, and also to stage and production. Also published it here: https://beaker-project.org/tasks/
Great work, Dan! This was much needed. I am now officially looking forward for Beaker 26. :-)
Btw the new task is already uploaded on beaker.engineering.redhat.com so you can begin using it in your jobs now if you want to test it out. (We have started doing so with our Beaker dogfood jobs and it has already revealed some undeclared dependencies...)
Beaker 26.0 has been released.