Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2161533

Summary: ostree fsck should not run by default on OCP
Product: Red Hat Enterprise Linux 8 Reporter: Andreas Bleischwitz <ableisch>
Component: sosAssignee: Pavel Moravec <pmoravec>
Status: CLOSED CURRENTRELEASE QA Contact: Miroslav HradĂ­lek <mhradile>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 8.4CC: agk, dornelas, jcastillo, jjansky, mhradile, plambri, sbradley, supportability-qe, theute, walters
Target Milestone: rcKeywords: OtherQA, Triaged
Target Release: ---Flags: pm-rhel: mirror+
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: sos-4.5.1-3.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-04-21 18:11:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Andreas Bleischwitz 2023-01-17 08:28:35 UTC
Description of problem:
When sos is started on a OpenShift node, it will by default start `ostree fsck` during data collection.
This will cause `ostree fsck` to very likely to appear in the process list as in state "D", which causes confusion (as it isn't expected that such process is started during sos-collection).

Version-Release number of selected component (if applicable):
sos included in the toolbox container 

How reproducible:
always

Steps to Reproduce:
1. collect sos-report on a OpenShift node
2. find "ostree fsck" in collected process list
3.

Actual results:
`ostree fsck` is started

Expected results:
`ostree fsck` is only started if requested (like if `sos` is not running on OpenShift nodes)

Additional info:
Derrick O. already did some analysis:
* `ostree fsck` is only started if the "verify" option is enabled - it defaults to disabled.
* "verfiy" is enabled whenever a OpenShift node is detected and therfore `ostree fsck` is run.
  additionally to the `ostree fsck` command, other verification steps are added too.

# sos report -o ostree -l -vv | grep verify
[sos.report:setup] using 'ocp' preset defaults (--container-runtime crio --log-size 100 --no-report --plugopts crio.timeout=600,networking.timeout=600,networking.ethtool_namespaces=False,networking.namespaces=200 --skip-plugins cgroups --verify)
[sos.report:setup] effective options now: --container-runtime crio --list-plugins --log-size 100 --no-report --only-plugins ostree --plugopts crio.timeout=600,networking.timeout=600,networking.ethtool_namespaces=False,networking.namespaces=200 --skip-plugins cgroups -vv --verify


https://github.com/sosreport/sos/blob/4.3/sos/presets/redhat/__init__.py#L37
---
RHOCP = "ocp"
RHOCP_DESC = "OpenShift Container Platform by Red Hat"
RHOCP_OPTS = SoSOptions(all_logs=True, verify=True, plugopts=[        <<<< enable verify on OpenShift nodes
                             'networking.timeout=600',
                             'networking.ethtool_namespaces=False',
                             'networking.namespaces=200'])
---

So the 'verify' option is disabled by default, but it's being enabled when OCP is detected. Oddly, it does seem that we're setting 'verify' here _just_ to call 'ostree fsck', or maybe that's just a coincidence considering https://github.com/sosreport/sos/pull/2459 makes no mention of it. Here's the list of plugin options that are triggered by 'verify':

# egrep -R 'get_option\("verify"\)' -A2 /usr/lib/python3.6/site-packages/sos/
/usr/lib/python3.6/site-packages/sos/report/plugins/autofs.py:        if self.get_option("verify"):
/usr/lib/python3.6/site-packages/sos/report/plugins/autofs.py-            self.add_cmd_output("rpm -qV autofs")
/usr/lib/python3.6/site-packages/sos/report/plugins/autofs.py-
--
/usr/lib/python3.6/site-packages/sos/report/plugins/dpkg.py:        if self.get_option("verify"):
/usr/lib/python3.6/site-packages/sos/report/plugins/dpkg.py-            self.add_cmd_output("dpkg -V")
/usr/lib/python3.6/site-packages/sos/report/plugins/dpkg.py-            self.add_cmd_output("dpkg -C")
--
/usr/lib/python3.6/site-packages/sos/report/plugins/flatpak.py:        if self.get_option("verify"):
/usr/lib/python3.6/site-packages/sos/report/plugins/flatpak.py-            self.add_cmd_output("flatpak repair --dry-run")
/usr/lib/python3.6/site-packages/sos/report/plugins/flatpak.py-
--
/usr/lib/python3.6/site-packages/sos/report/plugins/systemd.py:        if self.get_option("verify"):
/usr/lib/python3.6/site-packages/sos/report/plugins/systemd.py-            self.add_cmd_output("journalctl --verify")
/usr/lib/python3.6/site-packages/sos/report/plugins/systemd.py-
--
/usr/lib/python3.6/site-packages/sos/report/plugins/ostree.py:        if self.get_option("verify"):
/usr/lib/python3.6/site-packages/sos/report/plugins/ostree.py-            self.add_cmd_output("ostree fsck")
/usr/lib/python3.6/site-packages/sos/report/plugins/ostree.py-

Comment 1 Pavel Moravec 2023-01-17 18:41:15 UTC
Nice investigation. Do I get it right the autofs / dpkg / flatpak / systemd "verify commands" are not required as a default behaviour on OCP?

Keep in mind that --verify option also runs "rpm -V <pgklist>" whenever verify_packages variable is set by a plugin, which is:

$ grep " verify_packages " sos/report/plugins/*py
sos/report/plugins/block.py:    verify_packages = ('util-linux',)
sos/report/plugins/convert2rhel.py:    verify_packages = ('convert2rhel$',)
sos/report/plugins/java.py:    verify_packages = ('java.*',)
sos/report/plugins/kernel.py:    verify_packages = ('kernel$',)
sos/report/plugins/nss.py:    verify_packages = ('nss.*',)
sos/report/plugins/openssl.py:    verify_packages = ('openssl.*',)
sos/report/plugins/pam.py:    verify_packages = ('pam_.*',)
sos/report/plugins/perl.py:    verify_packages = ('perl.*',)
sos/report/plugins/rpm.py:    verify_packages = ('rpm',)
sos/report/plugins/system.py:    verify_packages = ('glibc', 'initscripts', 'zlib')
$

I see three possible implementations:
1) remove --verify from RHOCP preset completely. This will a) stop calling the "rpm -V .." checks (is that intentional?) and b) stop calling "ostree fsck", from default behaviour
2) introduce a new ostree plugin option "fsck", disabled by default (enabled via "-k ostree.fsck=yes" / "--plugin-option ostree.fsck=yes") and call "ostree fsck" when *this* option is enabled - then "rpm -V .." checks would be still executed by default
3) do both 1) and 2), such that a) "rpm -V .." are not further run by default, b) "ostree fsck" is disabled by default, and c) the fsck is enabled by more appropriate(?) plugin option, not the generic one


Please let us know what option is preferable.

Comment 2 Andreas Bleischwitz 2023-01-18 07:29:40 UTC
Hi Pavel,

I'm the least to decide, but in terms of "ostree fsck", it shouldn't be started as it triggered some false alarms. I don't think it will do any harm, but taken that it will put quite some I/O load on the node, it will impact running workload.
Unless there are well known reasons (which I don't know about), we shouldn't do such verification steps during support-data collection.

IMHO option 3 would be most versatile option.

Comment 3 Pavel Moravec 2023-01-18 07:52:20 UTC
(In reply to Pavel Moravec from comment #1)
> I see three possible implementations:
> 1) remove --verify from RHOCP preset completely. This will a) stop calling
> the "rpm -V .." checks (is that intentional?) and b) stop calling "ostree
> fsck", from default behaviour
> 2) introduce a new ostree plugin option "fsck", disabled by default (enabled
> via "-k ostree.fsck=yes" / "--plugin-option ostree.fsck=yes") and call
> "ostree fsck" when *this* option is enabled - then "rpm -V .." checks would
> be still executed by default
> 3) do both 1) and 2), such that a) "rpm -V .." are not further run by
> default, b) "ostree fsck" is disabled by default, and c) the fsck is enabled
> by more appropriate(?) plugin option, not the generic one

As I understand a change in default behaviour shall need some bigger consensus, I will wait some time for Derrick for next opinion. With no feedback in a few weeks, I will propose PR for the 3rd option (which sounds the best to me as well).

Comment 4 Derrick Ornelas 2023-01-18 19:49:25 UTC
(In reply to Pavel Moravec from comment #1)
> Nice investigation. Do I get it right the autofs / dpkg / flatpak / systemd
> "verify commands" are not required as a default behaviour on OCP?
> 

Correct, they are not needed/wanted for OCP. The autofs, dpkg (not in RHEL), and flatpak packages are not installed by default on RHEL CoreOS anyways. I most situations, I don't know that 'journalctl --verify' has much value either. 


> Keep in mind that --verify option also runs "rpm -V <pgklist>" whenever
> verify_packages variable is set by a plugin, which is:
> 
> $ grep " verify_packages " sos/report/plugins/*py
> sos/report/plugins/block.py:    verify_packages = ('util-linux',)
> sos/report/plugins/convert2rhel.py:    verify_packages = ('convert2rhel$',)
> sos/report/plugins/java.py:    verify_packages = ('java.*',)
> sos/report/plugins/kernel.py:    verify_packages = ('kernel$',)
> sos/report/plugins/nss.py:    verify_packages = ('nss.*',)
> sos/report/plugins/openssl.py:    verify_packages = ('openssl.*',)
> sos/report/plugins/pam.py:    verify_packages = ('pam_.*',)
> sos/report/plugins/perl.py:    verify_packages = ('perl.*',)
> sos/report/plugins/rpm.py:    verify_packages = ('rpm',)
> sos/report/plugins/system.py:    verify_packages = ('glibc', 'initscripts',
> 'zlib')
> $
> 

I think if we had concerns about the integrity of packages on a node, then we would direct the user to include the 'rpm.rpmva=on' option instead of checking this very small subset of packages with '--verify'. 


> I see three possible implementations:
> 1) remove --verify from RHOCP preset completely. This will a) stop calling
> the "rpm -V .." checks (is that intentional?) and b) stop calling "ostree
> fsck", from default behaviour
> 2) introduce a new ostree plugin option "fsck", disabled by default (enabled
> via "-k ostree.fsck=yes" / "--plugin-option ostree.fsck=yes") and call
> "ostree fsck" when *this* option is enabled - then "rpm -V .." checks would
> be still executed by default
> 3) do both 1) and 2), such that a) "rpm -V .." are not further run by
> default, b) "ostree fsck" is disabled by default, and c) the fsck is enabled
> by more appropriate(?) plugin option, not the generic one
> 
> 
> Please let us know what option is preferable.

Thanks for compiling this list. I like option 3. 

I would like to hear from @walters on how useful and/or hazardous 'ostree fsck' might be for everyday support use. I wonder if this is generally safe, something to avoid except in extreme circumstance, or maybe somewhere in between?

Comment 5 Pavel Moravec 2023-02-27 11:16:59 UTC
I raised https://github.com/sosreport/sos/pull/3147 in upstream.

Would you be able to verify the fix once we have a downstream build candidate ready (a matter of a week)?

Comment 6 Derrick Ornelas 2023-02-27 15:54:26 UTC
Yes, I can test the RHEL sos build.

Comment 21 Pavel Moravec 2023-04-21 18:11:28 UTC
This bug has been fixed by errata https://access.redhat.com/errata/RHBA-2023:1571 .

Comment 22 Red Hat Bugzilla 2023-09-19 04:32:33 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days