1882534 – kernel and host level logs should be gathered

Bug 1882534 - kernel and host level logs should be gathered

Summary: kernel and host level logs should be gathered

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	must-gather
Sub Component:
Version:	4.5
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.12.0
Assignee:	yati padia
QA Contact:	Oded
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2111375 (view as bug list)
Depends On:
Blocks:	2067095
TreeView+	depends on / blocked

Reported:	2020-09-24 20:46 UTC by Josh Durgin
Modified:	2023-08-09 16:35 UTC (History)
CC List:	13 users (show)
Fixed In Version:	4.12.0-145
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-01-31 00:19:18 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift ocs-operator issues 840	None	open	must-gather: add kernel level logs	2021-02-19 04:17:42 UTC
Github	openshift ocs-operator pull 893	None	open	must-gather: Add kernel level logs	2021-02-19 04:17:39 UTC
Github	red-hat-storage ocs-operator pull 1709	None	open	must-gather: collect journal logs	2022-06-13 10:35:25 UTC
Github	red-hat-storage ocs-operator pull 1860	None	open	Bug 1882534: [release-4.12] must-gather: collect journal logs	2022-10-26 13:17:31 UTC
Github	red-hat-storage ocs-operator pull 1886	None	open	must-gather: collect journal and kernel logs for all nodes	2022-11-21 10:16:12 UTC
Github	red-hat-storage ocs-operator pull 1892	None	open	Bug 1882534: [release-4.12] must-gather: collect journal and kernel logs for all nodes	2022-11-25 07:14:48 UTC
Red Hat Product Errata	RHBA-2023:0551	None	None	None	2023-01-31 00:19:42 UTC

Description Josh Durgin 2020-09-24 20:46:37 UTC

For bugs like https://bugzilla.redhat.com/show_bug.cgi?id=1881994 we need kernel and other logs from the host to determine whether there are hardware errors, and other information about the environment.

Specifically, we should include at least:

dmesg
/var/log/messages
/var/log/sysstat
journalctl

Comment 4 RAJAT SINGH 2020-11-19 11:49:56 UTC

@jdurgin the paths /var/log/messages and /var/log/sysstat do not exist when we tried looking into the node. Here's what I did.
My worker node name was ip-10-0-161-11.ec2.internal

oc debug nodes/ip-10-0-161-11.ec2.internal
chroot /host
cd /var/log

cd messages                                                            ↵
cd: no such file or directory: messages

cd sysstat
cd: no such file or directory: sysstat

Am I missing something here or looking at the wrong location?.

Comment 5 Josh Durgin 2020-11-20 21:43:25 UTC

(In reply to RAJAT SINGH from comment #4)
> @jdurgin the paths /var/log/messages and /var/log/sysstat do not
> exist when we tried looking into the node. Here's what I did.
> My worker node name was ip-10-0-161-11.ec2.internal
> 
> oc debug nodes/ip-10-0-161-11.ec2.internal
> chroot /host
> cd /var/log
> 
> cd messages                                                            ↵
> cd: no such file or directory: messages
> 
> cd sysstat
> cd: no such file or directory: sysstat
> 
> Am I missing something here or looking at the wrong location?.

The presence of those varies based on distro version and configuration. Your approach of fetching all of /var/log in the PR will capture the relevant info. Thanks for checking!

Comment 6 RAJAT SINGH 2020-12-07 07:30:08 UTC

PR: https://github.com/openshift/ocs-operator/pull/893

Comment 7 RAJAT SINGH 2020-12-10 17:24:24 UTC

Hi @jdurgin when we copy var/log directory locally, it takes a lot of time since it too big.We currently have a 5m timeout but sometimes the rsync might take too long because the logs are huge and no files will be transferred. So we either have to increase the timeout or deal with no logs.

Comment 8 Josh Durgin 2020-12-10 17:41:28 UTC

(In reply to RAJAT SINGH from comment #7)
> Hi @jdurgin when we copy var/log directory locally, it takes a
> lot of time since it too big.We currently have a 5m timeout but sometimes
> the rsync might take too long because the logs are huge and no files will be
> transferred. So we either have to increase the timeout or deal with no logs.

The log data is important, please increase the timeout. 5m is very short, I'd suggest 24 hours as a conservative upper bound, if you must have a timeout.

The impact of lack of logs is much worse than taking the time needed to collect them.

Comment 9 Jose A. Rivera 2020-12-10 18:14:45 UTC

(In reply to Josh Durgin from comment #8)
> The log data is important, please increase the timeout. 5m is very short,
> I'd suggest 24 hours as a conservative upper bound, if you must have a
> timeout.
> 
> The impact of lack of logs is much worse than taking the time needed to
> collect them.

This is in contradiction to another requirement. Part of the goal of this work is a refactor to try and minimize the amount of time an OCS must-gather will run. In large OCS clusters, do we really want one debug pod per node running for up to 24 hours transferring potentially gigabytes of data? Or worse, do we want to wait up to 24 hours for a Pod on a node that fails and has gone unresponsive?

Comment 10 Josh Durgin 2020-12-10 18:21:57 UTC

(In reply to Jose A. Rivera from comment #9)
> (In reply to Josh Durgin from comment #8)
> > The log data is important, please increase the timeout. 5m is very short,
> > I'd suggest 24 hours as a conservative upper bound, if you must have a
> > timeout.
> > 
> > The impact of lack of logs is much worse than taking the time needed to
> > collect them.
> 
> This is in contradiction to another requirement. Part of the goal of this
> work is a refactor to try and minimize the amount of time an OCS must-gather
> will run. In large OCS clusters, do we really want one debug pod per node
> running for up to 24 hours transferring potentially gigabytes of data? Or
> worse, do we want to wait up to 24 hours for a Pod on a node that fails and
> has gone unresponsive?

If must-gather is not gathering the necessary data, how useful is it? The exact timeout is debatable, but in customer environments (and even tests) we can have GBs of logs we need to capture. We do this with sosreports today.
Compression can also be done to save space and time, this is very effective for log data.

I'd expect for large clusters you'd want to capture targeted information about a subset of nodes, not necessarily requiring logs from the whole cluster.

Comment 11 Jose A. Rivera 2020-12-10 19:37:45 UTC

(In reply to Josh Durgin from comment #10)
> (In reply to Jose A. Rivera from comment #9)
> > (In reply to Josh Durgin from comment #8)
> If must-gather is not gathering the necessary data, how useful is it? The
> exact timeout is debatable, but in customer environments (and even tests) we
> can have GBs of logs we need to capture. We do this with sosreports today.
> Compression can also be done to save space and time, this is very effective
> for log data.
> 
> I'd expect for large clusters you'd want to capture targeted information
> about a subset of nodes, not necessarily requiring logs from the whole
> cluster.

The must-gather has no configurability, it's just an image that you run. We don't collect data on every node in the OCP cluster, but we do on every node in the OCS cluster. And we only have the tools the oc command provides for us, so I don't know if that does any compression or not.

Regardless, I'm not an official stakeholder here, so someone else will have to chime in on whether this is ultimately acceptable or not.

Comment 12 Brad Hubbard 2020-12-10 23:39:25 UTC

(In reply to Jose A. Rivera from comment #11)
> 
> Regardless, I'm not an official stakeholder here, so someone else will have
> to chime in on whether this is ultimately acceptable or not.

This needs to be escalated to product management on both sides then. If we can
not debug an issue due to lack of logging then customer experience and support
delivery will suffer badly.

Comment 20 RAJAT SINGH 2021-06-02 02:40:34 UTC

@josh Durgin if you can please take a look at the data that Ashish linked above and see if it makes sense to add this functionality, then we can discuss it further.

Comment 22 RAJAT SINGH 2021-06-04 13:06:14 UTC

True, it's unclear because the journal file is very large and it might be useful but not that significant when we look at its size and the time it takes to copy.

Comment 25 Mudit Agarwal 2021-10-06 13:07:06 UTC

Can we revisit this and check the feasibility else we can close this as WONTFIX as this is being dragged for releases.

Comment 26 yati padia 2021-10-06 13:29:46 UTC

Sure Mudit. Will take a look into it.

Comment 31 yati padia 2021-12-13 02:35:55 UTC

I will be working on this and will soon provide the statistics.

Comment 32 yati padia 2022-01-11 04:01:01 UTC

After few research, I see we have an option to zip the journalctl logs that we collect which will reduce the size.
```
sh-4.4# journalctl --since "2 days ago" --root /host | wc -c
27828015

sh-4.4# journalctl --since "2 days ago" --root /host |  gzip | wc -c
3283156
```
I shall come up with a PR collecting the journalctl logs soon.

Comment 33 Mudit Agarwal 2022-02-22 07:15:15 UTC

Can't be fixed before dev freeze and not a blocker.

Comment 34 Yaniv Kaul 2022-05-31 09:18:45 UTC

(In reply to yati padia from comment #32)
> After few research, I see we have an option to zip the journalctl logs that
> we collect which will reduce the size.
> ```
> sh-4.4# journalctl --since "2 days ago" --root /host | wc -c
> 27828015
> 
> sh-4.4# journalctl --since "2 days ago" --root /host |  gzip | wc -c
> 3283156
> ```
> I shall come up with a PR collecting the journalctl logs soon.

Any updates? Above github links are still relevant?

Comment 35 yati padia 2022-06-01 07:47:20 UTC

I am still working on this. We have a plan to collect journal logs as mentioned above. sorry for the delay, I will try to get it in for this release.
About the above links, I think those are closed and not relevant now about the above links. We may reopen the issue if we want, but I guess this bug is enough to track the issue.

Comment 36 yati padia 2022-06-14 03:04:36 UTC

Updated the bug with PR link.

Comment 37 Mudit Agarwal 2022-06-21 04:18:55 UTC

PR is still under review, no plans to backport

Comment 39 Josh Durgin 2022-07-04 16:04:09 UTC

Was there any action on the plan Bipin mentioned above? The PR seems to always take 2 days of journald log (it would be helpful to have as much as feasible, perhaps a week, if it does not consume too much space), with no equivalent of SAR data.

>  1) Figure out the average size of journalctl logs for 1 and 2 days on an active node. Before that you might have to figure out how to collect journalctl log by passing timestamp.
>     This will help us to understand must-gather will increase by what size if we add journalctl log for a day or two
>  2) Figure out what is sar euivalent in RHCOS. If we don't get anything we need to check with shift-networking team for this. We must have something which will give us similar data.
>     Once we know what needs to be added for this, we will have to again calculate size for it to understand the overall increase in must-gather when we include it.

Comment 40 yati padia 2022-07-05 02:20:11 UTC

@jdurgin as per the discussion with Bipin it was planned to get the logs for 2 days, the size for one week will be more. Also, we have planned to compress the size of the logs and convert it into gz file type which reduces the size to a  greater extent. Also, as of now, we have no plans to collect the SAR data.

Comment 43 yati padia 2022-09-28 07:08:44 UTC

Please review the PR so that we can move ahead with this bug: https://github.com/red-hat-storage/ocs-operator/pull/1709

Comment 46 Mudit Agarwal 2022-10-21 05:11:14 UTC

*** Bug 2111375 has been marked as a duplicate of this bug. ***

Comment 49 yati padia 2022-11-18 05:32:17 UTC

Oded was not able to see the kernel and journal logs. I am currently debugging the issue. Hence, moving it to the assigned state for now.

Comment 50 Oded 2022-12-12 11:11:03 UTC

journalctl files do not exist.

SetUp:
ODF Version: 4.12.0-130
OCP Version: 4.12.0-0.nightly-2022-12-08-093940
Provider: Vmware

Test Process:
1. Collect MG4.12:
$ oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.12

2.Check files on mg dir
$ find -name "*jou*"
$ find -name "*kern*"
$

Comment 54 Oded 2022-12-14 16:00:30 UTC

Bug fixed on private image, but it did not fix on official image [quay.io/rhceph-dev/ocs-must-gather:latest-4.12]

$ oc adm must-gather --image=docker.io/yati1998/ocs-must-gather:testlog
[odedviner@fedora yati]$ tree | grep -i jou
    │   │   ├── journal_compute-0
    │   │   │   └── journal_compute-0.gz
    │   │   ├── journal_compute-1
    │   │   │   └── journal_compute-1.gz
    │   │   ├── journal_compute-2
    │   │   │   └── journal_compute-2.gz
[odedviner@fedora yati]$ tree | grep -i kern
    │   │   ├── kernel_compute-0
    │   │   │   └── kernel_compute-0.gz
    │   │   ├── kernel_compute-1
    │   │   │   └── kernel_compute-1.gz
    │   │   ├── kernel_compute-2
    │   │   │   └── kernel_compute-2.gz
[odedviner@fedora yati]$

Comment 56 Oded 2022-12-22 15:30:34 UTC

journal and kernel files do not exist.

SetUp:
ODF Version:4.12.0-130
OCP Version:4.12.0-0.nightly-2022-12-08-093940
Provider: Vmware


Deleted the old ocs-must-gather image from the master node:

$ oc debug node/control-plane-2
sh-4.4# chroot /host
sh-4.4# bash
[root@control-plane-2 /]# podman images | grep must
quay.io/rhceph-dev/ocs-must-gather              latest-4.12  1c3ac1913734  2 months ago  393 MB
[root@control-plane-2 /]# podman rmi quay.io/rhceph-dev/ocs-must-gather:latest-4.12 -f
Untagged: quay.io/rhceph-dev/ocs-must-gather:latest-4.12
Deleted: 1c3ac19137345251e1c608a6ca8603f40417c3a16c0cd9f6772341c4bf075c8f
[root@control-plane-2 /]# podman images | grep must
[root@control-plane-2 /]# 


Run MG cmd:

$ oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.12


Check content:
[odedviner@fedora must-gather.local.2746887068348741777]$ tree | grep -i kern
[odedviner@fedora must-gather.local.2746887068348741777]$ tree | grep -i jou
[odedviner@fedora must-gather.local.2746887068348741777]$

Comment 57 Mudit Agarwal 2022-12-23 03:38:25 UTC

Late in the cycle

Comment 68 Oded 2023-01-06 07:29:52 UTC

Bug fixed.

SetUp:
ODF Version: 4.12.0-152
OCP Version: 4.12.0-0.nightly-2022-12-27-111646
Provider: AWS

Test Process:
1.Run mg 
$ oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.12

2.Check mg dir content:
[DIR]	journal_ip-10-0-147-196.us-east-2.compute.internal/	2023-01-02 11:52	-	 
[DIR]	journal_ip-10-0-177-31.us-east-2.compute.internal/	2023-01-02 11:52	-	 
[DIR]	journal_ip-10-0-198-130.us-east-2.compute.internal/	2023-01-02 11:52	-	 
[DIR]	kernel_ip-10-0-147-196.us-east-2.compute.internal/	2023-01-02 11:52	-	 
[DIR]	kernel_ip-10-0-177-31.us-east-2.compute.internal/	2023-01-02 11:52	-	 
[DIR]	kernel_ip-10-0-198-130.us-east-2.compute.internal/	2023-01-02 11:52	-	 

http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-222ai3c333-t1/j-222ai3c333-t1_20230102T074505/logs/failed_testcase_ocs_logs_1672649331/test_validate_ceph_config_values_in_rook_config_override_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-b9dc9da44e281efdb293d80324c822377d06dc5b29289ebb31c8b6db37305363/ceph/

Comment 73 errata-xmlrpc 2023-01-31 00:19:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.12.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:0551

Note You need to log in before you can comment on or make changes to this bug.