For bugs like https://bugzilla.redhat.com/show_bug.cgi?id=1881994 we need kernel and other logs from the host to determine whether there are hardware errors, and other information about the environment. Specifically, we should include at least: dmesg /var/log/messages /var/log/sysstat journalctl
@jdurgin the paths /var/log/messages and /var/log/sysstat do not exist when we tried looking into the node. Here's what I did. My worker node name was ip-10-0-161-11.ec2.internal oc debug nodes/ip-10-0-161-11.ec2.internal chroot /host cd /var/log cd messages ↵ cd: no such file or directory: messages cd sysstat cd: no such file or directory: sysstat Am I missing something here or looking at the wrong location?.
(In reply to RAJAT SINGH from comment #4) > @jdurgin the paths /var/log/messages and /var/log/sysstat do not > exist when we tried looking into the node. Here's what I did. > My worker node name was ip-10-0-161-11.ec2.internal > > oc debug nodes/ip-10-0-161-11.ec2.internal > chroot /host > cd /var/log > > cd messages ↵ > cd: no such file or directory: messages > > cd sysstat > cd: no such file or directory: sysstat > > Am I missing something here or looking at the wrong location?. The presence of those varies based on distro version and configuration. Your approach of fetching all of /var/log in the PR will capture the relevant info. Thanks for checking!
PR: https://github.com/openshift/ocs-operator/pull/893
Hi @jdurgin when we copy var/log directory locally, it takes a lot of time since it too big.We currently have a 5m timeout but sometimes the rsync might take too long because the logs are huge and no files will be transferred. So we either have to increase the timeout or deal with no logs.
(In reply to RAJAT SINGH from comment #7) > Hi @jdurgin when we copy var/log directory locally, it takes a > lot of time since it too big.We currently have a 5m timeout but sometimes > the rsync might take too long because the logs are huge and no files will be > transferred. So we either have to increase the timeout or deal with no logs. The log data is important, please increase the timeout. 5m is very short, I'd suggest 24 hours as a conservative upper bound, if you must have a timeout. The impact of lack of logs is much worse than taking the time needed to collect them.
(In reply to Josh Durgin from comment #8) > The log data is important, please increase the timeout. 5m is very short, > I'd suggest 24 hours as a conservative upper bound, if you must have a > timeout. > > The impact of lack of logs is much worse than taking the time needed to > collect them. This is in contradiction to another requirement. Part of the goal of this work is a refactor to try and minimize the amount of time an OCS must-gather will run. In large OCS clusters, do we really want one debug pod per node running for up to 24 hours transferring potentially gigabytes of data? Or worse, do we want to wait up to 24 hours for a Pod on a node that fails and has gone unresponsive?
(In reply to Jose A. Rivera from comment #9) > (In reply to Josh Durgin from comment #8) > > The log data is important, please increase the timeout. 5m is very short, > > I'd suggest 24 hours as a conservative upper bound, if you must have a > > timeout. > > > > The impact of lack of logs is much worse than taking the time needed to > > collect them. > > This is in contradiction to another requirement. Part of the goal of this > work is a refactor to try and minimize the amount of time an OCS must-gather > will run. In large OCS clusters, do we really want one debug pod per node > running for up to 24 hours transferring potentially gigabytes of data? Or > worse, do we want to wait up to 24 hours for a Pod on a node that fails and > has gone unresponsive? If must-gather is not gathering the necessary data, how useful is it? The exact timeout is debatable, but in customer environments (and even tests) we can have GBs of logs we need to capture. We do this with sosreports today. Compression can also be done to save space and time, this is very effective for log data. I'd expect for large clusters you'd want to capture targeted information about a subset of nodes, not necessarily requiring logs from the whole cluster.
(In reply to Josh Durgin from comment #10) > (In reply to Jose A. Rivera from comment #9) > > (In reply to Josh Durgin from comment #8) > If must-gather is not gathering the necessary data, how useful is it? The > exact timeout is debatable, but in customer environments (and even tests) we > can have GBs of logs we need to capture. We do this with sosreports today. > Compression can also be done to save space and time, this is very effective > for log data. > > I'd expect for large clusters you'd want to capture targeted information > about a subset of nodes, not necessarily requiring logs from the whole > cluster. The must-gather has no configurability, it's just an image that you run. We don't collect data on every node in the OCP cluster, but we do on every node in the OCS cluster. And we only have the tools the oc command provides for us, so I don't know if that does any compression or not. Regardless, I'm not an official stakeholder here, so someone else will have to chime in on whether this is ultimately acceptable or not.
(In reply to Jose A. Rivera from comment #11) > > Regardless, I'm not an official stakeholder here, so someone else will have > to chime in on whether this is ultimately acceptable or not. This needs to be escalated to product management on both sides then. If we can not debug an issue due to lack of logging then customer experience and support delivery will suffer badly.
@josh Durgin if you can please take a look at the data that Ashish linked above and see if it makes sense to add this functionality, then we can discuss it further.
True, it's unclear because the journal file is very large and it might be useful but not that significant when we look at its size and the time it takes to copy.
Can we revisit this and check the feasibility else we can close this as WONTFIX as this is being dragged for releases.
Sure Mudit. Will take a look into it.
I will be working on this and will soon provide the statistics.
After few research, I see we have an option to zip the journalctl logs that we collect which will reduce the size. ``` sh-4.4# journalctl --since "2 days ago" --root /host | wc -c 27828015 sh-4.4# journalctl --since "2 days ago" --root /host | gzip | wc -c 3283156 ``` I shall come up with a PR collecting the journalctl logs soon.
Can't be fixed before dev freeze and not a blocker.
(In reply to yati padia from comment #32) > After few research, I see we have an option to zip the journalctl logs that > we collect which will reduce the size. > ``` > sh-4.4# journalctl --since "2 days ago" --root /host | wc -c > 27828015 > > sh-4.4# journalctl --since "2 days ago" --root /host | gzip | wc -c > 3283156 > ``` > I shall come up with a PR collecting the journalctl logs soon. Any updates? Above github links are still relevant?
I am still working on this. We have a plan to collect journal logs as mentioned above. sorry for the delay, I will try to get it in for this release. About the above links, I think those are closed and not relevant now about the above links. We may reopen the issue if we want, but I guess this bug is enough to track the issue.
Updated the bug with PR link.
PR is still under review, no plans to backport
Was there any action on the plan Bipin mentioned above? The PR seems to always take 2 days of journald log (it would be helpful to have as much as feasible, perhaps a week, if it does not consume too much space), with no equivalent of SAR data. > 1) Figure out the average size of journalctl logs for 1 and 2 days on an active node. Before that you might have to figure out how to collect journalctl log by passing timestamp. > This will help us to understand must-gather will increase by what size if we add journalctl log for a day or two > 2) Figure out what is sar euivalent in RHCOS. If we don't get anything we need to check with shift-networking team for this. We must have something which will give us similar data. > Once we know what needs to be added for this, we will have to again calculate size for it to understand the overall increase in must-gather when we include it.
@jdurgin as per the discussion with Bipin it was planned to get the logs for 2 days, the size for one week will be more. Also, we have planned to compress the size of the logs and convert it into gz file type which reduces the size to a greater extent. Also, as of now, we have no plans to collect the SAR data.
Please review the PR so that we can move ahead with this bug: https://github.com/red-hat-storage/ocs-operator/pull/1709
*** Bug 2111375 has been marked as a duplicate of this bug. ***
Oded was not able to see the kernel and journal logs. I am currently debugging the issue. Hence, moving it to the assigned state for now.
journalctl files do not exist. SetUp: ODF Version: 4.12.0-130 OCP Version: 4.12.0-0.nightly-2022-12-08-093940 Provider: Vmware Test Process: 1. Collect MG4.12: $ oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.12 2.Check files on mg dir $ find -name "*jou*" $ find -name "*kern*" $
Bug fixed on private image, but it did not fix on official image [quay.io/rhceph-dev/ocs-must-gather:latest-4.12] $ oc adm must-gather --image=docker.io/yati1998/ocs-must-gather:testlog [odedviner@fedora yati]$ tree | grep -i jou │ │ ├── journal_compute-0 │ │ │ └── journal_compute-0.gz │ │ ├── journal_compute-1 │ │ │ └── journal_compute-1.gz │ │ ├── journal_compute-2 │ │ │ └── journal_compute-2.gz [odedviner@fedora yati]$ tree | grep -i kern │ │ ├── kernel_compute-0 │ │ │ └── kernel_compute-0.gz │ │ ├── kernel_compute-1 │ │ │ └── kernel_compute-1.gz │ │ ├── kernel_compute-2 │ │ │ └── kernel_compute-2.gz [odedviner@fedora yati]$
journal and kernel files do not exist. SetUp: ODF Version:4.12.0-130 OCP Version:4.12.0-0.nightly-2022-12-08-093940 Provider: Vmware Deleted the old ocs-must-gather image from the master node: $ oc debug node/control-plane-2 sh-4.4# chroot /host sh-4.4# bash [root@control-plane-2 /]# podman images | grep must quay.io/rhceph-dev/ocs-must-gather latest-4.12 1c3ac1913734 2 months ago 393 MB [root@control-plane-2 /]# podman rmi quay.io/rhceph-dev/ocs-must-gather:latest-4.12 -f Untagged: quay.io/rhceph-dev/ocs-must-gather:latest-4.12 Deleted: 1c3ac19137345251e1c608a6ca8603f40417c3a16c0cd9f6772341c4bf075c8f [root@control-plane-2 /]# podman images | grep must [root@control-plane-2 /]# Run MG cmd: $ oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.12 Check content: [odedviner@fedora must-gather.local.2746887068348741777]$ tree | grep -i kern [odedviner@fedora must-gather.local.2746887068348741777]$ tree | grep -i jou [odedviner@fedora must-gather.local.2746887068348741777]$
Late in the cycle
Bug fixed. SetUp: ODF Version: 4.12.0-152 OCP Version: 4.12.0-0.nightly-2022-12-27-111646 Provider: AWS Test Process: 1.Run mg $ oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.12 2.Check mg dir content: [DIR] journal_ip-10-0-147-196.us-east-2.compute.internal/ 2023-01-02 11:52 - [DIR] journal_ip-10-0-177-31.us-east-2.compute.internal/ 2023-01-02 11:52 - [DIR] journal_ip-10-0-198-130.us-east-2.compute.internal/ 2023-01-02 11:52 - [DIR] kernel_ip-10-0-147-196.us-east-2.compute.internal/ 2023-01-02 11:52 - [DIR] kernel_ip-10-0-177-31.us-east-2.compute.internal/ 2023-01-02 11:52 - [DIR] kernel_ip-10-0-198-130.us-east-2.compute.internal/ 2023-01-02 11:52 - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-222ai3c333-t1/j-222ai3c333-t1_20230102T074505/logs/failed_testcase_ocs_logs_1672649331/test_validate_ceph_config_values_in_rook_config_override_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-b9dc9da44e281efdb293d80324c822377d06dc5b29289ebb31c8b6db37305363/ceph/
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.12.0 enhancement and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:0551