1852852 – OCs 4.x: OSD logging is not enabled to write to stderr as with all other daemons

Bug 1852852 - OCs 4.x: OSD logging is not enabled to write to stderr as with all other daemons

Summary: OCs 4.x: OSD logging is not enabled to write to stderr as with all other daemons

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.5.0
Assignee:	Travis Nielsen
QA Contact:	Prasad Desala
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-07-01 12:48 UTC by Neha Berry
Modified:	2020-09-23 09:05 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-09-15 10:17:53 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	rook rook pull 5770	0	None	closed	ceph: Capture logging from the OSD daemons	2021-02-18 11:26:33 UTC
Red Hat Product Errata	RHBA-2020:3754	0	None	None	None	2020-09-15 10:18:21 UTC

Description Neha Berry 2020-07-01 12:48:53 UTC

Description of problem (please be detailed as possible and provide log
snippests):
----------------------------------------------------------------------
On an OCS 4.4.1-rc2 cluster on Vmware(converged Mode), the ceph cluster was in HEALTH_ERR state (reported in Bug 1852839)

To collect OSD debug log, configured this setting from ceph side, but the change did not really enable debug logs from OSD pod logs

Rook issue = https://github.com/rook/rook/issues/5744

                                
sh-4.4# ceph --debug-ms 1 config set osd debug_osd 30
It shows it is talking to mon.2 (which will forward to the leader) with ip 172.30.179.245
mon.a was the leader

But no change observed in OSD pod logs (last update was still way before I changed the osd debug_osd level


2020-06-30 12:21:26.222886 I | 2020-06-30 12:21:26.221 7f9438664700 -1 log_channel(cluster) log [ERR] : 1.7 scrub : stat mismatch, got 2/2 objects, 0/0 clones, 2/2 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 116/0 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes.
2020-06-30 12:21:26.222944 I | 2020-06-30 12:21:26.221 7f9438664700 -1 log_channel(cluster) log [ERR] : 1.7 scrub 1 errors
~                                                                                                                           

The setting is reflected in the ceph side
-----------------------------------------

sh-4.4# ceph config get osd.0 debug_osd
30/30
sh-4.4# ceph config get osd.1 debug_osd
30/30




Version of all relevant components (if applicable):
----------------------------------------------------------------------
========CSV ======
NAME                            DISPLAY                       VERSION        REPLACES   PHASE
lib-bucket-provisioner.v1.0.0   lib-bucket-provisioner        1.0.0                     Succeeded
ocs-operator.v4.4.1-465.ci      OpenShift Container Storage   4.4.1-465.ci              Succeeded

=========OCP==========
OCP  = 4.4.0-0.nightly-2020-06-29-071755

RHCS = 4.1 (ceph version 14.2.8-59.el8cp (53387608e81e6aa2487c952a604db06faa5b2cd0) nautilus (stable)


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
----------------------------------------------------------------------
If one needs to collect debug logs, current settings in rook are not enabled to allow debug log for OSD


Is there any workaround available to the best of your knowledge?
----------------------------------------------------------------------
The cluster was in Health_OK state in the morning. Not sure if @Josh performed any repair. Will add a needinfo for update on this.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
----------------------------------------------------------------------
3

Can this issue reproducible?
----------------------------------------------------------------------
Not sure

Can this issue reproduce from the UI?
----------------------------------------------------------------------
No
If this is a regression, please provide more details to justify this:
----------------------------------------------------------------------
No

Steps to Reproduce: (Not sure when and why the cluster reached this state, we do not have the exact Steps to reproduce)
----------------------------------------------------------------------
1. Create an OCP 4.4-nightly and OCS 4.4.1-rc2 cluster
2. Run tier1 on the cluster and  upload an object to NooBaa
3. It was found, that @2020-06-30 12:21:26.222944 (much after tier1 completion), 1 PG went to inconsistent state and ceph health changed to ERR
4. Try to collect the OSd debug log by setting the value in ceph toolbox
sh-4.4# ceph --debug-ms 1 config set osd debug_osd 30


Actual results:
----------------------------------------------------------------------
OSD logging is not enabled to write to stderr as with all other daemons

Expected results:
----------------------------------------------------------------------
OSD logging should be enabled to write to stderr as with all other daemons

Additional info
======================

Ceph status @Jun 30 17:26:11 UTC, >5 hrs after the PG issue was reported in OSD.3 logs

=====ceph status ====
Tue Jun 30 17:26:11 UTC 2020
  cluster:
    id:     ca6c02dd-1942-4f11-bd38-1a03979fc85e
    health: HEALTH_ERR
            1 scrub errors
            Possible data damage: 1 pg inconsistent
 
  services:
    mon: 3 daemons, quorum a,b,c (age 26h)
    mgr: a(active, since 30h)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-b=up:active} 1 up:standby-replay
    osd: 6 osds: 6 up (since 26h), 6 in (since 26h)
    rgw: 1 daemon active (ocs.storagecluster.cephobjectstore.a)
 
  task status:
    scrub status:
        mds.ocs-storagecluster-cephfilesystem-a: idle
        mds.ocs-storagecluster-cephfilesystem-b: idle
        mds.ocs-storagecluster-cephfilesystem-c: idle
        mds.ocs-storagecluster-cephfilesystem-d: idle
 
  data:
    pools:   11 pools, 400 pgs
    objects: 13.49k objects, 47 GiB
    usage:   104 GiB used, 2.9 TiB / 3.0 TiB avail
    pgs:     399 active+clean
             1   active+clean+inconsistent
 
  io:
    client:   852 B/s rd, 170 KiB/s wr, 2 op/s rd, 3 op/s wr

----------------------------------------------------------------------

Comment 3 Travis Nielsen 2020-07-01 14:39:30 UTC

This is a small fix that will enable much better troubleshooting of OSDs, we should fix for 4.5.

Comment 7 Travis Nielsen 2020-07-06 20:38:08 UTC

Upstream PR in review: https://github.com/rook/rook/pull/5770

Comment 10 Neha Berry 2020-08-20 12:24:01 UTC

@Travis, @Sebastien, can we get the exact steps to confirm the fix from rook side ?

IS there a param we need to check in the OSD deployment as well ?

Comment 11 Sébastien Han 2020-08-20 15:00:36 UTC

Neha, the OSD should be quite verbose and you should see a couple of logging flag on the "osd" container command line: https://github.com/rook/rook/pull/5770/files#diff-f029e41aa8b63c1861252ff9b8add9d9R48-R54
Thanks!

Comment 12 Prasad Desala 2020-08-25 09:37:35 UTC

On 4.5.0-61.ci, in the osd pod spec now we see the default logging flags,

    - --log-to-stderr=true
    - --err-to-stderr=true
    - --mon-cluster-log-to-stderr=true
    - '--log-stderr-prefix=debug '
    - --default-log-to-file=false
    - --default-mon-cluster-log-to-file=false

And we see debug messages from the osd logs.

[tdesala@localhost vmware]$ oc logs rook-ceph-osd-0-67859769d-rp549  | grep -i debug
debug 2020-08-24 15:29:05.236 7f7b008cadc0  0 set uid:gid to 167:167 (ceph:ceph)
debug 2020-08-24 15:29:05.236 7f7b008cadc0  0 ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable), process ceph-osd, pid 78753
debug 2020-08-24 15:29:05.236 7f7b008cadc0  0 pidfile_write: ignore empty --pid-file
debug 2020-08-24 15:29:05.270 7f7b008cadc0  1 bdev create path /var/lib/ceph/osd/ceph-0/block type kernel
debug 2020-08-24 15:29:05.270 7f7b008cadc0  1 bdev(0x55b61061e700 /var/lib/ceph/osd/ceph-0/block) open path /var/lib/ceph/osd/ceph-0/block
debug 2020-08-24 15:29:05.270 7f7b008cadc0  1 bdev(0x55b61061e700 /var/lib/ceph/osd/ceph-0/block) open size 549755813888 (0x8000000000, 512 GiB) block_size 4096 (4 KiB) rotational discard supported
debug 2020-08-24 15:29:05.270 7f7b008cadc0  1 bluestore(/var/lib/ceph/osd/ceph-0) _set_cache_sizes cache_size 1073741824 meta 0.4 kv 0.4 data 0.2

Earlier we used to just see INFO messages in the osd pod logs and now with these changes debug messages are also being logged.
Is the above information sufficient for veriyfing this BZ? Please confirm.

Comment 13 Sébastien Han 2020-08-25 12:12:23 UTC

It is sufficient. Thanks for validating, pleas move on VERIFIED state.

Comment 14 Prasad Desala 2020-08-25 12:30:12 UTC

Moving this BZ to Verified state based on Comments 12 and Comment13.

Comment 16 errata-xmlrpc 2020-09-15 10:17:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3754

Note You need to log in before you can comment on or make changes to this bug.