Description of problem (please be detailed as possible and provide log snippests): ---------------------------------------------------------------------- On an OCS 4.4.1-rc2 cluster on Vmware(converged Mode), the ceph cluster was in HEALTH_ERR state (reported in Bug 1852839) To collect OSD debug log, configured this setting from ceph side, but the change did not really enable debug logs from OSD pod logs Rook issue = https://github.com/rook/rook/issues/5744 sh-4.4# ceph --debug-ms 1 config set osd debug_osd 30 It shows it is talking to mon.2 (which will forward to the leader) with ip 172.30.179.245 mon.a was the leader But no change observed in OSD pod logs (last update was still way before I changed the osd debug_osd level 2020-06-30 12:21:26.222886 I | 2020-06-30 12:21:26.221 7f9438664700 -1 log_channel(cluster) log [ERR] : 1.7 scrub : stat mismatch, got 2/2 objects, 0/0 clones, 2/2 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 116/0 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes. 2020-06-30 12:21:26.222944 I | 2020-06-30 12:21:26.221 7f9438664700 -1 log_channel(cluster) log [ERR] : 1.7 scrub 1 errors ~ The setting is reflected in the ceph side ----------------------------------------- sh-4.4# ceph config get osd.0 debug_osd 30/30 sh-4.4# ceph config get osd.1 debug_osd 30/30 Version of all relevant components (if applicable): ---------------------------------------------------------------------- ========CSV ====== NAME DISPLAY VERSION REPLACES PHASE lib-bucket-provisioner.v1.0.0 lib-bucket-provisioner 1.0.0 Succeeded ocs-operator.v4.4.1-465.ci OpenShift Container Storage 4.4.1-465.ci Succeeded =========OCP========== OCP = 4.4.0-0.nightly-2020-06-29-071755 RHCS = 4.1 (ceph version 14.2.8-59.el8cp (53387608e81e6aa2487c952a604db06faa5b2cd0) nautilus (stable) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? ---------------------------------------------------------------------- If one needs to collect debug logs, current settings in rook are not enabled to allow debug log for OSD Is there any workaround available to the best of your knowledge? ---------------------------------------------------------------------- The cluster was in Health_OK state in the morning. Not sure if @Josh performed any repair. Will add a needinfo for update on this. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? ---------------------------------------------------------------------- 3 Can this issue reproducible? ---------------------------------------------------------------------- Not sure Can this issue reproduce from the UI? ---------------------------------------------------------------------- No If this is a regression, please provide more details to justify this: ---------------------------------------------------------------------- No Steps to Reproduce: (Not sure when and why the cluster reached this state, we do not have the exact Steps to reproduce) ---------------------------------------------------------------------- 1. Create an OCP 4.4-nightly and OCS 4.4.1-rc2 cluster 2. Run tier1 on the cluster and upload an object to NooBaa 3. It was found, that @2020-06-30 12:21:26.222944 (much after tier1 completion), 1 PG went to inconsistent state and ceph health changed to ERR 4. Try to collect the OSd debug log by setting the value in ceph toolbox sh-4.4# ceph --debug-ms 1 config set osd debug_osd 30 Actual results: ---------------------------------------------------------------------- OSD logging is not enabled to write to stderr as with all other daemons Expected results: ---------------------------------------------------------------------- OSD logging should be enabled to write to stderr as with all other daemons Additional info ====================== Ceph status @Jun 30 17:26:11 UTC, >5 hrs after the PG issue was reported in OSD.3 logs =====ceph status ==== Tue Jun 30 17:26:11 UTC 2020 cluster: id: ca6c02dd-1942-4f11-bd38-1a03979fc85e health: HEALTH_ERR 1 scrub errors Possible data damage: 1 pg inconsistent services: mon: 3 daemons, quorum a,b,c (age 26h) mgr: a(active, since 30h) mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-b=up:active} 1 up:standby-replay osd: 6 osds: 6 up (since 26h), 6 in (since 26h) rgw: 1 daemon active (ocs.storagecluster.cephobjectstore.a) task status: scrub status: mds.ocs-storagecluster-cephfilesystem-a: idle mds.ocs-storagecluster-cephfilesystem-b: idle mds.ocs-storagecluster-cephfilesystem-c: idle mds.ocs-storagecluster-cephfilesystem-d: idle data: pools: 11 pools, 400 pgs objects: 13.49k objects, 47 GiB usage: 104 GiB used, 2.9 TiB / 3.0 TiB avail pgs: 399 active+clean 1 active+clean+inconsistent io: client: 852 B/s rd, 170 KiB/s wr, 2 op/s rd, 3 op/s wr ----------------------------------------------------------------------
This is a small fix that will enable much better troubleshooting of OSDs, we should fix for 4.5.
Upstream PR in review: https://github.com/rook/rook/pull/5770
@Travis, @Sebastien, can we get the exact steps to confirm the fix from rook side ? IS there a param we need to check in the OSD deployment as well ?
Neha, the OSD should be quite verbose and you should see a couple of logging flag on the "osd" container command line: https://github.com/rook/rook/pull/5770/files#diff-f029e41aa8b63c1861252ff9b8add9d9R48-R54 Thanks!
On 4.5.0-61.ci, in the osd pod spec now we see the default logging flags, - --log-to-stderr=true - --err-to-stderr=true - --mon-cluster-log-to-stderr=true - '--log-stderr-prefix=debug ' - --default-log-to-file=false - --default-mon-cluster-log-to-file=false And we see debug messages from the osd logs. [tdesala@localhost vmware]$ oc logs rook-ceph-osd-0-67859769d-rp549 | grep -i debug debug 2020-08-24 15:29:05.236 7f7b008cadc0 0 set uid:gid to 167:167 (ceph:ceph) debug 2020-08-24 15:29:05.236 7f7b008cadc0 0 ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable), process ceph-osd, pid 78753 debug 2020-08-24 15:29:05.236 7f7b008cadc0 0 pidfile_write: ignore empty --pid-file debug 2020-08-24 15:29:05.270 7f7b008cadc0 1 bdev create path /var/lib/ceph/osd/ceph-0/block type kernel debug 2020-08-24 15:29:05.270 7f7b008cadc0 1 bdev(0x55b61061e700 /var/lib/ceph/osd/ceph-0/block) open path /var/lib/ceph/osd/ceph-0/block debug 2020-08-24 15:29:05.270 7f7b008cadc0 1 bdev(0x55b61061e700 /var/lib/ceph/osd/ceph-0/block) open size 549755813888 (0x8000000000, 512 GiB) block_size 4096 (4 KiB) rotational discard supported debug 2020-08-24 15:29:05.270 7f7b008cadc0 1 bluestore(/var/lib/ceph/osd/ceph-0) _set_cache_sizes cache_size 1073741824 meta 0.4 kv 0.4 data 0.2 Earlier we used to just see INFO messages in the osd pod logs and now with these changes debug messages are also being logged. Is the above information sufficient for veriyfing this BZ? Please confirm.
It is sufficient. Thanks for validating, pleas move on VERIFIED state.
Moving this BZ to Verified state based on Comments 12 and Comment13.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3754