Bug 2136336 - [cee/sd][Cephadm] ceph mgr is filling up the log messages "Detected new or changed devices" for all OSD nodes every 30 min un-neccessarily
Summary: [cee/sd][Cephadm] ceph mgr is filling up the log messages "Detected new or ch...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Cephadm
Version: 5.2
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: 6.1
Assignee: Adam King
QA Contact: Manisha Saini
Akash Raj
URL:
Whiteboard:
Depends On: 2180567
Blocks: 2192813
TreeView+ depends on / blocked
 
Reported: 2022-10-20 01:57 UTC by Kritik Sachdeva
Modified: 2023-08-21 07:26 UTC (History)
11 users (show)

Fixed In Version: ceph-17.2.6-5.el9cp
Doc Type: Bug Fix
Doc Text:
.`Cephadm` now logs device information only when an actual change occurs Previously, `cephadm`would compare all fields reported for OSDs, to check for new or changed devices. But one of these fields included a timestamp that would differ every time. Due to this, `cephadm` would log that it ‘Detected new or changed devices' every time it refreshed a host's devices, regardless of whether anything actually changed or not. With this fix, the comparison of device information against previous information no longer takes the timestamp fields into account that are expected to constantly change. `Cephadm` now logs only when there is an actual change in the devices.
Clone Of:
Environment:
Last Closed: 2023-06-15 09:15:43 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-5469 0 None None None 2022-10-20 02:12:06 UTC
Red Hat Product Errata RHSA-2023:3623 0 None None None 2023-06-15 09:16:48 UTC

Description Kritik Sachdeva 2022-10-20 01:57:45 UTC
Description of problem:

The active MGR log message "Detected new or changed devices” is filling up every 30 mins even if there is no activity or change observed in the cluster. The requirement is to log above messages only if there is an actual change on the node.

We have tried to look up in the code snippet that is generating the log message as below.
~~~
The code is here: mgr/cephadm/inventory.py:
    def devices_changed(self, host: str, b: List[inventory.Device]) -> bool:
        a = self.devices[host]
        if len(a) != len(b):
            return True
        aj = {d.path: d.to_json() for d in a}
        bj = {d.path: d.to_json() for d in b}
        if aj != bj:
            self.mgr.log.info("Detected new or changed devices on %s" % host)
            return True
        return False
~~~

And while dumping the values of both aj & bj variables, observed that the difference is only for the field "created" which is 30 mins.
     - This 30 mins is after which the above function is called coming from the configuration parameter mgr/cephadm/device_cache_timeout

~~~
aj: {'/dev/sdX': {'ceph_device': None, 'rejected_reasons': [], 'available': True, 'path': '/dev/sdX', 'sys_api': {'human_readable_size': '10.00 GB', 'locked': 0, 'model': 'QEMU HARDDISK', 'nr_requests': '256', 'partitions': {}, 'path': '/dev/sdX', 'removable': '0', 'rev': '2.5+', 'ro': '0', 'rotational': '1', 'sas_address': '', 'sas_device_handle': '', 'scheduler_mode': 'mq-deadline', 'sectors': 0, 'sectorsize': '512', 'size': 10737418240.0, 'support_discard': '4096', 'vendor': 'QEMU'}, 'created': '2022-10-20T01:07:32.604710Z', 'lvs': [], 'human_readable_type': 'hdd', 'device_id': 'QEMU_QEMU_HARDDISK_15bbXXXX-XXXX-40e8-XXXX-946a21dXXXXXX', 'lsm_data': {}}} 
-------------------------------------------------
 bj: {'/dev/sdX': {'ceph_device': None, 'rejected_reasons': [], 'available': True, 'path': '/dev/sdX', 'sys_api': {'human_readable_size': '10.00 GB', 'locked': 0, 'model': 'QEMU HARDDISK', 'nr_requests': '256', 'partitions': {}, 'path': '/dev/sdX', 'removable': '0', 'rev': '2.5+', 'ro': '0', 'rotational': '1', 'sas_address': '', 'sas_device_handle': '', 'scheduler_mode': 'mq-deadline', 'sectors': 0, 'sectorsize': '512', 'size': 10737418240.0, 'support_discard': '4096', 'vendor': 'QEMU'}, 'created': '2022-10-20T01:38:30.355023Z', 'lvs': [], 'human_readable_type': 'hdd', 'device_id': 'QEMU_QEMU_HARDDISK_15bbXXXX-XXXX-40e8-XXXX-946a21dXXXXX', 'lsm_data': {}}}
~~~

Additionally, we tried to look further in the code, and observed that the value for the field "created" gets updated or applied only once at the below location in code.
- https://github.com/ceph/ceph/blob/main/src/python-common/ceph/deployment/inventory.py#L74-L77

Version-Release number of selected component (if applicable): RHCS 5.2


How reproducible: Always


Steps to Reproduce:
1. Simply deploy or upgrade an RHCS 5 envrionment
2. Monitor the active mgr logs for the message "Detected new or changed devices"

Actual results: "Detected new or changed devices" log message is filling up even if there is no actual change.


Expected results: Log the above message only if there is an actual change on the node.

Comment 5 Matthew Secaur 2023-01-18 14:37:50 UTC
While waiting for a patch for Ceph 5.2, is there any way our customers can silence this warning message so that their logs/prometheus are not inundated with these messages?

Comment 6 Adam King 2023-01-18 17:20:57 UTC
(In reply to Matthew Secaur from comment #5)
> While waiting for a patch for Ceph 5.2, is there any way our customers can
> silence this warning message so that their logs/prometheus are not inundated
> with these messages?

These logs are info level, so I think you could just set the cephadm log to cluster level to warning (ceph config set mgr mgr/cephadm/log_to_cluster_level warning). You'd miss a handful of info log messages maybe, but typically anything important enough that we really need it to be seen would be at warning or error level.

Comment 16 errata-xmlrpc 2023-06-15 09:15:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 6.1 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:3623


Note You need to log in before you can comment on or make changes to this bug.