Bug 2136336

Summary: [cee/sd][Cephadm] ceph mgr is filling up the log messages "Detected new or changed devices" for all OSD nodes every 30 min un-neccessarily
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Kritik Sachdeva <ksachdev>
Component: CephadmAssignee: Adam King <adking>
Status: CLOSED ERRATA QA Contact: Manisha Saini <msaini>
Severity: medium Docs Contact: Akash Raj <akraj>
Priority: medium    
Version: 5.2CC: adking, akraj, cephqe-warriors, gjose, jeremy.coulombe, lithomas, lob+redhat, mmuench, msaini, msecaur, vamahaja
Target Milestone: ---   
Target Release: 6.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: ceph-17.2.6-5.el9cp Doc Type: Bug Fix
Doc Text:
.`Cephadm` now logs device information only when an actual change occurs Previously, `cephadm`would compare all fields reported for OSDs, to check for new or changed devices. But one of these fields included a timestamp that would differ every time. Due to this, `cephadm` would log that it ‘Detected new or changed devices' every time it refreshed a host's devices, regardless of whether anything actually changed or not. With this fix, the comparison of device information against previous information no longer takes the timestamp fields into account that are expected to constantly change. `Cephadm` now logs only when there is an actual change in the devices.
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-06-15 09:15:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2180567    
Bug Blocks: 2192813    

Description Kritik Sachdeva 2022-10-20 01:57:45 UTC
Description of problem:

The active MGR log message "Detected new or changed devices” is filling up every 30 mins even if there is no activity or change observed in the cluster. The requirement is to log above messages only if there is an actual change on the node.

We have tried to look up in the code snippet that is generating the log message as below.
~~~
The code is here: mgr/cephadm/inventory.py:
    def devices_changed(self, host: str, b: List[inventory.Device]) -> bool:
        a = self.devices[host]
        if len(a) != len(b):
            return True
        aj = {d.path: d.to_json() for d in a}
        bj = {d.path: d.to_json() for d in b}
        if aj != bj:
            self.mgr.log.info("Detected new or changed devices on %s" % host)
            return True
        return False
~~~

And while dumping the values of both aj & bj variables, observed that the difference is only for the field "created" which is 30 mins.
     - This 30 mins is after which the above function is called coming from the configuration parameter mgr/cephadm/device_cache_timeout

~~~
aj: {'/dev/sdX': {'ceph_device': None, 'rejected_reasons': [], 'available': True, 'path': '/dev/sdX', 'sys_api': {'human_readable_size': '10.00 GB', 'locked': 0, 'model': 'QEMU HARDDISK', 'nr_requests': '256', 'partitions': {}, 'path': '/dev/sdX', 'removable': '0', 'rev': '2.5+', 'ro': '0', 'rotational': '1', 'sas_address': '', 'sas_device_handle': '', 'scheduler_mode': 'mq-deadline', 'sectors': 0, 'sectorsize': '512', 'size': 10737418240.0, 'support_discard': '4096', 'vendor': 'QEMU'}, 'created': '2022-10-20T01:07:32.604710Z', 'lvs': [], 'human_readable_type': 'hdd', 'device_id': 'QEMU_QEMU_HARDDISK_15bbXXXX-XXXX-40e8-XXXX-946a21dXXXXXX', 'lsm_data': {}}} 
-------------------------------------------------
 bj: {'/dev/sdX': {'ceph_device': None, 'rejected_reasons': [], 'available': True, 'path': '/dev/sdX', 'sys_api': {'human_readable_size': '10.00 GB', 'locked': 0, 'model': 'QEMU HARDDISK', 'nr_requests': '256', 'partitions': {}, 'path': '/dev/sdX', 'removable': '0', 'rev': '2.5+', 'ro': '0', 'rotational': '1', 'sas_address': '', 'sas_device_handle': '', 'scheduler_mode': 'mq-deadline', 'sectors': 0, 'sectorsize': '512', 'size': 10737418240.0, 'support_discard': '4096', 'vendor': 'QEMU'}, 'created': '2022-10-20T01:38:30.355023Z', 'lvs': [], 'human_readable_type': 'hdd', 'device_id': 'QEMU_QEMU_HARDDISK_15bbXXXX-XXXX-40e8-XXXX-946a21dXXXXX', 'lsm_data': {}}}
~~~

Additionally, we tried to look further in the code, and observed that the value for the field "created" gets updated or applied only once at the below location in code.
- https://github.com/ceph/ceph/blob/main/src/python-common/ceph/deployment/inventory.py#L74-L77

Version-Release number of selected component (if applicable): RHCS 5.2


How reproducible: Always


Steps to Reproduce:
1. Simply deploy or upgrade an RHCS 5 envrionment
2. Monitor the active mgr logs for the message "Detected new or changed devices"

Actual results: "Detected new or changed devices" log message is filling up even if there is no actual change.


Expected results: Log the above message only if there is an actual change on the node.

Comment 5 Matthew Secaur 2023-01-18 14:37:50 UTC
While waiting for a patch for Ceph 5.2, is there any way our customers can silence this warning message so that their logs/prometheus are not inundated with these messages?

Comment 6 Adam King 2023-01-18 17:20:57 UTC
(In reply to Matthew Secaur from comment #5)
> While waiting for a patch for Ceph 5.2, is there any way our customers can
> silence this warning message so that their logs/prometheus are not inundated
> with these messages?

These logs are info level, so I think you could just set the cephadm log to cluster level to warning (ceph config set mgr mgr/cephadm/log_to_cluster_level warning). You'd miss a handful of info log messages maybe, but typically anything important enough that we really need it to be seen would be at warning or error level.

Comment 16 errata-xmlrpc 2023-06-15 09:15:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 6.1 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:3623