.`Cephadm` now logs device information only when an actual change occurs
Previously, `cephadm`would compare all fields reported for OSDs, to check for new or changed devices. But one of these fields included a timestamp that would differ every time. Due to this, `cephadm` would log that it ‘Detected new or changed devices' every time it refreshed a host's devices, regardless of whether anything actually changed or not.
With this fix, the comparison of device information against previous information no longer takes the timestamp fields into account that are expected to constantly change. `Cephadm` now logs only when there is an actual change in the devices.
DescriptionKritik Sachdeva
2022-10-20 01:57:45 UTC
Description of problem:
The active MGR log message "Detected new or changed devices” is filling up every 30 mins even if there is no activity or change observed in the cluster. The requirement is to log above messages only if there is an actual change on the node.
We have tried to look up in the code snippet that is generating the log message as below.
~~~
The code is here: mgr/cephadm/inventory.py:
def devices_changed(self, host: str, b: List[inventory.Device]) -> bool:
a = self.devices[host]
if len(a) != len(b):
return True
aj = {d.path: d.to_json() for d in a}
bj = {d.path: d.to_json() for d in b}
if aj != bj:
self.mgr.log.info("Detected new or changed devices on %s" % host)
return True
return False
~~~
And while dumping the values of both aj & bj variables, observed that the difference is only for the field "created" which is 30 mins.
- This 30 mins is after which the above function is called coming from the configuration parameter mgr/cephadm/device_cache_timeout
~~~
aj: {'/dev/sdX': {'ceph_device': None, 'rejected_reasons': [], 'available': True, 'path': '/dev/sdX', 'sys_api': {'human_readable_size': '10.00 GB', 'locked': 0, 'model': 'QEMU HARDDISK', 'nr_requests': '256', 'partitions': {}, 'path': '/dev/sdX', 'removable': '0', 'rev': '2.5+', 'ro': '0', 'rotational': '1', 'sas_address': '', 'sas_device_handle': '', 'scheduler_mode': 'mq-deadline', 'sectors': 0, 'sectorsize': '512', 'size': 10737418240.0, 'support_discard': '4096', 'vendor': 'QEMU'}, 'created': '2022-10-20T01:07:32.604710Z', 'lvs': [], 'human_readable_type': 'hdd', 'device_id': 'QEMU_QEMU_HARDDISK_15bbXXXX-XXXX-40e8-XXXX-946a21dXXXXXX', 'lsm_data': {}}}
-------------------------------------------------
bj: {'/dev/sdX': {'ceph_device': None, 'rejected_reasons': [], 'available': True, 'path': '/dev/sdX', 'sys_api': {'human_readable_size': '10.00 GB', 'locked': 0, 'model': 'QEMU HARDDISK', 'nr_requests': '256', 'partitions': {}, 'path': '/dev/sdX', 'removable': '0', 'rev': '2.5+', 'ro': '0', 'rotational': '1', 'sas_address': '', 'sas_device_handle': '', 'scheduler_mode': 'mq-deadline', 'sectors': 0, 'sectorsize': '512', 'size': 10737418240.0, 'support_discard': '4096', 'vendor': 'QEMU'}, 'created': '2022-10-20T01:38:30.355023Z', 'lvs': [], 'human_readable_type': 'hdd', 'device_id': 'QEMU_QEMU_HARDDISK_15bbXXXX-XXXX-40e8-XXXX-946a21dXXXXX', 'lsm_data': {}}}
~~~
Additionally, we tried to look further in the code, and observed that the value for the field "created" gets updated or applied only once at the below location in code.
- https://github.com/ceph/ceph/blob/main/src/python-common/ceph/deployment/inventory.py#L74-L77
Version-Release number of selected component (if applicable): RHCS 5.2
How reproducible: Always
Steps to Reproduce:
1. Simply deploy or upgrade an RHCS 5 envrionment
2. Monitor the active mgr logs for the message "Detected new or changed devices"
Actual results: "Detected new or changed devices" log message is filling up even if there is no actual change.
Expected results: Log the above message only if there is an actual change on the node.
While waiting for a patch for Ceph 5.2, is there any way our customers can silence this warning message so that their logs/prometheus are not inundated with these messages?
(In reply to Matthew Secaur from comment #5)
> While waiting for a patch for Ceph 5.2, is there any way our customers can
> silence this warning message so that their logs/prometheus are not inundated
> with these messages?
These logs are info level, so I think you could just set the cephadm log to cluster level to warning (ceph config set mgr mgr/cephadm/log_to_cluster_level warning). You'd miss a handful of info log messages maybe, but typically anything important enough that we really need it to be seen would be at warning or error level.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: Red Hat Ceph Storage 6.1 security and bug fix update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2023:3623
Description of problem: The active MGR log message "Detected new or changed devices” is filling up every 30 mins even if there is no activity or change observed in the cluster. The requirement is to log above messages only if there is an actual change on the node. We have tried to look up in the code snippet that is generating the log message as below. ~~~ The code is here: mgr/cephadm/inventory.py: def devices_changed(self, host: str, b: List[inventory.Device]) -> bool: a = self.devices[host] if len(a) != len(b): return True aj = {d.path: d.to_json() for d in a} bj = {d.path: d.to_json() for d in b} if aj != bj: self.mgr.log.info("Detected new or changed devices on %s" % host) return True return False ~~~ And while dumping the values of both aj & bj variables, observed that the difference is only for the field "created" which is 30 mins. - This 30 mins is after which the above function is called coming from the configuration parameter mgr/cephadm/device_cache_timeout ~~~ aj: {'/dev/sdX': {'ceph_device': None, 'rejected_reasons': [], 'available': True, 'path': '/dev/sdX', 'sys_api': {'human_readable_size': '10.00 GB', 'locked': 0, 'model': 'QEMU HARDDISK', 'nr_requests': '256', 'partitions': {}, 'path': '/dev/sdX', 'removable': '0', 'rev': '2.5+', 'ro': '0', 'rotational': '1', 'sas_address': '', 'sas_device_handle': '', 'scheduler_mode': 'mq-deadline', 'sectors': 0, 'sectorsize': '512', 'size': 10737418240.0, 'support_discard': '4096', 'vendor': 'QEMU'}, 'created': '2022-10-20T01:07:32.604710Z', 'lvs': [], 'human_readable_type': 'hdd', 'device_id': 'QEMU_QEMU_HARDDISK_15bbXXXX-XXXX-40e8-XXXX-946a21dXXXXXX', 'lsm_data': {}}} ------------------------------------------------- bj: {'/dev/sdX': {'ceph_device': None, 'rejected_reasons': [], 'available': True, 'path': '/dev/sdX', 'sys_api': {'human_readable_size': '10.00 GB', 'locked': 0, 'model': 'QEMU HARDDISK', 'nr_requests': '256', 'partitions': {}, 'path': '/dev/sdX', 'removable': '0', 'rev': '2.5+', 'ro': '0', 'rotational': '1', 'sas_address': '', 'sas_device_handle': '', 'scheduler_mode': 'mq-deadline', 'sectors': 0, 'sectorsize': '512', 'size': 10737418240.0, 'support_discard': '4096', 'vendor': 'QEMU'}, 'created': '2022-10-20T01:38:30.355023Z', 'lvs': [], 'human_readable_type': 'hdd', 'device_id': 'QEMU_QEMU_HARDDISK_15bbXXXX-XXXX-40e8-XXXX-946a21dXXXXX', 'lsm_data': {}}} ~~~ Additionally, we tried to look further in the code, and observed that the value for the field "created" gets updated or applied only once at the below location in code. - https://github.com/ceph/ceph/blob/main/src/python-common/ceph/deployment/inventory.py#L74-L77 Version-Release number of selected component (if applicable): RHCS 5.2 How reproducible: Always Steps to Reproduce: 1. Simply deploy or upgrade an RHCS 5 envrionment 2. Monitor the active mgr logs for the message "Detected new or changed devices" Actual results: "Detected new or changed devices" log message is filling up even if there is no actual change. Expected results: Log the above message only if there is an actual change on the node.