2292559 – [cephadm] CEPHADM_REFRESH_FAILED: failed to probe daemons or devices

Bug 2292559 - [cephadm] CEPHADM_REFRESH_FAILED: failed to probe daemons or devices

Summary: [cephadm] CEPHADM_REFRESH_FAILED: failed to probe daemons or devices

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Cephadm
Sub Component:
Version:	5.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	5.3z8
Assignee:	Adam King
QA Contact:	Mohit Bisht
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-06-16 00:43 UTC by Manny
Modified:	2025-02-13 19:22 UTC (History)
CC List:	7 users (show)
Fixed In Version:	ceph-16.2.10-269.el8cp
Doc Type:	Bug Fix
Doc Text:	.Cephadm now reads files within `sys/block/{}/device/wwid` as binary allowing non-utf-8 character types Previously, Cephadm was not able to read non-utf-8 characters within the `sys/block/{}/device/wwid` folder. As a result, when refreshing node metadata Cephadm would crash if a non-utf-8 character was present in the path. With this fix, Cephadm reads file within the `sys/block/{}/device/wwid` path as binary and decodes them, allowing the utility to read other character types.
Clone Of:
Environment:
Last Closed:	2025-02-13 19:22:44 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHCEPH-9174	0	None	None	None	2024-06-16 00:45:54 UTC
Red Hat Product Errata	RHBA-2025:1478	0	None	None	None	2025-02-13 19:22:47 UTC

Comment 8 Raimund Sacherer 2024-07-11 07:41:41 UTC

Hi Adam, 


as it's an encoding problem, probably a whacky/hacky way might be to try/catch and on encoding error try to treat the string as an LATIN-1 encoded string and try again using the correct encoding. Or maybe it would be cleaner to use the .decode() error parameter:

```
>>> b"\xb7".decode(errors="replace")
'�'
>>>
```

Which would not put the actual symbol in the string, but for the wwid, would that actually matter?

With this, we might be able to forget about possible strange characters in this bit. 

Best regards
Raimund

Comment 10 Raimund Sacherer 2024-07-12 07:21:18 UTC

Reading this bit of the code:

```
def read_file(path_list, file_name=''):
    # type: (List[str], str) -> str
    """Returns the content of the first file found within the `path_list`

    :param path_list: list of file paths to search
    :param file_name: optional file_name to be applied to a file path
    :returns: content of the file or 'Unknown'
    """
    for path in path_list:
        if file_name:
            file_path = os.path.join(path, file_name)
        else:
            file_path = path
        if os.path.exists(file_path):
            with open(file_path, 'r') as f:
                try:
                    content = f.read().strip()
                except OSError:
                    # sysfs may populate the file, but for devices like
                    # virtio reads can fail
                    return 'Unknown'
                else:
                    return content
    return 'Unknown'
```

I think we could either:
```
            with open(file_path, 'r', errors='replace') as f:
```

or maybe add in addition to the `OSError` exception handler also `UnicodeDecodeError` and just return 'Unknown' in that case as well. 

Not sure what would be preferrable. 

I'll ask the CU to provide the content of the offending file so you can test it. 


Thank you
Raimund

Comment 11 Raimund Sacherer 2024-07-12 08:06:59 UTC

Or, instead of returning `unknown`, we could also just return `unicodedecodeerror`, which should then be visibile in the output and at least gives a hint that an wwid exists, but can't be read, so not clear to me what is better:

`unknown`
`unicodedecodeerror`
`id.23403867855�nd`

? :-) ... I think I would prefer the 3rd one, seeing the WWID even if it has a character replaced.

One could even limit this scope to limit impact if one fears replacing characters could interfere with other parts of the code which read files:
```
    def read_file(path_list, file_name='', errors=None):                                       <-- Seeing that None is the same as 'strict'
[...]
                with open(file_path, 'r', errors=errors) as f:
[...]
```

And then feeding `replace` for reading this Vendor files:
```
    def _dev_list(self, dev_list):
        # type: (List[str]) -> List[Dict[str, object]]
        """Return a 'pretty' name list for each device in the `dev_list`"""
        disk_list = list()

        for dev in dev_list:
            disk_model = read_file(['/sys/block/{}/device/model'.format(dev)], errors='replace').strip()
            disk_rev = read_file(['/sys/block/{}/device/rev'.format(dev)], errors='replace').strip()
            disk_wwid = read_file(['/sys/block/{}/device/wwid'.format(dev)], errors='replace').strip()
            vendor = read_file(['/sys/block/{}/device/vendor'.format(dev)], errors='replace').strip()
```


Just as an idea. As soon as CU has uploaded the files from the offending hosts, I'll make them available. 

BR
Raimund

Comment 14 Raimund Sacherer 2024-07-26 09:28:06 UTC

Hello Adam, 

do we have any news about this? We have provided a tarball with the files and the file output, did you get a chance to look it over?

Thank you
Raimund

Comment 16 Raimund Sacherer 2024-07-29 12:46:41 UTC

Hi Adam, 

we are only talking these two lines, right?
```
            with open(file_path, 'rb') as f:
                try:
                    content = f.read().decode('utf-8', 'ignore').strip()
```

I think I can just take the cephadm binary from a cluster with the same version and modify those two lines and have the CU execute this locally on one of the servers. Or would there be a need to actually get the rest and compiled from github repository? 

BR
Raimund

Comment 17 Raimund Sacherer 2024-07-29 13:07:05 UTC

Hi Adam, 

FYI, Asked the CU to reproduce with the 2 commands on the host.

I'll keep you updated. 

BR
Raimund

Comment 19 Raimund Sacherer 2024-07-29 20:33:46 UTC

just as an FYI, executing manually on the node produces the same errors, I modified cephadm and sent it to the customer to test it out, will paste results when CU comes back. 

BR

Comment 20 Raimund Sacherer 2024-07-29 20:34:20 UTC

just as an FYI, executing manually on the node produces the same errors, I modified cephadm and sent it to the customer to test it out, will paste results when CU comes back. 

BR

Comment 21 Raimund Sacherer 2024-08-09 09:03:16 UTC

Hello Adam, 

CU run the commands on the modified binary of mine, unfortunately they did run the `cephadm ceph-volume` command with a error in the parameter set so it errored out, I asked them to retry with correct parameters. However the fact-gather went through correctly now:

```
[root@ceph-dashboard-test ~]# python3 ./cephadm.testbuild gather-facts
{
  "arch": "x86_64",
  "bios_date": "12/03/2020",
  "bios_version": "Hyper-V UEFI Release v4.1",
  "cpu_cores": 2,
  "cpu_count": 1,
  "cpu_load": {
    "15min": 0.09,
    "1min": 0.19,
    "5min": 0.23
  },
  "cpu_model": "Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz",
  "cpu_threads": 4,
  "flash_capacity": "0.0",
  "flash_capacity_bytes": 0,
  "flash_count": 0,
  "flash_list": [],
  "hdd_capacity": "85.9GB",
  "hdd_capacity_bytes": 85899345920,
  "hdd_count": 2,
  "hdd_list": [
    {
      "description": "Msft Virtual Disk (21.5GB)",
      "dev_name": "sdb",
      "disk_size_bytes": 21474836480,
      "model": "Virtual Disk",
      "rev": "1.0",
      "vendor": "Msft",
      "wwid": "t10.MSFT    \\032qv\\025\\215Fh\\035\\236SL"
    },
    {
      "description": "Msft Virtual Disk (64.4GB)",
      "dev_name": "sda",
      "disk_size_bytes": 64424509440,
      "model": "Virtual Disk",
      "rev": "1.0",
      "vendor": "Msft",
      "wwid": "t10.MSFT    &[jP\\200GO+eE\\222"
    }
  ],
  "hostname": "ceph-dashboard-test",
  "interfaces": {
    "eth0": {
      "driver": "hv_netvsc",
      "iftype": "physical",
      "ipv4_address": "10.215.44.99/25",
      "ipv6_address": "fe80::215:5dff:fea0:d50c/64",
      "lower_devs_list": [],
      "mtu": 1500,
      "nic_type": "ethernet",
      "operstate": "up",
      "speed": 20000,
      "upper_devs_list": []
    },
    "lo": {
      "driver": "",
      "iftype": "logical",
      "ipv4_address": "127.0.0.1/8",
      "ipv6_address": "::1/128",
      "lower_devs_list": [],
      "mtu": 65536,
      "nic_type": "loopback",
      "operstate": "unknown",
      "speed": -1,
      "upper_devs_list": []
    }
  },
  "kernel": "4.18.0-477.15.1.el8_8.x86_64",
  "kernel_parameters": {
    "net.ipv4.ip_nonlocal_bind": "0"
  },
  "kernel_security": {
    "description": "SELinux: Enabled(enforcing, targeted)",
    "type": "SELinux"
  },
  "memory_available_kb": 13016520,
  "memory_free_kb": 1453236,
  "memory_total_kb": 16138760,
  "model": "Virtual Machine (Virtual Machine)",
  "nic_count": 1,
  "operating_system": "Red Hat Enterprise Linux 8.8 (Ootpa)",
  "selinux_enabled": true,
  "subscribed": "Yes",
  "system_uptime": 6150307.68,
  "tcp6_ports_used": [
    22,
    3000,
    44321,
    9093,
    9094,
    9095,
    9100
  ],
  "tcp_ports_used": [
    22,
    44321
  ],
  "timestamp": 1723141832.6658888,
  "udp6_ports_used": [
    323,
    9094
  ],
  "udp_ports_used": [
    323
  ],
  "vendor": "Microsoft Corporation"
}
```


I did only change the two lines (opening as binary, and encoding with ignoring errors on reading). So this should fix the problem perfectly. 


Only question I have now is how can we distribute this fix to the CU. I think we might get a last shot at a final 5.3z-stream release, because I have another BZ open for scrub issues where we got that approved during yesterdays program call. So might be good to get that in. If we could get confirmation if that could go into the next (and probably last) 5.3z-stream, then I'll ask the CU if they would be good upgrading when it's available. Pretty sure we would not need a hotfix around that ... 

Thanks
BR
Raimund

Comment 37 errata-xmlrpc 2025-02-13 19:22:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 5.3 security and bug fix updates), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2025:1478

Note You need to log in before you can comment on or make changes to this bug.