Bug 1995610 - Ceph mounts are deemed failed (even when actually successful)
Summary: Ceph mounts are deemed failed (even when actually successful)
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: vdsm
Classification: oVirt
Component: Services
Version: 4.40.80.5
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: ---
Assignee: Benny Zlotnik
QA Contact: Avihai
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-19 13:37 UTC by Karl E. Jørgensen
Modified: 2022-05-25 13:39 UTC (History)
5 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2022-05-24 13:09:20 UTC
oVirt Team: Storage
Embargoed:
sbonazzo: ovirt-4.5-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github oVirt vdsm pull 142 0 None open Support Ceph mount using DNS name 2022-05-24 13:09:19 UTC
Red Hat Issue Tracker RHV-43070 0 None None None 2021-08-19 13:40:25 UTC

Description Karl E. Jørgensen 2021-08-19 13:37:02 UTC
Description of problem:


When mounting a ceph file system we get an exception:

2021-08-18 22:13:56,927+0100 INFO  (jsonrpc/4) [storage.StorageServer.MountConnection] Creating directory '/rhev/data-center/mnt/monitors.storage-01.example.com:_' (storageServer:167)
2021-08-18 22:13:56,927+0100 INFO  (jsonrpc/4) [storage.fileUtils] Creating directory: /rhev/data-center/mnt/monitors.storage-01.example.com:_ mode: None (fileUtils:201)
2021-08-18 22:13:56,927+0100 INFO  (jsonrpc/4) [storage.Mount] mounting monitors.storage-01.example.com:/ at /rhev/data-center/mnt/monitors.storage-01.example.com:_ (mount:207)
2021-08-18 22:13:57,006+0100 ERROR (jsonrpc/4) [storage.HSM] Could not connect to storageServer (hsm:2374)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 2371, in connectStorageServer
    conObj.connect()
  File "/usr/lib/python3.6/site-packages/vdsm/storage/storageServer.py", line 184, in connect
    self.getMountObj().getRecord().fs_file)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/mount.py", line 256, in getRecord
    (self.fs_spec, self.fs_file))
FileNotFoundError: [Errno 2] Mount of `monitors.storage-01.example.com:/` at `/rhev/data-center/mnt/monitors.storage-01.example.com:_` does not exist

The storage domain is configured with:
Path: monitors.storage-01.example.com:/
VFS Type: ceph
+ various secrets etc

The strange part; the actual mount command DID succeed. I can see the file system mounted on the server, examine the contents and see it in "df -h" etc.

# df -h
Filesystem                                           Size  Used Avail Use% Mounted on
192.168.185.1,192.168.185.5,192.168.185.6:/           15T  3.3T   12T  23% /rhev/data-center/mnt/monitors.storage-01.example.com:_


As far as I can see from the code, it expects to find "monitors.storage-01.example.com:/" in /proc/mounts - and not "192.168.185.1,192.168.185.5,192.168.185.6:/"  - something, somewhere has  translated the DNS name "monitors.storage-01.example.com" into the correct IP addresses.

And this translation causes the exception to be raised...

I  managed to hand-fix it by modifying /usr/lib/python3.6/site-packages/vdsm/storage/mount.py:

--- mount.py	2021-08-19 14:24:01.400661984 +0100
+++ mount.py	2021-08-18 22:49:56.998536572 +0100
@@ -248,7 +248,7 @@
             fs_specs = self.fs_spec, None
 
         for record in _iterMountRecords():
-            if self.fs_file == record.fs_file and record.fs_spec in fs_specs:
+            if self.fs_file == record.fs_file:
                 return record
 
         raise OSError(errno.ENOENT,


But this is obviously not ideal - future upgrades will overwrite it etc.


The really strange part: We have 6 other hypervisors in this cluster, and none of the other 6 have this problem!  They all have the DNS name appearing in /proc/mounts. Except this one server. I do not know why this one is different...

Comment 1 RHEL Program Management 2021-08-19 15:20:46 UTC
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 2 Karl E. Jørgensen 2021-08-20 08:59:17 UTC
When examining the differences between this hypervisor and the others: This one is the first one to be upgraded to ovirt 4.4.8 (vdsm 4.40.80-5-1). Other hypervisors are ovirt 4.4.7 (vdsm 4.40.70.6-1).

This hypervisor also ended up with ceph-common installed (pulled in by the ovirt-host-dependencies package 4.4.8), whereas the other hypervisors do not have ceph-common installed at all. And ceph-common provides /usr/sbin/mount.ceph ...

So I have to (reluctantly) conclude that what I see is a result of the upgrade from ovirt 4.4.7 -> ovirt 4.4.8, rather than errors on our side

Comment 3 Eyal Shenitzky 2021-08-23 14:30:27 UTC
Benny, can you please have a look?

Comment 4 Benny Zlotnik 2021-08-30 12:18:25 UTC
Hi,

Do you have ceph-common installed on the other hosts, if so, which version?
Can you provide the kernel version as well? (for the new and old hypervisors)

Comment 5 Karl E. Jørgensen 2021-08-31 08:44:41 UTC
No - the problematic host is the only one with ceph-common installed. It has version ceph-common-16.2.5-2.el8.x86_64


I very much suspect that the mount.ceph command is responsible for translating the DNS name into IP addresses - the man page for mount.ceph confirms this...

Comment 6 Karl E. Jørgensen 2021-09-01 21:47:06 UTC
As for kernel version: it varies a little.

Problematic hypervisor: 4.18.0-305.12.1.el8_4.x86_64

Other hypervisors: some variation - kernel versions are:
* 4.18.0-240.22.1.el8_3.x86_64
* 4.18.0-305.12.1.el8_4.x86_64 (yes: same as the problematic one)
* 4.18.0-305.7.1.el8_4.x86_64

as some of the hypervisors have the same kernel version as the problematic one, I doubt the kernel is the culprit.

Although some hypervisors have newer kernel versions installed, they have not necessarily been rebooted - the list above is based on the currently-running kernel (=output of "uname -r") rather than the latest installed one.

Comment 7 Benny Zlotnik 2021-10-05 11:23:53 UTC
Thanks, we need to check this with ceph as it sounds like the newer mount.ceph overrides the previous behavior. For now, we stopped installing ceph-common automtically, so if Managed Block Storage is not used it can simply be removed as a workaround

Comment 8 Eyal Shenitzky 2021-10-11 11:34:35 UTC
Benny, since ceph-common is not automatically installed, can we close the bug?

Comment 9 Karl E. Jørgensen 2021-10-12 08:34:39 UTC
As far as I can see, ceph-common is still automatically installed...

I upgraded the host to the latest available, and not only is ceph-common still installed, but attempts at removing it indicate it is still required:

[root@hv-03 ~]# rpm -qa |grep ceph
python3-ceph-argparse-16.2.6-1.el8.x86_64
python3-cephfs-16.2.6-1.el8.x86_64
libcephfs2-16.2.6-1.el8.x86_64
python3-ceph-common-16.2.6-1.el8.x86_64
ceph-common-16.2.6-1.el8.x86_64

[root@hv-03 ~]# rpm --erase ceph-common
error: Failed dependencies:
	ceph-common is needed by (installed) ovirt-host-dependencies-4.4.8-1.el8.x86_64


Also: the upgrade wiped out my manual patch (see above) to /usr/lib/python3.6/site-packages/vdsm/storage/mount.py  (which is expected). I had to re-apply the patch to get a working hypervisor

Comment 14 Michal Skrivanek 2022-04-11 09:05:41 UTC
worth recheck in 4.5

no update for a while, didn't make it into 4.5, closing

Comment 15 Karl E. Jørgensen 2022-04-11 11:00:22 UTC
I'll have to protest about the closing of this bug:

- It is still a real issue
- I have supplied a patch to fix it
- surely whether it made it into ovirt 4.5 is irrelevant as to whether it is a bug or not?

I'm happy to try out any tweaks you suggest or dig out more log info if needed - just ask...

Comment 16 Benny Zlotnik 2022-04-12 09:00:44 UTC
(In reply to Karl E. Jørgensen from comment #15)
> I'll have to protest about the closing of this bug:
> 
> - It is still a real issue
> - I have supplied a patch to fix it
> - surely whether it made it into ovirt 4.5 is irrelevant as to whether it is
> a bug or not?
> 
> I'm happy to try out any tweaks you suggest or dig out more log info if
> needed - just ask...
Hi, 

Can you post your change as a pull request to vdsm?
https://github.com/oVirt/vdsm

Comment 18 Karl E. Jørgensen 2022-04-19 22:11:12 UTC
OK - I created https://github.com/oVirt/vdsm/pull/142 for your perusal and general entertainment

Regards

Comment 19 Arik 2022-05-24 13:09:20 UTC
Moved to GitHub: https://github.com/oVirt/vdsm/issues/202


Note You need to log in before you can comment on or make changes to this bug.