Description of problem: When mounting a ceph file system we get an exception: 2021-08-18 22:13:56,927+0100 INFO (jsonrpc/4) [storage.StorageServer.MountConnection] Creating directory '/rhev/data-center/mnt/monitors.storage-01.example.com:_' (storageServer:167) 2021-08-18 22:13:56,927+0100 INFO (jsonrpc/4) [storage.fileUtils] Creating directory: /rhev/data-center/mnt/monitors.storage-01.example.com:_ mode: None (fileUtils:201) 2021-08-18 22:13:56,927+0100 INFO (jsonrpc/4) [storage.Mount] mounting monitors.storage-01.example.com:/ at /rhev/data-center/mnt/monitors.storage-01.example.com:_ (mount:207) 2021-08-18 22:13:57,006+0100 ERROR (jsonrpc/4) [storage.HSM] Could not connect to storageServer (hsm:2374) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 2371, in connectStorageServer conObj.connect() File "/usr/lib/python3.6/site-packages/vdsm/storage/storageServer.py", line 184, in connect self.getMountObj().getRecord().fs_file) File "/usr/lib/python3.6/site-packages/vdsm/storage/mount.py", line 256, in getRecord (self.fs_spec, self.fs_file)) FileNotFoundError: [Errno 2] Mount of `monitors.storage-01.example.com:/` at `/rhev/data-center/mnt/monitors.storage-01.example.com:_` does not exist The storage domain is configured with: Path: monitors.storage-01.example.com:/ VFS Type: ceph + various secrets etc The strange part; the actual mount command DID succeed. I can see the file system mounted on the server, examine the contents and see it in "df -h" etc. # df -h Filesystem Size Used Avail Use% Mounted on 192.168.185.1,192.168.185.5,192.168.185.6:/ 15T 3.3T 12T 23% /rhev/data-center/mnt/monitors.storage-01.example.com:_ As far as I can see from the code, it expects to find "monitors.storage-01.example.com:/" in /proc/mounts - and not "192.168.185.1,192.168.185.5,192.168.185.6:/" - something, somewhere has translated the DNS name "monitors.storage-01.example.com" into the correct IP addresses. And this translation causes the exception to be raised... I managed to hand-fix it by modifying /usr/lib/python3.6/site-packages/vdsm/storage/mount.py: --- mount.py 2021-08-19 14:24:01.400661984 +0100 +++ mount.py 2021-08-18 22:49:56.998536572 +0100 @@ -248,7 +248,7 @@ fs_specs = self.fs_spec, None for record in _iterMountRecords(): - if self.fs_file == record.fs_file and record.fs_spec in fs_specs: + if self.fs_file == record.fs_file: return record raise OSError(errno.ENOENT, But this is obviously not ideal - future upgrades will overwrite it etc. The really strange part: We have 6 other hypervisors in this cluster, and none of the other 6 have this problem! They all have the DNS name appearing in /proc/mounts. Except this one server. I do not know why this one is different...
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.
When examining the differences between this hypervisor and the others: This one is the first one to be upgraded to ovirt 4.4.8 (vdsm 4.40.80-5-1). Other hypervisors are ovirt 4.4.7 (vdsm 4.40.70.6-1). This hypervisor also ended up with ceph-common installed (pulled in by the ovirt-host-dependencies package 4.4.8), whereas the other hypervisors do not have ceph-common installed at all. And ceph-common provides /usr/sbin/mount.ceph ... So I have to (reluctantly) conclude that what I see is a result of the upgrade from ovirt 4.4.7 -> ovirt 4.4.8, rather than errors on our side
Benny, can you please have a look?
Hi, Do you have ceph-common installed on the other hosts, if so, which version? Can you provide the kernel version as well? (for the new and old hypervisors)
No - the problematic host is the only one with ceph-common installed. It has version ceph-common-16.2.5-2.el8.x86_64 I very much suspect that the mount.ceph command is responsible for translating the DNS name into IP addresses - the man page for mount.ceph confirms this...
As for kernel version: it varies a little. Problematic hypervisor: 4.18.0-305.12.1.el8_4.x86_64 Other hypervisors: some variation - kernel versions are: * 4.18.0-240.22.1.el8_3.x86_64 * 4.18.0-305.12.1.el8_4.x86_64 (yes: same as the problematic one) * 4.18.0-305.7.1.el8_4.x86_64 as some of the hypervisors have the same kernel version as the problematic one, I doubt the kernel is the culprit. Although some hypervisors have newer kernel versions installed, they have not necessarily been rebooted - the list above is based on the currently-running kernel (=output of "uname -r") rather than the latest installed one.
Thanks, we need to check this with ceph as it sounds like the newer mount.ceph overrides the previous behavior. For now, we stopped installing ceph-common automtically, so if Managed Block Storage is not used it can simply be removed as a workaround
Benny, since ceph-common is not automatically installed, can we close the bug?
As far as I can see, ceph-common is still automatically installed... I upgraded the host to the latest available, and not only is ceph-common still installed, but attempts at removing it indicate it is still required: [root@hv-03 ~]# rpm -qa |grep ceph python3-ceph-argparse-16.2.6-1.el8.x86_64 python3-cephfs-16.2.6-1.el8.x86_64 libcephfs2-16.2.6-1.el8.x86_64 python3-ceph-common-16.2.6-1.el8.x86_64 ceph-common-16.2.6-1.el8.x86_64 [root@hv-03 ~]# rpm --erase ceph-common error: Failed dependencies: ceph-common is needed by (installed) ovirt-host-dependencies-4.4.8-1.el8.x86_64 Also: the upgrade wiped out my manual patch (see above) to /usr/lib/python3.6/site-packages/vdsm/storage/mount.py (which is expected). I had to re-apply the patch to get a working hypervisor
worth recheck in 4.5 no update for a while, didn't make it into 4.5, closing
I'll have to protest about the closing of this bug: - It is still a real issue - I have supplied a patch to fix it - surely whether it made it into ovirt 4.5 is irrelevant as to whether it is a bug or not? I'm happy to try out any tweaks you suggest or dig out more log info if needed - just ask...
(In reply to Karl E. Jørgensen from comment #15) > I'll have to protest about the closing of this bug: > > - It is still a real issue > - I have supplied a patch to fix it > - surely whether it made it into ovirt 4.5 is irrelevant as to whether it is > a bug or not? > > I'm happy to try out any tweaks you suggest or dig out more log info if > needed - just ask... Hi, Can you post your change as a pull request to vdsm? https://github.com/oVirt/vdsm
OK - I created https://github.com/oVirt/vdsm/pull/142 for your perusal and general entertainment Regards
Moved to GitHub: https://github.com/oVirt/vdsm/issues/202