Today RHV does not specify our_host_name in sanlock.log and sanlock each time generates a rundom uuid instead. This makes it much harder to debug RHV sanlock issues as in the bellow example. Sanlock simply does not have enough space to list the full host name with the randomly generated uuid. This makes it impossible to correlate sanlock uuid to actual host, especially in environments with multiple hosts. Example: 2017-10-31 13:07:20-0400 329 [exampo]: s1 host 1 233691 3552954 fdcadd1d-6505-44f6-9e3f-5bcabd02e8e5.rhevhost 2017-10-31 13:07:20-0400 329 [1085]: s1 host 2 6 4225585 4b2e5d32-7612-4a1c-a8dc-85290d7b80a9.rhevhost 2017-10-31 13:07:20-0400 329 [1085]: s1 host 3 10 4249851 e4330bde-1b88-4c78-af0a-aa4c526c43df.rhevhost 2017-10-31 13:07:20-0400 329 [1085]: s1 host 4 21 0 3934a00b-6a6d-47b4-a6e4-30fe9016259b.rhevhost 2017-10-31 13:07:20-0400 329 [1085]: s1 host 5 13 3620639 60706565-e7ab-4e70-9c34-480c3e616eea.rhevhost 2017-10-31 13:07:20-0400 329 [1085]: s1 host 6 12 4071186 b8897179-8739-45eb-a225-42c2e01cc2ea.rhevhost 2017-10-31 13:07:20-0400 329 [1085]: s1 host 7 20 4139635 99891d9c-3633-4741-aa55-20d3a86d60fc.rhevhost 2017-10-31 13:07:20-0400 329 [1085]: s1 host 8 8 3565781 50d01723-f71a-47f0-bdc0-2824fe26bae9.rhevhost 2017-10-31 13:07:20-0400 329 [1085]: s1 host 9 7 3990028 fb41dc71-2207-47e9-bab3-fcaf37807242.rhevhost 2017-10-31 13:07:20-0400 329 [1085]: s1 host 10 18 308 962addbd-158a-4888-bded-cfe5cfe1f4ca.localhost. Those are 10 different hosts trying to acquire lease on same sanlock space s1. But no way which host is which, they all look same rhevhost - the numbers to identify the hosts are truncated.
sanlock log is not produced by VDSM. Can you set it in the conf and it'll display it in the log?
(In reply to Yaniv Kaul from comment #1) > sanlock log is not produced by VDSM. Can you set it in the conf and it'll > display it in the log? Yaniv, that was the actual advise from David. And that's why I opened this bug. If this value is not set in the conf file, sanlock will generate a random value each time it starts.
If sanlock is not configured with a host name (our_host_name in sanlock.conf), it will generate a new uuid to identify the host each time sanlock is started. It is critical that the host name be unique, so sanlock doesn't just grab uname/hostname. Using a uuid to identify the host has the downside of making it more difficult to debug, because you have to hunt down which host was using which uuid (by looking in sanlock.log). If rhev/vdsm has a way of assigning globally unique names to every host, then it could put that name into sanlock.conf/our_host_name. This could make things simpler to debug if those unique names were more readable than uuids. Even if they were uuids, it could still be advantageous to make these uuid names persistent so that a host would use the same name/uuid each time it restarted. I think that Nir had some ideas in this area, but I have forgotten the details.
We have a host ID which is unique in RHV. Can't we use it?
Is this host ID in RHV is the same number that is given to sanlock to use as the sanlock host_id? (For sanlock it must be a number between 1-2000.) If we're talking about the same host_id, then no this will not work because these host_id numbers are not unique, they can be used for different hosts (even if only by mistake). The issue prompting this bz is one case where two hosts are both trying to use the same host_id number (10). A unique host name is how we detect this mistake of two hosts trying to use the same host_id.
(In reply to David Teigland from comment #6) > Is this host ID in RHV is the same number that is given to sanlock to use as > the sanlock host_id? (For sanlock it must be a number between 1-2000.) If > we're talking about the same host_id, then no this will not work because > these host_id numbers are not unique, they can be used for different hosts > (even if only by mistake). The issue prompting this bz is one case where > two hosts are both trying to use the same host_id number (10). A unique > host name is how we detect this mistake of two hosts trying to use the same > host_id. No, I was referring to: [root@lago-basic-suite-master-host-0 vdsm]# cat /etc/vdsm/vdsm.id 55B25364-93F2-43F9-B562-D46A69831E81
(In reply to Yaniv Kaul from comment #5) > We have a host ID which is unique in RHV. Can't we use it? I think so, if we build some mechanism to pipe it through. David - any objection?
Sounds good to me. In sanlock terms, that would be the "host name", i.e. a unique string ID, whereas the "host ID" is a unique integer ID (between 1-2000). sanlock will currently accept a host name from sanlock.conf or the command line (-e), but the command line method is not accessible when starting sanlock from the systemd unit file. A future version could add a library interface for setting this.
(In reply to David Teigland from comment #9) > sanlock will currently accept a host name from sanlock.conf or the command > line (-e), but the command line method is not accessible when starting > sanlock from the systemd unit file. A future version could add a library > interface for setting this. Is this targetted for any upcoming RHEL version, or just a future RFE in the backlog?
I discussed this with David today, we think the right way is to configure sanlock with the vdsm host id and optional host name (same name used in engine?). The format can be: vdsm-host-uuid[/truncated...name] Vdsm will have to verify sanlock configuration during startup. All can be done using the current sanlock configurator (vdsm-tool configure --module sanlock)
(In reply to Nir Soffer from comment #12) > I discussed this with David today, we think the right way is to configure > sanlock > with the vdsm host id and optional host name (same name used in engine?). > > The format can be: > > vdsm-host-uuid[/truncated...name] > > Vdsm will have to verify sanlock configuration during startup. All can be > done > using the current sanlock configurator (vdsm-tool configure --module sanlock) So what should be done in order to close this bug?
(In reply to Tal Nisan from comment #13) This is not only for debugging, it is needed to avoid delays after unclean shutdown. We must configure sanlock with a unique uuid that cannot be used on any other host, see bug 1609029. The host hardware id should be fine for this: # cat /etc/vdsm/vdsm.id 55B25364-93F2-43F9-B562-D46A69831E81 I would not add trncated host name since the host name can change on engine side, but we want stable value that never change in sanlock configuration and on storage. To locate host details using the host hardware id seen in storage, one can use the SDK search feature: hosts_service = system_service.hosts_service() hosts = hosts_service.list( search="hw_id=%s" % hardware_id), case_sensitive=False, ) We need to change the sanlock configurator to configure sanlock host name. https://github.com/oVirt/vdsm/blob/master/lib/vdsm/tool/configurators/sanlock.py On new install, this must be done before sanlock accessed any storage, during "vdsm-tool configure --force". On upgrades, the host must be moved to maintenance, then "vdsm-tool configure --force" will stop sanlock, configure it, and start it again. On upgrades, if we need to configure sanlock, and the host is not in maintenance, fail the upgrade. Trying to stop sanlock when it is connected to storage will either fail after long time, or if systemd kill sanlock, will cause a reboot. When starting vdsm, the value of our_host_name must be checked. If the value doe snot match the hardware id, vdsm should refused to start.
This bug has not been marked as blocker for oVirt 4.3.0. Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.
Hi David, Is it possible to print the host uuid and hostname in sanlock.log at host boot, since it is requires too many changes on the RHV side and has some risks in it introducing such a change now.
The uuid (and truncated name) are already logged, e.g. sanlock daemon started 3.7.0 host 789788eb-2a22-4cc9-bd54-b2786c704ddf.null-04.la so it sounds like you are interested in seeing the entire hostname also? I could add that, would this look ok?: sanlock daemon started 3.7.0 host 789788eb-2a22-4cc9-bd54-b2786c704ddf.null-04.la (null-04.lab.msp.redhat.com)
(In reply to David Teigland from comment #26) > sanlock daemon started 3.7.0 host > 789788eb-2a22-4cc9-bd54-b2786c704ddf.null-04.la (null-04.lab.msp.redhat.com) Looks good. The host uuid is not very useful today because we use random uuid for every boot, but once we configure sanlock properly with /etc/vdsm/vdsm.id (easy) or the host uuid from engine (harder), it will be very useful.
pushed to master https://pagure.io/sanlock/c/cff348800722f7dadf030ffe7494c2df714996e3?branch=master If this is the only code change for this bug, then the bz should probably be reassigned to sanlock and a future RHEL8 release.
I think a separate sanlock bug would be better, since we really need to configure sanlock host uuid properly. This can avoid delays when resuming from unclean shutdown, and will make it easier to debug issues in the cluster. For example, if we use the uuid from /etc/vdsm/vdsm.id, when we find that host uuid owning a lease, we can locate the host in engine: $ cat /etc/vdsm/vdsm.id 324e9b61-49aa-4220-9e52-f7ee3c88c3f3 https://engine/ovirt-engine/api/hosts/?search=hw_id=324e9b61-49aa-4220-9e52-f7ee3c88c3f3 With the random uuid changed on every boot, we don't have a good way to find the host.
(In reply to Nir Soffer from comment #29) > I think a separate sanlock bug would be better, since we really need to > configure > sanlock host uuid properly. This can avoid delays when resuming from unclean > shutdown, and will make it easier to debug issues in the cluster. > > For example, if we use the uuid from /etc/vdsm/vdsm.id, when we find that > host > uuid owning a lease, we can locate the host in engine: > > $ cat /etc/vdsm/vdsm.id > 324e9b61-49aa-4220-9e52-f7ee3c88c3f3 > > https://engine/ovirt-engine/api/hosts/?search=hw_id=324e9b61-49aa-4220-9e52- > f7ee3c88c3f3 > > With the random uuid changed on every boot, we don't have a good way to find > the host. Hi Nir, What are the implications to get this change in? I do not think it should be high risk, since this lock is renewed regularly and new uuid can be used when needed. So if it is easy to implement, we would like to see it in 4.4.z.
(In reply to David Teigland from comment #28) > pushed to master > https://pagure.io/sanlock/c/ > cff348800722f7dadf030ffe7494c2df714996e3?branch=master > > If this is the only code change for this bug, then the bz should probably be > reassigned to sanlock and a future RHEL8 release. Hi David, Sorry for not replying earlier. Though I read this before. Is this in RHEL8? Or waiting on some action from me? I can file a bug, if needed, please let me know.
(In reply to Marina Kalinin from comment #31) > Sorry for not replying earlier. Though I read this before. > Is this in RHEL8? Or waiting on some action from me? > I can file a bug, if needed, please let me know. See comment 29. The sanlock change does not fix the issue which is vdsm not configuring sanlock uuid. Comment 14 explains what should be done.
There was one small sanlock change in comment 28 which logs the full hostname. It hadn't been worth doing an update with just that change, but it will be in 8.3 along with other sanlock fixes (related to the client connection issues.)
While testing recovery from unclean shutdown related to bug 1902468, I configured sanlock host name to see how it affects the timing. I measured the time from starting a host connected to 20 storage domains after unclean shutdown. Without host name: 5m:32s With host name: 1m:47s For reference, after clean shutdown activation takes 1m:05s. Here is the tested configuration: $ cat /etc/sanlock/sanlock.conf # Setting for vdsm our_host_name = d8635601-ea8b-4c5c-a624-41bd72b862d6 max_worker_threads = 40 (max_worker_threads is not available in RHEL 8.3).
Verified on rhv-4.4.4-6: > 'our_host_name' is specified in sanlock.log (on spm host) instead of a random uuid. Nir, is there something else I need to check?
(In reply to Evelina Shames from comment #38) > Verified on rhv-4.4.4-6: > > 'our_host_name' is specified in sanlock.log (on spm host) instead of a random uuid. Sanlock must be configured on all hosts not only on spm. > Nir, is there something else I need to check? Here is how to verify this change: ## After adding host to engine 4.4.4 or upgrading a host sanlock is configured /etc/sanlock/sanlock.conf contains our_host_name option, set to host hardware id: $ cat /etc/sanlock/sanlock.conf # Configuration for vdsm our_host_name = 095bce26-9d5b-4ae2-b3bf-dddfc9cdf541 ... $ cat /etc/vdsm/vdsm.id 095bce26-9d5b-4ae2-b3bf-dddfc9cdf541 Sanlock reports this value in sanlock log: 2021-01-03 11:40:54 9 [964]: sanlock daemon started 3.8.2 host 095bce26-9d5b-4ae2-b3bf-dddfc9cdf541 (host3) ## vdsm-tool is-configured succeeds $ sudo vdsm-tool is-configured Current revision of multipath.conf detected, preserving sanlock is configured for vdsm abrt is already configured for vdsm Managed volume database is already configured lvm is configured for vdsm libvirt is already configured for vdsm ## If our_host_name is does not match host hardware id: $ cat /etc/sanlock/sanlock.conf # Configuration for vdsm our_host_name = invalid-host-id max_worker_threads = 50 vdsm-tool is-configured will fail: $ sudo vdsm-tool is-configured --module sanlock sanlock config file needs options: {'our_host_name': '095bce26-9d5b-4ae2-b3bf-dddfc9cdf541'} Modules sanlock are not configured Error: One of the modules is not configured to work with VDSM. To configure the module use the following: 'vdsm-tool configure [--module module-name]'. If all modules are not configured try to use: 'vdsm-tool configure --force' (The force flag will stop the module's service and start it afterwards automatically to load the new configuration.) Changing the configuration back to the right value will make vdsm-tool happy again. ## When sanlock is not configured, vdsmd service will fail to start $ sudo systemctl stop vdsmd $ sudo systemctl start vdsmd Job for vdsmd.service failed because the control process exited with error code. See "systemctl status vdsmd.service" and "journalctl -xe" for details. Looking in journalcl we should see the error about configuring sanlock: Jan 03 18:54:25 host3 vdsmd_init_common.sh[32952]: vdsm: Running check_is_configured Jan 03 18:54:25 host3 vdsmd_init_common.sh[32952]: Error: Jan 03 18:54:25 host3 vdsmd_init_common.sh[32952]: One of the modules is not configured to work with VDSM. Jan 03 18:54:25 host3 vdsmd_init_common.sh[32952]: To configure the module use the following: Jan 03 18:54:25 host3 vdsmd_init_common.sh[32952]: 'vdsm-tool configure [--module module-name]'. Jan 03 18:54:25 host3 vdsmd_init_common.sh[32952]: If all modules are not configured try to use: Jan 03 18:54:25 host3 vdsmd_init_common.sh[32952]: 'vdsm-tool configure --force' Jan 03 18:54:25 host3 vdsmd_init_common.sh[32952]: (The force flag will stop the module's service and start it Jan 03 18:54:25 host3 vdsmd_init_common.sh[32952]: afterwards automatically to load the new configuration.) Jan 03 18:54:25 host3 vdsmd_init_common.sh[32952]: Jan 03 18:54:25 host3 vdsmd_init_common.sh[32952]: lvm is configured for vdsm Jan 03 18:54:25 host3 vdsmd_init_common.sh[32952]: Managed volume database is already configured Jan 03 18:54:25 host3 vdsmd_init_common.sh[32952]: Current revision of multipath.conf detected, preserving Jan 03 18:54:25 host3 vdsmd_init_common.sh[32952]: abrt is already configured for vdsm Jan 03 18:54:25 host3 vdsmd_init_common.sh[32952]: libvirt is already configured for vdsm Jan 03 18:54:25 host3 vdsmd_init_common.sh[32952]: sanlock config file needs options: {'our_host_name': '095bce26-9d5b-4ae2-b> Jan 03 18:54:25 host3 vdsmd_init_common.sh[32952]: Modules sanlock are not configured Jan 03 18:54:26 host3 vdsmd_init_common.sh[32952]: vdsm: stopped during execute check_is_configured task (task returned with> ## To configure a host manually 1. Move host to maintainance 2. Configure vdsm using vdsm-tool $ sudo vdsm-tool configure --force --module sanlock Checking configuration status... sanlock config file needs options: {'our_host_name': '095bce26-9d5b-4ae2-b3bf-dddfc9cdf541'} Running configure... Configuring sanlock config file Previous sanlock.conf copied to /etc/sanlock/sanlock.conf.20210103184638 Reconfiguration of sanlock is done. Done configuring modules to VDSM. ## Configuring sanlock creates a backup of /etc/sanlock/sanlock.conf $ ls -lh /etc/sanlock/sanlock.conf* -rw-r--r--. 1 root root 102 Jan 3 18:46 /etc/sanlock/sanlock.conf -rw-r--r--. 1 root root 81 Jan 3 18:46 /etc/sanlock/sanlock.conf.20210103184638 $ cat /etc/sanlock/sanlock.conf.20210103184638 # Configuration for vdsm our_host_name = invalid-host-id max_worker_threads = 50 ## Sanlock.conf selinux labels are kept during upgrade $ ls -lhZ /etc/sanlock/sanlock.conf -rw-r--r--. 1 root root system_u:object_r:sanlock_conf_t:s0 102 Jan 3 18:46 /etc/sanlock/sanlock.conf
(In reply to Nir Soffer from comment #39) > (In reply to Evelina Shames from comment #38) > > Verified on rhv-4.4.4-6: > > > 'our_host_name' is specified in sanlock.log (on spm host) instead of a random uuid. > > Sanlock must be configured on all hosts not only on spm. > > > Nir, is there something else I need to check? > > Here is how to verify this change: > > ## After adding host to engine 4.4.4 or upgrading a host sanlock is > configured > > /etc/sanlock/sanlock.conf contains our_host_name option, > set to host hardware id: > > $ cat /etc/sanlock/sanlock.conf > # Configuration for vdsm > our_host_name = 095bce26-9d5b-4ae2-b3bf-dddfc9cdf541 > ... > > $ cat /etc/vdsm/vdsm.id > 095bce26-9d5b-4ae2-b3bf-dddfc9cdf541 > > Sanlock reports this value in sanlock log: > > 2021-01-03 11:40:54 9 [964]: sanlock daemon started 3.8.2 host > 095bce26-9d5b-4ae2-b3bf-dddfc9cdf541 (host3) > > > ## vdsm-tool is-configured succeeds > > $ sudo vdsm-tool is-configured > Current revision of multipath.conf detected, preserving > sanlock is configured for vdsm > abrt is already configured for vdsm > Managed volume database is already configured > lvm is configured for vdsm > libvirt is already configured for vdsm > > > ## If our_host_name is does not match host hardware id: > > $ cat /etc/sanlock/sanlock.conf > # Configuration for vdsm > our_host_name = invalid-host-id > max_worker_threads = 50 > > vdsm-tool is-configured will fail: > > $ sudo vdsm-tool is-configured --module sanlock > sanlock config file needs options: {'our_host_name': > '095bce26-9d5b-4ae2-b3bf-dddfc9cdf541'} > Modules sanlock are not configured > Error: > > One of the modules is not configured to work with VDSM. > To configure the module use the following: > 'vdsm-tool configure [--module module-name]'. > > If all modules are not configured try to use: > 'vdsm-tool configure --force' > (The force flag will stop the module's service and start it > afterwards automatically to load the new configuration.) > > Changing the configuration back to the right value will > make vdsm-tool happy again. > > ## When sanlock is not configured, vdsmd service will fail to start > > $ sudo systemctl stop vdsmd > $ sudo systemctl start vdsmd > Job for vdsmd.service failed because the control process exited with error > code. > See "systemctl status vdsmd.service" and "journalctl -xe" for details. > > Looking in journalcl we should see the error about configuring sanlock: > > Jan 03 18:54:25 host3 vdsmd_init_common.sh[32952]: vdsm: Running > check_is_configured > Jan 03 18:54:25 host3 vdsmd_init_common.sh[32952]: Error: > Jan 03 18:54:25 host3 vdsmd_init_common.sh[32952]: One of the modules is not > configured to work with VDSM. > Jan 03 18:54:25 host3 vdsmd_init_common.sh[32952]: To configure the module > use the following: > Jan 03 18:54:25 host3 vdsmd_init_common.sh[32952]: 'vdsm-tool configure > [--module module-name]'. > Jan 03 18:54:25 host3 vdsmd_init_common.sh[32952]: If all modules are not > configured try to use: > Jan 03 18:54:25 host3 vdsmd_init_common.sh[32952]: 'vdsm-tool configure > --force' > Jan 03 18:54:25 host3 vdsmd_init_common.sh[32952]: (The force flag will stop > the module's service and start it > Jan 03 18:54:25 host3 vdsmd_init_common.sh[32952]: afterwards automatically > to load the new configuration.) > Jan 03 18:54:25 host3 vdsmd_init_common.sh[32952]: > Jan 03 18:54:25 host3 vdsmd_init_common.sh[32952]: lvm is configured for vdsm > Jan 03 18:54:25 host3 vdsmd_init_common.sh[32952]: Managed volume database > is already configured > Jan 03 18:54:25 host3 vdsmd_init_common.sh[32952]: Current revision of > multipath.conf detected, preserving > Jan 03 18:54:25 host3 vdsmd_init_common.sh[32952]: abrt is already > configured for vdsm > Jan 03 18:54:25 host3 vdsmd_init_common.sh[32952]: libvirt is already > configured for vdsm > Jan 03 18:54:25 host3 vdsmd_init_common.sh[32952]: sanlock config file needs > options: {'our_host_name': '095bce26-9d5b-4ae2-b> > Jan 03 18:54:25 host3 vdsmd_init_common.sh[32952]: Modules sanlock are not > configured > Jan 03 18:54:26 host3 vdsmd_init_common.sh[32952]: vdsm: stopped during > execute check_is_configured task (task returned with> > > > ## To configure a host manually > > 1. Move host to maintainance > 2. Configure vdsm using vdsm-tool > > $ sudo vdsm-tool configure --force --module sanlock > > Checking configuration status... > > sanlock config file needs options: {'our_host_name': > '095bce26-9d5b-4ae2-b3bf-dddfc9cdf541'} > > Running configure... > Configuring sanlock config file > Previous sanlock.conf copied to /etc/sanlock/sanlock.conf.20210103184638 > Reconfiguration of sanlock is done. > > Done configuring modules to VDSM. > > > ## Configuring sanlock creates a backup of /etc/sanlock/sanlock.conf > > $ ls -lh /etc/sanlock/sanlock.conf* > -rw-r--r--. 1 root root 102 Jan 3 18:46 /etc/sanlock/sanlock.conf > -rw-r--r--. 1 root root 81 Jan 3 18:46 > /etc/sanlock/sanlock.conf.20210103184638 > > $ cat /etc/sanlock/sanlock.conf.20210103184638 > # Configuration for vdsm > our_host_name = invalid-host-id > max_worker_threads = 50 > > > ## Sanlock.conf selinux labels are kept during upgrade > > $ ls -lhZ /etc/sanlock/sanlock.conf > -rw-r--r--. 1 root root system_u:object_r:sanlock_conf_t:s0 102 Jan 3 18:46 > /etc/sanlock/sanlock.conf Verified on rhv-4.4.4-6 with the above steps on both SPM and HSM hosts.
Tested on: Engine : https://rhev-red-03.rdu2.scalelab.redhat.com 40 Storage domains > iscsi connectivity Hosts : F02-h25-000-r620.rdu2.scalelab.redhat.com Version : vdsm-4.40.40-1.el8ev.x86_64 Rhv-release-4.4.4-7-001.noarch sanlock-3.8.2-1.el8.x86_64 max_worker_threads = 50 +---------------------------------+-----------------+-----------+ | F02-h25-000-r620 - vdsm-4.40.40 | | Duration | +---------------------------------+-----------------+-----------+ | Scenario | | | | Clean shutdown | deactivate host | 17.76 s | | | activate host | 139 s | | unclean shutdown | activate host | 140 s | +---------------------------------+-----------------+-----------+ Host: F01-h08-000-1029u.rdu2.scalelab.redhat.com Version : vdsm-4.40.37-1.el8ev.x86_64 Rhv-release-4.4.4-2-001.noarch sanlock-3.8.2-1.el8.x86_64 +----------------------------------+-----------------+-----------+ | F01-h08-000-1029u - vdsm-4.40.37 | | Duration | +----------------------------------+-----------------+-----------+ | Scenario | | | | Clean shutdown | deactivate host | 19.51 s | | | activate host | 136 s | | unclean shutdown | activate host | 975 s | +----------------------------------+-----------------+-----------+ p.s once we will get the official sanlock build we be able to test it
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHV RHEL Host (ovirt-host) 4.4.z [ovirt-4.4.4]), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0382