1664159 – Broker randomly fails when talking to sanlock

Bug 1664159 - Broker randomly fails when talking to sanlock

Summary: Broker randomly fails when talking to sanlock

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	4.2.7
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	ovirt-4.4.6
Target Release:	4.4.6
Assignee:	Asaf Rachmani
QA Contact:	Nikolai Sednev
Docs Contact:
URL:
Whiteboard:
Depends On:	1812185
Blocks:
TreeView+	depends on / blocked

Reported:	2019-01-07 23:21 UTC by Germano Veit Michel
Modified:	2024-10-01 16:12 UTC (History)
CC List:	21 users (show)
Fixed In Version:	vdsm-4.40.60.6-1.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-06-01 13:21:11 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logs from alma07 added ha-host (16 bytes, text/plain) 2020-10-14 12:23 UTC, Nikolai Sednev	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3970911	None	None	None	2019-03-07 23:08:24 UTC
Red Hat Product Errata	RHBA-2021:2178	None	None	None	2021-06-01 13:21:51 UTC
oVirt gerrit	114483	None	MERGED	spec: Update sanlock version for RHEL 8.4	2021-04-29 06:15:40 UTC

Description Germano Veit Michel 2019-01-07 23:21:12 UTC

Description of problem:

Randomly, the ha-broker fails to talk to sanlock on the hosts with errno 104 (connection reset by peer). This makes the broker restart, which triggers the agent restarting too. The user receives notifications about the state machine changing states. After restart everything is back to normal. See:

1. Broker fails to talk to sanlock

StatusStorageThread::ERROR::2019-01-04 16:22:42,838::status_broker::90::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(run) Failed to update state.
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 82, in run
    if (self._status_broker._inquire_whiteboard_lock() or
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 190, in _inquire_whiteboard_lock
    self.host_id, self._lease_file)
SanlockException: (104, 'Sanlock lockspace inquire failure', 'Connection reset by peer')

2. Broker shuts down due to the above error

StatusStorageThread::ERROR::2019-01-04 16:22:42,839::status_broker::70::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(trigger_restart) Trying to restart the broker
Listener::INFO::2019-01-04 16:22:43,341::broker::61::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Server shutting down

3. Agent loses connection to broker socket

MainThread::ERROR::2019-01-04 16:22:51,806::hosted_engine::431::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Unhandled monitoring loop exception
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 428, in start_monitoring
    self._monitoring_loop()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 447, in _monitoring_loop
    for old_state, state, delay in self.fsm:
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/fsm/machine.py", line 127, in next
    new_data = self.refresh(self._state.data)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/state_machine.py", line 81, in refresh
    stats.update(self.hosted_engine.collect_stats())
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 736, in collect_stats
    all_stats = self._broker.get_stats_from_storage()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 135, in get_stats_from_storage
    result = self._proxy.get_stats()
  File "/usr/lib64/python2.7/xmlrpclib.py", line 1233, in __call__
    return self.__send(self.__name, args)
  File "/usr/lib64/python2.7/xmlrpclib.py", line 1591, in __request
    verbose=self.__verbose
  File "/usr/lib64/python2.7/xmlrpclib.py", line 1273, in request
    return self.single_request(host, handler, request_body, verbose)
  File "/usr/lib64/python2.7/xmlrpclib.py", line 1301, in single_request
    self.send_content(h, request_body)
  File "/usr/lib64/python2.7/xmlrpclib.py", line 1448, in send_content
    connection.endheaders(request_body)
  File "/usr/lib64/python2.7/httplib.py", line 1037, in endheaders
    self._send_output(message_body)
  File "/usr/lib64/python2.7/httplib.py", line 881, in _send_output
    self.send(msg)
  File "/usr/lib64/python2.7/httplib.py", line 843, in send
    self.connect()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/unixrpc.py", line 52, in connect
    self.sock.connect(base64.b16decode(self.host))
  File "/usr/lib64/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 2] No such file or directory

4. Agent restarts too

MainThread::ERROR::2019-01-04 16:22:51,830::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent
MainThread::INFO::2019-01-04 16:22:51,830::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down

Observations:
- sanlock is running fine all the time
- upon restart of broker everything is back to normal
- 100% of the hits are on the same function, _inquire_whiteboard_lock, suspicious?

Sanlock client log dump only shows the hosted-engine lease being released by the host at that time, I assume this is due to the agent restarting.

2019-01-04 16:22:43 5798061 [1398]: cmd_rem_lockspace 15,30 hosted-engine flags 0
2019-01-04 16:22:44 5798062 [1392]: s2 set killing_pids check 0 remove 1
2019-01-04 16:22:44 5798062 [1392]: s2 all pids clear
2019-01-04 16:22:44 5798062 [1392]: s2 set thread_stop
2019-01-04 16:22:44 5798062 [1392]: s2 wdmd_test_live 0 0 to disable
2019-01-04 16:22:45 5798063 [4997]: s2 delta_release begin hosted-engine:2
2019-01-04 16:22:45 5798063 [4997]: s2 delta_release done 2 7 0
2019-01-04 16:22:45 5798063 [1392]: s2 free lockspace
2019-01-04 16:22:46 5798064 [1398]: cmd_rem_lockspace 15,30 done 0

After it restarts the lease is acquired again, and everything goes back to normal for a few days until it happens again. It happens on hosts running the HE and also not running the HE randomly.

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-2.2.16-1.el7ev.noarch
sanlock-3.6.0-1.el7.x86_64
sanlock-python-3.6.0-1.el7.x86_64

How reproducible:
Unknown, but happens regularly at customer site.

Actual results:
Broker fails to talk to sanlock and restarts, triggering "false" events.

Expected results:
Broker not restarting

Comment 8 Germano Veit Michel 2019-01-07 23:36:19 UTC

Any idea what could be causing this?

If nothing comes up maybe trace the syscalls to /var/run/sanlock/sanlock.sock?

Comment 9 Ryan Barry 2019-01-08 00:33:16 UTC

Without a reproducer (which needs to be checked) it's going to be hard to trace the calls.

Since this has never been reported anywhere and appears to also affect sanlock elsewhere, I'd suspect storage flapping, though the logs look ok. I'll take a deeper dive tomorrow morning

Comment 10 Germano Veit Michel 2019-01-08 00:39:59 UTC

(In reply to Ryan Barry from comment #9)
> Without a reproducer (which needs to be checked) it's going to be hard to
> trace the calls.

Hi Ryan, thanks for the quick reply ;)

Sure, wont be the easiest thing but I can work with the customer on it if we need to get to that point.

> Since this has never been reported anywhere and appears to also affect
> sanlock elsewhere, I'd suspect storage flapping, though the logs look ok.
> I'll take a deeper dive tomorrow morning

What do you mean by 'also affect sanlock elsewhere'? I only see this failure when the broker does _inquire_whiteboard_lock. Did you find something else too? It could also be some incorrect socket handling somewhere...

Comment 11 Ryan Barry 2019-01-08 00:55:33 UTC

Wanted to clarify, since I wasn't sure whether "hosts not running HE" meant "hosts using sanlock on the same storage in a different way" or "part of the HE cluster, but HE is not active on this host"

Comment 12 Germano Veit Michel 2019-01-08 00:59:43 UTC

(In reply to Ryan Barry from comment #11)
> Wanted to clarify, since I wasn't sure whether "hosts not running HE" meant
> "hosts using sanlock on the same storage in a different way" or "part of the
> HE cluster, but HE is not active on this host"

Ohh, right. Sorry if it wasn't clear.

So the "other hosts" also fail on broker's  _inquire_whiteboard_lock while taking to sanlock. Its the same thing on all of them, running the HE VM or not, but all part of the HE cluster.
The are no other errors relating sanlock, everything else seems ok. Its just the HE hosts on this particular function.

Comment 13 Michal Skrivanek 2019-01-08 06:48:30 UTC

thoughts?

Comment 15 nijin ashok 2019-03-04 03:03:02 UTC

I have got another customer with the same issue. Please let us know if you need any more info from the customer's environment to identify the root cause.

Comment 16 Ryan Barry 2019-03-04 03:56:21 UTC

Simone, any ideas?

Comment 17 Simone Tiraboschi 2019-03-04 10:15:07 UTC

I checked the logs and honestly I don't see anything that suspect there.

Also looking at the code, _inquire_whiteboard_lock is basically a single line function:
https://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/ovirt_hosted_engine_ha/broker/status_broker.py#L189

'Connection reset by peer' probably means that for some reason the sanlock client failed communicating with the sanlock daemon but we don't have any evidence of that in sanlock logs.
David, any hints?

Comment 18 David Teigland 2019-03-04 15:52:05 UTC

It looks like send(2) from libsanlock on the sanlock socket must have returned ECONNRESET, and the sanlock daemon may have never seen a problem.

/var/log/messages from around that same time shows similar errors from other systems, so I wonder if somehow socket files in /var/run where removed, or something like that?

Comment 19 Alexey 2019-03-07 06:59:14 UTC

We have the same issue with ovirt 4.2.8
/var/log/messages:
ovirt-ha-broker ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update ERROR Failed to update state.#012Traceback (most recent call last):#012  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 82, in run#012    if (self._status_broker._inquire_whiteboard_lock() or#012  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 190, in _inquire_whiteboard_lock#012    self.host_id, self._lease_file)#012SanlockException: (104, 'Sanlock lockspace inquire failure', 'Connection reset by peer')

Comment 20 Ryan Barry 2019-03-07 22:34:24 UTC

Any selinux denials, Alexey, Germano?

Any chance this is running on oVirt Node || RHVH?

Comment 21 Germano Veit Michel 2019-03-07 22:56:10 UTC

(In reply to Ryan Barry from comment #20)
> Any selinux denials, Alexey, Germano?

Not in my ticket.

> Any chance this is running on oVirt Node || RHVH?

RHEL

Comment 22 Alexey 2019-03-08 03:59:51 UTC

We are using oVirt Node 4.2.8 (RHEL-7-6.1810.2.el7.centos).

Comment 23 Alexey 2019-03-15 07:27:07 UTC

(In reply to Ryan Barry from comment #20)
> Any selinux denials, Alexey, Germano?

there are no selinux denials

> Any chance this is running on oVirt Node || RHVH?

oVirt Node 4.2.8

Comment 24 Daniel Gur 2019-08-28 13:14:49 UTC

sync2jira

Comment 25 Daniel Gur 2019-08-28 13:19:52 UTC

sync2jira

Comment 26 Siddhant Rao 2020-01-19 17:33:12 UTC

Hi,

I have a customer who is hitting this same issue, the ha-broker suddenly fails to talk to sanlock with the same _inquire_whiteboard_lock

~~~

StatusStorageThread::ERROR::2020-01-17 07:04:47,775::status_broker::90::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(run) Failed to update state.
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 82, in run
    if (self._status_broker._inquire_whiteboard_lock() or
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 190, in _inquire_whiteboard_lock
    self.host_id, self._lease_file)
SanlockException: (104, 'Sanlock lockspace inquire failure', 'Connection reset by peer')

~~~


Let me know if you would need some data to debug this

Comment 27 David Teigland 2020-03-05 21:56:42 UTC

I've been digging through sanlock code and trying to reproduce this with no progress.  I'm going to look into adding some new debugging in the area of client connections that might be able to catch what's happening.  My best guess is that the sanlock daemon is closing the fd for the inq_lockspace request without logging anything.  Some time ago I disabled all of the standard command logging related to inq_lockspace because there was too much of it (it's called very frequently.)  I doubt that enabling only those messages would tell us enough.  Once I have implemented some new debugging, I'll create a test build so we can attempt to reproduce the problem with that.

Comment 28 David Teigland 2020-03-06 22:27:29 UTC

I reproduced what appears to be the same problem and created a fix.  I'd welcome any testing with this scratch build containing the patch below (based on rhel

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=27102386



commit 62ba5d194100f71cbed32214e0aeb76f690f4093
Author: David Teigland <teigland>
Date:   Fri Mar 6 16:03:01 2020 -0600

    sanlock: fix closing wrong client fd
    
    The symptoms of this bug were inq_lockspace returning
    ECONNRESET.  It was caused by a previous client closing
    the fd of a newer client doing inq_lockspace (when both
    clients were running at roughly the same time.)
    
    First client ci1, second client ci2.
    
    ci1 in call_cmd_daemon() is finished, and close(fd)
    is called (and client[ci].fd is *not* set to -1).
    
    ci2 is a new client at about the same time and gets the
    same fd that had been used by ci1.
    
    ci1 being finished triggers a poll error, which results
    in client_free(ci1).  client_free looks at client[ci1].fd
    and finds it is not -1, so it calls close() on it, but
    this fd is now being used by ci2.  This breaks the sanlock
    daemon connection for ci2 and the client gets ECONNRESET.

diff --git a/src/cmd.c b/src/cmd.c
index d816fd7a4bfe..b79e224ebf6a 100644
--- a/src/cmd.c
+++ b/src/cmd.c
@@ -3072,7 +3072,24 @@ void call_cmd_daemon(int ci, struct sm_header *h_recv, int client_maxi)
                break;
        };
 
+       /*
+        * Previously just called close(fd) and did not set client[ci].fd = -1.
+        * This meant that a new client ci could get this fd and use it.
+        *
+        * When a poll error occurs because this ci was finished, then
+        * client_free(ci) would be called for this ci.  client_free would
+        * see cl->fd was still set and call close() on it, even though that
+        * fd was now in use by another ci.
+        *
+        * We could probably get by with just doing this here:
+        * client[ci].fd = -1;
+        * close(fd);
+        *
+        * and then handling the full client_free in response to
+        * the poll error (as done previously), but I see no reason
+        * to avoid the full client_free here.
+        */
        if (auto_close)
-               close(fd);
+               client_free(ci);
 }

Comment 29 David Teigland 2020-03-06 22:37:58 UTC

I reproduced the bug by running the following several times:

$ sanlock status > /dev/null & sanlock client inq_lockspace -s lockspace2:1:/dev/sanlk/lock2:0 >>inq & sanlock client inq_lockspace -s lockspace1:1:/dev/sanlk/lock1:0 >> inq & sanlock client inq_lockspace -s lockspace0:1:/dev/sanlk/lock0:0 >>inq & sanlock status > /dev/null

One of the inq_lockspace commands will print the 104 (ECONNRESET) error.

Including the status command was important, because it exercised the problematic code path.  The following commands/apis exercise that code path, and to reproduce this bug I expect that one of these commands would have to be running at about the same time as the inq_lockspace: status, host_status, renewal, log_dump, get_lockspaces, get_hosts, reg_event, end_event, getconfig.

Do the vdsm logs show that a command from that set was also being run at about the same time as ECONNRESET appeared from inq_lockspace?

Comment 30 Germano Veit Michel 2020-03-09 03:03:19 UTC

David, I've checked "sanlock_client_log_dump" from the initial ticket I opened this bug with but it does not cover the time of the error. I think that if we have any chance of finding the simultaneous commands it would be there.

However what you mention seems plausible to me, as a RHV/Hosted-Engine host uses a few locks, for example:

# sanlock status
daemon 917e0f73-e20b-42b6-b10a-b3e89a1b4818.host1
p -1 helper
p -1 listener
p 4980 HostedEngine
p 3185 
p -1 status
s f330840e-d357-4ed4-be47-582fce2d22e0:1:/rhev/data-center/mnt/192.168.151.252\:_exports_nfs/f330840e-d357-4ed4-be47-582fce2d22e0/dom_md/ids:0
s 2af97604-e629-4308-b655-3d5a88a64024:1:/dev/2af97604-e629-4308-b655-3d5a88a64024/ids:0
s hosted-engine:1:/var/run/vdsm/storage/c64637b4-f0e8-408c-b8af-6a52946113e2/7962efd6-92de-4319-a62b-8404945e3e94/40168fc1-36c6-433c-97b6-664e209867cd:0
s c64637b4-f0e8-408c-b8af-6a52946113e2:1:/rhev/data-center/mnt/192.168.151.252\:_exports_he/c64637b4-f0e8-408c-b8af-6a52946113e2/dom_md/ids:0
r c64637b4-f0e8-408c-b8af-6a52946113e2:SDM:/rhev/data-center/mnt/192.168.151.252\:_exports_he/c64637b4-f0e8-408c-b8af-6a52946113e2/dom_md/leases:1048576:58 p 3185
r c64637b4-f0e8-408c-b8af-6a52946113e2:91d8a4cd-983b-49c4-84d2-6cb97d0f4c6f:/rhev/data-center/mnt/192.168.151.252\:_exports_he/c64637b4-f0e8-408c-b8af-6a52946113e2/images/f1ade880-0811-497a-abbf-be7d56e97255/91d8a4cd-983b-49c4-84d2-6cb97d0f4c6f.lease:0:18 p 4980

The last 2 are only present if the Host is SPM (SDM) and the HE VM is running on it, respectively.

Even on non Hosted-Engine hypervisors, it will have the 'ids' ondes, and possibly SDM too if the Host is the SPM. And the more Storage Domains, the more locks it will have.

So if I understand correctly, inquiring for any 2 of these at the same time could trigger the bug....

Comment 31 David Teigland 2020-03-09 15:35:23 UTC

AFAICT inq_lockspace has no problem, it's only a victim of one of the other commands I listed which use the problematic code path.  If there's anything on the system using one of those commands, then inq_lockspace (or any other client connection) could be affected if they are running at the same time (it's a race).  I think it's most likely that RHV is using get_lockspaces/get_hosts/host_status.

I did find another location where close() could also cause this problem, but it only happens on an error path used when the system is out of memory or out of threads, which don't seem to be happening.

Comment 32 David Teigland 2020-03-10 17:56:29 UTC

RHEL8 sanlock bug for this ECONNRESET problem: bug 1812185

Comment 33 Michal Skrivanek 2020-03-17 12:17:06 UTC

deprecating SLA team usage, moving to Virt

Comment 34 Tal Nisan 2020-06-29 18:48:23 UTC

Targeting to 4.5 for the meanwhile as the Sanlock fix will only be released in RHEL 8.3, if needed please escalate to have it in 8.2.z so we can consume it earlier

Comment 36 Nikolai Sednev 2020-10-14 12:14:12 UTC

Reproduced on these components:
rhvm-4.4.3.6-0.13.el8ev.noarch
ovirt-engine-setup-4.4.3.6-0.13.el8ev.noarch
sanlock-3.8.2-1.el8.x86_64
vdsm-4.40.33-1.el8ev.x86_64
libvirt-client-6.6.0-6.module+el8.3.0+8125+aefcf088.x86_64
libvirt-lock-sanlock-6.6.0-6.module+el8.3.0+8125+aefcf088.x86_64
ovirt-hosted-engine-setup-2.4.7-2.el8ev.noarch
ovirt-hosted-engine-ha-2.4.5-1.el8ev.noarch
Linux 4.18.0-240.el8.x86_64 #1 SMP Wed Sep 23 05:13:10 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 8.3 (Ootpa)

Comment 37 Nikolai Sednev 2020-10-14 12:23:26 UTC

Created attachment 1721454 [details]
logs from alma07 added ha-host

Comment 38 Nikolai Sednev 2020-10-14 12:25:16 UTC

Please take a look on Comment #36.

Comment 39 Nir Soffer 2020-10-15 12:05:44 UTC

(In reply to Nikolai Sednev from comment #38)
> Please take a look on Comment #36.

There is no relevant info in this comment. What did you reproduce?
How? What are the reproduction steps?

Comment 40 Nikolai Sednev 2020-10-15 12:16:46 UTC

(In reply to Nir Soffer from comment #39)
> (In reply to Nikolai Sednev from comment #38)
> > Please take a look on Comment #36.
> 
> There is no relevant info in this comment. What did you reproduce?
> How? What are the reproduction steps?

I reproduced exactly the same error "MainThread::ERROR::2019-01-04 16:22:51,806::hosted_engine::431::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Unhandled monitoring loop exception", which was mentioned in the description and seen in attached logs.
For reproduction I used these steps:
1.Deploy HE over iSCSI storage on one host.
2.Add additional ha-host from "Guide Me" option from datacenter tab.
3.Additional ha-host being added in local maintenance, activate it in hosts tab using manage button.
4.Check /var/log/ovirt-hosted-engine-ha/agent.log on aditionally added ha-host.

Comment 41 Nir Soffer 2020-10-15 12:36:45 UTC

(In reply to Nikolai Sednev from comment #40)
> (In reply to Nir Soffer from comment #39)
> > (In reply to Nikolai Sednev from comment #38)
> > > Please take a look on Comment #36.
> > 
> > There is no relevant info in this comment. What did you reproduce?
> > How? What are the reproduction steps?
> 
> I reproduced exactly the same error "MainThread::ERROR::2019-01-04
> 16:22:51,806::hosted_engine::431::ovirt_hosted_engine_ha.agent.hosted_engine.
> HostedEngine::(start_monitoring) Unhandled monitoring loop exception", which
> was mentioned in the description and seen in attached logs.
> For reproduction I used these steps:
> 1.Deploy HE over iSCSI storage on one host.
> 2.Add additional ha-host from "Guide Me" option from datacenter tab.
> 3.Additional ha-host being added in local maintenance, activate it in hosts
> tab using manage button.
> 4.Check /var/log/ovirt-hosted-engine-ha/agent.log on aditionally added
> ha-host.

Moving the bug to hosted-engine-ha to investigate this errror.

Comment 42 Eyal Shenitzky 2020-10-15 12:40:26 UTC

Nir, should we require sanlock-3.8.1-1 version in VDSM to consume the fix for bz 1812185?

Comment 43 Nir Soffer 2020-10-15 12:55:34 UTC

(In reply to Eyal Shenitzky from comment #42)
> Nir, should we require sanlock-3.8.1-1 version in VDSM to consume the fix
> for bz 1812185?

We should require 3.8.2 (version in 8.3), see bug 1821042.

Note that comment 36 reports that this version was tested.

Comment 44 Eyal Shenitzky 2020-10-22 08:18:20 UTC

David, it seems that we still have an issue here, is this the same as the original bug, or we have a different bug?

Comment 45 David Teigland 2020-10-22 15:00:51 UTC

I don't see anything about sanlock to say if there might be something wrong or not.

Comment 47 Eyal Shenitzky 2021-04-29 06:15:44 UTC

Sanlock version bumped and now we are using RHEL-8.4.
Moving to ON_QA.

Comment 52 Nikolai Sednev 2021-05-03 10:47:43 UTC

Didn't reproduced originally reported error and issue on freshly deployed environment over NFS storage, which ran for at least 5 days in a row.
Moving to verified as it works fine for me on these components:
ovirt-hosted-engine-ha-2.4.6-1.el8ev.noarch
ovirt-hosted-engine-setup-2.5.0-2.el8ev.noarch
vdsm-4.40.60.5-1.el8ev.x86_64
sanlock-3.8.3-1.el8.x86_64
libvirt-lock-sanlock-7.0.0-13.module+el8.4.0+10604+5608c2b4.x86_64
qemu-kvm-5.2.0-15.module+el8.4.0+10650+50781ca0.x86_64
Linux 4.18.0-304.el8.x86_64 #1 SMP Tue Apr 6 05:19:59 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 8.4 (Ootpa)

Engine is ovirt-engine-setup-4.4.6.6-0.10.el8ev.noarch.

Please feel free to reopen if still getting reproduced.

Comment 56 errata-xmlrpc 2021-06-01 13:21:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV RHEL Host (ovirt-host) [ovirt-4.4.6]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2178

Note You need to log in before you can comment on or make changes to this bug.

ahadas
eshenitz
frag4b
guillaume.pavese
kmashalk
lsurette
mavital
michal.skrivanek
mjankula
nashok
nsednev
nsoffer
pelauter
ryan
schandle
sirao
srevivo
stirabos
teigland
tnisan
ycui