Description of problem: Randomly, the ha-broker fails to talk to sanlock on the hosts with errno 104 (connection reset by peer). This makes the broker restart, which triggers the agent restarting too. The user receives notifications about the state machine changing states. After restart everything is back to normal. See: 1. Broker fails to talk to sanlock StatusStorageThread::ERROR::2019-01-04 16:22:42,838::status_broker::90::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(run) Failed to update state. Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 82, in run if (self._status_broker._inquire_whiteboard_lock() or File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 190, in _inquire_whiteboard_lock self.host_id, self._lease_file) SanlockException: (104, 'Sanlock lockspace inquire failure', 'Connection reset by peer') 2. Broker shuts down due to the above error StatusStorageThread::ERROR::2019-01-04 16:22:42,839::status_broker::70::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(trigger_restart) Trying to restart the broker Listener::INFO::2019-01-04 16:22:43,341::broker::61::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Server shutting down 3. Agent loses connection to broker socket MainThread::ERROR::2019-01-04 16:22:51,806::hosted_engine::431::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Unhandled monitoring loop exception Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 428, in start_monitoring self._monitoring_loop() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 447, in _monitoring_loop for old_state, state, delay in self.fsm: File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/fsm/machine.py", line 127, in next new_data = self.refresh(self._state.data) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/state_machine.py", line 81, in refresh stats.update(self.hosted_engine.collect_stats()) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 736, in collect_stats all_stats = self._broker.get_stats_from_storage() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 135, in get_stats_from_storage result = self._proxy.get_stats() File "/usr/lib64/python2.7/xmlrpclib.py", line 1233, in __call__ return self.__send(self.__name, args) File "/usr/lib64/python2.7/xmlrpclib.py", line 1591, in __request verbose=self.__verbose File "/usr/lib64/python2.7/xmlrpclib.py", line 1273, in request return self.single_request(host, handler, request_body, verbose) File "/usr/lib64/python2.7/xmlrpclib.py", line 1301, in single_request self.send_content(h, request_body) File "/usr/lib64/python2.7/xmlrpclib.py", line 1448, in send_content connection.endheaders(request_body) File "/usr/lib64/python2.7/httplib.py", line 1037, in endheaders self._send_output(message_body) File "/usr/lib64/python2.7/httplib.py", line 881, in _send_output self.send(msg) File "/usr/lib64/python2.7/httplib.py", line 843, in send self.connect() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/unixrpc.py", line 52, in connect self.sock.connect(base64.b16decode(self.host)) File "/usr/lib64/python2.7/socket.py", line 224, in meth return getattr(self._sock,name)(*args) error: [Errno 2] No such file or directory 4. Agent restarts too MainThread::ERROR::2019-01-04 16:22:51,830::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent MainThread::INFO::2019-01-04 16:22:51,830::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down Observations: - sanlock is running fine all the time - upon restart of broker everything is back to normal - 100% of the hits are on the same function, _inquire_whiteboard_lock, suspicious? Sanlock client log dump only shows the hosted-engine lease being released by the host at that time, I assume this is due to the agent restarting. 2019-01-04 16:22:43 5798061 [1398]: cmd_rem_lockspace 15,30 hosted-engine flags 0 2019-01-04 16:22:44 5798062 [1392]: s2 set killing_pids check 0 remove 1 2019-01-04 16:22:44 5798062 [1392]: s2 all pids clear 2019-01-04 16:22:44 5798062 [1392]: s2 set thread_stop 2019-01-04 16:22:44 5798062 [1392]: s2 wdmd_test_live 0 0 to disable 2019-01-04 16:22:45 5798063 [4997]: s2 delta_release begin hosted-engine:2 2019-01-04 16:22:45 5798063 [4997]: s2 delta_release done 2 7 0 2019-01-04 16:22:45 5798063 [1392]: s2 free lockspace 2019-01-04 16:22:46 5798064 [1398]: cmd_rem_lockspace 15,30 done 0 After it restarts the lease is acquired again, and everything goes back to normal for a few days until it happens again. It happens on hosts running the HE and also not running the HE randomly. Version-Release number of selected component (if applicable): ovirt-hosted-engine-ha-2.2.16-1.el7ev.noarch sanlock-3.6.0-1.el7.x86_64 sanlock-python-3.6.0-1.el7.x86_64 How reproducible: Unknown, but happens regularly at customer site. Actual results: Broker fails to talk to sanlock and restarts, triggering "false" events. Expected results: Broker not restarting
Any idea what could be causing this? If nothing comes up maybe trace the syscalls to /var/run/sanlock/sanlock.sock?
Without a reproducer (which needs to be checked) it's going to be hard to trace the calls. Since this has never been reported anywhere and appears to also affect sanlock elsewhere, I'd suspect storage flapping, though the logs look ok. I'll take a deeper dive tomorrow morning
(In reply to Ryan Barry from comment #9) > Without a reproducer (which needs to be checked) it's going to be hard to > trace the calls. Hi Ryan, thanks for the quick reply ;) Sure, wont be the easiest thing but I can work with the customer on it if we need to get to that point. > Since this has never been reported anywhere and appears to also affect > sanlock elsewhere, I'd suspect storage flapping, though the logs look ok. > I'll take a deeper dive tomorrow morning What do you mean by 'also affect sanlock elsewhere'? I only see this failure when the broker does _inquire_whiteboard_lock. Did you find something else too? It could also be some incorrect socket handling somewhere...
Wanted to clarify, since I wasn't sure whether "hosts not running HE" meant "hosts using sanlock on the same storage in a different way" or "part of the HE cluster, but HE is not active on this host"
(In reply to Ryan Barry from comment #11) > Wanted to clarify, since I wasn't sure whether "hosts not running HE" meant > "hosts using sanlock on the same storage in a different way" or "part of the > HE cluster, but HE is not active on this host" Ohh, right. Sorry if it wasn't clear. So the "other hosts" also fail on broker's _inquire_whiteboard_lock while taking to sanlock. Its the same thing on all of them, running the HE VM or not, but all part of the HE cluster. The are no other errors relating sanlock, everything else seems ok. Its just the HE hosts on this particular function.
thoughts?
I have got another customer with the same issue. Please let us know if you need any more info from the customer's environment to identify the root cause.
Simone, any ideas?
I checked the logs and honestly I don't see anything that suspect there. Also looking at the code, _inquire_whiteboard_lock is basically a single line function: https://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/ovirt_hosted_engine_ha/broker/status_broker.py#L189 'Connection reset by peer' probably means that for some reason the sanlock client failed communicating with the sanlock daemon but we don't have any evidence of that in sanlock logs. David, any hints?
It looks like send(2) from libsanlock on the sanlock socket must have returned ECONNRESET, and the sanlock daemon may have never seen a problem. /var/log/messages from around that same time shows similar errors from other systems, so I wonder if somehow socket files in /var/run where removed, or something like that?
We have the same issue with ovirt 4.2.8 /var/log/messages: ovirt-ha-broker ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update ERROR Failed to update state.#012Traceback (most recent call last):#012 File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 82, in run#012 if (self._status_broker._inquire_whiteboard_lock() or#012 File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 190, in _inquire_whiteboard_lock#012 self.host_id, self._lease_file)#012SanlockException: (104, 'Sanlock lockspace inquire failure', 'Connection reset by peer')
Any selinux denials, Alexey, Germano? Any chance this is running on oVirt Node || RHVH?
(In reply to Ryan Barry from comment #20) > Any selinux denials, Alexey, Germano? Not in my ticket. > Any chance this is running on oVirt Node || RHVH? RHEL
We are using oVirt Node 4.2.8 (RHEL-7-6.1810.2.el7.centos).
(In reply to Ryan Barry from comment #20) > Any selinux denials, Alexey, Germano? there are no selinux denials > Any chance this is running on oVirt Node || RHVH? oVirt Node 4.2.8
sync2jira
Hi, I have a customer who is hitting this same issue, the ha-broker suddenly fails to talk to sanlock with the same _inquire_whiteboard_lock ~~~ StatusStorageThread::ERROR::2020-01-17 07:04:47,775::status_broker::90::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(run) Failed to update state. Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 82, in run if (self._status_broker._inquire_whiteboard_lock() or File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 190, in _inquire_whiteboard_lock self.host_id, self._lease_file) SanlockException: (104, 'Sanlock lockspace inquire failure', 'Connection reset by peer') ~~~ Let me know if you would need some data to debug this
I've been digging through sanlock code and trying to reproduce this with no progress. I'm going to look into adding some new debugging in the area of client connections that might be able to catch what's happening. My best guess is that the sanlock daemon is closing the fd for the inq_lockspace request without logging anything. Some time ago I disabled all of the standard command logging related to inq_lockspace because there was too much of it (it's called very frequently.) I doubt that enabling only those messages would tell us enough. Once I have implemented some new debugging, I'll create a test build so we can attempt to reproduce the problem with that.
I reproduced what appears to be the same problem and created a fix. I'd welcome any testing with this scratch build containing the patch below (based on rhel https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=27102386 commit 62ba5d194100f71cbed32214e0aeb76f690f4093 Author: David Teigland <teigland> Date: Fri Mar 6 16:03:01 2020 -0600 sanlock: fix closing wrong client fd The symptoms of this bug were inq_lockspace returning ECONNRESET. It was caused by a previous client closing the fd of a newer client doing inq_lockspace (when both clients were running at roughly the same time.) First client ci1, second client ci2. ci1 in call_cmd_daemon() is finished, and close(fd) is called (and client[ci].fd is *not* set to -1). ci2 is a new client at about the same time and gets the same fd that had been used by ci1. ci1 being finished triggers a poll error, which results in client_free(ci1). client_free looks at client[ci1].fd and finds it is not -1, so it calls close() on it, but this fd is now being used by ci2. This breaks the sanlock daemon connection for ci2 and the client gets ECONNRESET. diff --git a/src/cmd.c b/src/cmd.c index d816fd7a4bfe..b79e224ebf6a 100644 --- a/src/cmd.c +++ b/src/cmd.c @@ -3072,7 +3072,24 @@ void call_cmd_daemon(int ci, struct sm_header *h_recv, int client_maxi) break; }; + /* + * Previously just called close(fd) and did not set client[ci].fd = -1. + * This meant that a new client ci could get this fd and use it. + * + * When a poll error occurs because this ci was finished, then + * client_free(ci) would be called for this ci. client_free would + * see cl->fd was still set and call close() on it, even though that + * fd was now in use by another ci. + * + * We could probably get by with just doing this here: + * client[ci].fd = -1; + * close(fd); + * + * and then handling the full client_free in response to + * the poll error (as done previously), but I see no reason + * to avoid the full client_free here. + */ if (auto_close) - close(fd); + client_free(ci); }
I reproduced the bug by running the following several times: $ sanlock status > /dev/null & sanlock client inq_lockspace -s lockspace2:1:/dev/sanlk/lock2:0 >>inq & sanlock client inq_lockspace -s lockspace1:1:/dev/sanlk/lock1:0 >> inq & sanlock client inq_lockspace -s lockspace0:1:/dev/sanlk/lock0:0 >>inq & sanlock status > /dev/null One of the inq_lockspace commands will print the 104 (ECONNRESET) error. Including the status command was important, because it exercised the problematic code path. The following commands/apis exercise that code path, and to reproduce this bug I expect that one of these commands would have to be running at about the same time as the inq_lockspace: status, host_status, renewal, log_dump, get_lockspaces, get_hosts, reg_event, end_event, getconfig. Do the vdsm logs show that a command from that set was also being run at about the same time as ECONNRESET appeared from inq_lockspace?
David, I've checked "sanlock_client_log_dump" from the initial ticket I opened this bug with but it does not cover the time of the error. I think that if we have any chance of finding the simultaneous commands it would be there. However what you mention seems plausible to me, as a RHV/Hosted-Engine host uses a few locks, for example: # sanlock status daemon 917e0f73-e20b-42b6-b10a-b3e89a1b4818.host1 p -1 helper p -1 listener p 4980 HostedEngine p 3185 p -1 status s f330840e-d357-4ed4-be47-582fce2d22e0:1:/rhev/data-center/mnt/192.168.151.252\:_exports_nfs/f330840e-d357-4ed4-be47-582fce2d22e0/dom_md/ids:0 s 2af97604-e629-4308-b655-3d5a88a64024:1:/dev/2af97604-e629-4308-b655-3d5a88a64024/ids:0 s hosted-engine:1:/var/run/vdsm/storage/c64637b4-f0e8-408c-b8af-6a52946113e2/7962efd6-92de-4319-a62b-8404945e3e94/40168fc1-36c6-433c-97b6-664e209867cd:0 s c64637b4-f0e8-408c-b8af-6a52946113e2:1:/rhev/data-center/mnt/192.168.151.252\:_exports_he/c64637b4-f0e8-408c-b8af-6a52946113e2/dom_md/ids:0 r c64637b4-f0e8-408c-b8af-6a52946113e2:SDM:/rhev/data-center/mnt/192.168.151.252\:_exports_he/c64637b4-f0e8-408c-b8af-6a52946113e2/dom_md/leases:1048576:58 p 3185 r c64637b4-f0e8-408c-b8af-6a52946113e2:91d8a4cd-983b-49c4-84d2-6cb97d0f4c6f:/rhev/data-center/mnt/192.168.151.252\:_exports_he/c64637b4-f0e8-408c-b8af-6a52946113e2/images/f1ade880-0811-497a-abbf-be7d56e97255/91d8a4cd-983b-49c4-84d2-6cb97d0f4c6f.lease:0:18 p 4980 The last 2 are only present if the Host is SPM (SDM) and the HE VM is running on it, respectively. Even on non Hosted-Engine hypervisors, it will have the 'ids' ondes, and possibly SDM too if the Host is the SPM. And the more Storage Domains, the more locks it will have. So if I understand correctly, inquiring for any 2 of these at the same time could trigger the bug....
AFAICT inq_lockspace has no problem, it's only a victim of one of the other commands I listed which use the problematic code path. If there's anything on the system using one of those commands, then inq_lockspace (or any other client connection) could be affected if they are running at the same time (it's a race). I think it's most likely that RHV is using get_lockspaces/get_hosts/host_status. I did find another location where close() could also cause this problem, but it only happens on an error path used when the system is out of memory or out of threads, which don't seem to be happening.
RHEL8 sanlock bug for this ECONNRESET problem: bug 1812185
deprecating SLA team usage, moving to Virt
Targeting to 4.5 for the meanwhile as the Sanlock fix will only be released in RHEL 8.3, if needed please escalate to have it in 8.2.z so we can consume it earlier
Reproduced on these components: rhvm-4.4.3.6-0.13.el8ev.noarch ovirt-engine-setup-4.4.3.6-0.13.el8ev.noarch sanlock-3.8.2-1.el8.x86_64 vdsm-4.40.33-1.el8ev.x86_64 libvirt-client-6.6.0-6.module+el8.3.0+8125+aefcf088.x86_64 libvirt-lock-sanlock-6.6.0-6.module+el8.3.0+8125+aefcf088.x86_64 ovirt-hosted-engine-setup-2.4.7-2.el8ev.noarch ovirt-hosted-engine-ha-2.4.5-1.el8ev.noarch Linux 4.18.0-240.el8.x86_64 #1 SMP Wed Sep 23 05:13:10 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux release 8.3 (Ootpa)
Created attachment 1721454 [details] logs from alma07 added ha-host
Please take a look on Comment #36.
(In reply to Nikolai Sednev from comment #38) > Please take a look on Comment #36. There is no relevant info in this comment. What did you reproduce? How? What are the reproduction steps?
(In reply to Nir Soffer from comment #39) > (In reply to Nikolai Sednev from comment #38) > > Please take a look on Comment #36. > > There is no relevant info in this comment. What did you reproduce? > How? What are the reproduction steps? I reproduced exactly the same error "MainThread::ERROR::2019-01-04 16:22:51,806::hosted_engine::431::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Unhandled monitoring loop exception", which was mentioned in the description and seen in attached logs. For reproduction I used these steps: 1.Deploy HE over iSCSI storage on one host. 2.Add additional ha-host from "Guide Me" option from datacenter tab. 3.Additional ha-host being added in local maintenance, activate it in hosts tab using manage button. 4.Check /var/log/ovirt-hosted-engine-ha/agent.log on aditionally added ha-host.
(In reply to Nikolai Sednev from comment #40) > (In reply to Nir Soffer from comment #39) > > (In reply to Nikolai Sednev from comment #38) > > > Please take a look on Comment #36. > > > > There is no relevant info in this comment. What did you reproduce? > > How? What are the reproduction steps? > > I reproduced exactly the same error "MainThread::ERROR::2019-01-04 > 16:22:51,806::hosted_engine::431::ovirt_hosted_engine_ha.agent.hosted_engine. > HostedEngine::(start_monitoring) Unhandled monitoring loop exception", which > was mentioned in the description and seen in attached logs. > For reproduction I used these steps: > 1.Deploy HE over iSCSI storage on one host. > 2.Add additional ha-host from "Guide Me" option from datacenter tab. > 3.Additional ha-host being added in local maintenance, activate it in hosts > tab using manage button. > 4.Check /var/log/ovirt-hosted-engine-ha/agent.log on aditionally added > ha-host. Moving the bug to hosted-engine-ha to investigate this errror.
Nir, should we require sanlock-3.8.1-1 version in VDSM to consume the fix for bz 1812185?
(In reply to Eyal Shenitzky from comment #42) > Nir, should we require sanlock-3.8.1-1 version in VDSM to consume the fix > for bz 1812185? We should require 3.8.2 (version in 8.3), see bug 1821042. Note that comment 36 reports that this version was tested.
David, it seems that we still have an issue here, is this the same as the original bug, or we have a different bug?
I don't see anything about sanlock to say if there might be something wrong or not.
Sanlock version bumped and now we are using RHEL-8.4. Moving to ON_QA.
Didn't reproduced originally reported error and issue on freshly deployed environment over NFS storage, which ran for at least 5 days in a row. Moving to verified as it works fine for me on these components: ovirt-hosted-engine-ha-2.4.6-1.el8ev.noarch ovirt-hosted-engine-setup-2.5.0-2.el8ev.noarch vdsm-4.40.60.5-1.el8ev.x86_64 sanlock-3.8.3-1.el8.x86_64 libvirt-lock-sanlock-7.0.0-13.module+el8.4.0+10604+5608c2b4.x86_64 qemu-kvm-5.2.0-15.module+el8.4.0+10650+50781ca0.x86_64 Linux 4.18.0-304.el8.x86_64 #1 SMP Tue Apr 6 05:19:59 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux release 8.4 (Ootpa) Engine is ovirt-engine-setup-4.4.6.6-0.10.el8ev.noarch. Please feel free to reopen if still getting reproduced.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHV RHEL Host (ovirt-host) [ovirt-4.4.6]), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2178