Bug 1812185

Summary: fix client ECONNRESET by closing wrong fd
Product: Red Hat Enterprise Linux 8 Reporter: David Teigland <teigland>
Component: sanlockAssignee: David Teigland <teigland>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 8.2CC: agk, cluster-maint, cmarthal, nsoffer, rhandlin
Target Milestone: rcFlags: pm-rhel: mirror+
Target Release: 8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: sanlock-3.8.1-1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-04 02:14:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1664159, 1821042    

Description David Teigland 2020-03-10 17:54:07 UTC
Description of problem:

This sanlock bug was found and fixed in RHV bug 1664159.

commit 42f7f8f2d924eb8abe52b1c118ee89871d9112f1
Author: David Teigland <teigland>
Date:   Fri Mar 6 16:03:01 2020 -0600

    sanlock: fix closing wrong client fd
    
    The symptoms of this bug were inq_lockspace returning
    ECONNRESET.  It was caused by a previous client closing
    the fd of a newer client doing inq_lockspace (when both
    clients were running at roughly the same time.)
    
    First client ci1, second client ci2.
    
    ci1 in call_cmd_daemon() is finished, and close(fd)
    is called (and client[ci].fd is *not* set to -1).
    
    ci2 is a new client at about the same time and gets the
    same fd that had been used by ci1.
    
    ci1 being finished triggers a poll error, which results
    in client_free(ci1).  client_free looks at client[ci1].fd
    and finds it is not -1, so it calls close() on it, but
    this fd is now being used by ci2.  This breaks the sanlock
    daemon connection for ci2 and the client gets ECONNRESET.



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Nir Soffer 2020-05-01 22:37:29 UTC
David, this should POST or MODIFIED, no?

Comment 5 Corey Marthaler 2020-09-15 15:05:22 UTC
Fix verified in the latest rpms. I ran the mentioned scenario in a loop and never saw the ECONNRESET error.


sanlock-3.8.2-1.el8    BUILT: Mon Aug 10 12:12:49 CDT 2020
sanlock-lib-3.8.2-1.el8    BUILT: Mon Aug 10 12:12:49 CDT 2020

kernel-4.18.0-234.el8    BUILT: Thu Aug 20 12:01:26 CDT 2020
lvm2-2.03.09-5.el8    BUILT: Wed Aug 12 15:51:50 CDT 2020
lvm2-libs-2.03.09-5.el8    BUILT: Wed Aug 12 15:51:50 CDT 2020


[root@hayes-01 ~]# systemctl status sanlock
â sanlock.service - Shared Storage Lease Manager
   Loaded: loaded (/usr/lib/systemd/system/sanlock.service; disabled; vendor preset: disabled)
   Active: active (running) since Mon 2020-09-14 13:41:40 CDT; 2h 52min ago
  Process: 72782 ExecStart=/usr/sbin/sanlock daemon (code=exited, status=0/SUCCESS)
 Main PID: 72783 (sanlock)
    Tasks: 6 (limit: 1647453)
   Memory: 21.5M
   CGroup: /system.slice/sanlock.service
           ââ72783 /usr/sbin/sanlock daemon
           ââ72784 /usr/sbin/sanlock daemon

Sep 14 13:41:40 hayes-01.lab.msp.redhat.com systemd[1]: Starting Shared Storage Lease Manager...
Sep 14 13:41:40 hayes-01.lab.msp.redhat.com systemd[1]: Started Shared Storage Lease Manager.

[root@hayes-01 ~]# lvs -a -o +devices
  LV    VG    Attr       LSize Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices     
  lock0 sanlk -wi-a----- 4.00m                                                     /dev/sdd1(2)
  lock1 sanlk -wi-a----- 4.00m                                                     /dev/sdd1(3)
  lock2 sanlk -wi-a----- 4.00m                                                     /dev/sdd1(4)

[root@hayes-01 ~]# sanlock client init -s LS0:0:/dev/sanlk/lock0:0 -o 2
init
init done 0
[root@hayes-01 ~]# sanlock client init -s LS1:0:/dev/sanlk/lock1:0 -o 2
init
init done 0
[root@hayes-01 ~]# sanlock client init -s LS2:0:/dev/sanlk/lock2:0 -o 2
init
init done 0

[root@hayes-01 ~]# sanlock client add_lockspace -s LS0:1:/dev/sanlk/lock0:0 -o 2
add_lockspace_timeout 2
add_lockspace_timeout done 0
[root@hayes-01 ~]# sanlock client add_lockspace -s LS1:1:/dev/sanlk/lock1:0 -o 2
add_lockspace_timeout 2
add_lockspace_timeout done 0
[root@hayes-01 ~]# sanlock client add_lockspace -s LS2:1:/dev/sanlk/lock2:0 -o 2
add_lockspace_timeout 2
add_lockspace_timeout done 0

[root@hayes-01 ~]# sanlock status
daemon de5af774-c1b3-4202-b94e-5d0bfa9250cb.hayes-01.la
p -1 helper
p -1 listener
p -1 status
s LS2:1:/dev/sanlk/lock2:0
s LS1:1:/dev/sanlk/lock1:0
s LS0:1:/dev/sanlk/lock0:0

[root@hayes-01 ~]# sanlock status > /dev/null & sanlock client inq_lockspace -s LS2:1:/dev/sanlk/lock2:0 >>inq & sanlock client inq_lockspace -s LS1:1:/dev/sanlk/lock1:0 >> inq & sanlock client inq_lockspace -s LS0:1:/dev/sanlk/lock0:0 >>inq & sanlock status > /dev/null
[1] 89910
[2] 89911
[3] 89912
[4] 89913
[1]   Done                    sanlock status > /dev/null
[2]   Done                    sanlock client inq_lockspace -s LS2:1:/dev/sanlk/lock2:0 >> inq
[3]-  Done                    sanlock client inq_lockspace -s LS1:1:/dev/sanlk/lock1:0 >> inq
[4]+  Done                    sanlock client inq_lockspace -s LS0:1:/dev/sanlk/lock0:0 >> inq


[root@hayes-01 ~]# cat inq
inq_lockspace
inq_lockspace done -2
inq_lockspace
inq_lockspace done -2
inq_lockspace
inq_lockspace done -2
inq_lockspace
inq_lockspace done 0
inq_lockspace
inq_lockspace done 0
inq_lockspace
inq_lockspace done 0

Comment 8 errata-xmlrpc 2020-11-04 02:14:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (sanlock bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4595