Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2058438

Summary:

sanlock lockspace stuck in 'ADD' for days

Product:

Red Hat Enterprise Linux 8

Reporter:

Germano Veit Michel <gveitmic>

Component:

sanlock

Assignee:

David Teigland <teigland>

Status:

CLOSED ERRATA

QA Contact:

cluster-qe <cluster-qe>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

8.5

CC:

abgoswam, agk, bcholler, bhanoglu, cluster-maint, cmarthal, jortialc, mgokhool, mjuricek, mkalinin, mwest, nashok, nsoffer, teigland, toneata, vjuranek

Target Milestone:

Keywords:

Triaged, ZStream

Target Release:

---

Flags:

pm-rhel: mirror+

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

sanlock-3.8.4-4.el8

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

2065711 2091146 2091147 2104277 (view as bug list)

Environment:

Last Closed:

2022-11-08 10:54:13 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

2065711, 2091146, 2091147, 2104277

Attachments:

Description	Flags
Reproducer script	none
Test log with the fix	none

Description Germano Veit Michel 2022-02-24 23:18:10 UTC

Description of problem:

At this point this is more a heads up that we have something wrong, we are still trying to get more data to fill a proper bug.

Essentially, sanlock has a lockspace stuck in 'ADD' for days/weeks.

We have seen this with customer cases a couple times now, initially thought to be incorrect unmapping of LUNs from RHV hypervisors, this has now showed up on another ticket with storage devices that are still present on the system.

See this:

On Thu Feb 24 07:46:26 CET 2022, sanlock has this lockspace in this state (only one out of many)

s 05a41a9a-1318-4e64-9df8-c1de4456efbf:1:/dev/05a41a9a-1318-4e64-9df8-c1de4456efbf/ids:0 ADD
    list=add
    space_id=31
    io_timeout=10
    sector_size=0
    align_size=0
    host_generation=0
    renew_fail=0
    space_dead=0
    killing_pids=0
    used_retries=0
    external_used=0
    used_by_orphans=0
    renewal_read_extend_sec=10
    corrupt_result=0
    acquire_last_result=0
    renewal_last_result=0
    acquire_last_attempt=0
    acquire_last_success=0
    renewal_last_attempt=0
    renewal_last_success=0

15 days earlier..

2022-02-09 15:07:31 5289912 [1960057]: cmd_add_lockspace 3,45 05a41a9a-1318-4e64-9df8-c1de4456efbf:1:/dev/05a41a9a-1318-4e64-9df8-c1de4456efbf/ids:0 flags 1 timeout 10
2022-02-09 15:07:31 5289912 [1960057]: s31 lockspace 05a41a9a-1318-4e64-9df8-c1de4456efbf:1:/dev/05a41a9a-1318-4e64-9df8-c1de4456efbf/ids:0
2022-02-09 15:07:31 5289912 [1960057]: cmd_add_lockspace 3,45 async done 0

The VG is on this LUN:

360060160a9e0460006de885b0e6ba6ff dm-27 DGC,VRAID
size=8.0T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 7:0:0:148  sdy  65:128 active ready running
| `- 16:0:1:148 sddj 71:16  active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 7:0:3:148  sdbh 67:176 active ready running
  `- 16:0:0:148 sdci 69:96  active ready running

No I/O errors, softlockups or anything.

All being done at that time (lockspace add time) was scsi bus rescan and some lvm commands (by vdsm), rescanning all devices, but no errors. 

Version-Release number of selected component (if applicable):
sanlock-3.8.4-1.el8.x86_64

How reproducible:
Unknown so far

Comment 4 David Teigland 2022-02-28 19:07:40 UTC

When a lockspace is stuck in "add", I'd like to find out what that lockspace thread is doing or waiting for.  Digging through the various debugging info I've not found anything that shows this.  The new lockspace thread appears to have been started but we see no messages logged from it.  So, that lockspace thread must be stuck waiting on something early after starting.  The steps where the thread could possibly block before logging any messages are:

- opening the device

- reading 4k from the device

- reading/writing sysfs max_sectors_kb (I notice there's a long standing bug with permissions here that I'd not seen before.)

- connecting to wdmd.sock


A sanlock core, strace, or possibly a ps showing the wchan of all threads, would probably show what that lockspace thread is waiting on.

Comment 5 Germano Veit Michel 2022-02-28 22:52:48 UTC

(In reply to David Teigland from comment #4)
> ...
> A sanlock core, strace, or possibly a ps showing the wchan of all threads,
> would probably show what that lockspace thread is waiting on.

Yup, we are waiting for the customer to upload the coredumps.

We can keep a needinfo on me in the meantime.

Comment 11 David Teigland 2022-03-03 20:23:09 UTC

I haven't yet picked out anything unusual from any of the sanlock data. I don't see a sign of any loops in sanlock, but I suspect that vdsm may be hitting some kind of retry loop attempting to add a lockspace. The strace doesn't show any add_lockspace requests, so maybe that retry loop had finished at the time.

I found a clear memory leak in add_lockspace which amounts to about 4kb each time add_lockspace fails, so vdsm retrying a failing add_lockspace could be consistent with the memory usage reported. This leak can be avoided by disabling the renewal history (which exists only for debugging/analysis). Set renewal_history_size=0 in sanlock.conf and restart the sanlock daemon to disable it avoid the leak.

I'd also like to have someone familiar with vdsm check if there's any sign of vdsm add_lockspace retries. That would help me know if I'm looking in the right area.

None of that answers the question about why the lockspace is stuck in the "add" state. The pstack info doesn't show a lockspace_thread that's waiting for anything unusual. One theory I have is that add_lockspace called pthread_create() as shown here, but pthread_create failed:

rv = pthread_create(&sp->thread, NULL, lockspace_thread, sp);
if (rv < 0) {
log_erros(sp, "add_lockspace create thread failed");
goto fail_del;
}

If pthread_create failed to create a lockspace_thread, I think this would explain the stuck "add" and would explain the pstack data. The pthread_create man page states that on error it returns "non zero", so the (rv < 0) error check may be missing the failure (I need to look into the pthread_create details to verify this.)

From a little googling, it sounds like the thread limit on linux depends on the virtual memory limit of the process. I need to learn more about that, but the memory usage raises the question of whether the excessive memory use has prevented thread creation.

Comment 17 Germano Veit Michel 2022-03-08 04:17:12 UTC

I ran sanlock under valgrind with memcheck tool in my RHV lab and did the following operations on the host:

1. Activate + Maintenance + Activate
2. Start SPM + Stop SPM
3. Run VM w/ Lease + Stop VM w/ Lease.

Does not seem to have found much more than David already found:

==994== 34,560 bytes in 8 blocks are definitely lost in loss record 12 of 15
==994==    at 0x4C3ADBB: calloc (vg_replace_malloc.c:1117)
==994==    by 0x11A06A: add_lockspace_start (lockspace.c:982)
==994==    by 0x130BC2: cmd_add_lockspace (cmd.c:1329)
==994==    by 0x132DCB: call_cmd_thread (cmd.c:2192)
==994==    by 0x11D44F: thread_pool_worker (main.c:962)
==994==    by 0x4E4E179: start_thread (in /usr/lib64/libpthread-2.28.so)
==994==    by 0x5DECDC2: clone (in /usr/lib64/libc-2.28.so)
==994== 
==994== 1,048,576 bytes in 1 blocks are definitely lost in loss record 14 of 15
==994==    at 0x4C360A5: malloc (vg_replace_malloc.c:380)
==994==    by 0x11F441: main_loop (main.c:884)
==994==    by 0x11F441: do_daemon (main.c:1804)
==994==    by 0x10F55D: main (main.c:3947)

Few warnings here for reference, but unlikely related to our problem:

==994== Conditional jump or move depends on uninitialised value(s)
==994==    at 0x11539D: read_sysfs_size (diskio.c:64)
==994==    by 0x118B1B: set_lockspace_max_sectors_kb (lockspace.c:613)
==994==    by 0x118B1B: lockspace_thread (lockspace.c:742)
==994==    by 0x4E4E179: start_thread (in /usr/lib64/libpthread-2.28.so)
==994==    by 0x5DECDC2: clone (in /usr/lib64/libc-2.28.so)
==994==  Uninitialised value was created by a stack allocation
==994==    at 0x1152AA: read_sysfs_size (diskio.c:36)
==994== 
==994== Use of uninitialised value of size 8
==994==    at 0x1153C7: read_sysfs_size (diskio.c:64)
==994==    by 0x118B1B: set_lockspace_max_sectors_kb (lockspace.c:613)
==994==    by 0x118B1B: lockspace_thread (lockspace.c:742)
==994==    by 0x4E4E179: start_thread (in /usr/lib64/libpthread-2.28.so)
==994==    by 0x5DECDC2: clone (in /usr/lib64/libc-2.28.so)
==994==  Uninitialised value was created by a stack allocation
==994==    at 0x1152AA: read_sysfs_size (diskio.c:36)
==994== 
==994== Use of uninitialised value of size 8
==994==    at 0x115430: read_sysfs_size (diskio.c:65)
==994==    by 0x118B1B: set_lockspace_max_sectors_kb (lockspace.c:613)
==994==    by 0x118B1B: lockspace_thread (lockspace.c:742)
==994==    by 0x4E4E179: start_thread (in /usr/lib64/libpthread-2.28.so)
==994==    by 0x5DECDC2: clone (in /usr/lib64/libc-2.28.so)
==994==  Uninitialised value was created by a stack allocation
==994==    at 0x1152AA: read_sysfs_size (diskio.c:36)
==994== 
==994== Conditional jump or move depends on uninitialised value(s)
==994==    at 0x11539D: read_sysfs_size (diskio.c:64)
==994==    by 0x11549A: set_max_sectors_kb (diskio.c:126)
==994==    by 0x119300: set_lockspace_max_sectors_kb (lockspace.c:649)
==994==    by 0x119300: lockspace_thread (lockspace.c:742)
==994==    by 0x4E4E179: start_thread (in /usr/lib64/libpthread-2.28.so)
==994==    by 0x5DECDC2: clone (in /usr/lib64/libc-2.28.so)
==994==  Uninitialised value was created by a stack allocation
==994==    at 0x1152AA: read_sysfs_size (diskio.c:36)
==994== 
==994== Conditional jump or move depends on uninitialised value(s)
==994==    at 0x1153C1: read_sysfs_size (diskio.c:64)
==994==    by 0x11549A: set_max_sectors_kb (diskio.c:126)
==994==    by 0x119300: set_lockspace_max_sectors_kb (lockspace.c:649)
==994==    by 0x119300: lockspace_thread (lockspace.c:742)
==994==    by 0x4E4E179: start_thread (in /usr/lib64/libpthread-2.28.so)
==994==    by 0x5DECDC2: clone (in /usr/lib64/libc-2.28.so)
==994==  Uninitialised value was created by a stack allocation
==994==    at 0x1152AA: read_sysfs_size (diskio.c:36)
==994== 
==994== Use of uninitialised value of size 8
==994==    at 0x1153C7: read_sysfs_size (diskio.c:64)
==994==    by 0x11549A: set_max_sectors_kb (diskio.c:126)
==994==    by 0x119300: set_lockspace_max_sectors_kb (lockspace.c:649)
==994==    by 0x119300: lockspace_thread (lockspace.c:742)
==994==    by 0x4E4E179: start_thread (in /usr/lib64/libpthread-2.28.so)
==994==    by 0x5DECDC2: clone (in /usr/lib64/libc-2.28.so)
==994==  Uninitialised value was created by a stack allocation
==994==    at 0x1152AA: read_sysfs_size (diskio.c:36)
==994== 
==994== Use of uninitialised value of size 8
==994==    at 0x115430: read_sysfs_size (diskio.c:65)
==994==    by 0x11549A: set_max_sectors_kb (diskio.c:126)
==994==    by 0x119300: set_lockspace_max_sectors_kb (lockspace.c:649)
==994==    by 0x119300: lockspace_thread (lockspace.c:742)
==994==    by 0x4E4E179: start_thread (in /usr/lib64/libpthread-2.28.so)
==994==    by 0x5DECDC2: clone (in /usr/lib64/libc-2.28.so)
==994==  Uninitialised value was created by a stack allocation
==994==    at 0x1152AA: read_sysfs_size (diskio.c:36)

Is the theory that we have a memory leak which ultimately breaks pthread_create?

Comment 18 Germano Veit Michel 2022-03-08 04:18:39 UTC

btw, comment #17 used sanlock-3.8.4-1.el8.x86_64

Comment 19 David Teigland 2022-03-08 16:27:53 UTC

> Is the theory that we have a memory leak which ultimately breaks
> pthread_create?

Yes, I think high memory use is the most likely cause for pthread_create to fail.  sanlock fails to detect or log the error from pthread_create because it's incorrectly checking that the return value is negative.  This would then cause the add lockspace operation to wait forever because it's expecting the lockspace_thread to finish starting.  I hadn't known about the sosreport from comment 12, and I still need to look at that.  The large number of client connections doesn't seem right, and that might be related to the memory problem. The excessive memory usage could be from a leak, or it could also be from accumulating too many objects over time that are not being released for some reason.

Comment 20 David Teigland 2022-03-08 21:14:04 UTC

(In reply to Germano Veit Michel from comment #17)
> I ran sanlock under valgrind with memcheck tool in my RHV lab and did the
> following operations on the host:
> 
> 1. Activate + Maintenance + Activate
> 2. Start SPM + Stop SPM
> 3. Run VM w/ Lease + Stop VM w/ Lease.

Does this sequence involve vdsm using sanlock_inquire()?  In the past year there was a fairly significant change to the way that vdsm uses sanlock, where it now frequently calls sanlock_inquire() to verify that the expected leases are still held.  Previously I don't believe it called sanlock_inquire() at all, so I wonder if this change might have exposed a new memory leak.

Another issue is the comment 12 sosreport which shows 102 client connections created for sanlock_inq_lockspace() calls. (client name is set to "cmd16" and the last cmd was 16 which is SM_CMD_INQ_LOCKSPACE).  I don't know why or how these incomplete client connections are hanging around from inq_lockspace calls.   102 of them would not likely cause a big memory problem, so this might be another side effect of the memory problems.

Comment 21 Germano Veit Michel 2022-03-08 21:28:59 UTC

(In reply to David Teigland from comment #20)
> (In reply to Germano Veit Michel from comment #17)
> > I ran sanlock under valgrind with memcheck tool in my RHV lab and did the
> > following operations on the host:
> > 
> > 1. Activate + Maintenance + Activate
> > 2. Start SPM + Stop SPM
> > 3. Run VM w/ Lease + Stop VM w/ Lease.
> 
> Does this sequence involve vdsm using sanlock_inquire()?  In the past year
> there was a fairly significant change to the way that vdsm uses sanlock,
> where it now frequently calls sanlock_inquire() to verify that the expected
> leases are still held.  Previously I don't believe it called
> sanlock_inquire() at all, so I wonder if this change might have exposed a
> new memory leak.

What you mention above seems to be the change to avoid duplicate SPM.

I'm not entirely sure of all paths that lead to a sanlock_inq, but let me repeat the test by executing vm live migrations, as I'm sure they do call that because the customer had libvirt stuck waiting on sanlock_inq to complete to perform live migrations (the symptom of all this on my case was live migrations would not even start), we saw it in the libvirt core which pointed us to sanlock.

> Another issue is the comment 12 sosreport which shows 102 client connections
> created for sanlock_inq_lockspace() calls. (client name is set to "cmd16"
> and the last cmd was 16 which is SM_CMD_INQ_LOCKSPACE).  I don't know why or
> how these incomplete client connections are hanging around from
> inq_lockspace calls.   102 of them would not likely cause a big memory
> problem, so this might be another side effect of the memory problems.

Could be attempts to migrate VMs and other things done by the customer, piling up all that as a consequence after the issue.

The INQ will not return if the ADD on the same object is still going right...

I'll report back soon on the test above.

Comment 22 Germano Veit Michel 2022-03-08 21:34:16 UTC

By the way, do you think this could be an actual leak (lost reference) or may not have actually leaked but just climbing the memory usage? I can do both massif and memcheck, but the first might take a while to show anything more concrete...

Comment 23 Germano Veit Michel 2022-03-08 22:08:27 UTC

memcheck showed nothing after 5 migrations, will now leave it migrating in a loop for a day under massif, will report back tomorrow. This should trigger a lot of inquires, as this VM has a lease too.

Comment 24 David Teigland 2022-03-08 22:40:09 UTC

(In reply to David Teigland from comment #20)
> Another issue is the comment 12 sosreport which shows 102 client connections
> created for sanlock_inq_lockspace() calls. (client name is set to "cmd16"
> and the last cmd was 16 which is SM_CMD_INQ_LOCKSPACE).  I don't know why or
> how these incomplete client connections are hanging around from
> inq_lockspace calls.   102 of them would not likely cause a big memory
> problem, so this might be another side effect of the memory problems.

This is likely caused by no worker threads being available to process the request, and that also seems like it may be related to pthread_create failing to create worker threads.

Comment 25 Germano Veit Michel 2022-03-09 04:27:27 UTC

968 migrations of a VM with lease (libvirt does sanlock_inq), nothing interesting.

Memory usage is low and:

n8: 19715 (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
 n1: 9593 0x400BDA6: _dl_new_object (in /usr/lib64/ld-2.28.so)
  n1: 9593 0x40065F6: _dl_map_object_from_fd (in /usr/lib64/ld-2.28.so)
   n2: 9593 0x4009504: _dl_map_object (in /usr/lib64/ld-2.28.so)
    n1: 7187 0x400DDC5: openaux (in /usr/lib64/ld-2.28.so)
     n1: 7187 0x5E1F2B3: _dl_catch_exception (in /usr/lib64/libc-2.28.so)
      n1: 7187 0x400E20E: _dl_map_object_deps (in /usr/lib64/ld-2.28.so)
       n1: 7187 0x4013BB0: dl_open_worker (in /usr/lib64/ld-2.28.so)
        n1: 7187 0x5E1F2B3: _dl_catch_exception (in /usr/lib64/libc-2.28.so)
         n1: 7187 0x40136B0: _dl_open (in /usr/lib64/ld-2.28.so)
          n1: 7187 0x5E1E740: do_dlopen (in /usr/lib64/libc-2.28.so)
           n1: 7187 0x5E1F2B3: _dl_catch_exception (in /usr/lib64/libc-2.28.so)
            n1: 7187 0x5E1F372: _dl_catch_error (in /usr/lib64/libc-2.28.so)
             n1: 7187 0x5E1E846: dlerror_run (in /usr/lib64/libc-2.28.so)
              n1: 7187 0x5E1E8D9: __libc_dlopen_mode (in /usr/lib64/libc-2.28.so)
               n1: 7187 0x5E05D0D: nss_load_library (in /usr/lib64/libc-2.28.so)
                n1: 7187 0x5E06607: __nss_lookup_function (in /usr/lib64/libc-2.28.so)
                 n1: 7187 0x5DAC486: internal_getgrouplist (in /usr/lib64/libc-2.28.so)
                  n1: 7187 0x5DAC840: initgroups (in /usr/lib64/libc-2.28.so)
                   n1: 7187 0x11E969: setup_groups (main.c:1543)
                    n1: 7187 0x11E969: do_daemon (main.c:1727)
                     n0: 7187 0x10F55D: main (main.c:3947)
    n1: 2406 0x4013B48: dl_open_worker (in /usr/lib64/ld-2.28.so)
     n1: 2406 0x5E1F2B3: _dl_catch_exception (in /usr/lib64/libc-2.28.so)
      n1: 2406 0x40136B0: _dl_open (in /usr/lib64/ld-2.28.so)
       n1: 2406 0x5E1E740: do_dlopen (in /usr/lib64/libc-2.28.so)
        n1: 2406 0x5E1F2B3: _dl_catch_exception (in /usr/lib64/libc-2.28.so)
         n1: 2406 0x5E1F372: _dl_catch_error (in /usr/lib64/libc-2.28.so)
          n1: 2406 0x5E1E846: dlerror_run (in /usr/lib64/libc-2.28.so)
           n1: 2406 0x5E1E8D9: __libc_dlopen_mode (in /usr/lib64/libc-2.28.so)
            n1: 2406 0x5E05D0D: nss_load_library (in /usr/lib64/libc-2.28.so)
             n2: 2406 0x5E06607: __nss_lookup_function (in /usr/lib64/libc-2.28.so)
              n1: 2406 0x5DAC486: internal_getgrouplist (in /usr/lib64/libc-2.28.so)
               n1: 2406 0x5DAC840: initgroups (in /usr/lib64/libc-2.28.so)
                n1: 2406 0x11E969: setup_groups (main.c:1543)
                 n1: 2406 0x11E969: do_daemon (main.c:1727)
                  n0: 2406 0x10F55D: main (main.c:3947)
              n0: 0 in 1 place, below massif's threshold (1.00%)
 n1: 4096 0x83B7A88: ??? (in /usr/lib64/libnss_systemd.so.2)
  n1: 4096 0x83B7ACC: ??? (in /usr/lib64/libnss_systemd.so.2)
   n1: 4096 0x83AD041: ??? (in /usr/lib64/libnss_systemd.so.2)
    n1: 4096 0x83AD2D9: ??? (in /usr/lib64/libnss_systemd.so.2)
     n1: 4096 0x83760E0: ??? (in /usr/lib64/libnss_systemd.so.2)
      n1: 4096 0x8376F36: ??? (in /usr/lib64/libnss_systemd.so.2)
       n1: 4096 0x83771DA: ??? (in /usr/lib64/libnss_systemd.so.2)
        n1: 4096 0x8357469: ??? (in /usr/lib64/libnss_systemd.so.2)
         n1: 4096 0x5DAC1A9: compat_call.constprop.0 (in /usr/lib64/libc-2.28.so)
          n1: 4096 0x5DAC5FD: internal_getgrouplist (in /usr/lib64/libc-2.28.so)
           n1: 4096 0x5DAC840: initgroups (in /usr/lib64/libc-2.28.so)
            n1: 4096 0x11E969: setup_groups (main.c:1543)
             n1: 4096 0x11E969: do_daemon (main.c:1727)
              n0: 4096 0x10F55D: main (main.c:3947)
 n1: 2688 0x40112AC: _dl_check_map_versions (in /usr/lib64/ld-2.28.so)
  n1: 2688 0x4013BF5: dl_open_worker (in /usr/lib64/ld-2.28.so)
   n1: 2688 0x5E1F2B3: _dl_catch_exception (in /usr/lib64/libc-2.28.so)
    n1: 2688 0x40136B0: _dl_open (in /usr/lib64/ld-2.28.so)
     n1: 2688 0x5E1E740: do_dlopen (in /usr/lib64/libc-2.28.so)
      n1: 2688 0x5E1F2B3: _dl_catch_exception (in /usr/lib64/libc-2.28.so)
       n1: 2688 0x5E1F372: _dl_catch_error (in /usr/lib64/libc-2.28.so)
        n1: 2688 0x5E1E846: dlerror_run (in /usr/lib64/libc-2.28.so)
         n1: 2688 0x5E1E8D9: __libc_dlopen_mode (in /usr/lib64/libc-2.28.so)
          n1: 2688 0x5E05D0D: nss_load_library (in /usr/lib64/libc-2.28.so)
           n2: 2688 0x5E06607: __nss_lookup_function (in /usr/lib64/libc-2.28.so)
            n1: 2688 0x5DAC486: internal_getgrouplist (in /usr/lib64/libc-2.28.so)
             n1: 2688 0x5DAC840: initgroups (in /usr/lib64/libc-2.28.so)
              n1: 2688 0x11E969: setup_groups (main.c:1543)
               n1: 2688 0x11E969: do_daemon (main.c:1727)
                n0: 2688 0x10F55D: main (main.c:3947)
            n0: 0 in 1 place, below massif's threshold (1.00%)
 n1: 1024 0x5DACD51: getgrnam (in /usr/lib64/libc-2.28.so)
  n0: 1024 0x10E6D1: main (main.c:3920)
 n1: 1024 0x5DAEA81: getpwnam (in /usr/lib64/libc-2.28.so)
  n0: 1024 0x10E6E2: main (main.c:3920)
 n0: 618 in 32 places, all below massif's threshold (1.00%)
 n1: 400 0x400DFE6: _dl_map_object_deps (in /usr/lib64/ld-2.28.so)
  n1: 400 0x4013BB0: dl_open_worker (in /usr/lib64/ld-2.28.so)
   n1: 400 0x5E1F2B3: _dl_catch_exception (in /usr/lib64/libc-2.28.so)
    n1: 400 0x40136B0: _dl_open (in /usr/lib64/ld-2.28.so)
     n1: 400 0x5E1E740: do_dlopen (in /usr/lib64/libc-2.28.so)
      n1: 400 0x5E1F2B3: _dl_catch_exception (in /usr/lib64/libc-2.28.so)
       n1: 400 0x5E1F372: _dl_catch_error (in /usr/lib64/libc-2.28.so)
        n1: 400 0x5E1E846: dlerror_run (in /usr/lib64/libc-2.28.so)
         n1: 400 0x5E1E8D9: __libc_dlopen_mode (in /usr/lib64/libc-2.28.so)
          n1: 400 0x5E05D0D: nss_load_library (in /usr/lib64/libc-2.28.so)
           n1: 400 0x5E06607: __nss_lookup_function (in /usr/lib64/libc-2.28.so)
            n1: 400 0x5DAC486: internal_getgrouplist (in /usr/lib64/libc-2.28.so)
             n1: 400 0x5DAC840: initgroups (in /usr/lib64/libc-2.28.so)
              n1: 400 0x11E969: setup_groups (main.c:1543)
               n1: 400 0x11E969: do_daemon (main.c:1727)
                n0: 400 0x10F55D: main (main.c:3947)
 n1: 272 0x400E52D: _dl_map_object_deps (in /usr/lib64/ld-2.28.so)
  n1: 272 0x4013BB0: dl_open_worker (in /usr/lib64/ld-2.28.so)
   n1: 272 0x5E1F2B3: _dl_catch_exception (in /usr/lib64/libc-2.28.so)
    n1: 272 0x40136B0: _dl_open (in /usr/lib64/ld-2.28.so)
     n1: 272 0x5E1E740: do_dlopen (in /usr/lib64/libc-2.28.so)
      n1: 272 0x5E1F2B3: _dl_catch_exception (in /usr/lib64/libc-2.28.so)
       n1: 272 0x5E1F372: _dl_catch_error (in /usr/lib64/libc-2.28.so)
        n1: 272 0x5E1E846: dlerror_run (in /usr/lib64/libc-2.28.so)
         n1: 272 0x5E1E8D9: __libc_dlopen_mode (in /usr/lib64/libc-2.28.so)
          n1: 272 0x5E05D0D: nss_load_library (in /usr/lib64/libc-2.28.so)
           n2: 272 0x5E06607: __nss_lookup_function (in /usr/lib64/libc-2.28.so)
            n1: 272 0x5DAC486: internal_getgrouplist (in /usr/lib64/libc-2.28.so)
             n1: 272 0x5DAC840: initgroups (in /usr/lib64/libc-2.28.so)
              n1: 272 0x11E969: setup_groups (main.c:1543)
               n1: 272 0x11E969: do_daemon (main.c:1727)
                n0: 272 0x10F55D: main (main.c:3947)
            n0: 0 in 1 place, below massif's threshold (1.00%)

I'm starting to think it may need some sort of event or failure to trigger this, maybe storage related.

Comment 26 Nir Soffer 2022-03-09 11:18:21 UTC

Germano, does this happen only on the SPM host?

If it does, it may be related the the inquire calls done on the SPM.
We can eliminate this option by disabling the SPM watchdog feature.

# /etc/vdsm/vdsm.conf.d/99-local.conf
[spm]
watchdog_enable = false

See https://github.com/oVirt/vdsm/blob/5aea38cfe7e0cd21177a289aff7be423eff7518b/lib/vdsm/common/config.py.in#L416

Comment 27 Nir Soffer 2022-03-09 11:34:16 UTC

(In reply to David Teigland from comment #20)
> Another issue is the comment 12 sosreport which shows 102 client connections
> created for sanlock_inq_lockspace() calls. (client name is set to "cmd16"
> and the last cmd was 16 which is SM_CMD_INQ_LOCKSPACE).  I don't know why or
> how these incomplete client connections are hanging around from
> inq_lockspace calls.   102 of them would not likely cause a big memory
> problem, so this might be another side effect of the memory problems.

Do we log the client pid?

Vdsm call inq_lockspace() on the SPM regularly, when engine ask about 
SPM status. Based on my test setup logs this happens every 6 seconds.

$ grep 'FINISH getSpmStatus' /var/log/vdsm/vdsm.log | wc -l
1257

$ grep 'FINISH getSpmStatus' /var/log/vdsm/vdsm.log | head -1
2022-03-09 10:01:05,378+0200 INFO  (jsonrpc/0) [vdsm.api] FINISH getSpmStatus return={'spm_st': {'spmStatus': 'SPM', 'spmLver': 30, 'spmId': 2}} from=::ffff:192.168.122.10,55112, task_id=5c6c2ddb-ca3c-4541-ba9d-b7f35ea015d8 (api:54)

$ grep 'FINISH getSpmStatus' /var/log/vdsm/vdsm.log | tail -1
2022-03-09 13:31:09,520+0200 INFO  (jsonrpc/4) [vdsm.api] FINISH getSpmStatus return={'spm_st': {'spmStatus': 'SPM', 'spmLver': 30, 'spmId': 2}} from=::ffff:192.168.122.10,55112, task_id=a206a7b4-9a8e-468b-9d38-f50de7d90f65 (api:54)

I think we can decrease the interval of these calls, but this is how RHV
work in the last 10 years, so it is unlikely to be the reason for the leak.

Comment 28 David Teigland 2022-03-09 19:56:46 UTC

On one of the systems that has suffered from extreme sanlock mem usage, it would be interesting to see what the original or normal mem usage is.  Perhaps vdsm logs would also the progression of usage over time, i.e. did it slowly increase or was it sudden?

Comment 29 Germano Veit Michel 2022-03-09 21:31:22 UTC

(In reply to Nir Soffer from comment #26)
> Germano, does this happen only on the SPM host?

Unfortunately not, my case had 6 hosts with the issue, I think Nijin's also had a few.

I'll go back to the sosreports and look for clues, there must be a trigger for all this. So far, the main thing calling our attention is that all these customers have many SDs (dozens) and have been doing activity like import and create recently. But yesterday I already ran sanlock under valgrind while doing a force import and it resulted in nothing.

David, I don't think we will have data to see the progression over time, but let me try to find that out too. Most likely a couple points in time, not much more than that.

Let me see if I can find some other clues on how this is triggered.

Comment 30 Germano Veit Michel 2022-03-09 23:26:48 UTC

(In reply to Germano Veit Michel from comment #29)
> (In reply to Nir Soffer from comment #26)
> > Germano, does this happen only on the SPM host?
> 
> Unfortunately not, my case had 6 hosts with the issue, I think Nijin's also
> had a few.
> 

Not only that, the 6 hosts had the same problem, the same stuck ADD on the same SD, all at the same time.
Nothing was done at that time except activating that particular SD. Some time before there was a mass activation of many SDs, but all went well.

88 delta being and 88 done, 89 add_lockspace but 88 finished. No signs of logging from this thread:

Thread 18 (Thread 0x7f9521fe3700 (LWP 2657909)):
#0  0x00007f964988ed98 in nanosleep () from /lib64/libc.so.6
#1  0x00007f964988ec9e in sleep () from /lib64/libc.so.6
#2  0x000055df04f4ad1c in delta_lease_acquire ()                       
#3  0x000055df04f51c11 in lockspace_thread ()
#4  0x00007f964a81d17a in start_thread () from /lib64/libpthread.so.0
#5  0x00007f96498c2dc3 in clone () from /lib64/libc.so.6

grep 'delta_acquire begin' sos_commands/sanlock/sanlock_client_log_dump | wc -l
      88
grep 'delta_acquire done' sos_commands/sanlock/sanlock_client_log_dump | wc -l
      88

No idea what its doing, and how it can get to any of those 2 sleep() inside it without logging.

The pstack was taken later, but still. I found weird there are no signs of it. Also, the log_level does not seem to be able to call sleep() so hard to make sense of the above.

There are 2 workers waiting for the lockspace thread, so I assume one thread is missing or a previous problem occured already:

Workers
-------

Thread 5 (Thread 0x7f558d94a700 (LWP 3774651)):
#0  0x00007f964988ed98 in nanosleep () from /lib64/libc.so.6
#1  0x00007f964988ec9e in sleep () from /lib64/libc.so.6
#2  0x000055df04f53322 in add_lockspace_wait ()
#3  0x000055df04f69c35 in cmd_add_lockspace ()                         <-------
#4  0x000055df04f6bdcc in call_cmd_thread ()
#5  0x000055df04f56450 in thread_pool_worker ()
#6  0x00007f964a81d17a in start_thread () from /lib64/libpthread.so.0
#7  0x00007f96498c2dc3 in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7f559c167700 (LWP 3725508)):
#0  0x00007f964988ed98 in nanosleep () from /lib64/libc.so.6
#1  0x00007f964988ec9e in sleep () from /lib64/libc.so.6
#2  0x000055df04f53322 in add_lockspace_wait ()
#3  0x000055df04f69c35 in cmd_add_lockspace ()                         <-------
#4  0x000055df04f6bdcc in call_cmd_thread ()
#5  0x000055df04f56450 in thread_pool_worker ()
#6  0x00007f964a81d17a in start_thread () from /lib64/libpthread.so.0
#7  0x00007f96498c2dc3 in clone () from /lib64/libc.so.6

Perhaps the best way forward here is to get the pthread and leak fixes on the next release (I see the patches are already in!) and wait for it to happen again.
Hopefully it won't leak so a coredump can be taken, and any pthread create issues will appear more clearly.

Any chances the 2 fixed by David can go in 4.4.10?

Comment 31 David Teigland 2022-03-10 22:38:27 UTC

I think I'm seeing the same growth in memory usage in my own testing, and I think it's the 'token' structures that are leaking.  I haven't found the location yet, but expect to soon.  I don't know why valgrind doesn't catch it.

Comment 32 Germano Veit Michel 2022-03-10 22:55:12 UTC

(In reply to David Teigland from comment #31)
> I think I'm seeing the same growth in memory usage in my own testing, and I
> think it's the 'token' structures that are leaking.  I haven't found the
> location yet, but expect to soon.  I don't know why valgrind doesn't catch
> it.

I have a RHVH host with systemd starting sanlock under valgrind on boot already. Easy to test anything.
How you reproduce this? I mean, what commands would make it leak the most?
Happy to give it a go here.

Comment 33 David Teigland 2022-03-10 23:25:22 UTC

I have this old test that's pretty awkward to use, and just leave it running for a while:
https://pagure.io/sanlock/blob/master/f/tests/sanlk_load.c
with sanlk_load rand /dev/vg/lock -i 1 -s 59 -r 4 -p 60 -m 5 -S 0
where it's using 60 LVs that were initialized with sanlk_load init.

Comment 39 David Teigland 2022-03-17 17:09:01 UTC

The leak may be caused by worker pthreads exiting, which is relatively new, the commit was Dec 2020 and I'm not sure which release it was first included in.  The resources from pthreads are not reclaimed unless they are joined or are created in the detached state.

Comment 40 David Teigland 2022-03-17 18:25:22 UTC

When testing a patch that sets the detached state on worker threads, I still see unexplained memory growth (both VSZ and RSS).  With a patch that disables the shrinking of worker threads, I see fairly stable memory usage (constant VSS and some limited growth in RSS.)  Based on this, I think the initial fix should include disabling the shrinking of worker threads (I'd like to hear from Nir about the reasons behind shrinking worker threads, and if there are any old issues that might reappear.)  In the longer term I'd like to figure out why shrinking/recreating worker threads problematic.

Comment 41 David Teigland 2022-03-17 18:47:21 UTC

I'd like to include the following fixes for this bug:

1fa8d43 sanlock: fix memory leak of lockspace renewal_history
https://pagure.io/sanlock/c/1fa8d43007b31115342002ac675c905d9f18fbd9?branch=master

5abb9d5 sanlock: fix pthread_create error check
https://pagure.io/sanlock/c/5abb9d50616d399914958b99352b8cf016e4928a?branch=master

b04ddec Revert "sanlock: Shrink thread pool when there is no work"
https://pagure.io/sanlock/c/b04ddec92daffc5c919b09def3998237ae10d4a0?branch=master

Comment 43 Nir Soffer 2022-03-17 19:39:33 UTC

(In reply to David Teigland from comment #40)
> I'd like to hear from Nir about the reasons behind shrinking worker threads,
> and if there are any old issues that might reappear.

Shrinking the worker pool release resource when they are not needed. It was
added when we added configurable number of threads, required for RHV when
using large number of storage domain.

Without shrinking, the number of threads in the pool will reach the maximum
value (50 in RHV by default), and threads will never exit, so any resource
consumed by the threads will never be released.

I think this is a safe quick fix as amount of resources for full threads pool
is low. We can enable the shrinking again after changing the threads to run
in detached mode.

Comment 45 Nir Soffer 2022-03-18 20:22:10 UTC

Created attachment 1866687 [details]
Reproducer script

The reproducer script uses sanlock.read_lockspace() to reproduce the memory leak.

## How to run the reproducer

1. Copy to RHV host with at least one storage domain active

2. Configure the script:

# The storage domain id, visible in RHV UI
lockspace = "aecec81f-d464-4a35-9a91-6acf2ca4938c"

# Number of iterations to run.
iterations = 1000

# Number of threads to run every iteration.
max_threads = 100 

# Delay between iterations.
delay = 1.0 

3. Run the script

    sudo python3 test-sanlock-thread-leak.py 

Example run:

2022-03-18 21:48:03.654437 rss: 60.9 MiB vsize: 1.4 GiB
...
2022-03-18 21:49:03.617904 rss: 94.4 MiB vsize: 16.6 GiB
...
2022-03-18 21:50:03.813287 rss: 128.1 MiB vsize: 31.9 GiB
...
2022-03-18 21:51:04.001063 rss: 159.0 MiB vsize: 46.1 GiB
...
2022-03-18 21:52:03.068266 rss: 193.4 MiB vsize: 60.3 GiB
...
2022-03-18 21:53:03.081200 rss: 224.9 MiB vsize: 74.5 GiB
...
2022-03-18 21:54:03.555092 rss: 261.5 MiB vsize: 90.4 GiB
...
2022-03-18 21:55:02.870848 rss: 289.6 MiB vsize: 102.9 GiB
...
2022-03-18 21:56:03.273246 rss: 319.3 MiB vsize: 116.1 GiB
...
2022-03-18 21:57:03.881945 rss: 353.7 MiB vsize: 130.9 GiB
...
2022-03-18 21:58:03.070916 rss: 383.9 MiB vsize: 144.1 GiB
...
2022-03-18 21:59:04.051771 rss: 418.9 MiB vsize: 158.5 GiB
...
2022-03-18 22:00:03.303177 rss: 450.9 MiB vsize: 172.6 GiB
...
2022-03-18 22:01:03.705316 rss: 484.9 MiB vsize: 188.0 GiB
...
2022-03-18 22:02:03.629137 rss: 517.3 MiB vsize: 202.3 GiB
...
2022-03-18 22:03:04.066243 rss: 550.8 MiB vsize: 217.4 GiB
...
2022-03-18 22:04:03.349303 rss: 580.4 MiB vsize: 231.0 GiB
...
2022-03-18 22:05:03.068172 rss: 612.1 MiB vsize: 245.6 GiB
...
2022-03-18 22:06:03.334320 rss: 637.6 MiB vsize: 256.6 GiB
...
2022-03-18 22:06:17.887669 rss: 637.6 MiB vsize: 256.6 GiB

We see clear leak of 30 MiB per minute.

Running more iterations show that memory stops increasing:

2022-03-18 22:08:13.628395 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:14.714051 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:15.765413 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:16.941370 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:18.058354 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:19.137433 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:20.204378 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:21.316290 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:22.403088 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:23.496417 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:24.632546 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:25.710745 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:26.820464 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:27.904294 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:28.967350 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:30.043814 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:31.172618 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:32.291046 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:33.427515 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:34.496371 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:35.658129 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:36.805213 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:37.908271 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:39.001675 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:40.080902 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:41.178956 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:42.275738 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:43.363715 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:44.450131 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:45.536230 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:46.739119 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:47.844468 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:48.924501 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:50.037772 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:51.219408 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:52.364321 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:53.467345 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:54.557449 rss: 637.6 MiB vsize: 256.6 GiB
2022-03-18 22:08:55.675664 rss: 637.6 MiB vsize: 256.6 GiB

I could not reproduce the issue of stuck ADD at this point,
trying to deactivate and activate 4 storage domains. Maybe trying with
larger number of storage domain will reproduce the issue.

Maybe we need to add 1 or 2 inaccessible NFS storage domains to
reproduce.

Comment 48 Nir Soffer 2022-03-19 18:08:22 UTC

Created attachment 1866886 [details]
Test log with the fix

I tested a build with the fix:

# rpm -q sanlock
sanlock-3.8.4-2.el8.x86_64

Interesting events:

1. Run started:

2022-03-18 23:01:12.846458 rss: 71.9 MiB vsize: 1.5 GiB

2. Memory usage became stable after 16 minutes

2022-03-18 23:17:50.896674 rss: 78.8 MiB vsize: 1.5 GiB

3. Run ended

2022-03-19 18:07:43.715644 rss: 78.9 MiB vsize: 1.5 GiB

Total run time: 19:07
Total requests: 6,433,500

Comment 50 Germano Veit Michel 2022-03-30 04:26:52 UTC

Nijin helped us with a coredump:

Looks like a problem when scaling down workers indeed.

Currently at 5 threads total

Thread 5 (Thread 0x7f85e48c8700 (LWP 1266653)):
#0  0x00007fc68233ad98 in __GI___nanosleep (requested_time=requested_time@entry=0x7f85e48c77a0, remaining=remaining@entry=0x7f85e48c77a0)
    at ../sysdeps/unix/sysv/linux/nanosleep.c:28
#1  0x00007fc68233ac9e in __sleep (seconds=0, seconds@entry=1) at ../sysdeps/posix/sleep.c:55
#2  0x000055c7f808e322 in add_lockspace_wait (sp=0x7fc5b0042b50) at lockspace.c:1098
#3  0x000055c7f80a27b5 in cmd_add_lockspace (ca=ca@entry=0x55c7f8a7c430, cmd=cmd@entry=2) at cmd.c:1308
#4  0x000055c7f80a6da4 in call_cmd_thread (task=<optimized out>, ca=0x55c7f8a7c430) at cmd.c:2160
#5  0x000055c7f8091450 in thread_pool_worker (data=<optimized out>) at main.c:948
#6  0x00007fc6832c914a in start_thread (arg=<optimized out>) at pthread_create.c:479
#7  0x00007fc68236edc3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 4 (Thread 0x7f85f30e5700 (LWP 1097946)):
#0  0x00007fc68233ad98 in __GI___nanosleep (requested_time=requested_time@entry=0x7f85f30e47a0, remaining=remaining@entry=0x7f85f30e47a0)
    at ../sysdeps/unix/sysv/linux/nanosleep.c:28
#1  0x00007fc68233ac9e in __sleep (seconds=0, seconds@entry=1) at ../sysdeps/posix/sleep.c:55
#2  0x000055c7f808e322 in add_lockspace_wait (sp=0x7fc58800c450) at lockspace.c:1098
#3  0x000055c7f80a27b5 in cmd_add_lockspace (ca=ca@entry=0x55c7f8a7dcf0, cmd=cmd@entry=2) at cmd.c:1308
#4  0x000055c7f80a6da4 in call_cmd_thread (task=<optimized out>, ca=0x55c7f8a7dcf0) at cmd.c:2160
#5  0x000055c7f8091450 in thread_pool_worker (data=<optimized out>) at main.c:948
#6  0x00007fc6832c914a in start_thread (arg=<optimized out>) at pthread_create.c:479
#7  0x00007fc68236edc3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7fc67d9cb700 (LWP 6542)):
#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x55c7f83bd50c <resource_cond+44>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x55c7f83bd520 <resource_mutex>, cond=0x55c7f83bd4e0 <resource_cond>) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=cond@entry=0x55c7f83bd4e0 <resource_cond>, mutex=mutex@entry=0x55c7f83bd520 <resource_mutex>) at pthread_cond_wait.c:655
#3  0x000055c7f809c51f in resource_thread (arg=<optimized out>) at resource.c:2480
#4  0x00007fc6832c914a in start_thread (arg=<optimized out>) at pthread_create.c:479
#5  0x00007fc68236edc3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7fc67f1ce700 (LWP 6515)):
#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x55c7f83bc0e8 <log_cond+40>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x55c7f83bc100 <log_mutex>, cond=0x55c7f83bc0c0 <log_cond>) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=cond@entry=0x55c7f83bc0c0 <log_cond>, mutex=mutex@entry=0x55c7f83bc100 <log_mutex>) at pthread_cond_wait.c:655
#3  0x000055c7f808f92b in log_thread_fn (arg=<optimized out>) at log.c:272
#4  0x00007fc6832c914a in start_thread (arg=<optimized out>) at pthread_create.c:479
#5  0x00007fc68236edc3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7fc6836fc780 (LWP 6513)):
#0  0x00007fc682363a41 in __GI___poll (fds=0x55c7f8a788f0, nfds=62, timeout=1000) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x000055c7f8092ef6 in poll (__timeout=1000, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:46
#2  main_loop () at main.c:776
#3  do_daemon () at main.c:1787
#4  0x000055c7f808355e in main (argc=<optimized out>, argv=<optimized out>) at main.c:3930

But num_workers at 50 and 0 free? Does not look right

(gdb) p pool
$1 = {num_workers = 50, max_workers = 50, free_workers = 0, quit = 0, work_data = {next = 0x55c7f8a7ca60, prev = 0x55c7f95ef3f0}, mutex = {__data = {__lock = 0, __count = 0, 
      __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}, cond = {
    __data = {{__wseq = 36819258, __wseq32 = {__low = 36819258, __high = 0}}, {__g1_start = 36819256, __g1_start32 = {__low = 36819256, __high = 0}}, __g_refs = {0, 0}, 
      __g_size = {0, 0}, __g1_orig_size = 4, __wrefs = 0, __g_signals = {0, 0}}, 
    __size = ":\321\061\002\000\000\000\000\070\321\061\002", '\000' <repeats 20 times>, "\004", '\000' <repeats 14 times>, __align = 36819258}, quit_wait = {__data = {{
        __wseq = 0, __wseq32 = {__low = 0, __high = 0}}, {__g1_start = 0, __g1_start32 = {__low = 0, __high = 0}}, __g_refs = {0, 0}, __g_size = {0, 0}, __g1_orig_size = 0, 
      __wrefs = 0, __g_signals = {0, 0}}, __size = '\000' <repeats 47 times>, __align = 0}}

Its stuck right? wont get past here:
https://pagure.io/sanlock/blob/b04ddec92daffc5c919b09def3998237ae10d4a0/f/src/main.c#_991

Comment 52 Nir Soffer 2022-03-30 12:00:12 UTC

(In reply to Germano Veit Michel from comment #50)
> Nijin helped us with a coredump:
> 
> Looks like a problem when scaling down workers indeed.
> 
> Currently at 5 threads total
> 
> Thread 5 (Thread 0x7f85e48c8700 (LWP 1266653)):
> #0  0x00007fc68233ad98 in __GI___nanosleep
> (requested_time=requested_time@entry=0x7f85e48c77a0,
> remaining=remaining@entry=0x7f85e48c77a0)
>     at ../sysdeps/unix/sysv/linux/nanosleep.c:28
> #1  0x00007fc68233ac9e in __sleep (seconds=0, seconds@entry=1) at
> ../sysdeps/posix/sleep.c:55
> #2  0x000055c7f808e322 in add_lockspace_wait (sp=0x7fc5b0042b50) at
> lockspace.c:1098
> #3  0x000055c7f80a27b5 in cmd_add_lockspace (ca=ca@entry=0x55c7f8a7c430,
> cmd=cmd@entry=2) at cmd.c:1308
> #4  0x000055c7f80a6da4 in call_cmd_thread (task=<optimized out>,
> ca=0x55c7f8a7c430) at cmd.c:2160
> #5  0x000055c7f8091450 in thread_pool_worker (data=<optimized out>) at
> main.c:948
> #6  0x00007fc6832c914a in start_thread (arg=<optimized out>) at
> pthread_create.c:479
> #7  0x00007fc68236edc3 in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
> 
> Thread 4 (Thread 0x7f85f30e5700 (LWP 1097946)):
> #0  0x00007fc68233ad98 in __GI___nanosleep
> (requested_time=requested_time@entry=0x7f85f30e47a0,
> remaining=remaining@entry=0x7f85f30e47a0)
>     at ../sysdeps/unix/sysv/linux/nanosleep.c:28
> #1  0x00007fc68233ac9e in __sleep (seconds=0, seconds@entry=1) at
> ../sysdeps/posix/sleep.c:55
> #2  0x000055c7f808e322 in add_lockspace_wait (sp=0x7fc58800c450) at
> lockspace.c:1098
> #3  0x000055c7f80a27b5 in cmd_add_lockspace (ca=ca@entry=0x55c7f8a7dcf0,
> cmd=cmd@entry=2) at cmd.c:1308
> #4  0x000055c7f80a6da4 in call_cmd_thread (task=<optimized out>,
> ca=0x55c7f8a7dcf0) at cmd.c:2160
> #5  0x000055c7f8091450 in thread_pool_worker (data=<optimized out>) at
> main.c:948
> #6  0x00007fc6832c914a in start_thread (arg=<optimized out>) at
> pthread_create.c:479
> #7  0x00007fc68236edc3 in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
> 
> Thread 3 (Thread 0x7fc67d9cb700 (LWP 6542)):
> #0  futex_wait_cancelable (private=0, expected=0, futex_word=0x55c7f83bd50c
> <resource_cond+44>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
> #1  __pthread_cond_wait_common (abstime=0x0, mutex=0x55c7f83bd520
> <resource_mutex>, cond=0x55c7f83bd4e0 <resource_cond>) at
> pthread_cond_wait.c:502
> #2  __pthread_cond_wait (cond=cond@entry=0x55c7f83bd4e0 <resource_cond>,
> mutex=mutex@entry=0x55c7f83bd520 <resource_mutex>) at pthread_cond_wait.c:655
> #3  0x000055c7f809c51f in resource_thread (arg=<optimized out>) at
> resource.c:2480
> #4  0x00007fc6832c914a in start_thread (arg=<optimized out>) at
> pthread_create.c:479
> #5  0x00007fc68236edc3 in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
> 
> Thread 2 (Thread 0x7fc67f1ce700 (LWP 6515)):
> #0  futex_wait_cancelable (private=0, expected=0, futex_word=0x55c7f83bc0e8
> <log_cond+40>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
> #1  __pthread_cond_wait_common (abstime=0x0, mutex=0x55c7f83bc100
> <log_mutex>, cond=0x55c7f83bc0c0 <log_cond>) at pthread_cond_wait.c:502
> #2  __pthread_cond_wait (cond=cond@entry=0x55c7f83bc0c0 <log_cond>,
> mutex=mutex@entry=0x55c7f83bc100 <log_mutex>) at pthread_cond_wait.c:655
> #3  0x000055c7f808f92b in log_thread_fn (arg=<optimized out>) at log.c:272
> #4  0x00007fc6832c914a in start_thread (arg=<optimized out>) at
> pthread_create.c:479
> #5  0x00007fc68236edc3 in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
> 
> Thread 1 (Thread 0x7fc6836fc780 (LWP 6513)):
> #0  0x00007fc682363a41 in __GI___poll (fds=0x55c7f8a788f0, nfds=62,
> timeout=1000) at ../sysdeps/unix/sysv/linux/poll.c:29
> #1  0x000055c7f8092ef6 in poll (__timeout=1000, __nfds=<optimized out>,
> __fds=<optimized out>) at /usr/include/bits/poll2.h:46
> #2  main_loop () at main.c:776
> #3  do_daemon () at main.c:1787
> #4  0x000055c7f808355e in main (argc=<optimized out>, argv=<optimized out>)
> at main.c:3930
> 
> But num_workers at 50 and 0 free? Does not look right
> 
> (gdb) p pool
> $1 = {num_workers = 50, max_workers = 50, free_workers = 0, quit = 0,
...
> Its stuck right? wont get past here:
> https://pagure.io/sanlock/blob/b04ddec92daffc5c919b09def3998237ae10d4a0/f/
> src/main.c#_991

The core is from sanlock without the fix, but this link to the code after the
fix.

Comment 53 David Teigland 2022-03-30 19:51:29 UTC

(In reply to Germano Veit Michel from comment #50)
> But num_workers at 50 and 0 free? Does not look right

When sanlock memory usage reaches some limit, it fails to create any new threads, including lockspace threads (visible as stuck add lockspace) and worker threads.  sanlock's error check for failed pthread_create is also wrong.  As a result of these two things, thread_pool_add_work() will call pthread_create, which fails to create a worker, sanlock fails to see it failed, so sanlock still does num_workers++.  After enough failures, num_workers is stuck at 50 and there are only the minimum number of worker threads (2).

Comment 54 Germano Veit Michel 2022-03-30 22:01:36 UTC

Yup, I was comparing the coredump with your fixes, it does look like spot on thank you!

Comment 55 abgoswam 2022-04-21 09:09:22 UTC

if this BZs are getting fixed in RHEL 8.6 GA release?

Comment 56 abgoswam 2022-04-21 09:09:46 UTC

if this BZs are getting fixed in RHEL 8.6 GA release?

Comment 58 David Teigland 2022-04-21 14:21:18 UTC

it is fixed in 8.6 and 8.5.z

Comment 59 David Teigland 2022-05-27 16:42:31 UTC

repeating zstream process, now for 8.4

Comment 65 David Teigland 2022-07-06 17:02:02 UTC

rebuilt for 8.7 with larger release number 4

Comment 73 errata-xmlrpc 2022-11-08 10:54:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (sanlock bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:7774