Bug 2043969 - dlm_tool can only create about 59 lockspaces
Summary: dlm_tool can only create about 59 lockspaces
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: dlm
Version: 8.4
Hardware: All
OS: Linux
low
low
Target Milestone: rc
: ---
Assignee: Alexander Aring
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-22 23:10 UTC by Reid Wahl
Modified: 2023-07-22 07:29 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-07-22 07:29:22 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-109267 0 None None None 2022-01-22 23:12:51 UTC
Red Hat Knowledge Base (Solution) 6667751 0 None None None 2022-01-25 18:46:37 UTC

Description Reid Wahl 2022-01-22 23:10:25 UTC
Description of problem:

When there are already 59 lockspaces (exact number might vary by a small amount), dlm_tool cannot create any more.

[root@fastvm-rhel-8-0-23 ~]# dlm_tool join ls60
Joining lockspace "ls60" permission 600 
dlm_new_lockspace ls60 error 16


This seems to be an issue only with user-space tools. gfs2 has no problem creating more lockspaces based on prior experience.

I stepped through dlm_controld under gdb and found that the new lockspace is getting created. It's present in corosync-cpgtool output and in the debugfs. Also, the "join complete" message gets logged. But dlm_controld receives an "offline" uevent and leaves the lockspace immediately after joining. It seems like this is from the do_uevent(ls, 0) call in release_lockspace(). (If the offline uevent were coming from the call in new_lockspace(), then we'd expect to see a "ping_members aborted" message.)

I think we're hitting a failure to register the misc device.

fs/dlm/user.c:
~~~
static int device_create_lockspace(struct dlm_lspace_params *params)
{
...
    error = dlm_new_lockspace(params->name, dlm_config.ci_cluster_name, params->flags,
                  DLM_USER_LVB_LEN, NULL, NULL, NULL,
                  &lockspace);
    if (error)
        return error;

    ls = dlm_find_lockspace_local(lockspace);
    if (!ls)
        return -ENOENT;

    error = dlm_device_register(ls, params->name);
    dlm_put_lockspace(ls);

    if (error)
        dlm_release_lockspace(lockspace, 0);
    else
        error = ls->ls_device.minor;

    return error;
}


static int dlm_device_register(struct dlm_ls *ls, char *name)
{
...
    ls->ls_device.minor = MISC_DYNAMIC_MINOR;

    error = misc_register(&ls->ls_device);
...
    return error;
}
~~~


I believe we've run out of DYNAMIC MINORS. There are 64 available.

drivers/char/misc.c:
~~~
#define DYNAMIC_MINORS 64 /* like dynamic majors */
...
int misc_register(struct miscdevice *misc)
{
    bool is_dynamic = (misc->minor == MISC_DYNAMIC_MINOR);
...
    if (is_dynamic) {
        int i = find_first_zero_bit(misc_minors, DYNAMIC_MINORS);

        if (i >= DYNAMIC_MINORS) {
            err = -EBUSY;
            goto out;
        }
~~~


On my machine, here's the state as of the time we try to add the 64th lockspace:

[root@fastvm-rhel-8-0-23 ~]# ls -lanR /dev | awk '$9 ~ /^10:/ {print $9" " $11}' | sed 's/^\([0-9]*\):/\1 /g' | sort -nk2
10 0 ../dlm_ls59
10 1 ../dlm_ls58
10 2 ../dlm_ls57
10 3 ../dlm_ls56
10 4 ../dlm_ls55
10 5 ../dlm_ls54
10 6 ../dlm_ls53
10 7 ../dlm_ls52
10 8 ../dlm_ls51
10 9 ../dlm_ls50
...
10 58 ../dlm_ls1
10 59 ../dlm_plock
10 60 ../dlm-monitor
10 61 ../dlm-control
10 62 ../cpu_dma_latency
10 63 ../vga_arbiter
10 130 ../watchdog
...

-----

Version-Release number of selected component (if applicable):

kernel-4.18.0-305.3.1.el8_4.x86_64
dlm-4.1.0-1.el8.x86_64
dlm-lib-4.1.0-1.el8.x86_64

-----

How reproducible:

Always

-----

Steps to Reproduce:
1. Create 59 lockspaces using `dlm_tool join ls<number>`.
2. Attempt to create a 60th lockspace.

-----

Actual results:

Joining lockspace "ls60" permission 600 
dlm_new_lockspace ls60 error 16

-----

Expected results:

Joining lockspace "ls60" permission 600 
done

-----

Additional info:

I don't know whether this will be practical to fix, as I'm not experienced with the kernel API and its conventions. However, the result in user space is a completely opaque error that tells the user nothing about why their `dlm_tool join` command failed.

If we can't fix this so that there's no seemingly arbitrary limitation on lockspace creation via `dlm_tool join`, then we should **at least** improve the error message so that the user can understand the failure.

Comment 1 Reid Wahl 2022-01-24 02:54:20 UTC
I think since there's zero known demand for this (because no one uses dlm_tool join to create a bunch of lockspaces in practice), it's best to improve the error message and possibly document this in a man page, and then move on. Since the issue is that we run out of dynamic minors, I figure it's probably not straightforward to get around that limitation, and thus not worth the effort.

Comment 2 Jonathan Earl Brassow 2023-02-28 12:59:39 UTC
unsure why assignee got changed there...  should this bug be under a different pool?

Comment 3 Andrew Price 2023-02-28 13:19:59 UTC
Yeah let's set the pool to sst_filesystems if Alex is assigned.

Comment 5 RHEL Program Management 2023-07-22 07:29:22 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.


Note You need to log in before you can comment on or make changes to this bug.