Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Description of problem:
When there are already 59 lockspaces (exact number might vary by a small amount), dlm_tool cannot create any more.
[root@fastvm-rhel-8-0-23 ~]# dlm_tool join ls60
Joining lockspace "ls60" permission 600
dlm_new_lockspace ls60 error 16
This seems to be an issue only with user-space tools. gfs2 has no problem creating more lockspaces based on prior experience.
I stepped through dlm_controld under gdb and found that the new lockspace is getting created. It's present in corosync-cpgtool output and in the debugfs. Also, the "join complete" message gets logged. But dlm_controld receives an "offline" uevent and leaves the lockspace immediately after joining. It seems like this is from the do_uevent(ls, 0) call in release_lockspace(). (If the offline uevent were coming from the call in new_lockspace(), then we'd expect to see a "ping_members aborted" message.)
I think we're hitting a failure to register the misc device.
fs/dlm/user.c:
~~~
static int device_create_lockspace(struct dlm_lspace_params *params)
{
...
error = dlm_new_lockspace(params->name, dlm_config.ci_cluster_name, params->flags,
DLM_USER_LVB_LEN, NULL, NULL, NULL,
&lockspace);
if (error)
return error;
ls = dlm_find_lockspace_local(lockspace);
if (!ls)
return -ENOENT;
error = dlm_device_register(ls, params->name);
dlm_put_lockspace(ls);
if (error)
dlm_release_lockspace(lockspace, 0);
else
error = ls->ls_device.minor;
return error;
}
static int dlm_device_register(struct dlm_ls *ls, char *name)
{
...
ls->ls_device.minor = MISC_DYNAMIC_MINOR;
error = misc_register(&ls->ls_device);
...
return error;
}
~~~
I believe we've run out of DYNAMIC MINORS. There are 64 available.
drivers/char/misc.c:
~~~
#define DYNAMIC_MINORS 64 /* like dynamic majors */
...
int misc_register(struct miscdevice *misc)
{
bool is_dynamic = (misc->minor == MISC_DYNAMIC_MINOR);
...
if (is_dynamic) {
int i = find_first_zero_bit(misc_minors, DYNAMIC_MINORS);
if (i >= DYNAMIC_MINORS) {
err = -EBUSY;
goto out;
}
~~~
On my machine, here's the state as of the time we try to add the 64th lockspace:
[root@fastvm-rhel-8-0-23 ~]# ls -lanR /dev | awk '$9 ~ /^10:/ {print $9" " $11}' | sed 's/^\([0-9]*\):/\1 /g' | sort -nk2
10 0 ../dlm_ls59
10 1 ../dlm_ls58
10 2 ../dlm_ls57
10 3 ../dlm_ls56
10 4 ../dlm_ls55
10 5 ../dlm_ls54
10 6 ../dlm_ls53
10 7 ../dlm_ls52
10 8 ../dlm_ls51
10 9 ../dlm_ls50
...
10 58 ../dlm_ls1
10 59 ../dlm_plock
10 60 ../dlm-monitor
10 61 ../dlm-control
10 62 ../cpu_dma_latency
10 63 ../vga_arbiter
10 130 ../watchdog
...
-----
Version-Release number of selected component (if applicable):
kernel-4.18.0-305.3.1.el8_4.x86_64
dlm-4.1.0-1.el8.x86_64
dlm-lib-4.1.0-1.el8.x86_64
-----
How reproducible:
Always
-----
Steps to Reproduce:
1. Create 59 lockspaces using `dlm_tool join ls<number>`.
2. Attempt to create a 60th lockspace.
-----
Actual results:
Joining lockspace "ls60" permission 600
dlm_new_lockspace ls60 error 16
-----
Expected results:
Joining lockspace "ls60" permission 600
done
-----
Additional info:
I don't know whether this will be practical to fix, as I'm not experienced with the kernel API and its conventions. However, the result in user space is a completely opaque error that tells the user nothing about why their `dlm_tool join` command failed.
If we can't fix this so that there's no seemingly arbitrary limitation on lockspace creation via `dlm_tool join`, then we should **at least** improve the error message so that the user can understand the failure.
I think since there's zero known demand for this (because no one uses dlm_tool join to create a bunch of lockspaces in practice), it's best to improve the error message and possibly document this in a man page, and then move on. Since the issue is that we run out of dynamic minors, I figure it's probably not straightforward to get around that limitation, and thus not worth the effort.
Comment 2Jonathan Earl Brassow
2023-02-28 12:59:39 UTC
unsure why assignee got changed there... should this bug be under a different pool?
Yeah let's set the pool to sst_filesystems if Alex is assigned.
Comment 5RHEL Program Management
2023-07-22 07:29:22 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.
Description of problem: When there are already 59 lockspaces (exact number might vary by a small amount), dlm_tool cannot create any more. [root@fastvm-rhel-8-0-23 ~]# dlm_tool join ls60 Joining lockspace "ls60" permission 600 dlm_new_lockspace ls60 error 16 This seems to be an issue only with user-space tools. gfs2 has no problem creating more lockspaces based on prior experience. I stepped through dlm_controld under gdb and found that the new lockspace is getting created. It's present in corosync-cpgtool output and in the debugfs. Also, the "join complete" message gets logged. But dlm_controld receives an "offline" uevent and leaves the lockspace immediately after joining. It seems like this is from the do_uevent(ls, 0) call in release_lockspace(). (If the offline uevent were coming from the call in new_lockspace(), then we'd expect to see a "ping_members aborted" message.) I think we're hitting a failure to register the misc device. fs/dlm/user.c: ~~~ static int device_create_lockspace(struct dlm_lspace_params *params) { ... error = dlm_new_lockspace(params->name, dlm_config.ci_cluster_name, params->flags, DLM_USER_LVB_LEN, NULL, NULL, NULL, &lockspace); if (error) return error; ls = dlm_find_lockspace_local(lockspace); if (!ls) return -ENOENT; error = dlm_device_register(ls, params->name); dlm_put_lockspace(ls); if (error) dlm_release_lockspace(lockspace, 0); else error = ls->ls_device.minor; return error; } static int dlm_device_register(struct dlm_ls *ls, char *name) { ... ls->ls_device.minor = MISC_DYNAMIC_MINOR; error = misc_register(&ls->ls_device); ... return error; } ~~~ I believe we've run out of DYNAMIC MINORS. There are 64 available. drivers/char/misc.c: ~~~ #define DYNAMIC_MINORS 64 /* like dynamic majors */ ... int misc_register(struct miscdevice *misc) { bool is_dynamic = (misc->minor == MISC_DYNAMIC_MINOR); ... if (is_dynamic) { int i = find_first_zero_bit(misc_minors, DYNAMIC_MINORS); if (i >= DYNAMIC_MINORS) { err = -EBUSY; goto out; } ~~~ On my machine, here's the state as of the time we try to add the 64th lockspace: [root@fastvm-rhel-8-0-23 ~]# ls -lanR /dev | awk '$9 ~ /^10:/ {print $9" " $11}' | sed 's/^\([0-9]*\):/\1 /g' | sort -nk2 10 0 ../dlm_ls59 10 1 ../dlm_ls58 10 2 ../dlm_ls57 10 3 ../dlm_ls56 10 4 ../dlm_ls55 10 5 ../dlm_ls54 10 6 ../dlm_ls53 10 7 ../dlm_ls52 10 8 ../dlm_ls51 10 9 ../dlm_ls50 ... 10 58 ../dlm_ls1 10 59 ../dlm_plock 10 60 ../dlm-monitor 10 61 ../dlm-control 10 62 ../cpu_dma_latency 10 63 ../vga_arbiter 10 130 ../watchdog ... ----- Version-Release number of selected component (if applicable): kernel-4.18.0-305.3.1.el8_4.x86_64 dlm-4.1.0-1.el8.x86_64 dlm-lib-4.1.0-1.el8.x86_64 ----- How reproducible: Always ----- Steps to Reproduce: 1. Create 59 lockspaces using `dlm_tool join ls<number>`. 2. Attempt to create a 60th lockspace. ----- Actual results: Joining lockspace "ls60" permission 600 dlm_new_lockspace ls60 error 16 ----- Expected results: Joining lockspace "ls60" permission 600 done ----- Additional info: I don't know whether this will be practical to fix, as I'm not experienced with the kernel API and its conventions. However, the result in user space is a completely opaque error that tells the user nothing about why their `dlm_tool join` command failed. If we can't fix this so that there's no seemingly arbitrary limitation on lockspace creation via `dlm_tool join`, then we should **at least** improve the error message so that the user can understand the failure.