Description of problem: When there are already 59 lockspaces (exact number might vary by a small amount), dlm_tool cannot create any more. [root@fastvm-rhel-8-0-23 ~]# dlm_tool join ls60 Joining lockspace "ls60" permission 600 dlm_new_lockspace ls60 error 16 This seems to be an issue only with user-space tools. gfs2 has no problem creating more lockspaces based on prior experience. I stepped through dlm_controld under gdb and found that the new lockspace is getting created. It's present in corosync-cpgtool output and in the debugfs. Also, the "join complete" message gets logged. But dlm_controld receives an "offline" uevent and leaves the lockspace immediately after joining. It seems like this is from the do_uevent(ls, 0) call in release_lockspace(). (If the offline uevent were coming from the call in new_lockspace(), then we'd expect to see a "ping_members aborted" message.) I think we're hitting a failure to register the misc device. fs/dlm/user.c: ~~~ static int device_create_lockspace(struct dlm_lspace_params *params) { ... error = dlm_new_lockspace(params->name, dlm_config.ci_cluster_name, params->flags, DLM_USER_LVB_LEN, NULL, NULL, NULL, &lockspace); if (error) return error; ls = dlm_find_lockspace_local(lockspace); if (!ls) return -ENOENT; error = dlm_device_register(ls, params->name); dlm_put_lockspace(ls); if (error) dlm_release_lockspace(lockspace, 0); else error = ls->ls_device.minor; return error; } static int dlm_device_register(struct dlm_ls *ls, char *name) { ... ls->ls_device.minor = MISC_DYNAMIC_MINOR; error = misc_register(&ls->ls_device); ... return error; } ~~~ I believe we've run out of DYNAMIC MINORS. There are 64 available. drivers/char/misc.c: ~~~ #define DYNAMIC_MINORS 64 /* like dynamic majors */ ... int misc_register(struct miscdevice *misc) { bool is_dynamic = (misc->minor == MISC_DYNAMIC_MINOR); ... if (is_dynamic) { int i = find_first_zero_bit(misc_minors, DYNAMIC_MINORS); if (i >= DYNAMIC_MINORS) { err = -EBUSY; goto out; } ~~~ On my machine, here's the state as of the time we try to add the 64th lockspace: [root@fastvm-rhel-8-0-23 ~]# ls -lanR /dev | awk '$9 ~ /^10:/ {print $9" " $11}' | sed 's/^\([0-9]*\):/\1 /g' | sort -nk2 10 0 ../dlm_ls59 10 1 ../dlm_ls58 10 2 ../dlm_ls57 10 3 ../dlm_ls56 10 4 ../dlm_ls55 10 5 ../dlm_ls54 10 6 ../dlm_ls53 10 7 ../dlm_ls52 10 8 ../dlm_ls51 10 9 ../dlm_ls50 ... 10 58 ../dlm_ls1 10 59 ../dlm_plock 10 60 ../dlm-monitor 10 61 ../dlm-control 10 62 ../cpu_dma_latency 10 63 ../vga_arbiter 10 130 ../watchdog ... ----- Version-Release number of selected component (if applicable): kernel-4.18.0-305.3.1.el8_4.x86_64 dlm-4.1.0-1.el8.x86_64 dlm-lib-4.1.0-1.el8.x86_64 ----- How reproducible: Always ----- Steps to Reproduce: 1. Create 59 lockspaces using `dlm_tool join ls<number>`. 2. Attempt to create a 60th lockspace. ----- Actual results: Joining lockspace "ls60" permission 600 dlm_new_lockspace ls60 error 16 ----- Expected results: Joining lockspace "ls60" permission 600 done ----- Additional info: I don't know whether this will be practical to fix, as I'm not experienced with the kernel API and its conventions. However, the result in user space is a completely opaque error that tells the user nothing about why their `dlm_tool join` command failed. If we can't fix this so that there's no seemingly arbitrary limitation on lockspace creation via `dlm_tool join`, then we should **at least** improve the error message so that the user can understand the failure.
I think since there's zero known demand for this (because no one uses dlm_tool join to create a bunch of lockspaces in practice), it's best to improve the error message and possibly document this in a man page, and then move on. Since the issue is that we run out of dynamic minors, I figure it's probably not straightforward to get around that limitation, and thus not worth the effort.
unsure why assignee got changed there... should this bug be under a different pool?
Yeah let's set the pool to sst_filesystems if Alex is assigned.
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.