Bug 1621434
| Summary: | kernel: clearcpuid=304 does not reduce XSAVE region size on Xeon Gold 6152 CPU | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Florian Weimer <fweimer> | ||||
| Component: | kernel | Assignee: | Prarit Bhargava <prarit> | ||||
| kernel sub component: | x86_64 | QA Contact: | Vilém Maršík <vmarsik> | ||||
| Status: | CLOSED WONTFIX | Docs Contact: | |||||
| Severity: | medium | ||||||
| Priority: | medium | CC: | fweimer, iloginovskikh | ||||
| Version: | 7.7 | Keywords: | Reopened | ||||
| Target Milestone: | rc | ||||||
| Target Release: | 7.8 | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2021-02-15 07:41:54 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1707052 | ||||||
| Attachments: |
|
||||||
Compiling the test program on RHEL7.6 latest results in
[09:47 AM root@hpe-xl270gen10-01 ~]# gcc -o tst-minisigtksz tst-minisigtksz.c
tst-minisigtksz.c: In function ‘handler’:
tst-minisigtksz.c:14:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
for (size_t i = 0; i < sizeof (buffer); ++i)
^
tst-minisigtksz.c:14:3: note: use option -std=c99 or -std=gnu99 to compile your code
tst-minisigtksz.c: In function ‘main’:
tst-minisigtksz.c:53:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
for (void *p = stack_buffer; p < stack_bottom; ++p)
^
tst-minisigtksz.c:57:14: error: redefinition of ‘p’
for (void *p = stack_top; p < stack_end; ++p)
^
tst-minisigtksz.c:53:14: note: previous definition of ‘p’ was here
for (void *p = stack_buffer; p < stack_bottom; ++p)
^
tst-minisigtksz.c:57:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
for (void *p = stack_top; p < stack_end; ++p)
^
I could modify the test program but want to make sure I'm not doing something wrong ...
P.
(In reply to Prarit Bhargava from comment #1) > Compiling the test program on RHEL7.6 latest results in > > [09:47 AM root@hpe-xl270gen10-01 ~]# gcc -o tst-minisigtksz > tst-minisigtksz.c > tst-minisigtksz.c: In function ‘handler’: > tst-minisigtksz.c:14:3: error: ‘for’ loop initial declarations are only > allowed in C99 mode > for (size_t i = 0; i < sizeof (buffer); ++i) > ^ > tst-minisigtksz.c:14:3: note: use option -std=c99 or -std=gnu99 to compile > your code > tst-minisigtksz.c: In function ‘main’: > tst-minisigtksz.c:53:3: error: ‘for’ loop initial declarations are only > allowed in C99 mode > for (void *p = stack_buffer; p < stack_bottom; ++p) > ^ Sorry, you'll have to compile with -std=gnu99. fweimer, I just want to make sure I have the request/bug correct. You want "clearcpuid=304" to disable AVX-512 on the *HARDWARE*, not just the kernel. ie) As a result of "clearcpuid=304" you want cpuid instructions such as __cpuid_count (0xd, 0, eax, ebx, ecx, edx); to return, for example, as if AVX-512 was NOT enabled on the processor. Is that correct? P. (In reply to Prarit Bhargava from comment #3) > fweimer, I just want to make sure I have the request/bug correct. > > You want "clearcpuid=304" to disable AVX-512 on the *HARDWARE*, not just the > kernel. > > ie) As a result of "clearcpuid=304" you want cpuid instructions such as > > __cpuid_count (0xd, 0, eax, ebx, ecx, edx); > > to return, for example, as if AVX-512 was NOT enabled on the processor. I think clearcpuid=304 already does that, altering the CPUID flags, so that applications will not see that AVX-512 is available. What it does not seem to do is reduce the save area size for XSAVE. It should be much smaller because there should never be any AVX-512 registers to save. This is necessary for full compatibility with applications which have very tight (signal) stacks, particularly 32-bit applications. (In reply to Florian Weimer from comment #4) > (In reply to Prarit Bhargava from comment #3) > > fweimer, I just want to make sure I have the request/bug correct. > > > > You want "clearcpuid=304" to disable AVX-512 on the *HARDWARE*, not just the > > kernel. > > > > ie) As a result of "clearcpuid=304" you want cpuid instructions such as > > > > __cpuid_count (0xd, 0, eax, ebx, ecx, edx); > > > > to return, for example, as if AVX-512 was NOT enabled on the processor. > > I think clearcpuid=304 already does that, altering the CPUID flags, so that > applications will not see that AVX-512 is available. Well ... yes and no. An application that queries /proc/cpuinfo (lscpu for example) will see the avx* flags missing. An application that calls cpuid to read the processor flags directly will still see the avx flags because clearcpuid does not directly affect the hardware. > > What it does not seem to do is reduce the save area size for XSAVE. It > should be much smaller because there should never be any AVX-512 registers > to save. This is necessary for full compatibility with applications which > have very tight (signal) stacks, particularly 32-bit applications. I'm looking at the glibc code you pasted above: For _dl_runtime_resolve, set xsave_state_size to xsave area size + integer register save size and align it to 64 bytes. */ (cpu_features->max_cpuid >= 0xd) { unsigned int eax, ebx, ecx, edx; __cpuid_count (0xd, 0, eax, ebx, ecx, edx); <<< PRARIT: cpuid reads from HW. if (ebx != 0) { cpu_features->xsave_state_size = ALIGN_UP (ebx + STATE_SAVE_OFFSET, 64); __cpuid_count (0xd, 1, eax, ebx, ecx, edx); /* Check if XSAVEC is available. */ if ((eax & (1 << 1)) != 0) { unsigned int xstate_comp_offsets[32]; unsigned int xstate_comp_sizes[32]; unsigned int i; xstate_comp_offsets[0] = 0; <snip> and I'm wondering if the problem is that glibc is checking if the CPU supports XSAVE rather than the OS (via /proc/cpuinfo)? Or am I not "getting" the bug. :) P. Okay, then I misunderstand what the clearcpuid=304 feature was about. Before we can talk about XSAVE size, we need a way to disable AVX-512 for real. Based on the original description of the bug that added clearcpuid=, I assumed it was doing just that, but I can see that I was wrong about that. In any case, I can see now that this bug report doesn't make sense. (In reply to Florian Weimer from comment #6) > Okay, then I misunderstand what the clearcpuid=304 feature was about. > Before we can talk about XSAVE size, we need a way to disable AVX-512 for > real. Based on the original description of the bug that added clearcpuid=, > I assumed it was doing just that, but I can see that I was wrong about that. > > In any case, I can see now that this bug report doesn't make sense. Well ... yes and no. I think you have a valid case here and let's see if there is any other option. /me is "dumb" about glibc <<< just a warning :) This has been a misconception about CPU features and flags for a long time. CPU flags only indicate that a feature is PRESENT on a CPU, but not that the OS supports the feature. For example, a CPU could have smt (aka hyperthreading) but the OS could disable the non-primary threads. If a program read the appropriate cpuid, it could erroneously determine (as your code does above) that smt is enabled. It gets worse though, as it could be that the OS ignores the smt flag and just brings up the primary threads. In this case, however, I think you can make a better judgement about avx support by looking at /proc/cpuinfo. OOC why doesn't the glibc code do that? P. (In reply to Prarit Bhargava from comment #7) > This has been a misconception about CPU features and flags for a long time. > CPU flags only indicate that a feature is PRESENT on a CPU, but not that the > OS supports the feature. So what seems to happen here is that clearcpuid=304 does not simply alter the /proc/cpuinfo output, but also withdraws %zmm register support or something like that. Userspace correctly detects that with the Intel-recommended AVX-512 support check, with fails, and executing AVX-512 instructions on %zmm registers faults. So AVX-512 support is in fact removed from userspace. This matches the original description in the bug that added the kernel command line argument, but it does not quite agree with the documentation added in the patch. > In this case, however, I think you can make a better judgement about avx > support by looking at /proc/cpuinfo. OOC why doesn't the glibc code do that? Because /proc isn't always mounted. The kernel is supposed to provide this information in the auxiliary vector in hwcap flags, but for most features, we don't do this on x86 because CPUID (and later XGETBV) is the interface for obtaining this information. So the reaming question is whether the CPU actually needs the large XSAVEC area when running with clearcpuid=304 or not, and if it does not, how can we adjusted the size computations in the kernel and glibc. This could also be something that would have to be tweaked in firmware/microcode. I still think this would be a useful change for enhanced backwards compatibility. (In reply to Florian Weimer from comment #8) > (In reply to Prarit Bhargava from comment #7) > > > This has been a misconception about CPU features and flags for a long time. > > CPU flags only indicate that a feature is PRESENT on a CPU, but not that the > > OS supports the feature. > > So what seems to happen here is that clearcpuid=304 does not simply alter > the /proc/cpuinfo output, but also withdraws %zmm register support or > something like that. Userspace correctly detects that with the > Intel-recommended AVX-512 support check, with fails, and executing AVX-512 > instructions on %zmm registers faults. So AVX-512 support is in fact > removed from userspace. > Certainly that's true. I didn't mean to indicate that /proc/cpuinfo output is the only thing affected. The kernel has certainly dropped AVX-512 support at that point, however, it does not change the hardware reporting AVX-512 support. > This matches the original description in the bug that added the kernel > command line argument, but it does not quite agree with the documentation > added in the patch. > > > In this case, however, I think you can make a better judgement about avx > > support by looking at /proc/cpuinfo. OOC why doesn't the glibc code do that? > > Because /proc isn't always mounted. The kernel is supposed to provide this > information in the auxiliary vector in hwcap flags, but for most features, > we don't do this on x86 because CPUID (and later XGETBV) is the interface > for obtaining this information. :( but the problem (as noted) is that those commands return what the hardware supports and not what the kernel supports. One thought here is to modify glibc to check for /proc/cpuinfo and if available, use that information in place of CPUID. > > So the reaming question is whether the CPU actually needs the large XSAVEC > area when running with clearcpuid=304 or not, and if it does not, how can we > adjusted the size computations in the kernel and glibc. This could also be > something that would have to be tweaked in firmware/microcode. I still > think this would be a useful change for enhanced backwards compatibility. Agreed ... there has to be a better way to do this. Could you put together a version of glibc that checks for /proc/cpuinfo? Let's see if that works and we can see how much of a stack savings there is. I can push this info on either RHKL or LKML to see if anyone has a better idea. P. (In reply to Prarit Bhargava from comment #9) > > Because /proc isn't always mounted. The kernel is supposed to provide this > > information in the auxiliary vector in hwcap flags, but for most features, > > we don't do this on x86 because CPUID (and later XGETBV) is the interface > > for obtaining this information. > > :( but the problem (as noted) is that those commands return what the > hardware supports and not what the kernel supports. I think the kernel writes what it supports to XCR0, which userspace can read using XGETBV with %ecx == 0. But these XCR0 flags do not affect the XSAVE state side reported by CPUID. > One thought here is to modify glibc to check for /proc/cpuinfo and if > available, use that information in place of CPUID. Not going to happen, it does not make sense. We need to adhere to what the CPU says is its save area size. The kernel may not know about all corresponding CPU feature flags. However, it could be that the XSAVEC area size computation in glibc (and the kernel) is buggy and the XSAVE area size for the active XCR0 setting is actually *smaller* than the computed XSAVEC size. I need to get another AVX-512-enabled machine and verify if this is the case. Does the kernel use XSAVE or XSAVEC for context switches? After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened. |
Created attachment 1478271 [details] tst-minsigstksz.c (compile with -std=gnu99) One of the compatibility problems with AVX-512 is that the signal frame size and the dynamic linker trampoline stack usage is much higher than in previous CPU generations. I had hoped that passing clearcpuid=304 would address these compatibility issues by disabling AVX-512 and returning to the original size requirements. However, this is not the case. glibc sees the same size requirement with and without the kernel option. This can be verified as follows, after “debuginfo-install glibc”: (gdb) print _rtld_global_ro._dl_x86_cpu_features.xsave_state_size $1 = 2560 Furthermore, the attached test case (tst-minsigstksz.c) from the glibc upstream bug about signal stack sizes still shows modification below the configured stack: tst-minsigstksz: changed byte 1396 bytes below configured stack It would be really, really desirable if the large context save region size could be avoided with a suitable kernel option. Seen with kernel-3.10.0-862.11.6.el7.x86_64 on this CPU: processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 85 model name : Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz stepping : 4 microcode : 0x200004d cpu MHz : 2100.000 cache size : 30976 KB physical id : 0 siblings : 44 core id : 0 cpu cores : 22 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 22 The glibc code which computes the save area size looks like this: For _dl_runtime_resolve, set xsave_state_size to xsave area size + integer register save size and align it to 64 bytes. */ (cpu_features->max_cpuid >= 0xd) { unsigned int eax, ebx, ecx, edx; __cpuid_count (0xd, 0, eax, ebx, ecx, edx); if (ebx != 0) { cpu_features->xsave_state_size = ALIGN_UP (ebx + STATE_SAVE_OFFSET, 64); __cpuid_count (0xd, 1, eax, ebx, ecx, edx); /* Check if XSAVEC is available. */ if ((eax & (1 << 1)) != 0) { unsigned int xstate_comp_offsets[32]; unsigned int xstate_comp_sizes[32]; unsigned int i; xstate_comp_offsets[0] = 0; xstate_comp_offsets[1] = 160; xstate_comp_offsets[2] = 576; xstate_comp_sizes[0] = 160; xstate_comp_sizes[1] = 256; for (i = 2; i < 32; i++) { if ((STATE_SAVE_MASK & (1 << i)) != 0) { __cpuid_count (0xd, i, eax, ebx, ecx, edx); xstate_comp_sizes[i] = eax; } else { ecx = 0; xstate_comp_sizes[i] = 0; } if (i > 2) { xstate_comp_offsets[i] = (xstate_comp_offsets[i - 1] + xstate_comp_sizes[i -1]); if ((ecx & (1 << 1)) != 0) xstate_comp_offsets[i] = ALIGN_UP (xstate_comp_offsets[i], 64); } } /* Use XSAVEC. */ unsigned int size = xstate_comp_offsets[31] + xstate_comp_sizes[31]; if (size) { cpu_features->xsave_state_size = ALIGN_UP (size + STATE_SAVE_OFFSET, 64); cpu_features->feature[index_XSAVEC_Usable] |= bit_XSAVEC_Usable; } } } } (From sysdeps/x86/cpu-featers.c.)