Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1621434

Summary: kernel: clearcpuid=304 does not reduce XSAVE region size on Xeon Gold 6152 CPU
Product: Red Hat Enterprise Linux 7 Reporter: Florian Weimer <fweimer>
Component: kernelAssignee: Prarit Bhargava <prarit>
kernel sub component: x86_64 QA Contact: Vilém Maršík <vmarsik>
Status: CLOSED WONTFIX Docs Contact:
Severity: medium    
Priority: medium CC: fweimer, iloginovskikh
Version: 7.7Keywords: Reopened
Target Milestone: rc   
Target Release: 7.8   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-15 07:41:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1707052    
Attachments:
Description Flags
tst-minsigstksz.c (compile with -std=gnu99) none

Description Florian Weimer 2018-08-23 15:50:57 UTC
Created attachment 1478271 [details]
tst-minsigstksz.c (compile with -std=gnu99)

One of the compatibility problems with AVX-512 is that the signal frame size and the dynamic linker trampoline stack usage is much higher than in previous CPU generations.

I had hoped that passing clearcpuid=304 would address these compatibility issues by disabling AVX-512 and returning to the original size requirements.  However, this is not the case.  glibc sees the same size requirement with and without the kernel option.  This can be verified as follows, after “debuginfo-install glibc”:

(gdb) print _rtld_global_ro._dl_x86_cpu_features.xsave_state_size
$1 = 2560

Furthermore, the attached test case (tst-minsigstksz.c) from the glibc upstream bug about signal stack sizes still shows modification below the configured stack:

tst-minsigstksz: changed byte 1396 bytes below configured stack

It would be really, really desirable if the large context save region size could be avoided with a suitable kernel option.

Seen with kernel-3.10.0-862.11.6.el7.x86_64 on this CPU:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz
stepping        : 4
microcode       : 0x200004d
cpu MHz         : 2100.000
cache size      : 30976 KB
physical id     : 0
siblings        : 44
core id         : 0
cpu cores       : 22
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22

The glibc code which computes the save area size looks like this:

 For _dl_runtime_resolve, set xsave_state_size to xsave area
 size + integer register save size and align it to 64 bytes.  */
 (cpu_features->max_cpuid >= 0xd)
{
  unsigned int eax, ebx, ecx, edx;

  __cpuid_count (0xd, 0, eax, ebx, ecx, edx);
  if (ebx != 0)
    {
      cpu_features->xsave_state_size
       = ALIGN_UP (ebx + STATE_SAVE_OFFSET, 64);

      __cpuid_count (0xd, 1, eax, ebx, ecx, edx);

      /* Check if XSAVEC is available.  */
      if ((eax & (1 << 1)) != 0)
	{
	  unsigned int xstate_comp_offsets[32];
	  unsigned int xstate_comp_sizes[32];
	  unsigned int i;

	  xstate_comp_offsets[0] = 0;
	  xstate_comp_offsets[1] = 160;
	  xstate_comp_offsets[2] = 576;
	  xstate_comp_sizes[0] = 160;
	  xstate_comp_sizes[1] = 256;

	  for (i = 2; i < 32; i++)
	    {
	      if ((STATE_SAVE_MASK & (1 << i)) != 0)
		{
		  __cpuid_count (0xd, i, eax, ebx, ecx, edx);
		  xstate_comp_sizes[i] = eax;
		}
	      else
		{
		  ecx = 0;
		  xstate_comp_sizes[i] = 0;
		}

	      if (i > 2)
		{
		  xstate_comp_offsets[i]
		   = (xstate_comp_offsets[i - 1]
		      + xstate_comp_sizes[i -1]);
		  if ((ecx & (1 << 1)) != 0)
		    xstate_comp_offsets[i]
		     = ALIGN_UP (xstate_comp_offsets[i], 64);
		}
	    }

	  /* Use XSAVEC.  */
	  unsigned int size
	   = xstate_comp_offsets[31] + xstate_comp_sizes[31];
	  if (size)
	    {
	      cpu_features->xsave_state_size
	       = ALIGN_UP (size + STATE_SAVE_OFFSET, 64);
	      cpu_features->feature[index_XSAVEC_Usable]
	       |= bit_XSAVEC_Usable;
	    }
	}
    }
}

(From sysdeps/x86/cpu-featers.c.)

Comment 1 Prarit Bhargava 2018-09-04 13:51:14 UTC
Compiling the test program on RHEL7.6 latest results in

[09:47 AM root@hpe-xl270gen10-01 ~]# gcc -o tst-minisigtksz tst-minisigtksz.c 
tst-minisigtksz.c: In function ‘handler’:
tst-minisigtksz.c:14:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
   for (size_t i = 0; i < sizeof (buffer); ++i)
   ^
tst-minisigtksz.c:14:3: note: use option -std=c99 or -std=gnu99 to compile your code
tst-minisigtksz.c: In function ‘main’:
tst-minisigtksz.c:53:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
   for (void *p = stack_buffer; p < stack_bottom; ++p)
   ^
tst-minisigtksz.c:57:14: error: redefinition of ‘p’
   for (void *p = stack_top; p < stack_end; ++p)
              ^
tst-minisigtksz.c:53:14: note: previous definition of ‘p’ was here
   for (void *p = stack_buffer; p < stack_bottom; ++p)
              ^
tst-minisigtksz.c:57:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
   for (void *p = stack_top; p < stack_end; ++p)
   ^

I could modify the test program but want to make sure I'm not doing something wrong ...

P.

Comment 2 Florian Weimer 2018-09-04 13:53:14 UTC
(In reply to Prarit Bhargava from comment #1)
> Compiling the test program on RHEL7.6 latest results in
> 
> [09:47 AM root@hpe-xl270gen10-01 ~]# gcc -o tst-minisigtksz
> tst-minisigtksz.c 
> tst-minisigtksz.c: In function ‘handler’:
> tst-minisigtksz.c:14:3: error: ‘for’ loop initial declarations are only
> allowed in C99 mode
>    for (size_t i = 0; i < sizeof (buffer); ++i)
>    ^
> tst-minisigtksz.c:14:3: note: use option -std=c99 or -std=gnu99 to compile
> your code
> tst-minisigtksz.c: In function ‘main’:
> tst-minisigtksz.c:53:3: error: ‘for’ loop initial declarations are only
> allowed in C99 mode
>    for (void *p = stack_buffer; p < stack_bottom; ++p)
>    ^

Sorry, you'll have to compile with -std=gnu99.

Comment 3 Prarit Bhargava 2018-09-04 14:31:38 UTC
fweimer, I just want to make sure I have the request/bug correct. 

You want "clearcpuid=304" to disable AVX-512 on the *HARDWARE*, not just the kernel.  

ie) As a result of "clearcpuid=304" you want cpuid instructions such as

 __cpuid_count (0xd, 0, eax, ebx, ecx, edx);

to return, for example, as if AVX-512 was NOT enabled on the processor.

Is that correct?

P.

Comment 4 Florian Weimer 2018-09-04 15:31:30 UTC
(In reply to Prarit Bhargava from comment #3)
> fweimer, I just want to make sure I have the request/bug correct. 
> 
> You want "clearcpuid=304" to disable AVX-512 on the *HARDWARE*, not just the
> kernel.  
> 
> ie) As a result of "clearcpuid=304" you want cpuid instructions such as
> 
>  __cpuid_count (0xd, 0, eax, ebx, ecx, edx);
> 
> to return, for example, as if AVX-512 was NOT enabled on the processor.

I think clearcpuid=304 already does that, altering the CPUID flags, so that applications will not see that AVX-512 is available.

What it does not seem to do is reduce the save area size for XSAVE.  It should be much smaller because there should never be any AVX-512 registers to save.  This is necessary for full compatibility with applications which have very tight (signal) stacks, particularly 32-bit applications.

Comment 5 Prarit Bhargava 2018-09-04 17:53:24 UTC
(In reply to Florian Weimer from comment #4)
> (In reply to Prarit Bhargava from comment #3)
> > fweimer, I just want to make sure I have the request/bug correct. 
> > 
> > You want "clearcpuid=304" to disable AVX-512 on the *HARDWARE*, not just the
> > kernel.  
> > 
> > ie) As a result of "clearcpuid=304" you want cpuid instructions such as
> > 
> >  __cpuid_count (0xd, 0, eax, ebx, ecx, edx);
> > 
> > to return, for example, as if AVX-512 was NOT enabled on the processor.
> 
> I think clearcpuid=304 already does that, altering the CPUID flags, so that
> applications will not see that AVX-512 is available.

Well ... yes and no.  An application that queries /proc/cpuinfo (lscpu for example) will see the avx* flags missing.  An application that calls cpuid to read the processor flags directly will still see the avx flags because clearcpuid does not directly affect the hardware.

> 
> What it does not seem to do is reduce the save area size for XSAVE.  It
> should be much smaller because there should never be any AVX-512 registers
> to save.  This is necessary for full compatibility with applications which
> have very tight (signal) stacks, particularly 32-bit applications.

I'm looking at the glibc code you pasted above:

 For _dl_runtime_resolve, set xsave_state_size to xsave area
 size + integer register save size and align it to 64 bytes.  */
 (cpu_features->max_cpuid >= 0xd)
{
  unsigned int eax, ebx, ecx, edx;

  __cpuid_count (0xd, 0, eax, ebx, ecx, edx);  <<< PRARIT: cpuid reads from HW.
  if (ebx != 0)
    {
      cpu_features->xsave_state_size
       = ALIGN_UP (ebx + STATE_SAVE_OFFSET, 64);

      __cpuid_count (0xd, 1, eax, ebx, ecx, edx);

      /* Check if XSAVEC is available.  */
      if ((eax & (1 << 1)) != 0)
	{
	  unsigned int xstate_comp_offsets[32];
	  unsigned int xstate_comp_sizes[32];
	  unsigned int i;

	  xstate_comp_offsets[0] = 0;
<snip>

and I'm wondering if the problem is that glibc is checking if the CPU supports XSAVE rather than the OS (via /proc/cpuinfo)?

Or am I not "getting" the bug. :)

P.

Comment 6 Florian Weimer 2018-09-04 18:59:06 UTC
Okay, then I misunderstand what the clearcpuid=304 feature was about.  Before we can talk about XSAVE size, we need a way to disable AVX-512 for real.  Based on the original description of the bug that added clearcpuid=, I assumed it was doing just that, but I can see that I was wrong about that.

In any case, I can see now that this bug report doesn't make sense.

Comment 7 Prarit Bhargava 2018-09-05 12:39:25 UTC
(In reply to Florian Weimer from comment #6)
> Okay, then I misunderstand what the clearcpuid=304 feature was about. 
> Before we can talk about XSAVE size, we need a way to disable AVX-512 for
> real.  Based on the original description of the bug that added clearcpuid=,
> I assumed it was doing just that, but I can see that I was wrong about that.
> 
> In any case, I can see now that this bug report doesn't make sense.

Well ... yes and no.

I think you have a valid case here and let's see if there is any other option.

/me is "dumb" about glibc <<< just a warning :)

This has been a misconception about CPU features and flags for a long time.  CPU flags only indicate that a feature is PRESENT on a CPU, but not that the OS supports the feature.

For example, a CPU could have smt (aka hyperthreading) but the OS could disable the non-primary threads.  If a program read the appropriate cpuid, it could erroneously determine (as your code does above) that smt is enabled.

It gets worse though, as it could be that the OS ignores the smt flag and just brings up the primary threads.

In this case, however, I think you can make a better judgement about avx support by looking at /proc/cpuinfo.  OOC why doesn't the glibc code do that?

P.

Comment 8 Florian Weimer 2018-09-05 13:25:18 UTC
(In reply to Prarit Bhargava from comment #7)

> This has been a misconception about CPU features and flags for a long time. 
> CPU flags only indicate that a feature is PRESENT on a CPU, but not that the
> OS supports the feature.

So what seems to happen here is that clearcpuid=304 does not simply alter the /proc/cpuinfo output, but also withdraws %zmm register support or something like that.  Userspace correctly detects that with the Intel-recommended AVX-512 support check, with fails, and executing AVX-512 instructions on %zmm registers faults.  So AVX-512 support is in fact removed from userspace.

This matches the original description in the bug that added the kernel command line argument, but it does not quite agree with the documentation added in the patch.
 
> In this case, however, I think you can make a better judgement about avx
> support by looking at /proc/cpuinfo.  OOC why doesn't the glibc code do that?

Because /proc isn't always mounted.  The kernel is supposed to provide this information in the auxiliary vector in hwcap flags, but for most features, we don't do this on x86 because CPUID (and later XGETBV) is the interface for obtaining this information.

So the reaming question is whether the CPU actually needs the large XSAVEC area when running with clearcpuid=304 or not, and if it does not, how can we adjusted the size computations in the kernel and glibc.  This could also be something that would have to be tweaked in firmware/microcode.  I still think this would be a useful change for enhanced backwards compatibility.

Comment 9 Prarit Bhargava 2018-09-06 14:12:51 UTC
(In reply to Florian Weimer from comment #8)
> (In reply to Prarit Bhargava from comment #7)
> 
> > This has been a misconception about CPU features and flags for a long time. 
> > CPU flags only indicate that a feature is PRESENT on a CPU, but not that the
> > OS supports the feature.
> 
> So what seems to happen here is that clearcpuid=304 does not simply alter
> the /proc/cpuinfo output, but also withdraws %zmm register support or
> something like that.  Userspace correctly detects that with the
> Intel-recommended AVX-512 support check, with fails, and executing AVX-512
> instructions on %zmm registers faults.  So AVX-512 support is in fact
> removed from userspace.
> 

Certainly that's true.  I didn't mean to indicate that /proc/cpuinfo output is the only thing affected.  The kernel has certainly dropped AVX-512 support at that point, however, it does not change the hardware reporting AVX-512 support.

> This matches the original description in the bug that added the kernel
> command line argument, but it does not quite agree with the documentation
> added in the patch.
>  
> > In this case, however, I think you can make a better judgement about avx
> > support by looking at /proc/cpuinfo.  OOC why doesn't the glibc code do that?
> 
> Because /proc isn't always mounted.  The kernel is supposed to provide this
> information in the auxiliary vector in hwcap flags, but for most features,
> we don't do this on x86 because CPUID (and later XGETBV) is the interface
> for obtaining this information.

:( but the problem (as noted) is that those commands return what the hardware supports and not what the kernel supports.

One thought here is to modify glibc to check for /proc/cpuinfo and if available, use that information in place of CPUID.

> 
> So the reaming question is whether the CPU actually needs the large XSAVEC
> area when running with clearcpuid=304 or not, and if it does not, how can we
> adjusted the size computations in the kernel and glibc.  This could also be
> something that would have to be tweaked in firmware/microcode.  I still
> think this would be a useful change for enhanced backwards compatibility.

Agreed ... there has to be a better way to do this.

Could you put together a version of glibc that checks for /proc/cpuinfo?  Let's see if that works and we can see how much of a stack savings there is.  I can push this info on either RHKL or LKML to see if anyone has a better idea.

P.

Comment 10 Florian Weimer 2018-09-11 18:40:24 UTC
(In reply to Prarit Bhargava from comment #9)
> > Because /proc isn't always mounted.  The kernel is supposed to provide this
> > information in the auxiliary vector in hwcap flags, but for most features,
> > we don't do this on x86 because CPUID (and later XGETBV) is the interface
> > for obtaining this information.
> 
> :( but the problem (as noted) is that those commands return what the
> hardware supports and not what the kernel supports.

I think the kernel writes what it supports to XCR0, which userspace can read using XGETBV with %ecx == 0.  But these XCR0 flags do not affect the XSAVE state side reported by CPUID.

> One thought here is to modify glibc to check for /proc/cpuinfo and if
> available, use that information in place of CPUID.

Not going to happen, it does not make sense.  We need to adhere to what the CPU says is its save area size.  The kernel may not know about all corresponding CPU feature flags.

However, it could be that the XSAVEC area size computation in glibc (and the kernel) is buggy and the XSAVE area size for the active XCR0 setting is actually *smaller* than the computed XSAVEC size.  I need to get another AVX-512-enabled machine and verify if this is the case.

Does the kernel use XSAVE or XSAVEC for context switches?

Comment 13 RHEL Program Management 2021-02-15 07:41:54 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.