00db10
Full backports of the following patches:
00db10
00db10
commit b97eb2bdb1ed72982a7821c3078be591051cef59
00db10
Author: H.J. Lu <hjl.tools@gmail.com>
00db10
Date:   Mon Mar 16 14:58:43 2015 -0700
00db10
00db10
    Preserve bound registers in _dl_runtime_resolve
00db10
    
00db10
    We need to add a BND prefix before indirect branch at the end of
00db10
    _dl_runtime_resolve to preserve bound registers.
00db10
00db10
commit ddd85a65b6e3d6ec1e756c1f78559f99a2c943ca
00db10
Author: H.J. Lu <hjl.tools@gmail.com>
00db10
Date:   Tue Jul 7 05:23:24 2015 -0700
00db10
00db10
    Add and use sysdeps/i386/link-defines.sym
00db10
    
00db10
    Define macros for fields in La_i86_regs and La_i86_retval and use them
00db10
    in dl-trampoline.S, instead of hardcoded values.
00db10
00db10
commit 14c5cbabc2d11004ab223ae5eae761ddf83ef99e
00db10
Author: Igor Zamyatin <igor.zamyatin@intel.com>
00db10
Date:   Thu Jul 9 06:50:12 2015 -0700
00db10
00db10
    Preserve bound registers for pointer pass/return
00db10
    
00db10
    We need to save/restore bound registers and add a BND prefix before
00db10
    branches in _dl_runtime_profile so that bound registers for pointer
00db10
    pass and return are preserved when LD_AUDIT is used.
00db10
00db10
00db10
commit f3dcae82d54e5097e18e1d6ef4ff55c2ea4e621e
00db10
Author: H.J. Lu <hjl.tools@gmail.com>
00db10
Date:   Tue Aug 25 04:33:54 2015 -0700
00db10
00db10
    Save and restore vector registers in x86-64 ld.so
00db10
    
00db10
    This patch adds SSE, AVX and AVX512 versions of _dl_runtime_resolve
00db10
    and _dl_runtime_profile, which save and restore the first 8 vector
00db10
    registers used for parameter passing.  elf_machine_runtime_setup
00db10
    selects the proper _dl_runtime_resolve or _dl_runtime_profile based
00db10
    on _dl_x86_cpu_features.  It avoids race condition caused by
00db10
    FOREIGN_CALL macros, which are only used for x86-64.
00db10
    
00db10
    Performance impact of saving and restoring 8 vector registers are
00db10
    negligible on Nehalem, Sandy Bridge, Ivy Bridge and Haswell when
00db10
    ld.so is optimized with SSE2.
00db10
00db10
commit fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604
00db10
Author: H.J. Lu <hjl.tools@gmail.com>
00db10
Date:   Tue Sep 6 08:50:55 2016 -0700
00db10
00db10
    X86-64: Add _dl_runtime_resolve_avx[512]_{opt|slow} [BZ #20508]
00db10
    
00db10
    There is transition penalty when SSE instructions are mixed with 256-bit
00db10
    AVX or 512-bit AVX512 load instructions.  Since _dl_runtime_resolve_avx
00db10
    and _dl_runtime_profile_avx512 save/restore 256-bit YMM/512-bit ZMM
00db10
    registers, there is transition penalty when SSE instructions are used
00db10
    with lazy binding on AVX and AVX512 processors.
00db10
    
00db10
    To avoid SSE transition penalty, if only the lower 128 bits of the first
00db10
    8 vector registers are non-zero, we can preserve %xmm0 - %xmm7 registers
00db10
    with the zero upper bits.
00db10
    
00db10
    For AVX and AVX512 processors which support XGETBV with ECX == 1, we can
00db10
    use XGETBV with ECX == 1 to check if the upper 128 bits of YMM registers
00db10
    or the upper 256 bits of ZMM registers are zero.  We can restore only the
00db10
    non-zero portion of vector registers with AVX/AVX512 load instructions
00db10
    which will zero-extend upper bits of vector registers.
00db10
    
00db10
    This patch adds _dl_runtime_resolve_sse_vex which saves and restores
00db10
    XMM registers with 128-bit AVX store/load instructions.  It is used to
00db10
    preserve YMM/ZMM registers when only the lower 128 bits are non-zero.
00db10
    _dl_runtime_resolve_avx_opt and _dl_runtime_resolve_avx512_opt are added
00db10
    and used on AVX/AVX512 processors supporting XGETBV with ECX == 1 so
00db10
    that we store and load only the non-zero portion of vector registers.
00db10
    This avoids SSE transition penalty caused by _dl_runtime_resolve_avx and
00db10
    _dl_runtime_profile_avx512 when only the lower 128 bits of vector
00db10
    registers are used.
00db10
    
00db10
    _dl_runtime_resolve_avx_slow is added and used for AVX processors which
00db10
    don't support XGETBV with ECX == 1.  Since there is no SSE transition
00db10
    penalty on AVX512 processors which don't support XGETBV with ECX == 1,
00db10
    _dl_runtime_resolve_avx512_slow isn't provided.
00db10
00db10
commit 3403a17fea8ccef7dc5f99553a13231acf838744
00db10
Author: H.J. Lu <hjl.tools@gmail.com>
00db10
Date:   Thu Feb 9 12:19:44 2017 -0800
00db10
00db10
    x86-64: Verify that _dl_runtime_resolve preserves vector registers
00db10
    
00db10
    On x86-64, _dl_runtime_resolve must preserve the first 8 vector
00db10
    registers.  Add 3 _dl_runtime_resolve tests to verify that SSE,
00db10
    AVX and AVX512 registers are preserved.
00db10
00db10
commit c15f8eb50cea7ad1a4ccece6e0982bf426d52c00
00db10
Author: H.J. Lu <hjl.tools@gmail.com>
00db10
Date:   Tue Mar 21 10:59:31 2017 -0700
00db10
00db10
    x86-64: Improve branch predication in _dl_runtime_resolve_avx512_opt [BZ #21258]
00db10
    
00db10
    On Skylake server, _dl_runtime_resolve_avx512_opt is used to preserve
00db10
    the first 8 vector registers.  The code layout is
00db10
    
00db10
      if only %xmm0 - %xmm7 registers are used
00db10
         preserve %xmm0 - %xmm7 registers
00db10
      if only %ymm0 - %ymm7 registers are used
00db10
         preserve %ymm0 - %ymm7 registers
00db10
      preserve %zmm0 - %zmm7 registers
00db10
    
00db10
    Branch predication always executes the fallthrough code path to preserve
00db10
    %zmm0 - %zmm7 registers speculatively, even though only %xmm0 - %xmm7
00db10
    registers are used.  This leads to lower CPU frequency on Skylake
00db10
    server.  This patch changes the fallthrough code path to preserve
00db10
    %xmm0 - %xmm7 registers instead:
00db10
    
00db10
      if whole %zmm0 - %zmm7 registers are used
00db10
        preserve %zmm0 - %zmm7 registers
00db10
      if only %ymm0 - %ymm7 registers are used
00db10
         preserve %ymm0 - %ymm7 registers
00db10
      preserve %xmm0 - %xmm7 registers
00db10
00db10
    Tested on Skylake server.
00db10
    
00db10
            [BZ #21258]
00db10
            * sysdeps/x86_64/dl-trampoline.S (_dl_runtime_resolve_opt):
00db10
            Define only if _dl_runtime_resolve is defined to
00db10
            _dl_runtime_resolve_sse_vex.
00db10
            * sysdeps/x86_64/dl-trampoline.h (_dl_runtime_resolve_opt):
00db10
            Fallthrough to _dl_runtime_resolve_sse_vex.
00db10
00db10
Index: glibc-2.17-c758a686/nptl/sysdeps/x86_64/tcb-offsets.sym
00db10
===================================================================
00db10
--- glibc-2.17-c758a686.orig/nptl/sysdeps/x86_64/tcb-offsets.sym
00db10
+++ glibc-2.17-c758a686/nptl/sysdeps/x86_64/tcb-offsets.sym
00db10
@@ -15,7 +15,6 @@ VGETCPU_CACHE_OFFSET	offsetof (tcbhead_t
00db10
 #ifndef __ASSUME_PRIVATE_FUTEX
00db10
 PRIVATE_FUTEX		offsetof (tcbhead_t, private_futex)
00db10
 #endif
00db10
-RTLD_SAVESPACE_SSE	offsetof (tcbhead_t, rtld_savespace_sse)
00db10
 
00db10
 -- Not strictly offsets, but these values are also used in the TCB.
00db10
 TCB_CANCELSTATE_BITMASK	 CANCELSTATE_BITMASK
00db10
Index: glibc-2.17-c758a686/nptl/sysdeps/x86_64/tls.h
00db10
===================================================================
00db10
--- glibc-2.17-c758a686.orig/nptl/sysdeps/x86_64/tls.h
00db10
+++ glibc-2.17-c758a686/nptl/sysdeps/x86_64/tls.h
00db10
@@ -67,12 +67,13 @@ typedef struct
00db10
 # else
00db10
   int __unused1;
00db10
 # endif
00db10
-  int rtld_must_xmm_save;
00db10
+  int __glibc_unused1;
00db10
   /* Reservation of some values for the TM ABI.  */
00db10
   void *__private_tm[5];
00db10
   long int __unused2;
00db10
-  /* Have space for the post-AVX register size.  */
00db10
-  __128bits rtld_savespace_sse[8][4] __attribute__ ((aligned (32)));
00db10
+  /* Must be kept even if it is no longer used by glibc since programs,
00db10
+     like AddressSanitizer, depend on the size of tcbhead_t.  */
00db10
+  __128bits __glibc_unused2[8][4] __attribute__ ((aligned (32)));
00db10
 
00db10
   void *__padding[8];
00db10
 } tcbhead_t;
00db10
@@ -380,41 +381,6 @@ typedef struct
00db10
 # define THREAD_GSCOPE_WAIT() \
00db10
   GL(dl_wait_lookup_done) ()
00db10
 
00db10
-
00db10
-# ifdef SHARED
00db10
-/* Defined in dl-trampoline.S.  */
00db10
-extern void _dl_x86_64_save_sse (void);
00db10
-extern void _dl_x86_64_restore_sse (void);
00db10
-
00db10
-# define RTLD_CHECK_FOREIGN_CALL \
00db10
-  (THREAD_GETMEM (THREAD_SELF, header.rtld_must_xmm_save) != 0)
00db10
-
00db10
-/* NB: Don't use the xchg operation because that would imply a lock
00db10
-   prefix which is expensive and unnecessary.  The cache line is also
00db10
-   not contested at all.  */
00db10
-#  define RTLD_ENABLE_FOREIGN_CALL \
00db10
-  int old_rtld_must_xmm_save = THREAD_GETMEM (THREAD_SELF,		      \
00db10
-					      header.rtld_must_xmm_save);     \
00db10
-  THREAD_SETMEM (THREAD_SELF, header.rtld_must_xmm_save, 1)
00db10
-
00db10
-#  define RTLD_PREPARE_FOREIGN_CALL \
00db10
-  do if (THREAD_GETMEM (THREAD_SELF, header.rtld_must_xmm_save))	      \
00db10
-    {									      \
00db10
-      _dl_x86_64_save_sse ();						      \
00db10
-      THREAD_SETMEM (THREAD_SELF, header.rtld_must_xmm_save, 0);	      \
00db10
-    }									      \
00db10
-  while (0)
00db10
-
00db10
-#  define RTLD_FINALIZE_FOREIGN_CALL \
00db10
-  do {									      \
00db10
-    if (THREAD_GETMEM (THREAD_SELF, header.rtld_must_xmm_save) == 0)	      \
00db10
-      _dl_x86_64_restore_sse ();					      \
00db10
-    THREAD_SETMEM (THREAD_SELF, header.rtld_must_xmm_save,		      \
00db10
-		   old_rtld_must_xmm_save);				      \
00db10
-  } while (0)
00db10
-# endif
00db10
-
00db10
-
00db10
 #endif /* __ASSEMBLER__ */
00db10
 
00db10
 #endif	/* tls.h */
00db10
Index: glibc-2.17-c758a686/sysdeps/i386/Makefile
00db10
===================================================================
00db10
--- glibc-2.17-c758a686.orig/sysdeps/i386/Makefile
00db10
+++ glibc-2.17-c758a686/sysdeps/i386/Makefile
00db10
@@ -33,6 +33,7 @@ sysdep-CFLAGS += -mpreferred-stack-bound
00db10
 else
00db10
 ifeq ($(subdir),csu)
00db10
 sysdep-CFLAGS += -mpreferred-stack-boundary=4
00db10
+gen-as-const-headers += link-defines.sym
00db10
 else
00db10
 # Likewise, any function which calls user callbacks
00db10
 uses-callbacks += -mpreferred-stack-boundary=4
00db10
Index: glibc-2.17-c758a686/sysdeps/i386/configure
00db10
===================================================================
00db10
--- glibc-2.17-c758a686.orig/sysdeps/i386/configure
00db10
+++ glibc-2.17-c758a686/sysdeps/i386/configure
00db10
@@ -179,5 +179,32 @@ fi
00db10
 { $as_echo "$as_me:${as_lineno-$LINENO}: result: $libc_cv_cc_novzeroupper" >&5
00db10
 $as_echo "$libc_cv_cc_novzeroupper" >&6; }
00db10
 
00db10
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for Intel MPX support" >&5
00db10
+$as_echo_n "checking for Intel MPX support... " >&6; }
00db10
+if ${libc_cv_asm_mpx+:} false; then :
00db10
+  $as_echo_n "(cached) " >&6
00db10
+else
00db10
+  cat > conftest.s <<\EOF
00db10
+        bndmov %bnd0,(%esp)
00db10
+EOF
00db10
+if { ac_try='${CC-cc} -c $ASFLAGS conftest.s 1>&5'
00db10
+  { { eval echo "\"\$as_me\":${as_lineno-$LINENO}: \"$ac_try\""; } >&5
00db10
+  (eval $ac_try) 2>&5
00db10
+  ac_status=$?
00db10
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
00db10
+  test $ac_status = 0; }; }; then
00db10
+  libc_cv_asm_mpx=yes
00db10
+else
00db10
+  libc_cv_asm_mpx=no
00db10
+fi
00db10
+rm -f conftest*
00db10
+fi
00db10
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $libc_cv_asm_mpx" >&5
00db10
+$as_echo "$libc_cv_asm_mpx" >&6; }
00db10
+if test $libc_cv_asm_mpx == yes; then
00db10
+  $as_echo "#define HAVE_MPX_SUPPORT 1" >>confdefs.h
00db10
+
00db10
+fi
00db10
+
00db10
 $as_echo "#define PI_STATIC_AND_HIDDEN 1" >>confdefs.h
00db10
 
00db10
Index: glibc-2.17-c758a686/sysdeps/i386/configure.in
00db10
===================================================================
00db10
--- glibc-2.17-c758a686.orig/sysdeps/i386/configure.in
00db10
+++ glibc-2.17-c758a686/sysdeps/i386/configure.in
00db10
@@ -53,6 +53,21 @@ LIBC_TRY_CC_OPTION([-mno-vzeroupper],
00db10
 		   [libc_cv_cc_novzeroupper=no])
00db10
 ])
00db10
 
00db10
+dnl Check whether asm supports Intel MPX
00db10
+AC_CACHE_CHECK(for Intel MPX support, libc_cv_asm_mpx, [dnl
00db10
+cat > conftest.s <<\EOF
00db10
+        bndmov %bnd0,(%esp)
00db10
+EOF
00db10
+if AC_TRY_COMMAND(${CC-cc} -c $ASFLAGS conftest.s 1>&AS_MESSAGE_LOG_FD); then
00db10
+  libc_cv_asm_mpx=yes
00db10
+else
00db10
+  libc_cv_asm_mpx=no
00db10
+fi
00db10
+rm -f conftest*])
00db10
+if test $libc_cv_asm_mpx == yes; then
00db10
+  AC_DEFINE(HAVE_MPX_SUPPORT)
00db10
+fi
00db10
+
00db10
 dnl It is always possible to access static and hidden symbols in an
00db10
 dnl position independent way.
00db10
 AC_DEFINE(PI_STATIC_AND_HIDDEN)
00db10
Index: glibc-2.17-c758a686/sysdeps/i386/dl-trampoline.S
00db10
===================================================================
00db10
--- glibc-2.17-c758a686.orig/sysdeps/i386/dl-trampoline.S
00db10
+++ glibc-2.17-c758a686/sysdeps/i386/dl-trampoline.S
00db10
@@ -17,6 +17,13 @@
00db10
    <http://www.gnu.org/licenses/>.  */
00db10
 
00db10
 #include <sysdep.h>
00db10
+#include <link-defines.h>
00db10
+
00db10
+#ifdef HAVE_MPX_SUPPORT
00db10
+# define PRESERVE_BND_REGS_PREFIX bnd
00db10
+#else
00db10
+# define PRESERVE_BND_REGS_PREFIX .byte 0xf2
00db10
+#endif
00db10
 
00db10
 	.text
00db10
 	.globl _dl_runtime_resolve
00db10
@@ -161,24 +168,47 @@ _dl_runtime_profile:
00db10
 	    +4      free
00db10
 	   %esp     free
00db10
 	*/
00db10
-	subl $20, %esp
00db10
-	cfi_adjust_cfa_offset (20)
00db10
-	movl %eax, (%esp)
00db10
-	movl %edx, 4(%esp)
00db10
-	fstpt 8(%esp)
00db10
-	fstpt 20(%esp)
00db10
+#if LONG_DOUBLE_SIZE != 12
00db10
+# error "long double size must be 12 bytes"
00db10
+#endif
00db10
+	# Allocate space for La_i86_retval and subtract 12 free bytes.
00db10
+	subl $(LRV_SIZE - 12), %esp
00db10
+	cfi_adjust_cfa_offset (LRV_SIZE - 12)
00db10
+	movl %eax, LRV_EAX_OFFSET(%esp)
00db10
+	movl %edx, LRV_EDX_OFFSET(%esp)
00db10
+	fstpt LRV_ST0_OFFSET(%esp)
00db10
+	fstpt LRV_ST1_OFFSET(%esp)
00db10
+#ifdef HAVE_MPX_SUPPORT
00db10
+	bndmov %bnd0, LRV_BND0_OFFSET(%esp)
00db10
+	bndmov %bnd1, LRV_BND1_OFFSET(%esp)
00db10
+#else
00db10
+	.byte 0x66,0x0f,0x1b,0x44,0x24,LRV_BND0_OFFSET
00db10
+	.byte 0x66,0x0f,0x1b,0x4c,0x24,LRV_BND1_OFFSET
00db10
+#endif
00db10
 	pushl %esp
00db10
 	cfi_adjust_cfa_offset (4)
00db10
-	leal 36(%esp), %ecx
00db10
-	movl 56(%esp), %eax
00db10
-	movl 60(%esp), %edx
00db10
+	# Address of La_i86_regs area.
00db10
+	leal (LRV_SIZE + 4)(%esp), %ecx
00db10
+	# PLT2
00db10
+	movl (LRV_SIZE + 4 + LR_SIZE)(%esp), %eax
00db10
+	# PLT1
00db10
+	movl (LRV_SIZE + 4 + LR_SIZE + 4)(%esp), %edx
00db10
 	call _dl_call_pltexit
00db10
-	movl (%esp), %eax
00db10
-	movl 4(%esp), %edx
00db10
-	fldt 20(%esp)
00db10
-	fldt 8(%esp)
00db10
-	addl $60, %esp
00db10
-	cfi_adjust_cfa_offset (-60)
00db10
+	movl LRV_EAX_OFFSET(%esp), %eax
00db10
+	movl LRV_EDX_OFFSET(%esp), %edx
00db10
+	fldt LRV_ST1_OFFSET(%esp)
00db10
+	fldt LRV_ST0_OFFSET(%esp)
00db10
+#ifdef HAVE_MPX_SUPPORT
00db10
+	bndmov LRV_BND0_OFFSET(%esp), %bnd0
00db10
+	bndmov LRV_BND1_OFFSET(%esp), %bnd1
00db10
+#else
00db10
+	.byte 0x66,0x0f,0x1a,0x44,0x24,LRV_BND0_OFFSET
00db10
+	.byte 0x66,0x0f,0x1a,0x4c,0x24,LRV_BND1_OFFSET
00db10
+#endif
00db10
+	# Restore stack before return.
00db10
+	addl $(LRV_SIZE + 4 + LR_SIZE + 4), %esp
00db10
+	cfi_adjust_cfa_offset (-(LRV_SIZE + 4 + LR_SIZE + 4))
00db10
+	PRESERVE_BND_REGS_PREFIX
00db10
 	ret
00db10
 	cfi_endproc
00db10
 	.size _dl_runtime_profile, .-_dl_runtime_profile
00db10
Index: glibc-2.17-c758a686/sysdeps/i386/link-defines.sym
00db10
===================================================================
00db10
--- /dev/null
00db10
+++ glibc-2.17-c758a686/sysdeps/i386/link-defines.sym
00db10
@@ -0,0 +1,20 @@
00db10
+#include "link.h"
00db10
+#include <stddef.h>
00db10
+
00db10
+--
00db10
+LONG_DOUBLE_SIZE	sizeof (long double)
00db10
+
00db10
+LR_SIZE			sizeof (struct La_i86_regs)
00db10
+LR_EDX_OFFSET		offsetof (struct La_i86_regs, lr_edx)
00db10
+LR_ECX_OFFSET		offsetof (struct La_i86_regs, lr_ecx)
00db10
+LR_EAX_OFFSET		offsetof (struct La_i86_regs, lr_eax)
00db10
+LR_EBP_OFFSET		offsetof (struct La_i86_regs, lr_ebp)
00db10
+LR_ESP_OFFSET		offsetof (struct La_i86_regs, lr_esp)
00db10
+
00db10
+LRV_SIZE		sizeof (struct La_i86_retval)
00db10
+LRV_EAX_OFFSET		offsetof (struct La_i86_retval, lrv_eax)
00db10
+LRV_EDX_OFFSET		offsetof (struct La_i86_retval, lrv_edx)
00db10
+LRV_ST0_OFFSET		offsetof (struct La_i86_retval, lrv_st0)
00db10
+LRV_ST1_OFFSET		offsetof (struct La_i86_retval, lrv_st1)
00db10
+LRV_BND0_OFFSET		offsetof (struct La_i86_retval, lrv_bnd0)
00db10
+LRV_BND1_OFFSET		offsetof (struct La_i86_retval, lrv_bnd1)
00db10
Index: glibc-2.17-c758a686/sysdeps/x86/bits/link.h
00db10
===================================================================
00db10
--- glibc-2.17-c758a686.orig/sysdeps/x86/bits/link.h
00db10
+++ glibc-2.17-c758a686/sysdeps/x86/bits/link.h
00db10
@@ -38,6 +38,8 @@ typedef struct La_i86_retval
00db10
   uint32_t lrv_edx;
00db10
   long double lrv_st0;
00db10
   long double lrv_st1;
00db10
+  uint64_t lrv_bnd0;
00db10
+  uint64_t lrv_bnd1;
00db10
 } La_i86_retval;
00db10
 
00db10
 
00db10
Index: glibc-2.17-c758a686/sysdeps/x86/cpu-features.c
00db10
===================================================================
00db10
--- glibc-2.17-c758a686.orig/sysdeps/x86/cpu-features.c
00db10
+++ glibc-2.17-c758a686/sysdeps/x86/cpu-features.c
00db10
@@ -130,6 +130,20 @@ init_cpu_features (struct cpu_features *
00db10
 	      break;
00db10
 	    }
00db10
 	}
00db10
+
00db10
+      /* To avoid SSE transition penalty, use _dl_runtime_resolve_slow.
00db10
+         If XGETBV suports ECX == 1, use _dl_runtime_resolve_opt.  */
00db10
+      cpu_features->feature[index_Use_dl_runtime_resolve_slow]
00db10
+	|= bit_Use_dl_runtime_resolve_slow;
00db10
+      if (cpu_features->max_cpuid >= 0xd)
00db10
+	{
00db10
+	  unsigned int eax;
00db10
+
00db10
+	  __cpuid_count (0xd, 1, eax, ebx, ecx, edx);
00db10
+	  if ((eax & (1 << 2)) != 0)
00db10
+	    cpu_features->feature[index_Use_dl_runtime_resolve_opt]
00db10
+	      |= bit_Use_dl_runtime_resolve_opt;
00db10
+	}
00db10
     }
00db10
   /* This spells out "AuthenticAMD".  */
00db10
   else if (ebx == 0x68747541 && ecx == 0x444d4163 && edx == 0x69746e65)
00db10
Index: glibc-2.17-c758a686/sysdeps/x86/cpu-features.h
00db10
===================================================================
00db10
--- glibc-2.17-c758a686.orig/sysdeps/x86/cpu-features.h
00db10
+++ glibc-2.17-c758a686/sysdeps/x86/cpu-features.h
00db10
@@ -34,6 +34,9 @@
00db10
 #define bit_AVX512DQ_Usable		(1 << 13)
00db10
 #define bit_Prefer_MAP_32BIT_EXEC	(1 << 16)
00db10
 #define bit_Prefer_No_VZEROUPPER	(1 << 17)
00db10
+#define bit_Use_dl_runtime_resolve_opt	(1 << 20)
00db10
+#define bit_Use_dl_runtime_resolve_slow	(1 << 21)
00db10
+
00db10
 
00db10
 /* CPUID Feature flags.  */
00db10
 
00db10
@@ -95,6 +98,9 @@
00db10
 # define index_AVX512DQ_Usable		FEATURE_INDEX_1*FEATURE_SIZE
00db10
 # define index_Prefer_MAP_32BIT_EXEC	FEATURE_INDEX_1*FEATURE_SIZE
00db10
 # define index_Prefer_No_VZEROUPPER	FEATURE_INDEX_1*FEATURE_SIZE
00db10
+# define index_Use_dl_runtime_resolve_opt FEATURE_INDEX_1*FEATURE_SIZE
00db10
+# define index_Use_dl_runtime_resolve_slow FEATURE_INDEX_1*FEATURE_SIZE
00db10
+
00db10
 
00db10
 # if defined (_LIBC) && !IS_IN (nonlib)
00db10
 #  ifdef __x86_64__
00db10
@@ -273,6 +279,8 @@ extern const struct cpu_features *__get_
00db10
 # define index_AVX512DQ_Usable		FEATURE_INDEX_1
00db10
 # define index_Prefer_MAP_32BIT_EXEC	FEATURE_INDEX_1
00db10
 # define index_Prefer_No_VZEROUPPER     FEATURE_INDEX_1
00db10
+# define index_Use_dl_runtime_resolve_opt FEATURE_INDEX_1
00db10
+# define index_Use_dl_runtime_resolve_slow FEATURE_INDEX_1
00db10
 
00db10
 #endif	/* !__ASSEMBLER__ */
00db10
 
00db10
Index: glibc-2.17-c758a686/sysdeps/x86_64/Makefile
00db10
===================================================================
00db10
--- glibc-2.17-c758a686.orig/sysdeps/x86_64/Makefile
00db10
+++ glibc-2.17-c758a686/sysdeps/x86_64/Makefile
00db10
@@ -21,6 +21,11 @@ endif
00db10
 ifeq ($(subdir),elf)
00db10
 sysdep-dl-routines += tlsdesc dl-tlsdesc
00db10
 
00db10
+tests += ifuncmain8
00db10
+modules-names += ifuncmod8
00db10
+
00db10
+$(objpfx)ifuncmain8: $(objpfx)ifuncmod8.so
00db10
+
00db10
 tests += tst-quad1 tst-quad2
00db10
 modules-names += tst-quadmod1 tst-quadmod2
00db10
 
00db10
@@ -34,18 +39,32 @@ tests-pie += $(quad-pie-test)
00db10
 $(objpfx)tst-quad1pie: $(objpfx)tst-quadmod1pie.o
00db10
 $(objpfx)tst-quad2pie: $(objpfx)tst-quadmod2pie.o
00db10
 
00db10
+tests += tst-sse tst-avx tst-avx512
00db10
+test-extras += tst-avx-aux tst-avx512-aux
00db10
+extra-test-objs += tst-avx-aux.o tst-avx512-aux.o
00db10
+
00db10
 tests += tst-audit10
00db10
-modules-names += tst-auditmod10a tst-auditmod10b
00db10
+modules-names += tst-auditmod10a tst-auditmod10b \
00db10
+		 tst-ssemod tst-avxmod tst-avx512mod
00db10
 
00db10
 $(objpfx)tst-audit10: $(objpfx)tst-auditmod10a.so
00db10
 $(objpfx)tst-audit10.out: $(objpfx)tst-auditmod10b.so
00db10
 tst-audit10-ENV = LD_AUDIT=$(objpfx)tst-auditmod10b.so
00db10
 
00db10
+$(objpfx)tst-sse: $(objpfx)tst-ssemod.so
00db10
+$(objpfx)tst-avx: $(objpfx)tst-avx-aux.o $(objpfx)tst-avxmod.so
00db10
+$(objpfx)tst-avx512: $(objpfx)tst-avx512-aux.o $(objpfx)tst-avx512mod.so
00db10
+
00db10
+CFLAGS-tst-avx-aux.c += $(AVX-CFLAGS)
00db10
+CFLAGS-tst-avxmod.c += $(AVX-CFLAGS)
00db10
+
00db10
 ifeq (yes,$(config-cflags-avx512))
00db10
 AVX512-CFLAGS = -mavx512f
00db10
 CFLAGS-tst-audit10.c += $(AVX512-CFLAGS)
00db10
 CFLAGS-tst-auditmod10a.c += $(AVX512-CFLAGS)
00db10
 CFLAGS-tst-auditmod10b.c += $(AVX512-CFLAGS)
00db10
+CFLAGS-tst-avx512-aux.c += $(AVX512-CFLAGS)
00db10
+CFLAGS-tst-avx512mod.c += $(AVX512-CFLAGS)
00db10
 endif
00db10
 endif
00db10
 
00db10
Index: glibc-2.17-c758a686/sysdeps/x86_64/dl-machine.h
00db10
===================================================================
00db10
--- glibc-2.17-c758a686.orig/sysdeps/x86_64/dl-machine.h
00db10
+++ glibc-2.17-c758a686/sysdeps/x86_64/dl-machine.h
00db10
@@ -66,8 +66,15 @@ static inline int __attribute__ ((unused
00db10
 elf_machine_runtime_setup (struct link_map *l, int lazy, int profile)
00db10
 {
00db10
   Elf64_Addr *got;
00db10
-  extern void _dl_runtime_resolve (ElfW(Word)) attribute_hidden;
00db10
-  extern void _dl_runtime_profile (ElfW(Word)) attribute_hidden;
00db10
+  extern void _dl_runtime_resolve_sse (ElfW(Word)) attribute_hidden;
00db10
+  extern void _dl_runtime_resolve_avx (ElfW(Word)) attribute_hidden;
00db10
+  extern void _dl_runtime_resolve_avx_slow (ElfW(Word)) attribute_hidden;
00db10
+  extern void _dl_runtime_resolve_avx_opt (ElfW(Word)) attribute_hidden;
00db10
+  extern void _dl_runtime_resolve_avx512 (ElfW(Word)) attribute_hidden;
00db10
+  extern void _dl_runtime_resolve_avx512_opt (ElfW(Word)) attribute_hidden;
00db10
+  extern void _dl_runtime_profile_sse (ElfW(Word)) attribute_hidden;
00db10
+  extern void _dl_runtime_profile_avx (ElfW(Word)) attribute_hidden;
00db10
+  extern void _dl_runtime_profile_avx512 (ElfW(Word)) attribute_hidden;
00db10
 
00db10
   if (l->l_info[DT_JMPREL] && lazy)
00db10
     {
00db10
@@ -95,7 +102,12 @@ elf_machine_runtime_setup (struct link_m
00db10
 	 end in this function.  */
00db10
       if (__builtin_expect (profile, 0))
00db10
 	{
00db10
-	  *(ElfW(Addr) *) (got + 2) = (ElfW(Addr)) &_dl_runtime_profile;
00db10
+	  if (HAS_ARCH_FEATURE (AVX512F_Usable))
00db10
+	    *(ElfW(Addr) *) (got + 2) = (ElfW(Addr)) &_dl_runtime_profile_avx512;
00db10
+	  else if (HAS_ARCH_FEATURE (AVX_Usable))
00db10
+	    *(ElfW(Addr) *) (got + 2) = (ElfW(Addr)) &_dl_runtime_profile_avx;
00db10
+	  else
00db10
+	    *(ElfW(Addr) *) (got + 2) = (ElfW(Addr)) &_dl_runtime_profile_sse;
00db10
 
00db10
 	  if (GLRO(dl_profile) != NULL
00db10
 	      && _dl_name_match_p (GLRO(dl_profile), l))
00db10
@@ -104,9 +116,34 @@ elf_machine_runtime_setup (struct link_m
00db10
 	    GL(dl_profile_map) = l;
00db10
 	}
00db10
       else
00db10
-	/* This function will get called to fix up the GOT entry indicated by
00db10
-	   the offset on the stack, and then jump to the resolved address.  */
00db10
-	*(ElfW(Addr) *) (got + 2) = (ElfW(Addr)) &_dl_runtime_resolve;
00db10
+	{
00db10
+	  /* This function will get called to fix up the GOT entry
00db10
+	     indicated by the offset on the stack, and then jump to
00db10
+	     the resolved address.  */
00db10
+	  if (HAS_ARCH_FEATURE (AVX512F_Usable))
00db10
+	    {
00db10
+	      if (HAS_ARCH_FEATURE (Use_dl_runtime_resolve_opt))
00db10
+		*(ElfW(Addr) *) (got + 2)
00db10
+		  = (ElfW(Addr)) &_dl_runtime_resolve_avx512_opt;
00db10
+	      else
00db10
+		*(ElfW(Addr) *) (got + 2)
00db10
+		  = (ElfW(Addr)) &_dl_runtime_resolve_avx512;
00db10
+	    }
00db10
+	  else if (HAS_ARCH_FEATURE (AVX_Usable))
00db10
+	    {
00db10
+	      if (HAS_ARCH_FEATURE (Use_dl_runtime_resolve_opt))
00db10
+		*(ElfW(Addr) *) (got + 2)
00db10
+		  = (ElfW(Addr)) &_dl_runtime_resolve_avx_opt;
00db10
+	      else if (HAS_ARCH_FEATURE (Use_dl_runtime_resolve_slow))
00db10
+		*(ElfW(Addr) *) (got + 2)
00db10
+		  = (ElfW(Addr)) &_dl_runtime_resolve_avx_slow;
00db10
+	      else
00db10
+		*(ElfW(Addr) *) (got + 2)
00db10
+		  = (ElfW(Addr)) &_dl_runtime_resolve_avx;
00db10
+	    }
00db10
+	  else
00db10
+	    *(ElfW(Addr) *) (got + 2) = (ElfW(Addr)) &_dl_runtime_resolve_sse;
00db10
+	}
00db10
     }
00db10
 
00db10
   if (l->l_info[ADDRIDX (DT_TLSDESC_GOT)] && lazy)
00db10
Index: glibc-2.17-c758a686/sysdeps/x86_64/dl-trampoline.S
00db10
===================================================================
00db10
--- glibc-2.17-c758a686.orig/sysdeps/x86_64/dl-trampoline.S
00db10
+++ glibc-2.17-c758a686/sysdeps/x86_64/dl-trampoline.S
00db10
@@ -18,28 +18,52 @@
00db10
 
00db10
 #include <config.h>
00db10
 #include <sysdep.h>
00db10
+#include <cpu-features.h>
00db10
 #include <link-defines.h>
00db10
 
00db10
-#if (RTLD_SAVESPACE_SSE % 32) != 0
00db10
-# error RTLD_SAVESPACE_SSE must be aligned to 32 bytes
00db10
+#ifndef DL_STACK_ALIGNMENT
00db10
+/* Due to GCC bug:
00db10
+
00db10
+   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58066
00db10
+
00db10
+   __tls_get_addr may be called with 8-byte stack alignment.  Although
00db10
+   this bug has been fixed in GCC 4.9.4, 5.3 and 6, we can't assume
00db10
+   that stack will be always aligned at 16 bytes.  We use unaligned
00db10
+   16-byte move to load and store SSE registers, which has no penalty
00db10
+   on modern processors if stack is 16-byte aligned.  */
00db10
+# define DL_STACK_ALIGNMENT 8
00db10
 #endif
00db10
 
00db10
+#ifndef DL_RUNIME_UNALIGNED_VEC_SIZE
00db10
+/* The maximum size of unaligned vector load and store.  */
00db10
+# define DL_RUNIME_UNALIGNED_VEC_SIZE 16
00db10
+#endif
00db10
+
00db10
+/* True if _dl_runtime_resolve should align stack to VEC_SIZE bytes.  */
00db10
+#define DL_RUNIME_RESOLVE_REALIGN_STACK \
00db10
+  (VEC_SIZE > DL_STACK_ALIGNMENT \
00db10
+   && VEC_SIZE > DL_RUNIME_UNALIGNED_VEC_SIZE)
00db10
+
00db10
+/* Align vector register save area to 16 bytes.  */
00db10
+#define REGISTER_SAVE_VEC_OFF	0
00db10
+
00db10
 /* Area on stack to save and restore registers used for parameter
00db10
    passing when calling _dl_fixup.  */
00db10
 #ifdef __ILP32__
00db10
-/* X32 saves RCX, RDX, RSI, RDI, R8 and R9 plus RAX.  */
00db10
-# define REGISTER_SAVE_AREA	(8 * 7)
00db10
-# define REGISTER_SAVE_RAX	0
00db10
+# define REGISTER_SAVE_RAX	(REGISTER_SAVE_VEC_OFF + VEC_SIZE * 8)
00db10
+# define PRESERVE_BND_REGS_PREFIX
00db10
 #else
00db10
-/* X86-64 saves RCX, RDX, RSI, RDI, R8 and R9 plus RAX as well as BND0,
00db10
-   BND1, BND2, BND3.  */
00db10
-# define REGISTER_SAVE_AREA	(8 * 7 + 16 * 4)
00db10
 /* Align bound register save area to 16 bytes.  */
00db10
-# define REGISTER_SAVE_BND0	0
00db10
+# define REGISTER_SAVE_BND0	(REGISTER_SAVE_VEC_OFF + VEC_SIZE * 8)
00db10
 # define REGISTER_SAVE_BND1	(REGISTER_SAVE_BND0 + 16)
00db10
 # define REGISTER_SAVE_BND2	(REGISTER_SAVE_BND1 + 16)
00db10
 # define REGISTER_SAVE_BND3	(REGISTER_SAVE_BND2 + 16)
00db10
 # define REGISTER_SAVE_RAX	(REGISTER_SAVE_BND3 + 16)
00db10
+# ifdef HAVE_MPX_SUPPORT
00db10
+#  define PRESERVE_BND_REGS_PREFIX bnd
00db10
+# else
00db10
+#  define PRESERVE_BND_REGS_PREFIX .byte 0xf2
00db10
+# endif
00db10
 #endif
00db10
 #define REGISTER_SAVE_RCX	(REGISTER_SAVE_RAX + 8)
00db10
 #define REGISTER_SAVE_RDX	(REGISTER_SAVE_RCX + 8)
00db10
@@ -48,376 +72,71 @@
00db10
 #define REGISTER_SAVE_R8	(REGISTER_SAVE_RDI + 8)
00db10
 #define REGISTER_SAVE_R9	(REGISTER_SAVE_R8 + 8)
00db10
 
00db10
-	.text
00db10
-	.globl _dl_runtime_resolve
00db10
-	.type _dl_runtime_resolve, @function
00db10
-	.align 16
00db10
-	cfi_startproc
00db10
-_dl_runtime_resolve:
00db10
-	cfi_adjust_cfa_offset(16) # Incorporate PLT
00db10
-	subq $REGISTER_SAVE_AREA,%rsp
00db10
-	cfi_adjust_cfa_offset(REGISTER_SAVE_AREA)
00db10
-	# Preserve registers otherwise clobbered.
00db10
-	movq %rax, REGISTER_SAVE_RAX(%rsp)
00db10
-	movq %rcx, REGISTER_SAVE_RCX(%rsp)
00db10
-	movq %rdx, REGISTER_SAVE_RDX(%rsp)
00db10
-	movq %rsi, REGISTER_SAVE_RSI(%rsp)
00db10
-	movq %rdi, REGISTER_SAVE_RDI(%rsp)
00db10
-	movq %r8, REGISTER_SAVE_R8(%rsp)
00db10
-	movq %r9, REGISTER_SAVE_R9(%rsp)
00db10
-#ifndef __ILP32__
00db10
-	# We also have to preserve bound registers.  These are nops if
00db10
-	# Intel MPX isn't available or disabled.
00db10
-# ifdef HAVE_MPX_SUPPORT
00db10
-	bndmov %bnd0, REGISTER_SAVE_BND0(%rsp)
00db10
-	bndmov %bnd1, REGISTER_SAVE_BND1(%rsp)
00db10
-	bndmov %bnd2, REGISTER_SAVE_BND2(%rsp)
00db10
-	bndmov %bnd3, REGISTER_SAVE_BND3(%rsp)
00db10
-# else
00db10
-	.byte 0x66,0x0f,0x1b,0x44,0x24,REGISTER_SAVE_BND0
00db10
-	.byte 0x66,0x0f,0x1b,0x4c,0x24,REGISTER_SAVE_BND1
00db10
-	.byte 0x66,0x0f,0x1b,0x54,0x24,REGISTER_SAVE_BND2
00db10
-	.byte 0x66,0x0f,0x1b,0x5c,0x24,REGISTER_SAVE_BND3
00db10
-# endif
00db10
-#endif
00db10
-	# Copy args pushed by PLT in register.
00db10
-	# %rdi: link_map, %rsi: reloc_index
00db10
-	movq (REGISTER_SAVE_AREA + 8)(%rsp), %rsi
00db10
-	movq REGISTER_SAVE_AREA(%rsp), %rdi
00db10
-	call _dl_fixup		# Call resolver.
00db10
-	movq %rax, %r11		# Save return value
00db10
-#ifndef __ILP32__
00db10
-	# Restore bound registers.  These are nops if Intel MPX isn't
00db10
-	# avaiable or disabled.
00db10
-# ifdef HAVE_MPX_SUPPORT
00db10
-	bndmov REGISTER_SAVE_BND3(%rsp), %bnd3
00db10
-	bndmov REGISTER_SAVE_BND2(%rsp), %bnd2
00db10
-	bndmov REGISTER_SAVE_BND1(%rsp), %bnd1
00db10
-	bndmov REGISTER_SAVE_BND0(%rsp), %bnd0
00db10
-# else
00db10
-	.byte 0x66,0x0f,0x1a,0x5c,0x24,REGISTER_SAVE_BND3
00db10
-	.byte 0x66,0x0f,0x1a,0x54,0x24,REGISTER_SAVE_BND2
00db10
-	.byte 0x66,0x0f,0x1a,0x4c,0x24,REGISTER_SAVE_BND1
00db10
-	.byte 0x66,0x0f,0x1a,0x44,0x24,REGISTER_SAVE_BND0
00db10
-# endif
00db10
+#define VEC_SIZE		64
00db10
+#define VMOVA			vmovdqa64
00db10
+#if DL_RUNIME_RESOLVE_REALIGN_STACK || VEC_SIZE <= DL_STACK_ALIGNMENT
00db10
+# define VMOV			vmovdqa64
00db10
+#else
00db10
+# define VMOV			vmovdqu64
00db10
 #endif
00db10
-	# Get register content back.
00db10
-	movq REGISTER_SAVE_R9(%rsp), %r9
00db10
-	movq REGISTER_SAVE_R8(%rsp), %r8
00db10
-	movq REGISTER_SAVE_RDI(%rsp), %rdi
00db10
-	movq REGISTER_SAVE_RSI(%rsp), %rsi
00db10
-	movq REGISTER_SAVE_RDX(%rsp), %rdx
00db10
-	movq REGISTER_SAVE_RCX(%rsp), %rcx
00db10
-	movq REGISTER_SAVE_RAX(%rsp), %rax
00db10
-	# Adjust stack(PLT did 2 pushes)
00db10
-	addq $(REGISTER_SAVE_AREA + 16), %rsp
00db10
-	cfi_adjust_cfa_offset(-(REGISTER_SAVE_AREA + 16))
00db10
-	jmp *%r11		# Jump to function address.
00db10
-	cfi_endproc
00db10
-	.size _dl_runtime_resolve, .-_dl_runtime_resolve
00db10
-
00db10
-
00db10
-#ifndef PROF
00db10
-	.globl _dl_runtime_profile
00db10
-	.type _dl_runtime_profile, @function
00db10
-	.align 16
00db10
-	cfi_startproc
00db10
-
00db10
-_dl_runtime_profile:
00db10
-	cfi_adjust_cfa_offset(16) # Incorporate PLT
00db10
-	/* The La_x86_64_regs data structure pointed to by the
00db10
-	   fourth paramater must be 16-byte aligned.  This must
00db10
-	   be explicitly enforced.  We have the set up a dynamically
00db10
-	   sized stack frame.  %rbx points to the top half which
00db10
-	   has a fixed size and preserves the original stack pointer.  */
00db10
-
00db10
-	subq $32, %rsp		# Allocate the local storage.
00db10
-	cfi_adjust_cfa_offset(32)
00db10
-	movq %rbx, (%rsp)
00db10
-	cfi_rel_offset(%rbx, 0)
00db10
-
00db10
-	/* On the stack:
00db10
-		56(%rbx)	parameter #1
00db10
-		48(%rbx)	return address
00db10
-
00db10
-		40(%rbx)	reloc index
00db10
-		32(%rbx)	link_map
00db10
-
00db10
-		24(%rbx)	La_x86_64_regs pointer
00db10
-		16(%rbx)	framesize
00db10
-		 8(%rbx)	rax
00db10
-		  (%rbx)	rbx
00db10
-	*/
00db10
-
00db10
-	movq %rax, 8(%rsp)
00db10
-	movq %rsp, %rbx
00db10
-	cfi_def_cfa_register(%rbx)
00db10
-
00db10
-	/* Actively align the La_x86_64_regs structure.  */
00db10
-	andq $0xfffffffffffffff0, %rsp
00db10
-# if defined HAVE_AVX_SUPPORT || defined HAVE_AVX512_ASM_SUPPORT
00db10
-	/* sizeof(La_x86_64_regs).  Need extra space for 8 SSE registers
00db10
-	   to detect if any xmm0-xmm7 registers are changed by audit
00db10
-	   module.  */
00db10
-	subq $(LR_SIZE + XMM_SIZE*8), %rsp
00db10
-# else
00db10
-	subq $LR_SIZE, %rsp		# sizeof(La_x86_64_regs)
00db10
-# endif
00db10
-	movq %rsp, 24(%rbx)
00db10
-
00db10
-	/* Fill the La_x86_64_regs structure.  */
00db10
-	movq %rdx, LR_RDX_OFFSET(%rsp)
00db10
-	movq %r8,  LR_R8_OFFSET(%rsp)
00db10
-	movq %r9,  LR_R9_OFFSET(%rsp)
00db10
-	movq %rcx, LR_RCX_OFFSET(%rsp)
00db10
-	movq %rsi, LR_RSI_OFFSET(%rsp)
00db10
-	movq %rdi, LR_RDI_OFFSET(%rsp)
00db10
-	movq %rbp, LR_RBP_OFFSET(%rsp)
00db10
-
00db10
-	leaq 48(%rbx), %rax
00db10
-	movq %rax, LR_RSP_OFFSET(%rsp)
00db10
-
00db10
-	/* We always store the XMM registers even if AVX is available.
00db10
-	   This is to provide backward binary compatility for existing
00db10
-	   audit modules.  */
00db10
-	movaps %xmm0,		   (LR_XMM_OFFSET)(%rsp)
00db10
-	movaps %xmm1, (LR_XMM_OFFSET +   XMM_SIZE)(%rsp)
00db10
-	movaps %xmm2, (LR_XMM_OFFSET + XMM_SIZE*2)(%rsp)
00db10
-	movaps %xmm3, (LR_XMM_OFFSET + XMM_SIZE*3)(%rsp)
00db10
-	movaps %xmm4, (LR_XMM_OFFSET + XMM_SIZE*4)(%rsp)
00db10
-	movaps %xmm5, (LR_XMM_OFFSET + XMM_SIZE*5)(%rsp)
00db10
-	movaps %xmm6, (LR_XMM_OFFSET + XMM_SIZE*6)(%rsp)
00db10
-	movaps %xmm7, (LR_XMM_OFFSET + XMM_SIZE*7)(%rsp)
00db10
-
00db10
-# ifndef __ILP32__
00db10
-#  ifdef HAVE_MPX_SUPPORT
00db10
-	bndmov %bnd0, 		   (LR_BND_OFFSET)(%rsp)  # Preserve bound
00db10
-	bndmov %bnd1, (LR_BND_OFFSET +   BND_SIZE)(%rsp)  # registers. Nops if
00db10
-	bndmov %bnd2, (LR_BND_OFFSET + BND_SIZE*2)(%rsp)  # MPX not available
00db10
-	bndmov %bnd3, (LR_BND_OFFSET + BND_SIZE*3)(%rsp)  # or disabled.
00db10
-#  else
00db10
-	.byte 0x66,0x0f,0x1b,0x84,0x24;.long (LR_BND_OFFSET)
00db10
-	.byte 0x66,0x0f,0x1b,0x8c,0x24;.long (LR_BND_OFFSET + BND_SIZE)
00db10
-	.byte 0x66,0x0f,0x1b,0x84,0x24;.long (LR_BND_OFFSET + BND_SIZE*2)
00db10
-	.byte 0x66,0x0f,0x1b,0x8c,0x24;.long (LR_BND_OFFSET + BND_SIZE*3)
00db10
-#  endif
00db10
-# endif
00db10
-
00db10
-# if defined HAVE_AVX_SUPPORT || defined HAVE_AVX512_ASM_SUPPORT
00db10
-	.data
00db10
-L(have_avx):
00db10
-	.zero 4
00db10
-	.size L(have_avx), 4
00db10
-	.previous
00db10
-
00db10
-	cmpl	$0, L(have_avx)(%rip)
00db10
-	jne	L(defined)
00db10
-	movq	%rbx, %r11		# Save rbx
00db10
-	movl	$1, %eax
00db10
-	cpuid
00db10
-	movq	%r11,%rbx		# Restore rbx
00db10
-	xorl	%eax, %eax
00db10
-	// AVX and XSAVE supported?
00db10
-	andl	$((1 << 28) | (1 << 27)), %ecx
00db10
-	cmpl	$((1 << 28) | (1 << 27)), %ecx
00db10
-	jne	10f
00db10
-#  ifdef HAVE_AVX512_ASM_SUPPORT
00db10
-	// AVX512 supported in processor?
00db10
-	movq	%rbx, %r11		# Save rbx
00db10
-	xorl	%ecx, %ecx
00db10
-	mov	$0x7, %eax
00db10
-	cpuid
00db10
-	andl	$(1 << 16), %ebx
00db10
-#  endif
00db10
-	xorl	%ecx, %ecx
00db10
-	// Get XFEATURE_ENABLED_MASK
00db10
-	xgetbv
00db10
-#  ifdef HAVE_AVX512_ASM_SUPPORT
00db10
-	test	%ebx, %ebx
00db10
-	movq	%r11, %rbx		# Restore rbx
00db10
-	je	20f
00db10
-	// Verify that XCR0[7:5] = '111b' and
00db10
-	// XCR0[2:1] = '11b' which means
00db10
-	// that zmm state is enabled
00db10
-	andl	$0xe6, %eax
00db10
-	cmpl	$0xe6, %eax
00db10
-	jne	20f
00db10
-	movl	%eax, L(have_avx)(%rip)
00db10
-L(avx512):
00db10
-#   define RESTORE_AVX
00db10
-#   define VMOV    vmovdqu64
00db10
-#   define VEC(i)  zmm##i
00db10
-#   define MORE_CODE
00db10
-#   include "dl-trampoline.h"
00db10
-#   undef VMOV
00db10
-#   undef VEC
00db10
-#   undef RESTORE_AVX
00db10
-#  endif
00db10
-20:	andl	$0x6, %eax
00db10
-10:	subl	$0x5, %eax
00db10
-	movl	%eax, L(have_avx)(%rip)
00db10
-	cmpl	$0, %eax
00db10
-
00db10
-L(defined):
00db10
-	js	L(no_avx)
00db10
-#  ifdef HAVE_AVX512_ASM_SUPPORT
00db10
-	cmpl	$0xe6, L(have_avx)(%rip)
00db10
-	je	L(avx512)
00db10
-#  endif
00db10
-
00db10
-#  define RESTORE_AVX
00db10
-#  define VMOV    vmovdqu
00db10
-#  define VEC(i)  ymm##i
00db10
-#  define MORE_CODE
00db10
-#  include "dl-trampoline.h"
00db10
-
00db10
-	.align 16
00db10
-L(no_avx):
00db10
-# endif
00db10
-
00db10
-# undef RESTORE_AVX
00db10
-# include "dl-trampoline.h"
00db10
-
00db10
-	cfi_endproc
00db10
-	.size _dl_runtime_profile, .-_dl_runtime_profile
00db10
+#define VEC(i)			zmm##i
00db10
+#define _dl_runtime_resolve	_dl_runtime_resolve_avx512
00db10
+#define _dl_runtime_profile	_dl_runtime_profile_avx512
00db10
+#define RESTORE_AVX
00db10
+#include "dl-trampoline.h"
00db10
+#undef _dl_runtime_resolve
00db10
+#undef _dl_runtime_profile
00db10
+#undef VEC
00db10
+#undef VMOV
00db10
+#undef VMOVA
00db10
+#undef VEC_SIZE
00db10
+
00db10
+#define VEC_SIZE		32
00db10
+#define VMOVA			vmovdqa
00db10
+#if DL_RUNIME_RESOLVE_REALIGN_STACK || VEC_SIZE <= DL_STACK_ALIGNMENT
00db10
+# define VMOV			vmovdqa
00db10
+#else
00db10
+# define VMOV			vmovdqu
00db10
 #endif
00db10
-
00db10
-
00db10
-#ifdef SHARED
00db10
-	.globl _dl_x86_64_save_sse
00db10
-	.type _dl_x86_64_save_sse, @function
00db10
-	.align 16
00db10
-	cfi_startproc
00db10
-_dl_x86_64_save_sse:
00db10
-# if defined HAVE_AVX_SUPPORT || defined HAVE_AVX512_ASM_SUPPORT
00db10
-	cmpl	$0, L(have_avx)(%rip)
00db10
-	jne	L(defined_5)
00db10
-	movq	%rbx, %r11		# Save rbx
00db10
-	movl	$1, %eax
00db10
-	cpuid
00db10
-	movq	%r11,%rbx		# Restore rbx
00db10
-	xorl	%eax, %eax
00db10
-	// AVX and XSAVE supported?
00db10
-	andl	$((1 << 28) | (1 << 27)), %ecx
00db10
-	cmpl	$((1 << 28) | (1 << 27)), %ecx
00db10
-	jne	1f
00db10
-#  ifdef HAVE_AVX512_ASM_SUPPORT
00db10
-	// AVX512 supported in a processor?
00db10
-	movq	%rbx, %r11              # Save rbx
00db10
-	xorl	%ecx,%ecx
00db10
-	mov	$0x7,%eax
00db10
-	cpuid
00db10
-	andl	$(1 << 16), %ebx
00db10
-#  endif
00db10
-	xorl	%ecx, %ecx
00db10
-	// Get XFEATURE_ENABLED_MASK
00db10
-	xgetbv
00db10
-#  ifdef HAVE_AVX512_ASM_SUPPORT
00db10
-	test	%ebx, %ebx
00db10
-	movq	%r11, %rbx		# Restore rbx
00db10
-	je	2f
00db10
-	// Verify that XCR0[7:5] = '111b' and
00db10
-	// XCR0[2:1] = '11b' which means
00db10
-	// that zmm state is enabled
00db10
-	andl	$0xe6, %eax
00db10
-	movl	%eax, L(have_avx)(%rip)
00db10
-	cmpl	$0xe6, %eax
00db10
-	je	L(avx512_5)
00db10
-#  endif
00db10
-
00db10
-2:	andl	$0x6, %eax
00db10
-1:	subl	$0x5, %eax
00db10
-	movl	%eax, L(have_avx)(%rip)
00db10
-	cmpl	$0, %eax
00db10
-
00db10
-L(defined_5):
00db10
-	js	L(no_avx5)
00db10
-#  ifdef HAVE_AVX512_ASM_SUPPORT
00db10
-	cmpl	$0xe6, L(have_avx)(%rip)
00db10
-	je	L(avx512_5)
00db10
-#  endif
00db10
-
00db10
-	vmovdqa %ymm0, %fs:RTLD_SAVESPACE_SSE+0*YMM_SIZE
00db10
-	vmovdqa %ymm1, %fs:RTLD_SAVESPACE_SSE+1*YMM_SIZE
00db10
-	vmovdqa %ymm2, %fs:RTLD_SAVESPACE_SSE+2*YMM_SIZE
00db10
-	vmovdqa %ymm3, %fs:RTLD_SAVESPACE_SSE+3*YMM_SIZE
00db10
-	vmovdqa %ymm4, %fs:RTLD_SAVESPACE_SSE+4*YMM_SIZE
00db10
-	vmovdqa %ymm5, %fs:RTLD_SAVESPACE_SSE+5*YMM_SIZE
00db10
-	vmovdqa %ymm6, %fs:RTLD_SAVESPACE_SSE+6*YMM_SIZE
00db10
-	vmovdqa %ymm7, %fs:RTLD_SAVESPACE_SSE+7*YMM_SIZE
00db10
-	ret
00db10
-#  ifdef HAVE_AVX512_ASM_SUPPORT
00db10
-L(avx512_5):
00db10
-	vmovdqu64 %zmm0, %fs:RTLD_SAVESPACE_SSE+0*ZMM_SIZE
00db10
-	vmovdqu64 %zmm1, %fs:RTLD_SAVESPACE_SSE+1*ZMM_SIZE
00db10
-	vmovdqu64 %zmm2, %fs:RTLD_SAVESPACE_SSE+2*ZMM_SIZE
00db10
-	vmovdqu64 %zmm3, %fs:RTLD_SAVESPACE_SSE+3*ZMM_SIZE
00db10
-	vmovdqu64 %zmm4, %fs:RTLD_SAVESPACE_SSE+4*ZMM_SIZE
00db10
-	vmovdqu64 %zmm5, %fs:RTLD_SAVESPACE_SSE+5*ZMM_SIZE
00db10
-	vmovdqu64 %zmm6, %fs:RTLD_SAVESPACE_SSE+6*ZMM_SIZE
00db10
-	vmovdqu64 %zmm7, %fs:RTLD_SAVESPACE_SSE+7*ZMM_SIZE
00db10
-	ret
00db10
-#  endif
00db10
-L(no_avx5):
00db10
-# endif
00db10
-	movdqa	%xmm0, %fs:RTLD_SAVESPACE_SSE+0*XMM_SIZE
00db10
-	movdqa	%xmm1, %fs:RTLD_SAVESPACE_SSE+1*XMM_SIZE
00db10
-	movdqa	%xmm2, %fs:RTLD_SAVESPACE_SSE+2*XMM_SIZE
00db10
-	movdqa	%xmm3, %fs:RTLD_SAVESPACE_SSE+3*XMM_SIZE
00db10
-	movdqa	%xmm4, %fs:RTLD_SAVESPACE_SSE+4*XMM_SIZE
00db10
-	movdqa	%xmm5, %fs:RTLD_SAVESPACE_SSE+5*XMM_SIZE
00db10
-	movdqa	%xmm6, %fs:RTLD_SAVESPACE_SSE+6*XMM_SIZE
00db10
-	movdqa	%xmm7, %fs:RTLD_SAVESPACE_SSE+7*XMM_SIZE
00db10
-	ret
00db10
-	cfi_endproc
00db10
-	.size _dl_x86_64_save_sse, .-_dl_x86_64_save_sse
00db10
-
00db10
-
00db10
-	.globl _dl_x86_64_restore_sse
00db10
-	.type _dl_x86_64_restore_sse, @function
00db10
-	.align 16
00db10
-	cfi_startproc
00db10
-_dl_x86_64_restore_sse:
00db10
-# if defined HAVE_AVX_SUPPORT || defined HAVE_AVX512_ASM_SUPPORT
00db10
-	cmpl	$0, L(have_avx)(%rip)
00db10
-	js	L(no_avx6)
00db10
-#  ifdef HAVE_AVX512_ASM_SUPPORT
00db10
-	cmpl	$0xe6, L(have_avx)(%rip)
00db10
-	je	L(avx512_6)
00db10
-#  endif
00db10
-
00db10
-	vmovdqa %fs:RTLD_SAVESPACE_SSE+0*YMM_SIZE, %ymm0
00db10
-	vmovdqa %fs:RTLD_SAVESPACE_SSE+1*YMM_SIZE, %ymm1
00db10
-	vmovdqa %fs:RTLD_SAVESPACE_SSE+2*YMM_SIZE, %ymm2
00db10
-	vmovdqa %fs:RTLD_SAVESPACE_SSE+3*YMM_SIZE, %ymm3
00db10
-	vmovdqa %fs:RTLD_SAVESPACE_SSE+4*YMM_SIZE, %ymm4
00db10
-	vmovdqa %fs:RTLD_SAVESPACE_SSE+5*YMM_SIZE, %ymm5
00db10
-	vmovdqa %fs:RTLD_SAVESPACE_SSE+6*YMM_SIZE, %ymm6
00db10
-	vmovdqa %fs:RTLD_SAVESPACE_SSE+7*YMM_SIZE, %ymm7
00db10
-	ret
00db10
-#  ifdef HAVE_AVX512_ASM_SUPPORT
00db10
-L(avx512_6):
00db10
-	vmovdqu64 %fs:RTLD_SAVESPACE_SSE+0*ZMM_SIZE, %zmm0
00db10
-	vmovdqu64 %fs:RTLD_SAVESPACE_SSE+1*ZMM_SIZE, %zmm1
00db10
-	vmovdqu64 %fs:RTLD_SAVESPACE_SSE+2*ZMM_SIZE, %zmm2
00db10
-	vmovdqu64 %fs:RTLD_SAVESPACE_SSE+3*ZMM_SIZE, %zmm3
00db10
-	vmovdqu64 %fs:RTLD_SAVESPACE_SSE+4*ZMM_SIZE, %zmm4
00db10
-	vmovdqu64 %fs:RTLD_SAVESPACE_SSE+5*ZMM_SIZE, %zmm5
00db10
-	vmovdqu64 %fs:RTLD_SAVESPACE_SSE+6*ZMM_SIZE, %zmm6
00db10
-	vmovdqu64 %fs:RTLD_SAVESPACE_SSE+7*ZMM_SIZE, %zmm7
00db10
-	ret
00db10
-#  endif
00db10
-L(no_avx6):
00db10
-# endif
00db10
-	movdqa	%fs:RTLD_SAVESPACE_SSE+0*XMM_SIZE, %xmm0
00db10
-	movdqa	%fs:RTLD_SAVESPACE_SSE+1*XMM_SIZE, %xmm1
00db10
-	movdqa	%fs:RTLD_SAVESPACE_SSE+2*XMM_SIZE, %xmm2
00db10
-	movdqa	%fs:RTLD_SAVESPACE_SSE+3*XMM_SIZE, %xmm3
00db10
-	movdqa	%fs:RTLD_SAVESPACE_SSE+4*XMM_SIZE, %xmm4
00db10
-	movdqa	%fs:RTLD_SAVESPACE_SSE+5*XMM_SIZE, %xmm5
00db10
-	movdqa	%fs:RTLD_SAVESPACE_SSE+6*XMM_SIZE, %xmm6
00db10
-	movdqa	%fs:RTLD_SAVESPACE_SSE+7*XMM_SIZE, %xmm7
00db10
-	ret
00db10
-	cfi_endproc
00db10
-	.size _dl_x86_64_restore_sse, .-_dl_x86_64_restore_sse
00db10
+#define VEC(i)			ymm##i
00db10
+#define _dl_runtime_resolve	_dl_runtime_resolve_avx
00db10
+#define _dl_runtime_resolve_opt	_dl_runtime_resolve_avx_opt
00db10
+#define _dl_runtime_profile	_dl_runtime_profile_avx
00db10
+#include "dl-trampoline.h"
00db10
+#undef _dl_runtime_resolve
00db10
+#undef _dl_runtime_resolve_opt
00db10
+#undef _dl_runtime_profile
00db10
+#undef VEC
00db10
+#undef VMOV
00db10
+#undef VMOVA
00db10
+#undef VEC_SIZE
00db10
+
00db10
+/* movaps/movups is 1-byte shorter.  */
00db10
+#define VEC_SIZE		16
00db10
+#define VMOVA			movaps
00db10
+#if DL_RUNIME_RESOLVE_REALIGN_STACK || VEC_SIZE <= DL_STACK_ALIGNMENT
00db10
+# define VMOV			movaps
00db10
+#else
00db10
+# define VMOV			movups
00db10
+ #endif
00db10
+#define VEC(i)			xmm##i
00db10
+#define _dl_runtime_resolve	_dl_runtime_resolve_sse
00db10
+#define _dl_runtime_profile	_dl_runtime_profile_sse
00db10
+#undef RESTORE_AVX
00db10
+#include "dl-trampoline.h"
00db10
+#undef _dl_runtime_resolve
00db10
+#undef _dl_runtime_profile
00db10
+#undef VMOV
00db10
+#undef VMOVA
00db10
+
00db10
+/* Used by _dl_runtime_resolve_avx_opt/_dl_runtime_resolve_avx512_opt
00db10
+   to preserve the full vector registers with zero upper bits.  */
00db10
+#define VMOVA			vmovdqa
00db10
+#if DL_RUNTIME_RESOLVE_REALIGN_STACK || VEC_SIZE <= DL_STACK_ALIGNMENT
00db10
+# define VMOV			vmovdqa
00db10
+#else
00db10
+# define VMOV			vmovdqu
00db10
 #endif
00db10
+#define _dl_runtime_resolve	_dl_runtime_resolve_sse_vex
00db10
+#define _dl_runtime_resolve_opt	_dl_runtime_resolve_avx512_opt
00db10
+#include "dl-trampoline.h"
00db10
Index: glibc-2.17-c758a686/sysdeps/x86_64/dl-trampoline.h
00db10
===================================================================
00db10
--- glibc-2.17-c758a686.orig/sysdeps/x86_64/dl-trampoline.h
00db10
+++ glibc-2.17-c758a686/sysdeps/x86_64/dl-trampoline.h
00db10
@@ -1,6 +1,5 @@
00db10
-/* Partial PLT profile trampoline to save and restore x86-64 vector
00db10
-   registers.
00db10
-   Copyright (C) 2009, 2011 Free Software Foundation, Inc.
00db10
+/* PLT trampolines.  x86-64 version.
00db10
+   Copyright (C) 2009-2015 Free Software Foundation, Inc.
00db10
    This file is part of the GNU C Library.
00db10
 
00db10
    The GNU C Library is free software; you can redistribute it and/or
00db10
@@ -17,16 +16,355 @@
00db10
    License along with the GNU C Library; if not, see
00db10
    <http://www.gnu.org/licenses/>.  */
00db10
 
00db10
-#ifdef RESTORE_AVX
00db10
+#undef REGISTER_SAVE_AREA_RAW
00db10
+#ifdef __ILP32__
00db10
+/* X32 saves RCX, RDX, RSI, RDI, R8 and R9 plus RAX as well as VEC0 to
00db10
+   VEC7.  */
00db10
+# define REGISTER_SAVE_AREA_RAW	(8 * 7 + VEC_SIZE * 8)
00db10
+#else
00db10
+/* X86-64 saves RCX, RDX, RSI, RDI, R8 and R9 plus RAX as well as
00db10
+   BND0, BND1, BND2, BND3 and VEC0 to VEC7. */
00db10
+# define REGISTER_SAVE_AREA_RAW	(8 * 7 + 16 * 4 + VEC_SIZE * 8)
00db10
+#endif
00db10
+
00db10
+#undef REGISTER_SAVE_AREA
00db10
+#undef LOCAL_STORAGE_AREA
00db10
+#undef BASE
00db10
+#if DL_RUNIME_RESOLVE_REALIGN_STACK
00db10
+# define REGISTER_SAVE_AREA	(REGISTER_SAVE_AREA_RAW + 8)
00db10
+/* Local stack area before jumping to function address: RBX.  */
00db10
+# define LOCAL_STORAGE_AREA	8
00db10
+# define BASE			rbx
00db10
+# if (REGISTER_SAVE_AREA % VEC_SIZE) != 0
00db10
+#  error REGISTER_SAVE_AREA must be multples of VEC_SIZE
00db10
+# endif
00db10
+#else
00db10
+# define REGISTER_SAVE_AREA	REGISTER_SAVE_AREA_RAW
00db10
+/* Local stack area before jumping to function address:  All saved
00db10
+   registers.  */
00db10
+# define LOCAL_STORAGE_AREA	REGISTER_SAVE_AREA
00db10
+# define BASE			rsp
00db10
+# if (REGISTER_SAVE_AREA % 16) != 8
00db10
+#  error REGISTER_SAVE_AREA must be odd multples of 8
00db10
+# endif
00db10
+#endif
00db10
+
00db10
+	.text
00db10
+#ifdef _dl_runtime_resolve_opt
00db10
+/* Use the smallest vector registers to preserve the full YMM/ZMM
00db10
+   registers to avoid SSE transition penalty.  */
00db10
+
00db10
+# if VEC_SIZE == 32
00db10
+/* Check if the upper 128 bits in %ymm0 - %ymm7 registers are non-zero
00db10
+   and preserve %xmm0 - %xmm7 registers with the zero upper bits.  Since
00db10
+   there is no SSE transition penalty on AVX512 processors which don't
00db10
+   support XGETBV with ECX == 1, _dl_runtime_resolve_avx512_slow isn't
00db10
+   provided.   */
00db10
+	.globl _dl_runtime_resolve_avx_slow
00db10
+	.hidden _dl_runtime_resolve_avx_slow
00db10
+	.type _dl_runtime_resolve_avx_slow, @function
00db10
+	.align 16
00db10
+_dl_runtime_resolve_avx_slow:
00db10
+	cfi_startproc
00db10
+	cfi_adjust_cfa_offset(16) # Incorporate PLT
00db10
+	vorpd %ymm0, %ymm1, %ymm8
00db10
+	vorpd %ymm2, %ymm3, %ymm9
00db10
+	vorpd %ymm4, %ymm5, %ymm10
00db10
+	vorpd %ymm6, %ymm7, %ymm11
00db10
+	vorpd %ymm8, %ymm9, %ymm9
00db10
+	vorpd %ymm10, %ymm11, %ymm10
00db10
+	vpcmpeqd %xmm8, %xmm8, %xmm8
00db10
+	vorpd %ymm9, %ymm10, %ymm10
00db10
+	vptest %ymm10, %ymm8
00db10
+	# Preserve %ymm0 - %ymm7 registers if the upper 128 bits of any
00db10
+	# %ymm0 - %ymm7 registers aren't zero.
00db10
+	PRESERVE_BND_REGS_PREFIX
00db10
+	jnc _dl_runtime_resolve_avx
00db10
+	# Use vzeroupper to avoid SSE transition penalty.
00db10
+	vzeroupper
00db10
+	# Preserve %xmm0 - %xmm7 registers with the zero upper 128 bits
00db10
+	# when the upper 128 bits of %ymm0 - %ymm7 registers are zero.
00db10
+	PRESERVE_BND_REGS_PREFIX
00db10
+	jmp _dl_runtime_resolve_sse_vex
00db10
+	cfi_adjust_cfa_offset(-16) # Restore PLT adjustment
00db10
+	cfi_endproc
00db10
+	.size _dl_runtime_resolve_avx_slow, .-_dl_runtime_resolve_avx_slow
00db10
+# endif
00db10
+
00db10
+/* Use XGETBV with ECX == 1 to check which bits in vector registers are
00db10
+   non-zero and only preserve the non-zero lower bits with zero upper
00db10
+   bits.  */
00db10
+	.globl _dl_runtime_resolve_opt
00db10
+	.hidden _dl_runtime_resolve_opt
00db10
+	.type _dl_runtime_resolve_opt, @function
00db10
+	.align 16
00db10
+_dl_runtime_resolve_opt:
00db10
+	cfi_startproc
00db10
+	cfi_adjust_cfa_offset(16) # Incorporate PLT
00db10
+	pushq %rax
00db10
+	cfi_adjust_cfa_offset(8)
00db10
+	cfi_rel_offset(%rax, 0)
00db10
+	pushq %rcx
00db10
+	cfi_adjust_cfa_offset(8)
00db10
+	cfi_rel_offset(%rcx, 0)
00db10
+	pushq %rdx
00db10
+	cfi_adjust_cfa_offset(8)
00db10
+	cfi_rel_offset(%rdx, 0)
00db10
+	movl $1, %ecx
00db10
+	xgetbv
00db10
+	movl %eax, %r11d
00db10
+	popq %rdx
00db10
+	cfi_adjust_cfa_offset(-8)
00db10
+	cfi_restore (%rdx)
00db10
+	popq %rcx
00db10
+	cfi_adjust_cfa_offset(-8)
00db10
+	cfi_restore (%rcx)
00db10
+	popq %rax
00db10
+	cfi_adjust_cfa_offset(-8)
00db10
+	cfi_restore (%rax)
00db10
+# if VEC_SIZE == 32
00db10
+	# For YMM registers, check if YMM state is in use.
00db10
+	andl $bit_YMM_state, %r11d
00db10
+	# Preserve %xmm0 - %xmm7 registers with the zero upper 128 bits if
00db10
+	# YMM state isn't in use.
00db10
+	PRESERVE_BND_REGS_PREFIX
00db10
+	jz _dl_runtime_resolve_sse_vex
00db10
+# elif VEC_SIZE == 16
00db10
+	# For ZMM registers, check if YMM state and ZMM state are in
00db10
+	# use.
00db10
+	andl $(bit_YMM_state | bit_ZMM0_15_state), %r11d
00db10
+	cmpl $bit_YMM_state, %r11d
00db10
+	# Preserve %zmm0 - %zmm7 registers if ZMM state is in use.
00db10
+	PRESERVE_BND_REGS_PREFIX
00db10
+	jg _dl_runtime_resolve_avx512
00db10
+	# Preserve %ymm0 - %ymm7 registers with the zero upper 256 bits if
00db10
+	# ZMM state isn't in use.
00db10
+	PRESERVE_BND_REGS_PREFIX
00db10
+	je _dl_runtime_resolve_avx
00db10
+	# Preserve %xmm0 - %xmm7 registers with the zero upper 384 bits if
00db10
+	# neither YMM state nor ZMM state are in use.
00db10
+# else
00db10
+#  error Unsupported VEC_SIZE!
00db10
+# endif
00db10
+	cfi_adjust_cfa_offset(-16) # Restore PLT adjustment
00db10
+	cfi_endproc
00db10
+	.size _dl_runtime_resolve_opt, .-_dl_runtime_resolve_opt
00db10
+#endif
00db10
+	.globl _dl_runtime_resolve
00db10
+	.hidden _dl_runtime_resolve
00db10
+	.type _dl_runtime_resolve, @function
00db10
+	.align 16
00db10
+	cfi_startproc
00db10
+_dl_runtime_resolve:
00db10
+	cfi_adjust_cfa_offset(16) # Incorporate PLT
00db10
+#if DL_RUNIME_RESOLVE_REALIGN_STACK
00db10
+# if LOCAL_STORAGE_AREA != 8
00db10
+#  error LOCAL_STORAGE_AREA must be 8
00db10
+# endif
00db10
+	pushq %rbx			# push subtracts stack by 8.
00db10
+	cfi_adjust_cfa_offset(8)
00db10
+	cfi_rel_offset(%rbx, 0)
00db10
+	mov %RSP_LP, %RBX_LP
00db10
+	cfi_def_cfa_register(%rbx)
00db10
+	and $-VEC_SIZE, %RSP_LP
00db10
+#endif
00db10
+	sub $REGISTER_SAVE_AREA, %RSP_LP
00db10
+	cfi_adjust_cfa_offset(REGISTER_SAVE_AREA)
00db10
+	# Preserve registers otherwise clobbered.
00db10
+	movq %rax, REGISTER_SAVE_RAX(%rsp)
00db10
+	movq %rcx, REGISTER_SAVE_RCX(%rsp)
00db10
+	movq %rdx, REGISTER_SAVE_RDX(%rsp)
00db10
+	movq %rsi, REGISTER_SAVE_RSI(%rsp)
00db10
+	movq %rdi, REGISTER_SAVE_RDI(%rsp)
00db10
+	movq %r8, REGISTER_SAVE_R8(%rsp)
00db10
+	movq %r9, REGISTER_SAVE_R9(%rsp)
00db10
+	VMOV %VEC(0), (REGISTER_SAVE_VEC_OFF)(%rsp)
00db10
+	VMOV %VEC(1), (REGISTER_SAVE_VEC_OFF + VEC_SIZE)(%rsp)
00db10
+	VMOV %VEC(2), (REGISTER_SAVE_VEC_OFF + VEC_SIZE * 2)(%rsp)
00db10
+	VMOV %VEC(3), (REGISTER_SAVE_VEC_OFF + VEC_SIZE * 3)(%rsp)
00db10
+	VMOV %VEC(4), (REGISTER_SAVE_VEC_OFF + VEC_SIZE * 4)(%rsp)
00db10
+	VMOV %VEC(5), (REGISTER_SAVE_VEC_OFF + VEC_SIZE * 5)(%rsp)
00db10
+	VMOV %VEC(6), (REGISTER_SAVE_VEC_OFF + VEC_SIZE * 6)(%rsp)
00db10
+	VMOV %VEC(7), (REGISTER_SAVE_VEC_OFF + VEC_SIZE * 7)(%rsp)
00db10
+#ifndef __ILP32__
00db10
+	# We also have to preserve bound registers.  These are nops if
00db10
+	# Intel MPX isn't available or disabled.
00db10
+# ifdef HAVE_MPX_SUPPORT
00db10
+	bndmov %bnd0, REGISTER_SAVE_BND0(%rsp)
00db10
+	bndmov %bnd1, REGISTER_SAVE_BND1(%rsp)
00db10
+	bndmov %bnd2, REGISTER_SAVE_BND2(%rsp)
00db10
+	bndmov %bnd3, REGISTER_SAVE_BND3(%rsp)
00db10
+# else
00db10
+#  if REGISTER_SAVE_BND0 == 0
00db10
+	.byte 0x66,0x0f,0x1b,0x04,0x24
00db10
+#  else
00db10
+	.byte 0x66,0x0f,0x1b,0x44,0x24,REGISTER_SAVE_BND0
00db10
+#  endif
00db10
+	.byte 0x66,0x0f,0x1b,0x4c,0x24,REGISTER_SAVE_BND1
00db10
+	.byte 0x66,0x0f,0x1b,0x54,0x24,REGISTER_SAVE_BND2
00db10
+	.byte 0x66,0x0f,0x1b,0x5c,0x24,REGISTER_SAVE_BND3
00db10
+# endif
00db10
+#endif
00db10
+	# Copy args pushed by PLT in register.
00db10
+	# %rdi: link_map, %rsi: reloc_index
00db10
+	mov (LOCAL_STORAGE_AREA + 8)(%BASE), %RSI_LP
00db10
+	mov LOCAL_STORAGE_AREA(%BASE), %RDI_LP
00db10
+	call _dl_fixup		# Call resolver.
00db10
+	mov %RAX_LP, %R11_LP	# Save return value
00db10
+#ifndef __ILP32__
00db10
+	# Restore bound registers.  These are nops if Intel MPX isn't
00db10
+	# avaiable or disabled.
00db10
+# ifdef HAVE_MPX_SUPPORT
00db10
+	bndmov REGISTER_SAVE_BND3(%rsp), %bnd3
00db10
+	bndmov REGISTER_SAVE_BND2(%rsp), %bnd2
00db10
+	bndmov REGISTER_SAVE_BND1(%rsp), %bnd1
00db10
+	bndmov REGISTER_SAVE_BND0(%rsp), %bnd0
00db10
+# else
00db10
+	.byte 0x66,0x0f,0x1a,0x5c,0x24,REGISTER_SAVE_BND3
00db10
+	.byte 0x66,0x0f,0x1a,0x54,0x24,REGISTER_SAVE_BND2
00db10
+	.byte 0x66,0x0f,0x1a,0x4c,0x24,REGISTER_SAVE_BND1
00db10
+#  if REGISTER_SAVE_BND0 == 0
00db10
+	.byte 0x66,0x0f,0x1a,0x04,0x24
00db10
+#  else
00db10
+	.byte 0x66,0x0f,0x1a,0x44,0x24,REGISTER_SAVE_BND0
00db10
+#  endif
00db10
+# endif
00db10
+#endif
00db10
+	# Get register content back.
00db10
+	movq REGISTER_SAVE_R9(%rsp), %r9
00db10
+	movq REGISTER_SAVE_R8(%rsp), %r8
00db10
+	movq REGISTER_SAVE_RDI(%rsp), %rdi
00db10
+	movq REGISTER_SAVE_RSI(%rsp), %rsi
00db10
+	movq REGISTER_SAVE_RDX(%rsp), %rdx
00db10
+	movq REGISTER_SAVE_RCX(%rsp), %rcx
00db10
+	movq REGISTER_SAVE_RAX(%rsp), %rax
00db10
+	VMOV (REGISTER_SAVE_VEC_OFF)(%rsp), %VEC(0)
00db10
+	VMOV (REGISTER_SAVE_VEC_OFF + VEC_SIZE)(%rsp), %VEC(1)
00db10
+	VMOV (REGISTER_SAVE_VEC_OFF + VEC_SIZE * 2)(%rsp), %VEC(2)
00db10
+	VMOV (REGISTER_SAVE_VEC_OFF + VEC_SIZE * 3)(%rsp), %VEC(3)
00db10
+	VMOV (REGISTER_SAVE_VEC_OFF + VEC_SIZE * 4)(%rsp), %VEC(4)
00db10
+	VMOV (REGISTER_SAVE_VEC_OFF + VEC_SIZE * 5)(%rsp), %VEC(5)
00db10
+	VMOV (REGISTER_SAVE_VEC_OFF + VEC_SIZE * 6)(%rsp), %VEC(6)
00db10
+	VMOV (REGISTER_SAVE_VEC_OFF + VEC_SIZE * 7)(%rsp), %VEC(7)
00db10
+#if DL_RUNIME_RESOLVE_REALIGN_STACK
00db10
+	mov %RBX_LP, %RSP_LP
00db10
+	cfi_def_cfa_register(%rsp)
00db10
+	movq (%rsp), %rbx
00db10
+	cfi_restore(%rbx)
00db10
+#endif
00db10
+	# Adjust stack(PLT did 2 pushes)
00db10
+	add $(LOCAL_STORAGE_AREA + 16), %RSP_LP
00db10
+	cfi_adjust_cfa_offset(-(LOCAL_STORAGE_AREA + 16))
00db10
+	# Preserve bound registers.
00db10
+	PRESERVE_BND_REGS_PREFIX
00db10
+	jmp *%r11		# Jump to function address.
00db10
+	cfi_endproc
00db10
+	.size _dl_runtime_resolve, .-_dl_runtime_resolve
00db10
+
00db10
+
00db10
+/* To preserve %xmm0 - %xmm7 registers, dl-trampoline.h is included
00db10
+   twice, for _dl_runtime_resolve_sse and _dl_runtime_resolve_sse_vex.
00db10
+   But we don't need another _dl_runtime_profile for XMM registers.  */
00db10
+#if !defined PROF && defined _dl_runtime_profile
00db10
+# if (LR_VECTOR_OFFSET % VEC_SIZE) != 0
00db10
+#  error LR_VECTOR_OFFSET must be multples of VEC_SIZE
00db10
+# endif
00db10
+
00db10
+	.globl _dl_runtime_profile
00db10
+	.hidden _dl_runtime_profile
00db10
+	.type _dl_runtime_profile, @function
00db10
+	.align 16
00db10
+_dl_runtime_profile:
00db10
+	cfi_startproc
00db10
+	cfi_adjust_cfa_offset(16) # Incorporate PLT
00db10
+	/* The La_x86_64_regs data structure pointed to by the
00db10
+	   fourth paramater must be VEC_SIZE-byte aligned.  This must
00db10
+	   be explicitly enforced.  We have the set up a dynamically
00db10
+	   sized stack frame.  %rbx points to the top half which
00db10
+	   has a fixed size and preserves the original stack pointer.  */
00db10
+
00db10
+	sub $32, %RSP_LP	# Allocate the local storage.
00db10
+	cfi_adjust_cfa_offset(32)
00db10
+	movq %rbx, (%rsp)
00db10
+	cfi_rel_offset(%rbx, 0)
00db10
+
00db10
+	/* On the stack:
00db10
+		56(%rbx)	parameter #1
00db10
+		48(%rbx)	return address
00db10
+
00db10
+		40(%rbx)	reloc index
00db10
+		32(%rbx)	link_map
00db10
+
00db10
+		24(%rbx)	La_x86_64_regs pointer
00db10
+		16(%rbx)	framesize
00db10
+		 8(%rbx)	rax
00db10
+		  (%rbx)	rbx
00db10
+	*/
00db10
+
00db10
+	movq %rax, 8(%rsp)
00db10
+	mov %RSP_LP, %RBX_LP
00db10
+	cfi_def_cfa_register(%rbx)
00db10
+
00db10
+	/* Actively align the La_x86_64_regs structure.  */
00db10
+	and $-VEC_SIZE, %RSP_LP
00db10
+# if defined HAVE_AVX_SUPPORT || defined HAVE_AVX512_ASM_SUPPORT
00db10
+	/* sizeof(La_x86_64_regs).  Need extra space for 8 SSE registers
00db10
+	   to detect if any xmm0-xmm7 registers are changed by audit
00db10
+	   module.  */
00db10
+	sub $(LR_SIZE + XMM_SIZE*8), %RSP_LP
00db10
+# else
00db10
+	sub $LR_SIZE, %RSP_LP		# sizeof(La_x86_64_regs)
00db10
+# endif
00db10
+	movq %rsp, 24(%rbx)
00db10
+
00db10
+	/* Fill the La_x86_64_regs structure.  */
00db10
+	movq %rdx, LR_RDX_OFFSET(%rsp)
00db10
+	movq %r8,  LR_R8_OFFSET(%rsp)
00db10
+	movq %r9,  LR_R9_OFFSET(%rsp)
00db10
+	movq %rcx, LR_RCX_OFFSET(%rsp)
00db10
+	movq %rsi, LR_RSI_OFFSET(%rsp)
00db10
+	movq %rdi, LR_RDI_OFFSET(%rsp)
00db10
+	movq %rbp, LR_RBP_OFFSET(%rsp)
00db10
+
00db10
+	lea 48(%rbx), %RAX_LP
00db10
+	movq %rax, LR_RSP_OFFSET(%rsp)
00db10
+
00db10
+	/* We always store the XMM registers even if AVX is available.
00db10
+	   This is to provide backward binary compatibility for existing
00db10
+	   audit modules.  */
00db10
+	movaps %xmm0,		   (LR_XMM_OFFSET)(%rsp)
00db10
+	movaps %xmm1, (LR_XMM_OFFSET +   XMM_SIZE)(%rsp)
00db10
+	movaps %xmm2, (LR_XMM_OFFSET + XMM_SIZE*2)(%rsp)
00db10
+	movaps %xmm3, (LR_XMM_OFFSET + XMM_SIZE*3)(%rsp)
00db10
+	movaps %xmm4, (LR_XMM_OFFSET + XMM_SIZE*4)(%rsp)
00db10
+	movaps %xmm5, (LR_XMM_OFFSET + XMM_SIZE*5)(%rsp)
00db10
+	movaps %xmm6, (LR_XMM_OFFSET + XMM_SIZE*6)(%rsp)
00db10
+	movaps %xmm7, (LR_XMM_OFFSET + XMM_SIZE*7)(%rsp)
00db10
+
00db10
+# ifndef __ILP32__
00db10
+#  ifdef HAVE_MPX_SUPPORT
00db10
+	bndmov %bnd0, 		   (LR_BND_OFFSET)(%rsp)  # Preserve bound
00db10
+	bndmov %bnd1, (LR_BND_OFFSET +   BND_SIZE)(%rsp)  # registers. Nops if
00db10
+	bndmov %bnd2, (LR_BND_OFFSET + BND_SIZE*2)(%rsp)  # MPX not available
00db10
+	bndmov %bnd3, (LR_BND_OFFSET + BND_SIZE*3)(%rsp)  # or disabled.
00db10
+#  else
00db10
+	.byte 0x66,0x0f,0x1b,0x84,0x24;.long (LR_BND_OFFSET)
00db10
+	.byte 0x66,0x0f,0x1b,0x8c,0x24;.long (LR_BND_OFFSET + BND_SIZE)
00db10
+	.byte 0x66,0x0f,0x1b,0x94,0x24;.long (LR_BND_OFFSET + BND_SIZE*2)
00db10
+	.byte 0x66,0x0f,0x1b,0x9c,0x24;.long (LR_BND_OFFSET + BND_SIZE*3)
00db10
+#  endif
00db10
+# endif
00db10
+
00db10
+# ifdef RESTORE_AVX
00db10
 	/* This is to support AVX audit modules.  */
00db10
-	VMOV %VEC(0),		      (LR_VECTOR_OFFSET)(%rsp)
00db10
-	VMOV %VEC(1), (LR_VECTOR_OFFSET +   VECTOR_SIZE)(%rsp)
00db10
-	VMOV %VEC(2), (LR_VECTOR_OFFSET + VECTOR_SIZE*2)(%rsp)
00db10
-	VMOV %VEC(3), (LR_VECTOR_OFFSET + VECTOR_SIZE*3)(%rsp)
00db10
-	VMOV %VEC(4), (LR_VECTOR_OFFSET + VECTOR_SIZE*4)(%rsp)
00db10
-	VMOV %VEC(5), (LR_VECTOR_OFFSET + VECTOR_SIZE*5)(%rsp)
00db10
-	VMOV %VEC(6), (LR_VECTOR_OFFSET + VECTOR_SIZE*6)(%rsp)
00db10
-	VMOV %VEC(7), (LR_VECTOR_OFFSET + VECTOR_SIZE*7)(%rsp)
00db10
+	VMOVA %VEC(0),		      (LR_VECTOR_OFFSET)(%rsp)
00db10
+	VMOVA %VEC(1), (LR_VECTOR_OFFSET +   VECTOR_SIZE)(%rsp)
00db10
+	VMOVA %VEC(2), (LR_VECTOR_OFFSET + VECTOR_SIZE*2)(%rsp)
00db10
+	VMOVA %VEC(3), (LR_VECTOR_OFFSET + VECTOR_SIZE*3)(%rsp)
00db10
+	VMOVA %VEC(4), (LR_VECTOR_OFFSET + VECTOR_SIZE*4)(%rsp)
00db10
+	VMOVA %VEC(5), (LR_VECTOR_OFFSET + VECTOR_SIZE*5)(%rsp)
00db10
+	VMOVA %VEC(6), (LR_VECTOR_OFFSET + VECTOR_SIZE*6)(%rsp)
00db10
+	VMOVA %VEC(7), (LR_VECTOR_OFFSET + VECTOR_SIZE*7)(%rsp)
00db10
 
00db10
 	/* Save xmm0-xmm7 registers to detect if any of them are
00db10
 	   changed by audit module.  */
00db10
@@ -38,7 +376,7 @@
00db10
 	vmovdqa %xmm5, (LR_SIZE + XMM_SIZE*5)(%rsp)
00db10
 	vmovdqa %xmm6, (LR_SIZE + XMM_SIZE*6)(%rsp)
00db10
 	vmovdqa %xmm7, (LR_SIZE + XMM_SIZE*7)(%rsp)
00db10
-#endif
00db10
+# endif
00db10
 
00db10
 	mov %RSP_LP, %RCX_LP	# La_x86_64_regs pointer to %rcx.
00db10
 	mov 48(%rbx), %RDX_LP	# Load return address if needed.
00db10
@@ -63,21 +401,7 @@
00db10
 	movaps (LR_XMM_OFFSET + XMM_SIZE*6)(%rsp), %xmm6
00db10
 	movaps (LR_XMM_OFFSET + XMM_SIZE*7)(%rsp), %xmm7
00db10
 
00db10
-#ifndef __ILP32__
00db10
-# ifdef HAVE_MPX_SUPPORT
00db10
-	bndmov 		    (LR_BND_OFFSET)(%rsp), %bnd0  # Restore bound
00db10
-	bndmov (LR_BND_OFFSET +   BND_SIZE)(%rsp), %bnd1  # registers.
00db10
-	bndmov (LR_BND_OFFSET + BND_SIZE*2)(%rsp), %bnd2
00db10
-	bndmov (LR_BND_OFFSET + BND_SIZE*3)(%rsp), %bnd3
00db10
-# else
00db10
-	.byte 0x66,0x0f,0x1a,0x84,0x24;.long (LR_BND_OFFSET)
00db10
-	.byte 0x66,0x0f,0x1a,0x8c,0x24;.long (LR_BND_OFFSET + BND_SIZE)
00db10
-	.byte 0x66,0x0f,0x1a,0x94,0x24;.long (LR_BND_OFFSET + BND_SIZE*2)
00db10
-	.byte 0x66,0x0f,0x1a,0x9c,0x24;.long (LR_BND_OFFSET + BND_SIZE*3)
00db10
-# endif
00db10
-#endif
00db10
-
00db10
-#ifdef RESTORE_AVX
00db10
+# ifdef RESTORE_AVX
00db10
 	/* Check if any xmm0-xmm7 registers are changed by audit
00db10
 	   module.  */
00db10
 	vpcmpeqq (LR_SIZE)(%rsp), %xmm0, %xmm8
00db10
@@ -86,7 +410,7 @@
00db10
 	je 2f
00db10
 	vmovdqa	%xmm0, (LR_VECTOR_OFFSET)(%rsp)
00db10
 	jmp 1f
00db10
-2:	VMOV (LR_VECTOR_OFFSET)(%rsp), %VEC(0)
00db10
+2:	VMOVA (LR_VECTOR_OFFSET)(%rsp), %VEC(0)
00db10
 	vmovdqa	%xmm0, (LR_XMM_OFFSET)(%rsp)
00db10
 
00db10
 1:	vpcmpeqq (LR_SIZE + XMM_SIZE)(%rsp), %xmm1, %xmm8
00db10
@@ -95,7 +419,7 @@
00db10
 	je 2f
00db10
 	vmovdqa	%xmm1, (LR_VECTOR_OFFSET + VECTOR_SIZE)(%rsp)
00db10
 	jmp 1f
00db10
-2:	VMOV (LR_VECTOR_OFFSET + VECTOR_SIZE)(%rsp), %VEC(1)
00db10
+2:	VMOVA (LR_VECTOR_OFFSET + VECTOR_SIZE)(%rsp), %VEC(1)
00db10
 	vmovdqa	%xmm1, (LR_XMM_OFFSET + XMM_SIZE)(%rsp)
00db10
 
00db10
 1:	vpcmpeqq (LR_SIZE + XMM_SIZE*2)(%rsp), %xmm2, %xmm8
00db10
@@ -104,7 +428,7 @@
00db10
 	je 2f
00db10
 	vmovdqa	%xmm2, (LR_VECTOR_OFFSET + VECTOR_SIZE*2)(%rsp)
00db10
 	jmp 1f
00db10
-2:	VMOV (LR_VECTOR_OFFSET + VECTOR_SIZE*2)(%rsp), %VEC(2)
00db10
+2:	VMOVA (LR_VECTOR_OFFSET + VECTOR_SIZE*2)(%rsp), %VEC(2)
00db10
 	vmovdqa	%xmm2, (LR_XMM_OFFSET + XMM_SIZE*2)(%rsp)
00db10
 
00db10
 1:	vpcmpeqq (LR_SIZE + XMM_SIZE*3)(%rsp), %xmm3, %xmm8
00db10
@@ -113,7 +437,7 @@
00db10
 	je 2f
00db10
 	vmovdqa	%xmm3, (LR_VECTOR_OFFSET + VECTOR_SIZE*3)(%rsp)
00db10
 	jmp 1f
00db10
-2:	VMOV (LR_VECTOR_OFFSET + VECTOR_SIZE*3)(%rsp), %VEC(3)
00db10
+2:	VMOVA (LR_VECTOR_OFFSET + VECTOR_SIZE*3)(%rsp), %VEC(3)
00db10
 	vmovdqa	%xmm3, (LR_XMM_OFFSET + XMM_SIZE*3)(%rsp)
00db10
 
00db10
 1:	vpcmpeqq (LR_SIZE + XMM_SIZE*4)(%rsp), %xmm4, %xmm8
00db10
@@ -122,7 +446,7 @@
00db10
 	je 2f
00db10
 	vmovdqa	%xmm4, (LR_VECTOR_OFFSET + VECTOR_SIZE*4)(%rsp)
00db10
 	jmp 1f
00db10
-2:	VMOV (LR_VECTOR_OFFSET + VECTOR_SIZE*4)(%rsp), %VEC(4)
00db10
+2:	VMOVA (LR_VECTOR_OFFSET + VECTOR_SIZE*4)(%rsp), %VEC(4)
00db10
 	vmovdqa	%xmm4, (LR_XMM_OFFSET + XMM_SIZE*4)(%rsp)
00db10
 
00db10
 1:	vpcmpeqq (LR_SIZE + XMM_SIZE*5)(%rsp), %xmm5, %xmm8
00db10
@@ -131,7 +455,7 @@
00db10
 	je 2f
00db10
 	vmovdqa	%xmm5, (LR_VECTOR_OFFSET + VECTOR_SIZE*5)(%rsp)
00db10
 	jmp 1f
00db10
-2:	VMOV (LR_VECTOR_OFFSET + VECTOR_SIZE*5)(%rsp), %VEC(5)
00db10
+2:	VMOVA (LR_VECTOR_OFFSET + VECTOR_SIZE*5)(%rsp), %VEC(5)
00db10
 	vmovdqa	%xmm5, (LR_XMM_OFFSET + XMM_SIZE*5)(%rsp)
00db10
 
00db10
 1:	vpcmpeqq (LR_SIZE + XMM_SIZE*6)(%rsp), %xmm6, %xmm8
00db10
@@ -140,7 +464,7 @@
00db10
 	je 2f
00db10
 	vmovdqa	%xmm6, (LR_VECTOR_OFFSET + VECTOR_SIZE*6)(%rsp)
00db10
 	jmp 1f
00db10
-2:	VMOV (LR_VECTOR_OFFSET + VECTOR_SIZE*6)(%rsp), %VEC(6)
00db10
+2:	VMOVA (LR_VECTOR_OFFSET + VECTOR_SIZE*6)(%rsp), %VEC(6)
00db10
 	vmovdqa	%xmm6, (LR_XMM_OFFSET + XMM_SIZE*6)(%rsp)
00db10
 
00db10
 1:	vpcmpeqq (LR_SIZE + XMM_SIZE*7)(%rsp), %xmm7, %xmm8
00db10
@@ -149,13 +473,29 @@
00db10
 	je 2f
00db10
 	vmovdqa	%xmm7, (LR_VECTOR_OFFSET + VECTOR_SIZE*7)(%rsp)
00db10
 	jmp 1f
00db10
-2:	VMOV (LR_VECTOR_OFFSET + VECTOR_SIZE*7)(%rsp), %VEC(7)
00db10
+2:	VMOVA (LR_VECTOR_OFFSET + VECTOR_SIZE*7)(%rsp), %VEC(7)
00db10
 	vmovdqa	%xmm7, (LR_XMM_OFFSET + XMM_SIZE*7)(%rsp)
00db10
 
00db10
 1:
00db10
-#endif
00db10
+# endif
00db10
+
00db10
+# ifndef __ILP32__
00db10
+#  ifdef HAVE_MPX_SUPPORT
00db10
+	bndmov              (LR_BND_OFFSET)(%rsp), %bnd0  # Restore bound
00db10
+	bndmov (LR_BND_OFFSET +   BND_SIZE)(%rsp), %bnd1  # registers.
00db10
+	bndmov (LR_BND_OFFSET + BND_SIZE*2)(%rsp), %bnd2
00db10
+	bndmov (LR_BND_OFFSET + BND_SIZE*3)(%rsp), %bnd3
00db10
+#  else
00db10
+	.byte 0x66,0x0f,0x1a,0x84,0x24;.long (LR_BND_OFFSET)
00db10
+	.byte 0x66,0x0f,0x1a,0x8c,0x24;.long (LR_BND_OFFSET + BND_SIZE)
00db10
+	.byte 0x66,0x0f,0x1a,0x94,0x24;.long (LR_BND_OFFSET + BND_SIZE*2)
00db10
+	.byte 0x66,0x0f,0x1a,0x9c,0x24;.long (LR_BND_OFFSET + BND_SIZE*3)
00db10
+#  endif
00db10
+# endif
00db10
+
00db10
 	mov  16(%rbx), %R10_LP	# Anything in framesize?
00db10
 	test %R10_LP, %R10_LP
00db10
+	PRESERVE_BND_REGS_PREFIX
00db10
 	jns 3f
00db10
 
00db10
 	/* There's nothing in the frame size, so there
00db10
@@ -166,14 +506,15 @@
00db10
 	movq LR_RSI_OFFSET(%rsp), %rsi
00db10
 	movq LR_RDI_OFFSET(%rsp), %rdi
00db10
 
00db10
-	movq %rbx, %rsp
00db10
+	mov %RBX_LP, %RSP_LP
00db10
 	movq (%rsp), %rbx
00db10
-	cfi_restore(rbx)
00db10
+	cfi_restore(%rbx)
00db10
 	cfi_def_cfa_register(%rsp)
00db10
 
00db10
-	addq $48, %rsp		# Adjust the stack to the return value
00db10
+	add $48, %RSP_LP	# Adjust the stack to the return value
00db10
 				# (eats the reloc index and link_map)
00db10
 	cfi_adjust_cfa_offset(-48)
00db10
+	PRESERVE_BND_REGS_PREFIX
00db10
 	jmp *%r11		# Jump to function address.
00db10
 
00db10
 3:
00db10
@@ -186,13 +527,13 @@
00db10
 	   temporary buffer of the size specified by the 'framesize'
00db10
 	   returned from _dl_profile_fixup */
00db10
 
00db10
-	leaq LR_RSP_OFFSET(%rbx), %rsi	# stack
00db10
-	addq $8, %r10
00db10
-	andq $0xfffffffffffffff0, %r10
00db10
-	movq %r10, %rcx
00db10
-	subq %r10, %rsp
00db10
-	movq %rsp, %rdi
00db10
-	shrq $3, %rcx
00db10
+	lea LR_RSP_OFFSET(%rbx), %RSI_LP # stack
00db10
+	add $8, %R10_LP
00db10
+	and $-16, %R10_LP
00db10
+	mov %R10_LP, %RCX_LP
00db10
+	sub %R10_LP, %RSP_LP
00db10
+	mov %RSP_LP, %RDI_LP
00db10
+	shr $3, %RCX_LP
00db10
 	rep
00db10
 	movsq
00db10
 
00db10
@@ -200,23 +541,24 @@
00db10
 	movq 32(%rdi), %rsi
00db10
 	movq 40(%rdi), %rdi
00db10
 
00db10
+	PRESERVE_BND_REGS_PREFIX
00db10
 	call *%r11
00db10
 
00db10
-	mov 24(%rbx), %rsp	# Drop the copied stack content
00db10
+	mov 24(%rbx), %RSP_LP	# Drop the copied stack content
00db10
 
00db10
 	/* Now we have to prepare the La_x86_64_retval structure for the
00db10
 	   _dl_call_pltexit.  The La_x86_64_regs is being pointed by rsp now,
00db10
 	   so we just need to allocate the sizeof(La_x86_64_retval) space on
00db10
 	   the stack, since the alignment has already been taken care of. */
00db10
-#ifdef RESTORE_AVX
00db10
+# ifdef RESTORE_AVX
00db10
 	/* sizeof(La_x86_64_retval).  Need extra space for 2 SSE
00db10
 	   registers to detect if xmm0/xmm1 registers are changed
00db10
 	   by audit module.  */
00db10
-	subq $(LRV_SIZE + XMM_SIZE*2), %rsp
00db10
-#else
00db10
-	subq $LRV_SIZE, %rsp	# sizeof(La_x86_64_retval)
00db10
-#endif
00db10
-	movq %rsp, %rcx		# La_x86_64_retval argument to %rcx.
00db10
+	sub $(LRV_SIZE + XMM_SIZE*2), %RSP_LP
00db10
+# else
00db10
+	sub $LRV_SIZE, %RSP_LP	# sizeof(La_x86_64_retval)
00db10
+# endif
00db10
+	mov %RSP_LP, %RCX_LP	# La_x86_64_retval argument to %rcx.
00db10
 
00db10
 	/* Fill in the La_x86_64_retval structure.  */
00db10
 	movq %rax, LRV_RAX_OFFSET(%rcx)
00db10
@@ -225,26 +567,26 @@
00db10
 	movaps %xmm0, LRV_XMM0_OFFSET(%rcx)
00db10
 	movaps %xmm1, LRV_XMM1_OFFSET(%rcx)
00db10
 
00db10
-#ifdef RESTORE_AVX
00db10
+# ifdef RESTORE_AVX
00db10
 	/* This is to support AVX audit modules.  */
00db10
-	VMOV %VEC(0), LRV_VECTOR0_OFFSET(%rcx)
00db10
-	VMOV %VEC(1), LRV_VECTOR1_OFFSET(%rcx)
00db10
+	VMOVA %VEC(0), LRV_VECTOR0_OFFSET(%rcx)
00db10
+	VMOVA %VEC(1), LRV_VECTOR1_OFFSET(%rcx)
00db10
 
00db10
 	/* Save xmm0/xmm1 registers to detect if they are changed
00db10
 	   by audit module.  */
00db10
 	vmovdqa %xmm0,		  (LRV_SIZE)(%rcx)
00db10
 	vmovdqa %xmm1, (LRV_SIZE + XMM_SIZE)(%rcx)
00db10
-#endif
00db10
+# endif
00db10
 
00db10
-#ifndef __ILP32__
00db10
-# ifdef HAVE_MPX_SUPPORT
00db10
+# ifndef __ILP32__
00db10
+#  ifdef HAVE_MPX_SUPPORT
00db10
 	bndmov %bnd0, LRV_BND0_OFFSET(%rcx)  # Preserve returned bounds.
00db10
 	bndmov %bnd1, LRV_BND1_OFFSET(%rcx)
00db10
-# else
00db10
+#  else
00db10
 	.byte  0x66,0x0f,0x1b,0x81;.long (LRV_BND0_OFFSET)
00db10
 	.byte  0x66,0x0f,0x1b,0x89;.long (LRV_BND1_OFFSET)
00db10
+#  endif
00db10
 # endif
00db10
-#endif
00db10
 
00db10
 	fstpt LRV_ST0_OFFSET(%rcx)
00db10
 	fstpt LRV_ST1_OFFSET(%rcx)
00db10
@@ -261,49 +603,47 @@
00db10
 	movaps LRV_XMM0_OFFSET(%rsp), %xmm0
00db10
 	movaps LRV_XMM1_OFFSET(%rsp), %xmm1
00db10
 
00db10
-#ifdef RESTORE_AVX
00db10
+# ifdef RESTORE_AVX
00db10
 	/* Check if xmm0/xmm1 registers are changed by audit module.  */
00db10
 	vpcmpeqq (LRV_SIZE)(%rsp), %xmm0, %xmm2
00db10
 	vpmovmskb %xmm2, %esi
00db10
 	cmpl $0xffff, %esi
00db10
 	jne 1f
00db10
-	VMOV LRV_VECTOR0_OFFSET(%rsp), %VEC(0)
00db10
+	VMOVA LRV_VECTOR0_OFFSET(%rsp), %VEC(0)
00db10
 
00db10
 1:	vpcmpeqq (LRV_SIZE + XMM_SIZE)(%rsp), %xmm1, %xmm2
00db10
 	vpmovmskb %xmm2, %esi
00db10
 	cmpl $0xffff, %esi
00db10
 	jne 1f
00db10
-	VMOV LRV_VECTOR1_OFFSET(%rsp), %VEC(1)
00db10
+	VMOVA LRV_VECTOR1_OFFSET(%rsp), %VEC(1)
00db10
 
00db10
 1:
00db10
-#endif
00db10
+# endif
00db10
 
00db10
-#ifndef __ILP32__
00db10
-# ifdef HAVE_MPX_SUPPORT
00db10
-	bndmov LRV_BND0_OFFSET(%rcx), %bnd0  # Restore bound registers.
00db10
-	bndmov LRV_BND1_OFFSET(%rcx), %bnd1
00db10
-# else
00db10
-	.byte  0x66,0x0f,0x1a,0x81;.long (LRV_BND0_OFFSET)
00db10
-	.byte  0x66,0x0f,0x1a,0x89;.long (LRV_BND1_OFFSET)
00db10
+# ifndef __ILP32__
00db10
+#  ifdef HAVE_MPX_SUPPORT
00db10
+	bndmov LRV_BND0_OFFSET(%rsp), %bnd0  # Restore bound registers.
00db10
+	bndmov LRV_BND1_OFFSET(%rsp), %bnd1
00db10
+#  else
00db10
+	.byte  0x66,0x0f,0x1a,0x84,0x24;.long (LRV_BND0_OFFSET)
00db10
+	.byte  0x66,0x0f,0x1a,0x8c,0x24;.long (LRV_BND1_OFFSET)
00db10
+#  endif
00db10
 # endif
00db10
-#endif
00db10
 
00db10
 	fldt LRV_ST1_OFFSET(%rsp)
00db10
 	fldt LRV_ST0_OFFSET(%rsp)
00db10
 
00db10
-	movq %rbx, %rsp
00db10
+	mov %RBX_LP, %RSP_LP
00db10
 	movq (%rsp), %rbx
00db10
-	cfi_restore(rbx)
00db10
+	cfi_restore(%rbx)
00db10
 	cfi_def_cfa_register(%rsp)
00db10
 
00db10
-	addq $48, %rsp		# Adjust the stack to the return value
00db10
+	add $48, %RSP_LP	# Adjust the stack to the return value
00db10
 				# (eats the reloc index and link_map)
00db10
 	cfi_adjust_cfa_offset(-48)
00db10
+	PRESERVE_BND_REGS_PREFIX
00db10
 	retq
00db10
 
00db10
-#ifdef MORE_CODE
00db10
-	cfi_adjust_cfa_offset(48)
00db10
-	cfi_rel_offset(%rbx, 0)
00db10
-	cfi_def_cfa_register(%rbx)
00db10
-# undef MORE_CODE
00db10
+	cfi_endproc
00db10
+	.size _dl_runtime_profile, .-_dl_runtime_profile
00db10
 #endif
00db10
Index: glibc-2.17-c758a686/sysdeps/x86_64/ifuncmain8.c
00db10
===================================================================
00db10
--- /dev/null
00db10
+++ glibc-2.17-c758a686/sysdeps/x86_64/ifuncmain8.c
00db10
@@ -0,0 +1,32 @@
00db10
+/* Test IFUNC selector with floating-point parameters.
00db10
+   Copyright (C) 2015 Free Software Foundation, Inc.
00db10
+   This file is part of the GNU C Library.
00db10
+
00db10
+   The GNU C Library is free software; you can redistribute it and/or
00db10
+   modify it under the terms of the GNU Lesser General Public
00db10
+   License as published by the Free Software Foundation; either
00db10
+   version 2.1 of the License, or (at your option) any later version.
00db10
+
00db10
+   The GNU C Library is distributed in the hope that it will be useful,
00db10
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
00db10
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
00db10
+   Lesser General Public License for more details.
00db10
+
00db10
+   You should have received a copy of the GNU Lesser General Public
00db10
+   License along with the GNU C Library; if not, see
00db10
+   <http://www.gnu.org/licenses/>.  */
00db10
+
00db10
+#include <stdlib.h>
00db10
+
00db10
+extern float foo (float);
00db10
+
00db10
+static int
00db10
+do_test (void)
00db10
+{
00db10
+  if (foo (2) != 3)
00db10
+    abort ();
00db10
+  return 0;
00db10
+}
00db10
+
00db10
+#define TEST_FUNCTION do_test ()
00db10
+#include "../test-skeleton.c"
00db10
Index: glibc-2.17-c758a686/sysdeps/x86_64/ifuncmod8.c
00db10
===================================================================
00db10
--- /dev/null
00db10
+++ glibc-2.17-c758a686/sysdeps/x86_64/ifuncmod8.c
00db10
@@ -0,0 +1,36 @@
00db10
+/* Test IFUNC selector with floating-point parameters.
00db10
+   Copyright (C) 2015 Free Software Foundation, Inc.
00db10
+   This file is part of the GNU C Library.
00db10
+
00db10
+   The GNU C Library is free software; you can redistribute it and/or
00db10
+   modify it under the terms of the GNU Lesser General Public
00db10
+   License as published by the Free Software Foundation; either
00db10
+   version 2.1 of the License, or (at your option) any later version.
00db10
+
00db10
+   The GNU C Library is distributed in the hope that it will be useful,
00db10
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
00db10
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
00db10
+   Lesser General Public License for more details.
00db10
+
00db10
+   You should have received a copy of the GNU Lesser General Public
00db10
+   License along with the GNU C Library; if not, see
00db10
+   <http://www.gnu.org/licenses/>.  */
00db10
+
00db10
+#include <emmintrin.h>
00db10
+
00db10
+void * foo_ifunc (void) __asm__ ("foo");
00db10
+__asm__(".type foo, %gnu_indirect_function");
00db10
+
00db10
+static float
00db10
+foo_impl (float x)
00db10
+{
00db10
+  return x + 1;
00db10
+}
00db10
+
00db10
+void *
00db10
+foo_ifunc (void)
00db10
+{
00db10
+  __m128i xmm = _mm_set1_epi32 (-1);
00db10
+  asm volatile ("movdqa %0, %%xmm0" : : "x" (xmm) : "xmm0" );
00db10
+  return foo_impl;
00db10
+}
00db10
Index: glibc-2.17-c758a686/sysdeps/x86_64/tst-avx-aux.c
00db10
===================================================================
00db10
--- /dev/null
00db10
+++ glibc-2.17-c758a686/sysdeps/x86_64/tst-avx-aux.c
00db10
@@ -0,0 +1,47 @@
00db10
+/* Test case for preserved AVX registers in dynamic linker, -mavx part.
00db10
+   Copyright (C) 2017 Free Software Foundation, Inc.
00db10
+   This file is part of the GNU C Library.
00db10
+
00db10
+   The GNU C Library is free software; you can redistribute it and/or
00db10
+   modify it under the terms of the GNU Lesser General Public
00db10
+   License as published by the Free Software Foundation; either
00db10
+   version 2.1 of the License, or (at your option) any later version.
00db10
+
00db10
+   The GNU C Library is distributed in the hope that it will be useful,
00db10
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
00db10
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
00db10
+   Lesser General Public License for more details.
00db10
+
00db10
+   You should have received a copy of the GNU Lesser General Public
00db10
+   License along with the GNU C Library; if not, see
00db10
+   <http://www.gnu.org/licenses/>.  */
00db10
+
00db10
+#include <immintrin.h>
00db10
+#include <stdlib.h>
00db10
+#include <string.h>
00db10
+
00db10
+int
00db10
+tst_avx_aux (void)
00db10
+{
00db10
+#ifdef __AVX__
00db10
+  extern __m256i avx_test (__m256i, __m256i, __m256i, __m256i,
00db10
+			   __m256i, __m256i, __m256i, __m256i);
00db10
+
00db10
+  __m256i ymm0 = _mm256_set1_epi32 (0);
00db10
+  __m256i ymm1 = _mm256_set1_epi32 (1);
00db10
+  __m256i ymm2 = _mm256_set1_epi32 (2);
00db10
+  __m256i ymm3 = _mm256_set1_epi32 (3);
00db10
+  __m256i ymm4 = _mm256_set1_epi32 (4);
00db10
+  __m256i ymm5 = _mm256_set1_epi32 (5);
00db10
+  __m256i ymm6 = _mm256_set1_epi32 (6);
00db10
+  __m256i ymm7 = _mm256_set1_epi32 (7);
00db10
+  __m256i ret = avx_test (ymm0, ymm1, ymm2, ymm3,
00db10
+			  ymm4, ymm5, ymm6, ymm7);
00db10
+  ymm0 =  _mm256_set1_epi32 (0x12349876);
00db10
+  if (memcmp (&ymm0, &ret, sizeof (ret)))
00db10
+    abort ();
00db10
+  return 0;
00db10
+#else  /* __AVX__ */
00db10
+  return 77;
00db10
+#endif  /* __AVX__ */
00db10
+}
00db10
Index: glibc-2.17-c758a686/sysdeps/x86_64/tst-avx.c
00db10
===================================================================
00db10
--- /dev/null
00db10
+++ glibc-2.17-c758a686/sysdeps/x86_64/tst-avx.c
00db10
@@ -0,0 +1,49 @@
00db10
+/* Test case for preserved AVX registers in dynamic linker.
00db10
+   Copyright (C) 2017 Free Software Foundation, Inc.
00db10
+   This file is part of the GNU C Library.
00db10
+
00db10
+   The GNU C Library is free software; you can redistribute it and/or
00db10
+   modify it under the terms of the GNU Lesser General Public
00db10
+   License as published by the Free Software Foundation; either
00db10
+   version 2.1 of the License, or (at your option) any later version.
00db10
+
00db10
+   The GNU C Library is distributed in the hope that it will be useful,
00db10
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
00db10
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
00db10
+   Lesser General Public License for more details.
00db10
+
00db10
+   You should have received a copy of the GNU Lesser General Public
00db10
+   License along with the GNU C Library; if not, see
00db10
+   <http://www.gnu.org/licenses/>.  */
00db10
+
00db10
+#include <cpuid.h>
00db10
+
00db10
+int tst_avx_aux (void);
00db10
+
00db10
+static int
00db10
+avx_enabled (void)
00db10
+{
00db10
+  unsigned int eax, ebx, ecx, edx;
00db10
+
00db10
+  if (__get_cpuid (1, &eax, &ebx, &ecx, &edx) == 0
00db10
+      || (ecx & (bit_AVX | bit_OSXSAVE)) != (bit_AVX | bit_OSXSAVE))
00db10
+    return 0;
00db10
+
00db10
+  /* Check the OS has AVX and SSE saving enabled.  */
00db10
+  asm ("xgetbv" : "=a" (eax), "=d" (edx) : "c" (0));
00db10
+
00db10
+  return (eax & 6) == 6;
00db10
+}
00db10
+
00db10
+static int
00db10
+do_test (void)
00db10
+{
00db10
+  /* Run AVX test only if AVX is supported.  */
00db10
+  if (avx_enabled ())
00db10
+    return tst_avx_aux ();
00db10
+  else
00db10
+    return 77;
00db10
+}
00db10
+
00db10
+#define TEST_FUNCTION do_test ()
00db10
+#include "../../test-skeleton.c"
00db10
Index: glibc-2.17-c758a686/sysdeps/x86_64/tst-avx512-aux.c
00db10
===================================================================
00db10
--- /dev/null
00db10
+++ glibc-2.17-c758a686/sysdeps/x86_64/tst-avx512-aux.c
00db10
@@ -0,0 +1,48 @@
00db10
+/* Test case for preserved AVX512 registers in dynamic linker,
00db10
+   -mavx512 part.
00db10
+   Copyright (C) 2017 Free Software Foundation, Inc.
00db10
+   This file is part of the GNU C Library.
00db10
+
00db10
+   The GNU C Library is free software; you can redistribute it and/or
00db10
+   modify it under the terms of the GNU Lesser General Public
00db10
+   License as published by the Free Software Foundation; either
00db10
+   version 2.1 of the License, or (at your option) any later version.
00db10
+
00db10
+   The GNU C Library is distributed in the hope that it will be useful,
00db10
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
00db10
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
00db10
+   Lesser General Public License for more details.
00db10
+
00db10
+   You should have received a copy of the GNU Lesser General Public
00db10
+   License along with the GNU C Library; if not, see
00db10
+   <http://www.gnu.org/licenses/>.  */
00db10
+
00db10
+#include <immintrin.h>
00db10
+#include <stdlib.h>
00db10
+#include <string.h>
00db10
+
00db10
+int
00db10
+tst_avx512_aux (void)
00db10
+{
00db10
+#ifdef __AVX512F__
00db10
+  extern __m512i avx512_test (__m512i, __m512i, __m512i, __m512i,
00db10
+			      __m512i, __m512i, __m512i, __m512i);
00db10
+
00db10
+  __m512i zmm0 = _mm512_set1_epi32 (0);
00db10
+  __m512i zmm1 = _mm512_set1_epi32 (1);
00db10
+  __m512i zmm2 = _mm512_set1_epi32 (2);
00db10
+  __m512i zmm3 = _mm512_set1_epi32 (3);
00db10
+  __m512i zmm4 = _mm512_set1_epi32 (4);
00db10
+  __m512i zmm5 = _mm512_set1_epi32 (5);
00db10
+  __m512i zmm6 = _mm512_set1_epi32 (6);
00db10
+  __m512i zmm7 = _mm512_set1_epi32 (7);
00db10
+  __m512i ret = avx512_test (zmm0, zmm1, zmm2, zmm3,
00db10
+			     zmm4, zmm5, zmm6, zmm7);
00db10
+  zmm0 =  _mm512_set1_epi32 (0x12349876);
00db10
+  if (memcmp (&zmm0, &ret, sizeof (ret)))
00db10
+    abort ();
00db10
+  return 0;
00db10
+#else  /* __AVX512F__ */
00db10
+  return 77;
00db10
+#endif  /* __AVX512F__ */
00db10
+}
00db10
Index: glibc-2.17-c758a686/sysdeps/x86_64/tst-avx512.c
00db10
===================================================================
00db10
--- /dev/null
00db10
+++ glibc-2.17-c758a686/sysdeps/x86_64/tst-avx512.c
00db10
@@ -0,0 +1,57 @@
00db10
+/* Test case for preserved AVX512 registers in dynamic linker.
00db10
+   Copyright (C) 2017 Free Software Foundation, Inc.
00db10
+   This file is part of the GNU C Library.
00db10
+
00db10
+   The GNU C Library is free software; you can redistribute it and/or
00db10
+   modify it under the terms of the GNU Lesser General Public
00db10
+   License as published by the Free Software Foundation; either
00db10
+   version 2.1 of the License, or (at your option) any later version.
00db10
+
00db10
+   The GNU C Library is distributed in the hope that it will be useful,
00db10
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
00db10
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
00db10
+   Lesser General Public License for more details.
00db10
+
00db10
+   You should have received a copy of the GNU Lesser General Public
00db10
+   License along with the GNU C Library; if not, see
00db10
+   <http://www.gnu.org/licenses/>.  */
00db10
+
00db10
+#include <cpuid.h>
00db10
+
00db10
+int tst_avx512_aux (void);
00db10
+
00db10
+static int
00db10
+avx512_enabled (void)
00db10
+{
00db10
+#ifdef bit_AVX512F
00db10
+  unsigned int eax, ebx, ecx, edx;
00db10
+
00db10
+  if (__get_cpuid (1, &eax, &ebx, &ecx, &edx) == 0
00db10
+      || (ecx & (bit_AVX | bit_OSXSAVE)) != (bit_AVX | bit_OSXSAVE))
00db10
+    return 0;
00db10
+
00db10
+  __cpuid_count (7, 0, eax, ebx, ecx, edx);
00db10
+  if (!(ebx & bit_AVX512F))
00db10
+    return 0;
00db10
+
00db10
+  asm ("xgetbv" : "=a" (eax), "=d" (edx) : "c" (0));
00db10
+
00db10
+  /* Verify that ZMM, YMM and XMM states are enabled.  */
00db10
+  return (eax & 0xe6) == 0xe6;
00db10
+#else
00db10
+  return 0;
00db10
+#endif
00db10
+}
00db10
+
00db10
+static int
00db10
+do_test (void)
00db10
+{
00db10
+  /* Run AVX512 test only if AVX512 is supported.  */
00db10
+  if (avx512_enabled ())
00db10
+    return tst_avx512_aux ();
00db10
+  else
00db10
+    return 77;
00db10
+}
00db10
+
00db10
+#define TEST_FUNCTION do_test ()
00db10
+#include "../../test-skeleton.c"
00db10
Index: glibc-2.17-c758a686/sysdeps/x86_64/tst-avx512mod.c
00db10
===================================================================
00db10
--- /dev/null
00db10
+++ glibc-2.17-c758a686/sysdeps/x86_64/tst-avx512mod.c
00db10
@@ -0,0 +1,48 @@
00db10
+/* Test case for x86-64 preserved AVX512 registers in dynamic linker.  */
00db10
+
00db10
+#ifdef __AVX512F__
00db10
+#include <stdlib.h>
00db10
+#include <string.h>
00db10
+#include <immintrin.h>
00db10
+
00db10
+__m512i
00db10
+avx512_test (__m512i x0, __m512i x1, __m512i x2, __m512i x3,
00db10
+	     __m512i x4, __m512i x5, __m512i x6, __m512i x7)
00db10
+{
00db10
+  __m512i zmm;
00db10
+
00db10
+  zmm = _mm512_set1_epi32 (0);
00db10
+  if (memcmp (&zmm, &x0, sizeof (zmm)))
00db10
+    abort ();
00db10
+
00db10
+  zmm = _mm512_set1_epi32 (1);
00db10
+  if (memcmp (&zmm, &x1, sizeof (zmm)))
00db10
+    abort ();
00db10
+
00db10
+  zmm = _mm512_set1_epi32 (2);
00db10
+  if (memcmp (&zmm, &x2, sizeof (zmm)))
00db10
+    abort ();
00db10
+
00db10
+  zmm = _mm512_set1_epi32 (3);
00db10
+  if (memcmp (&zmm, &x3, sizeof (zmm)))
00db10
+    abort ();
00db10
+
00db10
+  zmm = _mm512_set1_epi32 (4);
00db10
+  if (memcmp (&zmm, &x4, sizeof (zmm)))
00db10
+    abort ();
00db10
+
00db10
+  zmm = _mm512_set1_epi32 (5);
00db10
+  if (memcmp (&zmm, &x5, sizeof (zmm)))
00db10
+    abort ();
00db10
+
00db10
+  zmm = _mm512_set1_epi32 (6);
00db10
+  if (memcmp (&zmm, &x6, sizeof (zmm)))
00db10
+    abort ();
00db10
+
00db10
+  zmm = _mm512_set1_epi32 (7);
00db10
+  if (memcmp (&zmm, &x7, sizeof (zmm)))
00db10
+    abort ();
00db10
+
00db10
+  return _mm512_set1_epi32 (0x12349876);
00db10
+}
00db10
+#endif
00db10
Index: glibc-2.17-c758a686/sysdeps/x86_64/tst-ssemod.c
00db10
===================================================================
00db10
--- /dev/null
00db10
+++ glibc-2.17-c758a686/sysdeps/x86_64/tst-ssemod.c
00db10
@@ -0,0 +1,46 @@
00db10
+/* Test case for x86-64 preserved SSE registers in dynamic linker.  */
00db10
+
00db10
+#include <stdlib.h>
00db10
+#include <string.h>
00db10
+#include <immintrin.h>
00db10
+
00db10
+__m128i
00db10
+sse_test (__m128i x0, __m128i x1, __m128i x2, __m128i x3,
00db10
+	  __m128i x4, __m128i x5, __m128i x6, __m128i x7)
00db10
+{
00db10
+  __m128i xmm;
00db10
+
00db10
+  xmm = _mm_set1_epi32 (0);
00db10
+  if (memcmp (&xmm, &x0, sizeof (xmm)))
00db10
+    abort ();
00db10
+
00db10
+  xmm = _mm_set1_epi32 (1);
00db10
+  if (memcmp (&xmm, &x1, sizeof (xmm)))
00db10
+    abort ();
00db10
+
00db10
+  xmm = _mm_set1_epi32 (2);
00db10
+  if (memcmp (&xmm, &x2, sizeof (xmm)))
00db10
+    abort ();
00db10
+
00db10
+  xmm = _mm_set1_epi32 (3);
00db10
+  if (memcmp (&xmm, &x3, sizeof (xmm)))
00db10
+    abort ();
00db10
+
00db10
+  xmm = _mm_set1_epi32 (4);
00db10
+  if (memcmp (&xmm, &x4, sizeof (xmm)))
00db10
+    abort ();
00db10
+
00db10
+  xmm = _mm_set1_epi32 (5);
00db10
+  if (memcmp (&xmm, &x5, sizeof (xmm)))
00db10
+    abort ();
00db10
+
00db10
+  xmm = _mm_set1_epi32 (6);
00db10
+  if (memcmp (&xmm, &x6, sizeof (xmm)))
00db10
+    abort ();
00db10
+
00db10
+  xmm = _mm_set1_epi32 (7);
00db10
+  if (memcmp (&xmm, &x7, sizeof (xmm)))
00db10
+    abort ();
00db10
+
00db10
+  return _mm_set1_epi32 (0x12349876);
00db10
+}
00db10
Index: glibc-2.17-c758a686/sysdeps/x86_64/tst-sse.c
00db10
===================================================================
00db10
--- /dev/null
00db10
+++ glibc-2.17-c758a686/sysdeps/x86_64/tst-sse.c
00db10
@@ -0,0 +1,46 @@
00db10
+/* Test case for preserved SSE registers in dynamic linker.
00db10
+   Copyright (C) 2017 Free Software Foundation, Inc.
00db10
+   This file is part of the GNU C Library.
00db10
+
00db10
+   The GNU C Library is free software; you can redistribute it and/or
00db10
+   modify it under the terms of the GNU Lesser General Public
00db10
+   License as published by the Free Software Foundation; either
00db10
+   version 2.1 of the License, or (at your option) any later version.
00db10
+
00db10
+   The GNU C Library is distributed in the hope that it will be useful,
00db10
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
00db10
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
00db10
+   Lesser General Public License for more details.
00db10
+
00db10
+   You should have received a copy of the GNU Lesser General Public
00db10
+   License along with the GNU C Library; if not, see
00db10
+   <http://www.gnu.org/licenses/>.  */
00db10
+
00db10
+#include <immintrin.h>
00db10
+#include <stdlib.h>
00db10
+#include <string.h>
00db10
+
00db10
+extern __m128i sse_test (__m128i, __m128i, __m128i, __m128i,
00db10
+                        __m128i, __m128i, __m128i, __m128i);
00db10
+
00db10
+static int
00db10
+do_test (void)
00db10
+{
00db10
+  __m128i xmm0 = _mm_set1_epi32 (0);
00db10
+  __m128i xmm1 = _mm_set1_epi32 (1);
00db10
+  __m128i xmm2 = _mm_set1_epi32 (2);
00db10
+  __m128i xmm3 = _mm_set1_epi32 (3);
00db10
+  __m128i xmm4 = _mm_set1_epi32 (4);
00db10
+  __m128i xmm5 = _mm_set1_epi32 (5);
00db10
+  __m128i xmm6 = _mm_set1_epi32 (6);
00db10
+  __m128i xmm7 = _mm_set1_epi32 (7);
00db10
+  __m128i ret = sse_test (xmm0, xmm1, xmm2, xmm3,
00db10
+                         xmm4, xmm5, xmm6, xmm7);
00db10
+  xmm0 =  _mm_set1_epi32 (0x12349876);
00db10
+  if (memcmp (&xmm0, &ret, sizeof (ret)))
00db10
+    abort ();
00db10
+  return 0;
00db10
+}
00db10
+
00db10
+#define TEST_FUNCTION do_test ()
00db10
+#include "../../test-skeleton.c"
00db10
Index: glibc-2.17-c758a686/sysdeps/x86_64/tst-avxmod.c
00db10
===================================================================
00db10
--- /dev/null
00db10
+++ glibc-2.17-c758a686/sysdeps/x86_64/tst-avxmod.c
00db10
@@ -0,0 +1,49 @@
00db10
+
00db10
+/* Test case for x86-64 preserved AVX registers in dynamic linker.  */
00db10
+
00db10
+#ifdef __AVX__
00db10
+#include <stdlib.h>
00db10
+#include <string.h>
00db10
+#include <immintrin.h>
00db10
+
00db10
+__m256i
00db10
+avx_test (__m256i x0, __m256i x1, __m256i x2, __m256i x3,
00db10
+	  __m256i x4, __m256i x5, __m256i x6, __m256i x7)
00db10
+{
00db10
+  __m256i ymm;
00db10
+
00db10
+  ymm = _mm256_set1_epi32 (0);
00db10
+  if (memcmp (&ymm, &x0, sizeof (ymm)))
00db10
+    abort ();
00db10
+
00db10
+  ymm = _mm256_set1_epi32 (1);
00db10
+  if (memcmp (&ymm, &x1, sizeof (ymm)))
00db10
+    abort ();
00db10
+
00db10
+  ymm = _mm256_set1_epi32 (2);
00db10
+  if (memcmp (&ymm, &x2, sizeof (ymm)))
00db10
+    abort ();
00db10
+
00db10
+  ymm = _mm256_set1_epi32 (3);
00db10
+  if (memcmp (&ymm, &x3, sizeof (ymm)))
00db10
+    abort ();
00db10
+
00db10
+  ymm = _mm256_set1_epi32 (4);
00db10
+  if (memcmp (&ymm, &x4, sizeof (ymm)))
00db10
+    abort ();
00db10
+
00db10
+  ymm = _mm256_set1_epi32 (5);
00db10
+  if (memcmp (&ymm, &x5, sizeof (ymm)))
00db10
+    abort ();
00db10
+
00db10
+  ymm = _mm256_set1_epi32 (6);
00db10
+  if (memcmp (&ymm, &x6, sizeof (ymm)))
00db10
+    abort ();
00db10
+
00db10
+  ymm = _mm256_set1_epi32 (7);
00db10
+  if (memcmp (&ymm, &x7, sizeof (ymm)))
00db10
+    abort ();
00db10
+
00db10
+  return _mm256_set1_epi32 (0x12349876);
00db10
+}
00db10
+#endif