Intel engineer Noah Goldstein optimized the memset() function in the glibc library. This optimization gives a performance increase of about 7.5% on desktop versions of processors of the Skylake-X and Ice Lake architectures. For server versions, the performance gain is slightly lower, primarily due to the lower overall performance of a single core.
The previous implementation of the memset() function used the rep stosb assembly instruction. Until recently, this instruction worked quite quickly, due to in-processor zero-over-zero writeback optimization. However, a potential vulnerability was found in this optimization that could lead to a side-channel attack. As a result, the zero-over-zero writeback optimization was canceled, which led to a deterioration in the performance of rep stosb. The new version of memset() still uses the rep stosb instruction, but under stricter conditions.
What exactly has changed can be understood by changing the following comment in the code, which describes the details of the implementation of memset()
Previous version of the description:
/* memset is implemented as: 1. Use overlapping store to avoid branch. 2. If size is less than VEC, use integer register stores. 3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores. 4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores. 5. On machines ERMS feature, if size is greater or equal than __x86_rep_stosb_threshold then REP STOSB will be used. 6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with 4 VEC stores and store 4 * VEC at a time until done. */
New version of the description:
/* memset is implemented as: 1. Use overlapping store to avoid branch. 2. If size is less than VEC, use integer register stores. 3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores. 4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores. 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with 4 VEC stores and store 4 * VEC at a time until done. 6. On machines ERMS feature, if size is range [__x86_rep_stosb_threshold, __x86_memset_non_temporal_threshold) then REP STOSB will be used. 7. If size >= __x86_memset_non_temporal_threshold, use a non-temporal stores. */
Source: linux.org.ru
