Restartable sequences system call (v7)
Expose a new system call allowing each thread to register one userspace
memory area to be used as an ABI between kernel and user-space for two
purposes: user-space restartable sequences and quick access to read the
current CPU number value from user-space.
* Restartable sequences (per-cpu atomics)
The restartable critical sections (percpu atomics) work has been started
by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
critical sections. [1] [2] The re-implementation proposed here brings a
few simplifications to the ABI which facilitates porting to other
architectures and speeds up the user-space fast path. A locking-based
fall-back, purely implemented in user-space, is proposed here to deal
with debugger single-stepping. This fallback interacts with rseq_start()
and rseq_finish(), which force retries in response to concurrent
lock-based activity.
Here are benchmarks of counter increment in various scenarios compared
to restartable sequences:
ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck
Counter increment speed (ns/increment)
1 thread 2 threads
global increment (baseline) 6 N/A
percpu rseq increment 50 52
percpu rseq spinlock 94 94
global atomic increment 48 74 (__sync_add_and_fetch_4)
global atomic CAS 50 172 (__sync_val_compare_and_swap_4)
global pthread mutex 148 862
ARMv7 Processor rev 10 (v7l)
Machine model: Wandboard
Counter increment speed (ns/increment)
1 thread 4 threads
global increment (baseline) 7 N/A
percpu rseq increment 50 50
percpu rseq spinlock 82 84
global atomic increment 44 262 (__sync_add_and_fetch_4)
global atomic CAS 46 316 (__sync_val_compare_and_swap_4)
global pthread mutex 146 1400
x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
Counter increment speed (ns/increment)
1 thread 8 threads
global increment (baseline) 3.0 N/A
percpu rseq increment 3.6 3.8
percpu rseq spinlock 5.6 6.2
global LOCK; inc 8.0 166.4
global LOCK; cmpxchg 13.4 435.2
global pthread mutex 25.2 1363.6
* Reading the current CPU number
Speeding up reading the current CPU number on which the caller thread is
running is done by keeping the current CPU number up do date within the
cpu_id field of the memory area registered by the thread. This is done
by making scheduler migration set the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within the registered user-space memory
area. User-space can then read the current CPU number directly from
memory.
Keeping the current cpu id in a memory area shared between kernel and
user-space is an improvement over current mechanisms available to read
the current CPU number, which has the following benefits over
alternative approaches:
- 35x speedup on ARM vs system call through glibc
- 20x speedup on x86 compared to calling glibc, which calls vdso
executing a "lsl" instruction,
- 14x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cpu_id value can be read from an inline
assembly, which makes it a useful building block for restartable
sequences.
- The approach of reading the cpu id through memory mapping shared
between kernel and user-space is portable (e.g. ARM), which is not the
case for the lsl-based x86 vdso.
On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the cpu id cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.
Benchmarking various approaches for reading the current CPU number:
ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck
- Baseline (empty loop): 8.4 ns
- Read CPU from rseq cpu_id: 16.7 ns
- Read CPU from rseq cpu_id (lazy register): 19.8 ns
- glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns
- getcpu system call: 234.9 ns
x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop): 0.8 ns
- Read CPU from rseq cpu_id: 0.8 ns
- Read CPU from rseq cpu_id (lazy register): 0.8 ns
- Read using gs segment selector: 0.8 ns
- "lsl" inline assembly: 13.0 ns
- glibc 2.19-0ubuntu6 getcpu: 16.6 ns
- getcpu system call: 53.9 ns
- Speed
Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
expectations, that enabling CONFIG_RSEQ slightly accelerates the
scheduler:
Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
restartable sequences series applied.
* CONFIG_RSEQ=n
avg.: 41.37 s
std.dev.: 0.36 s
* CONFIG_RSEQ=y
avg.: 40.46 s
std.dev.: 0.33 s
- Size
On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
2855 bytes, and the data size increase of vmlinux is 1024 bytes.
* CONFIG_RSEQ=n
text data bss dec hex filename
9964559 4256280 962560
15183399 e7ae27 vmlinux.norseq
* CONFIG_RSEQ=y
text data bss dec hex filename
9967414 4257304 962560
15187278 e7bd4e vmlinux.rseq
[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul Turner <pjt@google.com>
CC: Andrew Hunter <ahh@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: linux-api@vger.kernel.org
---
Changes since v1:
- Return -1, errno=EINVAL if cpu_cache pointer is not aligned on
sizeof(int32_t).
- Update man page to describe the pointer alignement requirements and
update atomicity guarantees.
- Add MAINTAINERS file GETCPU_CACHE entry.
- Remove dynamic memory allocation: go back to having a single
getcpu_cache entry per thread. Update documentation accordingly.
- Rebased on Linux 4.4.
Changes since v2:
- Introduce a "cmd" argument, along with an enum with GETCPU_CACHE_GET
and GETCPU_CACHE_SET. Introduce a uapi header linux/getcpu_cache.h
defining this enumeration.
- Split resume notifier architecture implementation from the system call
wire up in the following arch-specific patches.
- Man pages updates.
- Handle 32-bit compat pointers.
- Simplify handling of getcpu_cache GETCPU_CACHE_SET compiler barrier:
set the current cpu cache pointer before doing the cache update, and
set it back to NULL if the update fails. Setting it back to NULL on
error ensures that no resume notifier will trigger a SIGSEGV if a
migration happened concurrently.
Changes since v3:
- Fix __user annotations in compat code,
- Update memory ordering comments.
- Rebased on kernel v4.5-rc5.
Changes since v4:
- Inline getcpu_cache_fork, getcpu_cache_execve, and getcpu_cache_exit.
- Add new line between if() and switch() to improve readability.
- Added sched switch benchmarks (hackbench) and size overhead comparison
to change log.
Changes since v5:
- Rename "getcpu_cache" to "thread_local_abi", allowing to extend
this system call to cover future features such as restartable critical
sections. Generalizing this system call ensures that we can add
features similar to the cpu_id field within the same cache-line
without having to track one pointer per feature within the task
struct.
- Add a tlabi_nr parameter to the system call, thus allowing to extend
the ABI beyond the initial 64-byte structure by registering structures
with tlabi_nr greater than 0. The initial ABI structure is associated
with tlabi_nr 0.
- Rebased on kernel v4.5.
Changes since v6:
- Integrate "restartable sequences" v2 patchset from Paul Turner.
- Add handling of single-stepping purely in user-space, with a
fallback to locking after 2 rseq failures to ensure progress, and
by exposing a __rseq_table section to debuggers so they know where
to put breakpoints when dealing with rseq assembly blocks which
can be aborted at any point.
- make the code and ABI generic: porting the kernel implementation
simply requires to wire up the signal handler and return to user-space
hooks, and allocate the syscall number.
- extend testing with a fully configurable test program. See
param_spinlock_test -h for details.
- handling of rseq ENOSYS in user-space, also with a fallback
to locking.
- modify Paul Turner's rseq ABI to only require a single TLS store on
the user-space fast-path, removing the need to populate two additional
registers. This is made possible by introducing struct rseq_cs into
the ABI to describe a critical section start_ip, post_commit_ip, and
abort_ip.
- Rebased on kernel v4.7-rc7.
Man page associated:
RSEQ(2) Linux Programmer's Manual RSEQ(2)
NAME
rseq - Restartable sequences and cpu number cache
SYNOPSIS
#include <linux/rseq.h>
int rseq(struct rseq * rseq, int flags);
DESCRIPTION
The rseq() ABI accelerates user-space operations on per-cpu
data by defining a shared data structure ABI between each user-
space thread and the kernel.
The rseq argument is a pointer to the thread-local rseq struc‐
ture to be shared between kernel and user-space. A NULL rseq
value can be used to check whether rseq is registered for the
current thread.
The layout of struct rseq is as follows:
Structure alignment
This structure needs to be aligned on multiples of 64
bytes.
Structure size
This structure has a fixed size of 128 bytes.
Fields
cpu_id
Cache of the CPU number on which the calling thread is
running.
event_counter
Restartable sequences event_counter field.
rseq_cs
Restartable sequences rseq_cs field. Points to a struct
rseq_cs.
The layout of struct rseq_cs is as follows:
Structure alignment
This structure needs to be aligned on multiples of 64
bytes.
Structure size
This structure has a fixed size of 192 bytes.
Fields
start_ip
Instruction pointer address of the first instruction of
the sequence of consecutive assembly instructions.
post_commit_ip
Instruction pointer address after the last instruction
of the sequence of consecutive assembly instructions.
abort_ip
Instruction pointer address where to move the execution
flow in case of abort of the sequence of consecutive
assembly instructions.
The flags argument is currently unused and must be specified as
0.
Typically, a library or application will keep the rseq struc‐
ture in a thread-local storage variable, or other memory areas
belonging to each thread. It is recommended to perform volatile
reads of the thread-local cache to prevent the compiler from
doing load tearing. An alternative approach is to read each
field from inline assembly.
Each thread is responsible for registering its rseq structure.
Only one rseq structure address can be registered per thread.
Once set, the rseq address is idempotent for a given thread.
In a typical usage scenario, the thread registering the rseq
structure will be performing loads and stores from/to that
structure. It is however also allowed to read that structure
from other threads. The rseq field updates performed by the
kernel provide single-copy atomicity semantics, which guarantee
that other threads performing single-copy atomic reads of the
cpu number cache will always observe a consistent value.
Memory registered as rseq structure should never be deallocated
before the thread which registered it exits: specifically, it
should not be freed, and the library containing the registered
thread-local storage should not be dlclose'd. Violating this
constraint may cause a SIGSEGV signal to be delivered to the
thread.
Unregistration of associated rseq structure is implicitly per‐
formed when a thread or process exit.
RETURN VALUE
A return value of 0 indicates success. On error, -1 is
returned, and errno is set appropriately.
ERRORS
EINVAL Either flags is non-zero, or rseq contains an address
which is not appropriately aligned.
ENOSYS The rseq() system call is not implemented by this ker‐
nel.
EFAULT rseq is an invalid address.
EBUSY The rseq argument contains a non-NULL address which dif‐
fers from the memory location already registered for
this thread.
ENOENT The rseq argument is NULL, but no memory location is
currently registered for this thread.
VERSIONS
The rseq() system call was added in Linux 4.X (TODO).
CONFORMING TO
rseq() is Linux-specific.
EXAMPLE
The following code uses the rseq() system call to keep a
thread-local storage variable up to date with the current CPU
number, with a fallback on sched_getcpu(3) if the cache is not
available. For example simplicity, it is done in main(), but
multithreaded programs would need to invoke rseq() from each
program thread.
#define _GNU_SOURCE
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <stdint.h>
#include <sched.h>
#include <stddef.h>
#include <errno.h>
#include <string.h>
#include <sys/syscall.h>
#include <linux/rseq.h>
static __thread volatile struct rseq rseq_state = {
.u.e.cpu_id = -1,
};
static int
sys_rseq(volatile struct rseq *rseq_abi, int flags)
{
return syscall(__NR_rseq, rseq_abi, flags);
}
static int32_t
rseq_current_cpu_raw(void)
{
return rseq_state.u.e.cpu_id;
}
static int32_t
rseq_current_cpu(void)
{
int32_t cpu;
cpu = rseq_current_cpu_raw();
if (cpu < 0)
cpu = sched_getcpu();
return cpu;
}
static int
rseq_init_current_thread(void)
{
int rc;
rc = sys_rseq(&rseq_state, 0);
if (rc) {
fprintf(stderr, "Error: sys_rseq(...) failed(%d): %s\n",
errno, strerror(errno));
return -1;
}
return 0;
}
int
main(int argc, char **argv)
{
if (rseq_init_current_thread()) {
fprintf(stderr,
"Unable to initialize restartable sequences.\n");
fprintf(stderr, "Using sched_getcpu() as fallback.\n");
}
printf("Current CPU number: %d\n", rseq_current_cpu());
exit(EXIT_SUCCESS);
}
SEE ALSO
sched_getcpu(3)
Linux 2016-07-19 RSEQ(2)