Real-time scheduling

Big Idea: Real-time scheduler load balancing

Real-time scheduler load balancing

Experimental

Audience: Realtime program writers and scheduler interested people
Author: axvon

Overview

Our realtime thread scheduling load balancing philosophy is based around this following “overarching theory statement”:

“Always pull, never push.”

Whenever load balancing occurs, there is always a “positive” party, and a “negative” party.

The positive party is the one which gets load removed, and the negative party is the one that gets load added.

Within realtime scheduling, some amount of non-determinism will always be present (e.g. the branch predictor decides to evict entries from its BTB, the cache misses a few times here and there, a spinlock is spun on for a few extra cycles).

Thus, our goal becomes NOT to completely remove non-determinism, as that becomes somewhat of an infeasible plumbing job, but rather, it is to minimize non-determinism and its harm.

Background

Realtime scheduler load balancing has long been a topic of great contention in realtime scheduling. Although EDF is a provably optimal algorithm on uniprocessor systems, its performance and optimality is varied on multiprocessor machines.

Summary

Load balancing in a realtime operating system that supports timesharing scheduling is inherently non-deterministic. The outcome of a load balance is almost entirely dependent on the various loads present on the different nodes and logical processors.

However, we can reduce this non-determinism by making the load balancing unidirectional, or in other words, only allowing one kind of migration.

There are two main kinds of load migration:

Pushing is when the positive party actively offloads its work onto the negative party, with the negative party not having any say in this matter.

Pulling is when the negative party actively takes work from the positive party, with the positive party not being able to do much regarding this action.

Within realtime scheduling, the impact of the non-determinism from pushing is clearly much more significant and harmful than pulling.

Imagine this: CPU0 is happily running its 3 realtime tasks

CPU1 is unhappily running its 9 realtime tasks

CPU0 is making its way through a long-running compute task, with 2 more interactive tasks in the queue

CPU1 is switching more rapidly between its 8 tasks

Then, all of a sudden, CPU1 context switches, and exclaims “I can’t take this anymore!”, and it pushes 3 of its tasks onto CPU0.

Now, CPU0 has had an influx of work, but it is still running its compute task. This brings a major problem:

“The work we just moved will have no guarantee that it is run”

And we can end up in this scenario:

CPU0 is happily running one of its 6 realtime tasks, but some new tasks are sitting in the queue expecting to be switched to, but that never happens.

CPU1 is now a bit less frustrated about its load, but now 3 tasks that were originally assigned to it are waiting in a runqueue, and it no longer is able to choose when those 3 tasks will get to run again, unless it tracks the affinity of each of the tasks that it was running and moves them back.

Now, compare this to pull-only policies: CPU0 is happily running its 3 realtime tasks

CPU1 is unhappily running its 9 realtime tasks

CPU0 context switches, and notices that CPU1 has some extra work it can take. It decides to pull it over, and then immediately run it.

Now, although CPU0 has had work added, this work is guaranteed runtime after the migration, potentially allowing it to complete early/giving it more CPU time.

From this example, we can see how using pull-only migration provides stricter, deterministic guarantees about whether or not a task will run after being migrated. This is the primary reason why we are choosing this paradigm, in addition to a slight improvement in when tasks get migrated (i.e. on the negative party’s side, an explicit call to yield the processor must be made for work to be added, as opposed to work pushing where the negative party can randomly get work added.

API

TODO

Errors

The modular realtime scheduler system of this kernel allows realtime schedulers to fail, and fails upwards. TODO

Context

The realtime scheduler exists under the broader scheduling and multitasking capabilities of this kernel, and exists in contrast to the primary timesharing scheduler which is detailed to a greater extent elsewhere.

Constraints

TODO

Internals

TODO

Strategy

TODO

Rationale

TODO

include/sch/rt_sched.h

struct `rt_slot_request`

struct rt_slot_request {
    char                   *name;
    enum rt_slot_priority  prio;
    int32_t                mapped_to;
};

struct rt_slot_request referenced types:

enum rt_slot_priority

struct `rt_scheduler_blocklist_node`

struct rt_scheduler_blocklist_node {
    struct list_head  list;
    char              *blocked_scheduler_name;
};

struct rt_scheduler_blocklist_node referenced types:

struct list_head

struct `rt_scheduler_percpu_permitted`

struct rt_scheduler_percpu_permitted {
    enum rt_scheduler_capability  allowed_capabilities;
    struct list_head              blocklist;
    struct spinlock               lock;
};

struct rt_scheduler_percpu_permitted referenced types:

struct `rt_thread_summary`

struct rt_thread_summary {
    enum rt_scheduler_status  status;
    rt_weight_t               weight;
    rt_weight_t               weight_ewma;
    rt_urgency_t              urgency;
    rt_urgency_t              urgency_ewma;
};

struct rt_thread_summary referenced types:

struct `rt_thread_summary_ext`

struct rt_thread_summary_ext {
    struct rt_scheduler_static  *source;
    void                        *private;
};

struct rt_thread_summary_ext referenced types:

struct rt_scheduler_static

struct `rt_thread_shed_request`

struct rt_thread_shed_request {
    bool              on;
    rt_urgency_t      urgency;
    size_t            threads_available;
    struct list_head  threads;
};

struct rt_thread_shed_request referenced types:

struct `rt_scheduler_ops`

struct rt_scheduler_ops {
    rt_scheduler_static_fn        on_load;
    rt_scheduler_static_fn        on_unload;
    rt_scheduler_fn               init;
    rt_scheduler_fn               destroy;
    rt_scheduler_fn               switch_in;
    rt_scheduler_fn               switch_out;
    rt_scheduler_fn               on_failure;
    enum rt_scheduler_error       (*migrate_twin)(struct rt_scheduler *self, struct rt_scheduler *other);
    struct thread                 *(*pick_thread)( struct rt_scheduler *);
    rt_scheduler_thread_fn        add_thread;
    rt_scheduler_thread_fn        remove_thread;
    struct rt_thread_summary_ext  (*get_summary_ext)(struct rt_scheduler *);
    rt_scheduler_fn               on_tick;
    void                          (*return_all_threads)(struct rt_scheduler *, struct list_head *);
    rt_domain_id_t                (*domain_id_for_cpu)(struct core *);
};

struct rt_scheduler_ops referenced types:

struct `rt_ext_fn`

struct rt_ext_fn {
    char       *name;
    uint32_t   id;
    uintptr_t  (*fn)(uintptr_t, uintptr_t);
};

struct `rt_scheduler_mapping`

struct rt_scheduler_mapping {
    struct rbt_node             tree_node;
    struct rt_scheduler_static  *static_bptr;
    rt_domain_id_t              id;
    struct cpu_mask             members;
    struct cpu_mask             active;
    struct rt_scheduler         *rts;
    struct spinlock             lock;
    void                        *data;
};

struct rt_scheduler_mapping referenced types:

struct `rt_scheduler_static`

struct rt_scheduler_static {
    struct list_head                list_internal;
    struct rbt                      mappings_internal;
    struct cpu_mask                 *active_mask_internal;
    refcount_t                      refcount;
    struct work                     teardown_work;
    enum rt_scheduler_static_state  state;
    struct spinlock                 state_change_lock;
    char                            *name;
    struct rt_scheduler_ops         ops;
    enum rt_scheduler_topo_level    topo_level;
    enum rt_scheduler_capability    capabilities;
    size_t                          num_slot_requests;
    struct rt_slot_request          slot_requests[RT_SCHEDULER_SLOTS_PER_THREAD];
    size_t                          num_ext_fns;
    struct rt_ext_fn                ext_fns[];
};

struct rt_scheduler_static referenced types:

struct `rt_scheduler`

struct rt_scheduler {
    struct list_head               list;
    struct core                    *owner;
    struct rt_scheduler_mapping    *mapping_source;
    bool                           failed_internal;
    struct rt_thread_summary       summary;
    struct rt_thread_shed_request  shed_request;
    struct log_site                *log_site;
    struct log_handle              log_handle;
    size_t                         thread_count;
    struct spinlock                lock;
};

struct rt_scheduler referenced types:

struct `rt_scheduler_percpu`

struct rt_scheduler_percpu {
    struct scheduler                      *scheduler;
    struct rt_scheduler                   *born_with;
    struct rt_scheduler_mapping           *active_mapping;
    struct rt_scheduler_percpu_permitted  perms;
    struct semaphore                      switch_semaphore;
    struct log_site                       *log_site;
    struct log_handle                     log_handle;
    enum rt_scheduler_error               switch_code;
    (struct rt_scheduler_static *)        switch_into;
};

struct rt_scheduler_percpu referenced types:

struct `rt_global`

struct rt_global {
    struct locked_list  static_list;
    struct locked_list  *sch_pool;
    struct spinlock     switch_lock;
};

struct rt_global referenced types:

enum `rt_scheduler_topo_level`

enum rt_scheduler_topo_level {
    RT_SCHEDULER_TOPO_SMT,
    RT_SCHEDULER_TOPO_CORE,
    RT_SCHEDULER_TOPO_NUMA,
    RT_SCHEDULER_TOPO_LLC,
    RT_SCHEDULER_TOPO_PACKAGE,
    RT_SCHEDULER_TOPO_MACHINE,
    RT_SCHEDULER_TOPO_CUSTOM,
};

enum `rt_scheduler_status`

enum rt_scheduler_status {
    RT_SCHEDULER_STATUS_OK = 0,
    RT_SCHEDULER_STATUS_DEGRADED = 1 << 3,
    RT_SCHEDULER_STATUS_THROTTLED = 1 << 4,
    RT_SCHEDULER_STATUS_ISOLATED = 1 << 5,
};

enum `rt_scheduler_static_state`

enum rt_scheduler_static_state {
    RT_SCHEDULER_STATIC_UNLOADED,
    RT_SCHEDULER_STATIC_LOADED,
    RT_SCHEDULER_STATIC_DESTROYING,
};

enum `rt_slot_priority`

enum rt_slot_priority {
    RT_SLOT_REQUIRED,
    RT_SLOT_OPTIONAL,
};

type alias `rt_domain_id_t`

typedef size_t rt_domain_id_t;

type alias `rt_weight_t`

typedef int32_t rt_weight_t;

type alias `rt_urgency_t`

typedef fx16_16_t rt_urgency_t;

type alias rt_urgency_t referenced types:

fx16_16_t

type alias `rt_scheduler_fn`

typedef enum rt_scheduler_error (*rt_scheduler_fn)(struct rt_scheduler *);

type alias rt_scheduler_fn referenced types:

type alias `rt_scheduler_static_fn`

typedef enum rt_scheduler_error (*rt_scheduler_static_fn)(struct rt_scheduler_static *);

type alias rt_scheduler_static_fn referenced types:

type alias `rt_scheduler_thread_fn`

typedef void (*rt_scheduler_thread_fn)(struct rt_scheduler *, struct thread *);

type alias rt_scheduler_thread_fn referenced types:

Defines

`rt_sched_log(lvl, fmt, ...)`

#define rt_sched_log(lvl, fmt, ...)                                            \ log(LOG_SITE(rt_sched), LOG_HANDLE(rt_sched), lvl, fmt, ##__VA_ARGS__)

`rt_sched_err(fmt, ...)`

#define rt_sched_err(fmt, ...) rt_sched_log(LOG_ERROR, fmt, ##__VA_ARGS__)

`rt_sched_warn(fmt, ...)`

#define rt_sched_warn(fmt, ...) rt_sched_log(LOG_WARN, fmt, ##__VA_ARGS__)

`rt_sched_info(fmt, ...)`

#define rt_sched_info(fmt, ...) rt_sched_log(LOG_INFO, fmt, ##__VA_ARGS__)

`rt_sched_debug(fmt, ...)`

#define rt_sched_debug(fmt, ...) rt_sched_log(LOG_DEBUG, fmt, ##__VA_ARGS__)

`rt_sched_trace(fmt, ...)`

#define rt_sched_trace(fmt, ...) rt_sched_log(LOG_TRACE, fmt, ##__VA_ARGS__)

`RT_SCHEDULER_SLOTS_PER_THREAD`

#define RT_SCHEDULER_SLOTS_PER_THREAD 8

Real-time scheduling

Big Idea: Real-time scheduler load balancing

Overview

Background

Summary

API

Errors

Context

Constraints

Internals

Strategy

Rationale

struct rt_slot_request

struct rt_scheduler_blocklist_node

struct rt_scheduler_percpu_permitted

struct rt_thread_summary

struct rt_thread_summary_ext

struct rt_thread_shed_request

struct rt_scheduler_ops

struct rt_ext_fn

struct rt_scheduler_mapping

struct rt_scheduler_static

struct rt_scheduler

struct rt_scheduler_percpu

struct rt_global

enum rt_scheduler_topo_level

enum rt_scheduler_status

enum rt_scheduler_static_state

enum rt_slot_priority

type alias rt_domain_id_t

type alias rt_weight_t

type alias rt_urgency_t

type alias rt_scheduler_fn

type alias rt_scheduler_static_fn

type alias rt_scheduler_thread_fn

Defines

struct `rt_slot_request`

struct `rt_scheduler_blocklist_node`

struct `rt_scheduler_percpu_permitted`

struct `rt_thread_summary`

struct `rt_thread_summary_ext`

struct `rt_thread_shed_request`

struct `rt_scheduler_ops`

struct `rt_ext_fn`

struct `rt_scheduler_mapping`

struct `rt_scheduler_static`

struct `rt_scheduler`

struct `rt_scheduler_percpu`

struct `rt_global`

enum `rt_scheduler_topo_level`

enum `rt_scheduler_status`

enum `rt_scheduler_static_state`

enum `rt_slot_priority`

type alias `rt_domain_id_t`

type alias `rt_weight_t`

type alias `rt_urgency_t`

type alias `rt_scheduler_fn`

type alias `rt_scheduler_static_fn`

type alias `rt_scheduler_thread_fn`