Real-time scheduler load balancing
Audience: Realtime program writers and scheduler interested people
Author: axvon
Real-time scheduler load balancing
Audience: Realtime program writers and scheduler interested people
Author: axvon
Our realtime thread scheduling load balancing philosophy is based around this following “overarching theory statement”:
“Always pull, never push.”
Whenever load balancing occurs, there is always a “positive” party, and a “negative” party.
The positive party is the one which gets load removed, and the negative party is the one that gets load added.
Within realtime scheduling, some amount of non-determinism will always be present (e.g. the branch predictor decides to evict entries from its BTB, the cache misses a few times here and there, a spinlock is spun on for a few extra cycles).
Thus, our goal becomes NOT to completely remove non-determinism, as that becomes somewhat of an infeasible plumbing job, but rather, it is to minimize non-determinism and its harm.
Realtime scheduler load balancing has long been a topic of great contention in realtime scheduling. Although EDF is a provably optimal algorithm on uniprocessor systems, its performance and optimality is varied on multiprocessor machines.
Load balancing in a realtime operating system that supports timesharing scheduling is inherently non-deterministic. The outcome of a load balance is almost entirely dependent on the various loads present on the different nodes and logical processors.
However, we can reduce this non-determinism by making the load balancing unidirectional, or in other words, only allowing one kind of migration.
There are two main kinds of load migration:
Pushing is when the positive party actively offloads its work onto the negative party, with the negative party not having any say in this matter.
Pulling is when the negative party actively takes work from the positive party, with the positive party not being able to do much regarding this action.
Within realtime scheduling, the impact of the non-determinism from pushing is clearly much more significant and harmful than pulling.
Imagine this: CPU0 is happily running its 3 realtime tasks
CPU1 is unhappily running its 9 realtime tasks
CPU0 is making its way through a long-running compute task, with 2 more interactive tasks in the queue
CPU1 is switching more rapidly between its 8 tasks
Then, all of a sudden, CPU1 context switches, and exclaims “I can’t take this anymore!”, and it pushes 3 of its tasks onto CPU0.
Now, CPU0 has had an influx of work, but it is still running its compute task. This brings a major problem:
“The work we just moved will have no guarantee that it is run”
And we can end up in this scenario:
CPU0 is happily running one of its 6 realtime tasks, but some new tasks are sitting in the queue expecting to be switched to, but that never happens.
CPU1 is now a bit less frustrated about its load, but now 3 tasks that were originally assigned to it are waiting in a runqueue, and it no longer is able to choose when those 3 tasks will get to run again, unless it tracks the affinity of each of the tasks that it was running and moves them back.
Now, compare this to pull-only policies: CPU0 is happily running its 3 realtime tasks
CPU1 is unhappily running its 9 realtime tasks
CPU0 context switches, and notices that CPU1 has some extra work it can take. It decides to pull it over, and then immediately run it.
Now, although CPU0 has had work added, this work is guaranteed runtime after the migration, potentially allowing it to complete early/giving it more CPU time.
From this example, we can see how using pull-only migration provides stricter, deterministic guarantees about whether or not a task will run after being migrated. This is the primary reason why we are choosing this paradigm, in addition to a slight improvement in when tasks get migrated (i.e. on the negative party’s side, an explicit call to yield the processor must be made for work to be added, as opposed to work pushing where the negative party can randomly get work added.
TODO
The modular realtime scheduler system of this kernel allows realtime schedulers to fail, and fails upwards. TODO
The realtime scheduler exists under the broader scheduling and multitasking capabilities of this kernel, and exists in contrast to the primary timesharing scheduler which is detailed to a greater extent elsewhere.
TODO
TODO
TODO
TODO
struct rt_slot_request { char *name; enum rt_slot_priority prio; int32_t mapped_to;};struct rt_slot_request referenced types:
struct rt_scheduler_blocklist_node { struct list_head list; char *blocked_scheduler_name;};struct rt_scheduler_blocklist_node referenced types:
struct rt_scheduler_percpu_permitted { enum rt_scheduler_capability allowed_capabilities; struct list_head blocklist; struct spinlock lock;};struct rt_scheduler_percpu_permitted referenced types:
struct rt_thread_summary { enum rt_scheduler_status status; rt_weight_t weight; rt_weight_t weight_ewma; rt_urgency_t urgency; rt_urgency_t urgency_ewma;};struct rt_thread_summary referenced types:
struct rt_thread_summary_ext { struct rt_scheduler_static *source; void *private;};struct rt_thread_summary_ext referenced types:
struct rt_thread_shed_request { bool on; rt_urgency_t urgency; size_t threads_available; struct list_head threads;};struct rt_thread_shed_request referenced types:
struct rt_scheduler_ops { rt_scheduler_static_fn on_load; rt_scheduler_static_fn on_unload; rt_scheduler_fn init; rt_scheduler_fn destroy; rt_scheduler_fn switch_in; rt_scheduler_fn switch_out; rt_scheduler_fn on_failure; enum rt_scheduler_error (*migrate_twin)(struct rt_scheduler *self, struct rt_scheduler *other); struct thread *(*pick_thread)( struct rt_scheduler *); rt_scheduler_thread_fn add_thread; rt_scheduler_thread_fn remove_thread; struct rt_thread_summary_ext (*get_summary_ext)(struct rt_scheduler *); rt_scheduler_fn on_tick; void (*return_all_threads)(struct rt_scheduler *, struct list_head *); rt_domain_id_t (*domain_id_for_cpu)(struct core *);};struct rt_scheduler_ops referenced types:
rt_scheduler_static_fnrt_scheduler_fnenum rt_scheduler_errorstruct threadrt_scheduler_thread_fnstruct rt_thread_summary_extrt_domain_id_tstruct rt_ext_fn { char *name; uint32_t id; uintptr_t (*fn)(uintptr_t, uintptr_t);};struct rt_scheduler_mapping { struct rbt_node tree_node; struct rt_scheduler_static *static_bptr; rt_domain_id_t id; struct cpu_mask members; struct cpu_mask active; struct rt_scheduler *rts; struct spinlock lock; void *data;};struct rt_scheduler_mapping referenced types:
struct rbt_nodestruct rt_scheduler_staticrt_domain_id_tstruct cpu_maskstruct rt_schedulerstruct spinlockstruct rt_scheduler_static { struct list_head list_internal; struct rbt mappings_internal; struct cpu_mask *active_mask_internal; refcount_t refcount; struct work teardown_work; enum rt_scheduler_static_state state; struct spinlock state_change_lock; char *name; struct rt_scheduler_ops ops; enum rt_scheduler_topo_level topo_level; enum rt_scheduler_capability capabilities; size_t num_slot_requests; struct rt_slot_request slot_requests[RT_SCHEDULER_SLOTS_PER_THREAD]; size_t num_ext_fns; struct rt_ext_fn ext_fns[];};struct rt_scheduler_static referenced types:
struct list_headstruct rbtstruct cpu_maskrefcount_tstruct workenum rt_scheduler_static_statestruct spinlockstruct rt_scheduler_opsenum rt_scheduler_topo_levelenum rt_scheduler_capabilitystruct rt_slot_requeststruct rt_ext_fnstruct rt_scheduler { struct list_head list; struct core *owner; struct rt_scheduler_mapping *mapping_source; bool failed_internal; struct rt_thread_summary summary; struct rt_thread_shed_request shed_request; struct log_site *log_site; struct log_handle log_handle; size_t thread_count; struct spinlock lock;};struct rt_scheduler referenced types:
struct list_headstruct corestruct rt_scheduler_mappingstruct rt_thread_summarystruct rt_thread_shed_requeststruct log_sitestruct log_handlestruct spinlockstruct rt_scheduler_percpu { struct scheduler *scheduler; struct rt_scheduler *born_with; struct rt_scheduler_mapping *active_mapping; struct rt_scheduler_percpu_permitted perms; struct semaphore switch_semaphore; struct log_site *log_site; struct log_handle log_handle; enum rt_scheduler_error switch_code; (struct rt_scheduler_static *) switch_into;};struct rt_scheduler_percpu referenced types:
struct schedulerstruct rt_schedulerstruct rt_scheduler_mappingstruct rt_scheduler_percpu_permittedstruct semaphorestruct log_sitestruct log_handleenum rt_scheduler_errorstruct rt_global { struct locked_list static_list; struct locked_list *sch_pool; struct spinlock switch_lock;};struct rt_global referenced types:
enum rt_scheduler_topo_level { RT_SCHEDULER_TOPO_SMT, RT_SCHEDULER_TOPO_CORE, RT_SCHEDULER_TOPO_NUMA, RT_SCHEDULER_TOPO_LLC, RT_SCHEDULER_TOPO_PACKAGE, RT_SCHEDULER_TOPO_MACHINE, RT_SCHEDULER_TOPO_CUSTOM,};enum rt_scheduler_status { RT_SCHEDULER_STATUS_OK = 0, RT_SCHEDULER_STATUS_DEGRADED = 1 << 3, RT_SCHEDULER_STATUS_THROTTLED = 1 << 4, RT_SCHEDULER_STATUS_ISOLATED = 1 << 5,};enum rt_scheduler_static_state { RT_SCHEDULER_STATIC_UNLOADED, RT_SCHEDULER_STATIC_LOADED, RT_SCHEDULER_STATIC_DESTROYING,};enum rt_slot_priority { RT_SLOT_REQUIRED, RT_SLOT_OPTIONAL,};typedef size_t rt_domain_id_t;typedef int32_t rt_weight_t;typedef fx16_16_t rt_urgency_t;type alias rt_urgency_t referenced types:
typedef enum rt_scheduler_error (*rt_scheduler_fn)(struct rt_scheduler *);type alias rt_scheduler_fn referenced types:
typedef enum rt_scheduler_error (*rt_scheduler_static_fn)(struct rt_scheduler_static *);type alias rt_scheduler_static_fn referenced types:
typedef void (*rt_scheduler_thread_fn)(struct rt_scheduler *, struct thread *);type alias rt_scheduler_thread_fn referenced types:
#define rt_sched_log(lvl, fmt, ...) \ log(LOG_SITE(rt_sched), LOG_HANDLE(rt_sched), lvl, fmt, ##__VA_ARGS__)#define rt_sched_err(fmt, ...) rt_sched_log(LOG_ERROR, fmt, ##__VA_ARGS__)#define rt_sched_warn(fmt, ...) rt_sched_log(LOG_WARN, fmt, ##__VA_ARGS__)#define rt_sched_info(fmt, ...) rt_sched_log(LOG_INFO, fmt, ##__VA_ARGS__)#define rt_sched_debug(fmt, ...) rt_sched_log(LOG_DEBUG, fmt, ##__VA_ARGS__)#define rt_sched_trace(fmt, ...) rt_sched_log(LOG_TRACE, fmt, ##__VA_ARGS__)#define RT_SCHEDULER_SLOTS_PER_THREAD 8