As shown in the below figure, a power model can be attributed by the path it was trained. Training pipeline is an abstract of power model training that applies a set of learning methods to a different combination of energy source labels, available metrics, and idle/control plane power isolation choices to each specific group of nodes/machines.
A training pipeline starts from reading the Kepler-exporting metrics from Prometheus query (prom) and finally submits an archived models to the model database (model DB).
From Kepler queries, the default extractor generates a dataframe with the following columns.
Labeling energy source
energy source or
source refers to the source (power meter) that provides an energy number. Each source provides one or more
energy components. Currently supported source are shown as below.
|Energy/power source||Energy/power components|
|rapl||package, core, uncore, dram|
Idle power/Control plane power
isolate is a mechanism to separate the power portion that is consumed on the node in an idle state or the power portion that is consumed by operating systems and control plane processes. These portions of power is more than zero even if the metric utilization of workload is zero. We called the models those are trained after isolating these power portions as
DynPower models. At the same time, the models those are trained without isolation are called
DynPower model is used to estimate
container power and
process power while
AbsPower model is used to estimate
There are two common available
isolators: ProfileIsolator and MinIdleIsolator.
ProfileIsolator relies on profiled background powers (profiles) and removes resource usages by system processes from the training while MinIdleIsolator assumes minimum power as an idle power and includes resource usages by system processes in the training.
The pipeline with ProfileIsolator will be applied first if the profile that matches the training
node_type is available. Otherwise, the other pipeline will be applied.
(check how profiles are generated here)
feature group is an abstract that groups available features based on origin of the resource utilization metrics. On some node environments, some origin can be inaccessible such as hardware counter on the virtual private cloud. The model are trained for each defined group as below.
|Group Name||Features||Kepler Metric Source(s)|
|CounterIRQCombined||COUNTER_FEATURES, IRQ_FEATURES||BPF and Hardware Counter|
|Basic||COUNTER_FEATURES, CGROUP_FEATURES, BPF_FEATURES, KUBELET_FEATURES||All except IRQ and node information|
|WorkloadOnly||COUNTER_FEATURES, CGROUP_FEATURES, BPF_FEATURES, IRQ_FEATURES, KUBELET_FEATURES, ACCELERATOR_FEATURES||All except node information|
node information refers to value from kepler_node_info metric.
trainer is an abstract to define the learning method applies to each feature group with each given power labeling source.
Trainer class has 9 abstract methods.
load previous checkpoint model via implemented
(ii) load_remote_checkpoint. If the checkpoint cannot be loaded, initialize the model by calling implemented
load and apply scaler to input data
(iv) trainand save the checkpoint via
check whether to archive the model and push to database via
(vi) should_archive. If yes,
4.1. get trainer-specific basic metadata via
4.2. fill with required metadata, save it as metadata file (metadata.json)
(iv) get_weight_dictfunction is implemented (only for linear regression based trainer), the weight dict will be saved in the file named
4.5. archive the model folder. The model name will be in the format
4.6. push the archived model and
weight.json(if available) to the database
If the trainer is based on scikit-learn, consider implementing only
init_model method of
The intermediate checkpoint and output of model will be saved locally in folder
MODEL_PATH/<PowerSource>/<ModelOutputType>/<FeatureGroup>. The default
Kepler forms multiple groups of machines (nodes) based on its benchmark performance and trains a model separately for each group. The identified group is exported as