Comparing Task Graph Scheduling Algorithms: An Adversarial Approach

Jared Coleman jaredcol@usc.edu 0000-0003-1227-2962 and Bhaskar Krishnamachari bkrishna@usc.edu University of Southern California3650 McClintock AveLos AngelesCaliforniaUSA90089

(2024)

Abstract.

Scheduling a task graph representing an application over a heterogeneous network of computers is a fundamental problem in distributed computing. It is known to be not only NP-hard but also not polynomial-time approximable within a constant factor. As a result, many heuristic algorithms have been proposed over the past few decades. Yet it remains largely unclear how these algorithms compare to each other in terms of the quality of schedules they produce. We identify gaps in the traditional benchmarking approach to comparing task scheduling algorithms and propose a simulated annealing-based adversarial analysis approach called PISA to help address them. We also introduce SAGA, a new open-source library for comparing task scheduling algorithms. We use SAGA to benchmark 15 algorithms on 16 datasets and PISA to compare the algorithms in a pairwise manner. Algorithms that appear to perform similarly on benchmarking datasets are shown to perform very differently on adversarially chosen problem instances. Interestingly, the results indicate that this is true even when the adversarial search is constrained to selecting among well-structured, application-specific problem instances. This work represents an important step towards a more general understanding of the performance boundaries between task scheduling algorithms on different families of problem instances.

scheduling, simulated annealing, task graph, networks, benchmarking, makespan, adversarial analysis

^†^†copyright: acmcopyright^†^†journalyear: 2024^†^†doi: XXXXXXX.XXXXXXX^†^†conference: ACM Symposium on Principles of Distributed Computing; June 17–21, 2024; Nantes, France^†^†price: 15.00^†^†isbn: 978-1-4503-XXXX-X/18/06

1. Introduction

Task graph scheduling is a fundamental problem in distributed computing. Essentially, the goal is to assign computational tasks to different compute nodes in such a way that minimizes/maximizes some performance metric (e.g., total execution time, energy consumption, throughput, etc.). In this paper, we will focus on the task scheduling problem concerning heterogeneous task graphs and compute networks with the objective of minimizing makespan (total execution time) under the related machines model¹¹1In the related machines model, if the same task executes faster one some compute node $n_{1}$ than on node $n_{2}$ , then $n_{1}$ must execute all tasks faster than $n_{2}$ ( $n_{1}$ is strictly faster than $n_{2}$ ). Observe this model cannot describe multi-modal distributed systems, where certain classes of tasks (e.g., GPU-heavy tasks) might run better/worse on different types of machines (e.g., those with or without GPUs). The related machines model as it pertains to the task scheduling problem we study in this paper is described further in Section 1.1.. As this problem is NP-hard (Garey and Johnson, 1979) and also not polynomial-time approximable within a constant factor (Bazzi and Norouzi-Fard, 2015), many heuristic algorithms have been proposed. Despite the overwhelming number of algorithms now available in the literature, open-source implementations of them are scarce. Even worse, the datasets on which they were evaluated on on are often not available and the implementations that do exist are not compatible with each other (different programming languages, frameworks, problem instance formats, etc.). In order to address this technological shortcoming and enable this study, we built SAGA, a Python framework for running, evaluating, and comparing task scheduling algorithms.

SAGA’s modularity and extensibility makes it easy to benchmark algorithms on various datasets (SAGA currently includes datasets of randomized problem instances and datasets based on real-world scientific workflow and IoT/edge computing applications). The underlying theme throughout this paper, however, is that the traditional benchmarking approach to comparing task scheduling algorithms is insufficient. Benchmarking is only as useful as the underlying datasets are representative and, in practice, peculiarities of heuristic algorithms for task scheduling make it difficult to tell just what broader family of problem instances a dataset is really representative of. For this reason, benchmarking results for task scheduling algorithms can be misleading. In this paper, we will show examples and provide methods for the automatic discovery of in-family problem instances — that is, ones that are similar to other problem instances in the dataset — on which algorithms that appear to perform well on the original dataset perform very poorly. In fact, for every scheduling algorithm we consider in this paper (15 total), our proposed simulated-annealing based adversarial instance finder (PISA) finds problem instances on which it performs at least twice as bad as another of the algorithms. For most of the algorithms (10 of 15), it finds a problem instance on which the algorithm performs at least five times worse than another algorithm!

The main contributions of this paper are as follows:

(1)

Introduces SAGA (Authors, 2023b), an open-source, modular, and extensible Python package with implementations of many well-known task scheduling algorithms and tools for benchmarking and adversarial analysis.
(2)

Reports benchmarking results of 15 well-known scheduling algorithms on 16 datasets.
(3)

Proposes PISA, a novel simulated annealing-based adversarial analysis method for comparing algorithms that identifies problem instances for which a given algorithm maximally underperforms another.
(4)

Identifies gaps in the traditional benchmarking approach: PISA finds problem instances where algorithms that appear to do well on benchmarking datasets perform poorly on adversarially chosen datasets. We also show that this is true even when PISA is restricted to searching over well-structured, application-specific problem instances.

The rest of the paper is organized as follows. We formally define the problem in Section 1.1. In Section 2, we provide a brief history of the task scheduling problem and related work. In Section 3, we introduce SAGA, a Python library for running, evaluating, and comparing task scheduling algorithms. Then, in Section 4, we present the results of benchmarking 15 algorithms on 16 datasets. In Section 5, we present the main contribution of this paper: PISA, a simulated annealing-based adversarial analysis method for comparing task scheduling algorithms. In Section 5.1 we present the results of comparing the 15 algorithms in a pairwise manner using this method. In Section 6, we demonstrate how our proposed technique can be tailored to application-specific scenarios where certain properties of the task graph and/or network are known ahead of time. We conclude the paper in Section 7 with a short discussion on the implications of this work and directions for future research.

1.1. Problem Definition

Let us denote the task graph as $G=(T,D)$ , where $T$ is the set of tasks and $D$ contains the directed edges or dependencies between these tasks. An edge $(t,t^{\prime})\in D$ implies that the output from task $t$ is required input for task $t^{\prime}$ . Thus, task $t^{\prime}$ cannot start executing until it has received the output of task $t$ . This is often referred to as a precedence constraint. For a given task $t\in T$ , its compute cost is represented by $c(t)\in\mathbb{R}^{+}$ and the size of the data exchanged between two dependent tasks, $(t,t^{\prime})\in D$ , is $c(t,t^{\prime})\in\mathbb{R}^{+}$ . Let $N=(V,E)$ denote the compute node network, where $N$ is a complete undirected graph. $V$ is the set of nodes and $E$ is the set of edges. The compute speed of a node $v\in V$ is $s(v)\in\mathbb{R}^{+}$ and the communication strength between nodes $(v,v^{\prime})\in E$ is $s(v,v^{\prime})\in\mathbb{R}^{+}$ . Under the related machines model (Graham, 1969), the execution time of a task $t\in T$ on a node $v\in V$ is $\frac{c(t)}{s(v)}$ , and the data communication time between tasks $(t,t^{\prime})\in D$ from node $v$ to node $v^{\prime}$ (i.e., $t$ executes on $v$ and $t^{\prime}$ executes on $v^{\prime}$ ) is $\frac{c(t,t^{\prime})}{s(v,v^{\prime})}$ .

The goal is to schedule the tasks on different compute nodes in such a way that minimizes the makespan (total execution time) of the task graph. Let $\mathcal{A}$ denote a task scheduling algorithm. Given a problem instance $(N,G)$ which represents a network/task graph pair, let $S_{\mathcal{A},N,G}$ denote the schedule produced by $\mathcal{A}$ for $(N,G)$ . A schedule is a mapping from each task to a triple $(v,r,e)$ where $v$ is the node on which the task is scheduled, $r$ is the start time, and $e$ is the end time. A valid schedule must satisfy the following properties

•

All tasks must be scheduled: for all $t\in T$ , $S_{\mathcal{A},N,G}(t)=(v,r,e)$ must exist such that $v\in V$ and $0\leq r\leq e$ .

•

All tasks must have valid start and end times:

\forall t\in T,S_{\mathcal{A},N,G}(t)=(v,r,e)\implies e-r=\frac{c(t)}{s(v)}

•

Only one task can be scheduled on a node at a time (i.e., their start/end times cannot overlap):

\forall t,t^{\prime}\in T,t\neq t^{\prime},S_{\mathcal{A},N,G}(t)=(v,r,e)\land S% _{\mathcal{A},N,G}(t^{\prime})=(v,r^{\prime},e^{\prime})\implies e\leq r^{% \prime}\lor e^{\prime}\leq r

•

A task cannot start executing until all of its dependencies have finished executing and their outputs have been received at the node on which the task is scheduled:

\forall(t,t^{\prime})\in D,S_{\mathcal{A},N,G}(t)=(v,r,e)\land S_{\mathcal{A},% N,G}(t^{\prime})=(v^{\prime},r^{\prime},e^{\prime})\implies e+\frac{c(t,t^{% \prime})}{s(v,v^{\prime})}\leq r^{\prime}

Figure 1 depicts an example problem instance (task graph and network) and solution (schedule). We define the makespan of the schedule $S_{\mathcal{A},N,G}$ as the time at which the last task finishes executing:

M_{\mathcal{A}(N,G)}=\max_{t\in T\mid S_{\mathcal{A},N,G}(t)=(v,r,e)}e

Because the problem of minimizing makespan is NP-hard for this model (Garey and Johnson, 1979), many heuristic algorithms have been proposed. Traditionally, these heuristic algorithms are evaluated on a set of problem instances and compared against other algorithms based on their makespan ratio, which for a given problem instance $(N,G)$ is the makespan of the schedule produced by the algorithm divided by the minimum makespan of the schedules produced by the baseline algorithms. The makespan ratio of an algorithm $\mathcal{A}$ against a set of baseline algorithms $\mathcal{A}_{1},\mathcal{A}_{2},\ldots$ for a problem instance $(N,G)$ can be written

\displaystyle\frac{M_{\mathcal{A}(N,G)}}{\min\left(M_{\mathcal{A}_{1}(N,G)},M_% {\mathcal{A}_{2}(N,G)},M_{\mathcal{A}_{3}(N,G)},\ldots\right)}.

Makespan ratio is a commonly used concept in the task scheduling literature (Topcuoglu et al., 1999; Wang and Sinnen, 2018). It is common to collect the makespan ratios of an algorithm (against a set of baseline algorithms) for a dataset of problem instances. System designers use this technique, called benchmarking, to decide which algorithm(s) best support(s) their application. In this paper, we highlight the shortcomings of this approach by discussing examples of (and providing methods for the automatic discovery of) in-family problem instances on which algorithms that appear to perform well on the original dataset perform very poorly.

2. Related Work

A comprehensive survey by Ronald Graham in 1979 classified multiple variants of the task scheduling problem and proposed a structured approach to analyze their complexity classes (Graham et al., 1979). Though the variant of task scheduling discussed in this paper wasn’t directly addressed in their study, the established reduction framework paved the way for its NP-completeness determination. The survey addressed many facets of task graph scheduling, such as processor heterogeneity, precedence constraints, and makespan minimization, but did not consider inter-processor communication. This gap was addressed a decade later, in 1989, by Hwang et al., who introduced the concept of task graph scheduling with inter-processor communication (albeit only considering homogeneous processors). They proposed the ETF (Earliest Task First) algorithm and proved that it produces schedules with makespan of at most $(2-1/n)\omega_{\text{opt}}^{(i)}+C$ where $\omega_{\text{opt}}^{(i)}$ is the optimal makespan without considering inter-processor communication delays and $C$ is the worst-case communication requirements over a terminal chain of tasks (one which determines the makespan).

Over time, as distributed computing came into the mainstream, many heuristic scheduling algorithms emerged. Many of the most popular fall under the list scheduling paradigm, originally proposed by Graham. In general, list scheduling algorithms involve two steps:

(1)

Compute a priority $p(t)$ for each task $t$ such that tasks have higher priority than their dependent tasks.
(2)

Greedily schedule tasks in order of their computed priority (from highest to lowest) to run on the node that minimizes some predefined cost function.

In other words, list scheduling algorithms first decide on a valid topological sort of the task graph, and then use it to schedule tasks greedily according to some objective function (ETF, mentioned above, is a list scheduling algorithm). Two of the most well-known heuristic algorithms for task graph scheduling, HEFT (Heterogeneous Earliest Finish Time) and CPoP (Critical Path on Processor) (Topcuoglu et al., 1999), are list scheduling algorithms. Many other heuristic algorithms have been proposed and experimentally evaluated on different datasets of task graphs and networks (see benchmarking/comparison/survey papers (Braun et al., 1999a, 2001; Canon et al., 2008; Kwok and Ahmad, 1998; Wang and Sinnen, 2018)). Other paradigms studied in the literature are cluster-scheduling (Wang and Sinnen, 2018) and meta-heuristic (Houssein et al., 2021) algorithms. Cluster scheduling involves dividing the task graph into groups of tasks to execute on a single node. Meta-heuristic approaches (e.g., simulated annealing or genetic algorithms) have been shown to work well in some situations, but generally take longer to run and can be difficult to tune (Braun et al., 1999b).

In this paper, we propose a simulated annealing-based (Kirkpatrick et al., 1983) approach to finding problem instances on which an algorithm performs maximally poorly compared to a given baseline algorithm. This approach is inspired by the recent work of Namyar et al. who use simulated annealing and other techniques to find adversarial instances for heuristics that solve convex optimization problems (Namyar et al., 2022). To the best of our knowledge, this is the first work to propose an automated method for adversarial analysis of task scheduling algorithms.

3. The SAGA Framework

We built SAGA to overcome the notable lack of publicly available datasets and task scheduling algorithm implementations. SAGA is a Python library for running, evaluating, and comparing task scheduling algorithms. It currently contains implementations of 17 algorithms using a common interface. It also includes interfaces for generating, saving, and loading datasets for benchmarking. Finally, it includes an implementation of the main contribution of this paper: PISA, a simulated annealing-based adversarial analysis method for comparing algorithms. For more information on SAGA, we refer the reader to the technical report (Authors, 2023b).

Table 1 lists the 17 algorithms currently implemented in SAGA.

Table 1. Schedulers implemented in SAGA

Abbreviation	Algorithm	Reference
BIL	Best Imaginary Level	(Oh and Ha, 1996)
BruteForce	Brute Force	-
CPoP	Critical Path on Processor	(Topcuoglu et al., 1999)
Duplex	Duplex	(Braun et al., 2001)
ETF	Earliest Task First	(Hwang et al., 1989)
FastestNode	Fastest Node	-
FCP	Fast Critical Path	(Radulescu and van Gemund, 2000)
FLB	Fast Load Balancing	(Radulescu and van Gemund, 2000)
GDL	Generalized Dynamic Level	(Sih and Lee, 1993b)
HEFT	Heterogeneous Earliest Finish Time	(Topcuoglu et al., 1999)
MaxMin	MaxMin	(Braun et al., 2001)
MCT	Minimum Completion Time	(Armstrong et al., 1998)
MET	Minimum Execution Time	(Armstrong et al., 1998)
MinMin	MinMin	(Braun et al., 2001)
OLB	Opportunistic Load Balancing	(Armstrong et al., 1998)
SMT	SMT-driven Binary Search	-
WBA	Workflow-Based application	(Blythe et al., 2005)

To orient the reader, we provide a brief description of each of these algorithms along with their scheduling complexity, the model they were designed for, performance guarantees (if any), datasets they have been evaluated on, and other algorithms they have been evaluated against in Appendix A.1. Because the BruteForce and SMT scheduling algorithms take much longer to run (exponential time) than the other algorithms, they are not included in the benchmarking or adversarial analysis results reported in this paper.

SAGA also includes a set of tools for generating, saving, and loading datasets. Table 2 lists the 16 dataset generators currently included in SAGA.

Table 2. Datasets available in SAGA

Name	Task Graph		Network
in_trees	randomly weighted in-trees		randomly weighted
out_trees	randomly weighted out-trees
chains	randomly weighted parallel chains
blast	synthetic Blast workflows	(Hazekamp and Thain, 2017)	Chameleon cloud inspired	(Keahey et al., 2020)
bwa	synthetic BWA workflows	(Hazekamp and Thain, 2017)
cycles	synthetic Cycles workflows	(da Silva et al., 2019b)
epigenomics	synthetic Epigenomics workflows	(Juve et al., 2013)
genome	synthetic 1000genome workflows	(da Silva et al., 2019a)
montage	synthetic Montage workflows	(Rynge et al., 2014)
seismology	synthetic Seismology workflows	(Filgueira et al., 2016)
soykb	synthetic SoyKB	(Liu et al., 2016)
srasearch	synthetic SRASearch workflows	(Rynge, 2017)
etl	RIoTBench ETL application	(Shukla et al., 2017)	Edge/Fog/Cloud Networks	(Varshney et al., 2022)
predict	RIoTBench PREDICT application	(Shukla et al., 2017)
stats	RIoTBench STATS application	(Shukla et al., 2017)
train	RIoTBench TRAIN application	(Shukla et al., 2017)

The in_trees, out_trees, and chains datasets each consist of $1000$ randomly generated network/task graph pairs following a common methodology used in the literature (Cordeiro et al., 2010). In-trees and out-trees are generated with between $2$ and $4$ levels (chosen uniformly at random), a branching factor of either $2$ or $3$ (chosen uniformly at random), and node/edge-weights drawn from a clipped gaussian distribution (mean: $1$ , standard deviation: $1/3$ , min: $0$ , max: $2$ ). Parallel chains task graphs are generated with between $2$ and $5$ parallel chains (chosen uniformly at random) of length between $2$ and $5$ (chosen uniformly at random) and node/edge-weights drawn from the same clipped gaussian distribution. Randomly weighted networks are complete graphs with between $3$ and $5$ nodes (chosen uniformly at random) and node/edge-weights drawn from the same clipped gaussian distribution.

The scientific workflow datasets blast, bwa, cycles, epigenomics, genome, montage, seismology, soykb, and srasearch each contain $100$ problem instances. The task graphs are synthetically generated using the WfCommons Synthetic Workflow Generator (Coleman et al., 2023) and are based on real-world scientific workflows. The Chameleon cloud inspired networks are generated by fitting a distribution to the machine speed data from the execution traces (detailed information from a real execution of the application including task start/end times, cpu usages/requirements, data I/O sizes, etc.) of real workflows on Chameleon that are available in WfCommons and then sampling from that distribution to generate random networks. Because Chameleon uses a shared filesystem for data transfer, the communication cost can be absorbed into the computation cost and thus the communication strength between nodes is considered to be infinite.

The IoT/edge-inspired etl, predict, stats, and train datasets each contain $1000$ problem instances. The task graphs and networks are generated using the approach described in (Varshney et al., 2022). The task graph structure is based on real-world IoT data streaming applications and the node weights are generated using a clipped gaussian distribution (mean: $35$ , standard deviation: $25/3$ , min: $10$ , max: $60$ ). The input size of the application is generated using a clipped gaussian distribution (mean: $1000$ , standard deviation: $500/3$ , min: $500$ , max: $1500$ ) and the edge weights are determined by the known input/output ratios of the tasks. The Edge/Fog/Cloud Networks are generated by constructing a complete graph with three types of nodes: edge nodes with CPU speed $1$ , fog nodes with CPU speed $6$ , and cloud nodes with CPU speed $50$ . The communication strength between edge and fog nodes is $60$ and between fog and cloud/fog nodes is $100$ . The number of edge, fog, and cloud nodes is between $75$ and $125$ , $3$ and $7$ , and $1$ and $10$ , respectively (chosen uniformly at random). In order to generate a complete graph (a requirement for many of scheduling algorithms), the communication strength between edge and cloud nodes is set to $60$ , the communication strength between cloud nodes is set to infinity (i.e., no communication delay).

Many other schedulers and datasets have been proposed and used in the literature and can be easily integrated into SAGA in the future. SAGA is open-source (Authors, 2023a) and designed to be modular and extensible. We encourage the community to contribute new algorithms, datasets, and experimentation tools.

4. Benchmarking Results

Figure 2 shows the results of benchmarking 15 algorithms on 16 datasets.

Different scheduling algorithms perform better or worse depending on the dataset and algorithms that weren’t designed for fully heterogeneous task graphs and networks (e.g., ETF, FastestNode) tend to perform poorly. Many of the algorithms, though, perform similarly across the datasets. While these experiments provide valuable information about the performance of each algorithm on each dataset, they provide much less information about the algorithms themselves.

Consider the illustrative scenario in Figure 3.

A simplistic task graph, as depicted in Figure 2(a), coupled with a minor alteration to the initial network (Figures 2(b) and 2(c)) — a reduction in the strength of node $3$ ’s communication links — causes HEFT to perform worse than CPoP (Figures 2(d), 2(e), 2(f), and 2(g)). This example underscores the shortcoming of traditional benchmarking: it provides little insight into the conditions under which an algorithm performs well or poorly. Observe that a structurally equivalent instance of this problem with all node/edge weights scaled so they are between $0$ and $1$ could have been generated by the Parallel Chains dataset generator in SAGA. So while the benchmarking results in Figure 2 indicate that HEFT performs slightly better than CPoP on this dataset, there are, in fact, Parallel Chains instances where CPoP performs significantly better than HEFT.

5. Adversarial Analysis

We propose a novel adversarial analysis method for comparing task scheduling algorithms called PISA (Problem-instance Identification using Simulated Annealing). The goal of this method is to identify problem instances on which one algorithm outperforms another. More formally, the goal is to find a problem instance $(N,G)$ that maximizes the makespan ratio of algorithm $\mathcal{A}$ against algorithm $\mathcal{B}$ :

\max_{N,G}\frac{M_{\mathcal{A}(N,G)}}{M_{\mathcal{B}(N,G)}}

In doing so, we hope to fill in some of the gaps we know exist in the benchmarking results. We propose a simulated annealing-based approach for finding such problem instances. Simulated annealing is a meta-heuristic that is often used to find the global optimum of a function that has many local optima. In the context of our problem, the optimizer starts with an initial problem instance $(N,G)$ and then randomly perturbs the problem instance by changing the network or task graph in some way. If the perturbed problem instance has a higher makespan ratio than the current problem instance, the optimizer accepts the perturbation and continues the search from the new problem instance. If the perturbed problem instance has a lower makespan ratio than the current problem instance, the optimizer accepts the perturbation with some probability. This allows the optimizer to escape local optima and (potentially) find the global optimum. Over time, the probability of accepting perturbations that decrease the makespan ratio decreases, allowing the optimizer to settle into a high-makespan ratio state. The pseudocode for our proposed simulated annealing process is shown in Algorithm 1.

Algorithm 1 PISA (Problem-instance Identification using Simulated Annealing)

1:procedure PISA(

N,G,T_{max},T_{min},I_{max},\alpha,\mathcal{A},\mathcal{B}

)

(N,G)

is the initial problem instance

T_{max}

is the initial (maximum) temperature

T_{min}

is the minimum temperature

I_{max}

is the maximum number of iterations

\alpha

is the cooling rate

\mathcal{A}

is the scheduler

\mathcal{B}

is the baseline scheduler

N_{best}\leftarrow N

10:

G_{best}\leftarrow G

11:

M_{best}\leftarrow\frac{M_{\mathcal{A}(N,G)}}{M_{\mathcal{B}(N,G)}}

12:

T\leftarrow T_{max}

13:

i\leftarrow 0

14: while

T>T_{min}

and

i<I_{max}

15:

N^{\prime},G^{\prime}\leftarrow

Perturb(

N,G

)

16:

M^{\prime}\leftarrow\frac{M_{\mathcal{A}(N^{\prime},G^{\prime})}}{M_{\mathcal{% B}(N^{\prime},G^{\prime})}}

17: if

M^{\prime}>M_{best}

then

18:

N_{best}\leftarrow N^{\prime}

19:

G_{best}\leftarrow G^{\prime}

20:

M_{best}\leftarrow M^{\prime}

21: else

22:

p\leftarrow\exp\left(-\frac{M^{\prime}/M_{best}}{T}\right)

23:

r\leftarrow

Random(

0,1

)

\triangleright

Uniform random number between

0

and

1

24: if

r<p

then

25:

N_{best}\leftarrow N^{\prime}

26:

G_{best}\leftarrow G^{\prime}

27:

M_{best}\leftarrow M^{\prime}

28: end if

29: end if

30:

T\leftarrow T\cdot\alpha

31:

i\leftarrow i+1

32: end while

33: return

N_{best},G_{best},M_{best}

34:end procedure

The Perturb function is responsible for perturbing the problem instance by changing the network or task graph in some way. In our implementation, the Perturb function randomly selects (with equal probability) one of the following perturbations:

(1)

Change Network Node Weight: Select a node $v\in V$ uniformly at random and change its weight by a uniformly random amount between $-1/10$ and $1/10$ with a minimum weight of $0$ and a maximum weight of $1$ .
(2)

Change Network Edge Weight: Same as Change Network Node Weight, but for edges (not including self-edges).
(3)

Change Task Weight: Same as Change Network Node Weight, but for tasks.
(4)

Change Dependency Weight: Same as Change Network Edge Weight, but for dependencies.
(5)

Add Dependency: Select a task $t\in T$ uniformly at random and add a dependency from $t$ to a uniformly random task $t^{\prime}\in T$ such that $(t,t^{\prime})\notin D$ and doing so does not create a cycle.
(6)

Remove Dependency: Select a dependency $(t,t^{\prime})\in D$ uniformly at random and remove it.

Some of the algorithms we evaluate on were only designed for homogeneous compute nodes and/or communication links. In these cases, we restrict the perturbations to only change the aspects of the network that are relevant to the algorithm. For ETF, FCP, and FLB, we set all node weights to be $1$ initially and do not allow them to be changed. For BIL, GDL, FCP, and FLB we set all communication link weights to be $1$ initially and do not allow them to be changed.

For every pair of schedulers, we run the simulated annealing algorithm $5$ times with different randomly generated initial problem instances. The initial problem instance $(N,G)$ is such that $N$ is a complete graph with between $3$ and $5$ nodes (chosen uniformly at random) and node/edge-weights between $0$ and $1$ (generated uniformly at random, self-edges have weight $\infty$ ) and $G$ is a simple chain task graph with between $3$ and $5$ tasks (chosen uniformly at random) and task/dependency-weights between $0$ and $1$ (generated uniformly at random). In our implementation, we set $T_{max}=10$ , $T_{min}=0.1$ , $I_{max}=1000$ , and $\alpha=0.99$ .

5.1. Results

Figure 4 shows the PISA results for each pair of schedulers.

The benefit of our approach is clear from just a quick glance at Figure 4 (there’s a lot more red!). A closer look reveals even more interesting results. First, PISA finds, for every scheduling algorithm, problem instances such that it performs at least twice as bad as another of the algorithms. In fact, for most of the algorithms ( $10$ of 15), it finds a problem instance such that the algorithm performs at least five times worse than another algorithm! Furthermore, we note that for nearly every pair of algorithms $\mathcal{A},\mathcal{B}$ , PISA identifies a problem instance where $\mathcal{A}$ outperforms $\mathcal{B}$ and where $\mathcal{B}$ outperforms $\mathcal{A}$ . Finally, some of the cells in Figure 4 have values of $>1000$ . In this case, PISA identified a problem instance where one algorithm drastically outperforms the other. For these cases, it’s likely that there exist problem instances where the scheduler performs arbitrarily poorly compared to the baseline scheduler.

Readers can find a closer look at one example comparing HEFT and CPoP (two well-known task scheduling algorithms (Topcuoglu et al., 1999)) in Appendix A.2 that demonstrates how PISA can be used to discover the reasons why an algorithm performs well on some problem instances and poorly on others.

6. Application-Specific PISA

In Section 5, we introduced PISA as an effective method for finding problem instances where an algorithm performs much worse than benchmarking results suggest. The results, however, depend greatly on the initial problem instance and the implementation of the Perturb function. These two things define the space of problem instances the algorithm searches over and also affect which problem instances are more or less likely to be explored. We chose an implementation in Section 5 that kept problem instances relatively small (between three and five tasks and compute nodes) and allowed arbitrary task graph structure and CCRs (communication-to-computation ratios). By keeping the problem instance size small but allowing for almost arbitrary task graph structure and CCR, this implementation allowed us to explore how the structure of problem instances affects schedules in order to find patterns where certain algorithms out-perform others. In many more realistic scenarios, though, application developers have a better idea of what their task graph and/or compute network will look like. PISA can be easily restricted to searching over a space of realistic problem instances by adjusting the Perturb implementation and initial problem instance. In this section, we report results on experiments with application-specific Perturb implementations. Again, these results support the main hypothesis of this paper that PISA reveals performance boundaries that a traditional benchmarking approach does not.

6.1. Experimental Setup

One of the largest communities of researchers that depend on efficient task scheduling algorithms is that of scientists that use scientific workflows for simulation, experimentation, and more. Scientific workflows are typically big-data scale task graphs that are scheduled to run on cloud or super computing platforms. These researchers typically have little to no control over how their workflows are scheduled, instead relying on Workflow Management Systems (WFMS) like Pegasus (Deelman et al., 2015), Makeflow (Albrecht et al., 2012), and Nextflow (Di Tommaso et al., 2017) (to name a just few examples) to handle the technical execution details. For this reason, it is especially important for WFMS developers/maintainers (who choose which scheduling algorithms their system uses) to understand the performance boundaries between different algorithms for the different types of scientific workflows and computing systems their clients use.

SAGA currently has datasets based on nine real-world scientific workflows (blast, bwa, cycles, epigenomics, 1000genome, montage, seismology, soykb, and srasearch). These applications come from a very wide range of scientific domains — from astronomy (montage builds mosaics of astronomical imagery) to biology (1000genome performs human genome reconstruction) to agriculture (cycles is a multi-crop, multi-year agroecosystem model). For each workflow, the runtime of each task, input/output sizes in bytes, and speedup factor (compute speed) for each machine are available from public execution trace information²²2https://github.com/wfcommons/pegasus-instances, https://github.com/wfcommons/makeflow-instances. The inter-node communication rate, however, is not available. We set communication rates to be homogeneous so that the average CCR, or $\frac{\text{average data size}}{\text{commmunication strength}}$ , is $\frac{1}{5},\frac{1}{2},1,2,$ or $5$ (resulting in five different experiments for each workflow).

Because these scientific workflows are much larger than the workflows used in the experiments from Section 5, we evaluate a smaller subset of the schedulers available in SAGA: FastestNode, HEFT, CPoP, MaxMin, MinMin, and WBA. Performing these experiments for the rest of the schedulers remains a task for future work. For generating a benchmarking dataset, we use the WfCommons Synthetic Workflow Generator (Coleman et al., 2023) to generate random in-family task graphs and create random networks using a best-fit distribution computed from the real execution trace data. We also use this method to generate initial problem instances for PISA. Then, we adapt PISA’s Perturb implementation as follows:

•

Change Network Node Weight: Same as described in Section 5, except the weight is scaled between the minimum and maximum node speeds observed in the real execution trace data.
•

Change Network Edge Weight: This option is removed since network edge weights are homogeneous and fixed to ensure the average CCR for the given application is a specific value.
•

Change Task Weight: Same as described in Section 5, except the weight is scaled between the minimum and maximum task runtime observed in the real execution trace data.
•

Change Dependency Weight: Same as described in Section 5, except the weight is scaled between the minimum and maximum task I/O size observed in the real execution trace data.
•

Add/Remove Dependency: This option is removed so that the task graph structure remains representative of the real application.

These adjustments to the Perturb implementation allow us to explore application-specific problem instances more realistically.

6.2. Results

In this section we present and discuss some results on two of the workflows analyzed: srasearch and blast. Due to space constraints, full results can be found in Appendix A.3. Srasearch (workflow structure depicted in Figure 4(a)) is a toolkit for interacting with data in the INSDC Sequence Read Archives and blast (workflow structure depicted in Figure 4(b)) is a toolkit for finding regions of similarity between biological sequences.

Observe in Figure 5 that while the number of tasks may vary, the structure of both workflows is very rigid. Our new Perturb implementation, though, guarantees that the search space contains only task graphs with appropriate structure. Figures 6 and 7 show the benchmarking and PISA results for srasearch and blast, respectively. First, observe that the benchmarking approach suggests all algorithms (except FastestNode) perform very well on the srasearch applications with a CCR of $1/5$ . Using PISA, however, we are able to identify problem instances where WBA performs thousands of times worse than FastestNode! Also, we’re able to find instances where MinMin performs almost twice as bad CPoP. Even among the “good” algorithms, though, we see interesting behavior. Observe the results of HEFT and MaxMin. We are able to find both a problem instance where HEFT performs approximately $20\%$ worse that MaxMin and an instance where MaxMin performs approximately $11\%$ worse than HEFT.

The results for srasearch seem to imply that CPoP, which appears to perform consistently well for all CCRs tested (the worst case found has a makespan ratio of 1.15) would be the scheduling algorithm of choice for a WFMS designer. Workflow Management Systems do not support just one type of scientific workflow, though, and CPoP’s effectiveness for srasearch workflows may not extend to others. This is true for blast (see results in Figure 7) where CPoP performs generally poorly for all CCRs tested and in one case yields a schedule with 1000 times the makespan of the schedule WBA produces!

These results are evidence that the traditional benchmarking approach to comparing algorithms is insufficient even for highly regular, application-specific problem instances. They also have implications for the design of Workflow Management Systems (and other task scheduling systems in IoT, edge computing environments, etc.). It may be reasonable for a WFMS to run a set of scheduling algorithms that best covers the different types of client scientific workflows and computing systems. For example, a WFMS designer might run PISA and choose the three algorithms with the combined minimum maximum makespan ratio. Different methodologies for constructing and comparing such hybrid algorithms is an interesting topic for future work.

7. Conclusion

In this paper, we presented SAGA, a Python framework for running, evaluating, and comparing task scheduling algorithms. We evaluated 15 of the scheduling algorithms implemented in SAGA on 16 datasets and demonstrated how our proposed adversarial analysis method, PISA, provides useful information that traditional benchmarking does not. We also explored how PISA can be used for application-specific scenarios where system designers have some idea of what the target task graphs and compute networks look like ahead of time. We show that, even for this restricted case, PISA is successful in identifying performance boundaries between task scheduling algorithms that the traditional benchmarking approach is not.

There are many directions for future work. First, we plan to extend SAGA to include more algorithms and datasets. Another logical next step is to extend SAGA and the adversarial analysis method to support other problem variants. Due to our particular interest in task scheduling for dynamic environments, we plan to add support for stochastic problem instances (with stochastic task costs, data sizes, computation speeds, and communication costs). It would also be interesting to explore other meta-heuristics for adversarial analysis (e.g., genetic algorithms) and other performance metrics (e.g., throughput, energy consumption, cost, etc.). Finally, the application-specific scenario we explored in Section 6 suggests that an exploration into different methodologies for constructing and comparing hybrid scheduling algorithms using PISA might be fruitful.

References

(1)
Albrecht et al. (2012) Michael Albrecht, Patrick Donnelly, Peter Bui, and Douglas Thain. 2012. Makeflow: A Portable Abstraction for Data Intensive Computing on Clusters, Clouds, and Grids. In Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies (Scottsdale, Arizona, USA) (SWEET ’12). Association for Computing Machinery, New York, NY, USA, Article 1, 13 pages. https://doi.org/10.1145/2443416.2443417
Armstrong et al. (1998) R. Armstrong, D. Hensgen, and T. Kidd. 1998. The relative performance of various mapping algorithms is independent of sizable variances in run-time predictions. In Proceedings Seventh Heterogeneous Computing Workshop (HCW’98). 79–87. https://doi.org/10.1109/HCW.1998.666547
Authors (2023a) Anonymous Authors. 2023a. Scheduling Algorithms Gathered. Github. https://anonymous.4open.science/r/saga-1F6D/README.md
Authors (2023b) Anonymous Authors. 2023b. Scheduling Algorithms Gathered: A Framework for Implementing, Evaluating, and Comparing Task Graph Scheduling Algorithms. Technical Report. Anonymous Institution.
Bazzi and Norouzi-Fard (2015) Abbas Bazzi and Ashkan Norouzi-Fard. 2015. Towards Tight Lower Bounds for Scheduling Problems. In Algorithms - ESA 2015 - 23rd Annual European Symposium, Patras, Greece, September 14-16, 2015, Proceedings (Lecture Notes in Computer Science, Vol. 9294), Nikhil Bansal and Irene Finocchi (Eds.). Springer, 118–129. https://doi.org/10.1007/978-3-662-48350-3_11
Blythe et al. (2005) James Blythe, S. Jain, Ewa Deelman, Yolanda Gil, Karan Vahi, Anirban Mandal, and Ken Kennedy. 2005. Task scheduling strategies for workflow-based applications in grids. In 5th International Symposium on Cluster Computing and the Grid (CCGrid 2005), 9-12 May, 2005, Cardiff, UK. IEEE Computer Society, 759–767. https://doi.org/10.1109/CCGRID.2005.1558639
Braun et al. (1999a) T.D. Braun, H.J. Siegal, N. Beck, L.L. Boloni, M. Maheswaran, A.I. Reuther, J.P. Robertson, M.D. Theys, Bin Yao, D. Hensgen, and R.F. Freund. 1999a. A comparison study of static mapping heuristics for a class of meta-tasks on heterogeneous computing systems. In Proceedings. Eighth Heterogeneous Computing Workshop (HCW’99). 15–29. https://doi.org/10.1109/HCW.1999.765093
Braun et al. (1999b) T.D. Braun, H.J. Siegal, N. Beck, L.L. Boloni, M. Maheswaran, A.I. Reuther, J.P. Robertson, M.D. Theys, Bin Yao, D. Hensgen, and R.F. Freund. 1999b. A comparison study of static mapping heuristics for a class of meta-tasks on heterogeneous computing systems. In Proceedings. Eighth Heterogeneous Computing Workshop (HCW’99). 15–29. https://doi.org/10.1109/HCW.1999.765093
Braun et al. (2001) Tracy D. Braun, Howard Jay Siegel, Noah Beck, Ladislau Bölöni, Muthucumaru Maheswaran, Albert I. Reuther, James P. Robertson, Mitchell D. Theys, Bin Yao, Debra A. Hensgen, and Richard F. Freund. 2001. A Comparison of Eleven Static Heuristics for Mapping a Class of Independent Tasks onto Heterogeneous Distributed Computing Systems. J. Parallel Distributed Comput. 61, 6 (2001), 810–837. https://doi.org/10.1006/jpdc.2000.1714
Canon et al. (2008) Louis-Claude Canon, Emmanuel Jeannot, Rizos Sakellariou, and Wei Zheng. 2008. Comparative Evaluation Of The Robustness Of DAG Scheduling Heuristics. In Grid Computing - Achievements and Prospects: CoreGRID Integration Workshop 2008, Hersonissos, Crete, Greece, April 2-4, 2008, Sergei Gorlatch, Paraskevi Fragopoulou, and Thierry Priol (Eds.). Springer, 73–84. https://doi.org/10.1007/978-0-387-09457-1_7
Coleman et al. (2023) Tainã Coleman, Henri Casanova, and Rafael Ferreira da Silva. 2023. Automated generation of scientific workflow generators with WfChef. Future Gener. Comput. Syst. 147 (2023), 16–29. https://doi.org/10.1016/j.future.2023.04.031
Cordeiro et al. (2010) Daniel Cordeiro, Grêgory Mouniê, Swann Perarnau, Denis Trystram, Jean-Marc Vincent, and Frêdêric Wagner. 2010. Random graph generation for scheduling simulations. ICST. https://doi.org/10.4108/ICST.SIMUTOOLS2010.8667
da Silva et al. (2019a) Rafael Ferreira da Silva, Rosa Filgueira, Ewa Deelman, Erola Pairo-Castineira, Ian Michael Overton, and Malcolm P. Atkinson. 2019a. Using simple PID-inspired controllers for online resilient resource management of distributed scientific workflows. Future Gener. Comput. Syst. 95 (2019), 615–628. https://doi.org/10.1016/j.future.2019.01.015
da Silva et al. (2019b) Rafael Ferreira da Silva, Rajiv Mayani, Yuning Shi, Armen R. Kemanian, Mats Rynge, and Ewa Deelman. 2019b. Empowering Agroecosystem Modeling with HTC Scientific Workflows: The Cycles Model Use Case. In 2019 IEEE International Conference on Big Data (IEEE BigData), Los Angeles, CA, USA, December 9-12, 2019, Chaitanya K. Baru, Jun Huan, Latifur Khan, Xiaohua Hu, Ronay Ak, Yuanyuan Tian, Roger S. Barga, Carlo Zaniolo, Kisung Lee, and Yanfang (Fanny) Ye (Eds.). IEEE, 4545–4552. https://doi.org/10.1109/BigData47090.2019.9006107
Deelman et al. (2015) Ewa Deelman, Karan Vahi, Gideon Juve, Mats Rynge, Scott Callaghan, Philip J. Maechling, Rajiv Mayani, Weiwei Chen, Rafael Ferreira da Silva, Miron Livny, and Kent Wenger. 2015. Pegasus, a workflow management system for science automation. Future Generation Computer Systems 46 (2015), 17–35. https://doi.org/10.1016/j.future.2014.10.008
Di Tommaso et al. (2017) Paolo Di Tommaso, Maria Chatzou, Evan W. Floden, Pablo Prieto Barja, Emilio Palumbo, and Cedric Notredame. 2017. Nextflow enables reproducible computational workflows. Nature Biotechnology 35, 4 (01 Apr 2017), 316–319. https://doi.org/10.1038/nbt.3820
El-Rewini and Lewis (1990) Hesham El-Rewini and T. G. Lewis. 1990. Scheduling parallel program tasks onto arbitrary target machines. J. Parallel and Distrib. Comput. 9, 2 (1990), 138–153. https://doi.org/10.1016/0743-7315(90)90042-N
Filgueira et al. (2016) Rosa Filgueira, Rafael Ferreira da Silva, Amrey Krause, Ewa Deelman, and Malcolm P. Atkinson. 2016. Asterism: Pegasus and Dispel4py Hybrid Workflows for Data-Intensive Science. In Seventh International Workshop on Data-Intensive Computing in the Clouds, DataCloud@SC 2016, Salt Lake, UT, USA, November 14, 2016. IEEE Computer Society, 1–8. https://doi.org/10.1109/DataCloud.2016.004
Garey and Johnson (1979) M. R. Garey and David S. Johnson. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman.
Graham et al. (1979) R.L. Graham, E.L. Lawler, J.K. Lenstra, and A.H.G.Rinnooy Kan. 1979. Optimization and Approximation in Deterministic Sequencing and Scheduling: a Survey. In Discrete Optimization II, P.L. Hammer, E.L. Johnson, and B.H. Korte (Eds.). Annals of Discrete Mathematics, Vol. 5. Elsevier, 287–326. https://doi.org/10.1016/S0167-5060(08)70356-X
Graham (1969) R. L. Graham. 1969. Bounds on Multiprocessing Timing Anomalies. SIAM J. Appl. Math. 17, 2 (1969), 416–429. https://doi.org/10.1137/0117039 arXiv:https://doi.org/10.1137/0117039
Hazekamp and Thain (2017) Nick Hazekamp and Douglas Thain. 2017. Makeflow Examples Repository. Github. http://github.com/cooperative-computing-lab/makeflow-examples
Houssein et al. (2021) Essam H. Houssein, Ahmed G. Gad, Yaser Maher Wazery, and Ponnuthurai Nagaratnam Suganthan. 2021. Task Scheduling in Cloud Computing based on Meta-heuristics: Review, Taxonomy, Open Challenges, and Future Trends. Swarm Evol. Comput. 62 (2021), 100841. https://doi.org/10.1016/j.swevo.2021.100841
Hwang et al. (1989) Jing-Jang Hwang, Yuan-Chieh Chow, Frank D. Anger, and Chung-Yee Lee. 1989. Scheduling Precedence Graphs in Systems with Interprocessor Communication Times. SIAM J. Comput. 18, 2 (1989), 244–257. https://doi.org/10.1137/0218016
Juve et al. (2013) Gideon Juve, Ann L. Chervenak, Ewa Deelman, Shishir Bharathi, Gaurang Mehta, and Karan Vahi. 2013. Characterizing and profiling scientific workflows. Future Gener. Comput. Syst. 29, 3 (2013), 682–692. https://doi.org/10.1016/j.future.2012.08.015
Keahey et al. (2020) Kate Keahey, Jason Anderson, Zhuo Zhen, Pierre Riteau, Paul Ruth, Dan Stanzione, Mert Cevik, Jacob Colleran, Haryadi S. Gunawi, Cody Hammock, Joe Mambretti, Alexander Barnes, François Halbach, Alex Rocha, and Joe Stubbs. 2020. Lessons Learned from the Chameleon Testbed. In 2020 USENIX Annual Technical Conference, USENIX ATC 2020, July 15-17, 2020, Ada Gavrilovska and Erez Zadok (Eds.). USENIX Association, 219–233. https://www.usenix.org/conference/atc20/presentation/keahey
Kirkpatrick et al. (1983) Scott Kirkpatrick, C Daniel Gelatt Jr, and Mario P Vecchi. 1983. Optimization by simulated annealing. science 220, 4598 (1983), 671–680. https://doi.org/10.1126/science.220.4598.671
Kwok and Ahmad (1998) Y.-K. Kwok and I. Ahmad. 1998. Benchmarking the task graph scheduling algorithms. In Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing. 531–537. https://doi.org/10.1109/IPPS.1998.669967
Lee et al. (1988) Chung-Yee Lee, Jing-Jang Hwang, Yuan-Chieh Chow, and Frank D. Anger. 1988. Multiprocessor scheduling with interprocessor communication delays. Operations Research Letters 7, 3 (1988), 141–147. https://doi.org/10.1016/0167-6377(88)90080-6
Liu et al. (2016) Yang Liu, Saad M. Khan, Juexin Wang, Mats Rynge, Yuanxun Zhang, Shuai Zeng, Shiyuan Chen, João V. Maldonado dos Santos, Babu Valliyodan, Prasad Calyam, Nirav C. Merchant, Henry T. Nguyen, Dong Xu, and Trupti Joshi. 2016. PGen: large-scale genomic variations analysis workflow and browser in SoyKB. BMC Bioinform. 17, S-13 (2016), 337. https://doi.org/10.1186/s12859-016-1227-y
Namyar et al. (2022) Pooria Namyar, Behnaz Arzani, Ryan Beckett, Santiago Segarra, and Srikanth Kandula. 2022. Minding the gap between Fast Heuristics and their Optimal Counterparts. In Hot Topics in Networking. acm. https://www.microsoft.com/en-us/research/publication/minding-the-gap-between-fast-heuristics-and-their-optimal-counterparts/
Oh and Ha (1996) Hyunok Oh and Soonhoi Ha. 1996. A Static Scheduling Heuristic for Heterogeneous Processors. In Euro-Par ’96 Parallel Processing, Second International Euro-Par Conference, Lyon, France, August 26-29, 1996, Proceedings, Volume II (Lecture Notes in Computer Science, Vol. 1124), Luc Bougé, Pierre Fraigniaud, Anne Mignotte, and Yves Robert (Eds.). Springer, 573–577. https://doi.org/10.1007/BFb0024750
Radulescu and van Gemund (2000) Andrei Radulescu and Arjan J. C. van Gemund. 2000. Fast and Effective Task Scheduling in Heterogeneous Systems. In 9th Heterogeneous Computing Workshop, HCW 2000, Cancun, Mexico, May 1, 2000. IEEE Computer Society, 229–238. https://doi.org/10.1109/HCW.2000.843747
Rynge (2017) Mats Rynge. 2017. SRA Search Pegasus Workflow. Github. https://github.com/pegasus-isi/sra-search-pegasus-workflow
Rynge et al. (2014) M. Rynge, G. Juve, J. Kinney, J. Good, B. Berriman, A. Merrihew, and E. Deelman. 2014. Producing an Infrared Multiwavelength Galactic Plane Atlas Using Montage, Pegasus, and Amazon Web Services. In Astronomical Data Analysis Software and Systems XXIII (Astronomical Society of the Pacific Conference Series, Vol. 485), N. Manset and P. Forshay (Eds.). 211.
Shukla et al. (2017) Anshu Shukla, Shilpa Chaturvedi, and Yogesh Simmhan. 2017. RIoTBench: A Real-time IoT Benchmark for Distributed Stream Processing Platforms. CoRR abs/1701.08530 (2017). arXiv:1701.08530 http://arxiv.org/abs/1701.08530
Sih and Lee (1993a) G.C. Sih and E.A. Lee. 1993a. A compile-time scheduling heuristic for interconnection-constrained heterogeneous processor architectures. IEEE Transactions on Parallel and Distributed Systems 4, 2 (1993), 175–187. https://doi.org/10.1109/71.207593
Sih and Lee (1993b) Gilbert C. Sih and Edward A. Lee. 1993b. A Compile-Time Scheduling Heuristic for Interconnection-Constrained Heterogeneous Processor Architectures. IEEE Trans. Parallel Distributed Syst. 4, 2 (1993), 175–187. https://doi.org/10.1109/71.207593
Topcuoglu et al. (1999) Haluk Topcuoglu, Salim Hariri, and Min-You Wu. 1999. Task Scheduling Algorithms for Heterogeneous Processors. In 8th Heterogeneous Computing Workshop, HCW 1999, San Juan, Puerto Rico, April12, 1999. IEEE Computer Society, 3–14. https://doi.org/10.1109/HCW.1999.765092
Varshney et al. (2022) Prateeksha Varshney, Shriram Ramesh, Shayal Chhabra, Aakash Khochare, and Yogesh Simmhan. 2022. Resilient Execution of Data-triggered Applications on Edge, Fog and Cloud Resources. In 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid). 473–483. https://doi.org/10.1109/CCGrid54584.2022.00057
Wang and Sinnen (2018) Huijun Wang and Oliver Sinnen. 2018. List-Scheduling versus Cluster-Scheduling. IEEE Trans. Parallel Distributed Syst. 29, 8 (2018), 1736–1749. https://doi.org/10.1109/TPDS.2018.2808959

Appendix A Appendix

A.1. Scheduling Algorithm Descriptions

To orient the reader, we provide a brief description of each of the algorithms implemented in SAGA and presented in Section 3.

BIL (Best Imaginary Level) is a list scheduling algorithm designed for the unrelated machines model. In this model, the runtime of a task on a node need not be a function of task cost and node speed. Rather, it can be arbitrary. This model is more general even than the related machines model we study in this paper. BIL’s scheduling complexity is $O\left(|T|^{2}\cdot|V|\log|V|\right)$ and was proven to be optimal for linear graphs. The authors report a makespan speedup over the GDL (Generalized Dynamic Level) scheduler on randomly generated problem instances. The exact process for generating random problem instances is not described except to say that CCRs (communication to computation ratios — average communication time over average execution time) of $1/2$ , $1$ , and $2$ were used.

CPoP (Critical Path on Processor) was proposed in the same paper as HEFT (Heterogeneous Earliest Finish Time). Both are list scheduling algorithms with scheduling complexity $O\left(|T|^{2}|V|\right)$ . No formal bounds for HEFT and CPoP are known. HEFT works by first prioritizing tasks based on their upward rank, which is essentially the length of the longest chain of tasks (considering average execution and communication times). Then, it greedily schedules tasks in this order to the node that minimizes the task completion time given previously scheduled tasks. CPoP is similar but uses a slightly different priority metric. The biggest difference between the two is that CPoP always schedules critical path tasks (those on the longest chain in the task graph) to execute on the fastest node. Both algorithms were evaluated on random graphs (the process for generating graphs is described in the paper) and on real task graphs for Gaussian Elimination and FFT applications. They were compared against different scheduling algorithms: Mapping Heuristic (similar to HEFT without insertion) (El-Rewini and Lewis, 1990), Dynamic-Level Scheduling (Sih and Lee, 1993a), and Levelized Min Time³³3We could not find the original paper that proposes this algorithm.. Schedule Length Ratios (makespan scaled by the sum of minimum computation costs of all tasks), speedup, and schedule generation times are reported.

The MinMin, MaxMin, and Duplex list scheduling algorithms may have been proposed many times independently, but exceptionally clear definitions can be found in a paper that compares them to OLB, MET, MCT, and a few meta-heuristic algorithms (e.g., genetic algorithms and simulated annealing) (Braun et al., 2001). MinMin schedules tasks by iteratively selecting the task with the smallest minimum completion time (given previously scheduled tasks) and assigning it to the corresponding node. MaxMin, on the other hand, schedules tasks by iteratively selecting the task with the largest minimum completion time and assigning it to the the corresponding node. Duplex simply runs both MinMin and MaxMin and returns the schedule with the smallest makespan. In the paper, the authors evaluate these algorithms on independent non-communicating tasks with uniformly random costs and heterogeneous compute nodes with uniformly random speeds. MinMin (and therefore also Duplex) is shown to generate schedules with low makespan compared to the other algorithms while relatively high makespans for MaxMin are reported.

ETF (Earliest Task First) is one of the few algorithms we discuss in this paper that has formal bounds. It is also a list-scheduling algorithm with runtime $O(|T||V|^{2})$ and was designed for heterogeneous task graphs but homogeneous compute networks. It works by iteratively scheduling the task with the earliest possible start time given previously scheduled tasks (usually — details omitted for the sake of simplicity can be found in the original paper). Note how this is different from HEFT and CPoP, which schedule according to the earliest possible completion time of the task. This important difference is what allows the authors to prove a formal bound of $\omega_{\text{ETF}}\leq(2-1/n)\omega_{\text{opt}}^{(i)}+C$ where $\omega_{\text{opt}}^{(i)}$ is the optimal schedule makepsan without communication and $C$ is the total communication requirement over some terminal chain of tasks.

FCP (Fast Critical Path) and FLB (Fast Load Balancing) both have a runtime of $O(|T|\log\left(|V|)+|D|\right)$ and were designed for heterogeneous task graphs and heterogeneous node speeds but homogeneous communication strengths. The algorithms were evaluated on three types of task graphs with different structures based on real applications (LU decomposition, Laplace equation solver, and a stencil algorithm) with varied CCR and uniformly random task costs. Both algorithms were shown to perform well compared to HEFT and ERT (Lee et al., 1988) despite their lower schedule generation times.

GDL (Generalized Dynamic Level), also called DLS (Dynamic Level Scheduling), is a variation on list scheduling where task priorities are updated each time a task is scheduled. Due to this added computation in each iteration, the complexity of DLS is $O(|V|^{3}|T|)$ (a factor $|V|$ greater than HEFT and CPoP). GDL was originally designed for the very general unrelated machines model and was shown to outperform HDLFET (Sih and Lee, 1993b) on randomly generated problem instances (though the method used for generating random task graphs is not well-described) and on four real digital signal processing applications (two sound synthesis algorithms, a telephone channel simulator, and a quadrature amplitude modulation transmitter).

MCT (Minimum Completion Time) and MET (Minimum Execution Time) are very simple algorithms originally designed for the unrelated machines model. MET simply schedules tasks to the machine with the smallest execution time (regardless of task start/end time). MCT assigns tasks in arbitrary order to the node with the smallest completion time given previously scheduled tasks (basically HEFT without insertion or its priority function). MET and MCT have scheduling complexities of $O(|T||V|)$ and $O(|T|^{2}|V|)$ , respectively. The algorithms were evaluated on task graphs with 125 or 500 tasks, each task having one of five possible execution times. They were shown to outperform a very naive baseline algorithm which does not use expected execution times for scheduling.

OLB (Opportunistic Load Balancing) has a runtime of just $O(|T|)$ and was designed for independent tasks under the unrelated machines model. Probably useful only as a baseline for understanding the performance of other algorithms, OLB schedules tasks in arbitrary order on the earliest available compute node. Its performance has been shown to be significantly worse than MET, MCT, and LBA (Armstrong et al., 1998).

WBA (Workflow Based Application) is a scheduling algorithm developed for managing scientific workflows in cloud environments and was designed for the fully heterogeneous model discussed in this paper. We observe that its scheduling complexity is at most $O(|T||D||V|)$ (the authors do not report this, however, and a more efficient implementation might be possible). WBA operates by randomly assigning tasks to nodes, guided by a distribution that favors choices that least increase the schedule makespan in each iteration.

FastestNode is a simple baseline algorithm that schedules all tasks to execute in serial on the fastest compute node. BruteForce is a naive algorithm that tries every possible schedule and returns that with the smallest makespan. SMT uses an SMT (satisfiability modulo theory) solver and binary search to find a $(1+\epsilon)$ -OPT schedule.

A.2. Case Study: HEFT vs. CPoP

The HEFT and CPoP algorithms were proposed in the same paper by Topcuoglu et al. in (1999) and have remained two of the most popular task scheduling algorithms in decades since. Both are list-scheduling algorithms that use a priority list to determine the order in which tasks are greedily scheduled. In HEFT, each task is scheduled on the node that minimizes its finish time, given previous scheduling decisions. CPoP is similar, but commits to scheduling all critical-path tasks (those on the longest path in the task graph) on the fastest node⁴⁴4This statement is true for the related machines model. In general, CPoP schedules critical-path tasks to the node that minimizes the sum of execution times of critical-path tasks on that node.. The priority function for these algorithms are slightly different from each other, but are both based on task distance (in terms of average compute and communication time) from the start and/or end of the task graph. In HEFT, a sink task’s (one with no dependencies) priority is its average execution time over all nodes in the network. A non-sink task’s priority is the sum of its average execution time over all nodes in the network plus the maximum communication and compute time of its successors. In other words, it’s the average amount of time it takes to execute the task on any node in the network plus the maximum amount of time it takes (in an ideal scenario) to execute the rest of the task graph. In CPoP, the task priorities are computed based on the distance to the end and from the start of the task graph. This difference is what causes the two algorithms to perform differently on certain problem instances. Observe in Figure 8 a problem instance identified by PISA where HEFT performs approximately $1.55$ times worse than CPoP.

In this instance, the algorithms differ primarily in whether task $A$ or task $C$ has higher priority. In both algorithms, task $B$ has the highest priority and is thus scheduled first on node $2$ (the fastest node). For CPoP, task $C$ must have the highest priority because it is on the critical path $B\rightarrow C$ . For HEFT, though, task $A$ has the higher priority than $C$ because it is further from the end of the task graph. As a result, CPoP schedules $C$ on node $2$ , which allows task $A$ to be scheduled and executed in parallel on node $3$ (the second fastest node). HEFT, on the other hand, schedules all tasks on node $2$ and thus does not benefit from parallel execution. CPoP succeeds in this instance because it prioritizes tasks that are on the critical path, keeping high-cost tasks/communications on the same node and allowing low-cost tasks/communications to be executed in parallel on other nodes. HEFT greedily schedules high-cost tasks on fast nodes without taking into as much consideration how doing so might affect the rest of the task graph (especially the tasks on the critical path).

This does not mean that CPoP is better than HEFT, though! Observe in Figure 9 a problem instance identified by PISA where CPoP performs approximately $1.88$ times worse than HEFT.

In this instance, the algorithms priorities are actually the same for all tasks. In fact, they schedule both $A$ and $B$ identically. The problem that CPoP faces for this instance is exactly what allows it to succeed in the previous instance: committing to scheduling all critical-path tasks on the fastest node. In this instance, the critical path is $A\rightarrow C$ . Thus, CPoP schedules task $C$ on node $2$ (the fastest node) even though it would have finished much faster (due to the communication cost incurred by its dependency on task $A$ ) on node $1$ . HEFT does not have this problem because it does not commit to scheduling all critical-path tasks on the fastest node.

In conclusion, CPoP succeeds when the task has a critical path with low-cost tasks that “mess up” HEFTs priority function and fails when scheduling a task on the fastest node causes a large communication penalty. It’s easy to see that these bad case-scenarios can be modified to make HEFT and CPoP perform arbitrarily worse than each other (e.g., by increasing the number of parallel source tasks in the first example and by changing the network weight between nodes $2$ and $3$ to $0$ in the second example). This example highlights one of the benefits of our proposed approach to comparing task scheduling algorithms: it allows us to identify patterns where the algorithms under-/over-perform and then extrapolate those patterns to other problem instances.

A.3. Application-Specific PISA Results

Here we include the Section 6 benchmarking and PISA results for all evaluated scientific workflows. For each CCR, the top row shows benchmarking results, where gradients indicate makespan ratios on different problem instances in the dataset. All other cells indicate the highest makespan ratio yielding problem instance discovered by PISA.