Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
License: CC BY 4.0
arXiv:2403.07120v1 [cs.DC] 11 Mar 2024

Comparing Task Graph Scheduling Algorithms: An Adversarial Approach

Jared Coleman jaredcol@usc.edu 0000-0003-1227-2962  and  Bhaskar Krishnamachari bkrishna@usc.edu University of Southern California3650 McClintock AveLos AngelesCaliforniaUSA90089
(2024)
Abstract.

Scheduling a task graph representing an application over a heterogeneous network of computers is a fundamental problem in distributed computing. It is known to be not only NP-hard but also not polynomial-time approximable within a constant factor. As a result, many heuristic algorithms have been proposed over the past few decades. Yet it remains largely unclear how these algorithms compare to each other in terms of the quality of schedules they produce. We identify gaps in the traditional benchmarking approach to comparing task scheduling algorithms and propose a simulated annealing-based adversarial analysis approach called PISA to help address them. We also introduce SAGA, a new open-source library for comparing task scheduling algorithms. We use SAGA to benchmark 15 algorithms on 16 datasets and PISA to compare the algorithms in a pairwise manner. Algorithms that appear to perform similarly on benchmarking datasets are shown to perform very differently on adversarially chosen problem instances. Interestingly, the results indicate that this is true even when the adversarial search is constrained to selecting among well-structured, application-specific problem instances. This work represents an important step towards a more general understanding of the performance boundaries between task scheduling algorithms on different families of problem instances.

scheduling, simulated annealing, task graph, networks, benchmarking, makespan, adversarial analysis
copyright: acmcopyrightjournalyear: 2024doi: XXXXXXX.XXXXXXXconference: ACM Symposium on Principles of Distributed Computing; June 17–21, 2024; Nantes, Franceprice: 15.00isbn: 978-1-4503-XXXX-X/18/06

1. Introduction

Task graph scheduling is a fundamental problem in distributed computing. Essentially, the goal is to assign computational tasks to different compute nodes in such a way that minimizes/maximizes some performance metric (e.g., total execution time, energy consumption, throughput, etc.). In this paper, we will focus on the task scheduling problem concerning heterogeneous task graphs and compute networks with the objective of minimizing makespan (total execution time) under the related machines model111In the related machines model, if the same task executes faster one some compute node n1subscript𝑛1n_{1}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT than on node n2subscript𝑛2n_{2}italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, then n1subscript𝑛1n_{1}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT must execute all tasks faster than n2subscript𝑛2n_{2}italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (n1subscript𝑛1n_{1}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is strictly faster than n2subscript𝑛2n_{2}italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT). Observe this model cannot describe multi-modal distributed systems, where certain classes of tasks (e.g., GPU-heavy tasks) might run better/worse on different types of machines (e.g., those with or without GPUs). The related machines model as it pertains to the task scheduling problem we study in this paper is described further in Section 1.1.. As this problem is NP-hard (Garey and Johnson, 1979) and also not polynomial-time approximable within a constant factor (Bazzi and Norouzi-Fard, 2015), many heuristic algorithms have been proposed. Despite the overwhelming number of algorithms now available in the literature, open-source implementations of them are scarce. Even worse, the datasets on which they were evaluated on on are often not available and the implementations that do exist are not compatible with each other (different programming languages, frameworks, problem instance formats, etc.). In order to address this technological shortcoming and enable this study, we built SAGA, a Python framework for running, evaluating, and comparing task scheduling algorithms.

SAGA’s modularity and extensibility makes it easy to benchmark algorithms on various datasets (SAGA currently includes datasets of randomized problem instances and datasets based on real-world scientific workflow and IoT/edge computing applications). The underlying theme throughout this paper, however, is that the traditional benchmarking approach to comparing task scheduling algorithms is insufficient. Benchmarking is only as useful as the underlying datasets are representative and, in practice, peculiarities of heuristic algorithms for task scheduling make it difficult to tell just what broader family of problem instances a dataset is really representative of. For this reason, benchmarking results for task scheduling algorithms can be misleading. In this paper, we will show examples and provide methods for the automatic discovery of in-family problem instances — that is, ones that are similar to other problem instances in the dataset — on which algorithms that appear to perform well on the original dataset perform very poorly. In fact, for every scheduling algorithm we consider in this paper (15 total), our proposed simulated-annealing based adversarial instance finder (PISA) finds problem instances on which it performs at least twice as bad as another of the algorithms. For most of the algorithms (10 of 15), it finds a problem instance on which the algorithm performs at least five times worse than another algorithm!

The main contributions of this paper are as follows:

  1. (1)

    Introduces SAGA (Authors, 2023b), an open-source, modular, and extensible Python package with implementations of many well-known task scheduling algorithms and tools for benchmarking and adversarial analysis.

  2. (2)

    Reports benchmarking results of 15 well-known scheduling algorithms on 16 datasets.

  3. (3)

    Proposes PISA, a novel simulated annealing-based adversarial analysis method for comparing algorithms that identifies problem instances for which a given algorithm maximally underperforms another.

  4. (4)

    Identifies gaps in the traditional benchmarking approach: PISA finds problem instances where algorithms that appear to do well on benchmarking datasets perform poorly on adversarially chosen datasets. We also show that this is true even when PISA is restricted to searching over well-structured, application-specific problem instances.

The rest of the paper is organized as follows. We formally define the problem in Section 1.1. In Section 2, we provide a brief history of the task scheduling problem and related work. In Section 3, we introduce SAGA, a Python library for running, evaluating, and comparing task scheduling algorithms. Then, in Section 4, we present the results of benchmarking 15 algorithms on 16 datasets. In Section 5, we present the main contribution of this paper: PISA, a simulated annealing-based adversarial analysis method for comparing task scheduling algorithms. In Section 5.1 we present the results of comparing the 15 algorithms in a pairwise manner using this method. In Section 6, we demonstrate how our proposed technique can be tailored to application-specific scenarios where certain properties of the task graph and/or network are known ahead of time. We conclude the paper in Section 7 with a short discussion on the implications of this work and directions for future research.

1.1. Problem Definition

Let us denote the task graph as G=(T,D)𝐺𝑇𝐷G=(T,D)italic_G = ( italic_T , italic_D ), where T𝑇Titalic_T is the set of tasks and D𝐷Ditalic_D contains the directed edges or dependencies between these tasks. An edge (t,t)D𝑡superscript𝑡𝐷(t,t^{\prime})\in D( italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ italic_D implies that the output from task t𝑡titalic_t is required input for task tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Thus, task tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT cannot start executing until it has received the output of task t𝑡titalic_t. This is often referred to as a precedence constraint. For a given task tT𝑡𝑇t\in Titalic_t ∈ italic_T, its compute cost is represented by c(t)+𝑐𝑡superscriptc(t)\in\mathbb{R}^{+}italic_c ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and the size of the data exchanged between two dependent tasks, (t,t)D𝑡superscript𝑡𝐷(t,t^{\prime})\in D( italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ italic_D, is c(t,t)+𝑐𝑡superscript𝑡superscriptc(t,t^{\prime})\in\mathbb{R}^{+}italic_c ( italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. Let N=(V,E)𝑁𝑉𝐸N=(V,E)italic_N = ( italic_V , italic_E ) denote the compute node network, where N𝑁Nitalic_N is a complete undirected graph. V𝑉Vitalic_V is the set of nodes and E𝐸Eitalic_E is the set of edges. The compute speed of a node vV𝑣𝑉v\in Vitalic_v ∈ italic_V is s(v)+𝑠𝑣superscripts(v)\in\mathbb{R}^{+}italic_s ( italic_v ) ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and the communication strength between nodes (v,v)E𝑣superscript𝑣𝐸(v,v^{\prime})\in E( italic_v , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ italic_E is s(v,v)+𝑠𝑣superscript𝑣superscripts(v,v^{\prime})\in\mathbb{R}^{+}italic_s ( italic_v , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. Under the related machines model (Graham, 1969), the execution time of a task tT𝑡𝑇t\in Titalic_t ∈ italic_T on a node vV𝑣𝑉v\in Vitalic_v ∈ italic_V is c(t)s(v)𝑐𝑡𝑠𝑣\frac{c(t)}{s(v)}divide start_ARG italic_c ( italic_t ) end_ARG start_ARG italic_s ( italic_v ) end_ARG, and the data communication time between tasks (t,t)D𝑡superscript𝑡𝐷(t,t^{\prime})\in D( italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ italic_D from node v𝑣vitalic_v to node vsuperscript𝑣v^{\prime}italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (i.e., t𝑡titalic_t executes on v𝑣vitalic_v and tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT executes on vsuperscript𝑣v^{\prime}italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT) is c(t,t)s(v,v)𝑐𝑡superscript𝑡𝑠𝑣superscript𝑣\frac{c(t,t^{\prime})}{s(v,v^{\prime})}divide start_ARG italic_c ( italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_s ( italic_v , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG.

The goal is to schedule the tasks on different compute nodes in such a way that minimizes the makespan (total execution time) of the task graph. Let 𝒜𝒜\mathcal{A}caligraphic_A denote a task scheduling algorithm. Given a problem instance (N,G)𝑁𝐺(N,G)( italic_N , italic_G ) which represents a network/task graph pair, let S𝒜,N,Gsubscript𝑆𝒜𝑁𝐺S_{\mathcal{A},N,G}italic_S start_POSTSUBSCRIPT caligraphic_A , italic_N , italic_G end_POSTSUBSCRIPT denote the schedule produced by 𝒜𝒜\mathcal{A}caligraphic_A for (N,G)𝑁𝐺(N,G)( italic_N , italic_G ). A schedule is a mapping from each task to a triple (v,r,e)𝑣𝑟𝑒(v,r,e)( italic_v , italic_r , italic_e ) where v𝑣vitalic_v is the node on which the task is scheduled, r𝑟ritalic_r is the start time, and e𝑒eitalic_e is the end time. A valid schedule must satisfy the following properties

  • All tasks must be scheduled: for all tT𝑡𝑇t\in Titalic_t ∈ italic_T, S𝒜,N,G(t)=(v,r,e)subscript𝑆𝒜𝑁𝐺𝑡𝑣𝑟𝑒S_{\mathcal{A},N,G}(t)=(v,r,e)italic_S start_POSTSUBSCRIPT caligraphic_A , italic_N , italic_G end_POSTSUBSCRIPT ( italic_t ) = ( italic_v , italic_r , italic_e ) must exist such that vV𝑣𝑉v\in Vitalic_v ∈ italic_V and 0re0𝑟𝑒0\leq r\leq e0 ≤ italic_r ≤ italic_e.

  • All tasks must have valid start and end times:

    tT,S𝒜,N,G(t)=(v,r,e)er=c(t)s(v)formulae-sequencefor-all𝑡𝑇subscript𝑆𝒜𝑁𝐺𝑡𝑣𝑟𝑒𝑒𝑟𝑐𝑡𝑠𝑣\forall t\in T,S_{\mathcal{A},N,G}(t)=(v,r,e)\implies e-r=\frac{c(t)}{s(v)}∀ italic_t ∈ italic_T , italic_S start_POSTSUBSCRIPT caligraphic_A , italic_N , italic_G end_POSTSUBSCRIPT ( italic_t ) = ( italic_v , italic_r , italic_e ) ⟹ italic_e - italic_r = divide start_ARG italic_c ( italic_t ) end_ARG start_ARG italic_s ( italic_v ) end_ARG
  • Only one task can be scheduled on a node at a time (i.e., their start/end times cannot overlap):

    t,tT,tt,S𝒜,N,G(t)=(v,r,e)S𝒜,N,G(t)=(v,r,e)ererformulae-sequencefor-all𝑡superscript𝑡𝑇formulae-sequence𝑡superscript𝑡subscript𝑆𝒜𝑁𝐺𝑡𝑣𝑟𝑒subscript𝑆𝒜𝑁𝐺superscript𝑡𝑣superscript𝑟superscript𝑒𝑒superscript𝑟superscript𝑒𝑟\forall t,t^{\prime}\in T,t\neq t^{\prime},S_{\mathcal{A},N,G}(t)=(v,r,e)\land S% _{\mathcal{A},N,G}(t^{\prime})=(v,r^{\prime},e^{\prime})\implies e\leq r^{% \prime}\lor e^{\prime}\leq r∀ italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_T , italic_t ≠ italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT caligraphic_A , italic_N , italic_G end_POSTSUBSCRIPT ( italic_t ) = ( italic_v , italic_r , italic_e ) ∧ italic_S start_POSTSUBSCRIPT caligraphic_A , italic_N , italic_G end_POSTSUBSCRIPT ( italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ( italic_v , italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⟹ italic_e ≤ italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∨ italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_r
  • A task cannot start executing until all of its dependencies have finished executing and their outputs have been received at the node on which the task is scheduled:

    (t,t)D,S𝒜,N,G(t)=(v,r,e)S𝒜,N,G(t)=(v,r,e)e+c(t,t)s(v,v)rformulae-sequencefor-all𝑡superscript𝑡𝐷subscript𝑆𝒜𝑁𝐺𝑡𝑣𝑟𝑒subscript𝑆𝒜𝑁𝐺superscript𝑡superscript𝑣superscript𝑟superscript𝑒𝑒𝑐𝑡superscript𝑡𝑠𝑣superscript𝑣superscript𝑟\forall(t,t^{\prime})\in D,S_{\mathcal{A},N,G}(t)=(v,r,e)\land S_{\mathcal{A},% N,G}(t^{\prime})=(v^{\prime},r^{\prime},e^{\prime})\implies e+\frac{c(t,t^{% \prime})}{s(v,v^{\prime})}\leq r^{\prime}∀ ( italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ italic_D , italic_S start_POSTSUBSCRIPT caligraphic_A , italic_N , italic_G end_POSTSUBSCRIPT ( italic_t ) = ( italic_v , italic_r , italic_e ) ∧ italic_S start_POSTSUBSCRIPT caligraphic_A , italic_N , italic_G end_POSTSUBSCRIPT ( italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ( italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⟹ italic_e + divide start_ARG italic_c ( italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_s ( italic_v , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG ≤ italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
Refer to caption
(a) Task Graph
Refer to caption
(b) Network
Refer to caption
(c) Schedule
Figure 1. Example problem instance and schedule.

Figure 1 depicts an example problem instance (task graph and network) and solution (schedule). We define the makespan of the schedule S𝒜,N,Gsubscript𝑆𝒜𝑁𝐺S_{\mathcal{A},N,G}italic_S start_POSTSUBSCRIPT caligraphic_A , italic_N , italic_G end_POSTSUBSCRIPT as the time at which the last task finishes executing:

M𝒜(N,G)=maxtTS𝒜,N,G(t)=(v,r,e)esubscript𝑀𝒜𝑁𝐺subscript𝑡conditional𝑇subscript𝑆𝒜𝑁𝐺𝑡𝑣𝑟𝑒𝑒M_{\mathcal{A}(N,G)}=\max_{t\in T\mid S_{\mathcal{A},N,G}(t)=(v,r,e)}eitalic_M start_POSTSUBSCRIPT caligraphic_A ( italic_N , italic_G ) end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_t ∈ italic_T ∣ italic_S start_POSTSUBSCRIPT caligraphic_A , italic_N , italic_G end_POSTSUBSCRIPT ( italic_t ) = ( italic_v , italic_r , italic_e ) end_POSTSUBSCRIPT italic_e

Because the problem of minimizing makespan is NP-hard for this model (Garey and Johnson, 1979), many heuristic algorithms have been proposed. Traditionally, these heuristic algorithms are evaluated on a set of problem instances and compared against other algorithms based on their makespan ratio, which for a given problem instance (N,G)𝑁𝐺(N,G)( italic_N , italic_G ) is the makespan of the schedule produced by the algorithm divided by the minimum makespan of the schedules produced by the baseline algorithms. The makespan ratio of an algorithm 𝒜𝒜\mathcal{A}caligraphic_A against a set of baseline algorithms 𝒜1,𝒜2,subscript𝒜1subscript𝒜2\mathcal{A}_{1},\mathcal{A}_{2},\ldotscaligraphic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … for a problem instance (N,G)𝑁𝐺(N,G)( italic_N , italic_G ) can be written

M𝒜(N,G)min(M𝒜1(N,G),M𝒜2(N,G),M𝒜3(N,G),).subscript𝑀𝒜𝑁𝐺subscript𝑀subscript𝒜1𝑁𝐺subscript𝑀subscript𝒜2𝑁𝐺subscript𝑀subscript𝒜3𝑁𝐺\displaystyle\frac{M_{\mathcal{A}(N,G)}}{\min\left(M_{\mathcal{A}_{1}(N,G)},M_% {\mathcal{A}_{2}(N,G)},M_{\mathcal{A}_{3}(N,G)},\ldots\right)}.divide start_ARG italic_M start_POSTSUBSCRIPT caligraphic_A ( italic_N , italic_G ) end_POSTSUBSCRIPT end_ARG start_ARG roman_min ( italic_M start_POSTSUBSCRIPT caligraphic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_N , italic_G ) end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_N , italic_G ) end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT caligraphic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_N , italic_G ) end_POSTSUBSCRIPT , … ) end_ARG .

Makespan ratio is a commonly used concept in the task scheduling literature (Topcuoglu et al., 1999; Wang and Sinnen, 2018). It is common to collect the makespan ratios of an algorithm (against a set of baseline algorithms) for a dataset of problem instances. System designers use this technique, called benchmarking, to decide which algorithm(s) best support(s) their application. In this paper, we highlight the shortcomings of this approach by discussing examples of (and providing methods for the automatic discovery of) in-family problem instances on which algorithms that appear to perform well on the original dataset perform very poorly.

2. Related Work

A comprehensive survey by Ronald Graham in 1979 classified multiple variants of the task scheduling problem and proposed a structured approach to analyze their complexity classes (Graham et al., 1979). Though the variant of task scheduling discussed in this paper wasn’t directly addressed in their study, the established reduction framework paved the way for its NP-completeness determination. The survey addressed many facets of task graph scheduling, such as processor heterogeneity, precedence constraints, and makespan minimization, but did not consider inter-processor communication. This gap was addressed a decade later, in 1989, by Hwang et al., who introduced the concept of task graph scheduling with inter-processor communication (albeit only considering homogeneous processors). They proposed the ETF (Earliest Task First) algorithm and proved that it produces schedules with makespan of at most (21/n)ωopt(i)+C21𝑛superscriptsubscript𝜔opt𝑖𝐶(2-1/n)\omega_{\text{opt}}^{(i)}+C( 2 - 1 / italic_n ) italic_ω start_POSTSUBSCRIPT opt end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT + italic_C where ωopt(i)superscriptsubscript𝜔opt𝑖\omega_{\text{opt}}^{(i)}italic_ω start_POSTSUBSCRIPT opt end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT is the optimal makespan without considering inter-processor communication delays and C𝐶Citalic_C is the worst-case communication requirements over a terminal chain of tasks (one which determines the makespan).

Over time, as distributed computing came into the mainstream, many heuristic scheduling algorithms emerged. Many of the most popular fall under the list scheduling paradigm, originally proposed by Graham. In general, list scheduling algorithms involve two steps:

  1. (1)

    Compute a priority p(t)𝑝𝑡p(t)italic_p ( italic_t ) for each task t𝑡titalic_t such that tasks have higher priority than their dependent tasks.

  2. (2)

    Greedily schedule tasks in order of their computed priority (from highest to lowest) to run on the node that minimizes some predefined cost function.

In other words, list scheduling algorithms first decide on a valid topological sort of the task graph, and then use it to schedule tasks greedily according to some objective function (ETF, mentioned above, is a list scheduling algorithm). Two of the most well-known heuristic algorithms for task graph scheduling, HEFT (Heterogeneous Earliest Finish Time) and CPoP (Critical Path on Processor) (Topcuoglu et al., 1999), are list scheduling algorithms. Many other heuristic algorithms have been proposed and experimentally evaluated on different datasets of task graphs and networks (see benchmarking/comparison/survey papers (Braun et al., 1999a, 2001; Canon et al., 2008; Kwok and Ahmad, 1998; Wang and Sinnen, 2018)). Other paradigms studied in the literature are cluster-scheduling (Wang and Sinnen, 2018) and meta-heuristic (Houssein et al., 2021) algorithms. Cluster scheduling involves dividing the task graph into groups of tasks to execute on a single node. Meta-heuristic approaches (e.g., simulated annealing or genetic algorithms) have been shown to work well in some situations, but generally take longer to run and can be difficult to tune (Braun et al., 1999b).

In this paper, we propose a simulated annealing-based (Kirkpatrick et al., 1983) approach to finding problem instances on which an algorithm performs maximally poorly compared to a given baseline algorithm. This approach is inspired by the recent work of Namyar et al. who use simulated annealing and other techniques to find adversarial instances for heuristics that solve convex optimization problems (Namyar et al., 2022). To the best of our knowledge, this is the first work to propose an automated method for adversarial analysis of task scheduling algorithms.

3. The SAGA Framework

We built SAGA to overcome the notable lack of publicly available datasets and task scheduling algorithm implementations. SAGA is a Python library for running, evaluating, and comparing task scheduling algorithms. It currently contains implementations of 17 algorithms using a common interface. It also includes interfaces for generating, saving, and loading datasets for benchmarking. Finally, it includes an implementation of the main contribution of this paper: PISA, a simulated annealing-based adversarial analysis method for comparing algorithms. For more information on SAGA, we refer the reader to the technical report (Authors, 2023b).

Table 1 lists the 17 algorithms currently implemented in SAGA.

Table 1. Schedulers implemented in SAGA
Abbreviation Algorithm Reference
BIL Best Imaginary Level (Oh and Ha, 1996)
BruteForce Brute Force -
CPoP Critical Path on Processor (Topcuoglu et al., 1999)
Duplex Duplex (Braun et al., 2001)
ETF Earliest Task First (Hwang et al., 1989)
FastestNode Fastest Node -
FCP Fast Critical Path (Radulescu and van Gemund, 2000)
FLB Fast Load Balancing (Radulescu and van Gemund, 2000)
GDL Generalized Dynamic Level (Sih and Lee, 1993b)
HEFT Heterogeneous Earliest Finish Time (Topcuoglu et al., 1999)
MaxMin MaxMin (Braun et al., 2001)
MCT Minimum Completion Time (Armstrong et al., 1998)
MET Minimum Execution Time (Armstrong et al., 1998)
MinMin MinMin (Braun et al., 2001)
OLB Opportunistic Load Balancing (Armstrong et al., 1998)
SMT SMT-driven Binary Search -
WBA Workflow-Based application (Blythe et al., 2005)

To orient the reader, we provide a brief description of each of these algorithms along with their scheduling complexity, the model they were designed for, performance guarantees (if any), datasets they have been evaluated on, and other algorithms they have been evaluated against in Appendix A.1. Because the BruteForce and SMT scheduling algorithms take much longer to run (exponential time) than the other algorithms, they are not included in the benchmarking or adversarial analysis results reported in this paper.

SAGA also includes a set of tools for generating, saving, and loading datasets. Table 2 lists the 16 dataset generators currently included in SAGA.

Table 2. Datasets available in SAGA
Name Task Graph Network
in_trees randomly weighted in-trees randomly weighted
out_trees randomly weighted out-trees
chains randomly weighted parallel chains
blast synthetic Blast workflows (Hazekamp and Thain, 2017) Chameleon cloud inspired (Keahey et al., 2020)
bwa synthetic BWA workflows (Hazekamp and Thain, 2017)
cycles synthetic Cycles workflows (da Silva et al., 2019b)
epigenomics synthetic Epigenomics workflows (Juve et al., 2013)
genome synthetic 1000genome workflows (da Silva et al., 2019a)
montage synthetic Montage workflows (Rynge et al., 2014)
seismology synthetic Seismology workflows (Filgueira et al., 2016)
soykb synthetic SoyKB (Liu et al., 2016)
srasearch synthetic SRASearch workflows (Rynge, 2017)
etl RIoTBench ETL application (Shukla et al., 2017) Edge/Fog/Cloud Networks (Varshney et al., 2022)
predict RIoTBench PREDICT application (Shukla et al., 2017)
stats RIoTBench STATS application (Shukla et al., 2017)
train RIoTBench TRAIN application (Shukla et al., 2017)

The in_trees, out_trees, and chains datasets each consist of 1000100010001000 randomly generated network/task graph pairs following a common methodology used in the literature (Cordeiro et al., 2010). In-trees and out-trees are generated with between 2222 and 4444 levels (chosen uniformly at random), a branching factor of either 2222 or 3333 (chosen uniformly at random), and node/edge-weights drawn from a clipped gaussian distribution (mean: 1111, standard deviation: 1/3131/31 / 3, min: 00, max: 2222). Parallel chains task graphs are generated with between 2222 and 5555 parallel chains (chosen uniformly at random) of length between 2222 and 5555 (chosen uniformly at random) and node/edge-weights drawn from the same clipped gaussian distribution. Randomly weighted networks are complete graphs with between 3333 and 5555 nodes (chosen uniformly at random) and node/edge-weights drawn from the same clipped gaussian distribution.

The scientific workflow datasets blast, bwa, cycles, epigenomics, genome, montage, seismology, soykb, and srasearch each contain 100100100100 problem instances. The task graphs are synthetically generated using the WfCommons Synthetic Workflow Generator (Coleman et al., 2023) and are based on real-world scientific workflows. The Chameleon cloud inspired networks are generated by fitting a distribution to the machine speed data from the execution traces (detailed information from a real execution of the application including task start/end times, cpu usages/requirements, data I/O sizes, etc.) of real workflows on Chameleon that are available in WfCommons and then sampling from that distribution to generate random networks. Because Chameleon uses a shared filesystem for data transfer, the communication cost can be absorbed into the computation cost and thus the communication strength between nodes is considered to be infinite.

The IoT/edge-inspired etl, predict, stats, and train datasets each contain 1000100010001000 problem instances. The task graphs and networks are generated using the approach described in (Varshney et al., 2022). The task graph structure is based on real-world IoT data streaming applications and the node weights are generated using a clipped gaussian distribution (mean: 35353535, standard deviation: 25/325325/325 / 3, min: 10101010, max: 60606060). The input size of the application is generated using a clipped gaussian distribution (mean: 1000100010001000, standard deviation: 500/35003500/3500 / 3, min: 500500500500, max: 1500150015001500) and the edge weights are determined by the known input/output ratios of the tasks. The Edge/Fog/Cloud Networks are generated by constructing a complete graph with three types of nodes: edge nodes with CPU speed 1111, fog nodes with CPU speed 6666, and cloud nodes with CPU speed 50505050. The communication strength between edge and fog nodes is 60606060 and between fog and cloud/fog nodes is 100100100100. The number of edge, fog, and cloud nodes is between 75757575 and 125125125125, 3333 and 7777, and 1111 and 10101010, respectively (chosen uniformly at random). In order to generate a complete graph (a requirement for many of scheduling algorithms), the communication strength between edge and cloud nodes is set to 60606060, the communication strength between cloud nodes is set to infinity (i.e., no communication delay).

Many other schedulers and datasets have been proposed and used in the literature and can be easily integrated into SAGA in the future. SAGA is open-source (Authors, 2023a) and designed to be modular and extensible. We encourage the community to contribute new algorithms, datasets, and experimentation tools.

4. Benchmarking Results

Figure 2 shows the results of benchmarking 15 algorithms on 16 datasets.

Refer to caption
Figure 2. Makespan Ratios of 15 algorithms evaluated on 16 datasets. Gradients depict performance on different problem instances in each dataset.

Different scheduling algorithms perform better or worse depending on the dataset and algorithms that weren’t designed for fully heterogeneous task graphs and networks (e.g., ETF, FastestNode) tend to perform poorly. Many of the algorithms, though, perform similarly across the datasets. While these experiments provide valuable information about the performance of each algorithm on each dataset, they provide much less information about the algorithms themselves.

Consider the illustrative scenario in Figure 3.

Refer to caption
(a) Task Graph
Refer to caption
(b) Original Network
Refer to caption
(c) Modified Network
Refer to caption
(d) HEFT Schedule on Original Network
Refer to caption
(e) CPoP Schedule on Original Network
Refer to caption
(f) HEFT Schedule on Modified Network
Refer to caption
(g) CPoP Schedule on Modified Network
Figure 3. Comparison of Scheduling Algorithms on Slightly Modified Networks

A simplistic task graph, as depicted in Figure 2(a), coupled with a minor alteration to the initial network (Figures 2(b) and 2(c)) — a reduction in the strength of node 3333’s communication links — causes HEFT to perform worse than CPoP (Figures 2(d), 2(e), 2(f), and 2(g)). This example underscores the shortcoming of traditional benchmarking: it provides little insight into the conditions under which an algorithm performs well or poorly. Observe that a structurally equivalent instance of this problem with all node/edge weights scaled so they are between 00 and 1111 could have been generated by the Parallel Chains dataset generator in SAGA. So while the benchmarking results in Figure 2 indicate that HEFT performs slightly better than CPoP on this dataset, there are, in fact, Parallel Chains instances where CPoP performs significantly better than HEFT.

5. Adversarial Analysis

We propose a novel adversarial analysis method for comparing task scheduling algorithms called PISA (Problem-instance Identification using Simulated Annealing). The goal of this method is to identify problem instances on which one algorithm outperforms another. More formally, the goal is to find a problem instance (N,G)𝑁𝐺(N,G)( italic_N , italic_G ) that maximizes the makespan ratio of algorithm 𝒜𝒜\mathcal{A}caligraphic_A against algorithm \mathcal{B}caligraphic_B:

maxN,GM𝒜(N,G)M(N,G)subscript𝑁𝐺subscript𝑀𝒜𝑁𝐺subscript𝑀𝑁𝐺\max_{N,G}\frac{M_{\mathcal{A}(N,G)}}{M_{\mathcal{B}(N,G)}}roman_max start_POSTSUBSCRIPT italic_N , italic_G end_POSTSUBSCRIPT divide start_ARG italic_M start_POSTSUBSCRIPT caligraphic_A ( italic_N , italic_G ) end_POSTSUBSCRIPT end_ARG start_ARG italic_M start_POSTSUBSCRIPT caligraphic_B ( italic_N , italic_G ) end_POSTSUBSCRIPT end_ARG

In doing so, we hope to fill in some of the gaps we know exist in the benchmarking results. We propose a simulated annealing-based approach for finding such problem instances. Simulated annealing is a meta-heuristic that is often used to find the global optimum of a function that has many local optima. In the context of our problem, the optimizer starts with an initial problem instance (N,G)𝑁𝐺(N,G)( italic_N , italic_G ) and then randomly perturbs the problem instance by changing the network or task graph in some way. If the perturbed problem instance has a higher makespan ratio than the current problem instance, the optimizer accepts the perturbation and continues the search from the new problem instance. If the perturbed problem instance has a lower makespan ratio than the current problem instance, the optimizer accepts the perturbation with some probability. This allows the optimizer to escape local optima and (potentially) find the global optimum. Over time, the probability of accepting perturbations that decrease the makespan ratio decreases, allowing the optimizer to settle into a high-makespan ratio state. The pseudocode for our proposed simulated annealing process is shown in Algorithm 1.

Algorithm 1 PISA (Problem-instance Identification using Simulated Annealing)
1:procedure PISA(N,G,Tmax,Tmin,Imax,α,𝒜,𝑁𝐺subscript𝑇𝑚𝑎𝑥subscript𝑇𝑚𝑖𝑛subscript𝐼𝑚𝑎𝑥𝛼𝒜N,G,T_{max},T_{min},I_{max},\alpha,\mathcal{A},\mathcal{B}italic_N , italic_G , italic_T start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT , italic_α , caligraphic_A , caligraphic_B)
2:(N,G)𝑁𝐺(N,G)( italic_N , italic_G ) is the initial problem instance
3:Tmaxsubscript𝑇𝑚𝑎𝑥T_{max}italic_T start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT is the initial (maximum) temperature
4:Tminsubscript𝑇𝑚𝑖𝑛T_{min}italic_T start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT is the minimum temperature
5:Imaxsubscript𝐼𝑚𝑎𝑥I_{max}italic_I start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT is the maximum number of iterations
6:α𝛼\alphaitalic_α is the cooling rate
7:𝒜𝒜\mathcal{A}caligraphic_A is the scheduler
8:\mathcal{B}caligraphic_B is the baseline scheduler
9:     NbestNsubscript𝑁𝑏𝑒𝑠𝑡𝑁N_{best}\leftarrow Nitalic_N start_POSTSUBSCRIPT italic_b italic_e italic_s italic_t end_POSTSUBSCRIPT ← italic_N
10:     GbestGsubscript𝐺𝑏𝑒𝑠𝑡𝐺G_{best}\leftarrow Gitalic_G start_POSTSUBSCRIPT italic_b italic_e italic_s italic_t end_POSTSUBSCRIPT ← italic_G
11:     MbestM𝒜(N,G)M(N,G)subscript𝑀𝑏𝑒𝑠𝑡subscript𝑀𝒜𝑁𝐺subscript𝑀𝑁𝐺M_{best}\leftarrow\frac{M_{\mathcal{A}(N,G)}}{M_{\mathcal{B}(N,G)}}italic_M start_POSTSUBSCRIPT italic_b italic_e italic_s italic_t end_POSTSUBSCRIPT ← divide start_ARG italic_M start_POSTSUBSCRIPT caligraphic_A ( italic_N , italic_G ) end_POSTSUBSCRIPT end_ARG start_ARG italic_M start_POSTSUBSCRIPT caligraphic_B ( italic_N , italic_G ) end_POSTSUBSCRIPT end_ARG
12:     TTmax𝑇subscript𝑇𝑚𝑎𝑥T\leftarrow T_{max}italic_T ← italic_T start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT
13:     i0𝑖0i\leftarrow 0italic_i ← 0
14:     while T>Tmin𝑇subscript𝑇𝑚𝑖𝑛T>T_{min}italic_T > italic_T start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT and i<Imax𝑖subscript𝐼𝑚𝑎𝑥i<I_{max}italic_i < italic_I start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT do
15:         N,Gsuperscript𝑁superscript𝐺absentN^{\prime},G^{\prime}\leftarrowitalic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← Perturb(N,G𝑁𝐺N,Gitalic_N , italic_G)
16:         MM𝒜(N,G)M(N,G)superscript𝑀subscript𝑀𝒜superscript𝑁superscript𝐺subscript𝑀superscript𝑁superscript𝐺M^{\prime}\leftarrow\frac{M_{\mathcal{A}(N^{\prime},G^{\prime})}}{M_{\mathcal{% B}(N^{\prime},G^{\prime})}}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← divide start_ARG italic_M start_POSTSUBSCRIPT caligraphic_A ( italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT end_ARG start_ARG italic_M start_POSTSUBSCRIPT caligraphic_B ( italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT end_ARG
17:         if M>Mbestsuperscript𝑀subscript𝑀𝑏𝑒𝑠𝑡M^{\prime}>M_{best}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > italic_M start_POSTSUBSCRIPT italic_b italic_e italic_s italic_t end_POSTSUBSCRIPT then
18:              NbestNsubscript𝑁𝑏𝑒𝑠𝑡superscript𝑁N_{best}\leftarrow N^{\prime}italic_N start_POSTSUBSCRIPT italic_b italic_e italic_s italic_t end_POSTSUBSCRIPT ← italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
19:              GbestGsubscript𝐺𝑏𝑒𝑠𝑡superscript𝐺G_{best}\leftarrow G^{\prime}italic_G start_POSTSUBSCRIPT italic_b italic_e italic_s italic_t end_POSTSUBSCRIPT ← italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
20:              MbestMsubscript𝑀𝑏𝑒𝑠𝑡superscript𝑀M_{best}\leftarrow M^{\prime}italic_M start_POSTSUBSCRIPT italic_b italic_e italic_s italic_t end_POSTSUBSCRIPT ← italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
21:         else
22:              pexp(M/MbestT)𝑝superscript𝑀subscript𝑀𝑏𝑒𝑠𝑡𝑇p\leftarrow\exp\left(-\frac{M^{\prime}/M_{best}}{T}\right)italic_p ← roman_exp ( - divide start_ARG italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / italic_M start_POSTSUBSCRIPT italic_b italic_e italic_s italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_T end_ARG )
23:              r𝑟absentr\leftarrowitalic_r ← Random(0,1010,10 , 1) \triangleright Uniform random number between 00 and 1111
24:              if r<p𝑟𝑝r<pitalic_r < italic_p then
25:                  NbestNsubscript𝑁𝑏𝑒𝑠𝑡superscript𝑁N_{best}\leftarrow N^{\prime}italic_N start_POSTSUBSCRIPT italic_b italic_e italic_s italic_t end_POSTSUBSCRIPT ← italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
26:                  GbestGsubscript𝐺𝑏𝑒𝑠𝑡superscript𝐺G_{best}\leftarrow G^{\prime}italic_G start_POSTSUBSCRIPT italic_b italic_e italic_s italic_t end_POSTSUBSCRIPT ← italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
27:                  MbestMsubscript𝑀𝑏𝑒𝑠𝑡superscript𝑀M_{best}\leftarrow M^{\prime}italic_M start_POSTSUBSCRIPT italic_b italic_e italic_s italic_t end_POSTSUBSCRIPT ← italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
28:              end if
29:         end if
30:         TTα𝑇𝑇𝛼T\leftarrow T\cdot\alphaitalic_T ← italic_T ⋅ italic_α
31:         ii+1𝑖𝑖1i\leftarrow i+1italic_i ← italic_i + 1
32:     end while
33:     return Nbest,Gbest,Mbestsubscript𝑁𝑏𝑒𝑠𝑡subscript𝐺𝑏𝑒𝑠𝑡subscript𝑀𝑏𝑒𝑠𝑡N_{best},G_{best},M_{best}italic_N start_POSTSUBSCRIPT italic_b italic_e italic_s italic_t end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT italic_b italic_e italic_s italic_t end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT italic_b italic_e italic_s italic_t end_POSTSUBSCRIPT
34:end procedure

The Perturb function is responsible for perturbing the problem instance by changing the network or task graph in some way. In our implementation, the Perturb function randomly selects (with equal probability) one of the following perturbations:

  1. (1)

    Change Network Node Weight: Select a node vV𝑣𝑉v\in Vitalic_v ∈ italic_V uniformly at random and change its weight by a uniformly random amount between 1/10110-1/10- 1 / 10 and 1/101101/101 / 10 with a minimum weight of 00 and a maximum weight of 1111.

  2. (2)

    Change Network Edge Weight: Same as Change Network Node Weight, but for edges (not including self-edges).

  3. (3)

    Change Task Weight: Same as Change Network Node Weight, but for tasks.

  4. (4)

    Change Dependency Weight: Same as Change Network Edge Weight, but for dependencies.

  5. (5)

    Add Dependency: Select a task tT𝑡𝑇t\in Titalic_t ∈ italic_T uniformly at random and add a dependency from t𝑡titalic_t to a uniformly random task tTsuperscript𝑡𝑇t^{\prime}\in Titalic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_T such that (t,t)D𝑡superscript𝑡𝐷(t,t^{\prime})\notin D( italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∉ italic_D and doing so does not create a cycle.

  6. (6)

    Remove Dependency: Select a dependency (t,t)D𝑡superscript𝑡𝐷(t,t^{\prime})\in D( italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ italic_D uniformly at random and remove it.

Some of the algorithms we evaluate on were only designed for homogeneous compute nodes and/or communication links. In these cases, we restrict the perturbations to only change the aspects of the network that are relevant to the algorithm. For ETF, FCP, and FLB, we set all node weights to be 1111 initially and do not allow them to be changed. For BIL, GDL, FCP, and FLB we set all communication link weights to be 1111 initially and do not allow them to be changed.

For every pair of schedulers, we run the simulated annealing algorithm 5555 times with different randomly generated initial problem instances. The initial problem instance (N,G)𝑁𝐺(N,G)( italic_N , italic_G ) is such that N𝑁Nitalic_N is a complete graph with between 3333 and 5555 nodes (chosen uniformly at random) and node/edge-weights between 00 and 1111 (generated uniformly at random, self-edges have weight \infty) and G𝐺Gitalic_G is a simple chain task graph with between 3333 and 5555 tasks (chosen uniformly at random) and task/dependency-weights between 00 and 1111 (generated uniformly at random). In our implementation, we set Tmax=10subscript𝑇𝑚𝑎𝑥10T_{max}=10italic_T start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT = 10, Tmin=0.1subscript𝑇𝑚𝑖𝑛0.1T_{min}=0.1italic_T start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT = 0.1, Imax=1000subscript𝐼𝑚𝑎𝑥1000I_{max}=1000italic_I start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT = 1000, and α=0.99𝛼0.99\alpha=0.99italic_α = 0.99.

5.1. Results

Figure 4 shows the PISA results for each pair of schedulers.

Refer to caption
Figure 4. Heatmap of Makespan Ratios for all 15 Algorithms compared to each other. The cell value (and color) for row i𝑖iitalic_i and column j𝑗jitalic_j indicates the Makespan Ratio for the worst-case instance found by PISA for scheduler j𝑗jitalic_j against baseline scheduler i𝑖iitalic_i.

The benefit of our approach is clear from just a quick glance at Figure 4 (there’s a lot more red!). A closer look reveals even more interesting results. First, PISA finds, for every scheduling algorithm, problem instances such that it performs at least twice as bad as another of the algorithms. In fact, for most of the algorithms (10101010 of 15), it finds a problem instance such that the algorithm performs at least five times worse than another algorithm! Furthermore, we note that for nearly every pair of algorithms 𝒜,𝒜\mathcal{A},\mathcal{B}caligraphic_A , caligraphic_B, PISA identifies a problem instance where 𝒜𝒜\mathcal{A}caligraphic_A outperforms \mathcal{B}caligraphic_B and where \mathcal{B}caligraphic_B outperforms 𝒜𝒜\mathcal{A}caligraphic_A. Finally, some of the cells in Figure 4 have values of >1000absent1000>1000> 1000. In this case, PISA identified a problem instance where one algorithm drastically outperforms the other. For these cases, it’s likely that there exist problem instances where the scheduler performs arbitrarily poorly compared to the baseline scheduler.

Readers can find a closer look at one example comparing HEFT and CPoP (two well-known task scheduling algorithms (Topcuoglu et al., 1999)) in Appendix A.2 that demonstrates how PISA can be used to discover the reasons why an algorithm performs well on some problem instances and poorly on others.

6. Application-Specific PISA

In Section 5, we introduced PISA as an effective method for finding problem instances where an algorithm performs much worse than benchmarking results suggest. The results, however, depend greatly on the initial problem instance and the implementation of the Perturb function. These two things define the space of problem instances the algorithm searches over and also affect which problem instances are more or less likely to be explored. We chose an implementation in Section 5 that kept problem instances relatively small (between three and five tasks and compute nodes) and allowed arbitrary task graph structure and CCRs (communication-to-computation ratios). By keeping the problem instance size small but allowing for almost arbitrary task graph structure and CCR, this implementation allowed us to explore how the structure of problem instances affects schedules in order to find patterns where certain algorithms out-perform others. In many more realistic scenarios, though, application developers have a better idea of what their task graph and/or compute network will look like. PISA can be easily restricted to searching over a space of realistic problem instances by adjusting the Perturb implementation and initial problem instance. In this section, we report results on experiments with application-specific Perturb implementations. Again, these results support the main hypothesis of this paper that PISA reveals performance boundaries that a traditional benchmarking approach does not.

6.1. Experimental Setup

One of the largest communities of researchers that depend on efficient task scheduling algorithms is that of scientists that use scientific workflows for simulation, experimentation, and more. Scientific workflows are typically big-data scale task graphs that are scheduled to run on cloud or super computing platforms. These researchers typically have little to no control over how their workflows are scheduled, instead relying on Workflow Management Systems (WFMS) like Pegasus (Deelman et al., 2015), Makeflow (Albrecht et al., 2012), and Nextflow (Di Tommaso et al., 2017) (to name a just few examples) to handle the technical execution details. For this reason, it is especially important for WFMS developers/maintainers (who choose which scheduling algorithms their system uses) to understand the performance boundaries between different algorithms for the different types of scientific workflows and computing systems their clients use.

SAGA currently has datasets based on nine real-world scientific workflows (blast, bwa, cycles, epigenomics, 1000genome, montage, seismology, soykb, and srasearch). These applications come from a very wide range of scientific domains — from astronomy (montage builds mosaics of astronomical imagery) to biology (1000genome performs human genome reconstruction) to agriculture (cycles is a multi-crop, multi-year agroecosystem model). For each workflow, the runtime of each task, input/output sizes in bytes, and speedup factor (compute speed) for each machine are available from public execution trace information222https://github.com/wfcommons/pegasus-instances, https://github.com/wfcommons/makeflow-instances. The inter-node communication rate, however, is not available. We set communication rates to be homogeneous so that the average CCR, or average data sizecommmunication strengthaverage data sizecommmunication strength\frac{\text{average data size}}{\text{commmunication strength}}divide start_ARG average data size end_ARG start_ARG commmunication strength end_ARG, is 15,12,1,2,151212\frac{1}{5},\frac{1}{2},1,2,divide start_ARG 1 end_ARG start_ARG 5 end_ARG , divide start_ARG 1 end_ARG start_ARG 2 end_ARG , 1 , 2 , or 5555 (resulting in five different experiments for each workflow).

Because these scientific workflows are much larger than the workflows used in the experiments from Section 5, we evaluate a smaller subset of the schedulers available in SAGA: FastestNode, HEFT, CPoP, MaxMin, MinMin, and WBA. Performing these experiments for the rest of the schedulers remains a task for future work. For generating a benchmarking dataset, we use the WfCommons Synthetic Workflow Generator (Coleman et al., 2023) to generate random in-family task graphs and create random networks using a best-fit distribution computed from the real execution trace data. We also use this method to generate initial problem instances for PISA. Then, we adapt PISA’s Perturb implementation as follows:

  • Change Network Node Weight: Same as described in Section 5, except the weight is scaled between the minimum and maximum node speeds observed in the real execution trace data.

  • Change Network Edge Weight: This option is removed since network edge weights are homogeneous and fixed to ensure the average CCR for the given application is a specific value.

  • Change Task Weight: Same as described in Section 5, except the weight is scaled between the minimum and maximum task runtime observed in the real execution trace data.

  • Change Dependency Weight: Same as described in Section 5, except the weight is scaled between the minimum and maximum task I/O size observed in the real execution trace data.

  • Add/Remove Dependency: This option is removed so that the task graph structure remains representative of the real application.

These adjustments to the Perturb implementation allow us to explore application-specific problem instances more realistically.

6.2. Results

In this section we present and discuss some results on two of the workflows analyzed: srasearch and blast. Due to space constraints, full results can be found in Appendix A.3. Srasearch (workflow structure depicted in Figure 4(a)) is a toolkit for interacting with data in the INSDC Sequence Read Archives and blast (workflow structure depicted in Figure 4(b)) is a toolkit for finding regions of similarity between biological sequences.

Refer to caption
(a) Srasearch workflow structure
Refer to caption
(b) Blast workflow structure
Figure 5. Example workflow structures for blast and srasearch scientific workflows.

Observe in Figure 5 that while the number of tasks may vary, the structure of both workflows is very rigid. Our new Perturb implementation, though, guarantees that the search space contains only task graphs with appropriate structure. Figures 6 and 7 show the benchmarking and PISA results for srasearch and blast, respectively. First, observe that the benchmarking approach suggests all algorithms (except FastestNode) perform very well on the srasearch applications with a CCR of 1/5151/51 / 5. Using PISA, however, we are able to identify problem instances where WBA performs thousands of times worse than FastestNode! Also, we’re able to find instances where MinMin performs almost twice as bad CPoP. Even among the “good” algorithms, though, we see interesting behavior. Observe the results of HEFT and MaxMin. We are able to find both a problem instance where HEFT performs approximately 20%percent2020\%20 % worse that MaxMin and an instance where MaxMin performs approximately 11%percent1111\%11 % worse than HEFT.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6. Benchmarking and PISA results for srasearch workflows with different average CCRs (results for CCR of 5555 can be found in Appendix A.3). For each CCR, the top row shows benchmarking results, where gradients indicate makespan ratios on different problem instances in the dataset. All other cells indicate the highest makespan ratio yielding problem instance discovered by PISA.

The results for srasearch seem to imply that CPoP, which appears to perform consistently well for all CCRs tested (the worst case found has a makespan ratio of 1.15) would be the scheduling algorithm of choice for a WFMS designer. Workflow Management Systems do not support just one type of scientific workflow, though, and CPoP’s effectiveness for srasearch workflows may not extend to others. This is true for blast (see results in Figure 7) where CPoP performs generally poorly for all CCRs tested and in one case yields a schedule with 1000 times the makespan of the schedule WBA produces!

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7. Benchmarking and PISA results for blast workflows with different average CCRs (results for CCR of 5555 can be found in Appendix A.3). For each CCR, the top row shows benchmarking results, where gradients indicate makespan ratios on different problem instances in the dataset. All other cells indicate the highest makespan ratio yielding problem instance discovered by PISA.

These results are evidence that the traditional benchmarking approach to comparing algorithms is insufficient even for highly regular, application-specific problem instances. They also have implications for the design of Workflow Management Systems (and other task scheduling systems in IoT, edge computing environments, etc.). It may be reasonable for a WFMS to run a set of scheduling algorithms that best covers the different types of client scientific workflows and computing systems. For example, a WFMS designer might run PISA and choose the three algorithms with the combined minimum maximum makespan ratio. Different methodologies for constructing and comparing such hybrid algorithms is an interesting topic for future work.

7. Conclusion

In this paper, we presented SAGA, a Python framework for running, evaluating, and comparing task scheduling algorithms. We evaluated 15 of the scheduling algorithms implemented in SAGA on 16 datasets and demonstrated how our proposed adversarial analysis method, PISA, provides useful information that traditional benchmarking does not. We also explored how PISA can be used for application-specific scenarios where system designers have some idea of what the target task graphs and compute networks look like ahead of time. We show that, even for this restricted case, PISA is successful in identifying performance boundaries between task scheduling algorithms that the traditional benchmarking approach is not.

There are many directions for future work. First, we plan to extend SAGA to include more algorithms and datasets. Another logical next step is to extend SAGA and the adversarial analysis method to support other problem variants. Due to our particular interest in task scheduling for dynamic environments, we plan to add support for stochastic problem instances (with stochastic task costs, data sizes, computation speeds, and communication costs). It would also be interesting to explore other meta-heuristics for adversarial analysis (e.g., genetic algorithms) and other performance metrics (e.g., throughput, energy consumption, cost, etc.). Finally, the application-specific scenario we explored in Section 6 suggests that an exploration into different methodologies for constructing and comparing hybrid scheduling algorithms using PISA might be fruitful.

References

  • (1)
  • Albrecht et al. (2012) Michael Albrecht, Patrick Donnelly, Peter Bui, and Douglas Thain. 2012. Makeflow: A Portable Abstraction for Data Intensive Computing on Clusters, Clouds, and Grids. In Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies (Scottsdale, Arizona, USA) (SWEET ’12). Association for Computing Machinery, New York, NY, USA, Article 1, 13 pages. https://doi.org/10.1145/2443416.2443417
  • Armstrong et al. (1998) R. Armstrong, D. Hensgen, and T. Kidd. 1998. The relative performance of various mapping algorithms is independent of sizable variances in run-time predictions. In Proceedings Seventh Heterogeneous Computing Workshop (HCW’98). 79–87. https://doi.org/10.1109/HCW.1998.666547
  • Authors (2023a) Anonymous Authors. 2023a. Scheduling Algorithms Gathered. Github. https://anonymous.4open.science/r/saga-1F6D/README.md
  • Authors (2023b) Anonymous Authors. 2023b. Scheduling Algorithms Gathered: A Framework for Implementing, Evaluating, and Comparing Task Graph Scheduling Algorithms. Technical Report. Anonymous Institution.
  • Bazzi and Norouzi-Fard (2015) Abbas Bazzi and Ashkan Norouzi-Fard. 2015. Towards Tight Lower Bounds for Scheduling Problems. In Algorithms - ESA 2015 - 23rd Annual European Symposium, Patras, Greece, September 14-16, 2015, Proceedings (Lecture Notes in Computer Science, Vol. 9294), Nikhil Bansal and Irene Finocchi (Eds.). Springer, 118–129. https://doi.org/10.1007/978-3-662-48350-3_11
  • Blythe et al. (2005) James Blythe, S. Jain, Ewa Deelman, Yolanda Gil, Karan Vahi, Anirban Mandal, and Ken Kennedy. 2005. Task scheduling strategies for workflow-based applications in grids. In 5th International Symposium on Cluster Computing and the Grid (CCGrid 2005), 9-12 May, 2005, Cardiff, UK. IEEE Computer Society, 759–767. https://doi.org/10.1109/CCGRID.2005.1558639
  • Braun et al. (1999a) T.D. Braun, H.J. Siegal, N. Beck, L.L. Boloni, M. Maheswaran, A.I. Reuther, J.P. Robertson, M.D. Theys, Bin Yao, D. Hensgen, and R.F. Freund. 1999a. A comparison study of static mapping heuristics for a class of meta-tasks on heterogeneous computing systems. In Proceedings. Eighth Heterogeneous Computing Workshop (HCW’99). 15–29. https://doi.org/10.1109/HCW.1999.765093
  • Braun et al. (1999b) T.D. Braun, H.J. Siegal, N. Beck, L.L. Boloni, M. Maheswaran, A.I. Reuther, J.P. Robertson, M.D. Theys, Bin Yao, D. Hensgen, and R.F. Freund. 1999b. A comparison study of static mapping heuristics for a class of meta-tasks on heterogeneous computing systems. In Proceedings. Eighth Heterogeneous Computing Workshop (HCW’99). 15–29. https://doi.org/10.1109/HCW.1999.765093
  • Braun et al. (2001) Tracy D. Braun, Howard Jay Siegel, Noah Beck, Ladislau Bölöni, Muthucumaru Maheswaran, Albert I. Reuther, James P. Robertson, Mitchell D. Theys, Bin Yao, Debra A. Hensgen, and Richard F. Freund. 2001. A Comparison of Eleven Static Heuristics for Mapping a Class of Independent Tasks onto Heterogeneous Distributed Computing Systems. J. Parallel Distributed Comput. 61, 6 (2001), 810–837. https://doi.org/10.1006/jpdc.2000.1714
  • Canon et al. (2008) Louis-Claude Canon, Emmanuel Jeannot, Rizos Sakellariou, and Wei Zheng. 2008. Comparative Evaluation Of The Robustness Of DAG Scheduling Heuristics. In Grid Computing - Achievements and Prospects: CoreGRID Integration Workshop 2008, Hersonissos, Crete, Greece, April 2-4, 2008, Sergei Gorlatch, Paraskevi Fragopoulou, and Thierry Priol (Eds.). Springer, 73–84. https://doi.org/10.1007/978-0-387-09457-1_7
  • Coleman et al. (2023) Tainã Coleman, Henri Casanova, and Rafael Ferreira da Silva. 2023. Automated generation of scientific workflow generators with WfChef. Future Gener. Comput. Syst. 147 (2023), 16–29. https://doi.org/10.1016/j.future.2023.04.031
  • Cordeiro et al. (2010) Daniel Cordeiro, Grêgory Mouniê, Swann Perarnau, Denis Trystram, Jean-Marc Vincent, and Frêdêric Wagner. 2010. Random graph generation for scheduling simulations. ICST. https://doi.org/10.4108/ICST.SIMUTOOLS2010.8667
  • da Silva et al. (2019a) Rafael Ferreira da Silva, Rosa Filgueira, Ewa Deelman, Erola Pairo-Castineira, Ian Michael Overton, and Malcolm P. Atkinson. 2019a. Using simple PID-inspired controllers for online resilient resource management of distributed scientific workflows. Future Gener. Comput. Syst. 95 (2019), 615–628. https://doi.org/10.1016/j.future.2019.01.015
  • da Silva et al. (2019b) Rafael Ferreira da Silva, Rajiv Mayani, Yuning Shi, Armen R. Kemanian, Mats Rynge, and Ewa Deelman. 2019b. Empowering Agroecosystem Modeling with HTC Scientific Workflows: The Cycles Model Use Case. In 2019 IEEE International Conference on Big Data (IEEE BigData), Los Angeles, CA, USA, December 9-12, 2019, Chaitanya K. Baru, Jun Huan, Latifur Khan, Xiaohua Hu, Ronay Ak, Yuanyuan Tian, Roger S. Barga, Carlo Zaniolo, Kisung Lee, and Yanfang (Fanny) Ye (Eds.). IEEE, 4545–4552. https://doi.org/10.1109/BigData47090.2019.9006107
  • Deelman et al. (2015) Ewa Deelman, Karan Vahi, Gideon Juve, Mats Rynge, Scott Callaghan, Philip J. Maechling, Rajiv Mayani, Weiwei Chen, Rafael Ferreira da Silva, Miron Livny, and Kent Wenger. 2015. Pegasus, a workflow management system for science automation. Future Generation Computer Systems 46 (2015), 17–35. https://doi.org/10.1016/j.future.2014.10.008
  • Di Tommaso et al. (2017) Paolo Di Tommaso, Maria Chatzou, Evan W. Floden, Pablo Prieto Barja, Emilio Palumbo, and Cedric Notredame. 2017. Nextflow enables reproducible computational workflows. Nature Biotechnology 35, 4 (01 Apr 2017), 316–319. https://doi.org/10.1038/nbt.3820
  • El-Rewini and Lewis (1990) Hesham El-Rewini and T. G. Lewis. 1990. Scheduling parallel program tasks onto arbitrary target machines. J. Parallel and Distrib. Comput. 9, 2 (1990), 138–153. https://doi.org/10.1016/0743-7315(90)90042-N
  • Filgueira et al. (2016) Rosa Filgueira, Rafael Ferreira da Silva, Amrey Krause, Ewa Deelman, and Malcolm P. Atkinson. 2016. Asterism: Pegasus and Dispel4py Hybrid Workflows for Data-Intensive Science. In Seventh International Workshop on Data-Intensive Computing in the Clouds, DataCloud@SC 2016, Salt Lake, UT, USA, November 14, 2016. IEEE Computer Society, 1–8. https://doi.org/10.1109/DataCloud.2016.004
  • Garey and Johnson (1979) M. R. Garey and David S. Johnson. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman.
  • Graham et al. (1979) R.L. Graham, E.L. Lawler, J.K. Lenstra, and A.H.G.Rinnooy Kan. 1979. Optimization and Approximation in Deterministic Sequencing and Scheduling: a Survey. In Discrete Optimization II, P.L. Hammer, E.L. Johnson, and B.H. Korte (Eds.). Annals of Discrete Mathematics, Vol. 5. Elsevier, 287–326. https://doi.org/10.1016/S0167-5060(08)70356-X
  • Graham (1969) R. L. Graham. 1969. Bounds on Multiprocessing Timing Anomalies. SIAM J. Appl. Math. 17, 2 (1969), 416–429. https://doi.org/10.1137/0117039 arXiv:https://doi.org/10.1137/0117039
  • Hazekamp and Thain (2017) Nick Hazekamp and Douglas Thain. 2017. Makeflow Examples Repository. Github. http://github.com/cooperative-computing-lab/makeflow-examples
  • Houssein et al. (2021) Essam H. Houssein, Ahmed G. Gad, Yaser Maher Wazery, and Ponnuthurai Nagaratnam Suganthan. 2021. Task Scheduling in Cloud Computing based on Meta-heuristics: Review, Taxonomy, Open Challenges, and Future Trends. Swarm Evol. Comput. 62 (2021), 100841. https://doi.org/10.1016/j.swevo.2021.100841
  • Hwang et al. (1989) Jing-Jang Hwang, Yuan-Chieh Chow, Frank D. Anger, and Chung-Yee Lee. 1989. Scheduling Precedence Graphs in Systems with Interprocessor Communication Times. SIAM J. Comput. 18, 2 (1989), 244–257. https://doi.org/10.1137/0218016
  • Juve et al. (2013) Gideon Juve, Ann L. Chervenak, Ewa Deelman, Shishir Bharathi, Gaurang Mehta, and Karan Vahi. 2013. Characterizing and profiling scientific workflows. Future Gener. Comput. Syst. 29, 3 (2013), 682–692. https://doi.org/10.1016/j.future.2012.08.015
  • Keahey et al. (2020) Kate Keahey, Jason Anderson, Zhuo Zhen, Pierre Riteau, Paul Ruth, Dan Stanzione, Mert Cevik, Jacob Colleran, Haryadi S. Gunawi, Cody Hammock, Joe Mambretti, Alexander Barnes, François Halbach, Alex Rocha, and Joe Stubbs. 2020. Lessons Learned from the Chameleon Testbed. In 2020 USENIX Annual Technical Conference, USENIX ATC 2020, July 15-17, 2020, Ada Gavrilovska and Erez Zadok (Eds.). USENIX Association, 219–233. https://www.usenix.org/conference/atc20/presentation/keahey
  • Kirkpatrick et al. (1983) Scott Kirkpatrick, C Daniel Gelatt Jr, and Mario P Vecchi. 1983. Optimization by simulated annealing. science 220, 4598 (1983), 671–680. https://doi.org/10.1126/science.220.4598.671
  • Kwok and Ahmad (1998) Y.-K. Kwok and I. Ahmad. 1998. Benchmarking the task graph scheduling algorithms. In Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing. 531–537. https://doi.org/10.1109/IPPS.1998.669967
  • Lee et al. (1988) Chung-Yee Lee, Jing-Jang Hwang, Yuan-Chieh Chow, and Frank D. Anger. 1988. Multiprocessor scheduling with interprocessor communication delays. Operations Research Letters 7, 3 (1988), 141–147. https://doi.org/10.1016/0167-6377(88)90080-6
  • Liu et al. (2016) Yang Liu, Saad M. Khan, Juexin Wang, Mats Rynge, Yuanxun Zhang, Shuai Zeng, Shiyuan Chen, João V. Maldonado dos Santos, Babu Valliyodan, Prasad Calyam, Nirav C. Merchant, Henry T. Nguyen, Dong Xu, and Trupti Joshi. 2016. PGen: large-scale genomic variations analysis workflow and browser in SoyKB. BMC Bioinform. 17, S-13 (2016), 337. https://doi.org/10.1186/s12859-016-1227-y
  • Namyar et al. (2022) Pooria Namyar, Behnaz Arzani, Ryan Beckett, Santiago Segarra, and Srikanth Kandula. 2022. Minding the gap between Fast Heuristics and their Optimal Counterparts. In Hot Topics in Networking. acm. https://www.microsoft.com/en-us/research/publication/minding-the-gap-between-fast-heuristics-and-their-optimal-counterparts/
  • Oh and Ha (1996) Hyunok Oh and Soonhoi Ha. 1996. A Static Scheduling Heuristic for Heterogeneous Processors. In Euro-Par ’96 Parallel Processing, Second International Euro-Par Conference, Lyon, France, August 26-29, 1996, Proceedings, Volume II (Lecture Notes in Computer Science, Vol. 1124), Luc Bougé, Pierre Fraigniaud, Anne Mignotte, and Yves Robert (Eds.). Springer, 573–577. https://doi.org/10.1007/BFb0024750
  • Radulescu and van Gemund (2000) Andrei Radulescu and Arjan J. C. van Gemund. 2000. Fast and Effective Task Scheduling in Heterogeneous Systems. In 9th Heterogeneous Computing Workshop, HCW 2000, Cancun, Mexico, May 1, 2000. IEEE Computer Society, 229–238. https://doi.org/10.1109/HCW.2000.843747
  • Rynge (2017) Mats Rynge. 2017. SRA Search Pegasus Workflow. Github. https://github.com/pegasus-isi/sra-search-pegasus-workflow
  • Rynge et al. (2014) M. Rynge, G. Juve, J. Kinney, J. Good, B. Berriman, A. Merrihew, and E. Deelman. 2014. Producing an Infrared Multiwavelength Galactic Plane Atlas Using Montage, Pegasus, and Amazon Web Services. In Astronomical Data Analysis Software and Systems XXIII (Astronomical Society of the Pacific Conference Series, Vol. 485), N. Manset and P. Forshay (Eds.). 211.
  • Shukla et al. (2017) Anshu Shukla, Shilpa Chaturvedi, and Yogesh Simmhan. 2017. RIoTBench: A Real-time IoT Benchmark for Distributed Stream Processing Platforms. CoRR abs/1701.08530 (2017). arXiv:1701.08530 http://arxiv.org/abs/1701.08530
  • Sih and Lee (1993a) G.C. Sih and E.A. Lee. 1993a. A compile-time scheduling heuristic for interconnection-constrained heterogeneous processor architectures. IEEE Transactions on Parallel and Distributed Systems 4, 2 (1993), 175–187. https://doi.org/10.1109/71.207593
  • Sih and Lee (1993b) Gilbert C. Sih and Edward A. Lee. 1993b. A Compile-Time Scheduling Heuristic for Interconnection-Constrained Heterogeneous Processor Architectures. IEEE Trans. Parallel Distributed Syst. 4, 2 (1993), 175–187. https://doi.org/10.1109/71.207593
  • Topcuoglu et al. (1999) Haluk Topcuoglu, Salim Hariri, and Min-You Wu. 1999. Task Scheduling Algorithms for Heterogeneous Processors. In 8th Heterogeneous Computing Workshop, HCW 1999, San Juan, Puerto Rico, April12, 1999. IEEE Computer Society, 3–14. https://doi.org/10.1109/HCW.1999.765092
  • Varshney et al. (2022) Prateeksha Varshney, Shriram Ramesh, Shayal Chhabra, Aakash Khochare, and Yogesh Simmhan. 2022. Resilient Execution of Data-triggered Applications on Edge, Fog and Cloud Resources. In 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid). 473–483. https://doi.org/10.1109/CCGrid54584.2022.00057
  • Wang and Sinnen (2018) Huijun Wang and Oliver Sinnen. 2018. List-Scheduling versus Cluster-Scheduling. IEEE Trans. Parallel Distributed Syst. 29, 8 (2018), 1736–1749. https://doi.org/10.1109/TPDS.2018.2808959

Appendix A Appendix

A.1. Scheduling Algorithm Descriptions

To orient the reader, we provide a brief description of each of the algorithms implemented in SAGA and presented in Section 3.

BIL (Best Imaginary Level) is a list scheduling algorithm designed for the unrelated machines model. In this model, the runtime of a task on a node need not be a function of task cost and node speed. Rather, it can be arbitrary. This model is more general even than the related machines model we study in this paper. BIL’s scheduling complexity is O(|T|2|V|log|V|)𝑂superscript𝑇2𝑉𝑉O\left(|T|^{2}\cdot|V|\log|V|\right)italic_O ( | italic_T | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ | italic_V | roman_log | italic_V | ) and was proven to be optimal for linear graphs. The authors report a makespan speedup over the GDL (Generalized Dynamic Level) scheduler on randomly generated problem instances. The exact process for generating random problem instances is not described except to say that CCRs (communication to computation ratios — average communication time over average execution time) of 1/2121/21 / 2, 1111, and 2222 were used.

CPoP (Critical Path on Processor) was proposed in the same paper as HEFT (Heterogeneous Earliest Finish Time). Both are list scheduling algorithms with scheduling complexity O(|T|2|V|)𝑂superscript𝑇2𝑉O\left(|T|^{2}|V|\right)italic_O ( | italic_T | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_V | ). No formal bounds for HEFT and CPoP are known. HEFT works by first prioritizing tasks based on their upward rank, which is essentially the length of the longest chain of tasks (considering average execution and communication times). Then, it greedily schedules tasks in this order to the node that minimizes the task completion time given previously scheduled tasks. CPoP is similar but uses a slightly different priority metric. The biggest difference between the two is that CPoP always schedules critical path tasks (those on the longest chain in the task graph) to execute on the fastest node. Both algorithms were evaluated on random graphs (the process for generating graphs is described in the paper) and on real task graphs for Gaussian Elimination and FFT applications. They were compared against different scheduling algorithms: Mapping Heuristic (similar to HEFT without insertion) (El-Rewini and Lewis, 1990), Dynamic-Level Scheduling (Sih and Lee, 1993a), and Levelized Min Time333We could not find the original paper that proposes this algorithm.. Schedule Length Ratios (makespan scaled by the sum of minimum computation costs of all tasks), speedup, and schedule generation times are reported.

The MinMin, MaxMin, and Duplex list scheduling algorithms may have been proposed many times independently, but exceptionally clear definitions can be found in a paper that compares them to OLB, MET, MCT, and a few meta-heuristic algorithms (e.g., genetic algorithms and simulated annealing) (Braun et al., 2001). MinMin schedules tasks by iteratively selecting the task with the smallest minimum completion time (given previously scheduled tasks) and assigning it to the corresponding node. MaxMin, on the other hand, schedules tasks by iteratively selecting the task with the largest minimum completion time and assigning it to the the corresponding node. Duplex simply runs both MinMin and MaxMin and returns the schedule with the smallest makespan. In the paper, the authors evaluate these algorithms on independent non-communicating tasks with uniformly random costs and heterogeneous compute nodes with uniformly random speeds. MinMin (and therefore also Duplex) is shown to generate schedules with low makespan compared to the other algorithms while relatively high makespans for MaxMin are reported.

ETF (Earliest Task First) is one of the few algorithms we discuss in this paper that has formal bounds. It is also a list-scheduling algorithm with runtime O(|T||V|2)𝑂𝑇superscript𝑉2O(|T||V|^{2})italic_O ( | italic_T | | italic_V | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and was designed for heterogeneous task graphs but homogeneous compute networks. It works by iteratively scheduling the task with the earliest possible start time given previously scheduled tasks (usually — details omitted for the sake of simplicity can be found in the original paper). Note how this is different from HEFT and CPoP, which schedule according to the earliest possible completion time of the task. This important difference is what allows the authors to prove a formal bound of ωETF(21/n)ωopt(i)+Csubscript𝜔ETF21𝑛superscriptsubscript𝜔opt𝑖𝐶\omega_{\text{ETF}}\leq(2-1/n)\omega_{\text{opt}}^{(i)}+Citalic_ω start_POSTSUBSCRIPT ETF end_POSTSUBSCRIPT ≤ ( 2 - 1 / italic_n ) italic_ω start_POSTSUBSCRIPT opt end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT + italic_C where ωopt(i)superscriptsubscript𝜔opt𝑖\omega_{\text{opt}}^{(i)}italic_ω start_POSTSUBSCRIPT opt end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT is the optimal schedule makepsan without communication and C𝐶Citalic_C is the total communication requirement over some terminal chain of tasks.

FCP (Fast Critical Path) and FLB (Fast Load Balancing) both have a runtime of O(|T|log(|V|)+|D|)𝑂𝑇𝑉𝐷O(|T|\log\left(|V|)+|D|\right)italic_O ( | italic_T | roman_log ( | italic_V | ) + | italic_D | ) and were designed for heterogeneous task graphs and heterogeneous node speeds but homogeneous communication strengths. The algorithms were evaluated on three types of task graphs with different structures based on real applications (LU decomposition, Laplace equation solver, and a stencil algorithm) with varied CCR and uniformly random task costs. Both algorithms were shown to perform well compared to HEFT and ERT (Lee et al., 1988) despite their lower schedule generation times.

GDL (Generalized Dynamic Level), also called DLS (Dynamic Level Scheduling), is a variation on list scheduling where task priorities are updated each time a task is scheduled. Due to this added computation in each iteration, the complexity of DLS is O(|V|3|T|)𝑂superscript𝑉3𝑇O(|V|^{3}|T|)italic_O ( | italic_V | start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT | italic_T | ) (a factor |V|𝑉|V|| italic_V | greater than HEFT and CPoP). GDL was originally designed for the very general unrelated machines model and was shown to outperform HDLFET (Sih and Lee, 1993b) on randomly generated problem instances (though the method used for generating random task graphs is not well-described) and on four real digital signal processing applications (two sound synthesis algorithms, a telephone channel simulator, and a quadrature amplitude modulation transmitter).

MCT (Minimum Completion Time) and MET (Minimum Execution Time) are very simple algorithms originally designed for the unrelated machines model. MET simply schedules tasks to the machine with the smallest execution time (regardless of task start/end time). MCT assigns tasks in arbitrary order to the node with the smallest completion time given previously scheduled tasks (basically HEFT without insertion or its priority function). MET and MCT have scheduling complexities of O(|T||V|)𝑂𝑇𝑉O(|T||V|)italic_O ( | italic_T | | italic_V | ) and O(|T|2|V|)𝑂superscript𝑇2𝑉O(|T|^{2}|V|)italic_O ( | italic_T | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_V | ), respectively. The algorithms were evaluated on task graphs with 125 or 500 tasks, each task having one of five possible execution times. They were shown to outperform a very naive baseline algorithm which does not use expected execution times for scheduling.

OLB (Opportunistic Load Balancing) has a runtime of just O(|T|)𝑂𝑇O(|T|)italic_O ( | italic_T | ) and was designed for independent tasks under the unrelated machines model. Probably useful only as a baseline for understanding the performance of other algorithms, OLB schedules tasks in arbitrary order on the earliest available compute node. Its performance has been shown to be significantly worse than MET, MCT, and LBA (Armstrong et al., 1998).

WBA (Workflow Based Application) is a scheduling algorithm developed for managing scientific workflows in cloud environments and was designed for the fully heterogeneous model discussed in this paper. We observe that its scheduling complexity is at most O(|T||D||V|)𝑂𝑇𝐷𝑉O(|T||D||V|)italic_O ( | italic_T | | italic_D | | italic_V | ) (the authors do not report this, however, and a more efficient implementation might be possible). WBA operates by randomly assigning tasks to nodes, guided by a distribution that favors choices that least increase the schedule makespan in each iteration.

FastestNode is a simple baseline algorithm that schedules all tasks to execute in serial on the fastest compute node. BruteForce is a naive algorithm that tries every possible schedule and returns that with the smallest makespan. SMT uses an SMT (satisfiability modulo theory) solver and binary search to find a (1+ϵ)1italic-ϵ(1+\epsilon)( 1 + italic_ϵ )-OPT schedule.

A.2. Case Study: HEFT vs. CPoP

The HEFT and CPoP algorithms were proposed in the same paper by Topcuoglu et al. in (1999) and have remained two of the most popular task scheduling algorithms in decades since. Both are list-scheduling algorithms that use a priority list to determine the order in which tasks are greedily scheduled. In HEFT, each task is scheduled on the node that minimizes its finish time, given previous scheduling decisions. CPoP is similar, but commits to scheduling all critical-path tasks (those on the longest path in the task graph) on the fastest node444This statement is true for the related machines model. In general, CPoP schedules critical-path tasks to the node that minimizes the sum of execution times of critical-path tasks on that node.. The priority function for these algorithms are slightly different from each other, but are both based on task distance (in terms of average compute and communication time) from the start and/or end of the task graph. In HEFT, a sink task’s (one with no dependencies) priority is its average execution time over all nodes in the network. A non-sink task’s priority is the sum of its average execution time over all nodes in the network plus the maximum communication and compute time of its successors. In other words, it’s the average amount of time it takes to execute the task on any node in the network plus the maximum amount of time it takes (in an ideal scenario) to execute the rest of the task graph. In CPoP, the task priorities are computed based on the distance to the end and from the start of the task graph. This difference is what causes the two algorithms to perform differently on certain problem instances. Observe in Figure 8 a problem instance identified by PISA where HEFT performs approximately 1.551.551.551.55 times worse than CPoP.

Refer to caption
(a) Task Graph
Refer to caption
(b) Network
Refer to caption
(c) HEFT Schedule
Refer to caption
(d) CPoP Schedule
Figure 8. Problem Instance where HEFT performs 1.55absent1.55\approx 1.55≈ 1.55 times worse than CPoP.

In this instance, the algorithms differ primarily in whether task A𝐴Aitalic_A or task C𝐶Citalic_C has higher priority. In both algorithms, task B𝐵Bitalic_B has the highest priority and is thus scheduled first on node 2222 (the fastest node). For CPoP, task C𝐶Citalic_C must have the highest priority because it is on the critical path BC𝐵𝐶B\rightarrow Citalic_B → italic_C. For HEFT, though, task A𝐴Aitalic_A has the higher priority than C𝐶Citalic_C because it is further from the end of the task graph. As a result, CPoP schedules C𝐶Citalic_C on node 2222, which allows task A𝐴Aitalic_A to be scheduled and executed in parallel on node 3333 (the second fastest node). HEFT, on the other hand, schedules all tasks on node 2222 and thus does not benefit from parallel execution. CPoP succeeds in this instance because it prioritizes tasks that are on the critical path, keeping high-cost tasks/communications on the same node and allowing low-cost tasks/communications to be executed in parallel on other nodes. HEFT greedily schedules high-cost tasks on fast nodes without taking into as much consideration how doing so might affect the rest of the task graph (especially the tasks on the critical path).

This does not mean that CPoP is better than HEFT, though! Observe in Figure 9 a problem instance identified by PISA where CPoP performs approximately 1.881.881.881.88 times worse than HEFT.

Refer to caption
(a) Task Graph
Refer to caption
(b) Network
Refer to caption
(c) HEFT Schedule
Refer to caption
(d) CPoP Schedule
Figure 9. Problem Instance where CPoP performs 1.88absent1.88\approx 1.88≈ 1.88 times worse than HEFT.

In this instance, the algorithms priorities are actually the same for all tasks. In fact, they schedule both A𝐴Aitalic_A and B𝐵Bitalic_B identically. The problem that CPoP faces for this instance is exactly what allows it to succeed in the previous instance: committing to scheduling all critical-path tasks on the fastest node. In this instance, the critical path is AC𝐴𝐶A\rightarrow Citalic_A → italic_C. Thus, CPoP schedules task C𝐶Citalic_C on node 2222 (the fastest node) even though it would have finished much faster (due to the communication cost incurred by its dependency on task A𝐴Aitalic_A) on node 1111. HEFT does not have this problem because it does not commit to scheduling all critical-path tasks on the fastest node.

In conclusion, CPoP succeeds when the task has a critical path with low-cost tasks that “mess up” HEFTs priority function and fails when scheduling a task on the fastest node causes a large communication penalty. It’s easy to see that these bad case-scenarios can be modified to make HEFT and CPoP perform arbitrarily worse than each other (e.g., by increasing the number of parallel source tasks in the first example and by changing the network weight between nodes 2222 and 3333 to 00 in the second example). This example highlights one of the benefits of our proposed approach to comparing task scheduling algorithms: it allows us to identify patterns where the algorithms under-/over-perform and then extrapolate those patterns to other problem instances.

A.3. Application-Specific PISA Results

Here we include the Section 6 benchmarking and PISA results for all evaluated scientific workflows. For each CCR, the top row shows benchmarking results, where gradients indicate makespan ratios on different problem instances in the dataset. All other cells indicate the highest makespan ratio yielding problem instance discovered by PISA.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 10. Results for the blast scientific workflow.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 11. Results for the srasearch scientific workflow.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 12. Results for the bwa scientific workflow.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 13. Results for the epigenomics scientific workflow.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 14. Results for the 1000genome scientific workflow.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 15. Results for the montage scientific workflow.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 16. Results for the seismology scientific workflow.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 17. Results for the soykb scientific workflow.