How to build a DAG based Task Scheduling tool for Multiprocessor systems using python

Ramses Alexander Coraspe Valdez

Published in

ITNEXT

14 min readJun 7, 2022

Scheduling Big Data Workloads and Data Pipelines in the Cloud with pyDag

Much of the success of data driven companies of different sizes, from startups to large corporations, has been based on the good practices of their operations and the way how they keep their data up to date, they are dealing daily with variety, velocity and volume of their data, In most cases their strategies depend on those features. Some of the aims of the data team in this type of companies are:

Design and deploy cost effective and scalable data architectures
Get insights from their data
Keep the business and operations up and running

In order to achieve these aims the data team uses tools, most of these tools allow them to extract, transform and load data to other places or destination data sources, visualize data and convert data into information. It is very common to see ETL tools, task scheduling, job scheduling or workflow scheduling tools in these teams. It is worth mentioning that the terms: task scheduling, job scheduling, workflow scheduling, task orchestration, job orchestration and workflow orchestration are the same concept, what could distinguish them in some cases is the purpose of the tool and its architecture, some of these tools are just for orchestrate ETL processes and specify when they are going to be executed simply by using a pipeline architecture, others use DAG architecture, as well as offer to specify when the DAG is executed and how to orchestrate the execution of its tasks (vertices) in the correct order.

The advantage of this last architecture is that all the computation can be used on the machine where the DAG is being executed, giving priority to running some tasks (vetices) of the DAG in parallel.

A graph is a collection of vertices (tasks) and edges (connections or dependencies between vertices). Therefore, a directed acyclic graph or DAG is a directed graph with no cycles.

A pipeline is a kind of DAG but with limitations where each…

ITNEXT

How to build a DAG based Task Scheduling tool for Multiprocessor systems using python

Create an account to read the full story.

Published in ITNEXT

Written by Ramses Alexander Coraspe Valdez

No responses yet