ITNEXT

ITNEXT is a platform for IT developers & software engineers to share knowledge, connect, collaborate, learn and experience next-gen technologies.

Follow publication

You're unable to read via this Friend Link since it's expired. Learn more

How to build a DAG based Task Scheduling tool for Multiprocessor systems using python

Ramses Alexander Coraspe Valdez
ITNEXT
Published in
14 min readJun 7, 2022

Scheduling Big Data Workloads and Data Pipelines in the Cloud with pyDag

PyDag

Much of the success of data driven companies of different sizes, from startups to large corporations, has been based on the good practices of their operations and the way how they keep their data up to date, they are dealing daily with variety, velocity and volume of their data, In most cases their strategies depend on those features. Some of the aims of the data team in this type of companies are:

  • Design and deploy cost effective and scalable data architectures
  • Get insights from their data
  • Keep the business and operations up and running

In order to achieve these aims the data team uses tools, most of these tools allow them to extract, transform and load data to other places or destination data sources, visualize data and convert data into information. It is very common to see ETL tools, task scheduling, job scheduling or workflow scheduling tools in these teams. It is worth mentioning that the terms: task scheduling, job scheduling, workflow scheduling, task orchestration, job orchestration and workflow orchestration are the same concept, what could distinguish them in some cases is the purpose of the tool and its architecture, some of these tools are just for orchestrate ETL processes and specify when they are going to be executed simply by using a pipeline architecture, others use DAG architecture, as well as offer to specify when the DAG is executed and how to orchestrate the execution of its tasks (vertices) in the correct order.

The advantage of this last architecture is that all the computation can be used on the machine where the DAG is being executed, giving priority to running some tasks (vetices) of the DAG in parallel.

A graph is a collection of vertices (tasks) and edges (connections or dependencies between vertices). Therefore, a directed acyclic graph or DAG is a directed graph with no cycles.

DAG

A pipeline is a kind of DAG but with limitations where each…

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

Published in ITNEXT

ITNEXT is a platform for IT developers & software engineers to share knowledge, connect, collaborate, learn and experience next-gen technologies.

Written by Ramses Alexander Coraspe Valdez

Very passionate about data engineering and technology, love to design, create, test and write ideas, I hope you like my articles.

No responses yet

Write a response