What is the purpose of using GKE?

GKE helps simplify application deployment, scaling, and management on Google Cloud, providing a robust platform for container orchestration.

Why is Helm used in the setup?

Helm is a package manager for Kubernetes that simplifies deployment and management of applications, allowing for consistent configurations across environments.

What advantages does a self-managed Airflow deployment offer over Cloud Composer?

A self-managed deployment allows for better load balancing and management of resources without being limited to the constraints of a managed service like Cloud Composer.

What is the significance of Terraform in this setup?

Terraform is used for infrastructure as code, enabling the definition and provisioning of infrastructure through declarative configuration, ensuring control and version management.

How does Airflow integrate with GKE?

Airflow is deployed on GKE using Helm, allowing for orchestration of workflows directly within the Kubernetes environment.

What were some challenges faced when transitioning to Helm?

Upgrading from previous Airflow versions using Docker led to performance issues, but moving to Helm resolved these problems.

What best practices were highlighted in the presentation?

Some best practices include private cluster configurations, VPC pairing, proper IAM controls, monitoring solution usage, and effective resource allocation.

What are some key features of Apache Airflow?

Airflow is designed for programmatic orchestration of workflows, featuring dynamic task execution, rich UI, and extensive scalability.

How does GitLab's CI/CD pipeline enhance deployment?

The CI/CD pipeline validates changes and ensures that Terraform scripts will not break the infrastructure upon deployment.

What are the primary use cases for using GKE and Airflow together?

Use cases include data pipeline orchestration, continuous integration, and scalable application deployment.

Intro to GKE setup of airflow with helm and terraform | Ved Prakash | DSC Europe 23

00:31:28

https://www.youtube.com/watch?v=zOg8DB68gks

概要

TLDRVD Prakash, a Staff Data Engineer at GitLab, presents on how to set up Apache Airflow on Google Kubernetes Engine (GKE) using Helm and Terraform. He explains GitLab's mission and core values, emphasizing collaboration and transparency. The talk covers the benefits of GKE as a managed Kubernetes solution for deploying scalable applications, along with the advantages of using Apache Airflow for workflow orchestration. He highlights Terraform's role as infrastructure as code for managing resources and Helm's benefit for streamlined application management. The discussion also addresses best practices and considerations when implementing such systems, demonstrating GitLab's efficiency in managing data platforms.

収穫

🚀 Introduction to GKE: A managed Kubernetes service simplifying app deployment.
🔧 Apache Airflow: Open-source platform for scheduling and monitoring workflows.
📦 Terraform: Infrastructure as code tool for managing resources efficiently.
⚙️ Helm: A package manager for Kubernetes, enhancing configuration management.
📈 Scalability: GKE allows automatic scaling based on demand for data workloads.
🔒 Security: Implementation of VPC pairing to secure data sources.
📊 Best Practices: Emphasis on private cluster and IAM controls for security.
🛠️ CI/CD Pipeline: GitLab's pipeline validates infrastructure changes efficiently.
💡 Performance: Transition to Helm resolved performance issues experienced with Docker.
📚 Integration: Seamless connection between GKE, Helm, Terraform, and Airflow.

タイムライン

00:00:00 - 00:05:00
Mr. V Prakash, staff data engineer at GitLab, presents on setting up Apache Airflow with Helm and Terraform on Google Kubernetes Engine (GKE) as part of a data platform team. He shares insights from his experience at GitLab, emphasizing the company's mission of collaboration and efficiency.
00:05:00 - 00:10:00
GitLab is introduced as a comprehensive DevSecOps platform offering CI/CD pipelines, promoting security and efficient code management. The presentation outlines the agenda, focusing on the integration of GKE, Apache Airflow, Helm, and Terraform, and their significance in data operations.
00:10:00 - 00:15:00
GKE is described as a managed Kubernetes service that simplifies application deployment and management, highlighting its features such as reliability, automatic scaling, and security compliance. Prakash explains how GKE supports their data pipeline requirements effectively.
00:15:00 - 00:20:00
Apache Airflow, an open-source platform, is covered next, with key features like dynamic workflow execution, scalability, and a rich UI. Prakash favours Airflow for its flexibility and dynamic task management capabilities in scheduling and workflow automation.
00:20:00 - 00:25:00
Terraform is introduced as an infrastructure-as-code tool allowing users to provision GKE resources efficiently. Its key features include multicloud provisioning, declarative configuration, and state management, facilitating collaborative infrastructure management.
00:25:00 - 00:31:28
The presentation concludes by linking how GKE, Helm, Terraform, and Airflow work together in GitLab's data operations, emphasizing the ease of deployment, scaling, and the CI/CD pipeline's robustness in validating and managing infrastructure changes. Best practices and considerations for security and efficiency are also highlighted.

ビデオQ&A

What is the purpose of using GKE?
GKE helps simplify application deployment, scaling, and management on Google Cloud, providing a robust platform for container orchestration.
Why is Helm used in the setup?
Helm is a package manager for Kubernetes that simplifies deployment and management of applications, allowing for consistent configurations across environments.
What advantages does a self-managed Airflow deployment offer over Cloud Composer?
A self-managed deployment allows for better load balancing and management of resources without being limited to the constraints of a managed service like Cloud Composer.
What is the significance of Terraform in this setup?
Terraform is used for infrastructure as code, enabling the definition and provisioning of infrastructure through declarative configuration, ensuring control and version management.
How does Airflow integrate with GKE?
Airflow is deployed on GKE using Helm, allowing for orchestration of workflows directly within the Kubernetes environment.
What were some challenges faced when transitioning to Helm?
Upgrading from previous Airflow versions using Docker led to performance issues, but moving to Helm resolved these problems.
What best practices were highlighted in the presentation?
Some best practices include private cluster configurations, VPC pairing, proper IAM controls, monitoring solution usage, and effective resource allocation.
What are some key features of Apache Airflow?
Airflow is designed for programmatic orchestration of workflows, featuring dynamic task execution, rich UI, and extensive scalability.
How does GitLab's CI/CD pipeline enhance deployment?
The CI/CD pipeline validates changes and ensures that Terraform scripts will not break the infrastructure upon deployment.
What are the primary use cases for using GKE and Airflow together?
Use cases include data pipeline orchestration, continuous integration, and scalable application deployment.

ビデオをもっと見る

AIを活用したYouTubeの無料動画要約に即アクセス！

字幕

オートスクロール:

00:00:10
we are starting this block uh with Mr VD
00:00:13
prakash staff data engineer at gitlab
00:00:16
who will talk to us um uh in a
00:00:19
presentation entitled intro to uh jke
00:00:22
setup to airflow with Helm and terraform
00:00:25
uh I would like you to give an excited
00:00:27
welcome to uh Mr B
00:00:31
please Mage is yours
00:00:44
please hi good morning everyone excited
00:00:49
and let's talk about some tech so it's
00:00:52
mostly data so it's me V prash I work as
00:00:55
a staff data engineer at gitlab I've
00:00:56
been there for close to three years and
00:01:00
it's pretty exciting over there so
00:01:02
wanted to share some lights of what my
00:01:03
learning and what we have how we have
00:01:05
set it up the airflow in our system in
00:01:09
data platform team with you all the
00:01:11
wider audience so that's
00:01:14
me so just a little bit about gitlab as
00:01:17
a company so our mission everyone can
00:01:20
contribute so just as we discussed
00:01:24
before we can contribute you can
00:01:26
everyone can where everyone across the
00:01:28
team across the globe can contribute to
00:01:30
gitlab to make the product better
00:01:33
efficient and
00:01:35
smart and gitlab lives with the credit
00:01:37
values and this is not just the words so
00:01:40
it comes from the collaboration results
00:01:42
efficiency diversity iteration and
00:01:45
transparency of this every letter itself
00:01:50
is a big handbook page in our gitlab
00:01:54
terms where you can read more about this
00:01:57
but we literally live with this values
00:01:59
and and it shows in our day-to-day work
00:02:02
every OK hour or every goals what we do
00:02:04
or every project we deliver it's through
00:02:07
this value it's transparent we do in
00:02:09
small iteration our development and it's
00:02:11
fast and we deliver it and we iterate it
00:02:13
faster so it's boring solution but weate
00:02:16
and make it smarter smarter smart but we
00:02:19
meet the results at the
00:02:21
end so coming to what gitlab is doing
00:02:24
the gitlab is a Dev secops platform
00:02:25
delivered as a single application to
00:02:27
help you iterate faster and in together
00:02:31
so it's a Dev secops platform where
00:02:33
everything can be done in terms of cicd
00:02:37
pipeline in terms of security of your
00:02:39
pipeline and your get repository and all
00:02:42
the
00:02:43
things now moving to the next to the
00:02:47
agenda that what we're going to discuss
00:02:48
today in this small talk time of 20
00:02:52
minutes so we will be discussing about
00:02:54
introduction to Google cetus engine
00:02:56
Apache airflow Helm and
00:02:58
terraform how all of this components
00:03:00
fits together and how the installation
00:03:03
is done within gitlab data platform
00:03:05
teams and what's the win of data
00:03:06
platform on this and little bit on best
00:03:08
practices and
00:03:11
considerations so to start with so
00:03:14
introduction to gke what is gke
00:03:17
basically most of you are aware because
00:03:19
of it but just to reiterate it's a
00:03:21
Google kubernetes engine is a managed
00:03:23
kuber service that simplifies cized
00:03:26
application deployment scaling
00:03:28
Management on Google Cloud offering a
00:03:30
robust and efficient platform for
00:03:32
container
00:03:34
orchestration what are the key features
00:03:37
so why we choose so what's the reason of
00:03:39
using GK the major was that is a managed
00:03:42
kubernetes so we don't have to manually
00:03:44
manage the whole kubernetes
00:03:45
infrastructure it's managed by Google so
00:03:47
it's more reliable automatic scaling
00:03:50
which was very much needed for our
00:03:52
airflow the data pipeline what we have
00:03:54
built we need a system which is
00:03:56
automatically scal on demand whenever
00:03:59
and to handle all kinds of load what you
00:04:02
want security and compliance is one
00:04:04
major thing so when you set up an
00:04:05
infrastructure for data pipeline it
00:04:07
takes a lots of time to just catter
00:04:10
around the security and compliance that
00:04:12
is meeting the Securities the firewall
00:04:14
proof is it exposed to the public
00:04:15
internet are we having a data leak
00:04:17
happening like this so gke helps a lot
00:04:20
in that and that's why and the
00:04:22
integration at the last the key features
00:04:24
integration helps a lot in terms of
00:04:26
integrating
00:04:27
our CL all our data sources which is
00:04:30
sitting on other G jigle Cloud platform
00:04:33
so it can integrate with our other we do
00:04:36
a VPC pairing where we interact with
00:04:38
production databases so we don't have to
00:04:40
go over the internet and all it just
00:04:42
works with the VP our firewall and all
00:04:44
just we do VPC instead of exposing of
00:04:48
net uh n over the network to say pass
00:04:50
the data on and few use cases it's a
00:04:53
microservice deployment continuous
00:04:55
integration cicd and a scalable
00:04:58
application so microservice deploy
00:04:59
employment is a service which you have
00:05:00
web servers U so what do you say
00:05:04
middleware and database everything it
00:05:07
can it can be used for
00:05:09
that coming to the next what's Apache
00:05:12
airflow Apache airflow is an open source
00:05:14
platform designed to programatically
00:05:16
author schedule and monitor workflows so
00:05:19
this is a common tool used for
00:05:20
scheduling we have seen lots
00:05:22
of St tool for do a schedule like
00:05:25
control M skybo and all but airflow is
00:05:28
one of the favorite for me because it
00:05:30
helps a lot it's open source first thing
00:05:32
and it's a it can be scaled to whatever
00:05:34
design we want so what's the key
00:05:37
features again for this so it comes with
00:05:39
a directly Sly graph that's called Dax
00:05:43
extensibility the way it's helped to
00:05:45
extends whole airflow pipeline Dynamic
00:05:48
workflow execution so you we have one
00:05:52
dag which can create like 200 tasks
00:05:54
dynamically on demand so whenever we
00:05:57
just want to add one additional data uh
00:05:59
table we just add to a yaml and airflow
00:06:02
takes care of it by creating a attack
00:06:03
task for that rich UI which I like it
00:06:07
and the logging capability where you can
00:06:08
log it you can see the logs in the UI
00:06:10
but also you can see it back in the
00:06:12
server scalability so you can have a
00:06:15
multiple web server to handle your loads
00:06:18
incoming loads coming in from uh using
00:06:21
the load balancer you can scale your web
00:06:23
server you can scale your schuer you can
00:06:25
scale your database so that helps in
00:06:28
this
00:06:30
the pro top use cases which came to my
00:06:32
mind was Data pipeline orchestration
00:06:34
that's what we generally use it for ETL
00:06:37
extract transform load process not doing
00:06:39
the ETL just scheduling it workflow
00:06:43
Automation in diverse
00:06:47
industry what's the terraform then so we
00:06:50
talked about gke we talked about airf
00:06:52
flow so what does the terraform do tform
00:06:54
is infrastructure as a code for gke and
00:06:57
it's open source infrastructure tool
00:06:58
that enables user to Define and
00:07:00
provision infrastructure using a
00:07:02
declarative configuration
00:07:04
language the key features as you all
00:07:07
know is infrastructure scode so that's a
00:07:08
good for versioning control and all
00:07:11
multicloud provisioning so it can be
00:07:12
work on Google Azure AWS declarative
00:07:16
configuration language so it's very
00:07:18
simply configured easy to configure and
00:07:21
make it easily readable for anyone to
00:07:23
apply it plan and apply workflow this is
00:07:25
the best part you do a terraform change
00:07:28
you can validate date you can plan it
00:07:30
that what's the plan how will this go
00:07:32
and Implement in the system will it
00:07:35
destroy something what it will change
00:07:37
what it will add and it's easily
00:07:39
readable and that helps you a lot to
00:07:41
judge that shall I go and apply this
00:07:42
changes to the production or not you can
00:07:45
figure it out if something is wrong if
00:07:47
your state if your file has changed the
00:07:50
target server has changed manually it
00:07:52
will throw you an error that something
00:07:54
has gone wrong your plan is not looking
00:07:55
what you have done the modification so
00:07:57
it gives you a more control and and this
00:07:59
is the best way I feel to handle our
00:08:02
infrastructure and the State Management
00:08:04
so this is something which helps like
00:08:06
you have a team of five developers and
00:08:09
everyone doing a terraform chain how do
00:08:11
you keep the things in sync that I have
00:08:14
done a changes added a node pull to the
00:08:16
cluster and you are adding another one
00:08:18
then how do you do state state file is
00:08:20
the one which helps us to know pull up
00:08:23
the configuration of the GK from the
00:08:27
remote uh our GCS bucket that's where we
00:08:29
store it and to know that what is the
00:08:31
current state of the GK looks like was
00:08:33
last one and then what the changes will
00:08:36
do will it modify will it destroy or
00:08:39
will it uh create a new add something
00:08:42
new to it so that's the State Management
00:08:44
that's one of the more useful key
00:08:46
features where you can manage from a
00:08:48
point where you can roll back where you
00:08:50
can control
00:08:52
it the use cases coming to the use cases
00:08:54
it's a provisioning servers Network
00:08:57
infrastructures application deployment
00:08:59
ments so it's used for everything but we
00:09:03
have extended it little bit so we can
00:09:04
install airflow using terraform just
00:09:07
without having Helm but we used Helm a
00:09:11
package manager for kubernetes because
00:09:12
it makes our life easier it's a package
00:09:16
manager for kubernetes application
00:09:18
simplifying the deployment and
00:09:19
management of cized
00:09:22
application so what help hel help us in
00:09:25
that we have a consistent deployment
00:09:27
across all the involvment so if we have
00:09:29
five GK clusters one for test one for
00:09:31
Dev we can use the same feature same
00:09:34
package everywhere so we don't have to
00:09:37
have a different so we keep our
00:09:38
envirment sync that nothing is different
00:09:42
in the test or nothing is different in
00:09:43
the
00:09:44
prodct it is the simplified
00:09:46
configuration so you have a basic Helm
00:09:49
configuration which comes with the
00:09:50
airflow and then on top of it we modify
00:09:53
our configuration
00:09:55
like having a different secret keys and
00:09:57
all uh foret keys is so that we keep our
00:10:00
web server secure from the outside world
00:10:02
uh not using the default setup so
00:10:04
there's lots of we use postest database
00:10:06
in the back and so all those things is
00:10:08
being useful over here is the helm
00:10:10
package we override it with those values
00:10:13
with the config management and the
00:10:15
dependency management of the packages so
00:10:16
you need to install the database first
00:10:18
then the side cars first then web Ser
00:10:22
then the schuer and then the web server
00:10:23
so this all taken care by the helm
00:10:25
rather than us if I had to do it using
00:10:28
terraform just using Cube cutle I would
00:10:31
have to ensure that everything goes one
00:10:32
on the pieces otherwise it will just go
00:10:34
hawi it will not work as expected so use
00:10:38
cases is microservice deployment in this
00:10:40
case we are using for airf
00:10:43
flow so integration of airf flow with GK
00:10:47
using Helm and terao why use gke for
00:10:51
airf flow why not anything else G
00:10:54
there's lots of benefits for this the
00:10:58
major one is that scaling able so it's
00:10:59
can a scale to your demand you can have
00:11:01
a different for different workloads you
00:11:04
will have different node pools dedicated
00:11:06
so you can handle a data science model
00:11:08
in a different node pool with a bigger
00:11:10
machines with a bigger gpus in the same
00:11:12
cluster and you can have a small
00:11:13
pipeline also running in the same
00:11:15
cluster on the smaller node with a node
00:11:17
pool with a smaller machine so you don't
00:11:20
pay way high for everything uh like
00:11:24
running on everything on the bigger
00:11:26
machine so that's the one on this then
00:11:30
the next one for this uh why for the
00:11:32
aache and GK is is on the scalability uh
00:11:36
is done then the security it's secure we
00:11:39
host our air flow on the VPC private VPC
00:11:42
and that's where we keep it safe from
00:11:44
the outside world our air flow
00:11:47
instance the third Point comes to this
00:11:50
for the airflow on the
00:11:53
GK this uh why we are using it uh why it
00:11:56
should be used uh for this is
00:12:00
ease of Maintenance and ease of uh
00:12:02
maintaining the whole infrastructure
00:12:04
within the GK
00:12:06
platform what this slides gone little ha
00:12:10
sorry about
00:12:14
that so why Helm charts for airf flow so
00:12:17
that's the question as so we ask why GK
00:12:20
for air flow and why should we use Helm
00:12:22
charts for airflow the ease is that it's
00:12:24
a standardized packaging it will help
00:12:27
you the consistent deployment across all
00:12:29
your different involvment so that's one
00:12:30
of the major thing uh to have main
00:12:33
benefit for him is to keep the
00:12:35
consistent deployments simplified
00:12:37
configuration so you have a simple
00:12:39
configuration so you have a basic config
00:12:41
and then you override so you can
00:12:42
override your configuration based on
00:12:44
your requirement for a test involvement
00:12:46
you can have a different setup you can
00:12:48
have a different config management for
00:12:50
production you can have a different for
00:12:52
your performance test envirment like
00:12:54
that so that's the thing Version Control
00:12:57
and roll back that's another benefit of
00:13:00
Helm charts that it helps you to
00:13:02
configure your control your version and
00:13:04
a version of in the G repository and
00:13:06
that's easily
00:13:08
managed and the reusability of course so
00:13:12
it's a sharable configuration you can
00:13:13
configure it with you can share it with
00:13:15
all your team members and anyone can
00:13:17
contribute to that to help it so that's
00:13:20
the another thing why to use Helm for
00:13:22
the airf flow U we have been using
00:13:24
before Helm normal deployment but this
00:13:28
was when the hel the Apache officially
00:13:30
released the helm chart it made our life
00:13:32
much easier because the upgrading to the
00:13:34
air flow and all it was much more
00:13:35
smoother now rather than manually doing
00:13:37
the upgrade earlier and why Tera for
00:13:41
model for GK so with the terraform model
00:13:43
you can have everything configured just
00:13:47
outside not doing a manually so in case
00:13:50
something goes wrong like accidentally
00:13:53
you drop the cluster you have a
00:13:55
terraform model which can just bring
00:13:56
back the cluster in like 30 minutes with
00:13:59
all the big machines and all the setups
00:14:01
you run the helm charts on top of it and
00:14:03
the whole machine is back for nearly
00:14:05
like a 45 minutes of time with all the
00:14:07
configuration and all the things in
00:14:10
place moving to
00:14:12
next now let's connect the dots between
00:14:15
the gke plus Helm and the terraform plus
00:14:17
air flow so we provision the
00:14:20
infrastructure with terraform so that
00:14:22
sets the base teras form sets the
00:14:25
foundation it bootstraps the kubernetes
00:14:26
Clusters with all necessary
00:14:29
firewall rules and all the things in
00:14:31
place all the VPC pairing and all done
00:14:34
through the terraform and then KUB is
00:14:37
cluster orchestration with GK so gke
00:14:39
ensures the seamless operations between
00:14:43
the parts of the web server the
00:14:45
scheduler the PG bouncers the git
00:14:49
repository and G sync side cars as well
00:14:53
so it that's the done by the terraform
00:14:56
and then the kubernetes plays the role
00:14:57
of that and and then we do the package
00:14:59
management with Helm so Helm charts
00:15:01
defines the airflow configuration as I
00:15:03
said
00:15:05
it makes the server like spinning up the
00:15:08
server becomes much more easier if
00:15:10
anything goes wrong you can easily track
00:15:11
it down and when you do the helm chart
00:15:14
deploying on GK this is the favorite
00:15:17
part for me it's a smooth deployment you
00:15:20
just run it apply and then just sit back
00:15:23
and relax to see that it's going good or
00:15:25
it's what's uh happening and if anything
00:15:28
breaks you get the proper error
00:15:31
so something is there on the
00:15:35
screen yeah and last comes so when you
00:15:38
deploy the helm charts your airf flow is
00:15:40
up and running and that's the last P
00:15:42
integrated workflow and that is the
00:15:44
orchestration with the airf flow so
00:15:46
integrated approach the integration with
00:15:49
with the air flow helps you with all
00:15:51
your job dat workloads and everything
00:15:55
and with this at the end you have a air
00:15:59
flow up and running which has all your
00:16:00
active DXs with the git sync we are
00:16:03
continuously syncing our git
00:16:04
repositories with the kubernetes server
00:16:07
over there with the airf flow so any
00:16:09
changes we do to the our G repository
00:16:12
our analytics data repository after
00:16:14
testing is done when it's mered to the
00:16:16
master it instantly like in a gap of 2
00:16:18
minutes it sinks with the production and
00:16:21
you can see that dag in the production
00:16:23
over there you can run it instantly by
00:16:26
default it's off but you can run it so
00:16:28
with this all the four things in place
00:16:31
you have an infrastructure in place with
00:16:34
the combination which is safe secure uh
00:16:37
reliable scalable to the core where you
00:16:39
can scale it to the amount and it's a
00:16:41
very cost optimal solution so when we
00:16:44
run a data pipeline sometimes become
00:16:46
very costly process if you use like a
00:16:49
standard VMS and you don't has if you
00:16:51
need to scale it then you manually go
00:16:53
and add another VM to it and then you
00:16:54
have to shut it down but gke take carees
00:16:56
of Auto scaling up down everything for
00:16:59
that if you need to horizontally scale
00:17:01
you can scale as much as you want on
00:17:03
demand and then you scale down to zero
00:17:05
so we don't run all our node
00:17:08
pools 24 hours we just run the standard
00:17:12
node pools for airf flow for like 24
00:17:14
hours because that's where the air flow
00:17:15
is hosted but remaining all node pools
00:17:18
are just scalable from 0 to 5 0 to 10
00:17:20
based on the demand we do handle our
00:17:23
data loads for data science data
00:17:25
scientists We R data modelss our data
00:17:28
pip P lines the DBT the heavy data
00:17:30
pipelines which exports like close to
00:17:33
100 GB of data every day and these all
00:17:35
happens using air flow and this is the
00:17:39
benefit of the GK that it helps a lot to
00:17:41
scale
00:17:42
up and the good thing with this all of
00:17:46
this things coming together is that the
00:17:48
gitlab cicd pipeline the cicd pipeline
00:17:51
is so robust that when you modify your
00:17:54
yaml file
00:17:56
for uh terraform or for Helm it
00:17:59
validates it does the proper validation
00:18:01
and the CI pipeline passes it
00:18:03
revalidates that your plan will work
00:18:05
perfectly or not and if it will it will
00:18:07
give you output what the plan looks like
00:18:09
and based on that you can take a call
00:18:11
should I deploy it to production or not
00:18:13
so that's the whole connection of the
00:18:16
dots between the GK Helm uh terraform
00:18:19
and airflow moving to
00:18:22
next so how it is done within gitlab so
00:18:25
GK cluster is Provisions through data
00:18:27
form using Helm charts and we have the
00:18:30
air flow installed through the helm
00:18:32
chart so we have two name space within
00:18:34
the same cluster prod and
00:18:37
testing we have seven node pools
00:18:39
different machine type for each load as
00:18:41
I said so we have for bigger machines
00:18:44
with big gpus and CPUs for data science
00:18:47
models we have big machines for heavy
00:18:50
data pipeline where we export like in
00:18:52
one where we run like hundreds of jobs
00:18:57
within within within an hour so at that
00:18:59
point of time we need a bigger machine
00:19:01
with a scalable capability so it can
00:19:02
scale as much as you want remote a state
00:19:05
for any change required for the GK
00:19:07
cluster so we maintain our as said
00:19:11
earlier we maintain our estate file in
00:19:13
the GCS which is properly secured and
00:19:15
save not open to the public and that
00:19:17
helps us to maintain our Version Control
00:19:19
of the whole of the GK infrastructure as
00:19:22
well gitlab cicd pipeline to validate
00:19:25
the changes done to the terraform script
00:19:27
this is showes the changes will not
00:19:29
break the terraform apply so this is the
00:19:31
most critical part which helps a lot in
00:19:34
order
00:19:35
to not destroy your whole GK setup of
00:19:39
the air flow because this is the bread
00:19:41
and butter for the data Engineers if
00:19:43
that air flow goes Haywire the whole
00:19:45
data pipeline goes haywire and then you
00:19:47
have lots of questions to be
00:19:50
answered air flow installed using Helm
00:19:53
charts so we use airf flow version
00:19:55
2.5.3 using Helm charts for airflow
00:19:59
which will boot stop and airflow
00:20:00
deployment on kubernetes cluster using
00:20:02
Helm package managers we use cloud SQL
00:20:06
post instances so that we have
00:20:08
configured our basic config to use cloud
00:20:11
SQL post and we have a side car used
00:20:14
within the kubernetes pod which do the
00:20:17
git sync with analytics repository so
00:20:19
that's where this is the one which takes
00:20:21
2 minutes of time to sync the git
00:20:23
reposit keep ping the git Master Branch
00:20:27
to see that any new commits has happened
00:20:29
and if something has happened it will
00:20:30
pull it and you can see that dag
00:20:31
immediately into the air
00:20:34
flow and we have modified our web
00:20:36
servers secret key and a Fed key these
00:20:38
are the default Keys which comes with
00:20:40
the airflow installation but it's good
00:20:42
to have your own if you're are using it
00:20:44
because this keeps your servers safe and
00:20:46
secure because with default one people
00:20:49
can decode it and they can get an access
00:20:52
to the airflow
00:20:56
UI so how this benefit the data platform
00:20:59
team managing the data pipeline so we
00:21:02
currently have 88 active DXs with these
00:21:05
88 dags we are able to run 1200 plus
00:21:08
task every 24 hours so this is a huge
00:21:11
number in terms that there is a dag
00:21:13
which runs like 300 tasks and it runs
00:21:16
for one and a half hour and these 300
00:21:18
tasks is nothing but polling post
00:21:20
database with 300 tables and this is
00:21:24
done with the help of node
00:21:26
pools and the airf flow pools as well so
00:21:29
slots within the air flow we use airflow
00:21:32
slots to say that use a concurrency you
00:21:35
can run 10 task or 15 task within this
00:21:37
particular airflow dag in a concurrency
00:21:40
and if you set that to too high then
00:21:43
your node pool should be able to support
00:21:44
it so that comes the GK where it scales
00:21:48
horizontally to have like six seven
00:21:50
clusters to handle that load and that's
00:21:53
where one task itself creates like one
00:21:56
dag itself creates like 200 task and
00:21:59
that is through yaml file so we have
00:22:01
just 88 python files and that 88 python
00:22:05
files is responsible to create more than
00:22:07
600 to 700 task in the
00:22:10
airl improving the workflow task
00:22:13
dynamism with air flow since as I said
00:22:15
it's a configurable so once you do a
00:22:18
modification and it's a makes life very
00:22:20
easy because as a data Engineers we are
00:22:24
always overwhelmed with lots of work
00:22:26
lots of our own pipeline so someone
00:22:28
needs to add a new table to a postest a
00:22:30
person who is requesting can himself add
00:22:32
it and add it to the yaml file and we
00:22:35
can just see validate it it looks good
00:22:37
and we do the merge request and that
00:22:39
file that table is present in the
00:22:41
pipeline in after 2 minutes so it
00:22:43
improves enhances the cicd timeline and
00:22:46
the deployment P timeline for adding a
00:22:48
new table so kubernetes pod operators to
00:22:52
schedule Dynamic workload so airf flow
00:22:55
comes with the kubernetes Pod operators
00:22:57
and that is the one which we use to
00:22:59
schedule our workloads in whole of the
00:23:01
GK set so every task spin through airf
00:23:06
flow is creates a pod in the gke and
00:23:09
that pod itself is responsible to do
00:23:11
that whatever that task was supposed to
00:23:13
do so write from establishing the
00:23:15
connections to the database pulling the
00:23:17
data pushing it to GCS loading to
00:23:19
Snowflake and once all is done the port
00:23:21
gets Auto destroyed and that's and then
00:23:23
the machine starts scaling back so
00:23:25
kubernetes P operator is like Lifesaver
00:23:28
for us because it helps us to scale the
00:23:30
whole
00:23:32
involvment on demand node provisioning
00:23:34
with terraform so with terraform if a
00:23:37
team comes with saying that I need a
00:23:39
machine a data science team mostly I
00:23:42
need a machine with a bigger respect now
00:23:44
and if we were going through UI and we
00:23:47
don't have didn't have terraform the
00:23:49
problem would be that how do you manage
00:23:53
this uh you have to again go through how
00:23:55
do you Version Control so with Tera form
00:23:57
it's easy easy okay you need a machine
00:23:59
Let's Go mod add the spe set it to this
00:24:02
go from nhim M4 to nhim M8 like that
00:24:06
merge it's validated within half an hour
00:24:08
and it's boom ready to deployed in
00:24:11
production you have new machines present
00:24:13
to do minimal downtime typically under
00:24:15
45 minutes in the event of Disaster
00:24:18
Recovery scenario so that's what I was
00:24:20
telling in the start with this all the
00:24:22
four setup if we destroy the whole of
00:24:25
the kubernetes everything and everything
00:24:28
just goes wipe off we can spin it back
00:24:30
in 45 minutes so with this it makes like
00:24:34
the system is in control so in a case of
00:24:37
worst to the worst we are back in 45
00:24:39
minutes to resume the pipeline and which
00:24:40
helps our data pipeline to be more
00:24:43
secure and reliable for the end
00:24:47
users the best practices and
00:24:49
consideration so security best practice
00:24:51
private cluster configuration that's
00:24:53
what I would recommend VPC pairing so if
00:24:56
you have a another database another
00:24:58
sources which is in the same Google
00:24:59
Cloud you try to do VPC pairing rather
00:25:02
than exposing it over the internet and
00:25:05
fing data identity access management
00:25:07
control which is something like to give
00:25:09
the least privileges to that so that
00:25:12
only people with minimal access work
00:25:15
with the minimal access over there node
00:25:17
pool isolation for different data load
00:25:19
however different node pool that will
00:25:21
help you for different type of work just
00:25:23
have a different note so this will keep
00:25:25
your cost down and you will not be
00:25:27
running a bigger machine for a smaller
00:25:28
task and a smaller machine for a bigger
00:25:30
task securing Secrets keep so everything
00:25:33
is in the vault so use it VA to secure a
00:25:37
Secrets a scalability consideration
00:25:40
horizontal pod autoscaling so that's one
00:25:43
which is recommended database is scaling
00:25:45
as well so you can scale your database
00:25:48
uh as much as you want task parallelism
00:25:50
so concurrency within the airflow helps
00:25:52
you a lot to how do you run your task in
00:25:56
parallel with each other at one point of
00:25:58
time resource request limits persistent
00:26:01
the storage consideration this is
00:26:03
something where you have like the as GK
00:26:06
you should have a PVC volume as attached
00:26:08
to it where you can store all your logs
00:26:10
so in case gke goes off and when you
00:26:13
restore it back you still have access to
00:26:15
those logs to see for gke node pools
00:26:19
for different types of load for
00:26:21
different streams of team to work with
00:26:24
monitoring and logging strategies
00:26:26
Leverage kubernetes is Monitoring
00:26:28
Solutions Prometheus and grafana which
00:26:31
is very useful alerting and notification
00:26:34
channels we use a slack for all our
00:26:36
airflow failures happening through and
00:26:38
we monitor our web server through
00:26:41
Prometheus and grafana and it goes off
00:26:44
by any time we get an instant alert and
00:26:47
we use airflow matric to monitor it and
00:26:51
this is me you can just find me and look
00:26:55
for me and just reach out to me and
00:26:57
there's some additional resources so we
00:26:59
have gitlab handbook for information
00:27:01
about node pools and name spaces how we
00:27:03
have set it up airflow infrastructure
00:27:05
how it is present over there and our
00:27:07
gitlab data analytics repo or our data
00:27:09
bags where you can see all the dag bags
00:27:11
all the dags how we have done it and
00:27:14
thank you and any question
00:27:19
please thank you very much um uh vid now
00:27:23
we have a couple of minutes for
00:27:26
questions please question here and
00:27:33
there hi
00:27:35
uh thanks for your uh
00:27:39
presentation uh I was wondering when uh
00:27:43
chosing Helm have you considered
00:27:45
customize as an
00:27:48
option no question I we just looked at
00:27:52
Helm because in our organization we were
00:27:53
using a lot of Helm config management uh
00:27:56
not for our infrastructure but our the
00:27:57
site Sr team they use Helm config
00:27:59
management a lot so we get took the
00:28:01
knowledge from them and embed it in our
00:28:04
so that to keep itting consistent uh
00:28:07
across the organization so we haven't
00:28:09
looked do that but yeah I it's uh if
00:28:13
chance were there we would have
00:28:15
evaluated it again but Helm was the
00:28:17
choice because the aparta airflow we
00:28:19
were waiting for it to release a Helm
00:28:20
chart and when it released with the
00:28:22
series 2 it was good because it was
00:28:24
helpful for version management and all
00:28:26
so very easy for for the upgrade of air
00:28:28
flow I see and uh sort of related to
00:28:31
that do you see any advantage having
00:28:34
your own airflow deployment compared to
00:28:37
maybe Cloud composer which could be kind
00:28:39
of simpler yes we see an advantage we
00:28:43
can manage our own we see advantage in
00:28:45
terms we want to uh our load balancing
00:28:50
and our whatever the demand what we have
00:28:53
in terms of the data and the node pools
00:28:55
and the Machine size it's much more more
00:28:57
easier for us to manage it over here on
00:28:59
this side Cloud composer is good but
00:29:03
it's a lot of pain for me or to think
00:29:07
that to sync my G repository every time
00:29:09
I do a modification so the cicd when it
00:29:11
works with the GK with the gitlab that
00:29:13
components helps a lot in our data
00:29:15
pipeline to make it much more skillable
00:29:17
so that's where we look at the leverage
00:29:19
at how gitlab works with the GK more
00:29:22
efficiently with all our cicd pipelines
00:29:25
set up in place so that's where so yeah
00:29:28
Cloud composer was an option when we
00:29:29
looking at manage service and then we
00:29:32
still stuck with this because we were
00:29:33
able to manage all our loads over here
00:29:36
so thank you all right we have space for
00:29:39
one more
00:29:41
question yes
00:29:43
please thank you hi uh thank you for
00:29:46
your time um the question was regarding
00:29:48
the docker itself um used with the
00:29:51
kubernetes what's the key uh difference
00:29:54
between using this approach and simply
00:29:56
using um officially available airflow
00:29:59
Docker image with the nodes and the
00:30:00
kubernetes yes so we were using Docker
00:30:03
before so we moved to helm with the
00:30:05
series 2 we were till airflow 1.5 we
00:30:08
were using Docker images but when we
00:30:10
were trying to do an upgrade from 1.5 to
00:30:12
two Docker image was creating a lots of
00:30:16
what do you say steel objects our
00:30:18
hanging DXs it was not a performance
00:30:20
oriented for us it was working good for
00:30:22
other people but we saw that couple of
00:30:24
blocks saying that the DXs are hanging
00:30:26
and there is task which is wa waiting
00:30:28
for to be executed for 24 hours the
00:30:30
docker images didn't work so good for us
00:30:32
but when we moved to helm it was working
00:30:35
good I'm not sure what was the
00:30:37
difference but at that point of time for
00:30:39
me the performance was the key to move
00:30:41
towards the helm but I had a problem
00:30:42
with till 1.5 it was super good it was
00:30:46
very efficient it was with the series 2
00:30:48
it didn't work good for
00:30:49
us great thank you so basically in
00:30:52
future it might come back as a possible
00:30:54
solution yes yes it is it is still on
00:30:57
never we use Docker for our local
00:30:59
testing so we still use the docker
00:31:01
images for our local testing for this
00:31:03
thank you all right uh vid thank you
00:31:07
very much for this insightful
00:31:09
presentation I would like to remind
00:31:11
everyone that uh you can nwork and
00:31:13
connect during lunch perhaps for future
00:31:16
questions and this is a certificate of
00:31:18
appreciation from the organizing team
00:31:19
thank you thank you very
00:31:21
much thank
00:31:26
you

タグ

GitLab
Airflow
GKE
Terraform
Helm
DevSecOps
Data Pipeline
CI/CD
Infrastructure as Code
Kubernetes