00:00:10
we are starting this block uh with Mr VD
00:00:13
prakash staff data engineer at gitlab
00:00:16
who will talk to us um uh in a
00:00:19
presentation entitled intro to uh jke
00:00:22
setup to airflow with Helm and terraform
00:00:25
uh I would like you to give an excited
00:00:27
welcome to uh Mr B
00:00:31
please Mage is yours
00:00:44
please hi good morning everyone excited
00:00:49
and let's talk about some tech so it's
00:00:52
mostly data so it's me V prash I work as
00:00:55
a staff data engineer at gitlab I've
00:00:56
been there for close to three years and
00:01:00
it's pretty exciting over there so
00:01:02
wanted to share some lights of what my
00:01:03
learning and what we have how we have
00:01:05
set it up the airflow in our system in
00:01:09
data platform team with you all the
00:01:11
wider audience so that's
00:01:14
me so just a little bit about gitlab as
00:01:17
a company so our mission everyone can
00:01:20
contribute so just as we discussed
00:01:24
before we can contribute you can
00:01:26
everyone can where everyone across the
00:01:28
team across the globe can contribute to
00:01:30
gitlab to make the product better
00:01:33
efficient and
00:01:35
smart and gitlab lives with the credit
00:01:37
values and this is not just the words so
00:01:40
it comes from the collaboration results
00:01:42
efficiency diversity iteration and
00:01:45
transparency of this every letter itself
00:01:50
is a big handbook page in our gitlab
00:01:54
terms where you can read more about this
00:01:57
but we literally live with this values
00:01:59
and and it shows in our day-to-day work
00:02:02
every OK hour or every goals what we do
00:02:04
or every project we deliver it's through
00:02:07
this value it's transparent we do in
00:02:09
small iteration our development and it's
00:02:11
fast and we deliver it and we iterate it
00:02:13
faster so it's boring solution but weate
00:02:16
and make it smarter smarter smart but we
00:02:19
meet the results at the
00:02:21
end so coming to what gitlab is doing
00:02:24
the gitlab is a Dev secops platform
00:02:25
delivered as a single application to
00:02:27
help you iterate faster and in together
00:02:31
so it's a Dev secops platform where
00:02:33
everything can be done in terms of cicd
00:02:37
pipeline in terms of security of your
00:02:39
pipeline and your get repository and all
00:02:42
the
00:02:43
things now moving to the next to the
00:02:47
agenda that what we're going to discuss
00:02:48
today in this small talk time of 20
00:02:52
minutes so we will be discussing about
00:02:54
introduction to Google cetus engine
00:02:56
Apache airflow Helm and
00:02:58
terraform how all of this components
00:03:00
fits together and how the installation
00:03:03
is done within gitlab data platform
00:03:05
teams and what's the win of data
00:03:06
platform on this and little bit on best
00:03:08
practices and
00:03:11
considerations so to start with so
00:03:14
introduction to gke what is gke
00:03:17
basically most of you are aware because
00:03:19
of it but just to reiterate it's a
00:03:21
Google kubernetes engine is a managed
00:03:23
kuber service that simplifies cized
00:03:26
application deployment scaling
00:03:28
Management on Google Cloud offering a
00:03:30
robust and efficient platform for
00:03:32
container
00:03:34
orchestration what are the key features
00:03:37
so why we choose so what's the reason of
00:03:39
using GK the major was that is a managed
00:03:42
kubernetes so we don't have to manually
00:03:44
manage the whole kubernetes
00:03:45
infrastructure it's managed by Google so
00:03:47
it's more reliable automatic scaling
00:03:50
which was very much needed for our
00:03:52
airflow the data pipeline what we have
00:03:54
built we need a system which is
00:03:56
automatically scal on demand whenever
00:03:59
and to handle all kinds of load what you
00:04:02
want security and compliance is one
00:04:04
major thing so when you set up an
00:04:05
infrastructure for data pipeline it
00:04:07
takes a lots of time to just catter
00:04:10
around the security and compliance that
00:04:12
is meeting the Securities the firewall
00:04:14
proof is it exposed to the public
00:04:15
internet are we having a data leak
00:04:17
happening like this so gke helps a lot
00:04:20
in that and that's why and the
00:04:22
integration at the last the key features
00:04:24
integration helps a lot in terms of
00:04:26
integrating
00:04:27
our CL all our data sources which is
00:04:30
sitting on other G jigle Cloud platform
00:04:33
so it can integrate with our other we do
00:04:36
a VPC pairing where we interact with
00:04:38
production databases so we don't have to
00:04:40
go over the internet and all it just
00:04:42
works with the VP our firewall and all
00:04:44
just we do VPC instead of exposing of
00:04:48
net uh n over the network to say pass
00:04:50
the data on and few use cases it's a
00:04:53
microservice deployment continuous
00:04:55
integration cicd and a scalable
00:04:58
application so microservice deploy
00:04:59
employment is a service which you have
00:05:00
web servers U so what do you say
00:05:04
middleware and database everything it
00:05:07
can it can be used for
00:05:09
that coming to the next what's Apache
00:05:12
airflow Apache airflow is an open source
00:05:14
platform designed to programatically
00:05:16
author schedule and monitor workflows so
00:05:19
this is a common tool used for
00:05:20
scheduling we have seen lots
00:05:22
of St tool for do a schedule like
00:05:25
control M skybo and all but airflow is
00:05:28
one of the favorite for me because it
00:05:30
helps a lot it's open source first thing
00:05:32
and it's a it can be scaled to whatever
00:05:34
design we want so what's the key
00:05:37
features again for this so it comes with
00:05:39
a directly Sly graph that's called Dax
00:05:43
extensibility the way it's helped to
00:05:45
extends whole airflow pipeline Dynamic
00:05:48
workflow execution so you we have one
00:05:52
dag which can create like 200 tasks
00:05:54
dynamically on demand so whenever we
00:05:57
just want to add one additional data uh
00:05:59
table we just add to a yaml and airflow
00:06:02
takes care of it by creating a attack
00:06:03
task for that rich UI which I like it
00:06:07
and the logging capability where you can
00:06:08
log it you can see the logs in the UI
00:06:10
but also you can see it back in the
00:06:12
server scalability so you can have a
00:06:15
multiple web server to handle your loads
00:06:18
incoming loads coming in from uh using
00:06:21
the load balancer you can scale your web
00:06:23
server you can scale your schuer you can
00:06:25
scale your database so that helps in
00:06:28
this
00:06:30
the pro top use cases which came to my
00:06:32
mind was Data pipeline orchestration
00:06:34
that's what we generally use it for ETL
00:06:37
extract transform load process not doing
00:06:39
the ETL just scheduling it workflow
00:06:43
Automation in diverse
00:06:47
industry what's the terraform then so we
00:06:50
talked about gke we talked about airf
00:06:52
flow so what does the terraform do tform
00:06:54
is infrastructure as a code for gke and
00:06:57
it's open source infrastructure tool
00:06:58
that enables user to Define and
00:07:00
provision infrastructure using a
00:07:02
declarative configuration
00:07:04
language the key features as you all
00:07:07
know is infrastructure scode so that's a
00:07:08
good for versioning control and all
00:07:11
multicloud provisioning so it can be
00:07:12
work on Google Azure AWS declarative
00:07:16
configuration language so it's very
00:07:18
simply configured easy to configure and
00:07:21
make it easily readable for anyone to
00:07:23
apply it plan and apply workflow this is
00:07:25
the best part you do a terraform change
00:07:28
you can validate date you can plan it
00:07:30
that what's the plan how will this go
00:07:32
and Implement in the system will it
00:07:35
destroy something what it will change
00:07:37
what it will add and it's easily
00:07:39
readable and that helps you a lot to
00:07:41
judge that shall I go and apply this
00:07:42
changes to the production or not you can
00:07:45
figure it out if something is wrong if
00:07:47
your state if your file has changed the
00:07:50
target server has changed manually it
00:07:52
will throw you an error that something
00:07:54
has gone wrong your plan is not looking
00:07:55
what you have done the modification so
00:07:57
it gives you a more control and and this
00:07:59
is the best way I feel to handle our
00:08:02
infrastructure and the State Management
00:08:04
so this is something which helps like
00:08:06
you have a team of five developers and
00:08:09
everyone doing a terraform chain how do
00:08:11
you keep the things in sync that I have
00:08:14
done a changes added a node pull to the
00:08:16
cluster and you are adding another one
00:08:18
then how do you do state state file is
00:08:20
the one which helps us to know pull up
00:08:23
the configuration of the GK from the
00:08:27
remote uh our GCS bucket that's where we
00:08:29
store it and to know that what is the
00:08:31
current state of the GK looks like was
00:08:33
last one and then what the changes will
00:08:36
do will it modify will it destroy or
00:08:39
will it uh create a new add something
00:08:42
new to it so that's the State Management
00:08:44
that's one of the more useful key
00:08:46
features where you can manage from a
00:08:48
point where you can roll back where you
00:08:50
can control
00:08:52
it the use cases coming to the use cases
00:08:54
it's a provisioning servers Network
00:08:57
infrastructures application deployment
00:08:59
ments so it's used for everything but we
00:09:03
have extended it little bit so we can
00:09:04
install airflow using terraform just
00:09:07
without having Helm but we used Helm a
00:09:11
package manager for kubernetes because
00:09:12
it makes our life easier it's a package
00:09:16
manager for kubernetes application
00:09:18
simplifying the deployment and
00:09:19
management of cized
00:09:22
application so what help hel help us in
00:09:25
that we have a consistent deployment
00:09:27
across all the involvment so if we have
00:09:29
five GK clusters one for test one for
00:09:31
Dev we can use the same feature same
00:09:34
package everywhere so we don't have to
00:09:37
have a different so we keep our
00:09:38
envirment sync that nothing is different
00:09:42
in the test or nothing is different in
00:09:43
the
00:09:44
prodct it is the simplified
00:09:46
configuration so you have a basic Helm
00:09:49
configuration which comes with the
00:09:50
airflow and then on top of it we modify
00:09:53
our configuration
00:09:55
like having a different secret keys and
00:09:57
all uh foret keys is so that we keep our
00:10:00
web server secure from the outside world
00:10:02
uh not using the default setup so
00:10:04
there's lots of we use postest database
00:10:06
in the back and so all those things is
00:10:08
being useful over here is the helm
00:10:10
package we override it with those values
00:10:13
with the config management and the
00:10:15
dependency management of the packages so
00:10:16
you need to install the database first
00:10:18
then the side cars first then web Ser
00:10:22
then the schuer and then the web server
00:10:23
so this all taken care by the helm
00:10:25
rather than us if I had to do it using
00:10:28
terraform just using Cube cutle I would
00:10:31
have to ensure that everything goes one
00:10:32
on the pieces otherwise it will just go
00:10:34
hawi it will not work as expected so use
00:10:38
cases is microservice deployment in this
00:10:40
case we are using for airf
00:10:43
flow so integration of airf flow with GK
00:10:47
using Helm and terao why use gke for
00:10:51
airf flow why not anything else G
00:10:54
there's lots of benefits for this the
00:10:58
major one is that scaling able so it's
00:10:59
can a scale to your demand you can have
00:11:01
a different for different workloads you
00:11:04
will have different node pools dedicated
00:11:06
so you can handle a data science model
00:11:08
in a different node pool with a bigger
00:11:10
machines with a bigger gpus in the same
00:11:12
cluster and you can have a small
00:11:13
pipeline also running in the same
00:11:15
cluster on the smaller node with a node
00:11:17
pool with a smaller machine so you don't
00:11:20
pay way high for everything uh like
00:11:24
running on everything on the bigger
00:11:26
machine so that's the one on this then
00:11:30
the next one for this uh why for the
00:11:32
aache and GK is is on the scalability uh
00:11:36
is done then the security it's secure we
00:11:39
host our air flow on the VPC private VPC
00:11:42
and that's where we keep it safe from
00:11:44
the outside world our air flow
00:11:47
instance the third Point comes to this
00:11:50
for the airflow on the
00:11:53
GK this uh why we are using it uh why it
00:11:56
should be used uh for this is
00:12:00
ease of Maintenance and ease of uh
00:12:02
maintaining the whole infrastructure
00:12:04
within the GK
00:12:06
platform what this slides gone little ha
00:12:10
sorry about
00:12:14
that so why Helm charts for airf flow so
00:12:17
that's the question as so we ask why GK
00:12:20
for air flow and why should we use Helm
00:12:22
charts for airflow the ease is that it's
00:12:24
a standardized packaging it will help
00:12:27
you the consistent deployment across all
00:12:29
your different involvment so that's one
00:12:30
of the major thing uh to have main
00:12:33
benefit for him is to keep the
00:12:35
consistent deployments simplified
00:12:37
configuration so you have a simple
00:12:39
configuration so you have a basic config
00:12:41
and then you override so you can
00:12:42
override your configuration based on
00:12:44
your requirement for a test involvement
00:12:46
you can have a different setup you can
00:12:48
have a different config management for
00:12:50
production you can have a different for
00:12:52
your performance test envirment like
00:12:54
that so that's the thing Version Control
00:12:57
and roll back that's another benefit of
00:13:00
Helm charts that it helps you to
00:13:02
configure your control your version and
00:13:04
a version of in the G repository and
00:13:06
that's easily
00:13:08
managed and the reusability of course so
00:13:12
it's a sharable configuration you can
00:13:13
configure it with you can share it with
00:13:15
all your team members and anyone can
00:13:17
contribute to that to help it so that's
00:13:20
the another thing why to use Helm for
00:13:22
the airf flow U we have been using
00:13:24
before Helm normal deployment but this
00:13:28
was when the hel the Apache officially
00:13:30
released the helm chart it made our life
00:13:32
much easier because the upgrading to the
00:13:34
air flow and all it was much more
00:13:35
smoother now rather than manually doing
00:13:37
the upgrade earlier and why Tera for
00:13:41
model for GK so with the terraform model
00:13:43
you can have everything configured just
00:13:47
outside not doing a manually so in case
00:13:50
something goes wrong like accidentally
00:13:53
you drop the cluster you have a
00:13:55
terraform model which can just bring
00:13:56
back the cluster in like 30 minutes with
00:13:59
all the big machines and all the setups
00:14:01
you run the helm charts on top of it and
00:14:03
the whole machine is back for nearly
00:14:05
like a 45 minutes of time with all the
00:14:07
configuration and all the things in
00:14:10
place moving to
00:14:12
next now let's connect the dots between
00:14:15
the gke plus Helm and the terraform plus
00:14:17
air flow so we provision the
00:14:20
infrastructure with terraform so that
00:14:22
sets the base teras form sets the
00:14:25
foundation it bootstraps the kubernetes
00:14:26
Clusters with all necessary
00:14:29
firewall rules and all the things in
00:14:31
place all the VPC pairing and all done
00:14:34
through the terraform and then KUB is
00:14:37
cluster orchestration with GK so gke
00:14:39
ensures the seamless operations between
00:14:43
the parts of the web server the
00:14:45
scheduler the PG bouncers the git
00:14:49
repository and G sync side cars as well
00:14:53
so it that's the done by the terraform
00:14:56
and then the kubernetes plays the role
00:14:57
of that and and then we do the package
00:14:59
management with Helm so Helm charts
00:15:01
defines the airflow configuration as I
00:15:03
said
00:15:05
it makes the server like spinning up the
00:15:08
server becomes much more easier if
00:15:10
anything goes wrong you can easily track
00:15:11
it down and when you do the helm chart
00:15:14
deploying on GK this is the favorite
00:15:17
part for me it's a smooth deployment you
00:15:20
just run it apply and then just sit back
00:15:23
and relax to see that it's going good or
00:15:25
it's what's uh happening and if anything
00:15:28
breaks you get the proper error
00:15:31
so something is there on the
00:15:35
screen yeah and last comes so when you
00:15:38
deploy the helm charts your airf flow is
00:15:40
up and running and that's the last P
00:15:42
integrated workflow and that is the
00:15:44
orchestration with the airf flow so
00:15:46
integrated approach the integration with
00:15:49
with the air flow helps you with all
00:15:51
your job dat workloads and everything
00:15:55
and with this at the end you have a air
00:15:59
flow up and running which has all your
00:16:00
active DXs with the git sync we are
00:16:03
continuously syncing our git
00:16:04
repositories with the kubernetes server
00:16:07
over there with the airf flow so any
00:16:09
changes we do to the our G repository
00:16:12
our analytics data repository after
00:16:14
testing is done when it's mered to the
00:16:16
master it instantly like in a gap of 2
00:16:18
minutes it sinks with the production and
00:16:21
you can see that dag in the production
00:16:23
over there you can run it instantly by
00:16:26
default it's off but you can run it so
00:16:28
with this all the four things in place
00:16:31
you have an infrastructure in place with
00:16:34
the combination which is safe secure uh
00:16:37
reliable scalable to the core where you
00:16:39
can scale it to the amount and it's a
00:16:41
very cost optimal solution so when we
00:16:44
run a data pipeline sometimes become
00:16:46
very costly process if you use like a
00:16:49
standard VMS and you don't has if you
00:16:51
need to scale it then you manually go
00:16:53
and add another VM to it and then you
00:16:54
have to shut it down but gke take carees
00:16:56
of Auto scaling up down everything for
00:16:59
that if you need to horizontally scale
00:17:01
you can scale as much as you want on
00:17:03
demand and then you scale down to zero
00:17:05
so we don't run all our node
00:17:08
pools 24 hours we just run the standard
00:17:12
node pools for airf flow for like 24
00:17:14
hours because that's where the air flow
00:17:15
is hosted but remaining all node pools
00:17:18
are just scalable from 0 to 5 0 to 10
00:17:20
based on the demand we do handle our
00:17:23
data loads for data science data
00:17:25
scientists We R data modelss our data
00:17:28
pip P lines the DBT the heavy data
00:17:30
pipelines which exports like close to
00:17:33
100 GB of data every day and these all
00:17:35
happens using air flow and this is the
00:17:39
benefit of the GK that it helps a lot to
00:17:41
scale
00:17:42
up and the good thing with this all of
00:17:46
this things coming together is that the
00:17:48
gitlab cicd pipeline the cicd pipeline
00:17:51
is so robust that when you modify your
00:17:54
yaml file
00:17:56
for uh terraform or for Helm it
00:17:59
validates it does the proper validation
00:18:01
and the CI pipeline passes it
00:18:03
revalidates that your plan will work
00:18:05
perfectly or not and if it will it will
00:18:07
give you output what the plan looks like
00:18:09
and based on that you can take a call
00:18:11
should I deploy it to production or not
00:18:13
so that's the whole connection of the
00:18:16
dots between the GK Helm uh terraform
00:18:19
and airflow moving to
00:18:22
next so how it is done within gitlab so
00:18:25
GK cluster is Provisions through data
00:18:27
form using Helm charts and we have the
00:18:30
air flow installed through the helm
00:18:32
chart so we have two name space within
00:18:34
the same cluster prod and
00:18:37
testing we have seven node pools
00:18:39
different machine type for each load as
00:18:41
I said so we have for bigger machines
00:18:44
with big gpus and CPUs for data science
00:18:47
models we have big machines for heavy
00:18:50
data pipeline where we export like in
00:18:52
one where we run like hundreds of jobs
00:18:57
within within within an hour so at that
00:18:59
point of time we need a bigger machine
00:19:01
with a scalable capability so it can
00:19:02
scale as much as you want remote a state
00:19:05
for any change required for the GK
00:19:07
cluster so we maintain our as said
00:19:11
earlier we maintain our estate file in
00:19:13
the GCS which is properly secured and
00:19:15
save not open to the public and that
00:19:17
helps us to maintain our Version Control
00:19:19
of the whole of the GK infrastructure as
00:19:22
well gitlab cicd pipeline to validate
00:19:25
the changes done to the terraform script
00:19:27
this is showes the changes will not
00:19:29
break the terraform apply so this is the
00:19:31
most critical part which helps a lot in
00:19:34
order
00:19:35
to not destroy your whole GK setup of
00:19:39
the air flow because this is the bread
00:19:41
and butter for the data Engineers if
00:19:43
that air flow goes Haywire the whole
00:19:45
data pipeline goes haywire and then you
00:19:47
have lots of questions to be
00:19:50
answered air flow installed using Helm
00:19:53
charts so we use airf flow version
00:19:55
2.5.3 using Helm charts for airflow
00:19:59
which will boot stop and airflow
00:20:00
deployment on kubernetes cluster using
00:20:02
Helm package managers we use cloud SQL
00:20:06
post instances so that we have
00:20:08
configured our basic config to use cloud
00:20:11
SQL post and we have a side car used
00:20:14
within the kubernetes pod which do the
00:20:17
git sync with analytics repository so
00:20:19
that's where this is the one which takes
00:20:21
2 minutes of time to sync the git
00:20:23
reposit keep ping the git Master Branch
00:20:27
to see that any new commits has happened
00:20:29
and if something has happened it will
00:20:30
pull it and you can see that dag
00:20:31
immediately into the air
00:20:34
flow and we have modified our web
00:20:36
servers secret key and a Fed key these
00:20:38
are the default Keys which comes with
00:20:40
the airflow installation but it's good
00:20:42
to have your own if you're are using it
00:20:44
because this keeps your servers safe and
00:20:46
secure because with default one people
00:20:49
can decode it and they can get an access
00:20:52
to the airflow
00:20:56
UI so how this benefit the data platform
00:20:59
team managing the data pipeline so we
00:21:02
currently have 88 active DXs with these
00:21:05
88 dags we are able to run 1200 plus
00:21:08
task every 24 hours so this is a huge
00:21:11
number in terms that there is a dag
00:21:13
which runs like 300 tasks and it runs
00:21:16
for one and a half hour and these 300
00:21:18
tasks is nothing but polling post
00:21:20
database with 300 tables and this is
00:21:24
done with the help of node
00:21:26
pools and the airf flow pools as well so
00:21:29
slots within the air flow we use airflow
00:21:32
slots to say that use a concurrency you
00:21:35
can run 10 task or 15 task within this
00:21:37
particular airflow dag in a concurrency
00:21:40
and if you set that to too high then
00:21:43
your node pool should be able to support
00:21:44
it so that comes the GK where it scales
00:21:48
horizontally to have like six seven
00:21:50
clusters to handle that load and that's
00:21:53
where one task itself creates like one
00:21:56
dag itself creates like 200 task and
00:21:59
that is through yaml file so we have
00:22:01
just 88 python files and that 88 python
00:22:05
files is responsible to create more than
00:22:07
600 to 700 task in the
00:22:10
airl improving the workflow task
00:22:13
dynamism with air flow since as I said
00:22:15
it's a configurable so once you do a
00:22:18
modification and it's a makes life very
00:22:20
easy because as a data Engineers we are
00:22:24
always overwhelmed with lots of work
00:22:26
lots of our own pipeline so someone
00:22:28
needs to add a new table to a postest a
00:22:30
person who is requesting can himself add
00:22:32
it and add it to the yaml file and we
00:22:35
can just see validate it it looks good
00:22:37
and we do the merge request and that
00:22:39
file that table is present in the
00:22:41
pipeline in after 2 minutes so it
00:22:43
improves enhances the cicd timeline and
00:22:46
the deployment P timeline for adding a
00:22:48
new table so kubernetes pod operators to
00:22:52
schedule Dynamic workload so airf flow
00:22:55
comes with the kubernetes Pod operators
00:22:57
and that is the one which we use to
00:22:59
schedule our workloads in whole of the
00:23:01
GK set so every task spin through airf
00:23:06
flow is creates a pod in the gke and
00:23:09
that pod itself is responsible to do
00:23:11
that whatever that task was supposed to
00:23:13
do so write from establishing the
00:23:15
connections to the database pulling the
00:23:17
data pushing it to GCS loading to
00:23:19
Snowflake and once all is done the port
00:23:21
gets Auto destroyed and that's and then
00:23:23
the machine starts scaling back so
00:23:25
kubernetes P operator is like Lifesaver
00:23:28
for us because it helps us to scale the
00:23:30
whole
00:23:32
involvment on demand node provisioning
00:23:34
with terraform so with terraform if a
00:23:37
team comes with saying that I need a
00:23:39
machine a data science team mostly I
00:23:42
need a machine with a bigger respect now
00:23:44
and if we were going through UI and we
00:23:47
don't have didn't have terraform the
00:23:49
problem would be that how do you manage
00:23:53
this uh you have to again go through how
00:23:55
do you Version Control so with Tera form
00:23:57
it's easy easy okay you need a machine
00:23:59
Let's Go mod add the spe set it to this
00:24:02
go from nhim M4 to nhim M8 like that
00:24:06
merge it's validated within half an hour
00:24:08
and it's boom ready to deployed in
00:24:11
production you have new machines present
00:24:13
to do minimal downtime typically under
00:24:15
45 minutes in the event of Disaster
00:24:18
Recovery scenario so that's what I was
00:24:20
telling in the start with this all the
00:24:22
four setup if we destroy the whole of
00:24:25
the kubernetes everything and everything
00:24:28
just goes wipe off we can spin it back
00:24:30
in 45 minutes so with this it makes like
00:24:34
the system is in control so in a case of
00:24:37
worst to the worst we are back in 45
00:24:39
minutes to resume the pipeline and which
00:24:40
helps our data pipeline to be more
00:24:43
secure and reliable for the end
00:24:47
users the best practices and
00:24:49
consideration so security best practice
00:24:51
private cluster configuration that's
00:24:53
what I would recommend VPC pairing so if
00:24:56
you have a another database another
00:24:58
sources which is in the same Google
00:24:59
Cloud you try to do VPC pairing rather
00:25:02
than exposing it over the internet and
00:25:05
fing data identity access management
00:25:07
control which is something like to give
00:25:09
the least privileges to that so that
00:25:12
only people with minimal access work
00:25:15
with the minimal access over there node
00:25:17
pool isolation for different data load
00:25:19
however different node pool that will
00:25:21
help you for different type of work just
00:25:23
have a different note so this will keep
00:25:25
your cost down and you will not be
00:25:27
running a bigger machine for a smaller
00:25:28
task and a smaller machine for a bigger
00:25:30
task securing Secrets keep so everything
00:25:33
is in the vault so use it VA to secure a
00:25:37
Secrets a scalability consideration
00:25:40
horizontal pod autoscaling so that's one
00:25:43
which is recommended database is scaling
00:25:45
as well so you can scale your database
00:25:48
uh as much as you want task parallelism
00:25:50
so concurrency within the airflow helps
00:25:52
you a lot to how do you run your task in
00:25:56
parallel with each other at one point of
00:25:58
time resource request limits persistent
00:26:01
the storage consideration this is
00:26:03
something where you have like the as GK
00:26:06
you should have a PVC volume as attached
00:26:08
to it where you can store all your logs
00:26:10
so in case gke goes off and when you
00:26:13
restore it back you still have access to
00:26:15
those logs to see for gke node pools
00:26:19
for different types of load for
00:26:21
different streams of team to work with
00:26:24
monitoring and logging strategies
00:26:26
Leverage kubernetes is Monitoring
00:26:28
Solutions Prometheus and grafana which
00:26:31
is very useful alerting and notification
00:26:34
channels we use a slack for all our
00:26:36
airflow failures happening through and
00:26:38
we monitor our web server through
00:26:41
Prometheus and grafana and it goes off
00:26:44
by any time we get an instant alert and
00:26:47
we use airflow matric to monitor it and
00:26:51
this is me you can just find me and look
00:26:55
for me and just reach out to me and
00:26:57
there's some additional resources so we
00:26:59
have gitlab handbook for information
00:27:01
about node pools and name spaces how we
00:27:03
have set it up airflow infrastructure
00:27:05
how it is present over there and our
00:27:07
gitlab data analytics repo or our data
00:27:09
bags where you can see all the dag bags
00:27:11
all the dags how we have done it and
00:27:14
thank you and any question
00:27:19
please thank you very much um uh vid now
00:27:23
we have a couple of minutes for
00:27:26
questions please question here and
00:27:33
there hi
00:27:35
uh thanks for your uh
00:27:39
presentation uh I was wondering when uh
00:27:43
chosing Helm have you considered
00:27:45
customize as an
00:27:48
option no question I we just looked at
00:27:52
Helm because in our organization we were
00:27:53
using a lot of Helm config management uh
00:27:56
not for our infrastructure but our the
00:27:57
site Sr team they use Helm config
00:27:59
management a lot so we get took the
00:28:01
knowledge from them and embed it in our
00:28:04
so that to keep itting consistent uh
00:28:07
across the organization so we haven't
00:28:09
looked do that but yeah I it's uh if
00:28:13
chance were there we would have
00:28:15
evaluated it again but Helm was the
00:28:17
choice because the aparta airflow we
00:28:19
were waiting for it to release a Helm
00:28:20
chart and when it released with the
00:28:22
series 2 it was good because it was
00:28:24
helpful for version management and all
00:28:26
so very easy for for the upgrade of air
00:28:28
flow I see and uh sort of related to
00:28:31
that do you see any advantage having
00:28:34
your own airflow deployment compared to
00:28:37
maybe Cloud composer which could be kind
00:28:39
of simpler yes we see an advantage we
00:28:43
can manage our own we see advantage in
00:28:45
terms we want to uh our load balancing
00:28:50
and our whatever the demand what we have
00:28:53
in terms of the data and the node pools
00:28:55
and the Machine size it's much more more
00:28:57
easier for us to manage it over here on
00:28:59
this side Cloud composer is good but
00:29:03
it's a lot of pain for me or to think
00:29:07
that to sync my G repository every time
00:29:09
I do a modification so the cicd when it
00:29:11
works with the GK with the gitlab that
00:29:13
components helps a lot in our data
00:29:15
pipeline to make it much more skillable
00:29:17
so that's where we look at the leverage
00:29:19
at how gitlab works with the GK more
00:29:22
efficiently with all our cicd pipelines
00:29:25
set up in place so that's where so yeah
00:29:28
Cloud composer was an option when we
00:29:29
looking at manage service and then we
00:29:32
still stuck with this because we were
00:29:33
able to manage all our loads over here
00:29:36
so thank you all right we have space for
00:29:39
one more
00:29:41
question yes
00:29:43
please thank you hi uh thank you for
00:29:46
your time um the question was regarding
00:29:48
the docker itself um used with the
00:29:51
kubernetes what's the key uh difference
00:29:54
between using this approach and simply
00:29:56
using um officially available airflow
00:29:59
Docker image with the nodes and the
00:30:00
kubernetes yes so we were using Docker
00:30:03
before so we moved to helm with the
00:30:05
series 2 we were till airflow 1.5 we
00:30:08
were using Docker images but when we
00:30:10
were trying to do an upgrade from 1.5 to
00:30:12
two Docker image was creating a lots of
00:30:16
what do you say steel objects our
00:30:18
hanging DXs it was not a performance
00:30:20
oriented for us it was working good for
00:30:22
other people but we saw that couple of
00:30:24
blocks saying that the DXs are hanging
00:30:26
and there is task which is wa waiting
00:30:28
for to be executed for 24 hours the
00:30:30
docker images didn't work so good for us
00:30:32
but when we moved to helm it was working
00:30:35
good I'm not sure what was the
00:30:37
difference but at that point of time for
00:30:39
me the performance was the key to move
00:30:41
towards the helm but I had a problem
00:30:42
with till 1.5 it was super good it was
00:30:46
very efficient it was with the series 2
00:30:48
it didn't work good for
00:30:49
us great thank you so basically in
00:30:52
future it might come back as a possible
00:30:54
solution yes yes it is it is still on
00:30:57
never we use Docker for our local
00:30:59
testing so we still use the docker
00:31:01
images for our local testing for this
00:31:03
thank you all right uh vid thank you
00:31:07
very much for this insightful
00:31:09
presentation I would like to remind
00:31:11
everyone that uh you can nwork and
00:31:13
connect during lunch perhaps for future
00:31:16
questions and this is a certificate of
00:31:18
appreciation from the organizing team
00:31:19
thank you thank you very
00:31:21
much thank
00:31:26
you