Intro to GKE setup of airflow with helm and terraform | Ved Prakash | DSC Europe 23

00:31:28
https://www.youtube.com/watch?v=zOg8DB68gks

概要

TLDRVD Prakash, a Staff Data Engineer at GitLab, presents on how to set up Apache Airflow on Google Kubernetes Engine (GKE) using Helm and Terraform. He explains GitLab's mission and core values, emphasizing collaboration and transparency. The talk covers the benefits of GKE as a managed Kubernetes solution for deploying scalable applications, along with the advantages of using Apache Airflow for workflow orchestration. He highlights Terraform's role as infrastructure as code for managing resources and Helm's benefit for streamlined application management. The discussion also addresses best practices and considerations when implementing such systems, demonstrating GitLab's efficiency in managing data platforms.

収穫

  • 🚀 Introduction to GKE: A managed Kubernetes service simplifying app deployment.
  • 🔧 Apache Airflow: Open-source platform for scheduling and monitoring workflows.
  • 📦 Terraform: Infrastructure as code tool for managing resources efficiently.
  • ⚙️ Helm: A package manager for Kubernetes, enhancing configuration management.
  • 📈 Scalability: GKE allows automatic scaling based on demand for data workloads.
  • 🔒 Security: Implementation of VPC pairing to secure data sources.
  • 📊 Best Practices: Emphasis on private cluster and IAM controls for security.
  • 🛠️ CI/CD Pipeline: GitLab's pipeline validates infrastructure changes efficiently.
  • 💡 Performance: Transition to Helm resolved performance issues experienced with Docker.
  • 📚 Integration: Seamless connection between GKE, Helm, Terraform, and Airflow.

タイムライン

  • 00:00:00 - 00:05:00

    Mr. V Prakash, staff data engineer at GitLab, presents on setting up Apache Airflow with Helm and Terraform on Google Kubernetes Engine (GKE) as part of a data platform team. He shares insights from his experience at GitLab, emphasizing the company's mission of collaboration and efficiency.

  • 00:05:00 - 00:10:00

    GitLab is introduced as a comprehensive DevSecOps platform offering CI/CD pipelines, promoting security and efficient code management. The presentation outlines the agenda, focusing on the integration of GKE, Apache Airflow, Helm, and Terraform, and their significance in data operations.

  • 00:10:00 - 00:15:00

    GKE is described as a managed Kubernetes service that simplifies application deployment and management, highlighting its features such as reliability, automatic scaling, and security compliance. Prakash explains how GKE supports their data pipeline requirements effectively.

  • 00:15:00 - 00:20:00

    Apache Airflow, an open-source platform, is covered next, with key features like dynamic workflow execution, scalability, and a rich UI. Prakash favours Airflow for its flexibility and dynamic task management capabilities in scheduling and workflow automation.

  • 00:20:00 - 00:25:00

    Terraform is introduced as an infrastructure-as-code tool allowing users to provision GKE resources efficiently. Its key features include multicloud provisioning, declarative configuration, and state management, facilitating collaborative infrastructure management.

  • 00:25:00 - 00:31:28

    The presentation concludes by linking how GKE, Helm, Terraform, and Airflow work together in GitLab's data operations, emphasizing the ease of deployment, scaling, and the CI/CD pipeline's robustness in validating and managing infrastructure changes. Best practices and considerations for security and efficiency are also highlighted.

もっと見る

マインドマップ

ビデオQ&A

  • What is the purpose of using GKE?

    GKE helps simplify application deployment, scaling, and management on Google Cloud, providing a robust platform for container orchestration.

  • Why is Helm used in the setup?

    Helm is a package manager for Kubernetes that simplifies deployment and management of applications, allowing for consistent configurations across environments.

  • What advantages does a self-managed Airflow deployment offer over Cloud Composer?

    A self-managed deployment allows for better load balancing and management of resources without being limited to the constraints of a managed service like Cloud Composer.

  • What is the significance of Terraform in this setup?

    Terraform is used for infrastructure as code, enabling the definition and provisioning of infrastructure through declarative configuration, ensuring control and version management.

  • How does Airflow integrate with GKE?

    Airflow is deployed on GKE using Helm, allowing for orchestration of workflows directly within the Kubernetes environment.

  • What were some challenges faced when transitioning to Helm?

    Upgrading from previous Airflow versions using Docker led to performance issues, but moving to Helm resolved these problems.

  • What best practices were highlighted in the presentation?

    Some best practices include private cluster configurations, VPC pairing, proper IAM controls, monitoring solution usage, and effective resource allocation.

  • What are some key features of Apache Airflow?

    Airflow is designed for programmatic orchestration of workflows, featuring dynamic task execution, rich UI, and extensive scalability.

  • How does GitLab's CI/CD pipeline enhance deployment?

    The CI/CD pipeline validates changes and ensures that Terraform scripts will not break the infrastructure upon deployment.

  • What are the primary use cases for using GKE and Airflow together?

    Use cases include data pipeline orchestration, continuous integration, and scalable application deployment.

ビデオをもっと見る

AIを活用したYouTubeの無料動画要約に即アクセス!
字幕
en
オートスクロール:
  • 00:00:10
    we are starting this block uh with Mr VD
  • 00:00:13
    prakash staff data engineer at gitlab
  • 00:00:16
    who will talk to us um uh in a
  • 00:00:19
    presentation entitled intro to uh jke
  • 00:00:22
    setup to airflow with Helm and terraform
  • 00:00:25
    uh I would like you to give an excited
  • 00:00:27
    welcome to uh Mr B
  • 00:00:31
    please Mage is yours
  • 00:00:44
    please hi good morning everyone excited
  • 00:00:49
    and let's talk about some tech so it's
  • 00:00:52
    mostly data so it's me V prash I work as
  • 00:00:55
    a staff data engineer at gitlab I've
  • 00:00:56
    been there for close to three years and
  • 00:01:00
    it's pretty exciting over there so
  • 00:01:02
    wanted to share some lights of what my
  • 00:01:03
    learning and what we have how we have
  • 00:01:05
    set it up the airflow in our system in
  • 00:01:09
    data platform team with you all the
  • 00:01:11
    wider audience so that's
  • 00:01:14
    me so just a little bit about gitlab as
  • 00:01:17
    a company so our mission everyone can
  • 00:01:20
    contribute so just as we discussed
  • 00:01:24
    before we can contribute you can
  • 00:01:26
    everyone can where everyone across the
  • 00:01:28
    team across the globe can contribute to
  • 00:01:30
    gitlab to make the product better
  • 00:01:33
    efficient and
  • 00:01:35
    smart and gitlab lives with the credit
  • 00:01:37
    values and this is not just the words so
  • 00:01:40
    it comes from the collaboration results
  • 00:01:42
    efficiency diversity iteration and
  • 00:01:45
    transparency of this every letter itself
  • 00:01:50
    is a big handbook page in our gitlab
  • 00:01:54
    terms where you can read more about this
  • 00:01:57
    but we literally live with this values
  • 00:01:59
    and and it shows in our day-to-day work
  • 00:02:02
    every OK hour or every goals what we do
  • 00:02:04
    or every project we deliver it's through
  • 00:02:07
    this value it's transparent we do in
  • 00:02:09
    small iteration our development and it's
  • 00:02:11
    fast and we deliver it and we iterate it
  • 00:02:13
    faster so it's boring solution but weate
  • 00:02:16
    and make it smarter smarter smart but we
  • 00:02:19
    meet the results at the
  • 00:02:21
    end so coming to what gitlab is doing
  • 00:02:24
    the gitlab is a Dev secops platform
  • 00:02:25
    delivered as a single application to
  • 00:02:27
    help you iterate faster and in together
  • 00:02:31
    so it's a Dev secops platform where
  • 00:02:33
    everything can be done in terms of cicd
  • 00:02:37
    pipeline in terms of security of your
  • 00:02:39
    pipeline and your get repository and all
  • 00:02:42
    the
  • 00:02:43
    things now moving to the next to the
  • 00:02:47
    agenda that what we're going to discuss
  • 00:02:48
    today in this small talk time of 20
  • 00:02:52
    minutes so we will be discussing about
  • 00:02:54
    introduction to Google cetus engine
  • 00:02:56
    Apache airflow Helm and
  • 00:02:58
    terraform how all of this components
  • 00:03:00
    fits together and how the installation
  • 00:03:03
    is done within gitlab data platform
  • 00:03:05
    teams and what's the win of data
  • 00:03:06
    platform on this and little bit on best
  • 00:03:08
    practices and
  • 00:03:11
    considerations so to start with so
  • 00:03:14
    introduction to gke what is gke
  • 00:03:17
    basically most of you are aware because
  • 00:03:19
    of it but just to reiterate it's a
  • 00:03:21
    Google kubernetes engine is a managed
  • 00:03:23
    kuber service that simplifies cized
  • 00:03:26
    application deployment scaling
  • 00:03:28
    Management on Google Cloud offering a
  • 00:03:30
    robust and efficient platform for
  • 00:03:32
    container
  • 00:03:34
    orchestration what are the key features
  • 00:03:37
    so why we choose so what's the reason of
  • 00:03:39
    using GK the major was that is a managed
  • 00:03:42
    kubernetes so we don't have to manually
  • 00:03:44
    manage the whole kubernetes
  • 00:03:45
    infrastructure it's managed by Google so
  • 00:03:47
    it's more reliable automatic scaling
  • 00:03:50
    which was very much needed for our
  • 00:03:52
    airflow the data pipeline what we have
  • 00:03:54
    built we need a system which is
  • 00:03:56
    automatically scal on demand whenever
  • 00:03:59
    and to handle all kinds of load what you
  • 00:04:02
    want security and compliance is one
  • 00:04:04
    major thing so when you set up an
  • 00:04:05
    infrastructure for data pipeline it
  • 00:04:07
    takes a lots of time to just catter
  • 00:04:10
    around the security and compliance that
  • 00:04:12
    is meeting the Securities the firewall
  • 00:04:14
    proof is it exposed to the public
  • 00:04:15
    internet are we having a data leak
  • 00:04:17
    happening like this so gke helps a lot
  • 00:04:20
    in that and that's why and the
  • 00:04:22
    integration at the last the key features
  • 00:04:24
    integration helps a lot in terms of
  • 00:04:26
    integrating
  • 00:04:27
    our CL all our data sources which is
  • 00:04:30
    sitting on other G jigle Cloud platform
  • 00:04:33
    so it can integrate with our other we do
  • 00:04:36
    a VPC pairing where we interact with
  • 00:04:38
    production databases so we don't have to
  • 00:04:40
    go over the internet and all it just
  • 00:04:42
    works with the VP our firewall and all
  • 00:04:44
    just we do VPC instead of exposing of
  • 00:04:48
    net uh n over the network to say pass
  • 00:04:50
    the data on and few use cases it's a
  • 00:04:53
    microservice deployment continuous
  • 00:04:55
    integration cicd and a scalable
  • 00:04:58
    application so microservice deploy
  • 00:04:59
    employment is a service which you have
  • 00:05:00
    web servers U so what do you say
  • 00:05:04
    middleware and database everything it
  • 00:05:07
    can it can be used for
  • 00:05:09
    that coming to the next what's Apache
  • 00:05:12
    airflow Apache airflow is an open source
  • 00:05:14
    platform designed to programatically
  • 00:05:16
    author schedule and monitor workflows so
  • 00:05:19
    this is a common tool used for
  • 00:05:20
    scheduling we have seen lots
  • 00:05:22
    of St tool for do a schedule like
  • 00:05:25
    control M skybo and all but airflow is
  • 00:05:28
    one of the favorite for me because it
  • 00:05:30
    helps a lot it's open source first thing
  • 00:05:32
    and it's a it can be scaled to whatever
  • 00:05:34
    design we want so what's the key
  • 00:05:37
    features again for this so it comes with
  • 00:05:39
    a directly Sly graph that's called Dax
  • 00:05:43
    extensibility the way it's helped to
  • 00:05:45
    extends whole airflow pipeline Dynamic
  • 00:05:48
    workflow execution so you we have one
  • 00:05:52
    dag which can create like 200 tasks
  • 00:05:54
    dynamically on demand so whenever we
  • 00:05:57
    just want to add one additional data uh
  • 00:05:59
    table we just add to a yaml and airflow
  • 00:06:02
    takes care of it by creating a attack
  • 00:06:03
    task for that rich UI which I like it
  • 00:06:07
    and the logging capability where you can
  • 00:06:08
    log it you can see the logs in the UI
  • 00:06:10
    but also you can see it back in the
  • 00:06:12
    server scalability so you can have a
  • 00:06:15
    multiple web server to handle your loads
  • 00:06:18
    incoming loads coming in from uh using
  • 00:06:21
    the load balancer you can scale your web
  • 00:06:23
    server you can scale your schuer you can
  • 00:06:25
    scale your database so that helps in
  • 00:06:28
    this
  • 00:06:30
    the pro top use cases which came to my
  • 00:06:32
    mind was Data pipeline orchestration
  • 00:06:34
    that's what we generally use it for ETL
  • 00:06:37
    extract transform load process not doing
  • 00:06:39
    the ETL just scheduling it workflow
  • 00:06:43
    Automation in diverse
  • 00:06:47
    industry what's the terraform then so we
  • 00:06:50
    talked about gke we talked about airf
  • 00:06:52
    flow so what does the terraform do tform
  • 00:06:54
    is infrastructure as a code for gke and
  • 00:06:57
    it's open source infrastructure tool
  • 00:06:58
    that enables user to Define and
  • 00:07:00
    provision infrastructure using a
  • 00:07:02
    declarative configuration
  • 00:07:04
    language the key features as you all
  • 00:07:07
    know is infrastructure scode so that's a
  • 00:07:08
    good for versioning control and all
  • 00:07:11
    multicloud provisioning so it can be
  • 00:07:12
    work on Google Azure AWS declarative
  • 00:07:16
    configuration language so it's very
  • 00:07:18
    simply configured easy to configure and
  • 00:07:21
    make it easily readable for anyone to
  • 00:07:23
    apply it plan and apply workflow this is
  • 00:07:25
    the best part you do a terraform change
  • 00:07:28
    you can validate date you can plan it
  • 00:07:30
    that what's the plan how will this go
  • 00:07:32
    and Implement in the system will it
  • 00:07:35
    destroy something what it will change
  • 00:07:37
    what it will add and it's easily
  • 00:07:39
    readable and that helps you a lot to
  • 00:07:41
    judge that shall I go and apply this
  • 00:07:42
    changes to the production or not you can
  • 00:07:45
    figure it out if something is wrong if
  • 00:07:47
    your state if your file has changed the
  • 00:07:50
    target server has changed manually it
  • 00:07:52
    will throw you an error that something
  • 00:07:54
    has gone wrong your plan is not looking
  • 00:07:55
    what you have done the modification so
  • 00:07:57
    it gives you a more control and and this
  • 00:07:59
    is the best way I feel to handle our
  • 00:08:02
    infrastructure and the State Management
  • 00:08:04
    so this is something which helps like
  • 00:08:06
    you have a team of five developers and
  • 00:08:09
    everyone doing a terraform chain how do
  • 00:08:11
    you keep the things in sync that I have
  • 00:08:14
    done a changes added a node pull to the
  • 00:08:16
    cluster and you are adding another one
  • 00:08:18
    then how do you do state state file is
  • 00:08:20
    the one which helps us to know pull up
  • 00:08:23
    the configuration of the GK from the
  • 00:08:27
    remote uh our GCS bucket that's where we
  • 00:08:29
    store it and to know that what is the
  • 00:08:31
    current state of the GK looks like was
  • 00:08:33
    last one and then what the changes will
  • 00:08:36
    do will it modify will it destroy or
  • 00:08:39
    will it uh create a new add something
  • 00:08:42
    new to it so that's the State Management
  • 00:08:44
    that's one of the more useful key
  • 00:08:46
    features where you can manage from a
  • 00:08:48
    point where you can roll back where you
  • 00:08:50
    can control
  • 00:08:52
    it the use cases coming to the use cases
  • 00:08:54
    it's a provisioning servers Network
  • 00:08:57
    infrastructures application deployment
  • 00:08:59
    ments so it's used for everything but we
  • 00:09:03
    have extended it little bit so we can
  • 00:09:04
    install airflow using terraform just
  • 00:09:07
    without having Helm but we used Helm a
  • 00:09:11
    package manager for kubernetes because
  • 00:09:12
    it makes our life easier it's a package
  • 00:09:16
    manager for kubernetes application
  • 00:09:18
    simplifying the deployment and
  • 00:09:19
    management of cized
  • 00:09:22
    application so what help hel help us in
  • 00:09:25
    that we have a consistent deployment
  • 00:09:27
    across all the involvment so if we have
  • 00:09:29
    five GK clusters one for test one for
  • 00:09:31
    Dev we can use the same feature same
  • 00:09:34
    package everywhere so we don't have to
  • 00:09:37
    have a different so we keep our
  • 00:09:38
    envirment sync that nothing is different
  • 00:09:42
    in the test or nothing is different in
  • 00:09:43
    the
  • 00:09:44
    prodct it is the simplified
  • 00:09:46
    configuration so you have a basic Helm
  • 00:09:49
    configuration which comes with the
  • 00:09:50
    airflow and then on top of it we modify
  • 00:09:53
    our configuration
  • 00:09:55
    like having a different secret keys and
  • 00:09:57
    all uh foret keys is so that we keep our
  • 00:10:00
    web server secure from the outside world
  • 00:10:02
    uh not using the default setup so
  • 00:10:04
    there's lots of we use postest database
  • 00:10:06
    in the back and so all those things is
  • 00:10:08
    being useful over here is the helm
  • 00:10:10
    package we override it with those values
  • 00:10:13
    with the config management and the
  • 00:10:15
    dependency management of the packages so
  • 00:10:16
    you need to install the database first
  • 00:10:18
    then the side cars first then web Ser
  • 00:10:22
    then the schuer and then the web server
  • 00:10:23
    so this all taken care by the helm
  • 00:10:25
    rather than us if I had to do it using
  • 00:10:28
    terraform just using Cube cutle I would
  • 00:10:31
    have to ensure that everything goes one
  • 00:10:32
    on the pieces otherwise it will just go
  • 00:10:34
    hawi it will not work as expected so use
  • 00:10:38
    cases is microservice deployment in this
  • 00:10:40
    case we are using for airf
  • 00:10:43
    flow so integration of airf flow with GK
  • 00:10:47
    using Helm and terao why use gke for
  • 00:10:51
    airf flow why not anything else G
  • 00:10:54
    there's lots of benefits for this the
  • 00:10:58
    major one is that scaling able so it's
  • 00:10:59
    can a scale to your demand you can have
  • 00:11:01
    a different for different workloads you
  • 00:11:04
    will have different node pools dedicated
  • 00:11:06
    so you can handle a data science model
  • 00:11:08
    in a different node pool with a bigger
  • 00:11:10
    machines with a bigger gpus in the same
  • 00:11:12
    cluster and you can have a small
  • 00:11:13
    pipeline also running in the same
  • 00:11:15
    cluster on the smaller node with a node
  • 00:11:17
    pool with a smaller machine so you don't
  • 00:11:20
    pay way high for everything uh like
  • 00:11:24
    running on everything on the bigger
  • 00:11:26
    machine so that's the one on this then
  • 00:11:30
    the next one for this uh why for the
  • 00:11:32
    aache and GK is is on the scalability uh
  • 00:11:36
    is done then the security it's secure we
  • 00:11:39
    host our air flow on the VPC private VPC
  • 00:11:42
    and that's where we keep it safe from
  • 00:11:44
    the outside world our air flow
  • 00:11:47
    instance the third Point comes to this
  • 00:11:50
    for the airflow on the
  • 00:11:53
    GK this uh why we are using it uh why it
  • 00:11:56
    should be used uh for this is
  • 00:12:00
    ease of Maintenance and ease of uh
  • 00:12:02
    maintaining the whole infrastructure
  • 00:12:04
    within the GK
  • 00:12:06
    platform what this slides gone little ha
  • 00:12:10
    sorry about
  • 00:12:14
    that so why Helm charts for airf flow so
  • 00:12:17
    that's the question as so we ask why GK
  • 00:12:20
    for air flow and why should we use Helm
  • 00:12:22
    charts for airflow the ease is that it's
  • 00:12:24
    a standardized packaging it will help
  • 00:12:27
    you the consistent deployment across all
  • 00:12:29
    your different involvment so that's one
  • 00:12:30
    of the major thing uh to have main
  • 00:12:33
    benefit for him is to keep the
  • 00:12:35
    consistent deployments simplified
  • 00:12:37
    configuration so you have a simple
  • 00:12:39
    configuration so you have a basic config
  • 00:12:41
    and then you override so you can
  • 00:12:42
    override your configuration based on
  • 00:12:44
    your requirement for a test involvement
  • 00:12:46
    you can have a different setup you can
  • 00:12:48
    have a different config management for
  • 00:12:50
    production you can have a different for
  • 00:12:52
    your performance test envirment like
  • 00:12:54
    that so that's the thing Version Control
  • 00:12:57
    and roll back that's another benefit of
  • 00:13:00
    Helm charts that it helps you to
  • 00:13:02
    configure your control your version and
  • 00:13:04
    a version of in the G repository and
  • 00:13:06
    that's easily
  • 00:13:08
    managed and the reusability of course so
  • 00:13:12
    it's a sharable configuration you can
  • 00:13:13
    configure it with you can share it with
  • 00:13:15
    all your team members and anyone can
  • 00:13:17
    contribute to that to help it so that's
  • 00:13:20
    the another thing why to use Helm for
  • 00:13:22
    the airf flow U we have been using
  • 00:13:24
    before Helm normal deployment but this
  • 00:13:28
    was when the hel the Apache officially
  • 00:13:30
    released the helm chart it made our life
  • 00:13:32
    much easier because the upgrading to the
  • 00:13:34
    air flow and all it was much more
  • 00:13:35
    smoother now rather than manually doing
  • 00:13:37
    the upgrade earlier and why Tera for
  • 00:13:41
    model for GK so with the terraform model
  • 00:13:43
    you can have everything configured just
  • 00:13:47
    outside not doing a manually so in case
  • 00:13:50
    something goes wrong like accidentally
  • 00:13:53
    you drop the cluster you have a
  • 00:13:55
    terraform model which can just bring
  • 00:13:56
    back the cluster in like 30 minutes with
  • 00:13:59
    all the big machines and all the setups
  • 00:14:01
    you run the helm charts on top of it and
  • 00:14:03
    the whole machine is back for nearly
  • 00:14:05
    like a 45 minutes of time with all the
  • 00:14:07
    configuration and all the things in
  • 00:14:10
    place moving to
  • 00:14:12
    next now let's connect the dots between
  • 00:14:15
    the gke plus Helm and the terraform plus
  • 00:14:17
    air flow so we provision the
  • 00:14:20
    infrastructure with terraform so that
  • 00:14:22
    sets the base teras form sets the
  • 00:14:25
    foundation it bootstraps the kubernetes
  • 00:14:26
    Clusters with all necessary
  • 00:14:29
    firewall rules and all the things in
  • 00:14:31
    place all the VPC pairing and all done
  • 00:14:34
    through the terraform and then KUB is
  • 00:14:37
    cluster orchestration with GK so gke
  • 00:14:39
    ensures the seamless operations between
  • 00:14:43
    the parts of the web server the
  • 00:14:45
    scheduler the PG bouncers the git
  • 00:14:49
    repository and G sync side cars as well
  • 00:14:53
    so it that's the done by the terraform
  • 00:14:56
    and then the kubernetes plays the role
  • 00:14:57
    of that and and then we do the package
  • 00:14:59
    management with Helm so Helm charts
  • 00:15:01
    defines the airflow configuration as I
  • 00:15:03
    said
  • 00:15:05
    it makes the server like spinning up the
  • 00:15:08
    server becomes much more easier if
  • 00:15:10
    anything goes wrong you can easily track
  • 00:15:11
    it down and when you do the helm chart
  • 00:15:14
    deploying on GK this is the favorite
  • 00:15:17
    part for me it's a smooth deployment you
  • 00:15:20
    just run it apply and then just sit back
  • 00:15:23
    and relax to see that it's going good or
  • 00:15:25
    it's what's uh happening and if anything
  • 00:15:28
    breaks you get the proper error
  • 00:15:31
    so something is there on the
  • 00:15:35
    screen yeah and last comes so when you
  • 00:15:38
    deploy the helm charts your airf flow is
  • 00:15:40
    up and running and that's the last P
  • 00:15:42
    integrated workflow and that is the
  • 00:15:44
    orchestration with the airf flow so
  • 00:15:46
    integrated approach the integration with
  • 00:15:49
    with the air flow helps you with all
  • 00:15:51
    your job dat workloads and everything
  • 00:15:55
    and with this at the end you have a air
  • 00:15:59
    flow up and running which has all your
  • 00:16:00
    active DXs with the git sync we are
  • 00:16:03
    continuously syncing our git
  • 00:16:04
    repositories with the kubernetes server
  • 00:16:07
    over there with the airf flow so any
  • 00:16:09
    changes we do to the our G repository
  • 00:16:12
    our analytics data repository after
  • 00:16:14
    testing is done when it's mered to the
  • 00:16:16
    master it instantly like in a gap of 2
  • 00:16:18
    minutes it sinks with the production and
  • 00:16:21
    you can see that dag in the production
  • 00:16:23
    over there you can run it instantly by
  • 00:16:26
    default it's off but you can run it so
  • 00:16:28
    with this all the four things in place
  • 00:16:31
    you have an infrastructure in place with
  • 00:16:34
    the combination which is safe secure uh
  • 00:16:37
    reliable scalable to the core where you
  • 00:16:39
    can scale it to the amount and it's a
  • 00:16:41
    very cost optimal solution so when we
  • 00:16:44
    run a data pipeline sometimes become
  • 00:16:46
    very costly process if you use like a
  • 00:16:49
    standard VMS and you don't has if you
  • 00:16:51
    need to scale it then you manually go
  • 00:16:53
    and add another VM to it and then you
  • 00:16:54
    have to shut it down but gke take carees
  • 00:16:56
    of Auto scaling up down everything for
  • 00:16:59
    that if you need to horizontally scale
  • 00:17:01
    you can scale as much as you want on
  • 00:17:03
    demand and then you scale down to zero
  • 00:17:05
    so we don't run all our node
  • 00:17:08
    pools 24 hours we just run the standard
  • 00:17:12
    node pools for airf flow for like 24
  • 00:17:14
    hours because that's where the air flow
  • 00:17:15
    is hosted but remaining all node pools
  • 00:17:18
    are just scalable from 0 to 5 0 to 10
  • 00:17:20
    based on the demand we do handle our
  • 00:17:23
    data loads for data science data
  • 00:17:25
    scientists We R data modelss our data
  • 00:17:28
    pip P lines the DBT the heavy data
  • 00:17:30
    pipelines which exports like close to
  • 00:17:33
    100 GB of data every day and these all
  • 00:17:35
    happens using air flow and this is the
  • 00:17:39
    benefit of the GK that it helps a lot to
  • 00:17:41
    scale
  • 00:17:42
    up and the good thing with this all of
  • 00:17:46
    this things coming together is that the
  • 00:17:48
    gitlab cicd pipeline the cicd pipeline
  • 00:17:51
    is so robust that when you modify your
  • 00:17:54
    yaml file
  • 00:17:56
    for uh terraform or for Helm it
  • 00:17:59
    validates it does the proper validation
  • 00:18:01
    and the CI pipeline passes it
  • 00:18:03
    revalidates that your plan will work
  • 00:18:05
    perfectly or not and if it will it will
  • 00:18:07
    give you output what the plan looks like
  • 00:18:09
    and based on that you can take a call
  • 00:18:11
    should I deploy it to production or not
  • 00:18:13
    so that's the whole connection of the
  • 00:18:16
    dots between the GK Helm uh terraform
  • 00:18:19
    and airflow moving to
  • 00:18:22
    next so how it is done within gitlab so
  • 00:18:25
    GK cluster is Provisions through data
  • 00:18:27
    form using Helm charts and we have the
  • 00:18:30
    air flow installed through the helm
  • 00:18:32
    chart so we have two name space within
  • 00:18:34
    the same cluster prod and
  • 00:18:37
    testing we have seven node pools
  • 00:18:39
    different machine type for each load as
  • 00:18:41
    I said so we have for bigger machines
  • 00:18:44
    with big gpus and CPUs for data science
  • 00:18:47
    models we have big machines for heavy
  • 00:18:50
    data pipeline where we export like in
  • 00:18:52
    one where we run like hundreds of jobs
  • 00:18:57
    within within within an hour so at that
  • 00:18:59
    point of time we need a bigger machine
  • 00:19:01
    with a scalable capability so it can
  • 00:19:02
    scale as much as you want remote a state
  • 00:19:05
    for any change required for the GK
  • 00:19:07
    cluster so we maintain our as said
  • 00:19:11
    earlier we maintain our estate file in
  • 00:19:13
    the GCS which is properly secured and
  • 00:19:15
    save not open to the public and that
  • 00:19:17
    helps us to maintain our Version Control
  • 00:19:19
    of the whole of the GK infrastructure as
  • 00:19:22
    well gitlab cicd pipeline to validate
  • 00:19:25
    the changes done to the terraform script
  • 00:19:27
    this is showes the changes will not
  • 00:19:29
    break the terraform apply so this is the
  • 00:19:31
    most critical part which helps a lot in
  • 00:19:34
    order
  • 00:19:35
    to not destroy your whole GK setup of
  • 00:19:39
    the air flow because this is the bread
  • 00:19:41
    and butter for the data Engineers if
  • 00:19:43
    that air flow goes Haywire the whole
  • 00:19:45
    data pipeline goes haywire and then you
  • 00:19:47
    have lots of questions to be
  • 00:19:50
    answered air flow installed using Helm
  • 00:19:53
    charts so we use airf flow version
  • 00:19:55
    2.5.3 using Helm charts for airflow
  • 00:19:59
    which will boot stop and airflow
  • 00:20:00
    deployment on kubernetes cluster using
  • 00:20:02
    Helm package managers we use cloud SQL
  • 00:20:06
    post instances so that we have
  • 00:20:08
    configured our basic config to use cloud
  • 00:20:11
    SQL post and we have a side car used
  • 00:20:14
    within the kubernetes pod which do the
  • 00:20:17
    git sync with analytics repository so
  • 00:20:19
    that's where this is the one which takes
  • 00:20:21
    2 minutes of time to sync the git
  • 00:20:23
    reposit keep ping the git Master Branch
  • 00:20:27
    to see that any new commits has happened
  • 00:20:29
    and if something has happened it will
  • 00:20:30
    pull it and you can see that dag
  • 00:20:31
    immediately into the air
  • 00:20:34
    flow and we have modified our web
  • 00:20:36
    servers secret key and a Fed key these
  • 00:20:38
    are the default Keys which comes with
  • 00:20:40
    the airflow installation but it's good
  • 00:20:42
    to have your own if you're are using it
  • 00:20:44
    because this keeps your servers safe and
  • 00:20:46
    secure because with default one people
  • 00:20:49
    can decode it and they can get an access
  • 00:20:52
    to the airflow
  • 00:20:56
    UI so how this benefit the data platform
  • 00:20:59
    team managing the data pipeline so we
  • 00:21:02
    currently have 88 active DXs with these
  • 00:21:05
    88 dags we are able to run 1200 plus
  • 00:21:08
    task every 24 hours so this is a huge
  • 00:21:11
    number in terms that there is a dag
  • 00:21:13
    which runs like 300 tasks and it runs
  • 00:21:16
    for one and a half hour and these 300
  • 00:21:18
    tasks is nothing but polling post
  • 00:21:20
    database with 300 tables and this is
  • 00:21:24
    done with the help of node
  • 00:21:26
    pools and the airf flow pools as well so
  • 00:21:29
    slots within the air flow we use airflow
  • 00:21:32
    slots to say that use a concurrency you
  • 00:21:35
    can run 10 task or 15 task within this
  • 00:21:37
    particular airflow dag in a concurrency
  • 00:21:40
    and if you set that to too high then
  • 00:21:43
    your node pool should be able to support
  • 00:21:44
    it so that comes the GK where it scales
  • 00:21:48
    horizontally to have like six seven
  • 00:21:50
    clusters to handle that load and that's
  • 00:21:53
    where one task itself creates like one
  • 00:21:56
    dag itself creates like 200 task and
  • 00:21:59
    that is through yaml file so we have
  • 00:22:01
    just 88 python files and that 88 python
  • 00:22:05
    files is responsible to create more than
  • 00:22:07
    600 to 700 task in the
  • 00:22:10
    airl improving the workflow task
  • 00:22:13
    dynamism with air flow since as I said
  • 00:22:15
    it's a configurable so once you do a
  • 00:22:18
    modification and it's a makes life very
  • 00:22:20
    easy because as a data Engineers we are
  • 00:22:24
    always overwhelmed with lots of work
  • 00:22:26
    lots of our own pipeline so someone
  • 00:22:28
    needs to add a new table to a postest a
  • 00:22:30
    person who is requesting can himself add
  • 00:22:32
    it and add it to the yaml file and we
  • 00:22:35
    can just see validate it it looks good
  • 00:22:37
    and we do the merge request and that
  • 00:22:39
    file that table is present in the
  • 00:22:41
    pipeline in after 2 minutes so it
  • 00:22:43
    improves enhances the cicd timeline and
  • 00:22:46
    the deployment P timeline for adding a
  • 00:22:48
    new table so kubernetes pod operators to
  • 00:22:52
    schedule Dynamic workload so airf flow
  • 00:22:55
    comes with the kubernetes Pod operators
  • 00:22:57
    and that is the one which we use to
  • 00:22:59
    schedule our workloads in whole of the
  • 00:23:01
    GK set so every task spin through airf
  • 00:23:06
    flow is creates a pod in the gke and
  • 00:23:09
    that pod itself is responsible to do
  • 00:23:11
    that whatever that task was supposed to
  • 00:23:13
    do so write from establishing the
  • 00:23:15
    connections to the database pulling the
  • 00:23:17
    data pushing it to GCS loading to
  • 00:23:19
    Snowflake and once all is done the port
  • 00:23:21
    gets Auto destroyed and that's and then
  • 00:23:23
    the machine starts scaling back so
  • 00:23:25
    kubernetes P operator is like Lifesaver
  • 00:23:28
    for us because it helps us to scale the
  • 00:23:30
    whole
  • 00:23:32
    involvment on demand node provisioning
  • 00:23:34
    with terraform so with terraform if a
  • 00:23:37
    team comes with saying that I need a
  • 00:23:39
    machine a data science team mostly I
  • 00:23:42
    need a machine with a bigger respect now
  • 00:23:44
    and if we were going through UI and we
  • 00:23:47
    don't have didn't have terraform the
  • 00:23:49
    problem would be that how do you manage
  • 00:23:53
    this uh you have to again go through how
  • 00:23:55
    do you Version Control so with Tera form
  • 00:23:57
    it's easy easy okay you need a machine
  • 00:23:59
    Let's Go mod add the spe set it to this
  • 00:24:02
    go from nhim M4 to nhim M8 like that
  • 00:24:06
    merge it's validated within half an hour
  • 00:24:08
    and it's boom ready to deployed in
  • 00:24:11
    production you have new machines present
  • 00:24:13
    to do minimal downtime typically under
  • 00:24:15
    45 minutes in the event of Disaster
  • 00:24:18
    Recovery scenario so that's what I was
  • 00:24:20
    telling in the start with this all the
  • 00:24:22
    four setup if we destroy the whole of
  • 00:24:25
    the kubernetes everything and everything
  • 00:24:28
    just goes wipe off we can spin it back
  • 00:24:30
    in 45 minutes so with this it makes like
  • 00:24:34
    the system is in control so in a case of
  • 00:24:37
    worst to the worst we are back in 45
  • 00:24:39
    minutes to resume the pipeline and which
  • 00:24:40
    helps our data pipeline to be more
  • 00:24:43
    secure and reliable for the end
  • 00:24:47
    users the best practices and
  • 00:24:49
    consideration so security best practice
  • 00:24:51
    private cluster configuration that's
  • 00:24:53
    what I would recommend VPC pairing so if
  • 00:24:56
    you have a another database another
  • 00:24:58
    sources which is in the same Google
  • 00:24:59
    Cloud you try to do VPC pairing rather
  • 00:25:02
    than exposing it over the internet and
  • 00:25:05
    fing data identity access management
  • 00:25:07
    control which is something like to give
  • 00:25:09
    the least privileges to that so that
  • 00:25:12
    only people with minimal access work
  • 00:25:15
    with the minimal access over there node
  • 00:25:17
    pool isolation for different data load
  • 00:25:19
    however different node pool that will
  • 00:25:21
    help you for different type of work just
  • 00:25:23
    have a different note so this will keep
  • 00:25:25
    your cost down and you will not be
  • 00:25:27
    running a bigger machine for a smaller
  • 00:25:28
    task and a smaller machine for a bigger
  • 00:25:30
    task securing Secrets keep so everything
  • 00:25:33
    is in the vault so use it VA to secure a
  • 00:25:37
    Secrets a scalability consideration
  • 00:25:40
    horizontal pod autoscaling so that's one
  • 00:25:43
    which is recommended database is scaling
  • 00:25:45
    as well so you can scale your database
  • 00:25:48
    uh as much as you want task parallelism
  • 00:25:50
    so concurrency within the airflow helps
  • 00:25:52
    you a lot to how do you run your task in
  • 00:25:56
    parallel with each other at one point of
  • 00:25:58
    time resource request limits persistent
  • 00:26:01
    the storage consideration this is
  • 00:26:03
    something where you have like the as GK
  • 00:26:06
    you should have a PVC volume as attached
  • 00:26:08
    to it where you can store all your logs
  • 00:26:10
    so in case gke goes off and when you
  • 00:26:13
    restore it back you still have access to
  • 00:26:15
    those logs to see for gke node pools
  • 00:26:19
    for different types of load for
  • 00:26:21
    different streams of team to work with
  • 00:26:24
    monitoring and logging strategies
  • 00:26:26
    Leverage kubernetes is Monitoring
  • 00:26:28
    Solutions Prometheus and grafana which
  • 00:26:31
    is very useful alerting and notification
  • 00:26:34
    channels we use a slack for all our
  • 00:26:36
    airflow failures happening through and
  • 00:26:38
    we monitor our web server through
  • 00:26:41
    Prometheus and grafana and it goes off
  • 00:26:44
    by any time we get an instant alert and
  • 00:26:47
    we use airflow matric to monitor it and
  • 00:26:51
    this is me you can just find me and look
  • 00:26:55
    for me and just reach out to me and
  • 00:26:57
    there's some additional resources so we
  • 00:26:59
    have gitlab handbook for information
  • 00:27:01
    about node pools and name spaces how we
  • 00:27:03
    have set it up airflow infrastructure
  • 00:27:05
    how it is present over there and our
  • 00:27:07
    gitlab data analytics repo or our data
  • 00:27:09
    bags where you can see all the dag bags
  • 00:27:11
    all the dags how we have done it and
  • 00:27:14
    thank you and any question
  • 00:27:19
    please thank you very much um uh vid now
  • 00:27:23
    we have a couple of minutes for
  • 00:27:26
    questions please question here and
  • 00:27:33
    there hi
  • 00:27:35
    uh thanks for your uh
  • 00:27:39
    presentation uh I was wondering when uh
  • 00:27:43
    chosing Helm have you considered
  • 00:27:45
    customize as an
  • 00:27:48
    option no question I we just looked at
  • 00:27:52
    Helm because in our organization we were
  • 00:27:53
    using a lot of Helm config management uh
  • 00:27:56
    not for our infrastructure but our the
  • 00:27:57
    site Sr team they use Helm config
  • 00:27:59
    management a lot so we get took the
  • 00:28:01
    knowledge from them and embed it in our
  • 00:28:04
    so that to keep itting consistent uh
  • 00:28:07
    across the organization so we haven't
  • 00:28:09
    looked do that but yeah I it's uh if
  • 00:28:13
    chance were there we would have
  • 00:28:15
    evaluated it again but Helm was the
  • 00:28:17
    choice because the aparta airflow we
  • 00:28:19
    were waiting for it to release a Helm
  • 00:28:20
    chart and when it released with the
  • 00:28:22
    series 2 it was good because it was
  • 00:28:24
    helpful for version management and all
  • 00:28:26
    so very easy for for the upgrade of air
  • 00:28:28
    flow I see and uh sort of related to
  • 00:28:31
    that do you see any advantage having
  • 00:28:34
    your own airflow deployment compared to
  • 00:28:37
    maybe Cloud composer which could be kind
  • 00:28:39
    of simpler yes we see an advantage we
  • 00:28:43
    can manage our own we see advantage in
  • 00:28:45
    terms we want to uh our load balancing
  • 00:28:50
    and our whatever the demand what we have
  • 00:28:53
    in terms of the data and the node pools
  • 00:28:55
    and the Machine size it's much more more
  • 00:28:57
    easier for us to manage it over here on
  • 00:28:59
    this side Cloud composer is good but
  • 00:29:03
    it's a lot of pain for me or to think
  • 00:29:07
    that to sync my G repository every time
  • 00:29:09
    I do a modification so the cicd when it
  • 00:29:11
    works with the GK with the gitlab that
  • 00:29:13
    components helps a lot in our data
  • 00:29:15
    pipeline to make it much more skillable
  • 00:29:17
    so that's where we look at the leverage
  • 00:29:19
    at how gitlab works with the GK more
  • 00:29:22
    efficiently with all our cicd pipelines
  • 00:29:25
    set up in place so that's where so yeah
  • 00:29:28
    Cloud composer was an option when we
  • 00:29:29
    looking at manage service and then we
  • 00:29:32
    still stuck with this because we were
  • 00:29:33
    able to manage all our loads over here
  • 00:29:36
    so thank you all right we have space for
  • 00:29:39
    one more
  • 00:29:41
    question yes
  • 00:29:43
    please thank you hi uh thank you for
  • 00:29:46
    your time um the question was regarding
  • 00:29:48
    the docker itself um used with the
  • 00:29:51
    kubernetes what's the key uh difference
  • 00:29:54
    between using this approach and simply
  • 00:29:56
    using um officially available airflow
  • 00:29:59
    Docker image with the nodes and the
  • 00:30:00
    kubernetes yes so we were using Docker
  • 00:30:03
    before so we moved to helm with the
  • 00:30:05
    series 2 we were till airflow 1.5 we
  • 00:30:08
    were using Docker images but when we
  • 00:30:10
    were trying to do an upgrade from 1.5 to
  • 00:30:12
    two Docker image was creating a lots of
  • 00:30:16
    what do you say steel objects our
  • 00:30:18
    hanging DXs it was not a performance
  • 00:30:20
    oriented for us it was working good for
  • 00:30:22
    other people but we saw that couple of
  • 00:30:24
    blocks saying that the DXs are hanging
  • 00:30:26
    and there is task which is wa waiting
  • 00:30:28
    for to be executed for 24 hours the
  • 00:30:30
    docker images didn't work so good for us
  • 00:30:32
    but when we moved to helm it was working
  • 00:30:35
    good I'm not sure what was the
  • 00:30:37
    difference but at that point of time for
  • 00:30:39
    me the performance was the key to move
  • 00:30:41
    towards the helm but I had a problem
  • 00:30:42
    with till 1.5 it was super good it was
  • 00:30:46
    very efficient it was with the series 2
  • 00:30:48
    it didn't work good for
  • 00:30:49
    us great thank you so basically in
  • 00:30:52
    future it might come back as a possible
  • 00:30:54
    solution yes yes it is it is still on
  • 00:30:57
    never we use Docker for our local
  • 00:30:59
    testing so we still use the docker
  • 00:31:01
    images for our local testing for this
  • 00:31:03
    thank you all right uh vid thank you
  • 00:31:07
    very much for this insightful
  • 00:31:09
    presentation I would like to remind
  • 00:31:11
    everyone that uh you can nwork and
  • 00:31:13
    connect during lunch perhaps for future
  • 00:31:16
    questions and this is a certificate of
  • 00:31:18
    appreciation from the organizing team
  • 00:31:19
    thank you thank you very
  • 00:31:21
    much thank
  • 00:31:26
    you
タグ
  • GitLab
  • Airflow
  • GKE
  • Terraform
  • Helm
  • DevSecOps
  • Data Pipeline
  • CI/CD
  • Infrastructure as Code
  • Kubernetes