AWS Glue Tutorial for Beginners [NEW 2024 - FULL COURSE]

00:53:03
https://www.youtube.com/watch?v=ZvJSaioPYyo

Resumo

TLDRThis video tutorial by Johnny Chivers provides an updated overview of AWS Glue, focusing on its features and functionalities for ETL processes. It covers the AWS Glue Data Catalog, the creation of databases and tables, and the building of ETL jobs using both visual and code-based methods. The tutorial includes practical steps for setting up permissions, uploading data to S3, and utilizing crawlers to automate table creation. Additionally, it discusses data quality monitoring and scheduling options for ETL jobs, making it a comprehensive guide for both beginners and those looking to refresh their knowledge of AWS Glue.

Conclusões

  • 🔍 AWS Glue is a fully managed ETL service.
  • 📚 The Glue Data Catalog stores metadata about data sources.
  • ⚙️ ETL jobs can be created visually or through code.
  • 🤖 Crawlers automate the discovery of data and table creation.
  • 📂 Partitions improve data organization and query efficiency.
  • ✅ Data quality features help monitor and enforce data standards.
  • 🖥️ Glue DataBrew allows for visual data preparation without coding.
  • ⏰ ETL jobs can be scheduled using triggers in AWS Glue.
  • 🔗 AWS Glue integrates with various AWS services for data processing.
  • 📈 The tutorial is suitable for both beginners and experienced users.

Linha do tempo

  • 00:00:00 - 00:05:00

    Introduction to AWS Glue, highlighting the need for an updated tutorial due to changes in the AWS console and new features in Glue since the last video.

  • 00:05:00 - 00:10:00

    Overview of AWS Glue as a fully managed ETL service, explaining its components like the Glue Data Catalog and the flexible scheduler, and the importance of understanding its functionality for data engineering.

  • 00:10:00 - 00:15:00

    Instructions on setting up the AWS environment, including downloading necessary files from GitHub and using CloudFormation scripts to create required resources like S3 buckets and IAM roles.

  • 00:15:00 - 00:20:00

    Detailed steps on creating folders in the S3 bucket for organizing raw and processed data, and uploading CSV files for customers and orders, maintaining the folder structure as specified.

  • 00:20:00 - 00:25:00

    Explanation of the Glue Data Catalog as a persistent metastore for metadata, emphasizing that it does not store physical data but rather references to data locations and schemas.

  • 00:25:00 - 00:30:00

    Demonstration of creating a database in the Glue Data Catalog and manually adding a table for customers, including defining the schema and data types for the table.

  • 00:30:00 - 00:35:00

    Introduction to Glue Crawlers, which automate the process of populating the Glue Data Catalog with table definitions, and running a crawler to create the orders table from existing data.

  • 00:35:00 - 00:40:00

    Discussion on the concept of partitions in AWS Glue, explaining how they can optimize data storage and querying in S3 by organizing data into logical folders based on specific criteria.

  • 00:40:00 - 00:45:00

    Overview of Glue connections for securely storing connection properties to various data stores, and the importance of using these connections in ETL scripts to avoid hardcoding sensitive information.

  • 00:45:00 - 00:53:03

    Introduction to AWS Glue ETL jobs, explaining the visual ETL process and how to create a job that extracts, transforms, and loads data, including setting up the job parameters and defining the data source and target.

Mostrar mais

Mapa mental

Vídeo de perguntas e respostas

  • What is AWS Glue?

    AWS Glue is a fully managed ETL (Extract, Transform, Load) service that simplifies data preparation for analytics.

  • What is the Glue Data Catalog?

    The Glue Data Catalog is a persistent metadata repository that stores information about data sources and targets.

  • How do I create an ETL job in AWS Glue?

    You can create an ETL job in AWS Glue using the visual ETL interface or by writing code.

  • What are crawlers in AWS Glue?

    Crawlers are programs that automatically discover and catalog data in AWS Glue.

  • What is the purpose of partitions in AWS Glue?

    Partitions help organize data in S3, allowing for more efficient querying and processing.

  • How can I monitor data quality in AWS Glue?

    AWS Glue provides data quality features that allow you to define and monitor data quality rules.

  • What is Glue DataBrew?

    Glue DataBrew is a visual data preparation tool that allows users to clean and normalize data without coding.

  • How do I schedule ETL jobs in AWS Glue?

    You can schedule ETL jobs using triggers in AWS Glue, which can be defined based on time or events.

  • What is the difference between Glue Data Catalog and Glue DataBrew?

    The Glue Data Catalog is for metadata management, while Glue DataBrew is for visual data preparation.

  • Can I use AWS Glue with other AWS services?

    Yes, AWS Glue can integrate with various AWS services like S3, RDS, and Athena.

Ver mais resumos de vídeos

Obtenha acesso instantâneo a resumos gratuitos de vídeos do YouTube com tecnologia de IA!
Legendas
en
Rolagem automática:
  • 00:00:00
    hi folks welcome back to the channel for
  • 00:00:03
    those of you that are new here I'm
  • 00:00:04
    Johnny chivers and in today's video
  • 00:00:06
    we're going to take a look at AWS glue
  • 00:00:10
    there's already a video of AWS glue on
  • 00:00:13
    the channel however that was filmed over
  • 00:00:15
    3 years ago now and has been released
  • 00:00:17
    for about 2 and 1 half years the AWS
  • 00:00:20
    console has since updated feels
  • 00:00:22
    completely different and in fact there's
  • 00:00:25
    new features within the glue service
  • 00:00:27
    itself that aren't covered in that
  • 00:00:29
    original video
  • 00:00:30
    a lot of you have requested an update to
  • 00:00:32
    that video so we're going to do it today
  • 00:00:34
    it'll follow the same format we'll go
  • 00:00:36
    through what is glue we'll take a look
  • 00:00:38
    at the glued out a catalog and we'll
  • 00:00:39
    start building ETL jobs as well
  • 00:00:42
    everything you need for this tutorial is
  • 00:00:45
    located on the GitHub Link in the
  • 00:00:47
    description below that includes the data
  • 00:00:50
    and the slides that I'm going to use I'd
  • 00:00:52
    really appreciate a like And subscribe
  • 00:00:54
    to the channel cuz that helps me out
  • 00:00:56
    with that all being said let's jump onto
  • 00:00:58
    the computer take a look at what AWS
  • 00:01:01
    glue is then do some setup work using
  • 00:01:04
    the code located in that repo on GitHub
  • 00:01:06
    so again download that repo highly
  • 00:01:08
    important join me on the computer and
  • 00:01:10
    we'll get started okay folks uh just to
  • 00:01:13
    remind you again I will make these
  • 00:01:15
    slides available to download on the
  • 00:01:16
    GitHub if you want to download them
  • 00:01:18
    annotate and follow along first thing we
  • 00:01:21
    need to cover is what is HDS glue so
  • 00:01:25
    it's a fully manage ETL service that's
  • 00:01:27
    kind of the core value proposition fully
  • 00:01:30
    managed ETL service that means the AWS
  • 00:01:33
    are going to do the heavy lifting for
  • 00:01:34
    you when it comes to infrastructure it's
  • 00:01:37
    a spark or actually a python ETL engine
  • 00:01:40
    so it has both spark and python usually
  • 00:01:42
    you would only see spark mentioned in
  • 00:01:43
    the documents or a lot of people would
  • 00:01:45
    talk about spark because it's big data
  • 00:01:47
    but you can do python as well it
  • 00:01:49
    consists of a central a metadata
  • 00:01:52
    repository called the glue data catalog
  • 00:01:54
    we will cover this in depth and set up
  • 00:01:56
    some databases and tables during this
  • 00:01:58
    tutorial but this is a key comp
  • 00:01:59
    component that we will cover and it also
  • 00:02:01
    has a flexible scheduler again we'll
  • 00:02:03
    cover this in the tutorial I'll show you
  • 00:02:05
    how it works it's not the only way to
  • 00:02:07
    schedle things inside glue but it is
  • 00:02:09
    certainly good to know that you don't
  • 00:02:11
    have to leave the glue environment to do
  • 00:02:12
    this on the right hand side we have this
  • 00:02:14
    nice nice little diagram um provided by
  • 00:02:17
    AWS that kind of shows you what glue
  • 00:02:19
    does I'm actually going to start down
  • 00:02:21
    here because this is kind of the
  • 00:02:22
    fundamentals and what people think when
  • 00:02:24
    we say glue you have a data source you
  • 00:02:26
    can extract that data source you can
  • 00:02:28
    transform that through a script this r
  • 00:02:29
    runs on the serverless ETL engine in
  • 00:02:32
    python or in spark and then you load a
  • 00:02:34
    data Target it has numerous data sources
  • 00:02:37
    you can use numerous data targets that's
  • 00:02:39
    where the glue data catalog comes in you
  • 00:02:41
    store references to your data source or
  • 00:02:44
    your data Target in the glue data
  • 00:02:45
    catalog and then to populate the glue
  • 00:02:47
    data catalog you can use things like
  • 00:02:49
    crawlers or the Management console we
  • 00:02:51
    will do both in the tutorial and then
  • 00:02:53
    once you have things set up in the glue
  • 00:02:55
    do catalog the scripts the AWS glue jobs
  • 00:02:58
    or ETL Pipelines you can use a schedule
  • 00:03:01
    or event to run them and again we'll
  • 00:03:03
    cover that in the tutorial so I always
  • 00:03:07
    think this is useful to put in why use
  • 00:03:10
    AWS glue I know for the purposes of this
  • 00:03:12
    video we are just going to do the
  • 00:03:14
    tutorial but there are many different
  • 00:03:16
    options for ETL in AWS and as you learn
  • 00:03:19
    these as a data engineer or an architect
  • 00:03:21
    or as a developer it's good to know when
  • 00:03:23
    to use the correct tool for the correct
  • 00:03:25
    job and if you're doing this as part of
  • 00:03:27
    the AWS data engineer certification I'll
  • 00:03:30
    also leave a link to that video in the
  • 00:03:32
    description below it's good to know this
  • 00:03:33
    as well so AWS glue offers a fully
  • 00:03:36
    managed serverless ETL tool so again
  • 00:03:38
    fully serverless ETL tool this removes
  • 00:03:42
    the overhead on the buyer the entry when
  • 00:03:44
    there is a requirement for the ETL
  • 00:03:46
    service in AWS and what that means is
  • 00:03:48
    that you don't have to manage the
  • 00:03:50
    underlying infrastructure to use spark
  • 00:03:54
    you need to spin up clusters clustered
  • 00:03:56
    compute in a highly paralyzed
  • 00:03:58
    environment using a s glue you don't
  • 00:04:00
    need to think about any of that it just
  • 00:04:03
    runs it for you on manage services in
  • 00:04:05
    behind the scenes so again there is
  • 00:04:07
    nothing for you to provision it is
  • 00:04:09
    serverless AWS are doing the heavy
  • 00:04:12
    lifting you need to write the code and
  • 00:04:14
    it will do the
  • 00:04:16
    ETL okay so what we'll do now is jump on
  • 00:04:18
    the console we'll get hands on with a
  • 00:04:20
    little bit of setup work I have wrote
  • 00:04:22
    some cloud formation scripts for this to
  • 00:04:24
    help us set up permissions and some of
  • 00:04:26
    the data repositories that we'll need
  • 00:04:28
    throughout this tutorial follow along
  • 00:04:30
    make sure you have already downloaded
  • 00:04:32
    that GitHub if not I'll show you to do
  • 00:04:34
    it again in the middle of the setup work
  • 00:04:36
    but download that GitHub we will be
  • 00:04:38
    using it throughout this
  • 00:04:40
    tutorial so throughout this course as I
  • 00:04:43
    referenced in the intro we will be using
  • 00:04:45
    this GitHub for this section of setup
  • 00:04:47
    we'll be using the code located in the
  • 00:04:50
    code file which is located here in setup
  • 00:04:52
    hyphen code. yaml and we'll also then
  • 00:04:55
    once it's completed successfully by
  • 00:04:57
    uploading the data from the data file
  • 00:04:59
    the easiest thing to do as I mentioned
  • 00:05:00
    was just take a copy of this down into
  • 00:05:03
    your local machine so to do that you can
  • 00:05:04
    just go code and then download as zip
  • 00:05:07
    and then unzip that file once it's
  • 00:05:09
    downloaded and use the contents
  • 00:05:10
    throughout the instructions on what
  • 00:05:12
    we're doing is also located here but
  • 00:05:14
    don't worry I'm going to guide you step
  • 00:05:17
    by step that being said we need to jump
  • 00:05:20
    on to the AWS console so please log in
  • 00:05:24
    and you'll be graded with the homepage
  • 00:05:26
    hopefully you are familiar with the AWS
  • 00:05:28
    console if not and this is your first
  • 00:05:31
    time using the AWS console don't worry
  • 00:05:33
    I'll uh point out things as we go along
  • 00:05:35
    so the first thing we're going to do is
  • 00:05:37
    go to Cloud information this is where we
  • 00:05:39
    can use our infrastructure as code to
  • 00:05:42
    spin up the resources we need for the
  • 00:05:43
    tutorial the idea behind this is that I
  • 00:05:46
    have created the template force and that
  • 00:05:48
    means we won't have to do as much usual
  • 00:05:50
    manual intervention clicking through the
  • 00:05:52
    console and doing all the stuff in terms
  • 00:05:54
    of permissions that you see in many
  • 00:05:55
    other tutorials so we want to go create
  • 00:05:57
    stack create new resource we're going to
  • 00:06:00
    upload a template you want to choose
  • 00:06:02
    file and you want to go to the unzipped
  • 00:06:05
    version of the folder that you've just
  • 00:06:07
    downloaded from the GitHub I repeat the
  • 00:06:08
    unzipped version go to the code F folder
  • 00:06:12
    code folder and then the setup hyphen
  • 00:06:15
    code. yaml you want to open that then
  • 00:06:18
    you want to go next you need to give the
  • 00:06:20
    stack a name so this can be anything
  • 00:06:22
    that you want so I'm just going to call
  • 00:06:23
    this Johnny hyphen chivers well can't
  • 00:06:27
    spell my own name chivers hyphen glue
  • 00:06:31
    course uh 2024 that will do that next
  • 00:06:35
    thing you need to do is give the bucket
  • 00:06:37
    a name so this is the S3 bucket for the
  • 00:06:39
    course bear in mind with AWS every
  • 00:06:43
    bucket has to be unique globally so you
  • 00:06:45
    will not be able to use the same name as
  • 00:06:47
    me for the bucket but I'm going to call
  • 00:06:49
    mine AWS glue course hyphen Johnny
  • 00:06:54
    chivers hopefully that is unique you
  • 00:06:57
    need again I'll repeat you need a unique
  • 00:06:59
    name so if you don't have one it will
  • 00:07:00
    tell you but that's carry on for now um
  • 00:07:03
    use your name it might be the easiest
  • 00:07:05
    thing with some random numbers or digits
  • 00:07:07
    accept and acknowledge at the bottom
  • 00:07:09
    we're going to run this using the root
  • 00:07:11
    permissions that I'm currently logged in
  • 00:07:12
    with and then hit submit so that will
  • 00:07:15
    take a few minutes to go off and running
  • 00:07:18
    you can hit the refresh icon here and
  • 00:07:20
    see what the actual setup code is doing
  • 00:07:23
    code itself as I mentioned uh in the
  • 00:07:25
    header of the code so if you just go and
  • 00:07:27
    have a look at the file it's going going
  • 00:07:29
    to create a few things for us one is an
  • 00:07:31
    S3 bucket next one is an IM am rule for
  • 00:07:35
    glue we'll be using this three art to
  • 00:07:37
    actually permissions to do anything and
  • 00:07:38
    then the last thing is an Athena working
  • 00:07:40
    group which isn't technically part of
  • 00:07:42
    this course but we may use it to look at
  • 00:07:44
    some of the data so it's good to have it
  • 00:07:45
    around as well this shouldn't take too
  • 00:07:48
    long to complete in fact it's completed
  • 00:07:50
    already so that took about 30 seconds
  • 00:07:53
    let's check that a few things are there
  • 00:07:55
    as they should be so if you go to
  • 00:07:56
    outputs you'll see that we have the S3
  • 00:07:58
    bucket created so copy that name under
  • 00:08:00
    value so I put copy the value let's go
  • 00:08:03
    to S3 so I'm going to go to S3 I have
  • 00:08:06
    other buckets in this account you might
  • 00:08:08
    only have one bucket don't worry just
  • 00:08:10
    paste it in make sure the bucket has
  • 00:08:11
    been created I should mention as well
  • 00:08:13
    I'm working in the Ireland region for
  • 00:08:15
    this as well so that is the bucket if we
  • 00:08:18
    go to IM am click on I am click in and
  • 00:08:21
    we have the name of the IM am rule that
  • 00:08:23
    we're looking for called AWS glue course
  • 00:08:27
    so in here if we go to rules on left
  • 00:08:29
    hand side search search for the rule you
  • 00:08:32
    can see that we have the rule if we load
  • 00:08:33
    the rule up then we should have a policy
  • 00:08:36
    attached to the rule that's great we do
  • 00:08:38
    and inside that policy is all the good
  • 00:08:40
    stuff that we have created during the
  • 00:08:42
    setup process so that's the setup in
  • 00:08:44
    terms of running the template the next
  • 00:08:46
    thing we need to do is actually upload
  • 00:08:48
    some data and create some folders inside
  • 00:08:51
    the S3 bucket itself so let's go back to
  • 00:08:54
    that S3 bucket that we were just on that
  • 00:08:56
    we know was successfully created then
  • 00:08:58
    search for the bucket that we were just
  • 00:09:00
    on so AWS glue course Johnny chivers was
  • 00:09:02
    my bucket if we go back on to the GitHub
  • 00:09:06
    and we go back into the main part of the
  • 00:09:08
    readme file you can see that we're going
  • 00:09:10
    to upload the data we ignore that for a
  • 00:09:12
    little second but what we need to do is
  • 00:09:13
    create a few things in the folder
  • 00:09:16
    structure denoted so if we just copy 0.1
  • 00:09:19
    raw data I'll show you how to do this
  • 00:09:21
    you want to go to create folder you want
  • 00:09:23
    to call this folder raw data and then
  • 00:09:25
    you just want to create the folder like
  • 00:09:28
    so next we have to do process data so
  • 00:09:31
    this will be quite quick again same
  • 00:09:32
    process in create folder paste it and
  • 00:09:36
    then create folder next we have to do is
  • 00:09:39
    script location we'll be needing these
  • 00:09:41
    folders throughout the tutorial so we'll
  • 00:09:43
    just have to create them once and then
  • 00:09:45
    that's it done script
  • 00:09:47
    location oh I hit the wrong button back
  • 00:09:50
    in create folder folder name script
  • 00:09:53
    location create
  • 00:09:55
    folder temp directory we'll also need a
  • 00:09:58
    temp directory Dory crate folder folder
  • 00:10:01
    name temp directory create folder and
  • 00:10:04
    then we will need a thinner folder
  • 00:10:07
    called Athena so we'll just go into
  • 00:10:09
    create folder again this is the last one
  • 00:10:11
    and create Athena perfect back onto the
  • 00:10:15
    GitHub you'll see that in the Raw data
  • 00:10:16
    folder we have to upload the structures
  • 00:10:19
    and orders data keeping the folder
  • 00:10:22
    structure as the noted inside the GitHub
  • 00:10:24
    so you can see here we'll have a folder
  • 00:10:25
    called raw data then we're going to
  • 00:10:27
    upload our customers and orders folders
  • 00:10:29
    that have have CSV files inside them
  • 00:10:31
    this is quite simple so what the easiest
  • 00:10:33
    way to do this is to go to the actual
  • 00:10:36
    raw dat folder itself then what I find
  • 00:10:39
    the next bit to do the easiest way is to
  • 00:10:42
    go upload then when're here the easiest
  • 00:10:45
    thing for me to do or the way I find it
  • 00:10:47
    easiest is just put this folder down
  • 00:10:49
    into little minimize folder go get the
  • 00:10:53
    location where you have unzipped and
  • 00:10:54
    again unzipped that data so for me it's
  • 00:10:57
    going to be in Jonathan
  • 00:11:00
    in here then I've got sitting in here in
  • 00:11:03
    here in data so that's just where I have
  • 00:11:06
    these unzipped locations so go find the
  • 00:11:08
    customers and orders in the GitHub unzip
  • 00:11:11
    version click and drag it across again
  • 00:11:13
    that is just the easiest thing to do you
  • 00:11:15
    can see here then you get your customers
  • 00:11:16
    folder your orders foldo folder and your
  • 00:11:19
    two csvs as well click upload this will
  • 00:11:22
    take a few seconds there's not that much
  • 00:11:24
    data there so just B tight once it's
  • 00:11:26
    been successfully done you can close and
  • 00:11:28
    you can see that you have customers with
  • 00:11:30
    a customer CSV and you have orders with
  • 00:11:34
    a order CSV so we've successfully I'm
  • 00:11:37
    going to blow this back up big because
  • 00:11:38
    we don't need it that way anymore
  • 00:11:39
    created this raw data folder with our
  • 00:11:41
    customers folder and CSV and our orders
  • 00:11:44
    folder and our orders CSV as well so
  • 00:11:47
    that's the setup work complete for the
  • 00:11:49
    tutorial make sure you follow along with
  • 00:11:51
    this it's pretty simple run the run the
  • 00:11:52
    cloud foration template then create
  • 00:11:55
    these folder locations in the S3 bucket
  • 00:11:57
    that was created and then upload the
  • 00:11:59
    data maintaining the folder structure
  • 00:12:01
    just as I have shown the AWS glue data
  • 00:12:06
    catalog so the AGS glue data catalog is
  • 00:12:09
    a persistent metastore so let's say that
  • 00:12:12
    again is a persistent metastore well
  • 00:12:15
    what does that actually mean well it
  • 00:12:17
    stores metadata so what is metadata well
  • 00:12:20
    on the right hand side we have different
  • 00:12:22
    descriptions of metadata we have
  • 00:12:24
    location schema data types and data
  • 00:12:27
    classification so this is the important
  • 00:12:29
    but it's a managed service that lets you
  • 00:12:31
    store annotate and share metadata again
  • 00:12:33
    metadata these are the things on the
  • 00:12:35
    right hand side which can be used to
  • 00:12:37
    query and transform data what I find or
  • 00:12:40
    what can be the difficult concept to get
  • 00:12:42
    your head around is that when you
  • 00:12:44
    register things in the glow data catalog
  • 00:12:47
    I.E data sources or data targets they do
  • 00:12:50
    not move from their existing location so
  • 00:12:53
    if I have a database let's say I have a
  • 00:12:55
    database that's just a mycle database
  • 00:12:57
    and it's on AWS and I have data in that
  • 00:13:00
    database when I register it with a glue
  • 00:13:02
    data catalog it will store the location
  • 00:13:05
    the schema the data types and even the
  • 00:13:07
    classification of that data in the glue
  • 00:13:09
    data catalog what it does not do is move
  • 00:13:12
    the data to AWS glue it keeps it inside
  • 00:13:16
    that database it stays where it is this
  • 00:13:19
    is just a pointer and information on how
  • 00:13:22
    to access that data and that's the key
  • 00:13:24
    thing it becomes a metast store a
  • 00:13:26
    collection of information that you can
  • 00:13:29
    can use to reform ETL on different data
  • 00:13:32
    sources and bring them together inside
  • 00:13:34
    AWS and you need to remember then that
  • 00:13:37
    there's one AWS glue. a catalog per AWS
  • 00:13:39
    region so you get one per region you can
  • 00:13:42
    use I am at access identity and access
  • 00:13:45
    management policiy going to control um
  • 00:13:47
    access to them and you can also use data
  • 00:13:49
    governance because you can annotate the
  • 00:13:51
    glue data catalog so again it's a glue
  • 00:13:53
    data catalog it's where you store
  • 00:13:55
    metadata it does not store physical data
  • 00:13:58
    this is just reference information
  • 00:14:01
    required to get to the data you have
  • 00:14:04
    stored in
  • 00:14:06
    AWS AWS glue databases so a database is
  • 00:14:11
    a set of associated data catalog table
  • 00:14:14
    definitions organized into a local group
  • 00:14:16
    so what I've done here in this little
  • 00:14:17
    graphic is draw databus and I put some
  • 00:14:19
    tables inside it so this is where we
  • 00:14:21
    name something and then associate the
  • 00:14:23
    tables to it the way you would do it
  • 00:14:24
    normally in an RD in an RM DBS database
  • 00:14:28
    for examp example we'll jump on to the
  • 00:14:30
    console in just a second and we can look
  • 00:14:32
    at actually setting up a database and
  • 00:14:34
    we'll do tables as well but we'll
  • 00:14:36
    actually look at how this works in
  • 00:14:38
    action okay back on the AWS console if
  • 00:14:41
    this is your first time don't worry what
  • 00:14:44
    we're going to do is navigate to AWS
  • 00:14:46
    glue so type in AWS glue in the search
  • 00:14:48
    bar and click AWS glue once there you'll
  • 00:14:53
    notice a few different things looking
  • 00:14:54
    around on the console there's quite a
  • 00:14:56
    lot going on the left hand side is going
  • 00:14:59
    going to be your best friend so I'm
  • 00:15:00
    going to minimize that down in case
  • 00:15:01
    you've arrived with out it out there's a
  • 00:15:03
    hamburger menu to expand it and then
  • 00:15:05
    down the left you can see we have a
  • 00:15:07
    getting started ETL jobs data catalog
  • 00:15:09
    tables data connections and workflows
  • 00:15:12
    then it breaks down into two sections
  • 00:15:13
    one is data catalog where you have
  • 00:15:15
    databases and tables and a couple of
  • 00:15:18
    other things including crawlers
  • 00:15:19
    connections then on top of this you have
  • 00:15:22
    data integration and ETL so ETL jobs
  • 00:15:25
    data classification modes we'll be
  • 00:15:26
    looking at some of this as the tutorial
  • 00:15:28
    goes on importantly there is Legacy
  • 00:15:31
    pages so if you're familiar with old
  • 00:15:33
    glue and it's layout these are the
  • 00:15:35
    Legacy Pages down the left hand side
  • 00:15:37
    glue's done a lot of work in how it
  • 00:15:39
    looks on the UI in the last kind of two
  • 00:15:41
    years if you look at my previous
  • 00:15:42
    tutorial compared to this tutorial it's
  • 00:15:44
    a completely different look that's why
  • 00:15:46
    I've redone this video but if you're
  • 00:15:47
    looking from anything from the old
  • 00:15:49
    tutorial or you haven't been on glue in
  • 00:15:51
    a few years you can find your legacy
  • 00:15:53
    Pages down here okay so what we're going
  • 00:15:56
    to do next then is create a database for
  • 00:15:58
    this tutorial for for the purposes of
  • 00:16:00
    the rest of the demonstration you need
  • 00:16:02
    to go to databases which is on the left
  • 00:16:04
    hand side and we need to create a
  • 00:16:06
    database for this we need to add a
  • 00:16:10
    database so let's call the first one
  • 00:16:12
    let's call this one rawcore data um you
  • 00:16:15
    can have a location for that so let's
  • 00:16:17
    get a location let's be proper about
  • 00:16:20
    this this isn't required but is good
  • 00:16:22
    practice if you can S3 we need to go
  • 00:16:24
    find that bucket again that I have just
  • 00:16:26
    created for the glue course I'm going to
  • 00:16:28
    leave this open in a tab because we'll
  • 00:16:30
    be referencing it quite a lot we want
  • 00:16:32
    this raw data folder and we want to copy
  • 00:16:35
    the URI put it back in like this and we
  • 00:16:38
    can create the database
  • 00:16:41
    itself that's how we create a database
  • 00:16:44
    AWS glue tables so a glue table is the
  • 00:16:48
    metadata definition that represents your
  • 00:16:51
    data the data resides in its original
  • 00:16:53
    store so there it is again I'm going to
  • 00:16:55
    keep saying this the data resides in its
  • 00:16:57
    original store this is just a
  • 00:16:59
    representation of the schema this as I
  • 00:17:02
    said is the most difficult thing that
  • 00:17:04
    people come to grasp with glue it is
  • 00:17:07
    just a representation of the data it's
  • 00:17:09
    just pointing to where the data is when
  • 00:17:11
    I was at reinvent last year a few people
  • 00:17:13
    came up to me and said hey Johnny your
  • 00:17:15
    videos are really useful because no one
  • 00:17:16
    else kind of mentions how this actually
  • 00:17:18
    works and it's quite simple that there's
  • 00:17:20
    connections there's information stored
  • 00:17:22
    in the glue data catalog that represents
  • 00:17:24
    your data wherever it is actually
  • 00:17:26
    repository wherever that Repository
  • 00:17:28
    actually resides but it is not the data
  • 00:17:31
    itself so again make sure you understand
  • 00:17:33
    this make sure you read this and
  • 00:17:35
    understand this fully before you
  • 00:17:37
    progress with AWS glue AWS glue crawlers
  • 00:17:41
    so it's a little program that AWS have
  • 00:17:43
    created that lets you find information
  • 00:17:46
    about data that you have stored in AWS
  • 00:17:49
    or or on databases and then it populates
  • 00:17:52
    or tries to populate the AWS glue data
  • 00:17:55
    catalog for you with information and the
  • 00:17:57
    idea behind this is that it lifts the
  • 00:17:59
    burden if you having to manually create
  • 00:18:01
    the tables that you have lying around or
  • 00:18:03
    you have on AWS already I will show you
  • 00:18:06
    how to use both the AWS GL glue crawler
  • 00:18:09
    and I will show you how to manually add
  • 00:18:10
    the table so you don't have to use the
  • 00:18:12
    crawler it's just a handy little program
  • 00:18:15
    that helps you or minimizes the burden
  • 00:18:18
    on trying to find all the tables that
  • 00:18:19
    you have you point it to where you the
  • 00:18:21
    data resides and it comes up with the
  • 00:18:23
    schemers it comes up with the table
  • 00:18:24
    names for you but but you do have the
  • 00:18:28
    ability ility to manually create the
  • 00:18:30
    tables as well the choice is yours I
  • 00:18:32
    always use a blend of both there is the
  • 00:18:34
    right tool for the right job and we'll
  • 00:18:36
    show you how to do both in this tutorial
  • 00:18:38
    okay again I've just went back out to
  • 00:18:40
    the AWS glue console to make it simple
  • 00:18:42
    to find the tables so hamburger menu on
  • 00:18:44
    the left hand side if it's not already
  • 00:18:46
    expanded and go to tables that's as
  • 00:18:50
    simple as that going to minimize this
  • 00:18:52
    little header and banner here for this
  • 00:18:56
    part of the demo or the tutorial what
  • 00:18:58
    we're going to do is add a table
  • 00:19:00
    manually so looking down inside the
  • 00:19:02
    GitHub you can see that we have two
  • 00:19:04
    tables one is orders one is customers
  • 00:19:06
    we'll do the customers because it has
  • 00:19:08
    the smaller schema I would never really
  • 00:19:10
    add a table manually unless I'm just
  • 00:19:12
    playing around with data I would usually
  • 00:19:13
    do it through code but it's good to see
  • 00:19:15
    this in action so let's go back to
  • 00:19:18
    tables and let's go add table at the top
  • 00:19:20
    first thing we need to do is give the
  • 00:19:22
    table a name so I'm going to call this
  • 00:19:24
    customers actually I'm going to keep it
  • 00:19:25
    all smalls because that's just best
  • 00:19:27
    practice customers
  • 00:19:29
    and this is raw data so I'm just going
  • 00:19:30
    to call it customers raw next thing we
  • 00:19:33
    need to do is create or select a
  • 00:19:34
    database well we did that previously so
  • 00:19:36
    that's raw data you give this a
  • 00:19:38
    description this is us this is the
  • 00:19:43
    customers uh data I'll do right we want
  • 00:19:47
    to keep this table as it is standard AWS
  • 00:19:50
    glue table as default we're going to say
  • 00:19:53
    it's an S3 we're going to say the data
  • 00:19:55
    is in our cun and it's going to say well
  • 00:19:57
    where's the data so we'll go browse
  • 00:19:59
    we'll go find that bucket that we
  • 00:20:00
    created through the um cloud formation
  • 00:20:03
    template we're going to go raw and it's
  • 00:20:05
    in here inside our customers we're going
  • 00:20:08
    to choose just click off that um in
  • 00:20:10
    order to get rid of the warning uh we
  • 00:20:12
    are in CSV data common delivered and go
  • 00:20:16
    next it's going to then ask us a few
  • 00:20:19
    questions about the schema itself so
  • 00:20:23
    we're going to add the schema doing it
  • 00:20:25
    here so we'll just go add column number
  • 00:20:28
    one well if we go back onto the GitHub
  • 00:20:30
    we know it's customer ID so we're just
  • 00:20:31
    going to copy and we are going to go
  • 00:20:34
    back in and say it's customer ID it's
  • 00:20:36
    not a string type it is an INT type so
  • 00:20:38
    we'll just go int and hit save then
  • 00:20:42
    we'll want to add the next column well
  • 00:20:44
    we know that it's first name so we're
  • 00:20:46
    just going to go first name back in and
  • 00:20:49
    paste again I'm going to keep these as
  • 00:20:51
    small so just take out that capital F
  • 00:20:54
    and we didn't do it for the customer ID
  • 00:20:55
    which isn't great so back in there take
  • 00:20:58
    that c down and make it a small then we
  • 00:21:01
    need last name and we just copy that in
  • 00:21:05
    and we add a column in as it stands make
  • 00:21:08
    sure that's number three and that is
  • 00:21:10
    last name and save and then we have full
  • 00:21:13
    name as the last one so full name save
  • 00:21:18
    so that's the columns added we have one
  • 00:21:20
    int and three strings then we want to go
  • 00:21:23
    next and you can see here it as asked um
  • 00:21:27
    also by the T we have it in raw data the
  • 00:21:31
    customer data name and it is in
  • 00:21:34
    CSV then we want to go next and again if
  • 00:21:37
    you needed to re-edit the schema just
  • 00:21:39
    click that edit schema
  • 00:21:42
    function then you want to go next then
  • 00:21:45
    you want to go
  • 00:21:47
    create and it's off and it is creating
  • 00:21:50
    the table for us let's go and add the
  • 00:21:54
    second table the orders table through a
  • 00:21:57
    crawler so let's go down to the left
  • 00:22:00
    hand side here and go to crawlers again
  • 00:22:02
    if this Hamburg or this menu's
  • 00:22:04
    disappeared hit the hamburger go to
  • 00:22:06
    crawlers and then we're going to create
  • 00:22:08
    a crawler for the purpose of this I'm
  • 00:22:10
    just going to call this AWS glue
  • 00:22:13
    tutorial for the
  • 00:22:16
    crawler like this and we're just going
  • 00:22:19
    to hit next we haven't mapped the data
  • 00:22:22
    already naturally then we're going to go
  • 00:22:25
    add data source with have an S3 data
  • 00:22:27
    source it's in this account we're going
  • 00:22:30
    to go browse we're going to find that
  • 00:22:32
    bucket again that we've been using for
  • 00:22:33
    this course it's in raw data um it is
  • 00:22:37
    the orders table that we're going to map
  • 00:22:40
    this time again that might go red so
  • 00:22:42
    just cck click off it crawl all sub
  • 00:22:45
    folders yep that's correct and then we
  • 00:22:48
    want to go add S3 Source then when we've
  • 00:22:52
    got that we want to highlight that
  • 00:22:54
    source and go
  • 00:22:55
    next an existing IM am r so as part of
  • 00:22:59
    this tutorial we created an I am rule if
  • 00:23:02
    you need to know the name of the IM am
  • 00:23:04
    rule it's in the setup code and then you
  • 00:23:06
    just go down to the rule and it says L
  • 00:23:08
    ask glue course that's the one we're
  • 00:23:10
    looking for the one we checked existed
  • 00:23:11
    after we ran the template back in here
  • 00:23:14
    sorry and then just paste it in and find
  • 00:23:16
    it click off and that will be you we
  • 00:23:19
    don't need Le information do not check
  • 00:23:20
    that box our Target database is raw data
  • 00:23:24
    we don't need a prefix on the table name
  • 00:23:27
    for now that's fine and then we're just
  • 00:23:29
    going to schedule on demand click next
  • 00:23:33
    hit create crawler this will take a
  • 00:23:35
    little second and then you want to run
  • 00:23:36
    the crawler by clicking this button this
  • 00:23:39
    will go off and run itself to map the
  • 00:23:41
    data for this this will probably take
  • 00:23:44
    one or two minutes in total so I'll
  • 00:23:47
    pause the video here and we can pick it
  • 00:23:49
    up once it is done and has created our
  • 00:23:52
    orders table for us okay after exactly 1
  • 00:23:56
    minute for me so I did say take between
  • 00:23:58
    one and two you can see that it's
  • 00:24:00
    completed successfully and we've had one
  • 00:24:02
    table change you can click here to see
  • 00:24:04
    what that is we've added the table
  • 00:24:06
    orders perfect well that's great because
  • 00:24:08
    that's what we were looking for if we
  • 00:24:10
    then go into our databases if we go into
  • 00:24:13
    raw data you can see we have customers
  • 00:24:15
    raw and then we also have orders raw as
  • 00:24:18
    a table or orders as our table because
  • 00:24:20
    we added this by a crawler and I didn't
  • 00:24:21
    put a prefix or soft fix in it's used
  • 00:24:24
    the folder's name that's completely fine
  • 00:24:26
    we're allowed to have the name that we
  • 00:24:27
    want but more importantly let's just
  • 00:24:29
    check that it picked this up correctly
  • 00:24:31
    so if we click on that link it should
  • 00:24:34
    take us into the orders and we should
  • 00:24:35
    see the CSV perfect let's click on the
  • 00:24:37
    orders table itself did it get all the
  • 00:24:40
    different columns it's got 16 columns in
  • 00:24:42
    total so if we go back here and we go to
  • 00:24:45
    data you can see that we start with
  • 00:24:46
    sales orderer ID and we finish with line
  • 00:24:49
    total so let's do that little visual
  • 00:24:51
    check to make sure that those things
  • 00:24:52
    line up yep sales orderer ID and we
  • 00:24:55
    finish with line total so we've managed
  • 00:24:57
    to pick up the entire through a program
  • 00:24:59
    called a crawler the AWS has wrote that
  • 00:25:02
    has stopped us having to manually enter
  • 00:25:03
    this information or enter it through
  • 00:25:05
    code such the way that I did the manual
  • 00:25:07
    setup for the customers table so it's
  • 00:25:09
    took that burd in office if you have a
  • 00:25:11
    lot of data and you just want a way to
  • 00:25:13
    look at the um cables and get them
  • 00:25:16
    loaded very very quickly so that's a
  • 00:25:18
    crawler let's move on to the next
  • 00:25:21
    section A Brief word on petitions in AWS
  • 00:25:25
    this is important to know uh our ETL job
  • 00:25:27
    when we get there will create cre a
  • 00:25:28
    single partition in AWS but you can play
  • 00:25:30
    around and create more so a partition is
  • 00:25:32
    folders where data is stored in S3 which
  • 00:25:34
    are physical entities are ma to
  • 00:25:36
    partition which are logical empties
  • 00:25:39
    columns in the glue table what what does
  • 00:25:42
    that actually mean okay so we could have
  • 00:25:45
    a table called sales and we could have a
  • 00:25:47
    seale de and AWS glue or indeed big data
  • 00:25:51
    processing Frameworks in in general give
  • 00:25:53
    you the ability to create partitions and
  • 00:25:56
    with these partitions we can split
  • 00:25:58
    things into year month and day for that
  • 00:26:01
    sales date and what these actually
  • 00:26:03
    become are logical folders or physical
  • 00:26:05
    folders on S3 so we can actually split
  • 00:26:08
    the data on S3 so rather than having a
  • 00:26:11
    just a column in the table you have a
  • 00:26:13
    folder and what this means is when you
  • 00:26:15
    search for something like I want to find
  • 00:26:17
    the sales date that is February the 2nd
  • 00:26:20
    2019 the inner workings of AWS glue know
  • 00:26:25
    that actually it can exclude the folders
  • 00:26:27
    that don't have that on it so it knows
  • 00:26:29
    actually because these days aren't
  • 00:26:30
    correct I can just go look at this one
  • 00:26:33
    here out of the four it's a way to speed
  • 00:26:35
    up queries it's a way to speed up
  • 00:26:36
    writing as well and updates depending
  • 00:26:39
    exactly on what you're doing you need to
  • 00:26:41
    understand partitions to use AWS glue
  • 00:26:43
    again we will create a partition I will
  • 00:26:45
    show you how to do that when it comes to
  • 00:26:46
    the ETL job if you're not quite
  • 00:26:48
    physically grasping that n hopefully by
  • 00:26:51
    the time we do the ETL job it will all
  • 00:26:53
    sync
  • 00:26:54
    in AWS glue connections so a glue
  • 00:26:58
    connections is a data catalog object
  • 00:27:00
    that contains the properties that are
  • 00:27:02
    required to connect to a particular data
  • 00:27:03
    store typical of many Cloud providers or
  • 00:27:06
    any ETL tool you can store a connection
  • 00:27:08
    so if you have a database wherever that
  • 00:27:10
    may live on premise or in AWS you can
  • 00:27:13
    store the connection string the password
  • 00:27:14
    and the username inside AWS glue that
  • 00:27:17
    means when you need to actually
  • 00:27:18
    reference it in an ETL script you're
  • 00:27:20
    actually just entering the object
  • 00:27:21
    information you don't store the
  • 00:27:23
    passwords or anything in the script
  • 00:27:25
    pretty standard practice these days when
  • 00:27:27
    it was built with a AWS glue it wasn't
  • 00:27:29
    you would expect it to be there it is
  • 00:27:31
    there please use AWS glue connections
  • 00:27:34
    when you need them so let's take a
  • 00:27:36
    little look at connections down the left
  • 00:27:38
    hand side you'll see there is
  • 00:27:39
    connections click on Connection in
  • 00:27:42
    connectors you can go to the marker
  • 00:27:44
    place or if you knew your connector
  • 00:27:46
    already exists you can go create
  • 00:27:48
    connection and you can see here a list
  • 00:27:50
    of connectors that you can use with albs
  • 00:27:52
    glue these include AWS connectors or
  • 00:27:55
    second party or first party connectors
  • 00:27:57
    as well such as sales force uh snowflake
  • 00:27:59
    there sap sitting around as well on the
  • 00:28:02
    fun bit AWS glue ETL so AWS glue ETL
  • 00:28:06
    supports extracting data from various
  • 00:28:08
    sources and that's important so various
  • 00:28:10
    sources there is the Lex of S3 RDS on
  • 00:28:14
    prous databases transforming it to meet
  • 00:28:16
    your business needs and then loading it
  • 00:28:17
    into a destination of your choice we
  • 00:28:21
    will be doing this in the tutorial I
  • 00:28:23
    will show you how to do this visually
  • 00:28:25
    you will not need to code I will show
  • 00:28:26
    you the script as well but but if you're
  • 00:28:28
    not a good coder or you can't code don't
  • 00:28:30
    worry you'll be able to follow along
  • 00:28:32
    with the tutorial AWS glue ETL engine
  • 00:28:34
    you need to know this is an Apache Spark
  • 00:28:36
    engine distributed for Big Data
  • 00:28:38
    workloads across worker nodes it also
  • 00:28:40
    supports python but more typically you
  • 00:28:42
    would use the Spark engine you need to
  • 00:28:45
    understand this as well that the AWS
  • 00:28:47
    glue dpus one dpu is equivalent to four
  • 00:28:51
    CPUs and 16 gig of memory when you're
  • 00:28:54
    provisioning jobs and when we jump into
  • 00:28:56
    the ETL job in just a little second you
  • 00:28:58
    will see that I provisioned two dpus for
  • 00:29:00
    that job this means that I will have it
  • 00:29:03
    CPUs and 32 GB of memory available if
  • 00:29:07
    you do not have enough dpus for your job
  • 00:29:09
    it will crash and fail if you have too
  • 00:29:12
    many dpus for your job you're paying for
  • 00:29:14
    compute resource you're are not using it
  • 00:29:16
    is a bit of an arch to tune your glue
  • 00:29:17
    jobs there are handy features on the
  • 00:29:19
    console to show you when you're under
  • 00:29:21
    provisioning and over provisioning dpus
  • 00:29:23
    this is the charging mechanism so
  • 00:29:25
    depending on the on the region you're in
  • 00:29:27
    it will be priced at how many dpus per
  • 00:29:30
    minute that you use is the cost that you
  • 00:29:32
    pay so it's important to size these
  • 00:29:34
    correctly and just before we jump on and
  • 00:29:36
    create an ETL job it's good to know
  • 00:29:39
    about these we won't use them but you
  • 00:29:40
    can put boot marks in which basically
  • 00:29:42
    means when you have new data arrive the
  • 00:29:45
    ETL job that you're creating will not
  • 00:29:47
    reprocess the old data so you're just
  • 00:29:48
    doing a Delta load of the data really
  • 00:29:51
    handy if you're getting like early loads
  • 00:29:52
    or daily loads into an S3 bucket and you
  • 00:29:55
    just want to process the new data this
  • 00:29:57
    means AWS is automatically tracking that
  • 00:29:59
    for you and says hey I'm not going to
  • 00:30:02
    process the previous run of data I'm
  • 00:30:04
    just going to process the new data that
  • 00:30:06
    has landed it bookmarks it and will not
  • 00:30:08
    reprocess it it again okay so back on
  • 00:30:12
    the console the first thing we're going
  • 00:30:13
    to do is create a database so we can
  • 00:30:15
    actually store new tables we create
  • 00:30:17
    during this visual ETL process so on the
  • 00:30:21
    left hand side you want to go databases
  • 00:30:22
    and you want to go add database and
  • 00:30:25
    we're going to call this database
  • 00:30:26
    processed
  • 00:30:28
    underscore
  • 00:30:29
    data two C's there it shouldn't be
  • 00:30:33
    process underscore data we need the S3
  • 00:30:36
    location that we already set up um when
  • 00:30:38
    we were running the setup script so if
  • 00:30:40
    we go in and we find that bucket again
  • 00:30:42
    I've minimized the tabs there in between
  • 00:30:44
    different parts of the lesson I have a
  • 00:30:46
    processed data area of the bucket I want
  • 00:30:48
    to uh hit the tick and take the URI
  • 00:30:51
    again urri not URL that's really
  • 00:30:53
    important and we want to paste that down
  • 00:30:55
    here and we need to enter a little bit
  • 00:30:57
    of of a description so I'm just going to
  • 00:30:59
    say this is the database data base to
  • 00:31:04
    hold the tables for processed
  • 00:31:11
    data
  • 00:31:12
    processed data and then you want to
  • 00:31:15
    create that database so that gives us a
  • 00:31:16
    location to store some of the processed
  • 00:31:19
    data now down the left hand side you can
  • 00:31:21
    see there is a data integration and ETL
  • 00:31:24
    section there's lots of different things
  • 00:31:26
    here but we're going to be working on
  • 00:31:27
    the ETL jobs and for the purposes of
  • 00:31:29
    this tutorial we're going to do visual
  • 00:31:31
    ETL um that means if you can't code or
  • 00:31:33
    you don't want to code you don't need to
  • 00:31:35
    know it for the purposes of this demo
  • 00:31:37
    click on that visual ETL and then hit
  • 00:31:39
    create job from a blank graph this is
  • 00:31:43
    your UI or your GUI that helps you build
  • 00:31:46
    um glue jobs using nodes rather than
  • 00:31:49
    coding don't worry it still creates a
  • 00:31:51
    code script under the script tab in
  • 00:31:53
    behind so you can actually see what it's
  • 00:31:55
    doing but for the purposes of this we'll
  • 00:31:57
    do visual ETL first thing you need to do
  • 00:32:00
    is give it a name so let's call this
  • 00:32:03
    processed customers
  • 00:32:06
    job and let's hit enter you'll see then
  • 00:32:09
    when this goes completely red here we do
  • 00:32:11
    have a few things we need to fill in so
  • 00:32:13
    I am rule I've created one during the
  • 00:32:16
    setup course or script rather for called
  • 00:32:18
    AWS glue course so select that scrolling
  • 00:32:22
    down scrolling down into advanced
  • 00:32:23
    properties we will have to pick a name
  • 00:32:26
    or a few locations that will be named to
  • 00:32:28
    store um different aspects here so the
  • 00:32:31
    first thing is we have an area in the
  • 00:32:32
    bucket to store the scripts so again if
  • 00:32:34
    we go into the bucket itself I have
  • 00:32:36
    created a script location so choose that
  • 00:32:40
    you also need a place to put your logs
  • 00:32:42
    so again let's view that oh no not view
  • 00:32:44
    that sorry that's browse that then go
  • 00:32:46
    find the bucket again so m is AWS Johnny
  • 00:32:48
    CH of course uh inside the temp
  • 00:32:51
    directory would be fantastic and choose
  • 00:32:53
    that and then we'll just put forward SL
  • 00:32:56
    logs and then scrolling down scrolling
  • 00:32:58
    down scrolling down inside temporary
  • 00:33:00
    path let browse that again go find that
  • 00:33:03
    S3 bucket I've given all the permissions
  • 00:33:05
    required to access this bucket um during
  • 00:33:08
    the script setup so you won't have to
  • 00:33:10
    add anything extra by using those
  • 00:33:12
    locations oh then we want to Temporary
  • 00:33:13
    directory and choose that
  • 00:33:16
    directory that's everything there under
  • 00:33:18
    advanced settings so we can just scroll
  • 00:33:20
    back up and minimize that and then to
  • 00:33:21
    keep the cost down the dpus the number
  • 00:33:23
    of workers we want is two that's perfect
  • 00:33:26
    everything else can be left as it is you
  • 00:33:28
    can give it a description if you want
  • 00:33:30
    I'm not going to bother and then hit
  • 00:33:32
    save on the top right so that's the job
  • 00:33:34
    saved and you'll see that there's no red
  • 00:33:36
    um warnings anymore this plus icon is
  • 00:33:39
    where you can add your nodes and the
  • 00:33:41
    first thing we need is a source and our
  • 00:33:42
    source is going to be the AWS glue dat
  • 00:33:44
    catalog we're going to get that
  • 00:33:46
    customers table that we have already um
  • 00:33:48
    set up using raw data and then we will
  • 00:33:51
    have under tables customers raw select
  • 00:33:55
    that you'll see here that there's a data
  • 00:33:57
    pre review section ready to go depending
  • 00:34:00
    on whether you've been into ETL or not
  • 00:34:02
    this can take up to 2 or 3 minutes to
  • 00:34:03
    start you'll see data Pro preview
  • 00:34:06
    processing and then eventually you will
  • 00:34:07
    get here to the point that your data
  • 00:34:09
    preview is ready so you can actually see
  • 00:34:11
    the data as it currently sits and you
  • 00:34:13
    have the output schema as well for this
  • 00:34:16
    data what we're going to do is take it
  • 00:34:18
    and we're going to store it as paret
  • 00:34:20
    format in S3 registering a new table in
  • 00:34:23
    the glue data catalog what we're also
  • 00:34:25
    going to do is add a transformation to
  • 00:34:28
    the table and in this case we're going
  • 00:34:29
    to add a process time so inside nodes
  • 00:34:32
    and go to transformation there are lots
  • 00:34:34
    and lots of different Transformations
  • 00:34:35
    that you can do with a node in this one
  • 00:34:38
    we're going to add a a current time
  • 00:34:39
    stamp as our processed date time stamp
  • 00:34:43
    so click that node and it will
  • 00:34:44
    automatically join up if it does not
  • 00:34:47
    automatically join it will look like
  • 00:34:50
    this on your on your canvas as it
  • 00:34:53
    currently sits you just want to click on
  • 00:34:55
    and then choose the parent node and
  • 00:34:56
    select the AWS glue dat a catalog and
  • 00:34:58
    you can see there that it's ready to go
  • 00:35:01
    you need to give this then an output
  • 00:35:03
    column so I'm just going to call this
  • 00:35:04
    processed time stamp keep things nice
  • 00:35:07
    and simple and we'll just leave it as
  • 00:35:09
    default so you can see here then with
  • 00:35:11
    the output schema it's picking up the
  • 00:35:12
    new addition that we have as process Tim
  • 00:35:15
    stamp then we need a Target so into
  • 00:35:18
    Target and we're going to use S3 as our
  • 00:35:20
    Target so perfect we've selected S3 as
  • 00:35:23
    our Target you can see here it's already
  • 00:35:25
    got the format Park selected we're going
  • 00:35:28
    to do Snappy we do need to pick a
  • 00:35:30
    location to save this in S3 so again
  • 00:35:33
    let's go into that bucket that we've
  • 00:35:34
    created for the purposes of this
  • 00:35:36
    course inside that process data sorry
  • 00:35:40
    select the process data I do that all
  • 00:35:42
    the time hit choose and then we want to
  • 00:35:44
    call this
  • 00:35:46
    customers uh underscore processed and
  • 00:35:50
    then we want a forward slash on the end
  • 00:35:52
    of that as well so that's the location
  • 00:35:53
    we're going to save the data to we do
  • 00:35:56
    want to create a new table in the data
  • 00:35:58
    catalog and on subsequent runs update
  • 00:36:01
    the schema to add new partitions that's
  • 00:36:04
    exactly what we want to do our database
  • 00:36:06
    is going to be processed data uh
  • 00:36:09
    processed um sorry processed data for
  • 00:36:11
    the database um we have to give the
  • 00:36:13
    table of name so let's just call this
  • 00:36:15
    then customers uncore process so that's
  • 00:36:17
    the name of the table we're going to
  • 00:36:19
    create and then we want to add a
  • 00:36:21
    partition key as well and our partition
  • 00:36:23
    key will be on the processed dat time
  • 00:36:26
    stamp then we want to save everything
  • 00:36:28
    that we've just done and that is our ETL
  • 00:36:31
    job we're going to take that data we're
  • 00:36:32
    going to add a time stamp we're going to
  • 00:36:34
    transform it into partk we're going to
  • 00:36:36
    save it to S3 we're then going to create
  • 00:36:38
    a glue data catalog table called um
  • 00:36:41
    customers process that's going to set in
  • 00:36:43
    the process data um database and also
  • 00:36:47
    we're going to create a partition key
  • 00:36:49
    called date time
  • 00:36:51
    stamp okay perfect so let's run this and
  • 00:36:55
    let's save this then when you're
  • 00:36:57
    actually ready to run you can go to runs
  • 00:36:59
    here or you can click run there whatever
  • 00:37:01
    one you want to do then let's run job
  • 00:37:04
    that'll kick off the job it has to spin
  • 00:37:06
    up a few containers and a few other
  • 00:37:07
    things in behind the scenes this might
  • 00:37:09
    take a few minutes to process in total
  • 00:37:11
    cuz it's the first go you just want to
  • 00:37:12
    hit that refresh to see what's happening
  • 00:37:14
    you can also see down here what's going
  • 00:37:16
    on with the job itself you can see the
  • 00:37:18
    input arguments and everything else on
  • 00:37:20
    the UI I'm going to pause this here and
  • 00:37:23
    let it do its thing and then once it's
  • 00:37:25
    processed successfully we'll pick it
  • 00:37:26
    back up you can see it's often running
  • 00:37:28
    14 16 seconds in
  • 00:37:31
    already okay you can see after 1 minute
  • 00:37:34
    4 seconds precisely it succeeded so
  • 00:37:37
    that's fantastic let's go have a look at
  • 00:37:40
    the glue data catalog so if we go into
  • 00:37:42
    databases we go through process data
  • 00:37:44
    you'll see that we actually have some
  • 00:37:46
    table data there we can click on the
  • 00:37:48
    table and we can see that we have our
  • 00:37:50
    partition or full name custom ID and
  • 00:37:53
    last name if we click on partitions we
  • 00:37:55
    should be able to see that we have a
  • 00:37:56
    partition for when I just ran that and
  • 00:37:59
    also if we go back into the schema um we
  • 00:38:03
    should also be able to go to location by
  • 00:38:05
    clicking this URL here that'll take us
  • 00:38:08
    in that is where our data is currently
  • 00:38:10
    sitting as a parket file excellent as
  • 00:38:14
    part of this demo as well um I also set
  • 00:38:16
    up a Thea so we could look at data so if
  • 00:38:18
    you notice there will be a working goup
  • 00:38:20
    selected forthea so inside Athena if you
  • 00:38:23
    go there you'll see that there's a
  • 00:38:25
    working goup called uh AWS AWS glue
  • 00:38:28
    course Athena work group so you want to
  • 00:38:30
    select that on the top right so once
  • 00:38:32
    you've selected that working group
  • 00:38:34
    that's a location to save our Athena
  • 00:38:36
    results you can have a look at the data
  • 00:38:38
    as well by going to process data then we
  • 00:38:40
    should under tables have our table and
  • 00:38:43
    the simplest way is just to click those
  • 00:38:44
    three dots there and go preview table
  • 00:38:47
    this will quer our data table and bring
  • 00:38:48
    us back the data in paret format you can
  • 00:38:50
    see there that the process timestamp is
  • 00:38:53
    also there for us that's how you do it
  • 00:38:56
    with the process data for customers you
  • 00:38:59
    can do exactly the same with the order
  • 00:39:01
    data and you would do the same thing
  • 00:39:03
    again where you go in and you start with
  • 00:39:05
    an ETL job and you would say visual ETL
  • 00:39:08
    and start creating the same thing except
  • 00:39:10
    this time with the glue DOA catalog you
  • 00:39:11
    would select it and you would have raw
  • 00:39:14
    data and your table this time oops sorry
  • 00:39:17
    raw data and the table this time would
  • 00:39:19
    be orders so you can do the same thing
  • 00:39:21
    with orders as you did before um with
  • 00:39:24
    customers I won't do it for the purposes
  • 00:39:26
    of this demo that's a bit of a come one
  • 00:39:27
    for you but going ahead do exactly the
  • 00:39:29
    same thing for the orders data and you
  • 00:39:32
    will be off and running creating your
  • 00:39:33
    second ETL job AWS glue data quality is
  • 00:39:37
    a relatively new feature in fact it was
  • 00:39:39
    marked with new when I'm filing this
  • 00:39:41
    video so you will see it marked with new
  • 00:39:42
    Wi on the console this helps you monitor
  • 00:39:45
    the quality of your data by data quality
  • 00:39:47
    definition language using a open source
  • 00:39:50
    project called DQ we can actually go
  • 00:39:52
    into the console in a little second
  • 00:39:53
    you'll see an action it will formulate
  • 00:39:55
    some rules for us show us this dql
  • 00:39:58
    language in operation and then we can
  • 00:40:00
    alter that language as well or remove
  • 00:40:02
    rules to suit our data quality this is
  • 00:40:05
    so when we're using glue we can perform
  • 00:40:07
    data quality checks on data that's
  • 00:40:08
    coming in or data that we actually
  • 00:40:10
    create through ETL and then apply that
  • 00:40:13
    governance 3 code let's get started and
  • 00:40:15
    have a little look in the console of how
  • 00:40:17
    AWS glue data quality
  • 00:40:20
    works we are going to take a quick look
  • 00:40:24
    at glue data quality it's a new feature
  • 00:40:26
    and it's at in the glue data catalog and
  • 00:40:29
    you want to go to tables so let's pick a
  • 00:40:31
    table let's pick the orders table it's
  • 00:40:33
    kind of the most detailed so hopefully
  • 00:40:35
    we'll get something and you can see up
  • 00:40:37
    here they've got data quality new it's
  • 00:40:39
    actually still under new and has a video
  • 00:40:42
    so watch the video of you if if you want
  • 00:40:44
    but what we want to do is actually get
  • 00:40:47
    some uh data quality rules for this
  • 00:40:50
    table so if we go run history we'll see
  • 00:40:53
    that there's probably nothing on this
  • 00:40:55
    table whatso ever is there any
  • 00:40:58
    recommendation runs no we haven't right
  • 00:41:01
    so we want to go recommended
  • 00:41:03
    rules and we want to click recommended
  • 00:41:06
    rules oh choose the I am rule uh we've
  • 00:41:08
    got one for it so this is where glue is
  • 00:41:11
    going to go off and create rules force
  • 00:41:13
    that it recommends um that's it off and
  • 00:41:15
    running so this is using a bit of AI bit
  • 00:41:17
    of machine learning looking at our data
  • 00:41:19
    and coming up with the rules for us so
  • 00:41:22
    I'll pause the video this probably take
  • 00:41:24
    a few minutes and then we'll see some of
  • 00:41:25
    the rules hopefully that go decides
  • 00:41:27
    should be applied to our order table in
  • 00:41:29
    terms of data
  • 00:41:32
    quality okay and after a few minutes you
  • 00:41:34
    can see that it has a successful run if
  • 00:41:37
    you click into the run you will see the
  • 00:41:40
    recommended rules that it has um created
  • 00:41:43
    or or recommended for this data set you
  • 00:41:46
    can copy these rules and then what you
  • 00:41:48
    want to do is go back into the tab
  • 00:41:51
    before you want to go to the rules
  • 00:41:54
    themselves uh by clicking these and then
  • 00:41:58
    you want to hit copy then we want to go
  • 00:42:00
    back out to the table I'm sure there's a
  • 00:42:02
    quicker way to do this than what I'm
  • 00:42:03
    doing we want to glue data quality you
  • 00:42:06
    want to scroll down and you'll see that
  • 00:42:08
    you have the create data quality rules
  • 00:42:10
    you click in and then you can just paste
  • 00:42:12
    in what we lifted there so I'm just
  • 00:42:14
    going to go control V and you want to
  • 00:42:16
    get rid of that and you want to get rid
  • 00:42:18
    of that one at the bottom and then you
  • 00:42:20
    can start to mess around with this so
  • 00:42:22
    you can see that you know your real kind
  • 00:42:23
    per loaded things should be between
  • 00:42:25
    these two you can adjust that if you
  • 00:42:26
    want
  • 00:42:27
    it says make sure that everything has a
  • 00:42:29
    sales order in it looking at the
  • 00:42:31
    standard deviation this and this again
  • 00:42:33
    let's just say we wanted this to be
  • 00:42:34
    bigger or smaller so we could just go
  • 00:42:36
    100 and we know that our maximum is
  • 00:42:38
    going to be 10,000 you could go 10,000
  • 00:42:41
    and this is how you can start to build
  • 00:42:43
    up those rule sets you save the rule set
  • 00:42:45
    you give the rule set a name so I'm just
  • 00:42:46
    going to call this
  • 00:42:49
    dq1 and then it will apply this rule set
  • 00:42:51
    to the data in the table it will then
  • 00:42:53
    start to alert you when these rules are
  • 00:42:55
    broken again you can see that like
  • 00:42:57
    column values you can add all the values
  • 00:42:59
    into the columns and using this Library
  • 00:43:01
    you can start to build those rules to
  • 00:43:03
    ensure the data quality of your data AWS
  • 00:43:06
    glue scheduling um we will take a look
  • 00:43:09
    at this in just a second but you need to
  • 00:43:10
    know about AWS triggers this is how you
  • 00:43:12
    initiate an ETL job or actually a
  • 00:43:15
    crawler job I'll show you both on the
  • 00:43:17
    console and then this can be defined on
  • 00:43:19
    a schedule a time or event it's up to
  • 00:43:21
    you it can be done on demand as well
  • 00:43:23
    we'll see the options but a trigger I to
  • 00:43:26
    trigger something is how we start the
  • 00:43:28
    workflows or how we start an ETL job and
  • 00:43:30
    then we have AWS glue workflows this
  • 00:43:32
    lets us create trans complex um extract
  • 00:43:35
    transform and load activity so we can
  • 00:43:37
    run a crawler we can run our ETL script
  • 00:43:39
    then we can run another ETL script when
  • 00:43:42
    we're on the console right now I will
  • 00:43:43
    show you how to create a trigger I will
  • 00:43:45
    show you how to create a workflow these
  • 00:43:47
    are really useful if you're only using
  • 00:43:49
    AWS glue if you need to use other things
  • 00:43:52
    outside of glue like EMR or Athena then
  • 00:43:54
    you'll have to think of something else
  • 00:43:56
    like man work for aaty Earth flow that's
  • 00:43:59
    what I commonly use but if you're only
  • 00:44:01
    in the AWS glue environment and you only
  • 00:44:03
    need those tool set then AWS glue
  • 00:44:05
    workflows is totally available tool for
  • 00:44:07
    scheduling ETL jobs just a brief mention
  • 00:44:11
    um now we've came out of workflows and I
  • 00:44:12
    know I've keep mention it aier flow is
  • 00:44:14
    one that you can use step functions is
  • 00:44:15
    one you can use and event Bridge these
  • 00:44:17
    are three very common patterns I see for
  • 00:44:19
    scheduling glue jobs bur them in mind
  • 00:44:22
    but workflows works perfectly well if
  • 00:44:24
    you are only doing things inside AWS
  • 00:44:27
    glue okay let's look at orchestration
  • 00:44:30
    within the glue console um if you're
  • 00:44:33
    just using glue this is fantastic for
  • 00:44:35
    orchestration so if you're just running
  • 00:44:36
    glue jobs the catalog doing crawlers
  • 00:44:40
    cool this is a place for you to come do
  • 00:44:42
    some orchestration if you're knitting
  • 00:44:44
    other AWS Services together my
  • 00:44:47
    recommendation is to go look at other
  • 00:44:48
    orchestration particularly when it comes
  • 00:44:50
    to the managed workflows for Earth flow
  • 00:44:53
    that's my go-to at the moment when I'm
  • 00:44:54
    integrating EMR Thea glue all together
  • 00:44:57
    but if you're just using AWS glue this
  • 00:44:59
    is a great tool so go into workflows and
  • 00:45:02
    you want to go to add a workflow I'm
  • 00:45:04
    just going to call this test
  • 00:45:06
    workflow and create the workflow itself
  • 00:45:10
    you can click into the workflow and then
  • 00:45:11
    we need to start adding things that are
  • 00:45:14
    done want to hit add trigger you can see
  • 00:45:17
    that we don't currently have any
  • 00:45:19
    triggers so we can create triggers from
  • 00:45:20
    the triggers page if you want so if you
  • 00:45:22
    hit add trigger we can say that this
  • 00:45:24
    trigger can be crawler
  • 00:45:27
    um
  • 00:45:30
    glue T tutorial then we're just going to
  • 00:45:34
    do it on demand um we're going to hit
  • 00:45:36
    next it's going to ask you what our
  • 00:45:38
    Target resource so we'll hit resource
  • 00:45:40
    type we'll hit crawler select a crawler
  • 00:45:42
    AWS glue tutorial and hit add go next
  • 00:45:46
    and then hit create so there's a crawler
  • 00:45:49
    then we want to go back into our
  • 00:45:52
    workflow we can click on our workflow we
  • 00:45:54
    can hit add trigger we should have our
  • 00:45:56
    crawler sitting to run on demand and
  • 00:45:58
    we'll hit add and you can see here that
  • 00:46:00
    we've added the trigger to the workflow
  • 00:46:02
    if you just go then uh if we just look
  • 00:46:05
    at this and hit run workflow it will
  • 00:46:08
    start to run the workflow that workflow
  • 00:46:11
    is going to go and crawl our data we can
  • 00:46:14
    then build dependencies across here so
  • 00:46:16
    we can put in the next note to be
  • 00:46:17
    actually run the AWS job once the
  • 00:46:20
    crawler complete and you can start to
  • 00:46:22
    build more complicated workflows to do
  • 00:46:24
    that you would click on the uh craw like
  • 00:46:27
    this add a trigger add a new one so give
  • 00:46:29
    this a name so let's just call this uh
  • 00:46:32
    glue job
  • 00:46:33
    ETL this is an event start after the
  • 00:46:36
    event and hit add then we would want to
  • 00:46:39
    go in select the crawler or select the
  • 00:46:42
    job that we want to start so in this
  • 00:46:43
    case it would be the process job and hit
  • 00:46:45
    add and then you can start to build out
  • 00:46:48
    those different things as you go so now
  • 00:46:51
    once the CER is run we're going to do
  • 00:46:52
    the job and again you can actually hit
  • 00:46:55
    run workflow for that to go and kick off
  • 00:46:57
    as well and if you had many jobs or many
  • 00:46:59
    crawlers you could start to build a more
  • 00:47:01
    complex workflow for scheduling as I
  • 00:47:04
    said this is just a little taster of
  • 00:47:07
    what you can do with it if you're
  • 00:47:08
    working with in glued only totally valid
  • 00:47:11
    to use this as your orchestrator AWS
  • 00:47:13
    glue data brute I leave this in cuz I
  • 00:47:15
    covered it last time you'll notice when
  • 00:47:17
    I jump onto the console it actually ni
  • 00:47:19
    sits separately away from AWS glue it's
  • 00:47:22
    its own service this is for visual data
  • 00:47:24
    preparation that makes it easier for non
  • 00:47:26
    ERS data analysts or someone who just
  • 00:47:28
    wants to quickly look at data in quite
  • 00:47:31
    frankly what looks like an Excel uh
  • 00:47:32
    workbook to clean and normalize data I
  • 00:47:35
    am always hesitant to put this into
  • 00:47:37
    production I'm hesitant for data
  • 00:47:38
    Engineers to use it I use it to look at
  • 00:47:40
    data quickly and manipulate it before I
  • 00:47:42
    go write some data engineering code and
  • 00:47:45
    put it through more concrete cic CD type
  • 00:47:49
    processes for devops but it is a good
  • 00:47:52
    tool or an army switch play to have in
  • 00:47:54
    your arsenal when you need a look at
  • 00:47:55
    data or you want to give the option for
  • 00:47:57
    people who look at data that can't code
  • 00:47:59
    let's look at this action I will uh use
  • 00:48:01
    the sample data to to help us see see
  • 00:48:05
    how this works when we jump on to the
  • 00:48:07
    AWS console now interestingly with the
  • 00:48:10
    new um AWS console they've actually
  • 00:48:13
    moved glue data Brew out of it so you
  • 00:48:16
    actually have to type in AWS glue data
  • 00:48:18
    Brew so glue data usually does it and it
  • 00:48:20
    comes up as a completely separate
  • 00:48:22
    service page so which took this out
  • 00:48:24
    there's a video again about how it works
  • 00:48:25
    so feel free to work that what what's
  • 00:48:27
    that so we want to create a sample
  • 00:48:30
    project we're going to do UN resolution
  • 00:48:32
    votes um you want to select create a new
  • 00:48:35
    a am rule we'll call this one just AWS
  • 00:48:38
    course glue and then it'll add the rest
  • 00:48:41
    of it um that means we can delete it
  • 00:48:43
    once we're done so then you want to hit
  • 00:48:45
    create project this will run off of the
  • 00:48:46
    background create the I am rule for you
  • 00:48:48
    load up the data into the um datab Brew
  • 00:48:53
    UI force and we can start to play around
  • 00:48:55
    with it if you haven't been here before
  • 00:48:57
    it does look a little bit like Excel or
  • 00:48:59
    smart sheets um and that's really the
  • 00:49:02
    idea it's it's an area for non dat
  • 00:49:05
    Engineers maybe or like data Engineers
  • 00:49:07
    that want to get the grips of data
  • 00:49:09
    quickly but but don't want to do it
  • 00:49:12
    through code
  • 00:49:13
    um yeah I I I would generally use it um
  • 00:49:17
    if I wanted to look at something quickly
  • 00:49:19
    I would let non-technical users use it
  • 00:49:21
    as well I'm always just a little bit
  • 00:49:24
    apprehensive about going to like create
  • 00:49:26
    the the job out of it and letting them
  • 00:49:28
    run glue code if there's a valid
  • 00:49:30
    business reason sure but what we don't
  • 00:49:31
    want is getting away from the best
  • 00:49:33
    practices of cicd and maintaining those
  • 00:49:36
    devops pipelines when it comes to our
  • 00:49:38
    data engineering best practice but again
  • 00:49:42
    there's nothing wrong with it um
  • 00:49:43
    provided that it's done in a safe and
  • 00:49:45
    agile and secure way you can see that
  • 00:49:47
    this is loading up 51% I'm just going to
  • 00:49:49
    pause the video until it is
  • 00:49:52
    done okay so after a few minutes you can
  • 00:49:55
    see that it loads up rows and starts to
  • 00:49:58
    give a look at what we have you can see
  • 00:50:00
    at the top of each one it tells you
  • 00:50:02
    information like the number of distinct
  • 00:50:03
    values the kind of range that's in there
  • 00:50:06
    and the percentages of each other of the
  • 00:50:08
    values and it's kind of using it well it
  • 00:50:10
    is using aim ml in behind the scenes to
  • 00:50:12
    populate a lot of this data you can look
  • 00:50:14
    at the schema as well by clicking schema
  • 00:50:16
    and it'll tell you all the different
  • 00:50:18
    columns that you have and you get a
  • 00:50:20
    profile of the data if you wanted on top
  • 00:50:22
    of it but I'm just going to go back to
  • 00:50:23
    grid view if you were interested then
  • 00:50:25
    you can start to filter sour The Columns
  • 00:50:27
    format The Columns add in things as well
  • 00:50:29
    so let's just do a quick filter you want
  • 00:50:31
    to add in a filter and you want to do it
  • 00:50:33
    by condition and then you want to do
  • 00:50:36
    where it contains and then you can
  • 00:50:38
    select the source column so let's just
  • 00:50:39
    say ours is resolution and we want to
  • 00:50:43
    make sure that the column contains
  • 00:50:46
    contains let's go contains and a value
  • 00:50:48
    of let's say we only wanted 66 oh that's
  • 00:50:51
    55 66 you apply that and it will apply
  • 00:50:55
    to the column that value value or that
  • 00:50:58
    condition and you can see it's off
  • 00:50:59
    running and if it has 6 six in it it's
  • 00:51:01
    going to keep that and that's how it
  • 00:51:03
    does it stores this as a recipe so you
  • 00:51:05
    can see here that's the first step you
  • 00:51:07
    can then add another step so let's say
  • 00:51:09
    we wanted
  • 00:51:11
    to I'm just going to dup we want to
  • 00:51:14
    remove jaate duplicate vals in columns
  • 00:51:16
    you can say well what columns um let's
  • 00:51:18
    say we wanted to dup the First Column
  • 00:51:20
    which is assembly session all rules and
  • 00:51:23
    apply so it's going to dup The Columns
  • 00:51:25
    here based on that idea it's going to
  • 00:51:27
    select the first one it sees brings it
  • 00:51:29
    down to lesser columns or a less number
  • 00:51:31
    of columns it's dypt and you can see
  • 00:51:33
    here then that you have your recipe you
  • 00:51:35
    can publish that recipe and or
  • 00:51:37
    alternatively you can import a new one
  • 00:51:39
    or download it and then once you have
  • 00:51:41
    this recipe once you can apply it to
  • 00:51:43
    data sets over and over again best thing
  • 00:51:45
    to do is sit and play around with these
  • 00:51:46
    different functions if you're interested
  • 00:51:49
    again for me as I said glue data Brew
  • 00:51:52
    really get to grips with it for users
  • 00:51:54
    that don't code don't write the code I
  • 00:51:57
    kind of give it to my known technical
  • 00:51:58
    users when it makes sense I use it to
  • 00:52:01
    look at data really quickly I'm always
  • 00:52:03
    just a little bit apprehensive about
  • 00:52:05
    anyone putting it into a production
  • 00:52:07
    pipeline for data engineering sure
  • 00:52:09
    business users who want to play around
  • 00:52:10
    with data create jobs out of data that
  • 00:52:12
    or bigger data sets and use it in their
  • 00:52:14
    day-to-day business cool those refined
  • 00:52:16
    cicd pipelines with devops processes I'm
  • 00:52:20
    not such a fan of using glue data Brew
  • 00:52:22
    for those
  • 00:52:23
    purposes okay folks that concludes the
  • 00:52:26
    tutorial on AWS glue we've kind of took
  • 00:52:28
    a look at what AWS glue is we spent some
  • 00:52:30
    time looking at the AWS glue data
  • 00:52:32
    catalog we then created ETL jobs we've
  • 00:52:35
    learned how to schedule those ETL jobs
  • 00:52:37
    and we've also looked at glue data
  • 00:52:39
    quality and glue data brew as well one
  • 00:52:42
    final reminder really like a like And
  • 00:52:44
    subscribe to this channel it helps me
  • 00:52:46
    out in the description below I've also
  • 00:52:48
    left a link to the exams for the AWS
  • 00:52:51
    data engineering certification if you're
  • 00:52:53
    interested in taking that there's also a
  • 00:52:56
    YouTube tutorial on this channel for
  • 00:52:58
    that certification and until next time
  • 00:53:00
    folks thanks for watching
Etiquetas
  • AWS Glue
  • ETL
  • Data Catalog
  • Crawlers
  • Data Quality
  • DataBrew
  • Scheduling
  • AWS Services
  • Data Engineering
  • Tutorial