AWS Glue Tutorial for Beginners [NEW 2024 - FULL COURSE]
Resumo
TLDRThis video tutorial by Johnny Chivers provides an updated overview of AWS Glue, focusing on its features and functionalities for ETL processes. It covers the AWS Glue Data Catalog, the creation of databases and tables, and the building of ETL jobs using both visual and code-based methods. The tutorial includes practical steps for setting up permissions, uploading data to S3, and utilizing crawlers to automate table creation. Additionally, it discusses data quality monitoring and scheduling options for ETL jobs, making it a comprehensive guide for both beginners and those looking to refresh their knowledge of AWS Glue.
Conclusões
- 🔍 AWS Glue is a fully managed ETL service.
- 📚 The Glue Data Catalog stores metadata about data sources.
- ⚙️ ETL jobs can be created visually or through code.
- 🤖 Crawlers automate the discovery of data and table creation.
- 📂 Partitions improve data organization and query efficiency.
- ✅ Data quality features help monitor and enforce data standards.
- 🖥️ Glue DataBrew allows for visual data preparation without coding.
- ⏰ ETL jobs can be scheduled using triggers in AWS Glue.
- 🔗 AWS Glue integrates with various AWS services for data processing.
- 📈 The tutorial is suitable for both beginners and experienced users.
Linha do tempo
- 00:00:00 - 00:05:00
Introduction to AWS Glue, highlighting the need for an updated tutorial due to changes in the AWS console and new features in Glue since the last video.
- 00:05:00 - 00:10:00
Overview of AWS Glue as a fully managed ETL service, explaining its components like the Glue Data Catalog and the flexible scheduler, and the importance of understanding its functionality for data engineering.
- 00:10:00 - 00:15:00
Instructions on setting up the AWS environment, including downloading necessary files from GitHub and using CloudFormation scripts to create required resources like S3 buckets and IAM roles.
- 00:15:00 - 00:20:00
Detailed steps on creating folders in the S3 bucket for organizing raw and processed data, and uploading CSV files for customers and orders, maintaining the folder structure as specified.
- 00:20:00 - 00:25:00
Explanation of the Glue Data Catalog as a persistent metastore for metadata, emphasizing that it does not store physical data but rather references to data locations and schemas.
- 00:25:00 - 00:30:00
Demonstration of creating a database in the Glue Data Catalog and manually adding a table for customers, including defining the schema and data types for the table.
- 00:30:00 - 00:35:00
Introduction to Glue Crawlers, which automate the process of populating the Glue Data Catalog with table definitions, and running a crawler to create the orders table from existing data.
- 00:35:00 - 00:40:00
Discussion on the concept of partitions in AWS Glue, explaining how they can optimize data storage and querying in S3 by organizing data into logical folders based on specific criteria.
- 00:40:00 - 00:45:00
Overview of Glue connections for securely storing connection properties to various data stores, and the importance of using these connections in ETL scripts to avoid hardcoding sensitive information.
- 00:45:00 - 00:53:03
Introduction to AWS Glue ETL jobs, explaining the visual ETL process and how to create a job that extracts, transforms, and loads data, including setting up the job parameters and defining the data source and target.
Mapa mental
Vídeo de perguntas e respostas
What is AWS Glue?
AWS Glue is a fully managed ETL (Extract, Transform, Load) service that simplifies data preparation for analytics.
What is the Glue Data Catalog?
The Glue Data Catalog is a persistent metadata repository that stores information about data sources and targets.
How do I create an ETL job in AWS Glue?
You can create an ETL job in AWS Glue using the visual ETL interface or by writing code.
What are crawlers in AWS Glue?
Crawlers are programs that automatically discover and catalog data in AWS Glue.
What is the purpose of partitions in AWS Glue?
Partitions help organize data in S3, allowing for more efficient querying and processing.
How can I monitor data quality in AWS Glue?
AWS Glue provides data quality features that allow you to define and monitor data quality rules.
What is Glue DataBrew?
Glue DataBrew is a visual data preparation tool that allows users to clean and normalize data without coding.
How do I schedule ETL jobs in AWS Glue?
You can schedule ETL jobs using triggers in AWS Glue, which can be defined based on time or events.
What is the difference between Glue Data Catalog and Glue DataBrew?
The Glue Data Catalog is for metadata management, while Glue DataBrew is for visual data preparation.
Can I use AWS Glue with other AWS services?
Yes, AWS Glue can integrate with various AWS services like S3, RDS, and Athena.
Ver mais resumos de vídeos
PRODUCTS I USED TO LIGHTEN MY SKIN 2-3 SHADES LIGHTER WITHOUT BLEACHING. HOW I LIGHTEN MY SKIN 3x
the tragic life of being too hot
1 Million Browsers Turned Into Bots - Is Yours One of Them?
Buy this phone while you still can.
ABC World News Tonight with David Muir Full Broadcast - July 12, 2025
How the tables have turned
- 00:00:00hi folks welcome back to the channel for
- 00:00:03those of you that are new here I'm
- 00:00:04Johnny chivers and in today's video
- 00:00:06we're going to take a look at AWS glue
- 00:00:10there's already a video of AWS glue on
- 00:00:13the channel however that was filmed over
- 00:00:153 years ago now and has been released
- 00:00:17for about 2 and 1 half years the AWS
- 00:00:20console has since updated feels
- 00:00:22completely different and in fact there's
- 00:00:25new features within the glue service
- 00:00:27itself that aren't covered in that
- 00:00:29original video
- 00:00:30a lot of you have requested an update to
- 00:00:32that video so we're going to do it today
- 00:00:34it'll follow the same format we'll go
- 00:00:36through what is glue we'll take a look
- 00:00:38at the glued out a catalog and we'll
- 00:00:39start building ETL jobs as well
- 00:00:42everything you need for this tutorial is
- 00:00:45located on the GitHub Link in the
- 00:00:47description below that includes the data
- 00:00:50and the slides that I'm going to use I'd
- 00:00:52really appreciate a like And subscribe
- 00:00:54to the channel cuz that helps me out
- 00:00:56with that all being said let's jump onto
- 00:00:58the computer take a look at what AWS
- 00:01:01glue is then do some setup work using
- 00:01:04the code located in that repo on GitHub
- 00:01:06so again download that repo highly
- 00:01:08important join me on the computer and
- 00:01:10we'll get started okay folks uh just to
- 00:01:13remind you again I will make these
- 00:01:15slides available to download on the
- 00:01:16GitHub if you want to download them
- 00:01:18annotate and follow along first thing we
- 00:01:21need to cover is what is HDS glue so
- 00:01:25it's a fully manage ETL service that's
- 00:01:27kind of the core value proposition fully
- 00:01:30managed ETL service that means the AWS
- 00:01:33are going to do the heavy lifting for
- 00:01:34you when it comes to infrastructure it's
- 00:01:37a spark or actually a python ETL engine
- 00:01:40so it has both spark and python usually
- 00:01:42you would only see spark mentioned in
- 00:01:43the documents or a lot of people would
- 00:01:45talk about spark because it's big data
- 00:01:47but you can do python as well it
- 00:01:49consists of a central a metadata
- 00:01:52repository called the glue data catalog
- 00:01:54we will cover this in depth and set up
- 00:01:56some databases and tables during this
- 00:01:58tutorial but this is a key comp
- 00:01:59component that we will cover and it also
- 00:02:01has a flexible scheduler again we'll
- 00:02:03cover this in the tutorial I'll show you
- 00:02:05how it works it's not the only way to
- 00:02:07schedle things inside glue but it is
- 00:02:09certainly good to know that you don't
- 00:02:11have to leave the glue environment to do
- 00:02:12this on the right hand side we have this
- 00:02:14nice nice little diagram um provided by
- 00:02:17AWS that kind of shows you what glue
- 00:02:19does I'm actually going to start down
- 00:02:21here because this is kind of the
- 00:02:22fundamentals and what people think when
- 00:02:24we say glue you have a data source you
- 00:02:26can extract that data source you can
- 00:02:28transform that through a script this r
- 00:02:29runs on the serverless ETL engine in
- 00:02:32python or in spark and then you load a
- 00:02:34data Target it has numerous data sources
- 00:02:37you can use numerous data targets that's
- 00:02:39where the glue data catalog comes in you
- 00:02:41store references to your data source or
- 00:02:44your data Target in the glue data
- 00:02:45catalog and then to populate the glue
- 00:02:47data catalog you can use things like
- 00:02:49crawlers or the Management console we
- 00:02:51will do both in the tutorial and then
- 00:02:53once you have things set up in the glue
- 00:02:55do catalog the scripts the AWS glue jobs
- 00:02:58or ETL Pipelines you can use a schedule
- 00:03:01or event to run them and again we'll
- 00:03:03cover that in the tutorial so I always
- 00:03:07think this is useful to put in why use
- 00:03:10AWS glue I know for the purposes of this
- 00:03:12video we are just going to do the
- 00:03:14tutorial but there are many different
- 00:03:16options for ETL in AWS and as you learn
- 00:03:19these as a data engineer or an architect
- 00:03:21or as a developer it's good to know when
- 00:03:23to use the correct tool for the correct
- 00:03:25job and if you're doing this as part of
- 00:03:27the AWS data engineer certification I'll
- 00:03:30also leave a link to that video in the
- 00:03:32description below it's good to know this
- 00:03:33as well so AWS glue offers a fully
- 00:03:36managed serverless ETL tool so again
- 00:03:38fully serverless ETL tool this removes
- 00:03:42the overhead on the buyer the entry when
- 00:03:44there is a requirement for the ETL
- 00:03:46service in AWS and what that means is
- 00:03:48that you don't have to manage the
- 00:03:50underlying infrastructure to use spark
- 00:03:54you need to spin up clusters clustered
- 00:03:56compute in a highly paralyzed
- 00:03:58environment using a s glue you don't
- 00:04:00need to think about any of that it just
- 00:04:03runs it for you on manage services in
- 00:04:05behind the scenes so again there is
- 00:04:07nothing for you to provision it is
- 00:04:09serverless AWS are doing the heavy
- 00:04:12lifting you need to write the code and
- 00:04:14it will do the
- 00:04:16ETL okay so what we'll do now is jump on
- 00:04:18the console we'll get hands on with a
- 00:04:20little bit of setup work I have wrote
- 00:04:22some cloud formation scripts for this to
- 00:04:24help us set up permissions and some of
- 00:04:26the data repositories that we'll need
- 00:04:28throughout this tutorial follow along
- 00:04:30make sure you have already downloaded
- 00:04:32that GitHub if not I'll show you to do
- 00:04:34it again in the middle of the setup work
- 00:04:36but download that GitHub we will be
- 00:04:38using it throughout this
- 00:04:40tutorial so throughout this course as I
- 00:04:43referenced in the intro we will be using
- 00:04:45this GitHub for this section of setup
- 00:04:47we'll be using the code located in the
- 00:04:50code file which is located here in setup
- 00:04:52hyphen code. yaml and we'll also then
- 00:04:55once it's completed successfully by
- 00:04:57uploading the data from the data file
- 00:04:59the easiest thing to do as I mentioned
- 00:05:00was just take a copy of this down into
- 00:05:03your local machine so to do that you can
- 00:05:04just go code and then download as zip
- 00:05:07and then unzip that file once it's
- 00:05:09downloaded and use the contents
- 00:05:10throughout the instructions on what
- 00:05:12we're doing is also located here but
- 00:05:14don't worry I'm going to guide you step
- 00:05:17by step that being said we need to jump
- 00:05:20on to the AWS console so please log in
- 00:05:24and you'll be graded with the homepage
- 00:05:26hopefully you are familiar with the AWS
- 00:05:28console if not and this is your first
- 00:05:31time using the AWS console don't worry
- 00:05:33I'll uh point out things as we go along
- 00:05:35so the first thing we're going to do is
- 00:05:37go to Cloud information this is where we
- 00:05:39can use our infrastructure as code to
- 00:05:42spin up the resources we need for the
- 00:05:43tutorial the idea behind this is that I
- 00:05:46have created the template force and that
- 00:05:48means we won't have to do as much usual
- 00:05:50manual intervention clicking through the
- 00:05:52console and doing all the stuff in terms
- 00:05:54of permissions that you see in many
- 00:05:55other tutorials so we want to go create
- 00:05:57stack create new resource we're going to
- 00:06:00upload a template you want to choose
- 00:06:02file and you want to go to the unzipped
- 00:06:05version of the folder that you've just
- 00:06:07downloaded from the GitHub I repeat the
- 00:06:08unzipped version go to the code F folder
- 00:06:12code folder and then the setup hyphen
- 00:06:15code. yaml you want to open that then
- 00:06:18you want to go next you need to give the
- 00:06:20stack a name so this can be anything
- 00:06:22that you want so I'm just going to call
- 00:06:23this Johnny hyphen chivers well can't
- 00:06:27spell my own name chivers hyphen glue
- 00:06:31course uh 2024 that will do that next
- 00:06:35thing you need to do is give the bucket
- 00:06:37a name so this is the S3 bucket for the
- 00:06:39course bear in mind with AWS every
- 00:06:43bucket has to be unique globally so you
- 00:06:45will not be able to use the same name as
- 00:06:47me for the bucket but I'm going to call
- 00:06:49mine AWS glue course hyphen Johnny
- 00:06:54chivers hopefully that is unique you
- 00:06:57need again I'll repeat you need a unique
- 00:06:59name so if you don't have one it will
- 00:07:00tell you but that's carry on for now um
- 00:07:03use your name it might be the easiest
- 00:07:05thing with some random numbers or digits
- 00:07:07accept and acknowledge at the bottom
- 00:07:09we're going to run this using the root
- 00:07:11permissions that I'm currently logged in
- 00:07:12with and then hit submit so that will
- 00:07:15take a few minutes to go off and running
- 00:07:18you can hit the refresh icon here and
- 00:07:20see what the actual setup code is doing
- 00:07:23code itself as I mentioned uh in the
- 00:07:25header of the code so if you just go and
- 00:07:27have a look at the file it's going going
- 00:07:29to create a few things for us one is an
- 00:07:31S3 bucket next one is an IM am rule for
- 00:07:35glue we'll be using this three art to
- 00:07:37actually permissions to do anything and
- 00:07:38then the last thing is an Athena working
- 00:07:40group which isn't technically part of
- 00:07:42this course but we may use it to look at
- 00:07:44some of the data so it's good to have it
- 00:07:45around as well this shouldn't take too
- 00:07:48long to complete in fact it's completed
- 00:07:50already so that took about 30 seconds
- 00:07:53let's check that a few things are there
- 00:07:55as they should be so if you go to
- 00:07:56outputs you'll see that we have the S3
- 00:07:58bucket created so copy that name under
- 00:08:00value so I put copy the value let's go
- 00:08:03to S3 so I'm going to go to S3 I have
- 00:08:06other buckets in this account you might
- 00:08:08only have one bucket don't worry just
- 00:08:10paste it in make sure the bucket has
- 00:08:11been created I should mention as well
- 00:08:13I'm working in the Ireland region for
- 00:08:15this as well so that is the bucket if we
- 00:08:18go to IM am click on I am click in and
- 00:08:21we have the name of the IM am rule that
- 00:08:23we're looking for called AWS glue course
- 00:08:27so in here if we go to rules on left
- 00:08:29hand side search search for the rule you
- 00:08:32can see that we have the rule if we load
- 00:08:33the rule up then we should have a policy
- 00:08:36attached to the rule that's great we do
- 00:08:38and inside that policy is all the good
- 00:08:40stuff that we have created during the
- 00:08:42setup process so that's the setup in
- 00:08:44terms of running the template the next
- 00:08:46thing we need to do is actually upload
- 00:08:48some data and create some folders inside
- 00:08:51the S3 bucket itself so let's go back to
- 00:08:54that S3 bucket that we were just on that
- 00:08:56we know was successfully created then
- 00:08:58search for the bucket that we were just
- 00:09:00on so AWS glue course Johnny chivers was
- 00:09:02my bucket if we go back on to the GitHub
- 00:09:06and we go back into the main part of the
- 00:09:08readme file you can see that we're going
- 00:09:10to upload the data we ignore that for a
- 00:09:12little second but what we need to do is
- 00:09:13create a few things in the folder
- 00:09:16structure denoted so if we just copy 0.1
- 00:09:19raw data I'll show you how to do this
- 00:09:21you want to go to create folder you want
- 00:09:23to call this folder raw data and then
- 00:09:25you just want to create the folder like
- 00:09:28so next we have to do process data so
- 00:09:31this will be quite quick again same
- 00:09:32process in create folder paste it and
- 00:09:36then create folder next we have to do is
- 00:09:39script location we'll be needing these
- 00:09:41folders throughout the tutorial so we'll
- 00:09:43just have to create them once and then
- 00:09:45that's it done script
- 00:09:47location oh I hit the wrong button back
- 00:09:50in create folder folder name script
- 00:09:53location create
- 00:09:55folder temp directory we'll also need a
- 00:09:58temp directory Dory crate folder folder
- 00:10:01name temp directory create folder and
- 00:10:04then we will need a thinner folder
- 00:10:07called Athena so we'll just go into
- 00:10:09create folder again this is the last one
- 00:10:11and create Athena perfect back onto the
- 00:10:15GitHub you'll see that in the Raw data
- 00:10:16folder we have to upload the structures
- 00:10:19and orders data keeping the folder
- 00:10:22structure as the noted inside the GitHub
- 00:10:24so you can see here we'll have a folder
- 00:10:25called raw data then we're going to
- 00:10:27upload our customers and orders folders
- 00:10:29that have have CSV files inside them
- 00:10:31this is quite simple so what the easiest
- 00:10:33way to do this is to go to the actual
- 00:10:36raw dat folder itself then what I find
- 00:10:39the next bit to do the easiest way is to
- 00:10:42go upload then when're here the easiest
- 00:10:45thing for me to do or the way I find it
- 00:10:47easiest is just put this folder down
- 00:10:49into little minimize folder go get the
- 00:10:53location where you have unzipped and
- 00:10:54again unzipped that data so for me it's
- 00:10:57going to be in Jonathan
- 00:11:00in here then I've got sitting in here in
- 00:11:03here in data so that's just where I have
- 00:11:06these unzipped locations so go find the
- 00:11:08customers and orders in the GitHub unzip
- 00:11:11version click and drag it across again
- 00:11:13that is just the easiest thing to do you
- 00:11:15can see here then you get your customers
- 00:11:16folder your orders foldo folder and your
- 00:11:19two csvs as well click upload this will
- 00:11:22take a few seconds there's not that much
- 00:11:24data there so just B tight once it's
- 00:11:26been successfully done you can close and
- 00:11:28you can see that you have customers with
- 00:11:30a customer CSV and you have orders with
- 00:11:34a order CSV so we've successfully I'm
- 00:11:37going to blow this back up big because
- 00:11:38we don't need it that way anymore
- 00:11:39created this raw data folder with our
- 00:11:41customers folder and CSV and our orders
- 00:11:44folder and our orders CSV as well so
- 00:11:47that's the setup work complete for the
- 00:11:49tutorial make sure you follow along with
- 00:11:51this it's pretty simple run the run the
- 00:11:52cloud foration template then create
- 00:11:55these folder locations in the S3 bucket
- 00:11:57that was created and then upload the
- 00:11:59data maintaining the folder structure
- 00:12:01just as I have shown the AWS glue data
- 00:12:06catalog so the AGS glue data catalog is
- 00:12:09a persistent metastore so let's say that
- 00:12:12again is a persistent metastore well
- 00:12:15what does that actually mean well it
- 00:12:17stores metadata so what is metadata well
- 00:12:20on the right hand side we have different
- 00:12:22descriptions of metadata we have
- 00:12:24location schema data types and data
- 00:12:27classification so this is the important
- 00:12:29but it's a managed service that lets you
- 00:12:31store annotate and share metadata again
- 00:12:33metadata these are the things on the
- 00:12:35right hand side which can be used to
- 00:12:37query and transform data what I find or
- 00:12:40what can be the difficult concept to get
- 00:12:42your head around is that when you
- 00:12:44register things in the glow data catalog
- 00:12:47I.E data sources or data targets they do
- 00:12:50not move from their existing location so
- 00:12:53if I have a database let's say I have a
- 00:12:55database that's just a mycle database
- 00:12:57and it's on AWS and I have data in that
- 00:13:00database when I register it with a glue
- 00:13:02data catalog it will store the location
- 00:13:05the schema the data types and even the
- 00:13:07classification of that data in the glue
- 00:13:09data catalog what it does not do is move
- 00:13:12the data to AWS glue it keeps it inside
- 00:13:16that database it stays where it is this
- 00:13:19is just a pointer and information on how
- 00:13:22to access that data and that's the key
- 00:13:24thing it becomes a metast store a
- 00:13:26collection of information that you can
- 00:13:29can use to reform ETL on different data
- 00:13:32sources and bring them together inside
- 00:13:34AWS and you need to remember then that
- 00:13:37there's one AWS glue. a catalog per AWS
- 00:13:39region so you get one per region you can
- 00:13:42use I am at access identity and access
- 00:13:45management policiy going to control um
- 00:13:47access to them and you can also use data
- 00:13:49governance because you can annotate the
- 00:13:51glue data catalog so again it's a glue
- 00:13:53data catalog it's where you store
- 00:13:55metadata it does not store physical data
- 00:13:58this is just reference information
- 00:14:01required to get to the data you have
- 00:14:04stored in
- 00:14:06AWS AWS glue databases so a database is
- 00:14:11a set of associated data catalog table
- 00:14:14definitions organized into a local group
- 00:14:16so what I've done here in this little
- 00:14:17graphic is draw databus and I put some
- 00:14:19tables inside it so this is where we
- 00:14:21name something and then associate the
- 00:14:23tables to it the way you would do it
- 00:14:24normally in an RD in an RM DBS database
- 00:14:28for examp example we'll jump on to the
- 00:14:30console in just a second and we can look
- 00:14:32at actually setting up a database and
- 00:14:34we'll do tables as well but we'll
- 00:14:36actually look at how this works in
- 00:14:38action okay back on the AWS console if
- 00:14:41this is your first time don't worry what
- 00:14:44we're going to do is navigate to AWS
- 00:14:46glue so type in AWS glue in the search
- 00:14:48bar and click AWS glue once there you'll
- 00:14:53notice a few different things looking
- 00:14:54around on the console there's quite a
- 00:14:56lot going on the left hand side is going
- 00:14:59going to be your best friend so I'm
- 00:15:00going to minimize that down in case
- 00:15:01you've arrived with out it out there's a
- 00:15:03hamburger menu to expand it and then
- 00:15:05down the left you can see we have a
- 00:15:07getting started ETL jobs data catalog
- 00:15:09tables data connections and workflows
- 00:15:12then it breaks down into two sections
- 00:15:13one is data catalog where you have
- 00:15:15databases and tables and a couple of
- 00:15:18other things including crawlers
- 00:15:19connections then on top of this you have
- 00:15:22data integration and ETL so ETL jobs
- 00:15:25data classification modes we'll be
- 00:15:26looking at some of this as the tutorial
- 00:15:28goes on importantly there is Legacy
- 00:15:31pages so if you're familiar with old
- 00:15:33glue and it's layout these are the
- 00:15:35Legacy Pages down the left hand side
- 00:15:37glue's done a lot of work in how it
- 00:15:39looks on the UI in the last kind of two
- 00:15:41years if you look at my previous
- 00:15:42tutorial compared to this tutorial it's
- 00:15:44a completely different look that's why
- 00:15:46I've redone this video but if you're
- 00:15:47looking from anything from the old
- 00:15:49tutorial or you haven't been on glue in
- 00:15:51a few years you can find your legacy
- 00:15:53Pages down here okay so what we're going
- 00:15:56to do next then is create a database for
- 00:15:58this tutorial for for the purposes of
- 00:16:00the rest of the demonstration you need
- 00:16:02to go to databases which is on the left
- 00:16:04hand side and we need to create a
- 00:16:06database for this we need to add a
- 00:16:10database so let's call the first one
- 00:16:12let's call this one rawcore data um you
- 00:16:15can have a location for that so let's
- 00:16:17get a location let's be proper about
- 00:16:20this this isn't required but is good
- 00:16:22practice if you can S3 we need to go
- 00:16:24find that bucket again that I have just
- 00:16:26created for the glue course I'm going to
- 00:16:28leave this open in a tab because we'll
- 00:16:30be referencing it quite a lot we want
- 00:16:32this raw data folder and we want to copy
- 00:16:35the URI put it back in like this and we
- 00:16:38can create the database
- 00:16:41itself that's how we create a database
- 00:16:44AWS glue tables so a glue table is the
- 00:16:48metadata definition that represents your
- 00:16:51data the data resides in its original
- 00:16:53store so there it is again I'm going to
- 00:16:55keep saying this the data resides in its
- 00:16:57original store this is just a
- 00:16:59representation of the schema this as I
- 00:17:02said is the most difficult thing that
- 00:17:04people come to grasp with glue it is
- 00:17:07just a representation of the data it's
- 00:17:09just pointing to where the data is when
- 00:17:11I was at reinvent last year a few people
- 00:17:13came up to me and said hey Johnny your
- 00:17:15videos are really useful because no one
- 00:17:16else kind of mentions how this actually
- 00:17:18works and it's quite simple that there's
- 00:17:20connections there's information stored
- 00:17:22in the glue data catalog that represents
- 00:17:24your data wherever it is actually
- 00:17:26repository wherever that Repository
- 00:17:28actually resides but it is not the data
- 00:17:31itself so again make sure you understand
- 00:17:33this make sure you read this and
- 00:17:35understand this fully before you
- 00:17:37progress with AWS glue AWS glue crawlers
- 00:17:41so it's a little program that AWS have
- 00:17:43created that lets you find information
- 00:17:46about data that you have stored in AWS
- 00:17:49or or on databases and then it populates
- 00:17:52or tries to populate the AWS glue data
- 00:17:55catalog for you with information and the
- 00:17:57idea behind this is that it lifts the
- 00:17:59burden if you having to manually create
- 00:18:01the tables that you have lying around or
- 00:18:03you have on AWS already I will show you
- 00:18:06how to use both the AWS GL glue crawler
- 00:18:09and I will show you how to manually add
- 00:18:10the table so you don't have to use the
- 00:18:12crawler it's just a handy little program
- 00:18:15that helps you or minimizes the burden
- 00:18:18on trying to find all the tables that
- 00:18:19you have you point it to where you the
- 00:18:21data resides and it comes up with the
- 00:18:23schemers it comes up with the table
- 00:18:24names for you but but you do have the
- 00:18:28ability ility to manually create the
- 00:18:30tables as well the choice is yours I
- 00:18:32always use a blend of both there is the
- 00:18:34right tool for the right job and we'll
- 00:18:36show you how to do both in this tutorial
- 00:18:38okay again I've just went back out to
- 00:18:40the AWS glue console to make it simple
- 00:18:42to find the tables so hamburger menu on
- 00:18:44the left hand side if it's not already
- 00:18:46expanded and go to tables that's as
- 00:18:50simple as that going to minimize this
- 00:18:52little header and banner here for this
- 00:18:56part of the demo or the tutorial what
- 00:18:58we're going to do is add a table
- 00:19:00manually so looking down inside the
- 00:19:02GitHub you can see that we have two
- 00:19:04tables one is orders one is customers
- 00:19:06we'll do the customers because it has
- 00:19:08the smaller schema I would never really
- 00:19:10add a table manually unless I'm just
- 00:19:12playing around with data I would usually
- 00:19:13do it through code but it's good to see
- 00:19:15this in action so let's go back to
- 00:19:18tables and let's go add table at the top
- 00:19:20first thing we need to do is give the
- 00:19:22table a name so I'm going to call this
- 00:19:24customers actually I'm going to keep it
- 00:19:25all smalls because that's just best
- 00:19:27practice customers
- 00:19:29and this is raw data so I'm just going
- 00:19:30to call it customers raw next thing we
- 00:19:33need to do is create or select a
- 00:19:34database well we did that previously so
- 00:19:36that's raw data you give this a
- 00:19:38description this is us this is the
- 00:19:43customers uh data I'll do right we want
- 00:19:47to keep this table as it is standard AWS
- 00:19:50glue table as default we're going to say
- 00:19:53it's an S3 we're going to say the data
- 00:19:55is in our cun and it's going to say well
- 00:19:57where's the data so we'll go browse
- 00:19:59we'll go find that bucket that we
- 00:20:00created through the um cloud formation
- 00:20:03template we're going to go raw and it's
- 00:20:05in here inside our customers we're going
- 00:20:08to choose just click off that um in
- 00:20:10order to get rid of the warning uh we
- 00:20:12are in CSV data common delivered and go
- 00:20:16next it's going to then ask us a few
- 00:20:19questions about the schema itself so
- 00:20:23we're going to add the schema doing it
- 00:20:25here so we'll just go add column number
- 00:20:28one well if we go back onto the GitHub
- 00:20:30we know it's customer ID so we're just
- 00:20:31going to copy and we are going to go
- 00:20:34back in and say it's customer ID it's
- 00:20:36not a string type it is an INT type so
- 00:20:38we'll just go int and hit save then
- 00:20:42we'll want to add the next column well
- 00:20:44we know that it's first name so we're
- 00:20:46just going to go first name back in and
- 00:20:49paste again I'm going to keep these as
- 00:20:51small so just take out that capital F
- 00:20:54and we didn't do it for the customer ID
- 00:20:55which isn't great so back in there take
- 00:20:58that c down and make it a small then we
- 00:21:01need last name and we just copy that in
- 00:21:05and we add a column in as it stands make
- 00:21:08sure that's number three and that is
- 00:21:10last name and save and then we have full
- 00:21:13name as the last one so full name save
- 00:21:18so that's the columns added we have one
- 00:21:20int and three strings then we want to go
- 00:21:23next and you can see here it as asked um
- 00:21:27also by the T we have it in raw data the
- 00:21:31customer data name and it is in
- 00:21:34CSV then we want to go next and again if
- 00:21:37you needed to re-edit the schema just
- 00:21:39click that edit schema
- 00:21:42function then you want to go next then
- 00:21:45you want to go
- 00:21:47create and it's off and it is creating
- 00:21:50the table for us let's go and add the
- 00:21:54second table the orders table through a
- 00:21:57crawler so let's go down to the left
- 00:22:00hand side here and go to crawlers again
- 00:22:02if this Hamburg or this menu's
- 00:22:04disappeared hit the hamburger go to
- 00:22:06crawlers and then we're going to create
- 00:22:08a crawler for the purpose of this I'm
- 00:22:10just going to call this AWS glue
- 00:22:13tutorial for the
- 00:22:16crawler like this and we're just going
- 00:22:19to hit next we haven't mapped the data
- 00:22:22already naturally then we're going to go
- 00:22:25add data source with have an S3 data
- 00:22:27source it's in this account we're going
- 00:22:30to go browse we're going to find that
- 00:22:32bucket again that we've been using for
- 00:22:33this course it's in raw data um it is
- 00:22:37the orders table that we're going to map
- 00:22:40this time again that might go red so
- 00:22:42just cck click off it crawl all sub
- 00:22:45folders yep that's correct and then we
- 00:22:48want to go add S3 Source then when we've
- 00:22:52got that we want to highlight that
- 00:22:54source and go
- 00:22:55next an existing IM am r so as part of
- 00:22:59this tutorial we created an I am rule if
- 00:23:02you need to know the name of the IM am
- 00:23:04rule it's in the setup code and then you
- 00:23:06just go down to the rule and it says L
- 00:23:08ask glue course that's the one we're
- 00:23:10looking for the one we checked existed
- 00:23:11after we ran the template back in here
- 00:23:14sorry and then just paste it in and find
- 00:23:16it click off and that will be you we
- 00:23:19don't need Le information do not check
- 00:23:20that box our Target database is raw data
- 00:23:24we don't need a prefix on the table name
- 00:23:27for now that's fine and then we're just
- 00:23:29going to schedule on demand click next
- 00:23:33hit create crawler this will take a
- 00:23:35little second and then you want to run
- 00:23:36the crawler by clicking this button this
- 00:23:39will go off and run itself to map the
- 00:23:41data for this this will probably take
- 00:23:44one or two minutes in total so I'll
- 00:23:47pause the video here and we can pick it
- 00:23:49up once it is done and has created our
- 00:23:52orders table for us okay after exactly 1
- 00:23:56minute for me so I did say take between
- 00:23:58one and two you can see that it's
- 00:24:00completed successfully and we've had one
- 00:24:02table change you can click here to see
- 00:24:04what that is we've added the table
- 00:24:06orders perfect well that's great because
- 00:24:08that's what we were looking for if we
- 00:24:10then go into our databases if we go into
- 00:24:13raw data you can see we have customers
- 00:24:15raw and then we also have orders raw as
- 00:24:18a table or orders as our table because
- 00:24:20we added this by a crawler and I didn't
- 00:24:21put a prefix or soft fix in it's used
- 00:24:24the folder's name that's completely fine
- 00:24:26we're allowed to have the name that we
- 00:24:27want but more importantly let's just
- 00:24:29check that it picked this up correctly
- 00:24:31so if we click on that link it should
- 00:24:34take us into the orders and we should
- 00:24:35see the CSV perfect let's click on the
- 00:24:37orders table itself did it get all the
- 00:24:40different columns it's got 16 columns in
- 00:24:42total so if we go back here and we go to
- 00:24:45data you can see that we start with
- 00:24:46sales orderer ID and we finish with line
- 00:24:49total so let's do that little visual
- 00:24:51check to make sure that those things
- 00:24:52line up yep sales orderer ID and we
- 00:24:55finish with line total so we've managed
- 00:24:57to pick up the entire through a program
- 00:24:59called a crawler the AWS has wrote that
- 00:25:02has stopped us having to manually enter
- 00:25:03this information or enter it through
- 00:25:05code such the way that I did the manual
- 00:25:07setup for the customers table so it's
- 00:25:09took that burd in office if you have a
- 00:25:11lot of data and you just want a way to
- 00:25:13look at the um cables and get them
- 00:25:16loaded very very quickly so that's a
- 00:25:18crawler let's move on to the next
- 00:25:21section A Brief word on petitions in AWS
- 00:25:25this is important to know uh our ETL job
- 00:25:27when we get there will create cre a
- 00:25:28single partition in AWS but you can play
- 00:25:30around and create more so a partition is
- 00:25:32folders where data is stored in S3 which
- 00:25:34are physical entities are ma to
- 00:25:36partition which are logical empties
- 00:25:39columns in the glue table what what does
- 00:25:42that actually mean okay so we could have
- 00:25:45a table called sales and we could have a
- 00:25:47seale de and AWS glue or indeed big data
- 00:25:51processing Frameworks in in general give
- 00:25:53you the ability to create partitions and
- 00:25:56with these partitions we can split
- 00:25:58things into year month and day for that
- 00:26:01sales date and what these actually
- 00:26:03become are logical folders or physical
- 00:26:05folders on S3 so we can actually split
- 00:26:08the data on S3 so rather than having a
- 00:26:11just a column in the table you have a
- 00:26:13folder and what this means is when you
- 00:26:15search for something like I want to find
- 00:26:17the sales date that is February the 2nd
- 00:26:202019 the inner workings of AWS glue know
- 00:26:25that actually it can exclude the folders
- 00:26:27that don't have that on it so it knows
- 00:26:29actually because these days aren't
- 00:26:30correct I can just go look at this one
- 00:26:33here out of the four it's a way to speed
- 00:26:35up queries it's a way to speed up
- 00:26:36writing as well and updates depending
- 00:26:39exactly on what you're doing you need to
- 00:26:41understand partitions to use AWS glue
- 00:26:43again we will create a partition I will
- 00:26:45show you how to do that when it comes to
- 00:26:46the ETL job if you're not quite
- 00:26:48physically grasping that n hopefully by
- 00:26:51the time we do the ETL job it will all
- 00:26:53sync
- 00:26:54in AWS glue connections so a glue
- 00:26:58connections is a data catalog object
- 00:27:00that contains the properties that are
- 00:27:02required to connect to a particular data
- 00:27:03store typical of many Cloud providers or
- 00:27:06any ETL tool you can store a connection
- 00:27:08so if you have a database wherever that
- 00:27:10may live on premise or in AWS you can
- 00:27:13store the connection string the password
- 00:27:14and the username inside AWS glue that
- 00:27:17means when you need to actually
- 00:27:18reference it in an ETL script you're
- 00:27:20actually just entering the object
- 00:27:21information you don't store the
- 00:27:23passwords or anything in the script
- 00:27:25pretty standard practice these days when
- 00:27:27it was built with a AWS glue it wasn't
- 00:27:29you would expect it to be there it is
- 00:27:31there please use AWS glue connections
- 00:27:34when you need them so let's take a
- 00:27:36little look at connections down the left
- 00:27:38hand side you'll see there is
- 00:27:39connections click on Connection in
- 00:27:42connectors you can go to the marker
- 00:27:44place or if you knew your connector
- 00:27:46already exists you can go create
- 00:27:48connection and you can see here a list
- 00:27:50of connectors that you can use with albs
- 00:27:52glue these include AWS connectors or
- 00:27:55second party or first party connectors
- 00:27:57as well such as sales force uh snowflake
- 00:27:59there sap sitting around as well on the
- 00:28:02fun bit AWS glue ETL so AWS glue ETL
- 00:28:06supports extracting data from various
- 00:28:08sources and that's important so various
- 00:28:10sources there is the Lex of S3 RDS on
- 00:28:14prous databases transforming it to meet
- 00:28:16your business needs and then loading it
- 00:28:17into a destination of your choice we
- 00:28:21will be doing this in the tutorial I
- 00:28:23will show you how to do this visually
- 00:28:25you will not need to code I will show
- 00:28:26you the script as well but but if you're
- 00:28:28not a good coder or you can't code don't
- 00:28:30worry you'll be able to follow along
- 00:28:32with the tutorial AWS glue ETL engine
- 00:28:34you need to know this is an Apache Spark
- 00:28:36engine distributed for Big Data
- 00:28:38workloads across worker nodes it also
- 00:28:40supports python but more typically you
- 00:28:42would use the Spark engine you need to
- 00:28:45understand this as well that the AWS
- 00:28:47glue dpus one dpu is equivalent to four
- 00:28:51CPUs and 16 gig of memory when you're
- 00:28:54provisioning jobs and when we jump into
- 00:28:56the ETL job in just a little second you
- 00:28:58will see that I provisioned two dpus for
- 00:29:00that job this means that I will have it
- 00:29:03CPUs and 32 GB of memory available if
- 00:29:07you do not have enough dpus for your job
- 00:29:09it will crash and fail if you have too
- 00:29:12many dpus for your job you're paying for
- 00:29:14compute resource you're are not using it
- 00:29:16is a bit of an arch to tune your glue
- 00:29:17jobs there are handy features on the
- 00:29:19console to show you when you're under
- 00:29:21provisioning and over provisioning dpus
- 00:29:23this is the charging mechanism so
- 00:29:25depending on the on the region you're in
- 00:29:27it will be priced at how many dpus per
- 00:29:30minute that you use is the cost that you
- 00:29:32pay so it's important to size these
- 00:29:34correctly and just before we jump on and
- 00:29:36create an ETL job it's good to know
- 00:29:39about these we won't use them but you
- 00:29:40can put boot marks in which basically
- 00:29:42means when you have new data arrive the
- 00:29:45ETL job that you're creating will not
- 00:29:47reprocess the old data so you're just
- 00:29:48doing a Delta load of the data really
- 00:29:51handy if you're getting like early loads
- 00:29:52or daily loads into an S3 bucket and you
- 00:29:55just want to process the new data this
- 00:29:57means AWS is automatically tracking that
- 00:29:59for you and says hey I'm not going to
- 00:30:02process the previous run of data I'm
- 00:30:04just going to process the new data that
- 00:30:06has landed it bookmarks it and will not
- 00:30:08reprocess it it again okay so back on
- 00:30:12the console the first thing we're going
- 00:30:13to do is create a database so we can
- 00:30:15actually store new tables we create
- 00:30:17during this visual ETL process so on the
- 00:30:21left hand side you want to go databases
- 00:30:22and you want to go add database and
- 00:30:25we're going to call this database
- 00:30:26processed
- 00:30:28underscore
- 00:30:29data two C's there it shouldn't be
- 00:30:33process underscore data we need the S3
- 00:30:36location that we already set up um when
- 00:30:38we were running the setup script so if
- 00:30:40we go in and we find that bucket again
- 00:30:42I've minimized the tabs there in between
- 00:30:44different parts of the lesson I have a
- 00:30:46processed data area of the bucket I want
- 00:30:48to uh hit the tick and take the URI
- 00:30:51again urri not URL that's really
- 00:30:53important and we want to paste that down
- 00:30:55here and we need to enter a little bit
- 00:30:57of of a description so I'm just going to
- 00:30:59say this is the database data base to
- 00:31:04hold the tables for processed
- 00:31:11data
- 00:31:12processed data and then you want to
- 00:31:15create that database so that gives us a
- 00:31:16location to store some of the processed
- 00:31:19data now down the left hand side you can
- 00:31:21see there is a data integration and ETL
- 00:31:24section there's lots of different things
- 00:31:26here but we're going to be working on
- 00:31:27the ETL jobs and for the purposes of
- 00:31:29this tutorial we're going to do visual
- 00:31:31ETL um that means if you can't code or
- 00:31:33you don't want to code you don't need to
- 00:31:35know it for the purposes of this demo
- 00:31:37click on that visual ETL and then hit
- 00:31:39create job from a blank graph this is
- 00:31:43your UI or your GUI that helps you build
- 00:31:46um glue jobs using nodes rather than
- 00:31:49coding don't worry it still creates a
- 00:31:51code script under the script tab in
- 00:31:53behind so you can actually see what it's
- 00:31:55doing but for the purposes of this we'll
- 00:31:57do visual ETL first thing you need to do
- 00:32:00is give it a name so let's call this
- 00:32:03processed customers
- 00:32:06job and let's hit enter you'll see then
- 00:32:09when this goes completely red here we do
- 00:32:11have a few things we need to fill in so
- 00:32:13I am rule I've created one during the
- 00:32:16setup course or script rather for called
- 00:32:18AWS glue course so select that scrolling
- 00:32:22down scrolling down into advanced
- 00:32:23properties we will have to pick a name
- 00:32:26or a few locations that will be named to
- 00:32:28store um different aspects here so the
- 00:32:31first thing is we have an area in the
- 00:32:32bucket to store the scripts so again if
- 00:32:34we go into the bucket itself I have
- 00:32:36created a script location so choose that
- 00:32:40you also need a place to put your logs
- 00:32:42so again let's view that oh no not view
- 00:32:44that sorry that's browse that then go
- 00:32:46find the bucket again so m is AWS Johnny
- 00:32:48CH of course uh inside the temp
- 00:32:51directory would be fantastic and choose
- 00:32:53that and then we'll just put forward SL
- 00:32:56logs and then scrolling down scrolling
- 00:32:58down scrolling down inside temporary
- 00:33:00path let browse that again go find that
- 00:33:03S3 bucket I've given all the permissions
- 00:33:05required to access this bucket um during
- 00:33:08the script setup so you won't have to
- 00:33:10add anything extra by using those
- 00:33:12locations oh then we want to Temporary
- 00:33:13directory and choose that
- 00:33:16directory that's everything there under
- 00:33:18advanced settings so we can just scroll
- 00:33:20back up and minimize that and then to
- 00:33:21keep the cost down the dpus the number
- 00:33:23of workers we want is two that's perfect
- 00:33:26everything else can be left as it is you
- 00:33:28can give it a description if you want
- 00:33:30I'm not going to bother and then hit
- 00:33:32save on the top right so that's the job
- 00:33:34saved and you'll see that there's no red
- 00:33:36um warnings anymore this plus icon is
- 00:33:39where you can add your nodes and the
- 00:33:41first thing we need is a source and our
- 00:33:42source is going to be the AWS glue dat
- 00:33:44catalog we're going to get that
- 00:33:46customers table that we have already um
- 00:33:48set up using raw data and then we will
- 00:33:51have under tables customers raw select
- 00:33:55that you'll see here that there's a data
- 00:33:57pre review section ready to go depending
- 00:34:00on whether you've been into ETL or not
- 00:34:02this can take up to 2 or 3 minutes to
- 00:34:03start you'll see data Pro preview
- 00:34:06processing and then eventually you will
- 00:34:07get here to the point that your data
- 00:34:09preview is ready so you can actually see
- 00:34:11the data as it currently sits and you
- 00:34:13have the output schema as well for this
- 00:34:16data what we're going to do is take it
- 00:34:18and we're going to store it as paret
- 00:34:20format in S3 registering a new table in
- 00:34:23the glue data catalog what we're also
- 00:34:25going to do is add a transformation to
- 00:34:28the table and in this case we're going
- 00:34:29to add a process time so inside nodes
- 00:34:32and go to transformation there are lots
- 00:34:34and lots of different Transformations
- 00:34:35that you can do with a node in this one
- 00:34:38we're going to add a a current time
- 00:34:39stamp as our processed date time stamp
- 00:34:43so click that node and it will
- 00:34:44automatically join up if it does not
- 00:34:47automatically join it will look like
- 00:34:50this on your on your canvas as it
- 00:34:53currently sits you just want to click on
- 00:34:55and then choose the parent node and
- 00:34:56select the AWS glue dat a catalog and
- 00:34:58you can see there that it's ready to go
- 00:35:01you need to give this then an output
- 00:35:03column so I'm just going to call this
- 00:35:04processed time stamp keep things nice
- 00:35:07and simple and we'll just leave it as
- 00:35:09default so you can see here then with
- 00:35:11the output schema it's picking up the
- 00:35:12new addition that we have as process Tim
- 00:35:15stamp then we need a Target so into
- 00:35:18Target and we're going to use S3 as our
- 00:35:20Target so perfect we've selected S3 as
- 00:35:23our Target you can see here it's already
- 00:35:25got the format Park selected we're going
- 00:35:28to do Snappy we do need to pick a
- 00:35:30location to save this in S3 so again
- 00:35:33let's go into that bucket that we've
- 00:35:34created for the purposes of this
- 00:35:36course inside that process data sorry
- 00:35:40select the process data I do that all
- 00:35:42the time hit choose and then we want to
- 00:35:44call this
- 00:35:46customers uh underscore processed and
- 00:35:50then we want a forward slash on the end
- 00:35:52of that as well so that's the location
- 00:35:53we're going to save the data to we do
- 00:35:56want to create a new table in the data
- 00:35:58catalog and on subsequent runs update
- 00:36:01the schema to add new partitions that's
- 00:36:04exactly what we want to do our database
- 00:36:06is going to be processed data uh
- 00:36:09processed um sorry processed data for
- 00:36:11the database um we have to give the
- 00:36:13table of name so let's just call this
- 00:36:15then customers uncore process so that's
- 00:36:17the name of the table we're going to
- 00:36:19create and then we want to add a
- 00:36:21partition key as well and our partition
- 00:36:23key will be on the processed dat time
- 00:36:26stamp then we want to save everything
- 00:36:28that we've just done and that is our ETL
- 00:36:31job we're going to take that data we're
- 00:36:32going to add a time stamp we're going to
- 00:36:34transform it into partk we're going to
- 00:36:36save it to S3 we're then going to create
- 00:36:38a glue data catalog table called um
- 00:36:41customers process that's going to set in
- 00:36:43the process data um database and also
- 00:36:47we're going to create a partition key
- 00:36:49called date time
- 00:36:51stamp okay perfect so let's run this and
- 00:36:55let's save this then when you're
- 00:36:57actually ready to run you can go to runs
- 00:36:59here or you can click run there whatever
- 00:37:01one you want to do then let's run job
- 00:37:04that'll kick off the job it has to spin
- 00:37:06up a few containers and a few other
- 00:37:07things in behind the scenes this might
- 00:37:09take a few minutes to process in total
- 00:37:11cuz it's the first go you just want to
- 00:37:12hit that refresh to see what's happening
- 00:37:14you can also see down here what's going
- 00:37:16on with the job itself you can see the
- 00:37:18input arguments and everything else on
- 00:37:20the UI I'm going to pause this here and
- 00:37:23let it do its thing and then once it's
- 00:37:25processed successfully we'll pick it
- 00:37:26back up you can see it's often running
- 00:37:2814 16 seconds in
- 00:37:31already okay you can see after 1 minute
- 00:37:344 seconds precisely it succeeded so
- 00:37:37that's fantastic let's go have a look at
- 00:37:40the glue data catalog so if we go into
- 00:37:42databases we go through process data
- 00:37:44you'll see that we actually have some
- 00:37:46table data there we can click on the
- 00:37:48table and we can see that we have our
- 00:37:50partition or full name custom ID and
- 00:37:53last name if we click on partitions we
- 00:37:55should be able to see that we have a
- 00:37:56partition for when I just ran that and
- 00:37:59also if we go back into the schema um we
- 00:38:03should also be able to go to location by
- 00:38:05clicking this URL here that'll take us
- 00:38:08in that is where our data is currently
- 00:38:10sitting as a parket file excellent as
- 00:38:14part of this demo as well um I also set
- 00:38:16up a Thea so we could look at data so if
- 00:38:18you notice there will be a working goup
- 00:38:20selected forthea so inside Athena if you
- 00:38:23go there you'll see that there's a
- 00:38:25working goup called uh AWS AWS glue
- 00:38:28course Athena work group so you want to
- 00:38:30select that on the top right so once
- 00:38:32you've selected that working group
- 00:38:34that's a location to save our Athena
- 00:38:36results you can have a look at the data
- 00:38:38as well by going to process data then we
- 00:38:40should under tables have our table and
- 00:38:43the simplest way is just to click those
- 00:38:44three dots there and go preview table
- 00:38:47this will quer our data table and bring
- 00:38:48us back the data in paret format you can
- 00:38:50see there that the process timestamp is
- 00:38:53also there for us that's how you do it
- 00:38:56with the process data for customers you
- 00:38:59can do exactly the same with the order
- 00:39:01data and you would do the same thing
- 00:39:03again where you go in and you start with
- 00:39:05an ETL job and you would say visual ETL
- 00:39:08and start creating the same thing except
- 00:39:10this time with the glue DOA catalog you
- 00:39:11would select it and you would have raw
- 00:39:14data and your table this time oops sorry
- 00:39:17raw data and the table this time would
- 00:39:19be orders so you can do the same thing
- 00:39:21with orders as you did before um with
- 00:39:24customers I won't do it for the purposes
- 00:39:26of this demo that's a bit of a come one
- 00:39:27for you but going ahead do exactly the
- 00:39:29same thing for the orders data and you
- 00:39:32will be off and running creating your
- 00:39:33second ETL job AWS glue data quality is
- 00:39:37a relatively new feature in fact it was
- 00:39:39marked with new when I'm filing this
- 00:39:41video so you will see it marked with new
- 00:39:42Wi on the console this helps you monitor
- 00:39:45the quality of your data by data quality
- 00:39:47definition language using a open source
- 00:39:50project called DQ we can actually go
- 00:39:52into the console in a little second
- 00:39:53you'll see an action it will formulate
- 00:39:55some rules for us show us this dql
- 00:39:58language in operation and then we can
- 00:40:00alter that language as well or remove
- 00:40:02rules to suit our data quality this is
- 00:40:05so when we're using glue we can perform
- 00:40:07data quality checks on data that's
- 00:40:08coming in or data that we actually
- 00:40:10create through ETL and then apply that
- 00:40:13governance 3 code let's get started and
- 00:40:15have a little look in the console of how
- 00:40:17AWS glue data quality
- 00:40:20works we are going to take a quick look
- 00:40:24at glue data quality it's a new feature
- 00:40:26and it's at in the glue data catalog and
- 00:40:29you want to go to tables so let's pick a
- 00:40:31table let's pick the orders table it's
- 00:40:33kind of the most detailed so hopefully
- 00:40:35we'll get something and you can see up
- 00:40:37here they've got data quality new it's
- 00:40:39actually still under new and has a video
- 00:40:42so watch the video of you if if you want
- 00:40:44but what we want to do is actually get
- 00:40:47some uh data quality rules for this
- 00:40:50table so if we go run history we'll see
- 00:40:53that there's probably nothing on this
- 00:40:55table whatso ever is there any
- 00:40:58recommendation runs no we haven't right
- 00:41:01so we want to go recommended
- 00:41:03rules and we want to click recommended
- 00:41:06rules oh choose the I am rule uh we've
- 00:41:08got one for it so this is where glue is
- 00:41:11going to go off and create rules force
- 00:41:13that it recommends um that's it off and
- 00:41:15running so this is using a bit of AI bit
- 00:41:17of machine learning looking at our data
- 00:41:19and coming up with the rules for us so
- 00:41:22I'll pause the video this probably take
- 00:41:24a few minutes and then we'll see some of
- 00:41:25the rules hopefully that go decides
- 00:41:27should be applied to our order table in
- 00:41:29terms of data
- 00:41:32quality okay and after a few minutes you
- 00:41:34can see that it has a successful run if
- 00:41:37you click into the run you will see the
- 00:41:40recommended rules that it has um created
- 00:41:43or or recommended for this data set you
- 00:41:46can copy these rules and then what you
- 00:41:48want to do is go back into the tab
- 00:41:51before you want to go to the rules
- 00:41:54themselves uh by clicking these and then
- 00:41:58you want to hit copy then we want to go
- 00:42:00back out to the table I'm sure there's a
- 00:42:02quicker way to do this than what I'm
- 00:42:03doing we want to glue data quality you
- 00:42:06want to scroll down and you'll see that
- 00:42:08you have the create data quality rules
- 00:42:10you click in and then you can just paste
- 00:42:12in what we lifted there so I'm just
- 00:42:14going to go control V and you want to
- 00:42:16get rid of that and you want to get rid
- 00:42:18of that one at the bottom and then you
- 00:42:20can start to mess around with this so
- 00:42:22you can see that you know your real kind
- 00:42:23per loaded things should be between
- 00:42:25these two you can adjust that if you
- 00:42:26want
- 00:42:27it says make sure that everything has a
- 00:42:29sales order in it looking at the
- 00:42:31standard deviation this and this again
- 00:42:33let's just say we wanted this to be
- 00:42:34bigger or smaller so we could just go
- 00:42:36100 and we know that our maximum is
- 00:42:38going to be 10,000 you could go 10,000
- 00:42:41and this is how you can start to build
- 00:42:43up those rule sets you save the rule set
- 00:42:45you give the rule set a name so I'm just
- 00:42:46going to call this
- 00:42:49dq1 and then it will apply this rule set
- 00:42:51to the data in the table it will then
- 00:42:53start to alert you when these rules are
- 00:42:55broken again you can see that like
- 00:42:57column values you can add all the values
- 00:42:59into the columns and using this Library
- 00:43:01you can start to build those rules to
- 00:43:03ensure the data quality of your data AWS
- 00:43:06glue scheduling um we will take a look
- 00:43:09at this in just a second but you need to
- 00:43:10know about AWS triggers this is how you
- 00:43:12initiate an ETL job or actually a
- 00:43:15crawler job I'll show you both on the
- 00:43:17console and then this can be defined on
- 00:43:19a schedule a time or event it's up to
- 00:43:21you it can be done on demand as well
- 00:43:23we'll see the options but a trigger I to
- 00:43:26trigger something is how we start the
- 00:43:28workflows or how we start an ETL job and
- 00:43:30then we have AWS glue workflows this
- 00:43:32lets us create trans complex um extract
- 00:43:35transform and load activity so we can
- 00:43:37run a crawler we can run our ETL script
- 00:43:39then we can run another ETL script when
- 00:43:42we're on the console right now I will
- 00:43:43show you how to create a trigger I will
- 00:43:45show you how to create a workflow these
- 00:43:47are really useful if you're only using
- 00:43:49AWS glue if you need to use other things
- 00:43:52outside of glue like EMR or Athena then
- 00:43:54you'll have to think of something else
- 00:43:56like man work for aaty Earth flow that's
- 00:43:59what I commonly use but if you're only
- 00:44:01in the AWS glue environment and you only
- 00:44:03need those tool set then AWS glue
- 00:44:05workflows is totally available tool for
- 00:44:07scheduling ETL jobs just a brief mention
- 00:44:11um now we've came out of workflows and I
- 00:44:12know I've keep mention it aier flow is
- 00:44:14one that you can use step functions is
- 00:44:15one you can use and event Bridge these
- 00:44:17are three very common patterns I see for
- 00:44:19scheduling glue jobs bur them in mind
- 00:44:22but workflows works perfectly well if
- 00:44:24you are only doing things inside AWS
- 00:44:27glue okay let's look at orchestration
- 00:44:30within the glue console um if you're
- 00:44:33just using glue this is fantastic for
- 00:44:35orchestration so if you're just running
- 00:44:36glue jobs the catalog doing crawlers
- 00:44:40cool this is a place for you to come do
- 00:44:42some orchestration if you're knitting
- 00:44:44other AWS Services together my
- 00:44:47recommendation is to go look at other
- 00:44:48orchestration particularly when it comes
- 00:44:50to the managed workflows for Earth flow
- 00:44:53that's my go-to at the moment when I'm
- 00:44:54integrating EMR Thea glue all together
- 00:44:57but if you're just using AWS glue this
- 00:44:59is a great tool so go into workflows and
- 00:45:02you want to go to add a workflow I'm
- 00:45:04just going to call this test
- 00:45:06workflow and create the workflow itself
- 00:45:10you can click into the workflow and then
- 00:45:11we need to start adding things that are
- 00:45:14done want to hit add trigger you can see
- 00:45:17that we don't currently have any
- 00:45:19triggers so we can create triggers from
- 00:45:20the triggers page if you want so if you
- 00:45:22hit add trigger we can say that this
- 00:45:24trigger can be crawler
- 00:45:27um
- 00:45:30glue T tutorial then we're just going to
- 00:45:34do it on demand um we're going to hit
- 00:45:36next it's going to ask you what our
- 00:45:38Target resource so we'll hit resource
- 00:45:40type we'll hit crawler select a crawler
- 00:45:42AWS glue tutorial and hit add go next
- 00:45:46and then hit create so there's a crawler
- 00:45:49then we want to go back into our
- 00:45:52workflow we can click on our workflow we
- 00:45:54can hit add trigger we should have our
- 00:45:56crawler sitting to run on demand and
- 00:45:58we'll hit add and you can see here that
- 00:46:00we've added the trigger to the workflow
- 00:46:02if you just go then uh if we just look
- 00:46:05at this and hit run workflow it will
- 00:46:08start to run the workflow that workflow
- 00:46:11is going to go and crawl our data we can
- 00:46:14then build dependencies across here so
- 00:46:16we can put in the next note to be
- 00:46:17actually run the AWS job once the
- 00:46:20crawler complete and you can start to
- 00:46:22build more complicated workflows to do
- 00:46:24that you would click on the uh craw like
- 00:46:27this add a trigger add a new one so give
- 00:46:29this a name so let's just call this uh
- 00:46:32glue job
- 00:46:33ETL this is an event start after the
- 00:46:36event and hit add then we would want to
- 00:46:39go in select the crawler or select the
- 00:46:42job that we want to start so in this
- 00:46:43case it would be the process job and hit
- 00:46:45add and then you can start to build out
- 00:46:48those different things as you go so now
- 00:46:51once the CER is run we're going to do
- 00:46:52the job and again you can actually hit
- 00:46:55run workflow for that to go and kick off
- 00:46:57as well and if you had many jobs or many
- 00:46:59crawlers you could start to build a more
- 00:47:01complex workflow for scheduling as I
- 00:47:04said this is just a little taster of
- 00:47:07what you can do with it if you're
- 00:47:08working with in glued only totally valid
- 00:47:11to use this as your orchestrator AWS
- 00:47:13glue data brute I leave this in cuz I
- 00:47:15covered it last time you'll notice when
- 00:47:17I jump onto the console it actually ni
- 00:47:19sits separately away from AWS glue it's
- 00:47:22its own service this is for visual data
- 00:47:24preparation that makes it easier for non
- 00:47:26ERS data analysts or someone who just
- 00:47:28wants to quickly look at data in quite
- 00:47:31frankly what looks like an Excel uh
- 00:47:32workbook to clean and normalize data I
- 00:47:35am always hesitant to put this into
- 00:47:37production I'm hesitant for data
- 00:47:38Engineers to use it I use it to look at
- 00:47:40data quickly and manipulate it before I
- 00:47:42go write some data engineering code and
- 00:47:45put it through more concrete cic CD type
- 00:47:49processes for devops but it is a good
- 00:47:52tool or an army switch play to have in
- 00:47:54your arsenal when you need a look at
- 00:47:55data or you want to give the option for
- 00:47:57people who look at data that can't code
- 00:47:59let's look at this action I will uh use
- 00:48:01the sample data to to help us see see
- 00:48:05how this works when we jump on to the
- 00:48:07AWS console now interestingly with the
- 00:48:10new um AWS console they've actually
- 00:48:13moved glue data Brew out of it so you
- 00:48:16actually have to type in AWS glue data
- 00:48:18Brew so glue data usually does it and it
- 00:48:20comes up as a completely separate
- 00:48:22service page so which took this out
- 00:48:24there's a video again about how it works
- 00:48:25so feel free to work that what what's
- 00:48:27that so we want to create a sample
- 00:48:30project we're going to do UN resolution
- 00:48:32votes um you want to select create a new
- 00:48:35a am rule we'll call this one just AWS
- 00:48:38course glue and then it'll add the rest
- 00:48:41of it um that means we can delete it
- 00:48:43once we're done so then you want to hit
- 00:48:45create project this will run off of the
- 00:48:46background create the I am rule for you
- 00:48:48load up the data into the um datab Brew
- 00:48:53UI force and we can start to play around
- 00:48:55with it if you haven't been here before
- 00:48:57it does look a little bit like Excel or
- 00:48:59smart sheets um and that's really the
- 00:49:02idea it's it's an area for non dat
- 00:49:05Engineers maybe or like data Engineers
- 00:49:07that want to get the grips of data
- 00:49:09quickly but but don't want to do it
- 00:49:12through code
- 00:49:13um yeah I I I would generally use it um
- 00:49:17if I wanted to look at something quickly
- 00:49:19I would let non-technical users use it
- 00:49:21as well I'm always just a little bit
- 00:49:24apprehensive about going to like create
- 00:49:26the the job out of it and letting them
- 00:49:28run glue code if there's a valid
- 00:49:30business reason sure but what we don't
- 00:49:31want is getting away from the best
- 00:49:33practices of cicd and maintaining those
- 00:49:36devops pipelines when it comes to our
- 00:49:38data engineering best practice but again
- 00:49:42there's nothing wrong with it um
- 00:49:43provided that it's done in a safe and
- 00:49:45agile and secure way you can see that
- 00:49:47this is loading up 51% I'm just going to
- 00:49:49pause the video until it is
- 00:49:52done okay so after a few minutes you can
- 00:49:55see that it loads up rows and starts to
- 00:49:58give a look at what we have you can see
- 00:50:00at the top of each one it tells you
- 00:50:02information like the number of distinct
- 00:50:03values the kind of range that's in there
- 00:50:06and the percentages of each other of the
- 00:50:08values and it's kind of using it well it
- 00:50:10is using aim ml in behind the scenes to
- 00:50:12populate a lot of this data you can look
- 00:50:14at the schema as well by clicking schema
- 00:50:16and it'll tell you all the different
- 00:50:18columns that you have and you get a
- 00:50:20profile of the data if you wanted on top
- 00:50:22of it but I'm just going to go back to
- 00:50:23grid view if you were interested then
- 00:50:25you can start to filter sour The Columns
- 00:50:27format The Columns add in things as well
- 00:50:29so let's just do a quick filter you want
- 00:50:31to add in a filter and you want to do it
- 00:50:33by condition and then you want to do
- 00:50:36where it contains and then you can
- 00:50:38select the source column so let's just
- 00:50:39say ours is resolution and we want to
- 00:50:43make sure that the column contains
- 00:50:46contains let's go contains and a value
- 00:50:48of let's say we only wanted 66 oh that's
- 00:50:5155 66 you apply that and it will apply
- 00:50:55to the column that value value or that
- 00:50:58condition and you can see it's off
- 00:50:59running and if it has 6 six in it it's
- 00:51:01going to keep that and that's how it
- 00:51:03does it stores this as a recipe so you
- 00:51:05can see here that's the first step you
- 00:51:07can then add another step so let's say
- 00:51:09we wanted
- 00:51:11to I'm just going to dup we want to
- 00:51:14remove jaate duplicate vals in columns
- 00:51:16you can say well what columns um let's
- 00:51:18say we wanted to dup the First Column
- 00:51:20which is assembly session all rules and
- 00:51:23apply so it's going to dup The Columns
- 00:51:25here based on that idea it's going to
- 00:51:27select the first one it sees brings it
- 00:51:29down to lesser columns or a less number
- 00:51:31of columns it's dypt and you can see
- 00:51:33here then that you have your recipe you
- 00:51:35can publish that recipe and or
- 00:51:37alternatively you can import a new one
- 00:51:39or download it and then once you have
- 00:51:41this recipe once you can apply it to
- 00:51:43data sets over and over again best thing
- 00:51:45to do is sit and play around with these
- 00:51:46different functions if you're interested
- 00:51:49again for me as I said glue data Brew
- 00:51:52really get to grips with it for users
- 00:51:54that don't code don't write the code I
- 00:51:57kind of give it to my known technical
- 00:51:58users when it makes sense I use it to
- 00:52:01look at data really quickly I'm always
- 00:52:03just a little bit apprehensive about
- 00:52:05anyone putting it into a production
- 00:52:07pipeline for data engineering sure
- 00:52:09business users who want to play around
- 00:52:10with data create jobs out of data that
- 00:52:12or bigger data sets and use it in their
- 00:52:14day-to-day business cool those refined
- 00:52:16cicd pipelines with devops processes I'm
- 00:52:20not such a fan of using glue data Brew
- 00:52:22for those
- 00:52:23purposes okay folks that concludes the
- 00:52:26tutorial on AWS glue we've kind of took
- 00:52:28a look at what AWS glue is we spent some
- 00:52:30time looking at the AWS glue data
- 00:52:32catalog we then created ETL jobs we've
- 00:52:35learned how to schedule those ETL jobs
- 00:52:37and we've also looked at glue data
- 00:52:39quality and glue data brew as well one
- 00:52:42final reminder really like a like And
- 00:52:44subscribe to this channel it helps me
- 00:52:46out in the description below I've also
- 00:52:48left a link to the exams for the AWS
- 00:52:51data engineering certification if you're
- 00:52:53interested in taking that there's also a
- 00:52:56YouTube tutorial on this channel for
- 00:52:58that certification and until next time
- 00:53:00folks thanks for watching
- AWS Glue
- ETL
- Data Catalog
- Crawlers
- Data Quality
- DataBrew
- Scheduling
- AWS Services
- Data Engineering
- Tutorial