AWS Glue is a fully managed ETL (Extract, Transform, Load) service that simplifies data preparation for analytics.

What is the Glue Data Catalog?

The Glue Data Catalog is a persistent metadata repository that stores information about data sources and targets.

How do I create an ETL job in AWS Glue?

You can create an ETL job in AWS Glue using the visual ETL interface or by writing code.

What are crawlers in AWS Glue?

Crawlers are programs that automatically discover and catalog data in AWS Glue.

What is the purpose of partitions in AWS Glue?

Partitions help organize data in S3, allowing for more efficient querying and processing.

How can I monitor data quality in AWS Glue?

AWS Glue provides data quality features that allow you to define and monitor data quality rules.

What is Glue DataBrew?

Glue DataBrew is a visual data preparation tool that allows users to clean and normalize data without coding.

How do I schedule ETL jobs in AWS Glue?

You can schedule ETL jobs using triggers in AWS Glue, which can be defined based on time or events.

What is the difference between Glue Data Catalog and Glue DataBrew?

The Glue Data Catalog is for metadata management, while Glue DataBrew is for visual data preparation.

Can I use AWS Glue with other AWS services?

Yes, AWS Glue can integrate with various AWS services like S3, RDS, and Athena.

AWS Glue Tutorial for Beginners [NEW 2024 - FULL COURSE]

00:53:03

https://www.youtube.com/watch?v=ZvJSaioPYyo

Resumo

TLDRThis video tutorial by Johnny Chivers provides an updated overview of AWS Glue, focusing on its features and functionalities for ETL processes. It covers the AWS Glue Data Catalog, the creation of databases and tables, and the building of ETL jobs using both visual and code-based methods. The tutorial includes practical steps for setting up permissions, uploading data to S3, and utilizing crawlers to automate table creation. Additionally, it discusses data quality monitoring and scheduling options for ETL jobs, making it a comprehensive guide for both beginners and those looking to refresh their knowledge of AWS Glue.

Conclusões

🔍 AWS Glue is a fully managed ETL service.
📚 The Glue Data Catalog stores metadata about data sources.
⚙️ ETL jobs can be created visually or through code.
🤖 Crawlers automate the discovery of data and table creation.
📂 Partitions improve data organization and query efficiency.
✅ Data quality features help monitor and enforce data standards.
🖥️ Glue DataBrew allows for visual data preparation without coding.
⏰ ETL jobs can be scheduled using triggers in AWS Glue.
🔗 AWS Glue integrates with various AWS services for data processing.
📈 The tutorial is suitable for both beginners and experienced users.

Linha do tempo

00:00:00 - 00:05:00
Introduction to AWS Glue, highlighting the need for an updated tutorial due to changes in the AWS console and new features in Glue since the last video.
00:05:00 - 00:10:00
Overview of AWS Glue as a fully managed ETL service, explaining its components like the Glue Data Catalog and the flexible scheduler, and the importance of understanding its functionality for data engineering.
00:10:00 - 00:15:00
Instructions on setting up the AWS environment, including downloading necessary files from GitHub and using CloudFormation scripts to create required resources like S3 buckets and IAM roles.
00:15:00 - 00:20:00
Detailed steps on creating folders in the S3 bucket for organizing raw and processed data, and uploading CSV files for customers and orders, maintaining the folder structure as specified.
00:20:00 - 00:25:00
Explanation of the Glue Data Catalog as a persistent metastore for metadata, emphasizing that it does not store physical data but rather references to data locations and schemas.
00:25:00 - 00:30:00
Demonstration of creating a database in the Glue Data Catalog and manually adding a table for customers, including defining the schema and data types for the table.
00:30:00 - 00:35:00
Introduction to Glue Crawlers, which automate the process of populating the Glue Data Catalog with table definitions, and running a crawler to create the orders table from existing data.
00:35:00 - 00:40:00
Discussion on the concept of partitions in AWS Glue, explaining how they can optimize data storage and querying in S3 by organizing data into logical folders based on specific criteria.
00:40:00 - 00:45:00
Overview of Glue connections for securely storing connection properties to various data stores, and the importance of using these connections in ETL scripts to avoid hardcoding sensitive information.
00:45:00 - 00:53:03
Introduction to AWS Glue ETL jobs, explaining the visual ETL process and how to create a job that extracts, transforms, and loads data, including setting up the job parameters and defining the data source and target.

Mostrar mais

Mapa mental

Vídeo de perguntas e respostas

What is AWS Glue?
AWS Glue is a fully managed ETL (Extract, Transform, Load) service that simplifies data preparation for analytics.
What is the Glue Data Catalog?
The Glue Data Catalog is a persistent metadata repository that stores information about data sources and targets.
How do I create an ETL job in AWS Glue?
You can create an ETL job in AWS Glue using the visual ETL interface or by writing code.
What are crawlers in AWS Glue?
Crawlers are programs that automatically discover and catalog data in AWS Glue.
What is the purpose of partitions in AWS Glue?
Partitions help organize data in S3, allowing for more efficient querying and processing.
How can I monitor data quality in AWS Glue?
AWS Glue provides data quality features that allow you to define and monitor data quality rules.
What is Glue DataBrew?
Glue DataBrew is a visual data preparation tool that allows users to clean and normalize data without coding.
How do I schedule ETL jobs in AWS Glue?
You can schedule ETL jobs using triggers in AWS Glue, which can be defined based on time or events.
What is the difference between Glue Data Catalog and Glue DataBrew?
The Glue Data Catalog is for metadata management, while Glue DataBrew is for visual data preparation.
Can I use AWS Glue with other AWS services?
Yes, AWS Glue can integrate with various AWS services like S3, RDS, and Athena.

Ver mais resumos de vídeos

Obtenha acesso instantâneo a resumos gratuitos de vídeos do YouTube com tecnologia de IA!

Legendas

Rolagem automática:

00:00:00
hi folks welcome back to the channel for
00:00:03
those of you that are new here I'm
00:00:04
Johnny chivers and in today's video
00:00:06
we're going to take a look at AWS glue
00:00:10
there's already a video of AWS glue on
00:00:13
the channel however that was filmed over
00:00:15
3 years ago now and has been released
00:00:17
for about 2 and 1 half years the AWS
00:00:20
console has since updated feels
00:00:22
completely different and in fact there's
00:00:25
new features within the glue service
00:00:27
itself that aren't covered in that
00:00:29
original video
00:00:30
a lot of you have requested an update to
00:00:32
that video so we're going to do it today
00:00:34
it'll follow the same format we'll go
00:00:36
through what is glue we'll take a look
00:00:38
at the glued out a catalog and we'll
00:00:39
start building ETL jobs as well
00:00:42
everything you need for this tutorial is
00:00:45
located on the GitHub Link in the
00:00:47
description below that includes the data
00:00:50
and the slides that I'm going to use I'd
00:00:52
really appreciate a like And subscribe
00:00:54
to the channel cuz that helps me out
00:00:56
with that all being said let's jump onto
00:00:58
the computer take a look at what AWS
00:01:01
glue is then do some setup work using
00:01:04
the code located in that repo on GitHub
00:01:06
so again download that repo highly
00:01:08
important join me on the computer and
00:01:10
we'll get started okay folks uh just to
00:01:13
remind you again I will make these
00:01:15
slides available to download on the
00:01:16
GitHub if you want to download them
00:01:18
annotate and follow along first thing we
00:01:21
need to cover is what is HDS glue so
00:01:25
it's a fully manage ETL service that's
00:01:27
kind of the core value proposition fully
00:01:30
managed ETL service that means the AWS
00:01:33
are going to do the heavy lifting for
00:01:34
you when it comes to infrastructure it's
00:01:37
a spark or actually a python ETL engine
00:01:40
so it has both spark and python usually
00:01:42
you would only see spark mentioned in
00:01:43
the documents or a lot of people would
00:01:45
talk about spark because it's big data
00:01:47
but you can do python as well it
00:01:49
consists of a central a metadata
00:01:52
repository called the glue data catalog
00:01:54
we will cover this in depth and set up
00:01:56
some databases and tables during this
00:01:58
tutorial but this is a key comp
00:01:59
component that we will cover and it also
00:02:01
has a flexible scheduler again we'll
00:02:03
cover this in the tutorial I'll show you
00:02:05
how it works it's not the only way to
00:02:07
schedle things inside glue but it is
00:02:09
certainly good to know that you don't
00:02:11
have to leave the glue environment to do
00:02:12
this on the right hand side we have this
00:02:14
nice nice little diagram um provided by
00:02:17
AWS that kind of shows you what glue
00:02:19
does I'm actually going to start down
00:02:21
here because this is kind of the
00:02:22
fundamentals and what people think when
00:02:24
we say glue you have a data source you
00:02:26
can extract that data source you can
00:02:28
transform that through a script this r
00:02:29
runs on the serverless ETL engine in
00:02:32
python or in spark and then you load a
00:02:34
data Target it has numerous data sources
00:02:37
you can use numerous data targets that's
00:02:39
where the glue data catalog comes in you
00:02:41
store references to your data source or
00:02:44
your data Target in the glue data
00:02:45
catalog and then to populate the glue
00:02:47
data catalog you can use things like
00:02:49
crawlers or the Management console we
00:02:51
will do both in the tutorial and then
00:02:53
once you have things set up in the glue
00:02:55
do catalog the scripts the AWS glue jobs
00:02:58
or ETL Pipelines you can use a schedule
00:03:01
or event to run them and again we'll
00:03:03
cover that in the tutorial so I always
00:03:07
think this is useful to put in why use
00:03:10
AWS glue I know for the purposes of this
00:03:12
video we are just going to do the
00:03:14
tutorial but there are many different
00:03:16
options for ETL in AWS and as you learn
00:03:19
these as a data engineer or an architect
00:03:21
or as a developer it's good to know when
00:03:23
to use the correct tool for the correct
00:03:25
job and if you're doing this as part of
00:03:27
the AWS data engineer certification I'll
00:03:30
also leave a link to that video in the
00:03:32
description below it's good to know this
00:03:33
as well so AWS glue offers a fully
00:03:36
managed serverless ETL tool so again
00:03:38
fully serverless ETL tool this removes
00:03:42
the overhead on the buyer the entry when
00:03:44
there is a requirement for the ETL
00:03:46
service in AWS and what that means is
00:03:48
that you don't have to manage the
00:03:50
underlying infrastructure to use spark
00:03:54
you need to spin up clusters clustered
00:03:56
compute in a highly paralyzed
00:03:58
environment using a s glue you don't
00:04:00
need to think about any of that it just
00:04:03
runs it for you on manage services in
00:04:05
behind the scenes so again there is
00:04:07
nothing for you to provision it is
00:04:09
serverless AWS are doing the heavy
00:04:12
lifting you need to write the code and
00:04:14
it will do the
00:04:16
ETL okay so what we'll do now is jump on
00:04:18
the console we'll get hands on with a
00:04:20
little bit of setup work I have wrote
00:04:22
some cloud formation scripts for this to
00:04:24
help us set up permissions and some of
00:04:26
the data repositories that we'll need
00:04:28
throughout this tutorial follow along
00:04:30
make sure you have already downloaded
00:04:32
that GitHub if not I'll show you to do
00:04:34
it again in the middle of the setup work
00:04:36
but download that GitHub we will be
00:04:38
using it throughout this
00:04:40
tutorial so throughout this course as I
00:04:43
referenced in the intro we will be using
00:04:45
this GitHub for this section of setup
00:04:47
we'll be using the code located in the
00:04:50
code file which is located here in setup
00:04:52
hyphen code. yaml and we'll also then
00:04:55
once it's completed successfully by
00:04:57
uploading the data from the data file
00:04:59
the easiest thing to do as I mentioned
00:05:00
was just take a copy of this down into
00:05:03
your local machine so to do that you can
00:05:04
just go code and then download as zip
00:05:07
and then unzip that file once it's
00:05:09
downloaded and use the contents
00:05:10
throughout the instructions on what
00:05:12
we're doing is also located here but
00:05:14
don't worry I'm going to guide you step
00:05:17
by step that being said we need to jump
00:05:20
on to the AWS console so please log in
00:05:24
and you'll be graded with the homepage
00:05:26
hopefully you are familiar with the AWS
00:05:28
console if not and this is your first
00:05:31
time using the AWS console don't worry
00:05:33
I'll uh point out things as we go along
00:05:35
so the first thing we're going to do is
00:05:37
go to Cloud information this is where we
00:05:39
can use our infrastructure as code to
00:05:42
spin up the resources we need for the
00:05:43
tutorial the idea behind this is that I
00:05:46
have created the template force and that
00:05:48
means we won't have to do as much usual
00:05:50
manual intervention clicking through the
00:05:52
console and doing all the stuff in terms
00:05:54
of permissions that you see in many
00:05:55
other tutorials so we want to go create
00:05:57
stack create new resource we're going to
00:06:00
upload a template you want to choose
00:06:02
file and you want to go to the unzipped
00:06:05
version of the folder that you've just
00:06:07
downloaded from the GitHub I repeat the
00:06:08
unzipped version go to the code F folder
00:06:12
code folder and then the setup hyphen
00:06:15
code. yaml you want to open that then
00:06:18
you want to go next you need to give the
00:06:20
stack a name so this can be anything
00:06:22
that you want so I'm just going to call
00:06:23
this Johnny hyphen chivers well can't
00:06:27
spell my own name chivers hyphen glue
00:06:31
course uh 2024 that will do that next
00:06:35
thing you need to do is give the bucket
00:06:37
a name so this is the S3 bucket for the
00:06:39
course bear in mind with AWS every
00:06:43
bucket has to be unique globally so you
00:06:45
will not be able to use the same name as
00:06:47
me for the bucket but I'm going to call
00:06:49
mine AWS glue course hyphen Johnny
00:06:54
chivers hopefully that is unique you
00:06:57
need again I'll repeat you need a unique
00:06:59
name so if you don't have one it will
00:07:00
tell you but that's carry on for now um
00:07:03
use your name it might be the easiest
00:07:05
thing with some random numbers or digits
00:07:07
accept and acknowledge at the bottom
00:07:09
we're going to run this using the root
00:07:11
permissions that I'm currently logged in
00:07:12
with and then hit submit so that will
00:07:15
take a few minutes to go off and running
00:07:18
you can hit the refresh icon here and
00:07:20
see what the actual setup code is doing
00:07:23
code itself as I mentioned uh in the
00:07:25
header of the code so if you just go and
00:07:27
have a look at the file it's going going
00:07:29
to create a few things for us one is an
00:07:31
S3 bucket next one is an IM am rule for
00:07:35
glue we'll be using this three art to
00:07:37
actually permissions to do anything and
00:07:38
then the last thing is an Athena working
00:07:40
group which isn't technically part of
00:07:42
this course but we may use it to look at
00:07:44
some of the data so it's good to have it
00:07:45
around as well this shouldn't take too
00:07:48
long to complete in fact it's completed
00:07:50
already so that took about 30 seconds
00:07:53
let's check that a few things are there
00:07:55
as they should be so if you go to
00:07:56
outputs you'll see that we have the S3
00:07:58
bucket created so copy that name under
00:08:00
value so I put copy the value let's go
00:08:03
to S3 so I'm going to go to S3 I have
00:08:06
other buckets in this account you might
00:08:08
only have one bucket don't worry just
00:08:10
paste it in make sure the bucket has
00:08:11
been created I should mention as well
00:08:13
I'm working in the Ireland region for
00:08:15
this as well so that is the bucket if we
00:08:18
go to IM am click on I am click in and
00:08:21
we have the name of the IM am rule that
00:08:23
we're looking for called AWS glue course
00:08:27
so in here if we go to rules on left
00:08:29
hand side search search for the rule you
00:08:32
can see that we have the rule if we load
00:08:33
the rule up then we should have a policy
00:08:36
attached to the rule that's great we do
00:08:38
and inside that policy is all the good
00:08:40
stuff that we have created during the
00:08:42
setup process so that's the setup in
00:08:44
terms of running the template the next
00:08:46
thing we need to do is actually upload
00:08:48
some data and create some folders inside
00:08:51
the S3 bucket itself so let's go back to
00:08:54
that S3 bucket that we were just on that
00:08:56
we know was successfully created then
00:08:58
search for the bucket that we were just
00:09:00
on so AWS glue course Johnny chivers was
00:09:02
my bucket if we go back on to the GitHub
00:09:06
and we go back into the main part of the
00:09:08
readme file you can see that we're going
00:09:10
to upload the data we ignore that for a
00:09:12
little second but what we need to do is
00:09:13
create a few things in the folder
00:09:16
structure denoted so if we just copy 0.1
00:09:19
raw data I'll show you how to do this
00:09:21
you want to go to create folder you want
00:09:23
to call this folder raw data and then
00:09:25
you just want to create the folder like
00:09:28
so next we have to do process data so
00:09:31
this will be quite quick again same
00:09:32
process in create folder paste it and
00:09:36
then create folder next we have to do is
00:09:39
script location we'll be needing these
00:09:41
folders throughout the tutorial so we'll
00:09:43
just have to create them once and then
00:09:45
that's it done script
00:09:47
location oh I hit the wrong button back
00:09:50
in create folder folder name script
00:09:53
location create
00:09:55
folder temp directory we'll also need a
00:09:58
temp directory Dory crate folder folder
00:10:01
name temp directory create folder and
00:10:04
then we will need a thinner folder
00:10:07
called Athena so we'll just go into
00:10:09
create folder again this is the last one
00:10:11
and create Athena perfect back onto the
00:10:15
GitHub you'll see that in the Raw data
00:10:16
folder we have to upload the structures
00:10:19
and orders data keeping the folder
00:10:22
structure as the noted inside the GitHub
00:10:24
so you can see here we'll have a folder
00:10:25
called raw data then we're going to
00:10:27
upload our customers and orders folders
00:10:29
that have have CSV files inside them
00:10:31
this is quite simple so what the easiest
00:10:33
way to do this is to go to the actual
00:10:36
raw dat folder itself then what I find
00:10:39
the next bit to do the easiest way is to
00:10:42
go upload then when're here the easiest
00:10:45
thing for me to do or the way I find it
00:10:47
easiest is just put this folder down
00:10:49
into little minimize folder go get the
00:10:53
location where you have unzipped and
00:10:54
again unzipped that data so for me it's
00:10:57
going to be in Jonathan
00:11:00
in here then I've got sitting in here in
00:11:03
here in data so that's just where I have
00:11:06
these unzipped locations so go find the
00:11:08
customers and orders in the GitHub unzip
00:11:11
version click and drag it across again
00:11:13
that is just the easiest thing to do you
00:11:15
can see here then you get your customers
00:11:16
folder your orders foldo folder and your
00:11:19
two csvs as well click upload this will
00:11:22
take a few seconds there's not that much
00:11:24
data there so just B tight once it's
00:11:26
been successfully done you can close and
00:11:28
you can see that you have customers with
00:11:30
a customer CSV and you have orders with
00:11:34
a order CSV so we've successfully I'm
00:11:37
going to blow this back up big because
00:11:38
we don't need it that way anymore
00:11:39
created this raw data folder with our
00:11:41
customers folder and CSV and our orders
00:11:44
folder and our orders CSV as well so
00:11:47
that's the setup work complete for the
00:11:49
tutorial make sure you follow along with
00:11:51
this it's pretty simple run the run the
00:11:52
cloud foration template then create
00:11:55
these folder locations in the S3 bucket
00:11:57
that was created and then upload the
00:11:59
data maintaining the folder structure
00:12:01
just as I have shown the AWS glue data
00:12:06
catalog so the AGS glue data catalog is
00:12:09
a persistent metastore so let's say that
00:12:12
again is a persistent metastore well
00:12:15
what does that actually mean well it
00:12:17
stores metadata so what is metadata well
00:12:20
on the right hand side we have different
00:12:22
descriptions of metadata we have
00:12:24
location schema data types and data
00:12:27
classification so this is the important
00:12:29
but it's a managed service that lets you
00:12:31
store annotate and share metadata again
00:12:33
metadata these are the things on the
00:12:35
right hand side which can be used to
00:12:37
query and transform data what I find or
00:12:40
what can be the difficult concept to get
00:12:42
your head around is that when you
00:12:44
register things in the glow data catalog
00:12:47
I.E data sources or data targets they do
00:12:50
not move from their existing location so
00:12:53
if I have a database let's say I have a
00:12:55
database that's just a mycle database
00:12:57
and it's on AWS and I have data in that
00:13:00
database when I register it with a glue
00:13:02
data catalog it will store the location
00:13:05
the schema the data types and even the
00:13:07
classification of that data in the glue
00:13:09
data catalog what it does not do is move
00:13:12
the data to AWS glue it keeps it inside
00:13:16
that database it stays where it is this
00:13:19
is just a pointer and information on how
00:13:22
to access that data and that's the key
00:13:24
thing it becomes a metast store a
00:13:26
collection of information that you can
00:13:29
can use to reform ETL on different data
00:13:32
sources and bring them together inside
00:13:34
AWS and you need to remember then that
00:13:37
there's one AWS glue. a catalog per AWS
00:13:39
region so you get one per region you can
00:13:42
use I am at access identity and access
00:13:45
management policiy going to control um
00:13:47
access to them and you can also use data
00:13:49
governance because you can annotate the
00:13:51
glue data catalog so again it's a glue
00:13:53
data catalog it's where you store
00:13:55
metadata it does not store physical data
00:13:58
this is just reference information
00:14:01
required to get to the data you have
00:14:04
stored in
00:14:06
AWS AWS glue databases so a database is
00:14:11
a set of associated data catalog table
00:14:14
definitions organized into a local group
00:14:16
so what I've done here in this little
00:14:17
graphic is draw databus and I put some
00:14:19
tables inside it so this is where we
00:14:21
name something and then associate the
00:14:23
tables to it the way you would do it
00:14:24
normally in an RD in an RM DBS database
00:14:28
for examp example we'll jump on to the
00:14:30
console in just a second and we can look
00:14:32
at actually setting up a database and
00:14:34
we'll do tables as well but we'll
00:14:36
actually look at how this works in
00:14:38
action okay back on the AWS console if
00:14:41
this is your first time don't worry what
00:14:44
we're going to do is navigate to AWS
00:14:46
glue so type in AWS glue in the search
00:14:48
bar and click AWS glue once there you'll
00:14:53
notice a few different things looking
00:14:54
around on the console there's quite a
00:14:56
lot going on the left hand side is going
00:14:59
going to be your best friend so I'm
00:15:00
going to minimize that down in case
00:15:01
you've arrived with out it out there's a
00:15:03
hamburger menu to expand it and then
00:15:05
down the left you can see we have a
00:15:07
getting started ETL jobs data catalog
00:15:09
tables data connections and workflows
00:15:12
then it breaks down into two sections
00:15:13
one is data catalog where you have
00:15:15
databases and tables and a couple of
00:15:18
other things including crawlers
00:15:19
connections then on top of this you have
00:15:22
data integration and ETL so ETL jobs
00:15:25
data classification modes we'll be
00:15:26
looking at some of this as the tutorial
00:15:28
goes on importantly there is Legacy
00:15:31
pages so if you're familiar with old
00:15:33
glue and it's layout these are the
00:15:35
Legacy Pages down the left hand side
00:15:37
glue's done a lot of work in how it
00:15:39
looks on the UI in the last kind of two
00:15:41
years if you look at my previous
00:15:42
tutorial compared to this tutorial it's
00:15:44
a completely different look that's why
00:15:46
I've redone this video but if you're
00:15:47
looking from anything from the old
00:15:49
tutorial or you haven't been on glue in
00:15:51
a few years you can find your legacy
00:15:53
Pages down here okay so what we're going
00:15:56
to do next then is create a database for
00:15:58
this tutorial for for the purposes of
00:16:00
the rest of the demonstration you need
00:16:02
to go to databases which is on the left
00:16:04
hand side and we need to create a
00:16:06
database for this we need to add a
00:16:10
database so let's call the first one
00:16:12
let's call this one rawcore data um you
00:16:15
can have a location for that so let's
00:16:17
get a location let's be proper about
00:16:20
this this isn't required but is good
00:16:22
practice if you can S3 we need to go
00:16:24
find that bucket again that I have just
00:16:26
created for the glue course I'm going to
00:16:28
leave this open in a tab because we'll
00:16:30
be referencing it quite a lot we want
00:16:32
this raw data folder and we want to copy
00:16:35
the URI put it back in like this and we
00:16:38
can create the database
00:16:41
itself that's how we create a database
00:16:44
AWS glue tables so a glue table is the
00:16:48
metadata definition that represents your
00:16:51
data the data resides in its original
00:16:53
store so there it is again I'm going to
00:16:55
keep saying this the data resides in its
00:16:57
original store this is just a
00:16:59
representation of the schema this as I
00:17:02
said is the most difficult thing that
00:17:04
people come to grasp with glue it is
00:17:07
just a representation of the data it's
00:17:09
just pointing to where the data is when
00:17:11
I was at reinvent last year a few people
00:17:13
came up to me and said hey Johnny your
00:17:15
videos are really useful because no one
00:17:16
else kind of mentions how this actually
00:17:18
works and it's quite simple that there's
00:17:20
connections there's information stored
00:17:22
in the glue data catalog that represents
00:17:24
your data wherever it is actually
00:17:26
repository wherever that Repository
00:17:28
actually resides but it is not the data
00:17:31
itself so again make sure you understand
00:17:33
this make sure you read this and
00:17:35
understand this fully before you
00:17:37
progress with AWS glue AWS glue crawlers
00:17:41
so it's a little program that AWS have
00:17:43
created that lets you find information
00:17:46
about data that you have stored in AWS
00:17:49
or or on databases and then it populates
00:17:52
or tries to populate the AWS glue data
00:17:55
catalog for you with information and the
00:17:57
idea behind this is that it lifts the
00:17:59
burden if you having to manually create
00:18:01
the tables that you have lying around or
00:18:03
you have on AWS already I will show you
00:18:06
how to use both the AWS GL glue crawler
00:18:09
and I will show you how to manually add
00:18:10
the table so you don't have to use the
00:18:12
crawler it's just a handy little program
00:18:15
that helps you or minimizes the burden
00:18:18
on trying to find all the tables that
00:18:19
you have you point it to where you the
00:18:21
data resides and it comes up with the
00:18:23
schemers it comes up with the table
00:18:24
names for you but but you do have the
00:18:28
ability ility to manually create the
00:18:30
tables as well the choice is yours I
00:18:32
always use a blend of both there is the
00:18:34
right tool for the right job and we'll
00:18:36
show you how to do both in this tutorial
00:18:38
okay again I've just went back out to
00:18:40
the AWS glue console to make it simple
00:18:42
to find the tables so hamburger menu on
00:18:44
the left hand side if it's not already
00:18:46
expanded and go to tables that's as
00:18:50
simple as that going to minimize this
00:18:52
little header and banner here for this
00:18:56
part of the demo or the tutorial what
00:18:58
we're going to do is add a table
00:19:00
manually so looking down inside the
00:19:02
GitHub you can see that we have two
00:19:04
tables one is orders one is customers
00:19:06
we'll do the customers because it has
00:19:08
the smaller schema I would never really
00:19:10
add a table manually unless I'm just
00:19:12
playing around with data I would usually
00:19:13
do it through code but it's good to see
00:19:15
this in action so let's go back to
00:19:18
tables and let's go add table at the top
00:19:20
first thing we need to do is give the
00:19:22
table a name so I'm going to call this
00:19:24
customers actually I'm going to keep it
00:19:25
all smalls because that's just best
00:19:27
practice customers
00:19:29
and this is raw data so I'm just going
00:19:30
to call it customers raw next thing we
00:19:33
need to do is create or select a
00:19:34
database well we did that previously so
00:19:36
that's raw data you give this a
00:19:38
description this is us this is the
00:19:43
customers uh data I'll do right we want
00:19:47
to keep this table as it is standard AWS
00:19:50
glue table as default we're going to say
00:19:53
it's an S3 we're going to say the data
00:19:55
is in our cun and it's going to say well
00:19:57
where's the data so we'll go browse
00:19:59
we'll go find that bucket that we
00:20:00
created through the um cloud formation
00:20:03
template we're going to go raw and it's
00:20:05
in here inside our customers we're going
00:20:08
to choose just click off that um in
00:20:10
order to get rid of the warning uh we
00:20:12
are in CSV data common delivered and go
00:20:16
next it's going to then ask us a few
00:20:19
questions about the schema itself so
00:20:23
we're going to add the schema doing it
00:20:25
here so we'll just go add column number
00:20:28
one well if we go back onto the GitHub
00:20:30
we know it's customer ID so we're just
00:20:31
going to copy and we are going to go
00:20:34
back in and say it's customer ID it's
00:20:36
not a string type it is an INT type so
00:20:38
we'll just go int and hit save then
00:20:42
we'll want to add the next column well
00:20:44
we know that it's first name so we're
00:20:46
just going to go first name back in and
00:20:49
paste again I'm going to keep these as
00:20:51
small so just take out that capital F
00:20:54
and we didn't do it for the customer ID
00:20:55
which isn't great so back in there take
00:20:58
that c down and make it a small then we
00:21:01
need last name and we just copy that in
00:21:05
and we add a column in as it stands make
00:21:08
sure that's number three and that is
00:21:10
last name and save and then we have full
00:21:13
name as the last one so full name save
00:21:18
so that's the columns added we have one
00:21:20
int and three strings then we want to go
00:21:23
next and you can see here it as asked um
00:21:27
also by the T we have it in raw data the
00:21:31
customer data name and it is in
00:21:34
CSV then we want to go next and again if
00:21:37
you needed to re-edit the schema just
00:21:39
click that edit schema
00:21:42
function then you want to go next then
00:21:45
you want to go
00:21:47
create and it's off and it is creating
00:21:50
the table for us let's go and add the
00:21:54
second table the orders table through a
00:21:57
crawler so let's go down to the left
00:22:00
hand side here and go to crawlers again
00:22:02
if this Hamburg or this menu's
00:22:04
disappeared hit the hamburger go to
00:22:06
crawlers and then we're going to create
00:22:08
a crawler for the purpose of this I'm
00:22:10
just going to call this AWS glue
00:22:13
tutorial for the
00:22:16
crawler like this and we're just going
00:22:19
to hit next we haven't mapped the data
00:22:22
already naturally then we're going to go
00:22:25
add data source with have an S3 data
00:22:27
source it's in this account we're going
00:22:30
to go browse we're going to find that
00:22:32
bucket again that we've been using for
00:22:33
this course it's in raw data um it is
00:22:37
the orders table that we're going to map
00:22:40
this time again that might go red so
00:22:42
just cck click off it crawl all sub
00:22:45
folders yep that's correct and then we
00:22:48
want to go add S3 Source then when we've
00:22:52
got that we want to highlight that
00:22:54
source and go
00:22:55
next an existing IM am r so as part of
00:22:59
this tutorial we created an I am rule if
00:23:02
you need to know the name of the IM am
00:23:04
rule it's in the setup code and then you
00:23:06
just go down to the rule and it says L
00:23:08
ask glue course that's the one we're
00:23:10
looking for the one we checked existed
00:23:11
after we ran the template back in here
00:23:14
sorry and then just paste it in and find
00:23:16
it click off and that will be you we
00:23:19
don't need Le information do not check
00:23:20
that box our Target database is raw data
00:23:24
we don't need a prefix on the table name
00:23:27
for now that's fine and then we're just
00:23:29
going to schedule on demand click next
00:23:33
hit create crawler this will take a
00:23:35
little second and then you want to run
00:23:36
the crawler by clicking this button this
00:23:39
will go off and run itself to map the
00:23:41
data for this this will probably take
00:23:44
one or two minutes in total so I'll
00:23:47
pause the video here and we can pick it
00:23:49
up once it is done and has created our
00:23:52
orders table for us okay after exactly 1
00:23:56
minute for me so I did say take between
00:23:58
one and two you can see that it's
00:24:00
completed successfully and we've had one
00:24:02
table change you can click here to see
00:24:04
what that is we've added the table
00:24:06
orders perfect well that's great because
00:24:08
that's what we were looking for if we
00:24:10
then go into our databases if we go into
00:24:13
raw data you can see we have customers
00:24:15
raw and then we also have orders raw as
00:24:18
a table or orders as our table because
00:24:20
we added this by a crawler and I didn't
00:24:21
put a prefix or soft fix in it's used
00:24:24
the folder's name that's completely fine
00:24:26
we're allowed to have the name that we
00:24:27
want but more importantly let's just
00:24:29
check that it picked this up correctly
00:24:31
so if we click on that link it should
00:24:34
take us into the orders and we should
00:24:35
see the CSV perfect let's click on the
00:24:37
orders table itself did it get all the
00:24:40
different columns it's got 16 columns in
00:24:42
total so if we go back here and we go to
00:24:45
data you can see that we start with
00:24:46
sales orderer ID and we finish with line
00:24:49
total so let's do that little visual
00:24:51
check to make sure that those things
00:24:52
line up yep sales orderer ID and we
00:24:55
finish with line total so we've managed
00:24:57
to pick up the entire through a program
00:24:59
called a crawler the AWS has wrote that
00:25:02
has stopped us having to manually enter
00:25:03
this information or enter it through
00:25:05
code such the way that I did the manual
00:25:07
setup for the customers table so it's
00:25:09
took that burd in office if you have a
00:25:11
lot of data and you just want a way to
00:25:13
look at the um cables and get them
00:25:16
loaded very very quickly so that's a
00:25:18
crawler let's move on to the next
00:25:21
section A Brief word on petitions in AWS
00:25:25
this is important to know uh our ETL job
00:25:27
when we get there will create cre a
00:25:28
single partition in AWS but you can play
00:25:30
around and create more so a partition is
00:25:32
folders where data is stored in S3 which
00:25:34
are physical entities are ma to
00:25:36
partition which are logical empties
00:25:39
columns in the glue table what what does
00:25:42
that actually mean okay so we could have
00:25:45
a table called sales and we could have a
00:25:47
seale de and AWS glue or indeed big data
00:25:51
processing Frameworks in in general give
00:25:53
you the ability to create partitions and
00:25:56
with these partitions we can split
00:25:58
things into year month and day for that
00:26:01
sales date and what these actually
00:26:03
become are logical folders or physical
00:26:05
folders on S3 so we can actually split
00:26:08
the data on S3 so rather than having a
00:26:11
just a column in the table you have a
00:26:13
folder and what this means is when you
00:26:15
search for something like I want to find
00:26:17
the sales date that is February the 2nd
00:26:20
2019 the inner workings of AWS glue know
00:26:25
that actually it can exclude the folders
00:26:27
that don't have that on it so it knows
00:26:29
actually because these days aren't
00:26:30
correct I can just go look at this one
00:26:33
here out of the four it's a way to speed
00:26:35
up queries it's a way to speed up
00:26:36
writing as well and updates depending
00:26:39
exactly on what you're doing you need to
00:26:41
understand partitions to use AWS glue
00:26:43
again we will create a partition I will
00:26:45
show you how to do that when it comes to
00:26:46
the ETL job if you're not quite
00:26:48
physically grasping that n hopefully by
00:26:51
the time we do the ETL job it will all
00:26:53
sync
00:26:54
in AWS glue connections so a glue
00:26:58
connections is a data catalog object
00:27:00
that contains the properties that are
00:27:02
required to connect to a particular data
00:27:03
store typical of many Cloud providers or
00:27:06
any ETL tool you can store a connection
00:27:08
so if you have a database wherever that
00:27:10
may live on premise or in AWS you can
00:27:13
store the connection string the password
00:27:14
and the username inside AWS glue that
00:27:17
means when you need to actually
00:27:18
reference it in an ETL script you're
00:27:20
actually just entering the object
00:27:21
information you don't store the
00:27:23
passwords or anything in the script
00:27:25
pretty standard practice these days when
00:27:27
it was built with a AWS glue it wasn't
00:27:29
you would expect it to be there it is
00:27:31
there please use AWS glue connections
00:27:34
when you need them so let's take a
00:27:36
little look at connections down the left
00:27:38
hand side you'll see there is
00:27:39
connections click on Connection in
00:27:42
connectors you can go to the marker
00:27:44
place or if you knew your connector
00:27:46
already exists you can go create
00:27:48
connection and you can see here a list
00:27:50
of connectors that you can use with albs
00:27:52
glue these include AWS connectors or
00:27:55
second party or first party connectors
00:27:57
as well such as sales force uh snowflake
00:27:59
there sap sitting around as well on the
00:28:02
fun bit AWS glue ETL so AWS glue ETL
00:28:06
supports extracting data from various
00:28:08
sources and that's important so various
00:28:10
sources there is the Lex of S3 RDS on
00:28:14
prous databases transforming it to meet
00:28:16
your business needs and then loading it
00:28:17
into a destination of your choice we
00:28:21
will be doing this in the tutorial I
00:28:23
will show you how to do this visually
00:28:25
you will not need to code I will show
00:28:26
you the script as well but but if you're
00:28:28
not a good coder or you can't code don't
00:28:30
worry you'll be able to follow along
00:28:32
with the tutorial AWS glue ETL engine
00:28:34
you need to know this is an Apache Spark
00:28:36
engine distributed for Big Data
00:28:38
workloads across worker nodes it also
00:28:40
supports python but more typically you
00:28:42
would use the Spark engine you need to
00:28:45
understand this as well that the AWS
00:28:47
glue dpus one dpu is equivalent to four
00:28:51
CPUs and 16 gig of memory when you're
00:28:54
provisioning jobs and when we jump into
00:28:56
the ETL job in just a little second you
00:28:58
will see that I provisioned two dpus for
00:29:00
that job this means that I will have it
00:29:03
CPUs and 32 GB of memory available if
00:29:07
you do not have enough dpus for your job
00:29:09
it will crash and fail if you have too
00:29:12
many dpus for your job you're paying for
00:29:14
compute resource you're are not using it
00:29:16
is a bit of an arch to tune your glue
00:29:17
jobs there are handy features on the
00:29:19
console to show you when you're under
00:29:21
provisioning and over provisioning dpus
00:29:23
this is the charging mechanism so
00:29:25
depending on the on the region you're in
00:29:27
it will be priced at how many dpus per
00:29:30
minute that you use is the cost that you
00:29:32
pay so it's important to size these
00:29:34
correctly and just before we jump on and
00:29:36
create an ETL job it's good to know
00:29:39
about these we won't use them but you
00:29:40
can put boot marks in which basically
00:29:42
means when you have new data arrive the
00:29:45
ETL job that you're creating will not
00:29:47
reprocess the old data so you're just
00:29:48
doing a Delta load of the data really
00:29:51
handy if you're getting like early loads
00:29:52
or daily loads into an S3 bucket and you
00:29:55
just want to process the new data this
00:29:57
means AWS is automatically tracking that
00:29:59
for you and says hey I'm not going to
00:30:02
process the previous run of data I'm
00:30:04
just going to process the new data that
00:30:06
has landed it bookmarks it and will not
00:30:08
reprocess it it again okay so back on
00:30:12
the console the first thing we're going
00:30:13
to do is create a database so we can
00:30:15
actually store new tables we create
00:30:17
during this visual ETL process so on the
00:30:21
left hand side you want to go databases
00:30:22
and you want to go add database and
00:30:25
we're going to call this database
00:30:26
processed
00:30:28
underscore
00:30:29
data two C's there it shouldn't be
00:30:33
process underscore data we need the S3
00:30:36
location that we already set up um when
00:30:38
we were running the setup script so if
00:30:40
we go in and we find that bucket again
00:30:42
I've minimized the tabs there in between
00:30:44
different parts of the lesson I have a
00:30:46
processed data area of the bucket I want
00:30:48
to uh hit the tick and take the URI
00:30:51
again urri not URL that's really
00:30:53
important and we want to paste that down
00:30:55
here and we need to enter a little bit
00:30:57
of of a description so I'm just going to
00:30:59
say this is the database data base to
00:31:04
hold the tables for processed
00:31:11
data
00:31:12
processed data and then you want to
00:31:15
create that database so that gives us a
00:31:16
location to store some of the processed
00:31:19
data now down the left hand side you can
00:31:21
see there is a data integration and ETL
00:31:24
section there's lots of different things
00:31:26
here but we're going to be working on
00:31:27
the ETL jobs and for the purposes of
00:31:29
this tutorial we're going to do visual
00:31:31
ETL um that means if you can't code or
00:31:33
you don't want to code you don't need to
00:31:35
know it for the purposes of this demo
00:31:37
click on that visual ETL and then hit
00:31:39
create job from a blank graph this is
00:31:43
your UI or your GUI that helps you build
00:31:46
um glue jobs using nodes rather than
00:31:49
coding don't worry it still creates a
00:31:51
code script under the script tab in
00:31:53
behind so you can actually see what it's
00:31:55
doing but for the purposes of this we'll
00:31:57
do visual ETL first thing you need to do
00:32:00
is give it a name so let's call this
00:32:03
processed customers
00:32:06
job and let's hit enter you'll see then
00:32:09
when this goes completely red here we do
00:32:11
have a few things we need to fill in so
00:32:13
I am rule I've created one during the
00:32:16
setup course or script rather for called
00:32:18
AWS glue course so select that scrolling
00:32:22
down scrolling down into advanced
00:32:23
properties we will have to pick a name
00:32:26
or a few locations that will be named to
00:32:28
store um different aspects here so the
00:32:31
first thing is we have an area in the
00:32:32
bucket to store the scripts so again if
00:32:34
we go into the bucket itself I have
00:32:36
created a script location so choose that
00:32:40
you also need a place to put your logs
00:32:42
so again let's view that oh no not view
00:32:44
that sorry that's browse that then go
00:32:46
find the bucket again so m is AWS Johnny
00:32:48
CH of course uh inside the temp
00:32:51
directory would be fantastic and choose
00:32:53
that and then we'll just put forward SL
00:32:56
logs and then scrolling down scrolling
00:32:58
down scrolling down inside temporary
00:33:00
path let browse that again go find that
00:33:03
S3 bucket I've given all the permissions
00:33:05
required to access this bucket um during
00:33:08
the script setup so you won't have to
00:33:10
add anything extra by using those
00:33:12
locations oh then we want to Temporary
00:33:13
directory and choose that
00:33:16
directory that's everything there under
00:33:18
advanced settings so we can just scroll
00:33:20
back up and minimize that and then to
00:33:21
keep the cost down the dpus the number
00:33:23
of workers we want is two that's perfect
00:33:26
everything else can be left as it is you
00:33:28
can give it a description if you want
00:33:30
I'm not going to bother and then hit
00:33:32
save on the top right so that's the job
00:33:34
saved and you'll see that there's no red
00:33:36
um warnings anymore this plus icon is
00:33:39
where you can add your nodes and the
00:33:41
first thing we need is a source and our
00:33:42
source is going to be the AWS glue dat
00:33:44
catalog we're going to get that
00:33:46
customers table that we have already um
00:33:48
set up using raw data and then we will
00:33:51
have under tables customers raw select
00:33:55
that you'll see here that there's a data
00:33:57
pre review section ready to go depending
00:34:00
on whether you've been into ETL or not
00:34:02
this can take up to 2 or 3 minutes to
00:34:03
start you'll see data Pro preview
00:34:06
processing and then eventually you will
00:34:07
get here to the point that your data
00:34:09
preview is ready so you can actually see
00:34:11
the data as it currently sits and you
00:34:13
have the output schema as well for this
00:34:16
data what we're going to do is take it
00:34:18
and we're going to store it as paret
00:34:20
format in S3 registering a new table in
00:34:23
the glue data catalog what we're also
00:34:25
going to do is add a transformation to
00:34:28
the table and in this case we're going
00:34:29
to add a process time so inside nodes
00:34:32
and go to transformation there are lots
00:34:34
and lots of different Transformations
00:34:35
that you can do with a node in this one
00:34:38
we're going to add a a current time
00:34:39
stamp as our processed date time stamp
00:34:43
so click that node and it will
00:34:44
automatically join up if it does not
00:34:47
automatically join it will look like
00:34:50
this on your on your canvas as it
00:34:53
currently sits you just want to click on
00:34:55
and then choose the parent node and
00:34:56
select the AWS glue dat a catalog and
00:34:58
you can see there that it's ready to go
00:35:01
you need to give this then an output
00:35:03
column so I'm just going to call this
00:35:04
processed time stamp keep things nice
00:35:07
and simple and we'll just leave it as
00:35:09
default so you can see here then with
00:35:11
the output schema it's picking up the
00:35:12
new addition that we have as process Tim
00:35:15
stamp then we need a Target so into
00:35:18
Target and we're going to use S3 as our
00:35:20
Target so perfect we've selected S3 as
00:35:23
our Target you can see here it's already
00:35:25
got the format Park selected we're going
00:35:28
to do Snappy we do need to pick a
00:35:30
location to save this in S3 so again
00:35:33
let's go into that bucket that we've
00:35:34
created for the purposes of this
00:35:36
course inside that process data sorry
00:35:40
select the process data I do that all
00:35:42
the time hit choose and then we want to
00:35:44
call this
00:35:46
customers uh underscore processed and
00:35:50
then we want a forward slash on the end
00:35:52
of that as well so that's the location
00:35:53
we're going to save the data to we do
00:35:56
want to create a new table in the data
00:35:58
catalog and on subsequent runs update
00:36:01
the schema to add new partitions that's
00:36:04
exactly what we want to do our database
00:36:06
is going to be processed data uh
00:36:09
processed um sorry processed data for
00:36:11
the database um we have to give the
00:36:13
table of name so let's just call this
00:36:15
then customers uncore process so that's
00:36:17
the name of the table we're going to
00:36:19
create and then we want to add a
00:36:21
partition key as well and our partition
00:36:23
key will be on the processed dat time
00:36:26
stamp then we want to save everything
00:36:28
that we've just done and that is our ETL
00:36:31
job we're going to take that data we're
00:36:32
going to add a time stamp we're going to
00:36:34
transform it into partk we're going to
00:36:36
save it to S3 we're then going to create
00:36:38
a glue data catalog table called um
00:36:41
customers process that's going to set in
00:36:43
the process data um database and also
00:36:47
we're going to create a partition key
00:36:49
called date time
00:36:51
stamp okay perfect so let's run this and
00:36:55
let's save this then when you're
00:36:57
actually ready to run you can go to runs
00:36:59
here or you can click run there whatever
00:37:01
one you want to do then let's run job
00:37:04
that'll kick off the job it has to spin
00:37:06
up a few containers and a few other
00:37:07
things in behind the scenes this might
00:37:09
take a few minutes to process in total
00:37:11
cuz it's the first go you just want to
00:37:12
hit that refresh to see what's happening
00:37:14
you can also see down here what's going
00:37:16
on with the job itself you can see the
00:37:18
input arguments and everything else on
00:37:20
the UI I'm going to pause this here and
00:37:23
let it do its thing and then once it's
00:37:25
processed successfully we'll pick it
00:37:26
back up you can see it's often running
00:37:28
14 16 seconds in
00:37:31
already okay you can see after 1 minute
00:37:34
4 seconds precisely it succeeded so
00:37:37
that's fantastic let's go have a look at
00:37:40
the glue data catalog so if we go into
00:37:42
databases we go through process data
00:37:44
you'll see that we actually have some
00:37:46
table data there we can click on the
00:37:48
table and we can see that we have our
00:37:50
partition or full name custom ID and
00:37:53
last name if we click on partitions we
00:37:55
should be able to see that we have a
00:37:56
partition for when I just ran that and
00:37:59
also if we go back into the schema um we
00:38:03
should also be able to go to location by
00:38:05
clicking this URL here that'll take us
00:38:08
in that is where our data is currently
00:38:10
sitting as a parket file excellent as
00:38:14
part of this demo as well um I also set
00:38:16
up a Thea so we could look at data so if
00:38:18
you notice there will be a working goup
00:38:20
selected forthea so inside Athena if you
00:38:23
go there you'll see that there's a
00:38:25
working goup called uh AWS AWS glue
00:38:28
course Athena work group so you want to
00:38:30
select that on the top right so once
00:38:32
you've selected that working group
00:38:34
that's a location to save our Athena
00:38:36
results you can have a look at the data
00:38:38
as well by going to process data then we
00:38:40
should under tables have our table and
00:38:43
the simplest way is just to click those
00:38:44
three dots there and go preview table
00:38:47
this will quer our data table and bring
00:38:48
us back the data in paret format you can
00:38:50
see there that the process timestamp is
00:38:53
also there for us that's how you do it
00:38:56
with the process data for customers you
00:38:59
can do exactly the same with the order
00:39:01
data and you would do the same thing
00:39:03
again where you go in and you start with
00:39:05
an ETL job and you would say visual ETL
00:39:08
and start creating the same thing except
00:39:10
this time with the glue DOA catalog you
00:39:11
would select it and you would have raw
00:39:14
data and your table this time oops sorry
00:39:17
raw data and the table this time would
00:39:19
be orders so you can do the same thing
00:39:21
with orders as you did before um with
00:39:24
customers I won't do it for the purposes
00:39:26
of this demo that's a bit of a come one
00:39:27
for you but going ahead do exactly the
00:39:29
same thing for the orders data and you
00:39:32
will be off and running creating your
00:39:33
second ETL job AWS glue data quality is
00:39:37
a relatively new feature in fact it was
00:39:39
marked with new when I'm filing this
00:39:41
video so you will see it marked with new
00:39:42
Wi on the console this helps you monitor
00:39:45
the quality of your data by data quality
00:39:47
definition language using a open source
00:39:50
project called DQ we can actually go
00:39:52
into the console in a little second
00:39:53
you'll see an action it will formulate
00:39:55
some rules for us show us this dql
00:39:58
language in operation and then we can
00:40:00
alter that language as well or remove
00:40:02
rules to suit our data quality this is
00:40:05
so when we're using glue we can perform
00:40:07
data quality checks on data that's
00:40:08
coming in or data that we actually
00:40:10
create through ETL and then apply that
00:40:13
governance 3 code let's get started and
00:40:15
have a little look in the console of how
00:40:17
AWS glue data quality
00:40:20
works we are going to take a quick look
00:40:24
at glue data quality it's a new feature
00:40:26
and it's at in the glue data catalog and
00:40:29
you want to go to tables so let's pick a
00:40:31
table let's pick the orders table it's
00:40:33
kind of the most detailed so hopefully
00:40:35
we'll get something and you can see up
00:40:37
here they've got data quality new it's
00:40:39
actually still under new and has a video
00:40:42
so watch the video of you if if you want
00:40:44
but what we want to do is actually get
00:40:47
some uh data quality rules for this
00:40:50
table so if we go run history we'll see
00:40:53
that there's probably nothing on this
00:40:55
table whatso ever is there any
00:40:58
recommendation runs no we haven't right
00:41:01
so we want to go recommended
00:41:03
rules and we want to click recommended
00:41:06
rules oh choose the I am rule uh we've
00:41:08
got one for it so this is where glue is
00:41:11
going to go off and create rules force
00:41:13
that it recommends um that's it off and
00:41:15
running so this is using a bit of AI bit
00:41:17
of machine learning looking at our data
00:41:19
and coming up with the rules for us so
00:41:22
I'll pause the video this probably take
00:41:24
a few minutes and then we'll see some of
00:41:25
the rules hopefully that go decides
00:41:27
should be applied to our order table in
00:41:29
terms of data
00:41:32
quality okay and after a few minutes you
00:41:34
can see that it has a successful run if
00:41:37
you click into the run you will see the
00:41:40
recommended rules that it has um created
00:41:43
or or recommended for this data set you
00:41:46
can copy these rules and then what you
00:41:48
want to do is go back into the tab
00:41:51
before you want to go to the rules
00:41:54
themselves uh by clicking these and then
00:41:58
you want to hit copy then we want to go
00:42:00
back out to the table I'm sure there's a
00:42:02
quicker way to do this than what I'm
00:42:03
doing we want to glue data quality you
00:42:06
want to scroll down and you'll see that
00:42:08
you have the create data quality rules
00:42:10
you click in and then you can just paste
00:42:12
in what we lifted there so I'm just
00:42:14
going to go control V and you want to
00:42:16
get rid of that and you want to get rid
00:42:18
of that one at the bottom and then you
00:42:20
can start to mess around with this so
00:42:22
you can see that you know your real kind
00:42:23
per loaded things should be between
00:42:25
these two you can adjust that if you
00:42:26
want
00:42:27
it says make sure that everything has a
00:42:29
sales order in it looking at the
00:42:31
standard deviation this and this again
00:42:33
let's just say we wanted this to be
00:42:34
bigger or smaller so we could just go
00:42:36
100 and we know that our maximum is
00:42:38
going to be 10,000 you could go 10,000
00:42:41
and this is how you can start to build
00:42:43
up those rule sets you save the rule set
00:42:45
you give the rule set a name so I'm just
00:42:46
going to call this
00:42:49
dq1 and then it will apply this rule set
00:42:51
to the data in the table it will then
00:42:53
start to alert you when these rules are
00:42:55
broken again you can see that like
00:42:57
column values you can add all the values
00:42:59
into the columns and using this Library
00:43:01
you can start to build those rules to
00:43:03
ensure the data quality of your data AWS
00:43:06
glue scheduling um we will take a look
00:43:09
at this in just a second but you need to
00:43:10
know about AWS triggers this is how you
00:43:12
initiate an ETL job or actually a
00:43:15
crawler job I'll show you both on the
00:43:17
console and then this can be defined on
00:43:19
a schedule a time or event it's up to
00:43:21
you it can be done on demand as well
00:43:23
we'll see the options but a trigger I to
00:43:26
trigger something is how we start the
00:43:28
workflows or how we start an ETL job and
00:43:30
then we have AWS glue workflows this
00:43:32
lets us create trans complex um extract
00:43:35
transform and load activity so we can
00:43:37
run a crawler we can run our ETL script
00:43:39
then we can run another ETL script when
00:43:42
we're on the console right now I will
00:43:43
show you how to create a trigger I will
00:43:45
show you how to create a workflow these
00:43:47
are really useful if you're only using
00:43:49
AWS glue if you need to use other things
00:43:52
outside of glue like EMR or Athena then
00:43:54
you'll have to think of something else
00:43:56
like man work for aaty Earth flow that's
00:43:59
what I commonly use but if you're only
00:44:01
in the AWS glue environment and you only
00:44:03
need those tool set then AWS glue
00:44:05
workflows is totally available tool for
00:44:07
scheduling ETL jobs just a brief mention
00:44:11
um now we've came out of workflows and I
00:44:12
know I've keep mention it aier flow is
00:44:14
one that you can use step functions is
00:44:15
one you can use and event Bridge these
00:44:17
are three very common patterns I see for
00:44:19
scheduling glue jobs bur them in mind
00:44:22
but workflows works perfectly well if
00:44:24
you are only doing things inside AWS
00:44:27
glue okay let's look at orchestration
00:44:30
within the glue console um if you're
00:44:33
just using glue this is fantastic for
00:44:35
orchestration so if you're just running
00:44:36
glue jobs the catalog doing crawlers
00:44:40
cool this is a place for you to come do
00:44:42
some orchestration if you're knitting
00:44:44
other AWS Services together my
00:44:47
recommendation is to go look at other
00:44:48
orchestration particularly when it comes
00:44:50
to the managed workflows for Earth flow
00:44:53
that's my go-to at the moment when I'm
00:44:54
integrating EMR Thea glue all together
00:44:57
but if you're just using AWS glue this
00:44:59
is a great tool so go into workflows and
00:45:02
you want to go to add a workflow I'm
00:45:04
just going to call this test
00:45:06
workflow and create the workflow itself
00:45:10
you can click into the workflow and then
00:45:11
we need to start adding things that are
00:45:14
done want to hit add trigger you can see
00:45:17
that we don't currently have any
00:45:19
triggers so we can create triggers from
00:45:20
the triggers page if you want so if you
00:45:22
hit add trigger we can say that this
00:45:24
trigger can be crawler
00:45:27
um
00:45:30
glue T tutorial then we're just going to
00:45:34
do it on demand um we're going to hit
00:45:36
next it's going to ask you what our
00:45:38
Target resource so we'll hit resource
00:45:40
type we'll hit crawler select a crawler
00:45:42
AWS glue tutorial and hit add go next
00:45:46
and then hit create so there's a crawler
00:45:49
then we want to go back into our
00:45:52
workflow we can click on our workflow we
00:45:54
can hit add trigger we should have our
00:45:56
crawler sitting to run on demand and
00:45:58
we'll hit add and you can see here that
00:46:00
we've added the trigger to the workflow
00:46:02
if you just go then uh if we just look
00:46:05
at this and hit run workflow it will
00:46:08
start to run the workflow that workflow
00:46:11
is going to go and crawl our data we can
00:46:14
then build dependencies across here so
00:46:16
we can put in the next note to be
00:46:17
actually run the AWS job once the
00:46:20
crawler complete and you can start to
00:46:22
build more complicated workflows to do
00:46:24
that you would click on the uh craw like
00:46:27
this add a trigger add a new one so give
00:46:29
this a name so let's just call this uh
00:46:32
glue job
00:46:33
ETL this is an event start after the
00:46:36
event and hit add then we would want to
00:46:39
go in select the crawler or select the
00:46:42
job that we want to start so in this
00:46:43
case it would be the process job and hit
00:46:45
add and then you can start to build out
00:46:48
those different things as you go so now
00:46:51
once the CER is run we're going to do
00:46:52
the job and again you can actually hit
00:46:55
run workflow for that to go and kick off
00:46:57
as well and if you had many jobs or many
00:46:59
crawlers you could start to build a more
00:47:01
complex workflow for scheduling as I
00:47:04
said this is just a little taster of
00:47:07
what you can do with it if you're
00:47:08
working with in glued only totally valid
00:47:11
to use this as your orchestrator AWS
00:47:13
glue data brute I leave this in cuz I
00:47:15
covered it last time you'll notice when
00:47:17
I jump onto the console it actually ni
00:47:19
sits separately away from AWS glue it's
00:47:22
its own service this is for visual data
00:47:24
preparation that makes it easier for non
00:47:26
ERS data analysts or someone who just
00:47:28
wants to quickly look at data in quite
00:47:31
frankly what looks like an Excel uh
00:47:32
workbook to clean and normalize data I
00:47:35
am always hesitant to put this into
00:47:37
production I'm hesitant for data
00:47:38
Engineers to use it I use it to look at
00:47:40
data quickly and manipulate it before I
00:47:42
go write some data engineering code and
00:47:45
put it through more concrete cic CD type
00:47:49
processes for devops but it is a good
00:47:52
tool or an army switch play to have in
00:47:54
your arsenal when you need a look at
00:47:55
data or you want to give the option for
00:47:57
people who look at data that can't code
00:47:59
let's look at this action I will uh use
00:48:01
the sample data to to help us see see
00:48:05
how this works when we jump on to the
00:48:07
AWS console now interestingly with the
00:48:10
new um AWS console they've actually
00:48:13
moved glue data Brew out of it so you
00:48:16
actually have to type in AWS glue data
00:48:18
Brew so glue data usually does it and it
00:48:20
comes up as a completely separate
00:48:22
service page so which took this out
00:48:24
there's a video again about how it works
00:48:25
so feel free to work that what what's
00:48:27
that so we want to create a sample
00:48:30
project we're going to do UN resolution
00:48:32
votes um you want to select create a new
00:48:35
a am rule we'll call this one just AWS
00:48:38
course glue and then it'll add the rest
00:48:41
of it um that means we can delete it
00:48:43
once we're done so then you want to hit
00:48:45
create project this will run off of the
00:48:46
background create the I am rule for you
00:48:48
load up the data into the um datab Brew
00:48:53
UI force and we can start to play around
00:48:55
with it if you haven't been here before
00:48:57
it does look a little bit like Excel or
00:48:59
smart sheets um and that's really the
00:49:02
idea it's it's an area for non dat
00:49:05
Engineers maybe or like data Engineers
00:49:07
that want to get the grips of data
00:49:09
quickly but but don't want to do it
00:49:12
through code
00:49:13
um yeah I I I would generally use it um
00:49:17
if I wanted to look at something quickly
00:49:19
I would let non-technical users use it
00:49:21
as well I'm always just a little bit
00:49:24
apprehensive about going to like create
00:49:26
the the job out of it and letting them
00:49:28
run glue code if there's a valid
00:49:30
business reason sure but what we don't
00:49:31
want is getting away from the best
00:49:33
practices of cicd and maintaining those
00:49:36
devops pipelines when it comes to our
00:49:38
data engineering best practice but again
00:49:42
there's nothing wrong with it um
00:49:43
provided that it's done in a safe and
00:49:45
agile and secure way you can see that
00:49:47
this is loading up 51% I'm just going to
00:49:49
pause the video until it is
00:49:52
done okay so after a few minutes you can
00:49:55
see that it loads up rows and starts to
00:49:58
give a look at what we have you can see
00:50:00
at the top of each one it tells you
00:50:02
information like the number of distinct
00:50:03
values the kind of range that's in there
00:50:06
and the percentages of each other of the
00:50:08
values and it's kind of using it well it
00:50:10
is using aim ml in behind the scenes to
00:50:12
populate a lot of this data you can look
00:50:14
at the schema as well by clicking schema
00:50:16
and it'll tell you all the different
00:50:18
columns that you have and you get a
00:50:20
profile of the data if you wanted on top
00:50:22
of it but I'm just going to go back to
00:50:23
grid view if you were interested then
00:50:25
you can start to filter sour The Columns
00:50:27
format The Columns add in things as well
00:50:29
so let's just do a quick filter you want
00:50:31
to add in a filter and you want to do it
00:50:33
by condition and then you want to do
00:50:36
where it contains and then you can
00:50:38
select the source column so let's just
00:50:39
say ours is resolution and we want to
00:50:43
make sure that the column contains
00:50:46
contains let's go contains and a value
00:50:48
of let's say we only wanted 66 oh that's
00:50:51
55 66 you apply that and it will apply
00:50:55
to the column that value value or that
00:50:58
condition and you can see it's off
00:50:59
running and if it has 6 six in it it's
00:51:01
going to keep that and that's how it
00:51:03
does it stores this as a recipe so you
00:51:05
can see here that's the first step you
00:51:07
can then add another step so let's say
00:51:09
we wanted
00:51:11
to I'm just going to dup we want to
00:51:14
remove jaate duplicate vals in columns
00:51:16
you can say well what columns um let's
00:51:18
say we wanted to dup the First Column
00:51:20
which is assembly session all rules and
00:51:23
apply so it's going to dup The Columns
00:51:25
here based on that idea it's going to
00:51:27
select the first one it sees brings it
00:51:29
down to lesser columns or a less number
00:51:31
of columns it's dypt and you can see
00:51:33
here then that you have your recipe you
00:51:35
can publish that recipe and or
00:51:37
alternatively you can import a new one
00:51:39
or download it and then once you have
00:51:41
this recipe once you can apply it to
00:51:43
data sets over and over again best thing
00:51:45
to do is sit and play around with these
00:51:46
different functions if you're interested
00:51:49
again for me as I said glue data Brew
00:51:52
really get to grips with it for users
00:51:54
that don't code don't write the code I
00:51:57
kind of give it to my known technical
00:51:58
users when it makes sense I use it to
00:52:01
look at data really quickly I'm always
00:52:03
just a little bit apprehensive about
00:52:05
anyone putting it into a production
00:52:07
pipeline for data engineering sure
00:52:09
business users who want to play around
00:52:10
with data create jobs out of data that
00:52:12
or bigger data sets and use it in their
00:52:14
day-to-day business cool those refined
00:52:16
cicd pipelines with devops processes I'm
00:52:20
not such a fan of using glue data Brew
00:52:22
for those
00:52:23
purposes okay folks that concludes the
00:52:26
tutorial on AWS glue we've kind of took
00:52:28
a look at what AWS glue is we spent some
00:52:30
time looking at the AWS glue data
00:52:32
catalog we then created ETL jobs we've
00:52:35
learned how to schedule those ETL jobs
00:52:37
and we've also looked at glue data
00:52:39
quality and glue data brew as well one
00:52:42
final reminder really like a like And
00:52:44
subscribe to this channel it helps me
00:52:46
out in the description below I've also
00:52:48
left a link to the exams for the AWS
00:52:51
data engineering certification if you're
00:52:53
interested in taking that there's also a
00:52:56
YouTube tutorial on this channel for
00:52:58
that certification and until next time
00:53:00
folks thanks for watching

Etiquetas

AWS Glue
ETL
Data Catalog
Crawlers
Data Quality
DataBrew
Scheduling
AWS Services
Data Engineering
Tutorial