What AWS services were discussed in the interview?

The interview discussed services like AWS DMS, Redshift, S3, Lambda, Glue, DynamoDB, and ElasticSearch.

What programming languages does ASA primarily use?

ASA primarily uses Python and SQL.

What is the role of SNS notifications in ASA's data pipeline?

SNS notifications are used to trigger alerts based on job success or failure, and to notify users when there are discrepancies or issues.

How does ASA handle schema changes in the data pipeline?

ASA uses Python to manage schema changes by comparing columns in source and target, and sends notifications to users when new columns are detected.

What approach does ASA use for data partitioning in S3?

ASA partitions data based on date into S3 folders for organized and timely access.

How is reconciliation handled in ASA's data process?

Reconciliation is done after production load by matching KPIs and sending notifications if there are discrepancies.

What kind of triggers are used to invoke AWS Lambda?

Lambdas are invoked using put events from S3 when new data is uploaded.

How is data security handled for PII data?

Data security includes using abstraction layers, data masking, and separating PII from non-PII data into different storage locations with restricted access.

What kind of database management techniques does ASA use with DynamoDB and ElasticSearch?

ASA uses DynamoDB for high throughput and ElasticSearch for complex queries, full text search, and real-time analytics.

How does ASA ensure data lineage and logging in their workflows?

Data lineage is ensured using RDDs for execution plans and logs in AWS services like CloudTrail for operational visibility.

AWS Data Engineering Interview

00:55:12

https://www.youtube.com/watch?v=gCuAzrU5-2w

Summary

TLDRASA, with 5 years of experience, primarily works on data engineering using AWS services. ASA has worked in both the banking and insurance sectors, employing skills in ETL, data migration, and AWS services like Redshift, S3, and Lambda. The interview delves into a specific insurance data use case where ASA utilized a variety of AWS tools to manage data pipelines, handling tasks from data migration using DMS to data warehousing with Redshift. The discussion spans processes like data partitioning in S3, schema management, reconciliation, and implementing security for PII data. ASA also covers concepts like orchestrating tasks using AWS Step Functions, logging and notifications with SNS, and handling schema changes and data security using Python scripts. The conversation includes their SQL skills, detailing how ASM generates reports and manages data in databases like DynamoDB and ElasticSearch, emphasizing their understanding of database functionalities. The use case highlights ASA's adeptness in delivering data solutions efficiently while ensuring data validity, security, and performance.

Takeaways

👨‍💻 ASA has 5 years of experience in data engineering.
📊 Skilled in AWS technologies like Redshift, S3, and Lambda.
🔄 Works on transforming and migrating data in insurance use cases.
🔒 Implements security measures for PII data using AWS services.
📚 Utilizes ETL processes and data warehousing effectively.
🤝 Uses SNS for notification and error tracking in data pipelines.
💡 Implements schema drift solutions using custom scripts and Python.
📈 Proficient in using SQL for querying and managing data.
💻 Combines use of DynamoDB and ElasticSearch for diverse data needs.
🔍 Ensures data lineage and auditing through AWS and DAGs.

Timeline

00:00:00 - 00:05:00
AA introduces herself, highlighting five years of experience in data-centric roles, focusing on ETL projects, AWS services, data modeling, and Python/SQL scripting. She describes a project involving insurance data, detailing the data transfer from Oracle to AWS Redshift and the process of data warehousing.
00:05:00 - 00:10:00
AA elaborates on the architecture of her insurance data project. Data from Oracle is migrated to AWS S3, then to Redshift for data warehousing. She discusses using AWS services like Redshift Spectrum for data processes and mentions employing CDC for handling transactional data from Oracle.
00:10:00 - 00:15:00
The discussion covers data load management using DMS, CDC, and the retention policies for logs and raw data stored in S3. AA explains how data capture is done incrementally, using S3 to store raw data, which is later moved to alternative storage for archiving after 90 days.
00:15:00 - 00:20:00
AA explains the partition management in S3, ensuring daily updates are retained while being able to reconstruct past data if needed. She also describes reconciliation processes and notification mechanisms via SNS for data validation and ensuring process integrity.
00:20:00 - 00:25:00
The talk continues about data verification using SNS triggers and SQL queries for KPI validation. AA describes daily job scheduling, how discrepancies are tackled in data, and her team's method for maintaining communication with consumers on potential process failures.
00:25:00 - 00:30:00
AA outlines the use of SNS notifications to communicate with consumers about process statuses. Additionally, she explains securing S3 bucket data using IAM and ACLs and the importance of abstract layers and data masking when dealing with PII in e-commerce data pipelines.
00:30:00 - 00:35:00
AA discusses the process of securely handling PII data by creating different schema access levels and utilizing Python for data separation in e-commerce settings. She explains transforming CSV logs to Parquet using Python for optimized data storage and analytics.
00:35:00 - 00:40:00
Handling schema evolution is elaborated by describing set comparisons between source and Redshift target schemas. AA mentions potential use of Glue for schema management and outlines data pipeline transformations and data cataloging solutions such as AWS Glue and EMR.
00:40:00 - 00:45:00
The focus shifts to using AWS Glue and EMR for ETL processes, considering instance selections and use-case based EMR configurations. AA suggests EMR serverless might differ from Glue in terms of catalog features and explains using Data Catalog for schema tracking.
00:45:00 - 00:50:00
A discussion on DynamoDB and Elasticsearch explores their application in storing analytics data. AA explains scenarios where Elasticsearch offers advantages like complex queries and full-text search capabilities. She contrasts this with DynamoDB's high-throughput and conflict resolution strategies.
00:50:00 - 00:55:12
AA discusses designing audits and logs for ad-hoc data pipelining in S3 using AWS services like Lambda for triggering and Cloudwatch for monitoring. She explains data lineage tracking through RDDs in Spark and describes using services like CloudTrail for infrastructure-level auditing.

Mind Map

Video Q&A

What AWS services were discussed in the interview?
The interview discussed services like AWS DMS, Redshift, S3, Lambda, Glue, DynamoDB, and ElasticSearch.
What programming languages does ASA primarily use?
ASA primarily uses Python and SQL.
What is the role of SNS notifications in ASA's data pipeline?
SNS notifications are used to trigger alerts based on job success or failure, and to notify users when there are discrepancies or issues.
How does ASA handle schema changes in the data pipeline?
ASA uses Python to manage schema changes by comparing columns in source and target, and sends notifications to users when new columns are detected.
What approach does ASA use for data partitioning in S3?
ASA partitions data based on date into S3 folders for organized and timely access.
How is reconciliation handled in ASA's data process?
Reconciliation is done after production load by matching KPIs and sending notifications if there are discrepancies.
What kind of triggers are used to invoke AWS Lambda?
Lambdas are invoked using put events from S3 when new data is uploaded.
How is data security handled for PII data?
Data security includes using abstraction layers, data masking, and separating PII from non-PII data into different storage locations with restricted access.
What kind of database management techniques does ASA use with DynamoDB and ElasticSearch?
ASA uses DynamoDB for high throughput and ElasticSearch for complex queries, full text search, and real-time analytics.
How does ASA ensure data lineage and logging in their workflows?
Data lineage is ensured using RDDs for execution plans and logs in AWS services like CloudTrail for operational visibility.

View more video summaries

Get instant access to free YouTube video summaries powered by AI!

Subtitles

Auto Scroll:

00:00:05
connection established okay so hi AA
00:00:09
welcome to this round right so can you
00:00:11
please introduce yourself and maybe the
00:00:13
project that you have worked upon so
00:00:15
yeah you can start yeah okay yeah thank
00:00:18
you Nisha for giving me this opportunity
00:00:20
to interview with you I'll start with my
00:00:22
brief introduction so myself ASA I have
00:00:25
close to five years of experience
00:00:26
working in data now I have worked on
00:00:28
banking as well as insurance to I
00:00:30
started my journey with exential there I
00:00:32
was working in ETL projects ETL data
00:00:34
migration and implementation project and
00:00:36
in quantify I was working in creating
00:00:38
Inn Data Solutions using uh various AWS
00:00:41
services like U red shift S3 uh DMS and
00:00:45
I do have a good understanding on data
00:00:47
modeling Concepts and data warehousing
00:00:49
other than that like my go-to
00:00:51
programming languages Python and SQL
00:00:53
scripting so that's a brief about me so
00:00:56
can you also mention like the
00:00:58
architecture like what was the source
00:00:59
location and where you were putting the
00:01:01
data what type of Transformations you
00:01:02
are doing so just to give an
00:01:05
idea yeah okay so uh in one of my use
00:01:09
case I'll discuss about one of my use
00:01:10
case that was related to Insurance data
00:01:12
so basically uh my architecture was like
00:01:15
uh Source was the orle database where
00:01:16
all the transaction were uh transactions
00:01:18
were happening so we were using DMS
00:01:20
service to get the data migrated to the
00:01:23
uh S3 storage where we were keeping all
00:01:25
the raw data then we were using uh like
00:01:29
um uh we are leveraging uh red shift for
00:01:32
our data warehousing solution where we
00:01:34
have all the data models like Dimensions
00:01:36
facts as well as there is one layer to
00:01:38
it there is ODS layer which is basically
00:01:40
we are keeping a exact replica of what
00:01:42
data is there on the source so that it
00:01:45
can act as a source of Truth for us when
00:01:46
we are starting modeling our data so we
00:01:49
can uh basically leverage that and uh
00:01:51
that can uh act as a source for us uh in
00:01:55
between we are uh basically creating
00:01:56
stor procedure and external tables using
00:01:59
red shift and of spectrum capabilities
00:02:01
to get the CDC data incremental data and
00:02:03
we have written the code in SQL so there
00:02:05
is a couple of uh like merge statements
00:02:07
and other uh quality related checks that
00:02:09
we are doing so this was end to end
00:02:11
architecture and we were dumping the
00:02:12
data into our dimensions and uh dims and
00:02:15
facts basically which is there in red
00:02:17
shift which was uh later on consumed by
00:02:20
the client as per the
00:02:22
requirement so is uh the DMS that you
00:02:25
are using right so is how it is
00:02:27
basically how much data load that you
00:02:29
have on the Oracle system like what is
00:02:32
the what are the different tables and
00:02:33
what is the data load that it is
00:02:35
handling yeah so there are a lot of
00:02:37
different tables uh in Insurance domain
00:02:40
there are different tables related to
00:02:42
like uh based on different granularities
00:02:43
policy related risk there are coverages
00:02:46
there are claim related data so there
00:02:48
are uh like a lot of uh Source data so
00:02:53
we have enabled CDC basically we are
00:02:54
just getting the incremental data like
00:02:56
the change that is happening what DMS is
00:02:58
doing we have enabled the CDC in our
00:03:00
source site so whenever there is a
00:03:02
transaction it will go on and uh
00:03:04
basically uh write it in a log uh what
00:03:07
we call is a right Ahad logs so whenever
00:03:09
the transaction is done that log will be
00:03:11
captured there and DMS is taking all the
00:03:14
logs from the client side and it will
00:03:16
just uh migrate it to the S3 it will
00:03:19
have the raw files as well like all the
00:03:21
raw files we'll be receiving so since we
00:03:24
are just dealing with the change data
00:03:25
capture it's not uh in a like huge
00:03:28
volume but when we are doing history
00:03:30
load then uh yeah it is taking a lot of
00:03:33
time DM is taking a lot of time to get
00:03:35
the data from the source so sometimes it
00:03:38
also take like more than two hours so
00:03:40
like the row data right as I understand
00:03:42
it is the exact replic of your orle
00:03:44
system right so do you have set any
00:03:46
retention at that part or will it be
00:03:48
always in the S3 standard
00:03:50
layer uh so like uh firstly the logs
00:03:53
there is a retention period like
00:03:55
previously there was a retention period
00:03:56
of 2 to 3 days but if there is any
00:03:58
failure which has not been resolved in 2
00:04:00
3 days so uh then we have changed it to
00:04:02
7 days so logs will be there uh till
00:04:05
like 7 Days retention in S3 we what we
00:04:08
are doing is like after 90 days like we
00:04:10
are moving it to uh other basically we
00:04:12
are archiving that
00:04:14
data so rtion so say you have a
00:04:18
requirement right where you don't
00:04:19
suppose you have a location say uh in
00:04:22
the S3 bucket right and you are writing
00:04:24
the data the next time you run a load
00:04:27
you don't want that location uh so you
00:04:29
want that location to be overd right but
00:04:32
in this case you don't want to lose the
00:04:34
previous data so what capability of S3
00:04:36
would you use so that you can't you
00:04:37
won't delete the previous version of the
00:04:40
data uh so basically everyday data is
00:04:44
going in a partition date so in S3 we do
00:04:46
have date date folders uh so let's say
00:04:50
if there are 100 transaction that has
00:04:52
happened today so it will be in today's
00:04:54
date and uh for tomorrow it will be in
00:04:56
different date so whenever we are uh
00:04:58
creating our external table we are
00:05:00
adding the uh today's date partition to
00:05:02
it and we just we are just reading uh
00:05:04
the latest data from there and then uh
00:05:07
like store procedure is running and it
00:05:08
is dumping the data from is3 to uh our
00:05:11
red shift ODS layer so it's like if I am
00:05:15
running my job today I'll have if there
00:05:17
is any case where failure has been uh
00:05:20
like there was a failure so we can add
00:05:21
multiple date partition as well to
00:05:23
access different uh dates data but for
00:05:26
now the use case is like like every day
00:05:28
it will drop the previous date partition
00:05:30
and add a new one and uh we are just
00:05:32
able to access the one day data only
00:05:35
okay so are you doing any reconciliation
00:05:37
process as part of this
00:05:40
job uh yeah we are doing reconciliation
00:05:43
process once we are uh done with like
00:05:45
the production load for all the use
00:05:47
cases there will be SNS notifications
00:05:49
that will get triggered based on the
00:05:52
success and failures so mostly what we
00:05:54
are doing is we are just uh based on
00:05:57
some transformation we are checking
00:05:58
whatever data is there in in our
00:05:59
dimensions and facts uh and as per the
00:06:02
requirement whatever data is there on
00:06:04
the source so we have written SQL
00:06:06
queries uh as part of Recon so it will
00:06:09
basically match all the kpis and if
00:06:11
there is any difference it will trigger
00:06:13
a notification if there is no difference
00:06:15
as well then then also like we have kept
00:06:17
it that way that uh it will uh daily
00:06:19
match some kpis for example like the
00:06:21
policy count uh the premium amount so
00:06:24
these kpis it will be matching and we
00:06:26
are sending the triggers so how you
00:06:28
scheduling the trigger ciliation process
00:06:30
and at what time like is it a daily run
00:06:32
that you are doing or a weekly run so
00:06:33
how it is
00:06:34
scheduled it's a daily run so for now
00:06:37
like we have scheduled it like uh in the
00:06:39
morning around 9:00 because we have we
00:06:42
are starting all our use cases run after
00:06:44
12: after midnight so uh as for the load
00:06:48
that we were getting everything will be
00:06:50
like all the uh use cases jobs will uh
00:06:53
will be successful by 7 in the morning
00:06:56
so based on that like we are running it
00:06:58
uh at like 9 in the like 9 in the
00:07:01
morning yeah and say as part of your
00:07:03
reconciliation process right you find
00:07:05
some discrepancy in your data like maybe
00:07:07
there is some corrupt record or
00:07:08
something so how you would deal with
00:07:10
that and how you would like reprocess it
00:07:13
again yeah okay so if there is any
00:07:16
mismatch in the data so what we are
00:07:18
doing is we are going back to the code
00:07:20
firstly we'll check the logs like there
00:07:22
can be different type of uh failures if
00:07:24
it's a job failure or is it a uh kpi
00:07:28
mismatch so in case of uh there is any
00:07:31
failure we are going to the step
00:07:32
function that we are using for
00:07:33
orchestrating all our tasks so there we
00:07:36
do have enabled uh like Cloud watch logs
00:07:39
we can go to the ECS task and we can
00:07:42
check the cloudwatch logs and based on
00:07:44
the error message that it is showing
00:07:46
we'll go and fix in the code uh in the
00:07:48
jobs basically and if it's a kbii
00:07:50
mismatch related data then we know like
00:07:53
uh from which fact we are getting this
00:07:55
data and we'll have to basically
00:07:56
backtrack the source of that particular
00:07:58
kpi uh like what all the transformation
00:08:01
we are applying to that uh particular uh
00:08:04
like that metric and from where it is
00:08:06
coming and what can be done like uh we
00:08:08
will be backtracking it at every uh
00:08:11
Point like every
00:08:13
stage so how you are currently notifying
00:08:16
your consumer so say you are producing
00:08:17
the data and there are some consumers
00:08:19
right and because of your process you
00:08:21
identify maybe there is some SL sometime
00:08:23
that is breached or maybe there is some
00:08:26
freshness issue or something so how that
00:08:28
will be communicated to the
00:08:30
consumers uh so basically uh we have
00:08:33
created views on top of our facts and
00:08:37
the access has been given so it's uh
00:08:39
like country related data that we are
00:08:41
getting so there are some users that are
00:08:43
accessing the uh like let's say Hong
00:08:46
Kong data there are some users that are
00:08:47
accessing Vietnam data So based on the
00:08:50
uh like authent like we will be giving
00:08:53
the them the uh permission to view the
00:08:55
data so only the uh related use cases uh
00:09:00
basically if there are some users that
00:09:02
want they want the data uh to get viewed
00:09:05
or something like that so they'll have
00:09:07
to ask for Access and we need to add
00:09:09
their name as well but suppose they
00:09:11
already have the access but somehow
00:09:13
because of your job fail or something
00:09:15
you are not able to publish today's date
00:09:17
data right and now when they check on
00:09:19
their side they won't be able to get the
00:09:20
latest data so as a proactive measure
00:09:22
you can also notify them right as part
00:09:25
of your process so how can you is it
00:09:27
currently that You' have implemented if
00:09:29
note how could you do that yes yes us uh
00:09:32
SNS notification will get triggered
00:09:34
after the run is complete and it's going
00:09:36
like we have included the users as well
00:09:39
who are consuming the data so they'll
00:09:40
get a notification in case of there is
00:09:42
any failure or there is any kpi mismatch
00:09:44
they'll directly get the notification
00:09:46
from
00:09:47
there uh so will how will you implement
00:09:49
it as part of your current
00:09:51
workflow yeah uh so uh like uh we are
00:09:55
orchestrating everything in Step
00:09:56
functions there is one uh SNS uh trigger
00:09:59
that we are uh basically we have written
00:10:01
the Lambda as well where we are
00:10:03
comparing all the data and uh that we
00:10:06
will be invoking and it will uh based on
00:10:10
some conditions it will trigger the
00:10:11
notification we when we are creating the
00:10:13
topics we are sending that them to the
00:10:15
users they will subscribe to it based on
00:10:17
the PS up method they will be receiving
00:10:20
notifications later and in case of any
00:10:22
failure or production issue then based
00:10:24
on that like uh we'll have to estimate
00:10:26
it and we'll create a like us a story
00:10:28
for that like produ issue we'll have to
00:10:30
log in jir and based on the we'll start
00:10:33
working on it okay so in S3 bucket right
00:10:37
can we have like two buckets with the
00:10:39
same name in AWS
00:10:41
S3 uh no uh it should be unique uh so
00:10:46
but is it like is it not region specific
00:10:48
or is it region specific d three service
00:10:50
it's Global it's Global and every name
00:10:53
should be unique is there any
00:10:55
requirement for having such Global name
00:10:58
space
00:11:00
uh require uh sorry uh I did not
00:11:03
understand your question so point is
00:11:04
like is uh what is the use case like why
00:11:06
ad decided to have the global name
00:11:08
spaces even if we can specify the region
00:11:11
in an S3 bucket okay I check I don't so
00:11:13
say you have an S3 bucket right and like
00:11:15
how can you secure your S3 bucket like
00:11:17
what different uh things you can
00:11:19
Implement by which you can secure your
00:11:21
S3
00:11:23
bucket uh so basically we can provide
00:11:25
permissions accordingly like uh there
00:11:28
can be an uh access uh list we can add
00:11:31
and uh we can allow or deny some
00:11:33
resources or we can allow or uh deny
00:11:35
based on the IM am
00:11:37
permissions so that we can uh
00:11:41
do okay any other thing that you can
00:11:43
Implement I can think of right now okay
00:11:47
so say you are working for say
00:11:48
e-commerce company right hello say
00:11:52
you're working for an e-commerce company
00:11:54
and uh maybe handling the Sher data
00:11:55
product right so all the customer
00:11:57
information and everything you are
00:11:58
dealing with those type of data uh so
00:12:01
Cher data is basically in pii data right
00:12:03
and it's your responsibility as part of
00:12:05
the team to secure the pii information
00:12:08
right so what are the diff so how you
00:12:10
will Design uh this pipelines or maybe
00:12:12
how you will secure this pii information
00:12:14
so maybe you can list down some factors
00:12:16
and maybe some ways uh so first thing is
00:12:19
like uh this data like are we receiving
00:12:21
the piia columns as well and uh what is
00:12:24
the like use case like can we uh can we
00:12:28
access that data uh if yes then we can
00:12:31
uh do one thing we can create an
00:12:32
abstraction layer on top of that where
00:12:34
we can just uh basically uh mask that
00:12:37
data so what we are doing is in some
00:12:40
cases we do have two uh two different
00:12:42
schemas one is restricted one and one is
00:12:45
the general one so in the general one we
00:12:46
will be keeping all the uh we will be
00:12:49
doing all the masking on the columns
00:12:51
that uh that comes under pi and in the
00:12:54
restricted column we will be showing
00:12:55
their values as well so what we are
00:12:57
doing is uh we do have have data stored
00:12:59
in our facts when we are creating the
00:13:01
views on top of that then only in the
00:13:03
view uh view query we are uh just have
00:13:07
written a case statement where we are
00:13:09
checking if that column is present or
00:13:11
not if the value is there then we are
00:13:12
just masking that value other than data
00:13:15
masking like is is there any other thing
00:13:17
that you will Implement so that say your
00:13:20
data is present in the S3 bucket and
00:13:21
currently uh you have the access right
00:13:24
and you want to prevent any unauthorized
00:13:26
access to your data so data asking is
00:13:29
one of the technique that you will
00:13:30
surely Implement before dumping it to
00:13:31
the Target sources so like any other
00:13:34
technique that you can Implement to
00:13:35
secure your data maybe we can remove
00:13:38
those uh columns from there and we can
00:13:40
keep it in different uh uh like
00:13:42
different file or different table and uh
00:13:45
we can move it to different folder uh
00:13:48
which do not have access to the normal
00:13:50
users we can restrict the access to that
00:13:53
we can write python code we can read the
00:13:56
data we can uh basically separate the
00:13:58
The pii Columns and the uh original
00:14:01
columns that we want to use and we can
00:14:03
create two different files out of it and
00:14:06
one file will go to the restricted
00:14:08
folder and one file can go uh to the
00:14:10
normal folder that basically can be
00:14:12
consumed by all the users okay so when
00:14:15
you're explaining the architecture right
00:14:17
so what was the format of the data that
00:14:19
you have in the
00:14:20
roer so it's basically firstly uh all
00:14:24
the logs that we are receiving it's in
00:14:26
CSV format and later on using python we
00:14:29
are taking the chunks uh like chunks of
00:14:31
data we are uh basically converting that
00:14:34
uh that into park format so that we can
00:14:36
access it later so how you transforming
00:14:39
from rad to uh transform is there any
00:14:41
Services of AWS you're
00:14:44
using so uh firstly there is one raw
00:14:47
Zone which we are calling uh which will
00:14:50
have all the like all the logs so let's
00:14:53
say if there are 100 transaction that
00:14:54
has happened so it will have 10 uh like
00:14:57
100 CSV files inside it so now what we
00:15:00
are doing is we have written a python
00:15:02
code and the python code is running it
00:15:04
will merge all the those 100 uh files
00:15:07
because uh there can be a case like for
00:15:09
every transaction it is creating a file
00:15:11
so there can be like thousands or uh 10
00:15:14
thousands of files we have we are
00:15:16
receiving in one day but we want to
00:15:17
process everything so we are merging all
00:15:19
those data and then we are in the python
00:15:22
code only like we are creating U like we
00:15:24
are converting it into par and we are
00:15:26
keeping that data into our aggregate
00:15:28
zone so agregate Zone uh we have in
00:15:32
place which will have the exact same
00:15:33
data but it it's uh in uh like we are
00:15:35
it's compressed it's in park format that
00:15:38
we can leverage later for the
00:15:40
optimization techniques that we can use
00:15:42
and uh it's acting uh like uh acting as
00:15:46
a middle layer which can be later on
00:15:48
consumed uh by the user so on top of
00:15:50
that we are creating external tables and
00:15:52
we are dumping the data into ODS so in
00:15:55
this whole pipeline L how you are
00:15:56
managing the schema right because CSV
00:15:59
and uh don't come with the schemas right
00:16:02
so how you managing your schemas schema
00:16:04
drift or schema Evolution as part of
00:16:06
this pipeline if there is any Evolution
00:16:08
happening right so how you will
00:16:09
incorporate that in a
00:16:11
pipeline yeah so CSV does not provide us
00:16:15
the uh feature for like schi Evolution
00:16:18
so what we are doing is uh we have
00:16:20
written a python code and uh like
00:16:23
whenever there is a new file we are
00:16:25
basically creating a data frame out of
00:16:26
it and uh we are checking all The
00:16:28
Columns we are creating a set of it and
00:16:30
we are connecting to our red ship which
00:16:32
is our Target and we are taking some
00:16:35
data from that as well again doing the
00:16:38
same process checking all the columns
00:16:40
and creating a set out of it and then we
00:16:42
have two sets like one set has all the
00:16:44
columns from The Source One set has all
00:16:46
the columns from the target we are
00:16:47
taking a set difference and if there is
00:16:50
uh some columns that is like we are
00:16:52
getting as as an output of set
00:16:54
difference then we know like these are
00:16:56
all the columns that are uh that that is
00:16:58
not there in the red shift in our Target
00:17:01
and these are all the new columns so
00:17:03
there like we are doing two things uh we
00:17:05
are firstly like uh based on the columns
00:17:08
which we are getting we are creating
00:17:09
alter statements so we are adding new
00:17:12
columns to our Target and we are just
00:17:14
keeping it as we care because it can uh
00:17:16
have any type of data and later on like
00:17:19
we can go on and manually change that
00:17:21
like as per the uh like the data
00:17:23
requirement if it's uh we need to keep
00:17:25
it integer or like any other data type
00:17:28
so this will be the one thing and second
00:17:30
thing we are sending notifications to
00:17:31
your user so that we can get their
00:17:33
confirmation like if those columns were
00:17:35
needed and uh are they part of as part
00:17:38
of any new requirement or things like
00:17:40
that so can't we use like blue catalog
00:17:42
service here to manage the schema like
00:17:45
instead of managing like connecting to
00:17:47
source and Target and comparing it so
00:17:49
can't we use any managed catalog service
00:17:51
Like
00:17:52
Glue yes yes we can definitely use that
00:17:55
but since it was already uh like flow is
00:17:58
like we are getting the data from DMS
00:18:00
and we are processing all we are doing
00:18:02
all the processing writing all the SQL
00:18:05
scripts and dumping the data so we are
00:18:07
not using glue for that that's why we
00:18:09
have not used but definitely we can uh
00:18:12
yeah change that to uh we can use some
00:18:14
other ews Service as well okay so say
00:18:17
you're using glue right for maybe EMR
00:18:20
service right so we have to specify the
00:18:22
cluster configurations right like what
00:18:24
type of instances we want to use so um
00:18:27
we have like different instance typee in
00:18:29
AWS service right so uh on basis of use
00:18:32
case how we will specify like what type
00:18:34
of instance we have to use from AWS
00:18:38
offering so uh like if we are using AWS
00:18:41
clue we don't need to worry about the
00:18:44
infrastructure management because it's a
00:18:46
server serverless and it manages uh as
00:18:49
per the load so if there's more uh like
00:18:52
more data coming in our way and the
00:18:54
workload is uh higher than expected like
00:18:56
it will spin up all the cluster
00:18:59
accordingly uh so I have used uh ews
00:19:02
glue not uh EMR I can talk in terms of
00:19:05
that like if firstly uh Whenever there
00:19:08
is any use case we'll have to go on and
00:19:10
uh analyze the data data volume data
00:19:13
types and uh we will be doing all the
00:19:15
analysis on that basis only we can
00:19:18
either uh use some uh like EMR in EMR we
00:19:22
can get uh all the flexibility to
00:19:24
provision our own clusters so there it's
00:19:27
very much necessary that we know what uh
00:19:29
data volume we're getting and as for
00:19:31
that like how can it be handled how many
00:19:33
nodes how many basically how many
00:19:35
executors do we need what is the memory
00:19:37
what is the core that we need as per the
00:19:39
requirement and the data we are dealing
00:19:41
with So currently AMR also comes with a
00:19:44
new offering of EMR serverless right so
00:19:46
in this case you also can use EMR Ser
00:19:48
serverless for your load so what what
00:19:51
can be the use case for using EMR
00:19:53
serverless over top of over glue yeah so
00:19:56
if it's uh like similar like
00:19:58
architecture wise like it's serverless
00:20:00
and it also gives like uh manages all
00:20:04
the infrastructure itself so maybe the
00:20:06
data cataloging part that we have in AWS
00:20:09
glue we don't have in am EMR I'm not uh
00:20:12
wellers with the EMR uh like
00:20:14
functionalities and features but I'll
00:20:16
talk about uh aw's clue it does have uh
00:20:20
data brew as well where we can uh like
00:20:22
perform some analytics that is one plus
00:20:24
Point uh it does have AWS crawlers as
00:20:26
well that can help to crawl the data
00:20:28
from some source to Etha or some other
00:20:31
source so those are the functionalities
00:20:33
maybe that uh EMR is not like it's not
00:20:36
there in the EMR okay so say you are
00:20:39
using glue service right for your detail
00:20:41
Pipeline and you are using glue crawler
00:20:44
and say your data is present in the S3
00:20:45
bucket right so how so say one day you
00:20:48
use the glue crawler and schema is up to
00:20:50
date in your glue pipelines right uh so
00:20:53
maybe from some few days uh afterwards
00:20:55
your schema has changed right so how
00:20:57
will glue come to know that schema has
00:21:00
changed so how you will implement this
00:21:02
type of workflow so that your ETL
00:21:04
pipelines would be aware of the recent
00:21:06
schema
00:21:07
changes so we are using AWS uh like data
00:21:11
catalog part here and data catalog will
00:21:13
basically store all the metadata
00:21:15
regarding the data files and everything
00:21:17
so if there is any change in the data so
00:21:19
AWS catalog will capture that change as
00:21:22
part of schema evaluation and when we
00:21:24
are uh moving the data to like crawling
00:21:26
the data from the S3 or some other
00:21:29
source uh it will go on and check the
00:21:32
metadata from data catalog and it will
00:21:34
come to know about the changes in the
00:21:36
schema so say you want to like have a
00:21:38
pipeline implemented as uh in the
00:21:41
similar flow so say you want to have an
00:21:43
Automation in place uh in this case like
00:21:45
whenever there is a schema changes or
00:21:48
something or maybe you have a glue job
00:21:50
which will basically maybe per day basis
00:21:53
it will scan the data and in case there
00:21:55
is any drift in the schema you will be
00:21:57
get notified ified so you want to
00:21:59
automate such a pipeline so how you will
00:22:01
Design such a pipeline in glue uh so in
00:22:03
glue what we can do is uh in that case
00:22:06
like if there is any schema changes we
00:22:09
need to
00:22:09
notify uh maybe we can add some triggers
00:22:13
or uh we can send some notifications
00:22:15
using teams
00:22:17
or maybe on some other channel like
00:22:20
slack Channel teams uh what according to
00:22:22
the functionality that it provides uh
00:22:26
yeah I'm not uh much like I have to uh
00:22:28
think about that use case but my
00:22:31
Approach will be like if I am using glue
00:22:33
uh somewhere like I will get previous
00:22:36
date uh data right uh that has because
00:22:40
uh the the schema was the previous
00:22:42
schema that we were referring to and uh
00:22:45
the new schema that we are getting we
00:22:47
can also load that also in our AWS glue
00:22:49
so maybe let's say like I have like
00:22:51
today's the date that we have received
00:22:54
the new schema and two days back the
00:22:56
date was uh there like uh based on the
00:22:58
partition date I can read two partitions
00:23:01
now and from those two partitions I can
00:23:03
read the data and check all the columns
00:23:05
and uh basically get the difference in
00:23:07
the columns whatever the difference is
00:23:08
there and what is the data type changes
00:23:11
basically uh based on that like if there
00:23:14
is any difference that uh there will we
00:23:17
can uh set some flags and based on those
00:23:19
flags or we can uh invoke some Lambda or
00:23:22
basically we can invoke SNS to trigger
00:23:25
the notifications so something like that
00:23:28
we can Implement okay so you mentioned
00:23:31
in your resume like you have also used
00:23:33
Dynam and elastic search as a database
00:23:35
uh so what was the use case right and
00:23:37
why we have like using two different
00:23:39
databases so what were the different use
00:23:40
cases that you were using this
00:23:43
for yeah so uh basically dynamodb and
00:23:47
elastic search we are using for same use
00:23:48
case only what is what was happening is
00:23:51
we were getting uh aay related data so
00:23:53
whenever like uh we need to uh set a
00:23:56
mandate so that like uh the premium or
00:23:59
anything like that like it can be
00:24:00
deducted automatically from the account
00:24:02
so those sort of data we were getting in
00:24:04
terms of CSV and txt file from the users
00:24:07
themselves and using infoworks for the
00:24:09
data injection part like we were using
00:24:11
DMS ews service in this use case we are
00:24:14
using infoworks which basically is a
00:24:15
data injection tool there like we can
00:24:17
write the same workflows and uh we can
00:24:19
create a pipeline uh what it was doing
00:24:22
was like it was reading the data from
00:24:24
that file and uh it was uh converting it
00:24:27
into par and again the so like the
00:24:29
target was to uh basically convert in
00:24:31
park and uh place it on S3 itself which
00:24:35
was our um raw Zone there as well and
00:24:38
later on like what we were doing is uh
00:24:40
we were using AWS glue jobs to read the
00:24:42
data from there and we were it's uh like
00:24:44
we were creating the Json structure
00:24:46
there like using stru type and struck
00:24:47
field and the target was Dynamo DB
00:24:51
because um what we like Dynamo DB and uh
00:24:54
elastic search why we were using two
00:24:56
different uh Services because uh firstly
00:24:59
there are some functionalities that
00:25:01
elastic search provides like it can
00:25:02
provides us like we can uh write more
00:25:05
complex queries it's more uh based on
00:25:07
real-time analytics and also like uh it
00:25:10
can uh it provides a uh full text search
00:25:13
so if we want to uh search in a complex
00:25:16
uh uh there are some partition Keys also
00:25:19
like in Dynamo DB it's also uh very fast
00:25:22
it's it has a very high throughput and
00:25:25
uh it does store data in uh no SQL
00:25:28
unstructured or maybe semi-structured
00:25:30
data but it comes with the cost
00:25:33
definitely our use case was like like if
00:25:35
there is any change in the document
00:25:38
whatever we are storing and based on the
00:25:40
same partition key if I want to update
00:25:42
that it will override that particular uh
00:25:44
data which was there in Dynamo TP it
00:25:47
does not index it and keep both the data
00:25:49
fields so let's say if there is any like
00:25:51
my partition uh like we need to uh
00:25:54
basically specify the partitions and
00:25:56
sort key there so let's say if I'm
00:25:58
keeping my like employee ID employee
00:26:00
name those two keys as part of partition
00:26:03
keys and uh on those two keys I have
00:26:05
some updates now I want to update that
00:26:08
system like there is some changes okay
00:26:10
so if I'll go on and hit the uh updating
00:26:13
Dynamo DB it will override that uh data
00:26:15
in elastic search it will index that
00:26:18
like it will keep both the data like we
00:26:20
can uh it doesn't have any uh if you
00:26:22
want to uh basically prevent it from uh
00:26:26
having the duplicates we'll have to IND
00:26:27
that data and we can query that as well
00:26:29
like based on uh if you want the um if
00:26:33
you want the latest data we can just
00:26:35
query it accordingly because it has its
00:26:37
like already indexed documents so that
00:26:39
was one use case which were uh because
00:26:41
of which we are having the same data at
00:26:44
both places but the consumers are
00:26:46
different like the users are different
00:26:48
so can you explain me the difference in
00:26:50
the database Charing and the
00:26:51
partitioning uh database charting and
00:26:54
partitioning partitioning is uh uh in my
00:26:58
understanding like where we are getting
00:26:59
like we have a large data set and we
00:27:01
want to process that data set uh in uh
00:27:04
like very quick time so we'll be taking
00:27:07
the chunks of it and uh we'll be keeping
00:27:10
those chunks in different different
00:27:11
partitions so that is partitioning to
00:27:13
achieve parallelism where we can uh par
00:27:16
parall we can process all those data and
00:27:19
sharding uh like I have I'm aware on the
00:27:23
AWS skesis I guess in sharding like we
00:27:25
are also getting the data like from
00:27:26
streaming
00:27:28
I'm not sure if uh I'm correct here but
00:27:31
uh I have read about charts which
00:27:34
basically capture the streaming data and
00:27:36
we do specify the capacity of that chart
00:27:39
uh which will be there so it can like
00:27:41
maybe um in one chart we can have 100
00:27:44
records or things like that that's
00:27:46
mostly using for the uh streaming
00:27:48
related data okay so like what is the
00:27:52
secondary index Concept in the Dynamo DB
00:27:55
so like we have the secondary local
00:27:57
secondary index as well as the global
00:27:58
secondary index so what is the use case
00:28:00
of implementing the secondary index and
00:28:02
what is the cons of using those
00:28:04
secondary index I'm not uh much aware on
00:28:08
that actually I'll have to okay look
00:28:10
into it so say consider one scenario
00:28:13
right so you are working for some
00:28:14
business unit and uh so they are putting
00:28:17
your data in the S3 bucket so they don't
00:28:20
have a particular schedule when they are
00:28:22
dumping it is like on an ad hoc basis on
00:28:23
a daily basis right and they have
00:28:26
mentioned an SLA for 30 minutes so
00:28:28
whenever data LS to the S3 bucket you
00:28:30
need to process that data in in the 30
00:28:33
minute window right so they have also
00:28:35
given some business logic which you need
00:28:37
to perform on your data and right and
00:28:40
again they also want all the auditing
00:28:42
and logging in place so that they can
00:28:44
also track the data lineage so what
00:28:47
would be your approach of handling such
00:28:48
a pipeline uh for Designing such a
00:28:50
pipeline by meeting the constraints that
00:28:53
they have mentioned so first constraint
00:28:55
is uh that it's on a dog basis we are
00:28:58
not getting any notifications right uh
00:29:01
what I can think of is uh what we can do
00:29:04
is uh so whenever we can uh activate we
00:29:07
can write a Lambda function and uh we
00:29:10
can invoke it based on the put uh put
00:29:12
event in S3 so if there is any file that
00:29:16
has been placed in S3 it will
00:29:18
automatically go on and basically invoke
00:29:20
the Lambda function and Lambda function
00:29:22
can take care of the uh processing part
00:29:25
there um that's one thing and in case
00:29:29
there are like the data size data volume
00:29:31
is huge so what can we do in that case
00:29:35
um basic uh what I'm thinking is maybe
00:29:37
we can uh do the same thing like till
00:29:40
the first part what we can do is uh
00:29:43
whenever there's a put object it can
00:29:45
invoke the Lambda and in the Lambda we
00:29:47
can invoke AWS glue jobs maybe and uh it
00:29:50
will take the data from S3 and in the
00:29:52
AWS glue jobs it will start processing
00:29:56
uh so that was the a do part and what
00:29:59
was the second uh what were uh the other
00:30:02
constraint uh so for other constraint is
00:30:04
they also want to have the logging in
00:30:06
place as well as the auditing mechanism
00:30:08
so that they can also see the data
00:30:12
lineage okay so for that like uh uh data
00:30:16
lineage okay so uh in AWS glue if we are
00:30:20
using that uh so it does uh have like uh
00:30:25
we we are leveraging spark right so it
00:30:28
does store the data in the lower uh like
00:30:31
low level in rdds which does have the uh
00:30:33
lineage information so maybe we can
00:30:35
leverage
00:30:36
that or in case and for the logging part
00:30:39
we can
00:30:41
active yeah so for any logging part or
00:30:43
monitoring purpose we can uh activate uh
00:30:46
AWS cloudwatch events there and in case
00:30:49
of any failure in case of anything we
00:30:51
want to uh basically see we can go on
00:30:54
and uh check on the logs uh yeah so can
00:30:58
you explain like how rdds will give you
00:30:59
the data
00:31:01
lineage yeah so uh rdds uh basically how
00:31:07
uh okay so whenever we are calling an
00:31:09
action basically uh it's a in sparkk
00:31:12
It's a lazy evaluation it follows a lazy
00:31:14
evaluation technique so if there is any
00:31:16
action that has been called it will go
00:31:17
on and from the start it will start uh
00:31:19
take all the uh steps that will that is
00:31:22
there it will create a execution plan uh
00:31:25
and in that execution plan will have all
00:31:27
the Transformations and everything that
00:31:28
needs to be done to get to the output
00:31:31
and it it's in certain order and so it
00:31:34
will create dags for it so dag will have
00:31:37
a particular tasks in certain order that
00:31:39
needs to be executed that is basically
00:31:41
the lineage information that rdd has uh
00:31:45
so in that sense like we can uh if there
00:31:47
is any node failure or anything like
00:31:48
that happens uh so it can go on and uh
00:31:52
Trace back and using that dag it can
00:31:55
recomm that node
00:31:57
so it does have a lineage information
00:31:59
with it so say you have in this uh in
00:32:02
this use case only you have the
00:32:03
requirement to have all the different
00:32:05
event so say you are saying I am
00:32:07
triggering a Lambda right and then using
00:32:09
Lambda I am again triggering the glue
00:32:10
service right so all this event you want
00:32:13
to track as part of your audit right
00:32:15
what is the originating Source what is
00:32:17
uh what is the transformation source so
00:32:19
in this scenario like for this
00:32:21
implementation how you will uh enable
00:32:23
the auditing part uh I guess cloud trail
00:32:26
is a
00:32:27
uh AWS service that can help in auditing
00:32:31
part uh but that is again on the entire
00:32:34
infra level right that cloud trail yeah
00:32:37
okay okay in that case like uh we can
00:32:41
maybe leverage not AWS glue we can use
00:32:45
airflow as well we can write uh we can
00:32:47
create dags in airflow and uh we can
00:32:49
create tasks and uh we
00:32:52
can uh
00:32:53
basically use their uh uh we can
00:32:57
uh set the dependencies between them and
00:33:00
based on that like it can leverage act
00:33:02
as a leanage information in case of any
00:33:04
failures okay uh so how you are
00:33:07
currently deploying right your pipeline
00:33:09
so what what is the is there any devops
00:33:11
tool that you are using so you are also
00:33:13
using quite a good Services of AWS right
00:33:15
so how you deploying this
00:33:18
infrastructure okay so uh basically uh
00:33:22
we do have now a new framework with us
00:33:25
with which is basically uh has been
00:33:27
created by our team uh in quantify uh so
00:33:31
to deploy what we are doing is we do
00:33:33
have a repository get repository so all
00:33:35
the changes that we are doing uh in
00:33:37
terms of our uh Transformations and all
00:33:40
the SQL related changes on in the
00:33:42
scripts we will create a get uh request
00:33:45
like basically purle request to merge
00:33:47
the data so someone will approve that
00:33:49
and will be able to merge the data and
00:33:52
post that we do have a genin pipeline
00:33:54
that we are using for the deployment
00:33:56
part so gen pipeline is pointing to that
00:33:58
git repository so if there is any change
00:34:00
we'll need to again deploy the genkins
00:34:03
it will go on and uh choose the stack
00:34:05
and whatever the resources it will go on
00:34:07
and allocate that so this is the cicd
00:34:10
tool that we are using for now so what
00:34:13
what stack it is deploying
00:34:16
like so like it there is a code that has
00:34:19
been written by the platform team uh to
00:34:22
uh deploy all the resources that are
00:34:25
needed like uh and provide all the uh
00:34:28
permissions and accesses so if in case
00:34:31
like we are using uh AWS S3 so all the
00:34:34
resources that are using in
00:34:36
communicating with S3 they need the
00:34:37
permission to communicate with S3 so all
00:34:40
those things there is a terraform code I
00:34:42
guess like that has been actually taken
00:34:44
care by the platform team so they are
00:34:46
the ones who are deploying so it is a
00:34:48
separate team who is managing the INF
00:34:51
yes yes okay so uh you are also using
00:34:53
Lambda right you have used Lambda so you
00:34:56
already are aware that Lambda is a
00:34:57
limitation right where it can't have the
00:35:00
processing more than 15 minute right so
00:35:02
say you have a use case right where you
00:35:04
uh wanted to have sequential Lambda
00:35:06
trigger right so maybe once one Lambda
00:35:09
completes maybe before the timeout uh it
00:35:12
will trigger another Lambda somehow and
00:35:14
the processing would continue from that
00:35:16
execution state only uh so can you think
00:35:19
like how can we Implement such a use
00:35:21
case uh let me think uh so in case we
00:35:26
need to use Lambda only or processing
00:35:29
the data that is huge more than expected
00:35:31
which cannot be completed in 15 minutes
00:35:33
of time span so uh yeah we can uh
00:35:36
trigger another Lambda inside a Lambda
00:35:39
that we can do and uh maybe what we can
00:35:42
do is uh whatever the processing like
00:35:44
based on the time limit so let's say
00:35:46
there are three or four transformation
00:35:48
that we need to do and as part of one
00:35:51
Lambda time St like time out uh we can
00:35:54
just do two Transformations so we can
00:35:57
read the file we can perform those two
00:35:58
transformation and we can uh get a
00:36:00
output file and we can place it again on
00:36:02
some other S3 Zone uh we can create some
00:36:05
intermediate Zone in S3 we can do that
00:36:08
and uh that can be passed as a argument
00:36:11
in like as a variable in Lambda like the
00:36:14
another Lambda that will be invoked
00:36:15
inside this Lambda only so maybe that we
00:36:18
can do or uh we can uh do one more thing
00:36:22
instead of like creating the f file we
00:36:24
can uh dump the data like whatever the
00:36:26
process processing we have done we can
00:36:27
dump the data into our red shift and uh
00:36:30
from that point of state only it will
00:36:33
fetch the recent data that has been that
00:36:35
that is there in red shift and it can
00:36:36
continue working on it okay so say you
00:36:40
using Athena right so for some ad hoc
00:36:42
analysis and something right uh so see
00:36:44
there is uh the currently the definition
00:36:47
that you have mentioned there is no
00:36:48
partition right maybe on the next day uh
00:36:51
the data has been changed right and now
00:36:53
there is partition so will aena able to
00:36:55
detect the partition automat ially or do
00:36:57
you need to maybe have some commands
00:37:00
perform by which it will be able to
00:37:01
identify the
00:37:04
partitions uh so as far as I remember
00:37:07
like in Athena we need to give the
00:37:08
partitions as well because that is how
00:37:11
the query like it's able to access uh
00:37:14
give the query result faster because
00:37:16
it's working on partitions uh so as far
00:37:20
as remember like I we need to give some
00:37:22
partitions because based on that create
00:37:24
an external table without partitions
00:37:27
will AA give us an error okay I'm I'm
00:37:29
not aware on that I'll have to look into
00:37:31
it actually so does aena also physically
00:37:34
load the data because it is also
00:37:35
maintaining the catalog and meta store
00:37:37
right so will it like entirely load the
00:37:39
data and then it will do the
00:37:41
processing uh no no it's not uh loading
00:37:44
the data uh it's just referring to the
00:37:46
data that is present on maybe S3 so that
00:37:51
not loading the data so can we have an
00:37:53
aena service say in one region uh which
00:37:56
is say maybe doing the analysis on the
00:37:59
data in some other S3 bucket in another
00:38:01
region so can we have such a cross cross
00:38:03
region communication here or is it a
00:38:05
mandate to have the AA in the same
00:38:07
region as the F3
00:38:10
bucket I'm not I'm not aware on that
00:38:13
okay uh so can you please share your
00:38:15
screen like I have in SLE questions yeah
00:38:17
let me know if my screen is visible yeah
00:38:19
yes so yeah your screen is visible can
00:38:21
you see on the chat maybe so let me
00:38:24
explain the question first right so here
00:38:26
will have like the three different
00:38:27
tables so consider this as for the
00:38:29
e-commerce data right where we have the
00:38:31
orders data and then also we have the
00:38:33
products information right and then we
00:38:35
have the order detail so order detail is
00:38:37
like a more granular where you have will
00:38:39
have the like the uh for a single order
00:38:42
ID right you will have the different
00:38:44
products right so order detail is like a
00:38:47
more granular where you have all the
00:38:48
product ID information how many quantity
00:38:51
of that unit you have purchased what is
00:38:52
the unit price right so this order
00:38:55
details is at the detail Lev level right
00:38:57
and this orders table is on a very high
00:38:59
level like what is the order ID what is
00:39:01
the order date what is the total amount
00:39:03
right so that orders table is is the
00:39:06
Gran is a more at high level of
00:39:08
granularity right then this product
00:39:10
table is containing the information
00:39:12
right uh this is the product ID
00:39:13
corresponding to this name and this
00:39:15
category so you you have this three
00:39:17
different uh tables right now what you
00:39:20
need to find out in the SQL query is the
00:39:22
top selling product right so what is the
00:39:25
so maybe you generating some report for
00:39:27
some customer or maybe some stakeholders
00:39:29
right and you want to show them right
00:39:31
this is the top selling product right
00:39:33
what whatever we are offered and what is
00:39:35
the criteria for finding the top selling
00:39:38
product is basically characterized by
00:39:40
the highest
00:39:41
revenue uh what is the criteria again
00:39:43
for the highest revenue so we have to uh
00:39:45
do the segregation based on each
00:39:47
category for the last quarter so say you
00:39:50
are building a quarterly report right
00:39:52
and in the quarterly report within each
00:39:54
category what is the top selling product
00:39:57
which will be characterized by the
00:39:58
highest revenue so you need to create a
00:40:00
SQL query for
00:40:03
this let me know in case of any question
00:40:05
yeah
00:40:06
okay yeah so highest revenue is it based
00:40:09
on like the total amount which we have
00:40:11
here so you can ignore this total amount
00:40:14
right I think it might not be accurate
00:40:15
so consider this the order details and
00:40:18
the quantity and unit price for the for
00:40:20
the revenue okay okay so uh firstly what
00:40:26
I'm thinking firstly we need the data
00:40:29
just for the last quarter so uh let's
00:40:31
say last three months uh okay so that we
00:40:34
can filter out based on the order date
00:40:36
okay uh just writing some approach here
00:40:41
so last 3 months data we can filter out
00:40:45
from uh order
00:40:47
table now uh we need to join orders and
00:40:51
Order detail together okay to get the um
00:40:56
we can join these two based on order
00:41:00
ID and we can get quantity and uh unit
00:41:04
price okay and we'll try we'll multiply
00:41:08
uh the number of units that we have with
00:41:10
the unit price so to get the total
00:41:13
amount if we are ignoring uh this
00:41:15
particular total amount here so we need
00:41:18
to multiply this
00:41:20
quantity uh with unit price and getting
00:41:24
the total
00:41:25
amount
00:41:26
now what uh we can do is now if we have
00:41:31
the total amount here and
00:41:34
uh we have the order ID okay based on
00:41:37
the order ID here okay uh order ID so
00:41:44
what type of join will you perform in
00:41:45
this two
00:41:47
table we need all t uh like all the
00:41:50
orders then we can uh use l
00:41:53
join I'm assuming like it will have
00:41:56
information for all the orders because
00:41:58
it's on the granular level of orders so
00:42:02
even if we do in a join I don't think
00:42:04
there will be any loss of data
00:42:07
okay yeah okay so now I have my total
00:42:12
amount I have filtered order date uh now
00:42:16
I need to group my data um group my data
00:42:19
based on product and get the sum of the
00:42:22
total amount get the revenue basically
00:42:28
and within each category of for the last
00:42:30
quarter so category okay now category
00:42:33
also we need so we need to uh basically
00:42:37
join this
00:42:39
product I like product table as well and
00:42:42
based on product ID and we need uh
00:42:46
category from here we need category from
00:42:48
here now we have all the data we can
00:42:51
group group it on um product
00:42:54
ID but product ID unique right uh okay
00:42:58
so the product ID is unique so it will
00:43:01
automatically have different categories
00:43:03
right uh we are considering these uh
00:43:06
different products
00:43:08
correct yeah okay so we can just group
00:43:12
it on product
00:43:14
ID and we can take the total sum of uh
00:43:20
like some of this
00:43:22
amount uh total amount uh so can you
00:43:26
explain like why we are doing the
00:43:27
grouping on the product ID uh so we need
00:43:30
top selling product so top selling
00:43:32
product will be the one that has
00:43:34
generated the highest revenue so if I
00:43:38
want the top selling product I have this
00:43:40
product ID table and I have this product
00:43:43
name and category cons like this is the
00:43:45
unique one so if I am doing the uh group
00:43:48
or grouping on product ID so whatever
00:43:50
the orders are there for this particular
00:43:52
order ID 101 I am uh basically basically
00:43:56
adding all the revenues generated from
00:43:58
all the orders for this particular
00:43:59
product okay and I'm getting the total
00:44:02
revenue for that particular product
00:44:04
likewise I am doing it for all the
00:44:06
products now so if I'm grouping it on
00:44:08
product ID I will have all the products
00:44:11
and all the revenue generated from uh
00:44:14
like uh from all the orders for these
00:44:17
products uh so now what you have in your
00:44:20
select query in this
00:44:22
case okay so in my select query
00:44:27
uh in my select query I will have these
00:44:30
two uh quantity and unit and
00:44:35
uh I will I can uh like I can use
00:44:39
product ID only because I'm grouping it
00:44:41
by product ID but like in the question
00:44:44
we have can you read out the question
00:44:46
again what would be the top the top
00:44:49
selling product yeah okay so we need
00:44:52
product name and product
00:44:54
category what will uh what is the like
00:44:57
all the columns we need in the output
00:44:59
can you uh just so we would need the
00:45:01
category and then we would need the
00:45:03
product name and the total revenue
00:45:05
highest revenue that is
00:45:07
there
00:45:09
category product
00:45:11
name and uh Revenue
00:45:15
okay okay so I can you uh like uh
00:45:18
instead of grouping it on product ID I
00:45:20
can group it on product name and
00:45:22
category also okay that because uh
00:45:25
combination of this is unique okay and
00:45:28
uh in the select query also like I can
00:45:30
use the same
00:45:32
okay will it give the fin because I need
00:45:35
the no because I just need the uh top
00:45:39
selling product so maybe uh I need to
00:45:43
order
00:45:44
it uh by this Revenue okay in descending
00:45:49
order so now I have uh all the highest
00:45:52
revenue first and then the lowest one in
00:45:56
the uh bottom uh then I can just like
00:46:00
maybe I'm thinking to limit it to one
00:46:02
then it will give me the just one output
00:46:04
with the
00:46:05
highest but we require for each category
00:46:08
uh we just we don't need a single row we
00:46:11
need for each category the top selling
00:46:13
product okay okay okay okay so in that
00:46:16
case like uh we'll have to go with the
00:46:18
uh window function
00:46:20
here so uh yeah so what were your window
00:46:24
condition H uh so in Partition by uh we
00:46:30
will be writing this product uh name and
00:46:33
category uh sorry uh in like the C yeah
00:46:37
um these two are maybe just the category
00:46:39
one we can keep in Partition by so we
00:46:42
can partition the data based on all the
00:46:44
categories and uh we can order the data
00:46:48
on revenue and uh like this Revenue in
00:46:52
descending order and then uh we can just
00:46:55
uh ask it to maybe uh give it a name and
00:47:00
uh we can just take the data or like we
00:47:03
can filter it out post that we can just
00:47:05
uh uh something like that like if I am
00:47:09
using just a minute
00:47:13
so I'm ranking my data let's say name
00:47:16
and category I already have this as
00:47:21
revenue
00:47:23
and sending and you can just give it
00:47:28
some Rank and later on like
00:47:31
uh or maybe use qualify for this but
00:47:35
that is only supported in some SQL not
00:47:38
in all databases yeah yeah yes yes uh so
00:47:42
in that case like uh in another uh like
00:47:45
you can keep it as a subquery and later
00:47:48
on we can just use RNs one so for every
00:47:53
category whatever the highest revenue is
00:47:55
we we are just switching that uh so can
00:47:57
you go to the product table you scroll
00:47:59
to product table right so in the product
00:48:02
table right say you have a category of
00:48:03
business called right so you have
00:48:05
different product name A and B A and D
00:48:08
sorry right so say uh as for this data
00:48:11
I'm just assuming right so maybe you
00:48:13
have the a as the highest selling
00:48:14
product so you don't need a two separate
00:48:17
Row for business card just a single Row
00:48:18
for business card and a and the highest
00:48:20
revenue so will your SQL query give me
00:48:23
the single Row for a single category
00:48:26
no in that case I don't uh I was
00:48:28
assuming that the product name and
00:48:30
category both are different both will be
00:48:32
unique now if uh this is the case like
00:48:36
then I can group it on category only
00:48:38
then uh it will sum up all the revenues
00:48:41
uh from the C and D product name and if
00:48:47
that is the highest like whatever the
00:48:49
highest one I guess uh it will give me
00:48:52
the highest one so in
00:48:54
case both the different product have the
00:48:56
same Revenue so what would be the output
00:48:59
in the
00:49:01
rank uh it will give me the same like
00:49:03
both the uh because uh okay so in case
00:49:08
if uh okay uh because I am using
00:49:11
Rank and uh if
00:49:15
it's if it's the same it will have the
00:49:18
same uh rank only so will it give me
00:49:22
like the two different Row in the
00:49:24
output yes uh yeah I guess so okay uh
00:49:29
because from the next one it will uh
00:49:33
like skip a rank if you're using uh rank
00:49:36
function so say we want to add another
00:49:39
CL right say we want to add another
00:49:40
close here uh say we just need a single
00:49:44
row right for the category and we can
00:49:46
add another condition so in that case
00:49:48
when there is a same revenue for
00:49:50
different products uh we can go ahead
00:49:52
with the alphabetical ordering for the
00:49:54
product name so whatever ever uh product
00:49:56
name comes first as per the alphabetical
00:49:58
ordering we want to include that row so
00:50:01
how you will include this condition in
00:50:02
your
00:50:04
query if I want to include uh only
00:50:08
product name that
00:50:10
comes uh okay only in this case if a and
00:50:13
C has like the same Revenue uh Like A
00:50:17
and D has the same Revenue based on the
00:50:19
uh category business card uh I just want
00:50:21
to display a right uh okay stand that uh
00:50:28
I can compare uh the product name and uh
00:50:32
because product name will have a and c
00:50:34
and which one uh like either one is
00:50:37
smaller it will uh it should give me the
00:50:40
smaller one only how will you compare
00:50:43
this two how will
00:50:45
I okay so now I have two columns what I
00:50:50
can do is I can uh use a row number or
00:50:53
uh row number function based on the
00:50:55
category and I can order it on product
00:50:59
name and uh like product name uh
00:51:02
ascending order if uh what I like I'm
00:51:05
just think uh so if I am grouping it
00:51:09
again on
00:51:11
category
00:51:13
um yeah so if I'm grouping it again on
00:51:16
category and uh I am just
00:51:18
ordering by uh my product
00:51:22
name
00:51:24
uh
00:51:26
so I am just giving
00:51:30
this just a
00:51:34
minute if I'm using maybe R number
00:51:38
function and I'm partitioning by
00:51:41
category and I'm ordering by product
00:51:44
name so whatever the ascending order
00:51:46
will be it will have uh R number as one
00:51:50
and uh the second one like d will have
00:51:52
row number as two uh then I can filter
00:51:55
it out based on rle number here as well
00:51:57
like uh Ru number one so you will apply
00:51:59
this condition after all the above logic
00:52:02
right so you will add another window
00:52:04
function uh to incorporate this
00:52:07
logic okay so order by Revenue in
00:52:10
descending then I can uh here only
00:52:14
M I can
00:52:16
use this condition like I can order it
00:52:20
by like firstly it will order it by the
00:52:22
revenue and then uh it will order it uh
00:52:25
on product name okay again in category
00:52:28
itself okay so we won't need this second
00:52:31
window function will this work the
00:52:33
condition no
00:52:35
no yeah I guess uh yeah I'm thinking
00:52:39
like it should uh
00:52:41
work okay yeah yeah so I'm good from my
00:52:45
side like do you have any questions for
00:52:47
me yeah so uh
00:52:49
based like do you have any feedback or
00:52:53
anything yeah so you were very good at
00:52:55
the SQL right so the
00:52:57
thinking yeah so thinking uh you are
00:53:00
able to think logically right so you
00:53:02
understand the data uh the only thing
00:53:04
that is important is the Assumption so
00:53:06
don't assume anything you have the data
00:53:08
set right the products uh it's there the
00:53:10
different category and the different
00:53:11
product name right so for the same two
00:53:13
categories we have the different product
00:53:15
name so it's not unique right so always
00:53:17
if you are making an assumption just
00:53:19
clarify it from the interviewer if the
00:53:21
Assumption you are taking is right or
00:53:22
wrong so that is one part otherwise your
00:53:24
logic part is very good you are able to
00:53:27
Think Through the problem and able to
00:53:30
demonstrate your SQL capabilities right
00:53:33
uh on the engineering side right I think
00:53:35
uh main other thing you need to also
00:53:38
focus is on the data security part so
00:53:40
like how because when why we are using
00:53:43
Cloud right cloud is not always secure
00:53:45
so Cloud we have to make our services
00:53:47
secure so what are the different things
00:53:49
that we can incorporate like uh the
00:53:51
question that I asked right how you will
00:53:53
make your maybe F3 bucket secure or
00:53:55
maybe the pii data so all those things
00:53:57
there are different things that we can
00:53:59
Implement so there is like encryption
00:54:01
then also we have the server side loging
00:54:03
then also we have the different IM roles
00:54:05
so we can leverage IM roles and policies
00:54:08
uh and also we have to follow the Leist
00:54:10
privilege access policies so sometimes I
00:54:12
have seen different codes where people
00:54:14
just put star like give all the
00:54:16
permissions for different service so we
00:54:18
uh AWS doesn't recommend like this so
00:54:20
always have the list granity at the
00:54:22
possible in the permissions and other
00:54:24
stuff uh so that is again very much
00:54:26
important and uh so yeah I find like you
00:54:30
good with the designing another thing
00:54:32
but security is again another thing that
00:54:34
we should also Focus as a data engineer
00:54:36
so yeah that is yes definitely I'll look
00:54:39
into it yeah and just maybe a little bit
00:54:42
more on the uh different services so you
00:54:43
are also good with the Dynam DB and
00:54:45
elastic search uh why we are using the
00:54:48
use of different so there can be a
00:54:50
question like why we are replicating the
00:54:51
data between the two different databases
00:54:53
but you are clear on your use is like
00:54:55
why there is a need so it's always the
00:54:57
use case implementation specific and you
00:54:59
are clear with that point right so I
00:55:01
think I'm good yeah just focus on those
00:55:04
little uh little details yeah on the
00:55:06
infrastructure part I think otherwise
00:55:11
it's