AWS Data Engineering Interview
Summary
TLDRASA, with 5 years of experience, primarily works on data engineering using AWS services. ASA has worked in both the banking and insurance sectors, employing skills in ETL, data migration, and AWS services like Redshift, S3, and Lambda. The interview delves into a specific insurance data use case where ASA utilized a variety of AWS tools to manage data pipelines, handling tasks from data migration using DMS to data warehousing with Redshift. The discussion spans processes like data partitioning in S3, schema management, reconciliation, and implementing security for PII data. ASA also covers concepts like orchestrating tasks using AWS Step Functions, logging and notifications with SNS, and handling schema changes and data security using Python scripts. The conversation includes their SQL skills, detailing how ASM generates reports and manages data in databases like DynamoDB and ElasticSearch, emphasizing their understanding of database functionalities. The use case highlights ASA's adeptness in delivering data solutions efficiently while ensuring data validity, security, and performance.
Takeaways
- π¨βπ» ASA has 5 years of experience in data engineering.
- π Skilled in AWS technologies like Redshift, S3, and Lambda.
- π Works on transforming and migrating data in insurance use cases.
- π Implements security measures for PII data using AWS services.
- π Utilizes ETL processes and data warehousing effectively.
- π€ Uses SNS for notification and error tracking in data pipelines.
- π‘ Implements schema drift solutions using custom scripts and Python.
- π Proficient in using SQL for querying and managing data.
- π» Combines use of DynamoDB and ElasticSearch for diverse data needs.
- π Ensures data lineage and auditing through AWS and DAGs.
Timeline
- 00:00:00 - 00:05:00
AA introduces herself, highlighting five years of experience in data-centric roles, focusing on ETL projects, AWS services, data modeling, and Python/SQL scripting. She describes a project involving insurance data, detailing the data transfer from Oracle to AWS Redshift and the process of data warehousing.
- 00:05:00 - 00:10:00
AA elaborates on the architecture of her insurance data project. Data from Oracle is migrated to AWS S3, then to Redshift for data warehousing. She discusses using AWS services like Redshift Spectrum for data processes and mentions employing CDC for handling transactional data from Oracle.
- 00:10:00 - 00:15:00
The discussion covers data load management using DMS, CDC, and the retention policies for logs and raw data stored in S3. AA explains how data capture is done incrementally, using S3 to store raw data, which is later moved to alternative storage for archiving after 90 days.
- 00:15:00 - 00:20:00
AA explains the partition management in S3, ensuring daily updates are retained while being able to reconstruct past data if needed. She also describes reconciliation processes and notification mechanisms via SNS for data validation and ensuring process integrity.
- 00:20:00 - 00:25:00
The talk continues about data verification using SNS triggers and SQL queries for KPI validation. AA describes daily job scheduling, how discrepancies are tackled in data, and her team's method for maintaining communication with consumers on potential process failures.
- 00:25:00 - 00:30:00
AA outlines the use of SNS notifications to communicate with consumers about process statuses. Additionally, she explains securing S3 bucket data using IAM and ACLs and the importance of abstract layers and data masking when dealing with PII in e-commerce data pipelines.
- 00:30:00 - 00:35:00
AA discusses the process of securely handling PII data by creating different schema access levels and utilizing Python for data separation in e-commerce settings. She explains transforming CSV logs to Parquet using Python for optimized data storage and analytics.
- 00:35:00 - 00:40:00
Handling schema evolution is elaborated by describing set comparisons between source and Redshift target schemas. AA mentions potential use of Glue for schema management and outlines data pipeline transformations and data cataloging solutions such as AWS Glue and EMR.
- 00:40:00 - 00:45:00
The focus shifts to using AWS Glue and EMR for ETL processes, considering instance selections and use-case based EMR configurations. AA suggests EMR serverless might differ from Glue in terms of catalog features and explains using Data Catalog for schema tracking.
- 00:45:00 - 00:50:00
A discussion on DynamoDB and Elasticsearch explores their application in storing analytics data. AA explains scenarios where Elasticsearch offers advantages like complex queries and full-text search capabilities. She contrasts this with DynamoDB's high-throughput and conflict resolution strategies.
- 00:50:00 - 00:55:12
AA discusses designing audits and logs for ad-hoc data pipelining in S3 using AWS services like Lambda for triggering and Cloudwatch for monitoring. She explains data lineage tracking through RDDs in Spark and describes using services like CloudTrail for infrastructure-level auditing.
Mind Map
Video Q&A
What AWS services were discussed in the interview?
The interview discussed services like AWS DMS, Redshift, S3, Lambda, Glue, DynamoDB, and ElasticSearch.
What programming languages does ASA primarily use?
ASA primarily uses Python and SQL.
What is the role of SNS notifications in ASA's data pipeline?
SNS notifications are used to trigger alerts based on job success or failure, and to notify users when there are discrepancies or issues.
How does ASA handle schema changes in the data pipeline?
ASA uses Python to manage schema changes by comparing columns in source and target, and sends notifications to users when new columns are detected.
What approach does ASA use for data partitioning in S3?
ASA partitions data based on date into S3 folders for organized and timely access.
How is reconciliation handled in ASA's data process?
Reconciliation is done after production load by matching KPIs and sending notifications if there are discrepancies.
What kind of triggers are used to invoke AWS Lambda?
Lambdas are invoked using put events from S3 when new data is uploaded.
How is data security handled for PII data?
Data security includes using abstraction layers, data masking, and separating PII from non-PII data into different storage locations with restricted access.
What kind of database management techniques does ASA use with DynamoDB and ElasticSearch?
ASA uses DynamoDB for high throughput and ElasticSearch for complex queries, full text search, and real-time analytics.
How does ASA ensure data lineage and logging in their workflows?
Data lineage is ensured using RDDs for execution plans and logs in AWS services like CloudTrail for operational visibility.
View more video summaries
Keynote by Josh Bersin | SHL Virtual Summit | Part 2
ALDO REBELO DETALHA CRIME ORGANIZADO NA AMAZΓNIA E DETONA SISTEMA DE ONGS
Convertible Bond Expert Breaks Down Why MicroStrategy Is Soaring | Richard Byworth
MacBook Air M2 for Programming [Apple M2 Review]
M2 MacBook Air For Programming - Real World Review!
19 Mangler Squigs
- 00:00:05connection established okay so hi AA
- 00:00:09welcome to this round right so can you
- 00:00:11please introduce yourself and maybe the
- 00:00:13project that you have worked upon so
- 00:00:15yeah you can start yeah okay yeah thank
- 00:00:18you Nisha for giving me this opportunity
- 00:00:20to interview with you I'll start with my
- 00:00:22brief introduction so myself ASA I have
- 00:00:25close to five years of experience
- 00:00:26working in data now I have worked on
- 00:00:28banking as well as insurance to I
- 00:00:30started my journey with exential there I
- 00:00:32was working in ETL projects ETL data
- 00:00:34migration and implementation project and
- 00:00:36in quantify I was working in creating
- 00:00:38Inn Data Solutions using uh various AWS
- 00:00:41services like U red shift S3 uh DMS and
- 00:00:45I do have a good understanding on data
- 00:00:47modeling Concepts and data warehousing
- 00:00:49other than that like my go-to
- 00:00:51programming languages Python and SQL
- 00:00:53scripting so that's a brief about me so
- 00:00:56can you also mention like the
- 00:00:58architecture like what was the source
- 00:00:59location and where you were putting the
- 00:01:01data what type of Transformations you
- 00:01:02are doing so just to give an
- 00:01:05idea yeah okay so uh in one of my use
- 00:01:09case I'll discuss about one of my use
- 00:01:10case that was related to Insurance data
- 00:01:12so basically uh my architecture was like
- 00:01:15uh Source was the orle database where
- 00:01:16all the transaction were uh transactions
- 00:01:18were happening so we were using DMS
- 00:01:20service to get the data migrated to the
- 00:01:23uh S3 storage where we were keeping all
- 00:01:25the raw data then we were using uh like
- 00:01:29um uh we are leveraging uh red shift for
- 00:01:32our data warehousing solution where we
- 00:01:34have all the data models like Dimensions
- 00:01:36facts as well as there is one layer to
- 00:01:38it there is ODS layer which is basically
- 00:01:40we are keeping a exact replica of what
- 00:01:42data is there on the source so that it
- 00:01:45can act as a source of Truth for us when
- 00:01:46we are starting modeling our data so we
- 00:01:49can uh basically leverage that and uh
- 00:01:51that can uh act as a source for us uh in
- 00:01:55between we are uh basically creating
- 00:01:56stor procedure and external tables using
- 00:01:59red shift and of spectrum capabilities
- 00:02:01to get the CDC data incremental data and
- 00:02:03we have written the code in SQL so there
- 00:02:05is a couple of uh like merge statements
- 00:02:07and other uh quality related checks that
- 00:02:09we are doing so this was end to end
- 00:02:11architecture and we were dumping the
- 00:02:12data into our dimensions and uh dims and
- 00:02:15facts basically which is there in red
- 00:02:17shift which was uh later on consumed by
- 00:02:20the client as per the
- 00:02:22requirement so is uh the DMS that you
- 00:02:25are using right so is how it is
- 00:02:27basically how much data load that you
- 00:02:29have on the Oracle system like what is
- 00:02:32the what are the different tables and
- 00:02:33what is the data load that it is
- 00:02:35handling yeah so there are a lot of
- 00:02:37different tables uh in Insurance domain
- 00:02:40there are different tables related to
- 00:02:42like uh based on different granularities
- 00:02:43policy related risk there are coverages
- 00:02:46there are claim related data so there
- 00:02:48are uh like a lot of uh Source data so
- 00:02:53we have enabled CDC basically we are
- 00:02:54just getting the incremental data like
- 00:02:56the change that is happening what DMS is
- 00:02:58doing we have enabled the CDC in our
- 00:03:00source site so whenever there is a
- 00:03:02transaction it will go on and uh
- 00:03:04basically uh write it in a log uh what
- 00:03:07we call is a right Ahad logs so whenever
- 00:03:09the transaction is done that log will be
- 00:03:11captured there and DMS is taking all the
- 00:03:14logs from the client side and it will
- 00:03:16just uh migrate it to the S3 it will
- 00:03:19have the raw files as well like all the
- 00:03:21raw files we'll be receiving so since we
- 00:03:24are just dealing with the change data
- 00:03:25capture it's not uh in a like huge
- 00:03:28volume but when we are doing history
- 00:03:30load then uh yeah it is taking a lot of
- 00:03:33time DM is taking a lot of time to get
- 00:03:35the data from the source so sometimes it
- 00:03:38also take like more than two hours so
- 00:03:40like the row data right as I understand
- 00:03:42it is the exact replic of your orle
- 00:03:44system right so do you have set any
- 00:03:46retention at that part or will it be
- 00:03:48always in the S3 standard
- 00:03:50layer uh so like uh firstly the logs
- 00:03:53there is a retention period like
- 00:03:55previously there was a retention period
- 00:03:56of 2 to 3 days but if there is any
- 00:03:58failure which has not been resolved in 2
- 00:04:003 days so uh then we have changed it to
- 00:04:027 days so logs will be there uh till
- 00:04:05like 7 Days retention in S3 we what we
- 00:04:08are doing is like after 90 days like we
- 00:04:10are moving it to uh other basically we
- 00:04:12are archiving that
- 00:04:14data so rtion so say you have a
- 00:04:18requirement right where you don't
- 00:04:19suppose you have a location say uh in
- 00:04:22the S3 bucket right and you are writing
- 00:04:24the data the next time you run a load
- 00:04:27you don't want that location uh so you
- 00:04:29want that location to be overd right but
- 00:04:32in this case you don't want to lose the
- 00:04:34previous data so what capability of S3
- 00:04:36would you use so that you can't you
- 00:04:37won't delete the previous version of the
- 00:04:40data uh so basically everyday data is
- 00:04:44going in a partition date so in S3 we do
- 00:04:46have date date folders uh so let's say
- 00:04:50if there are 100 transaction that has
- 00:04:52happened today so it will be in today's
- 00:04:54date and uh for tomorrow it will be in
- 00:04:56different date so whenever we are uh
- 00:04:58creating our external table we are
- 00:05:00adding the uh today's date partition to
- 00:05:02it and we just we are just reading uh
- 00:05:04the latest data from there and then uh
- 00:05:07like store procedure is running and it
- 00:05:08is dumping the data from is3 to uh our
- 00:05:11red shift ODS layer so it's like if I am
- 00:05:15running my job today I'll have if there
- 00:05:17is any case where failure has been uh
- 00:05:20like there was a failure so we can add
- 00:05:21multiple date partition as well to
- 00:05:23access different uh dates data but for
- 00:05:26now the use case is like like every day
- 00:05:28it will drop the previous date partition
- 00:05:30and add a new one and uh we are just
- 00:05:32able to access the one day data only
- 00:05:35okay so are you doing any reconciliation
- 00:05:37process as part of this
- 00:05:40job uh yeah we are doing reconciliation
- 00:05:43process once we are uh done with like
- 00:05:45the production load for all the use
- 00:05:47cases there will be SNS notifications
- 00:05:49that will get triggered based on the
- 00:05:52success and failures so mostly what we
- 00:05:54are doing is we are just uh based on
- 00:05:57some transformation we are checking
- 00:05:58whatever data is there in in our
- 00:05:59dimensions and facts uh and as per the
- 00:06:02requirement whatever data is there on
- 00:06:04the source so we have written SQL
- 00:06:06queries uh as part of Recon so it will
- 00:06:09basically match all the kpis and if
- 00:06:11there is any difference it will trigger
- 00:06:13a notification if there is no difference
- 00:06:15as well then then also like we have kept
- 00:06:17it that way that uh it will uh daily
- 00:06:19match some kpis for example like the
- 00:06:21policy count uh the premium amount so
- 00:06:24these kpis it will be matching and we
- 00:06:26are sending the triggers so how you
- 00:06:28scheduling the trigger ciliation process
- 00:06:30and at what time like is it a daily run
- 00:06:32that you are doing or a weekly run so
- 00:06:33how it is
- 00:06:34scheduled it's a daily run so for now
- 00:06:37like we have scheduled it like uh in the
- 00:06:39morning around 9:00 because we have we
- 00:06:42are starting all our use cases run after
- 00:06:4412: after midnight so uh as for the load
- 00:06:48that we were getting everything will be
- 00:06:50like all the uh use cases jobs will uh
- 00:06:53will be successful by 7 in the morning
- 00:06:56so based on that like we are running it
- 00:06:58uh at like 9 in the like 9 in the
- 00:07:01morning yeah and say as part of your
- 00:07:03reconciliation process right you find
- 00:07:05some discrepancy in your data like maybe
- 00:07:07there is some corrupt record or
- 00:07:08something so how you would deal with
- 00:07:10that and how you would like reprocess it
- 00:07:13again yeah okay so if there is any
- 00:07:16mismatch in the data so what we are
- 00:07:18doing is we are going back to the code
- 00:07:20firstly we'll check the logs like there
- 00:07:22can be different type of uh failures if
- 00:07:24it's a job failure or is it a uh kpi
- 00:07:28mismatch so in case of uh there is any
- 00:07:31failure we are going to the step
- 00:07:32function that we are using for
- 00:07:33orchestrating all our tasks so there we
- 00:07:36do have enabled uh like Cloud watch logs
- 00:07:39we can go to the ECS task and we can
- 00:07:42check the cloudwatch logs and based on
- 00:07:44the error message that it is showing
- 00:07:46we'll go and fix in the code uh in the
- 00:07:48jobs basically and if it's a kbii
- 00:07:50mismatch related data then we know like
- 00:07:53uh from which fact we are getting this
- 00:07:55data and we'll have to basically
- 00:07:56backtrack the source of that particular
- 00:07:58kpi uh like what all the transformation
- 00:08:01we are applying to that uh particular uh
- 00:08:04like that metric and from where it is
- 00:08:06coming and what can be done like uh we
- 00:08:08will be backtracking it at every uh
- 00:08:11Point like every
- 00:08:13stage so how you are currently notifying
- 00:08:16your consumer so say you are producing
- 00:08:17the data and there are some consumers
- 00:08:19right and because of your process you
- 00:08:21identify maybe there is some SL sometime
- 00:08:23that is breached or maybe there is some
- 00:08:26freshness issue or something so how that
- 00:08:28will be communicated to the
- 00:08:30consumers uh so basically uh we have
- 00:08:33created views on top of our facts and
- 00:08:37the access has been given so it's uh
- 00:08:39like country related data that we are
- 00:08:41getting so there are some users that are
- 00:08:43accessing the uh like let's say Hong
- 00:08:46Kong data there are some users that are
- 00:08:47accessing Vietnam data So based on the
- 00:08:50uh like authent like we will be giving
- 00:08:53the them the uh permission to view the
- 00:08:55data so only the uh related use cases uh
- 00:09:00basically if there are some users that
- 00:09:02want they want the data uh to get viewed
- 00:09:05or something like that so they'll have
- 00:09:07to ask for Access and we need to add
- 00:09:09their name as well but suppose they
- 00:09:11already have the access but somehow
- 00:09:13because of your job fail or something
- 00:09:15you are not able to publish today's date
- 00:09:17data right and now when they check on
- 00:09:19their side they won't be able to get the
- 00:09:20latest data so as a proactive measure
- 00:09:22you can also notify them right as part
- 00:09:25of your process so how can you is it
- 00:09:27currently that You' have implemented if
- 00:09:29note how could you do that yes yes us uh
- 00:09:32SNS notification will get triggered
- 00:09:34after the run is complete and it's going
- 00:09:36like we have included the users as well
- 00:09:39who are consuming the data so they'll
- 00:09:40get a notification in case of there is
- 00:09:42any failure or there is any kpi mismatch
- 00:09:44they'll directly get the notification
- 00:09:46from
- 00:09:47there uh so will how will you implement
- 00:09:49it as part of your current
- 00:09:51workflow yeah uh so uh like uh we are
- 00:09:55orchestrating everything in Step
- 00:09:56functions there is one uh SNS uh trigger
- 00:09:59that we are uh basically we have written
- 00:10:01the Lambda as well where we are
- 00:10:03comparing all the data and uh that we
- 00:10:06will be invoking and it will uh based on
- 00:10:10some conditions it will trigger the
- 00:10:11notification we when we are creating the
- 00:10:13topics we are sending that them to the
- 00:10:15users they will subscribe to it based on
- 00:10:17the PS up method they will be receiving
- 00:10:20notifications later and in case of any
- 00:10:22failure or production issue then based
- 00:10:24on that like uh we'll have to estimate
- 00:10:26it and we'll create a like us a story
- 00:10:28for that like produ issue we'll have to
- 00:10:30log in jir and based on the we'll start
- 00:10:33working on it okay so in S3 bucket right
- 00:10:37can we have like two buckets with the
- 00:10:39same name in AWS
- 00:10:41S3 uh no uh it should be unique uh so
- 00:10:46but is it like is it not region specific
- 00:10:48or is it region specific d three service
- 00:10:50it's Global it's Global and every name
- 00:10:53should be unique is there any
- 00:10:55requirement for having such Global name
- 00:10:58space
- 00:11:00uh require uh sorry uh I did not
- 00:11:03understand your question so point is
- 00:11:04like is uh what is the use case like why
- 00:11:06ad decided to have the global name
- 00:11:08spaces even if we can specify the region
- 00:11:11in an S3 bucket okay I check I don't so
- 00:11:13say you have an S3 bucket right and like
- 00:11:15how can you secure your S3 bucket like
- 00:11:17what different uh things you can
- 00:11:19Implement by which you can secure your
- 00:11:21S3
- 00:11:23bucket uh so basically we can provide
- 00:11:25permissions accordingly like uh there
- 00:11:28can be an uh access uh list we can add
- 00:11:31and uh we can allow or deny some
- 00:11:33resources or we can allow or uh deny
- 00:11:35based on the IM am
- 00:11:37permissions so that we can uh
- 00:11:41do okay any other thing that you can
- 00:11:43Implement I can think of right now okay
- 00:11:47so say you are working for say
- 00:11:48e-commerce company right hello say
- 00:11:52you're working for an e-commerce company
- 00:11:54and uh maybe handling the Sher data
- 00:11:55product right so all the customer
- 00:11:57information and everything you are
- 00:11:58dealing with those type of data uh so
- 00:12:01Cher data is basically in pii data right
- 00:12:03and it's your responsibility as part of
- 00:12:05the team to secure the pii information
- 00:12:08right so what are the diff so how you
- 00:12:10will Design uh this pipelines or maybe
- 00:12:12how you will secure this pii information
- 00:12:14so maybe you can list down some factors
- 00:12:16and maybe some ways uh so first thing is
- 00:12:19like uh this data like are we receiving
- 00:12:21the piia columns as well and uh what is
- 00:12:24the like use case like can we uh can we
- 00:12:28access that data uh if yes then we can
- 00:12:31uh do one thing we can create an
- 00:12:32abstraction layer on top of that where
- 00:12:34we can just uh basically uh mask that
- 00:12:37data so what we are doing is in some
- 00:12:40cases we do have two uh two different
- 00:12:42schemas one is restricted one and one is
- 00:12:45the general one so in the general one we
- 00:12:46will be keeping all the uh we will be
- 00:12:49doing all the masking on the columns
- 00:12:51that uh that comes under pi and in the
- 00:12:54restricted column we will be showing
- 00:12:55their values as well so what we are
- 00:12:57doing is uh we do have have data stored
- 00:12:59in our facts when we are creating the
- 00:13:01views on top of that then only in the
- 00:13:03view uh view query we are uh just have
- 00:13:07written a case statement where we are
- 00:13:09checking if that column is present or
- 00:13:11not if the value is there then we are
- 00:13:12just masking that value other than data
- 00:13:15masking like is is there any other thing
- 00:13:17that you will Implement so that say your
- 00:13:20data is present in the S3 bucket and
- 00:13:21currently uh you have the access right
- 00:13:24and you want to prevent any unauthorized
- 00:13:26access to your data so data asking is
- 00:13:29one of the technique that you will
- 00:13:30surely Implement before dumping it to
- 00:13:31the Target sources so like any other
- 00:13:34technique that you can Implement to
- 00:13:35secure your data maybe we can remove
- 00:13:38those uh columns from there and we can
- 00:13:40keep it in different uh uh like
- 00:13:42different file or different table and uh
- 00:13:45we can move it to different folder uh
- 00:13:48which do not have access to the normal
- 00:13:50users we can restrict the access to that
- 00:13:53we can write python code we can read the
- 00:13:56data we can uh basically separate the
- 00:13:58The pii Columns and the uh original
- 00:14:01columns that we want to use and we can
- 00:14:03create two different files out of it and
- 00:14:06one file will go to the restricted
- 00:14:08folder and one file can go uh to the
- 00:14:10normal folder that basically can be
- 00:14:12consumed by all the users okay so when
- 00:14:15you're explaining the architecture right
- 00:14:17so what was the format of the data that
- 00:14:19you have in the
- 00:14:20roer so it's basically firstly uh all
- 00:14:24the logs that we are receiving it's in
- 00:14:26CSV format and later on using python we
- 00:14:29are taking the chunks uh like chunks of
- 00:14:31data we are uh basically converting that
- 00:14:34uh that into park format so that we can
- 00:14:36access it later so how you transforming
- 00:14:39from rad to uh transform is there any
- 00:14:41Services of AWS you're
- 00:14:44using so uh firstly there is one raw
- 00:14:47Zone which we are calling uh which will
- 00:14:50have all the like all the logs so let's
- 00:14:53say if there are 100 transaction that
- 00:14:54has happened so it will have 10 uh like
- 00:14:57100 CSV files inside it so now what we
- 00:15:00are doing is we have written a python
- 00:15:02code and the python code is running it
- 00:15:04will merge all the those 100 uh files
- 00:15:07because uh there can be a case like for
- 00:15:09every transaction it is creating a file
- 00:15:11so there can be like thousands or uh 10
- 00:15:14thousands of files we have we are
- 00:15:16receiving in one day but we want to
- 00:15:17process everything so we are merging all
- 00:15:19those data and then we are in the python
- 00:15:22code only like we are creating U like we
- 00:15:24are converting it into par and we are
- 00:15:26keeping that data into our aggregate
- 00:15:28zone so agregate Zone uh we have in
- 00:15:32place which will have the exact same
- 00:15:33data but it it's uh in uh like we are
- 00:15:35it's compressed it's in park format that
- 00:15:38we can leverage later for the
- 00:15:40optimization techniques that we can use
- 00:15:42and uh it's acting uh like uh acting as
- 00:15:46a middle layer which can be later on
- 00:15:48consumed uh by the user so on top of
- 00:15:50that we are creating external tables and
- 00:15:52we are dumping the data into ODS so in
- 00:15:55this whole pipeline L how you are
- 00:15:56managing the schema right because CSV
- 00:15:59and uh don't come with the schemas right
- 00:16:02so how you managing your schemas schema
- 00:16:04drift or schema Evolution as part of
- 00:16:06this pipeline if there is any Evolution
- 00:16:08happening right so how you will
- 00:16:09incorporate that in a
- 00:16:11pipeline yeah so CSV does not provide us
- 00:16:15the uh feature for like schi Evolution
- 00:16:18so what we are doing is uh we have
- 00:16:20written a python code and uh like
- 00:16:23whenever there is a new file we are
- 00:16:25basically creating a data frame out of
- 00:16:26it and uh we are checking all The
- 00:16:28Columns we are creating a set of it and
- 00:16:30we are connecting to our red ship which
- 00:16:32is our Target and we are taking some
- 00:16:35data from that as well again doing the
- 00:16:38same process checking all the columns
- 00:16:40and creating a set out of it and then we
- 00:16:42have two sets like one set has all the
- 00:16:44columns from The Source One set has all
- 00:16:46the columns from the target we are
- 00:16:47taking a set difference and if there is
- 00:16:50uh some columns that is like we are
- 00:16:52getting as as an output of set
- 00:16:54difference then we know like these are
- 00:16:56all the columns that are uh that that is
- 00:16:58not there in the red shift in our Target
- 00:17:01and these are all the new columns so
- 00:17:03there like we are doing two things uh we
- 00:17:05are firstly like uh based on the columns
- 00:17:08which we are getting we are creating
- 00:17:09alter statements so we are adding new
- 00:17:12columns to our Target and we are just
- 00:17:14keeping it as we care because it can uh
- 00:17:16have any type of data and later on like
- 00:17:19we can go on and manually change that
- 00:17:21like as per the uh like the data
- 00:17:23requirement if it's uh we need to keep
- 00:17:25it integer or like any other data type
- 00:17:28so this will be the one thing and second
- 00:17:30thing we are sending notifications to
- 00:17:31your user so that we can get their
- 00:17:33confirmation like if those columns were
- 00:17:35needed and uh are they part of as part
- 00:17:38of any new requirement or things like
- 00:17:40that so can't we use like blue catalog
- 00:17:42service here to manage the schema like
- 00:17:45instead of managing like connecting to
- 00:17:47source and Target and comparing it so
- 00:17:49can't we use any managed catalog service
- 00:17:51Like
- 00:17:52Glue yes yes we can definitely use that
- 00:17:55but since it was already uh like flow is
- 00:17:58like we are getting the data from DMS
- 00:18:00and we are processing all we are doing
- 00:18:02all the processing writing all the SQL
- 00:18:05scripts and dumping the data so we are
- 00:18:07not using glue for that that's why we
- 00:18:09have not used but definitely we can uh
- 00:18:12yeah change that to uh we can use some
- 00:18:14other ews Service as well okay so say
- 00:18:17you're using glue right for maybe EMR
- 00:18:20service right so we have to specify the
- 00:18:22cluster configurations right like what
- 00:18:24type of instances we want to use so um
- 00:18:27we have like different instance typee in
- 00:18:29AWS service right so uh on basis of use
- 00:18:32case how we will specify like what type
- 00:18:34of instance we have to use from AWS
- 00:18:38offering so uh like if we are using AWS
- 00:18:41clue we don't need to worry about the
- 00:18:44infrastructure management because it's a
- 00:18:46server serverless and it manages uh as
- 00:18:49per the load so if there's more uh like
- 00:18:52more data coming in our way and the
- 00:18:54workload is uh higher than expected like
- 00:18:56it will spin up all the cluster
- 00:18:59accordingly uh so I have used uh ews
- 00:19:02glue not uh EMR I can talk in terms of
- 00:19:05that like if firstly uh Whenever there
- 00:19:08is any use case we'll have to go on and
- 00:19:10uh analyze the data data volume data
- 00:19:13types and uh we will be doing all the
- 00:19:15analysis on that basis only we can
- 00:19:18either uh use some uh like EMR in EMR we
- 00:19:22can get uh all the flexibility to
- 00:19:24provision our own clusters so there it's
- 00:19:27very much necessary that we know what uh
- 00:19:29data volume we're getting and as for
- 00:19:31that like how can it be handled how many
- 00:19:33nodes how many basically how many
- 00:19:35executors do we need what is the memory
- 00:19:37what is the core that we need as per the
- 00:19:39requirement and the data we are dealing
- 00:19:41with So currently AMR also comes with a
- 00:19:44new offering of EMR serverless right so
- 00:19:46in this case you also can use EMR Ser
- 00:19:48serverless for your load so what what
- 00:19:51can be the use case for using EMR
- 00:19:53serverless over top of over glue yeah so
- 00:19:56if it's uh like similar like
- 00:19:58architecture wise like it's serverless
- 00:20:00and it also gives like uh manages all
- 00:20:04the infrastructure itself so maybe the
- 00:20:06data cataloging part that we have in AWS
- 00:20:09glue we don't have in am EMR I'm not uh
- 00:20:12wellers with the EMR uh like
- 00:20:14functionalities and features but I'll
- 00:20:16talk about uh aw's clue it does have uh
- 00:20:20data brew as well where we can uh like
- 00:20:22perform some analytics that is one plus
- 00:20:24Point uh it does have AWS crawlers as
- 00:20:26well that can help to crawl the data
- 00:20:28from some source to Etha or some other
- 00:20:31source so those are the functionalities
- 00:20:33maybe that uh EMR is not like it's not
- 00:20:36there in the EMR okay so say you are
- 00:20:39using glue service right for your detail
- 00:20:41Pipeline and you are using glue crawler
- 00:20:44and say your data is present in the S3
- 00:20:45bucket right so how so say one day you
- 00:20:48use the glue crawler and schema is up to
- 00:20:50date in your glue pipelines right uh so
- 00:20:53maybe from some few days uh afterwards
- 00:20:55your schema has changed right so how
- 00:20:57will glue come to know that schema has
- 00:21:00changed so how you will implement this
- 00:21:02type of workflow so that your ETL
- 00:21:04pipelines would be aware of the recent
- 00:21:06schema
- 00:21:07changes so we are using AWS uh like data
- 00:21:11catalog part here and data catalog will
- 00:21:13basically store all the metadata
- 00:21:15regarding the data files and everything
- 00:21:17so if there is any change in the data so
- 00:21:19AWS catalog will capture that change as
- 00:21:22part of schema evaluation and when we
- 00:21:24are uh moving the data to like crawling
- 00:21:26the data from the S3 or some other
- 00:21:29source uh it will go on and check the
- 00:21:32metadata from data catalog and it will
- 00:21:34come to know about the changes in the
- 00:21:36schema so say you want to like have a
- 00:21:38pipeline implemented as uh in the
- 00:21:41similar flow so say you want to have an
- 00:21:43Automation in place uh in this case like
- 00:21:45whenever there is a schema changes or
- 00:21:48something or maybe you have a glue job
- 00:21:50which will basically maybe per day basis
- 00:21:53it will scan the data and in case there
- 00:21:55is any drift in the schema you will be
- 00:21:57get notified ified so you want to
- 00:21:59automate such a pipeline so how you will
- 00:22:01Design such a pipeline in glue uh so in
- 00:22:03glue what we can do is uh in that case
- 00:22:06like if there is any schema changes we
- 00:22:09need to
- 00:22:09notify uh maybe we can add some triggers
- 00:22:13or uh we can send some notifications
- 00:22:15using teams
- 00:22:17or maybe on some other channel like
- 00:22:20slack Channel teams uh what according to
- 00:22:22the functionality that it provides uh
- 00:22:26yeah I'm not uh much like I have to uh
- 00:22:28think about that use case but my
- 00:22:31Approach will be like if I am using glue
- 00:22:33uh somewhere like I will get previous
- 00:22:36date uh data right uh that has because
- 00:22:40uh the the schema was the previous
- 00:22:42schema that we were referring to and uh
- 00:22:45the new schema that we are getting we
- 00:22:47can also load that also in our AWS glue
- 00:22:49so maybe let's say like I have like
- 00:22:51today's the date that we have received
- 00:22:54the new schema and two days back the
- 00:22:56date was uh there like uh based on the
- 00:22:58partition date I can read two partitions
- 00:23:01now and from those two partitions I can
- 00:23:03read the data and check all the columns
- 00:23:05and uh basically get the difference in
- 00:23:07the columns whatever the difference is
- 00:23:08there and what is the data type changes
- 00:23:11basically uh based on that like if there
- 00:23:14is any difference that uh there will we
- 00:23:17can uh set some flags and based on those
- 00:23:19flags or we can uh invoke some Lambda or
- 00:23:22basically we can invoke SNS to trigger
- 00:23:25the notifications so something like that
- 00:23:28we can Implement okay so you mentioned
- 00:23:31in your resume like you have also used
- 00:23:33Dynam and elastic search as a database
- 00:23:35uh so what was the use case right and
- 00:23:37why we have like using two different
- 00:23:39databases so what were the different use
- 00:23:40cases that you were using this
- 00:23:43for yeah so uh basically dynamodb and
- 00:23:47elastic search we are using for same use
- 00:23:48case only what is what was happening is
- 00:23:51we were getting uh aay related data so
- 00:23:53whenever like uh we need to uh set a
- 00:23:56mandate so that like uh the premium or
- 00:23:59anything like that like it can be
- 00:24:00deducted automatically from the account
- 00:24:02so those sort of data we were getting in
- 00:24:04terms of CSV and txt file from the users
- 00:24:07themselves and using infoworks for the
- 00:24:09data injection part like we were using
- 00:24:11DMS ews service in this use case we are
- 00:24:14using infoworks which basically is a
- 00:24:15data injection tool there like we can
- 00:24:17write the same workflows and uh we can
- 00:24:19create a pipeline uh what it was doing
- 00:24:22was like it was reading the data from
- 00:24:24that file and uh it was uh converting it
- 00:24:27into par and again the so like the
- 00:24:29target was to uh basically convert in
- 00:24:31park and uh place it on S3 itself which
- 00:24:35was our um raw Zone there as well and
- 00:24:38later on like what we were doing is uh
- 00:24:40we were using AWS glue jobs to read the
- 00:24:42data from there and we were it's uh like
- 00:24:44we were creating the Json structure
- 00:24:46there like using stru type and struck
- 00:24:47field and the target was Dynamo DB
- 00:24:51because um what we like Dynamo DB and uh
- 00:24:54elastic search why we were using two
- 00:24:56different uh Services because uh firstly
- 00:24:59there are some functionalities that
- 00:25:01elastic search provides like it can
- 00:25:02provides us like we can uh write more
- 00:25:05complex queries it's more uh based on
- 00:25:07real-time analytics and also like uh it
- 00:25:10can uh it provides a uh full text search
- 00:25:13so if we want to uh search in a complex
- 00:25:16uh uh there are some partition Keys also
- 00:25:19like in Dynamo DB it's also uh very fast
- 00:25:22it's it has a very high throughput and
- 00:25:25uh it does store data in uh no SQL
- 00:25:28unstructured or maybe semi-structured
- 00:25:30data but it comes with the cost
- 00:25:33definitely our use case was like like if
- 00:25:35there is any change in the document
- 00:25:38whatever we are storing and based on the
- 00:25:40same partition key if I want to update
- 00:25:42that it will override that particular uh
- 00:25:44data which was there in Dynamo TP it
- 00:25:47does not index it and keep both the data
- 00:25:49fields so let's say if there is any like
- 00:25:51my partition uh like we need to uh
- 00:25:54basically specify the partitions and
- 00:25:56sort key there so let's say if I'm
- 00:25:58keeping my like employee ID employee
- 00:26:00name those two keys as part of partition
- 00:26:03keys and uh on those two keys I have
- 00:26:05some updates now I want to update that
- 00:26:08system like there is some changes okay
- 00:26:10so if I'll go on and hit the uh updating
- 00:26:13Dynamo DB it will override that uh data
- 00:26:15in elastic search it will index that
- 00:26:18like it will keep both the data like we
- 00:26:20can uh it doesn't have any uh if you
- 00:26:22want to uh basically prevent it from uh
- 00:26:26having the duplicates we'll have to IND
- 00:26:27that data and we can query that as well
- 00:26:29like based on uh if you want the um if
- 00:26:33you want the latest data we can just
- 00:26:35query it accordingly because it has its
- 00:26:37like already indexed documents so that
- 00:26:39was one use case which were uh because
- 00:26:41of which we are having the same data at
- 00:26:44both places but the consumers are
- 00:26:46different like the users are different
- 00:26:48so can you explain me the difference in
- 00:26:50the database Charing and the
- 00:26:51partitioning uh database charting and
- 00:26:54partitioning partitioning is uh uh in my
- 00:26:58understanding like where we are getting
- 00:26:59like we have a large data set and we
- 00:27:01want to process that data set uh in uh
- 00:27:04like very quick time so we'll be taking
- 00:27:07the chunks of it and uh we'll be keeping
- 00:27:10those chunks in different different
- 00:27:11partitions so that is partitioning to
- 00:27:13achieve parallelism where we can uh par
- 00:27:16parall we can process all those data and
- 00:27:19sharding uh like I have I'm aware on the
- 00:27:23AWS skesis I guess in sharding like we
- 00:27:25are also getting the data like from
- 00:27:26streaming
- 00:27:28I'm not sure if uh I'm correct here but
- 00:27:31uh I have read about charts which
- 00:27:34basically capture the streaming data and
- 00:27:36we do specify the capacity of that chart
- 00:27:39uh which will be there so it can like
- 00:27:41maybe um in one chart we can have 100
- 00:27:44records or things like that that's
- 00:27:46mostly using for the uh streaming
- 00:27:48related data okay so like what is the
- 00:27:52secondary index Concept in the Dynamo DB
- 00:27:55so like we have the secondary local
- 00:27:57secondary index as well as the global
- 00:27:58secondary index so what is the use case
- 00:28:00of implementing the secondary index and
- 00:28:02what is the cons of using those
- 00:28:04secondary index I'm not uh much aware on
- 00:28:08that actually I'll have to okay look
- 00:28:10into it so say consider one scenario
- 00:28:13right so you are working for some
- 00:28:14business unit and uh so they are putting
- 00:28:17your data in the S3 bucket so they don't
- 00:28:20have a particular schedule when they are
- 00:28:22dumping it is like on an ad hoc basis on
- 00:28:23a daily basis right and they have
- 00:28:26mentioned an SLA for 30 minutes so
- 00:28:28whenever data LS to the S3 bucket you
- 00:28:30need to process that data in in the 30
- 00:28:33minute window right so they have also
- 00:28:35given some business logic which you need
- 00:28:37to perform on your data and right and
- 00:28:40again they also want all the auditing
- 00:28:42and logging in place so that they can
- 00:28:44also track the data lineage so what
- 00:28:47would be your approach of handling such
- 00:28:48a pipeline uh for Designing such a
- 00:28:50pipeline by meeting the constraints that
- 00:28:53they have mentioned so first constraint
- 00:28:55is uh that it's on a dog basis we are
- 00:28:58not getting any notifications right uh
- 00:29:01what I can think of is uh what we can do
- 00:29:04is uh so whenever we can uh activate we
- 00:29:07can write a Lambda function and uh we
- 00:29:10can invoke it based on the put uh put
- 00:29:12event in S3 so if there is any file that
- 00:29:16has been placed in S3 it will
- 00:29:18automatically go on and basically invoke
- 00:29:20the Lambda function and Lambda function
- 00:29:22can take care of the uh processing part
- 00:29:25there um that's one thing and in case
- 00:29:29there are like the data size data volume
- 00:29:31is huge so what can we do in that case
- 00:29:35um basic uh what I'm thinking is maybe
- 00:29:37we can uh do the same thing like till
- 00:29:40the first part what we can do is uh
- 00:29:43whenever there's a put object it can
- 00:29:45invoke the Lambda and in the Lambda we
- 00:29:47can invoke AWS glue jobs maybe and uh it
- 00:29:50will take the data from S3 and in the
- 00:29:52AWS glue jobs it will start processing
- 00:29:56uh so that was the a do part and what
- 00:29:59was the second uh what were uh the other
- 00:30:02constraint uh so for other constraint is
- 00:30:04they also want to have the logging in
- 00:30:06place as well as the auditing mechanism
- 00:30:08so that they can also see the data
- 00:30:12lineage okay so for that like uh uh data
- 00:30:16lineage okay so uh in AWS glue if we are
- 00:30:20using that uh so it does uh have like uh
- 00:30:25we we are leveraging spark right so it
- 00:30:28does store the data in the lower uh like
- 00:30:31low level in rdds which does have the uh
- 00:30:33lineage information so maybe we can
- 00:30:35leverage
- 00:30:36that or in case and for the logging part
- 00:30:39we can
- 00:30:41active yeah so for any logging part or
- 00:30:43monitoring purpose we can uh activate uh
- 00:30:46AWS cloudwatch events there and in case
- 00:30:49of any failure in case of anything we
- 00:30:51want to uh basically see we can go on
- 00:30:54and uh check on the logs uh yeah so can
- 00:30:58you explain like how rdds will give you
- 00:30:59the data
- 00:31:01lineage yeah so uh rdds uh basically how
- 00:31:07uh okay so whenever we are calling an
- 00:31:09action basically uh it's a in sparkk
- 00:31:12It's a lazy evaluation it follows a lazy
- 00:31:14evaluation technique so if there is any
- 00:31:16action that has been called it will go
- 00:31:17on and from the start it will start uh
- 00:31:19take all the uh steps that will that is
- 00:31:22there it will create a execution plan uh
- 00:31:25and in that execution plan will have all
- 00:31:27the Transformations and everything that
- 00:31:28needs to be done to get to the output
- 00:31:31and it it's in certain order and so it
- 00:31:34will create dags for it so dag will have
- 00:31:37a particular tasks in certain order that
- 00:31:39needs to be executed that is basically
- 00:31:41the lineage information that rdd has uh
- 00:31:45so in that sense like we can uh if there
- 00:31:47is any node failure or anything like
- 00:31:48that happens uh so it can go on and uh
- 00:31:52Trace back and using that dag it can
- 00:31:55recomm that node
- 00:31:57so it does have a lineage information
- 00:31:59with it so say you have in this uh in
- 00:32:02this use case only you have the
- 00:32:03requirement to have all the different
- 00:32:05event so say you are saying I am
- 00:32:07triggering a Lambda right and then using
- 00:32:09Lambda I am again triggering the glue
- 00:32:10service right so all this event you want
- 00:32:13to track as part of your audit right
- 00:32:15what is the originating Source what is
- 00:32:17uh what is the transformation source so
- 00:32:19in this scenario like for this
- 00:32:21implementation how you will uh enable
- 00:32:23the auditing part uh I guess cloud trail
- 00:32:26is a
- 00:32:27uh AWS service that can help in auditing
- 00:32:31part uh but that is again on the entire
- 00:32:34infra level right that cloud trail yeah
- 00:32:37okay okay in that case like uh we can
- 00:32:41maybe leverage not AWS glue we can use
- 00:32:45airflow as well we can write uh we can
- 00:32:47create dags in airflow and uh we can
- 00:32:49create tasks and uh we
- 00:32:52can uh
- 00:32:53basically use their uh uh we can
- 00:32:57uh set the dependencies between them and
- 00:33:00based on that like it can leverage act
- 00:33:02as a leanage information in case of any
- 00:33:04failures okay uh so how you are
- 00:33:07currently deploying right your pipeline
- 00:33:09so what what is the is there any devops
- 00:33:11tool that you are using so you are also
- 00:33:13using quite a good Services of AWS right
- 00:33:15so how you deploying this
- 00:33:18infrastructure okay so uh basically uh
- 00:33:22we do have now a new framework with us
- 00:33:25with which is basically uh has been
- 00:33:27created by our team uh in quantify uh so
- 00:33:31to deploy what we are doing is we do
- 00:33:33have a repository get repository so all
- 00:33:35the changes that we are doing uh in
- 00:33:37terms of our uh Transformations and all
- 00:33:40the SQL related changes on in the
- 00:33:42scripts we will create a get uh request
- 00:33:45like basically purle request to merge
- 00:33:47the data so someone will approve that
- 00:33:49and will be able to merge the data and
- 00:33:52post that we do have a genin pipeline
- 00:33:54that we are using for the deployment
- 00:33:56part so gen pipeline is pointing to that
- 00:33:58git repository so if there is any change
- 00:34:00we'll need to again deploy the genkins
- 00:34:03it will go on and uh choose the stack
- 00:34:05and whatever the resources it will go on
- 00:34:07and allocate that so this is the cicd
- 00:34:10tool that we are using for now so what
- 00:34:13what stack it is deploying
- 00:34:16like so like it there is a code that has
- 00:34:19been written by the platform team uh to
- 00:34:22uh deploy all the resources that are
- 00:34:25needed like uh and provide all the uh
- 00:34:28permissions and accesses so if in case
- 00:34:31like we are using uh AWS S3 so all the
- 00:34:34resources that are using in
- 00:34:36communicating with S3 they need the
- 00:34:37permission to communicate with S3 so all
- 00:34:40those things there is a terraform code I
- 00:34:42guess like that has been actually taken
- 00:34:44care by the platform team so they are
- 00:34:46the ones who are deploying so it is a
- 00:34:48separate team who is managing the INF
- 00:34:51yes yes okay so uh you are also using
- 00:34:53Lambda right you have used Lambda so you
- 00:34:56already are aware that Lambda is a
- 00:34:57limitation right where it can't have the
- 00:35:00processing more than 15 minute right so
- 00:35:02say you have a use case right where you
- 00:35:04uh wanted to have sequential Lambda
- 00:35:06trigger right so maybe once one Lambda
- 00:35:09completes maybe before the timeout uh it
- 00:35:12will trigger another Lambda somehow and
- 00:35:14the processing would continue from that
- 00:35:16execution state only uh so can you think
- 00:35:19like how can we Implement such a use
- 00:35:21case uh let me think uh so in case we
- 00:35:26need to use Lambda only or processing
- 00:35:29the data that is huge more than expected
- 00:35:31which cannot be completed in 15 minutes
- 00:35:33of time span so uh yeah we can uh
- 00:35:36trigger another Lambda inside a Lambda
- 00:35:39that we can do and uh maybe what we can
- 00:35:42do is uh whatever the processing like
- 00:35:44based on the time limit so let's say
- 00:35:46there are three or four transformation
- 00:35:48that we need to do and as part of one
- 00:35:51Lambda time St like time out uh we can
- 00:35:54just do two Transformations so we can
- 00:35:57read the file we can perform those two
- 00:35:58transformation and we can uh get a
- 00:36:00output file and we can place it again on
- 00:36:02some other S3 Zone uh we can create some
- 00:36:05intermediate Zone in S3 we can do that
- 00:36:08and uh that can be passed as a argument
- 00:36:11in like as a variable in Lambda like the
- 00:36:14another Lambda that will be invoked
- 00:36:15inside this Lambda only so maybe that we
- 00:36:18can do or uh we can uh do one more thing
- 00:36:22instead of like creating the f file we
- 00:36:24can uh dump the data like whatever the
- 00:36:26process processing we have done we can
- 00:36:27dump the data into our red shift and uh
- 00:36:30from that point of state only it will
- 00:36:33fetch the recent data that has been that
- 00:36:35that is there in red shift and it can
- 00:36:36continue working on it okay so say you
- 00:36:40using Athena right so for some ad hoc
- 00:36:42analysis and something right uh so see
- 00:36:44there is uh the currently the definition
- 00:36:47that you have mentioned there is no
- 00:36:48partition right maybe on the next day uh
- 00:36:51the data has been changed right and now
- 00:36:53there is partition so will aena able to
- 00:36:55detect the partition automat ially or do
- 00:36:57you need to maybe have some commands
- 00:37:00perform by which it will be able to
- 00:37:01identify the
- 00:37:04partitions uh so as far as I remember
- 00:37:07like in Athena we need to give the
- 00:37:08partitions as well because that is how
- 00:37:11the query like it's able to access uh
- 00:37:14give the query result faster because
- 00:37:16it's working on partitions uh so as far
- 00:37:20as remember like I we need to give some
- 00:37:22partitions because based on that create
- 00:37:24an external table without partitions
- 00:37:27will AA give us an error okay I'm I'm
- 00:37:29not aware on that I'll have to look into
- 00:37:31it actually so does aena also physically
- 00:37:34load the data because it is also
- 00:37:35maintaining the catalog and meta store
- 00:37:37right so will it like entirely load the
- 00:37:39data and then it will do the
- 00:37:41processing uh no no it's not uh loading
- 00:37:44the data uh it's just referring to the
- 00:37:46data that is present on maybe S3 so that
- 00:37:51not loading the data so can we have an
- 00:37:53aena service say in one region uh which
- 00:37:56is say maybe doing the analysis on the
- 00:37:59data in some other S3 bucket in another
- 00:38:01region so can we have such a cross cross
- 00:38:03region communication here or is it a
- 00:38:05mandate to have the AA in the same
- 00:38:07region as the F3
- 00:38:10bucket I'm not I'm not aware on that
- 00:38:13okay uh so can you please share your
- 00:38:15screen like I have in SLE questions yeah
- 00:38:17let me know if my screen is visible yeah
- 00:38:19yes so yeah your screen is visible can
- 00:38:21you see on the chat maybe so let me
- 00:38:24explain the question first right so here
- 00:38:26will have like the three different
- 00:38:27tables so consider this as for the
- 00:38:29e-commerce data right where we have the
- 00:38:31orders data and then also we have the
- 00:38:33products information right and then we
- 00:38:35have the order detail so order detail is
- 00:38:37like a more granular where you have will
- 00:38:39have the like the uh for a single order
- 00:38:42ID right you will have the different
- 00:38:44products right so order detail is like a
- 00:38:47more granular where you have all the
- 00:38:48product ID information how many quantity
- 00:38:51of that unit you have purchased what is
- 00:38:52the unit price right so this order
- 00:38:55details is at the detail Lev level right
- 00:38:57and this orders table is on a very high
- 00:38:59level like what is the order ID what is
- 00:39:01the order date what is the total amount
- 00:39:03right so that orders table is is the
- 00:39:06Gran is a more at high level of
- 00:39:08granularity right then this product
- 00:39:10table is containing the information
- 00:39:12right uh this is the product ID
- 00:39:13corresponding to this name and this
- 00:39:15category so you you have this three
- 00:39:17different uh tables right now what you
- 00:39:20need to find out in the SQL query is the
- 00:39:22top selling product right so what is the
- 00:39:25so maybe you generating some report for
- 00:39:27some customer or maybe some stakeholders
- 00:39:29right and you want to show them right
- 00:39:31this is the top selling product right
- 00:39:33what whatever we are offered and what is
- 00:39:35the criteria for finding the top selling
- 00:39:38product is basically characterized by
- 00:39:40the highest
- 00:39:41revenue uh what is the criteria again
- 00:39:43for the highest revenue so we have to uh
- 00:39:45do the segregation based on each
- 00:39:47category for the last quarter so say you
- 00:39:50are building a quarterly report right
- 00:39:52and in the quarterly report within each
- 00:39:54category what is the top selling product
- 00:39:57which will be characterized by the
- 00:39:58highest revenue so you need to create a
- 00:40:00SQL query for
- 00:40:03this let me know in case of any question
- 00:40:05yeah
- 00:40:06okay yeah so highest revenue is it based
- 00:40:09on like the total amount which we have
- 00:40:11here so you can ignore this total amount
- 00:40:14right I think it might not be accurate
- 00:40:15so consider this the order details and
- 00:40:18the quantity and unit price for the for
- 00:40:20the revenue okay okay so uh firstly what
- 00:40:26I'm thinking firstly we need the data
- 00:40:29just for the last quarter so uh let's
- 00:40:31say last three months uh okay so that we
- 00:40:34can filter out based on the order date
- 00:40:36okay uh just writing some approach here
- 00:40:41so last 3 months data we can filter out
- 00:40:45from uh order
- 00:40:47table now uh we need to join orders and
- 00:40:51Order detail together okay to get the um
- 00:40:56we can join these two based on order
- 00:41:00ID and we can get quantity and uh unit
- 00:41:04price okay and we'll try we'll multiply
- 00:41:08uh the number of units that we have with
- 00:41:10the unit price so to get the total
- 00:41:13amount if we are ignoring uh this
- 00:41:15particular total amount here so we need
- 00:41:18to multiply this
- 00:41:20quantity uh with unit price and getting
- 00:41:24the total
- 00:41:25amount
- 00:41:26now what uh we can do is now if we have
- 00:41:31the total amount here and
- 00:41:34uh we have the order ID okay based on
- 00:41:37the order ID here okay uh order ID so
- 00:41:44what type of join will you perform in
- 00:41:45this two
- 00:41:47table we need all t uh like all the
- 00:41:50orders then we can uh use l
- 00:41:53join I'm assuming like it will have
- 00:41:56information for all the orders because
- 00:41:58it's on the granular level of orders so
- 00:42:02even if we do in a join I don't think
- 00:42:04there will be any loss of data
- 00:42:07okay yeah okay so now I have my total
- 00:42:12amount I have filtered order date uh now
- 00:42:16I need to group my data um group my data
- 00:42:19based on product and get the sum of the
- 00:42:22total amount get the revenue basically
- 00:42:28and within each category of for the last
- 00:42:30quarter so category okay now category
- 00:42:33also we need so we need to uh basically
- 00:42:37join this
- 00:42:39product I like product table as well and
- 00:42:42based on product ID and we need uh
- 00:42:46category from here we need category from
- 00:42:48here now we have all the data we can
- 00:42:51group group it on um product
- 00:42:54ID but product ID unique right uh okay
- 00:42:58so the product ID is unique so it will
- 00:43:01automatically have different categories
- 00:43:03right uh we are considering these uh
- 00:43:06different products
- 00:43:08correct yeah okay so we can just group
- 00:43:12it on product
- 00:43:14ID and we can take the total sum of uh
- 00:43:20like some of this
- 00:43:22amount uh total amount uh so can you
- 00:43:26explain like why we are doing the
- 00:43:27grouping on the product ID uh so we need
- 00:43:30top selling product so top selling
- 00:43:32product will be the one that has
- 00:43:34generated the highest revenue so if I
- 00:43:38want the top selling product I have this
- 00:43:40product ID table and I have this product
- 00:43:43name and category cons like this is the
- 00:43:45unique one so if I am doing the uh group
- 00:43:48or grouping on product ID so whatever
- 00:43:50the orders are there for this particular
- 00:43:52order ID 101 I am uh basically basically
- 00:43:56adding all the revenues generated from
- 00:43:58all the orders for this particular
- 00:43:59product okay and I'm getting the total
- 00:44:02revenue for that particular product
- 00:44:04likewise I am doing it for all the
- 00:44:06products now so if I'm grouping it on
- 00:44:08product ID I will have all the products
- 00:44:11and all the revenue generated from uh
- 00:44:14like uh from all the orders for these
- 00:44:17products uh so now what you have in your
- 00:44:20select query in this
- 00:44:22case okay so in my select query
- 00:44:27uh in my select query I will have these
- 00:44:30two uh quantity and unit and
- 00:44:35uh I will I can uh like I can use
- 00:44:39product ID only because I'm grouping it
- 00:44:41by product ID but like in the question
- 00:44:44we have can you read out the question
- 00:44:46again what would be the top the top
- 00:44:49selling product yeah okay so we need
- 00:44:52product name and product
- 00:44:54category what will uh what is the like
- 00:44:57all the columns we need in the output
- 00:44:59can you uh just so we would need the
- 00:45:01category and then we would need the
- 00:45:03product name and the total revenue
- 00:45:05highest revenue that is
- 00:45:07there
- 00:45:09category product
- 00:45:11name and uh Revenue
- 00:45:15okay okay so I can you uh like uh
- 00:45:18instead of grouping it on product ID I
- 00:45:20can group it on product name and
- 00:45:22category also okay that because uh
- 00:45:25combination of this is unique okay and
- 00:45:28uh in the select query also like I can
- 00:45:30use the same
- 00:45:32okay will it give the fin because I need
- 00:45:35the no because I just need the uh top
- 00:45:39selling product so maybe uh I need to
- 00:45:43order
- 00:45:44it uh by this Revenue okay in descending
- 00:45:49order so now I have uh all the highest
- 00:45:52revenue first and then the lowest one in
- 00:45:56the uh bottom uh then I can just like
- 00:46:00maybe I'm thinking to limit it to one
- 00:46:02then it will give me the just one output
- 00:46:04with the
- 00:46:05highest but we require for each category
- 00:46:08uh we just we don't need a single row we
- 00:46:11need for each category the top selling
- 00:46:13product okay okay okay okay so in that
- 00:46:16case like uh we'll have to go with the
- 00:46:18uh window function
- 00:46:20here so uh yeah so what were your window
- 00:46:24condition H uh so in Partition by uh we
- 00:46:30will be writing this product uh name and
- 00:46:33category uh sorry uh in like the C yeah
- 00:46:37um these two are maybe just the category
- 00:46:39one we can keep in Partition by so we
- 00:46:42can partition the data based on all the
- 00:46:44categories and uh we can order the data
- 00:46:48on revenue and uh like this Revenue in
- 00:46:52descending order and then uh we can just
- 00:46:55uh ask it to maybe uh give it a name and
- 00:47:00uh we can just take the data or like we
- 00:47:03can filter it out post that we can just
- 00:47:05uh uh something like that like if I am
- 00:47:09using just a minute
- 00:47:13so I'm ranking my data let's say name
- 00:47:16and category I already have this as
- 00:47:21revenue
- 00:47:23and sending and you can just give it
- 00:47:28some Rank and later on like
- 00:47:31uh or maybe use qualify for this but
- 00:47:35that is only supported in some SQL not
- 00:47:38in all databases yeah yeah yes yes uh so
- 00:47:42in that case like uh in another uh like
- 00:47:45you can keep it as a subquery and later
- 00:47:48on we can just use RNs one so for every
- 00:47:53category whatever the highest revenue is
- 00:47:55we we are just switching that uh so can
- 00:47:57you go to the product table you scroll
- 00:47:59to product table right so in the product
- 00:48:02table right say you have a category of
- 00:48:03business called right so you have
- 00:48:05different product name A and B A and D
- 00:48:08sorry right so say uh as for this data
- 00:48:11I'm just assuming right so maybe you
- 00:48:13have the a as the highest selling
- 00:48:14product so you don't need a two separate
- 00:48:17Row for business card just a single Row
- 00:48:18for business card and a and the highest
- 00:48:20revenue so will your SQL query give me
- 00:48:23the single Row for a single category
- 00:48:26no in that case I don't uh I was
- 00:48:28assuming that the product name and
- 00:48:30category both are different both will be
- 00:48:32unique now if uh this is the case like
- 00:48:36then I can group it on category only
- 00:48:38then uh it will sum up all the revenues
- 00:48:41uh from the C and D product name and if
- 00:48:47that is the highest like whatever the
- 00:48:49highest one I guess uh it will give me
- 00:48:52the highest one so in
- 00:48:54case both the different product have the
- 00:48:56same Revenue so what would be the output
- 00:48:59in the
- 00:49:01rank uh it will give me the same like
- 00:49:03both the uh because uh okay so in case
- 00:49:08if uh okay uh because I am using
- 00:49:11Rank and uh if
- 00:49:15it's if it's the same it will have the
- 00:49:18same uh rank only so will it give me
- 00:49:22like the two different Row in the
- 00:49:24output yes uh yeah I guess so okay uh
- 00:49:29because from the next one it will uh
- 00:49:33like skip a rank if you're using uh rank
- 00:49:36function so say we want to add another
- 00:49:39CL right say we want to add another
- 00:49:40close here uh say we just need a single
- 00:49:44row right for the category and we can
- 00:49:46add another condition so in that case
- 00:49:48when there is a same revenue for
- 00:49:50different products uh we can go ahead
- 00:49:52with the alphabetical ordering for the
- 00:49:54product name so whatever ever uh product
- 00:49:56name comes first as per the alphabetical
- 00:49:58ordering we want to include that row so
- 00:50:01how you will include this condition in
- 00:50:02your
- 00:50:04query if I want to include uh only
- 00:50:08product name that
- 00:50:10comes uh okay only in this case if a and
- 00:50:13C has like the same Revenue uh Like A
- 00:50:17and D has the same Revenue based on the
- 00:50:19uh category business card uh I just want
- 00:50:21to display a right uh okay stand that uh
- 00:50:28I can compare uh the product name and uh
- 00:50:32because product name will have a and c
- 00:50:34and which one uh like either one is
- 00:50:37smaller it will uh it should give me the
- 00:50:40smaller one only how will you compare
- 00:50:43this two how will
- 00:50:45I okay so now I have two columns what I
- 00:50:50can do is I can uh use a row number or
- 00:50:53uh row number function based on the
- 00:50:55category and I can order it on product
- 00:50:59name and uh like product name uh
- 00:51:02ascending order if uh what I like I'm
- 00:51:05just think uh so if I am grouping it
- 00:51:09again on
- 00:51:11category
- 00:51:13um yeah so if I'm grouping it again on
- 00:51:16category and uh I am just
- 00:51:18ordering by uh my product
- 00:51:22name
- 00:51:24uh
- 00:51:26so I am just giving
- 00:51:30this just a
- 00:51:34minute if I'm using maybe R number
- 00:51:38function and I'm partitioning by
- 00:51:41category and I'm ordering by product
- 00:51:44name so whatever the ascending order
- 00:51:46will be it will have uh R number as one
- 00:51:50and uh the second one like d will have
- 00:51:52row number as two uh then I can filter
- 00:51:55it out based on rle number here as well
- 00:51:57like uh Ru number one so you will apply
- 00:51:59this condition after all the above logic
- 00:52:02right so you will add another window
- 00:52:04function uh to incorporate this
- 00:52:07logic okay so order by Revenue in
- 00:52:10descending then I can uh here only
- 00:52:14M I can
- 00:52:16use this condition like I can order it
- 00:52:20by like firstly it will order it by the
- 00:52:22revenue and then uh it will order it uh
- 00:52:25on product name okay again in category
- 00:52:28itself okay so we won't need this second
- 00:52:31window function will this work the
- 00:52:33condition no
- 00:52:35no yeah I guess uh yeah I'm thinking
- 00:52:39like it should uh
- 00:52:41work okay yeah yeah so I'm good from my
- 00:52:45side like do you have any questions for
- 00:52:47me yeah so uh
- 00:52:49based like do you have any feedback or
- 00:52:53anything yeah so you were very good at
- 00:52:55the SQL right so the
- 00:52:57thinking yeah so thinking uh you are
- 00:53:00able to think logically right so you
- 00:53:02understand the data uh the only thing
- 00:53:04that is important is the Assumption so
- 00:53:06don't assume anything you have the data
- 00:53:08set right the products uh it's there the
- 00:53:10different category and the different
- 00:53:11product name right so for the same two
- 00:53:13categories we have the different product
- 00:53:15name so it's not unique right so always
- 00:53:17if you are making an assumption just
- 00:53:19clarify it from the interviewer if the
- 00:53:21Assumption you are taking is right or
- 00:53:22wrong so that is one part otherwise your
- 00:53:24logic part is very good you are able to
- 00:53:27Think Through the problem and able to
- 00:53:30demonstrate your SQL capabilities right
- 00:53:33uh on the engineering side right I think
- 00:53:35uh main other thing you need to also
- 00:53:38focus is on the data security part so
- 00:53:40like how because when why we are using
- 00:53:43Cloud right cloud is not always secure
- 00:53:45so Cloud we have to make our services
- 00:53:47secure so what are the different things
- 00:53:49that we can incorporate like uh the
- 00:53:51question that I asked right how you will
- 00:53:53make your maybe F3 bucket secure or
- 00:53:55maybe the pii data so all those things
- 00:53:57there are different things that we can
- 00:53:59Implement so there is like encryption
- 00:54:01then also we have the server side loging
- 00:54:03then also we have the different IM roles
- 00:54:05so we can leverage IM roles and policies
- 00:54:08uh and also we have to follow the Leist
- 00:54:10privilege access policies so sometimes I
- 00:54:12have seen different codes where people
- 00:54:14just put star like give all the
- 00:54:16permissions for different service so we
- 00:54:18uh AWS doesn't recommend like this so
- 00:54:20always have the list granity at the
- 00:54:22possible in the permissions and other
- 00:54:24stuff uh so that is again very much
- 00:54:26important and uh so yeah I find like you
- 00:54:30good with the designing another thing
- 00:54:32but security is again another thing that
- 00:54:34we should also Focus as a data engineer
- 00:54:36so yeah that is yes definitely I'll look
- 00:54:39into it yeah and just maybe a little bit
- 00:54:42more on the uh different services so you
- 00:54:43are also good with the Dynam DB and
- 00:54:45elastic search uh why we are using the
- 00:54:48use of different so there can be a
- 00:54:50question like why we are replicating the
- 00:54:51data between the two different databases
- 00:54:53but you are clear on your use is like
- 00:54:55why there is a need so it's always the
- 00:54:57use case implementation specific and you
- 00:54:59are clear with that point right so I
- 00:55:01think I'm good yeah just focus on those
- 00:55:04little uh little details yeah on the
- 00:55:06infrastructure part I think otherwise
- 00:55:11it's
- AWS
- Data Engineering
- ETL
- AWS Lambda
- Data Security
- DynamoDB
- ElasticSearch
- Data Reconciliation
- SQL
- S3 Partitioning