AWS Data Engineering Interview

00:55:12
https://www.youtube.com/watch?v=gCuAzrU5-2w

Summary

TLDRASA, with 5 years of experience, primarily works on data engineering using AWS services. ASA has worked in both the banking and insurance sectors, employing skills in ETL, data migration, and AWS services like Redshift, S3, and Lambda. The interview delves into a specific insurance data use case where ASA utilized a variety of AWS tools to manage data pipelines, handling tasks from data migration using DMS to data warehousing with Redshift. The discussion spans processes like data partitioning in S3, schema management, reconciliation, and implementing security for PII data. ASA also covers concepts like orchestrating tasks using AWS Step Functions, logging and notifications with SNS, and handling schema changes and data security using Python scripts. The conversation includes their SQL skills, detailing how ASM generates reports and manages data in databases like DynamoDB and ElasticSearch, emphasizing their understanding of database functionalities. The use case highlights ASA's adeptness in delivering data solutions efficiently while ensuring data validity, security, and performance.

Takeaways

  • πŸ‘¨β€πŸ’» ASA has 5 years of experience in data engineering.
  • πŸ“Š Skilled in AWS technologies like Redshift, S3, and Lambda.
  • πŸ”„ Works on transforming and migrating data in insurance use cases.
  • πŸ”’ Implements security measures for PII data using AWS services.
  • πŸ“š Utilizes ETL processes and data warehousing effectively.
  • 🀝 Uses SNS for notification and error tracking in data pipelines.
  • πŸ’‘ Implements schema drift solutions using custom scripts and Python.
  • πŸ“ˆ Proficient in using SQL for querying and managing data.
  • πŸ’» Combines use of DynamoDB and ElasticSearch for diverse data needs.
  • πŸ” Ensures data lineage and auditing through AWS and DAGs.

Timeline

  • 00:00:00 - 00:05:00

    AA introduces herself, highlighting five years of experience in data-centric roles, focusing on ETL projects, AWS services, data modeling, and Python/SQL scripting. She describes a project involving insurance data, detailing the data transfer from Oracle to AWS Redshift and the process of data warehousing.

  • 00:05:00 - 00:10:00

    AA elaborates on the architecture of her insurance data project. Data from Oracle is migrated to AWS S3, then to Redshift for data warehousing. She discusses using AWS services like Redshift Spectrum for data processes and mentions employing CDC for handling transactional data from Oracle.

  • 00:10:00 - 00:15:00

    The discussion covers data load management using DMS, CDC, and the retention policies for logs and raw data stored in S3. AA explains how data capture is done incrementally, using S3 to store raw data, which is later moved to alternative storage for archiving after 90 days.

  • 00:15:00 - 00:20:00

    AA explains the partition management in S3, ensuring daily updates are retained while being able to reconstruct past data if needed. She also describes reconciliation processes and notification mechanisms via SNS for data validation and ensuring process integrity.

  • 00:20:00 - 00:25:00

    The talk continues about data verification using SNS triggers and SQL queries for KPI validation. AA describes daily job scheduling, how discrepancies are tackled in data, and her team's method for maintaining communication with consumers on potential process failures.

  • 00:25:00 - 00:30:00

    AA outlines the use of SNS notifications to communicate with consumers about process statuses. Additionally, she explains securing S3 bucket data using IAM and ACLs and the importance of abstract layers and data masking when dealing with PII in e-commerce data pipelines.

  • 00:30:00 - 00:35:00

    AA discusses the process of securely handling PII data by creating different schema access levels and utilizing Python for data separation in e-commerce settings. She explains transforming CSV logs to Parquet using Python for optimized data storage and analytics.

  • 00:35:00 - 00:40:00

    Handling schema evolution is elaborated by describing set comparisons between source and Redshift target schemas. AA mentions potential use of Glue for schema management and outlines data pipeline transformations and data cataloging solutions such as AWS Glue and EMR.

  • 00:40:00 - 00:45:00

    The focus shifts to using AWS Glue and EMR for ETL processes, considering instance selections and use-case based EMR configurations. AA suggests EMR serverless might differ from Glue in terms of catalog features and explains using Data Catalog for schema tracking.

  • 00:45:00 - 00:50:00

    A discussion on DynamoDB and Elasticsearch explores their application in storing analytics data. AA explains scenarios where Elasticsearch offers advantages like complex queries and full-text search capabilities. She contrasts this with DynamoDB's high-throughput and conflict resolution strategies.

  • 00:50:00 - 00:55:12

    AA discusses designing audits and logs for ad-hoc data pipelining in S3 using AWS services like Lambda for triggering and Cloudwatch for monitoring. She explains data lineage tracking through RDDs in Spark and describes using services like CloudTrail for infrastructure-level auditing.

Show more

Mind Map

Video Q&A

  • What AWS services were discussed in the interview?

    The interview discussed services like AWS DMS, Redshift, S3, Lambda, Glue, DynamoDB, and ElasticSearch.

  • What programming languages does ASA primarily use?

    ASA primarily uses Python and SQL.

  • What is the role of SNS notifications in ASA's data pipeline?

    SNS notifications are used to trigger alerts based on job success or failure, and to notify users when there are discrepancies or issues.

  • How does ASA handle schema changes in the data pipeline?

    ASA uses Python to manage schema changes by comparing columns in source and target, and sends notifications to users when new columns are detected.

  • What approach does ASA use for data partitioning in S3?

    ASA partitions data based on date into S3 folders for organized and timely access.

  • How is reconciliation handled in ASA's data process?

    Reconciliation is done after production load by matching KPIs and sending notifications if there are discrepancies.

  • What kind of triggers are used to invoke AWS Lambda?

    Lambdas are invoked using put events from S3 when new data is uploaded.

  • How is data security handled for PII data?

    Data security includes using abstraction layers, data masking, and separating PII from non-PII data into different storage locations with restricted access.

  • What kind of database management techniques does ASA use with DynamoDB and ElasticSearch?

    ASA uses DynamoDB for high throughput and ElasticSearch for complex queries, full text search, and real-time analytics.

  • How does ASA ensure data lineage and logging in their workflows?

    Data lineage is ensured using RDDs for execution plans and logs in AWS services like CloudTrail for operational visibility.

View more video summaries

Get instant access to free YouTube video summaries powered by AI!
Subtitles
en
Auto Scroll:
  • 00:00:05
    connection established okay so hi AA
  • 00:00:09
    welcome to this round right so can you
  • 00:00:11
    please introduce yourself and maybe the
  • 00:00:13
    project that you have worked upon so
  • 00:00:15
    yeah you can start yeah okay yeah thank
  • 00:00:18
    you Nisha for giving me this opportunity
  • 00:00:20
    to interview with you I'll start with my
  • 00:00:22
    brief introduction so myself ASA I have
  • 00:00:25
    close to five years of experience
  • 00:00:26
    working in data now I have worked on
  • 00:00:28
    banking as well as insurance to I
  • 00:00:30
    started my journey with exential there I
  • 00:00:32
    was working in ETL projects ETL data
  • 00:00:34
    migration and implementation project and
  • 00:00:36
    in quantify I was working in creating
  • 00:00:38
    Inn Data Solutions using uh various AWS
  • 00:00:41
    services like U red shift S3 uh DMS and
  • 00:00:45
    I do have a good understanding on data
  • 00:00:47
    modeling Concepts and data warehousing
  • 00:00:49
    other than that like my go-to
  • 00:00:51
    programming languages Python and SQL
  • 00:00:53
    scripting so that's a brief about me so
  • 00:00:56
    can you also mention like the
  • 00:00:58
    architecture like what was the source
  • 00:00:59
    location and where you were putting the
  • 00:01:01
    data what type of Transformations you
  • 00:01:02
    are doing so just to give an
  • 00:01:05
    idea yeah okay so uh in one of my use
  • 00:01:09
    case I'll discuss about one of my use
  • 00:01:10
    case that was related to Insurance data
  • 00:01:12
    so basically uh my architecture was like
  • 00:01:15
    uh Source was the orle database where
  • 00:01:16
    all the transaction were uh transactions
  • 00:01:18
    were happening so we were using DMS
  • 00:01:20
    service to get the data migrated to the
  • 00:01:23
    uh S3 storage where we were keeping all
  • 00:01:25
    the raw data then we were using uh like
  • 00:01:29
    um uh we are leveraging uh red shift for
  • 00:01:32
    our data warehousing solution where we
  • 00:01:34
    have all the data models like Dimensions
  • 00:01:36
    facts as well as there is one layer to
  • 00:01:38
    it there is ODS layer which is basically
  • 00:01:40
    we are keeping a exact replica of what
  • 00:01:42
    data is there on the source so that it
  • 00:01:45
    can act as a source of Truth for us when
  • 00:01:46
    we are starting modeling our data so we
  • 00:01:49
    can uh basically leverage that and uh
  • 00:01:51
    that can uh act as a source for us uh in
  • 00:01:55
    between we are uh basically creating
  • 00:01:56
    stor procedure and external tables using
  • 00:01:59
    red shift and of spectrum capabilities
  • 00:02:01
    to get the CDC data incremental data and
  • 00:02:03
    we have written the code in SQL so there
  • 00:02:05
    is a couple of uh like merge statements
  • 00:02:07
    and other uh quality related checks that
  • 00:02:09
    we are doing so this was end to end
  • 00:02:11
    architecture and we were dumping the
  • 00:02:12
    data into our dimensions and uh dims and
  • 00:02:15
    facts basically which is there in red
  • 00:02:17
    shift which was uh later on consumed by
  • 00:02:20
    the client as per the
  • 00:02:22
    requirement so is uh the DMS that you
  • 00:02:25
    are using right so is how it is
  • 00:02:27
    basically how much data load that you
  • 00:02:29
    have on the Oracle system like what is
  • 00:02:32
    the what are the different tables and
  • 00:02:33
    what is the data load that it is
  • 00:02:35
    handling yeah so there are a lot of
  • 00:02:37
    different tables uh in Insurance domain
  • 00:02:40
    there are different tables related to
  • 00:02:42
    like uh based on different granularities
  • 00:02:43
    policy related risk there are coverages
  • 00:02:46
    there are claim related data so there
  • 00:02:48
    are uh like a lot of uh Source data so
  • 00:02:53
    we have enabled CDC basically we are
  • 00:02:54
    just getting the incremental data like
  • 00:02:56
    the change that is happening what DMS is
  • 00:02:58
    doing we have enabled the CDC in our
  • 00:03:00
    source site so whenever there is a
  • 00:03:02
    transaction it will go on and uh
  • 00:03:04
    basically uh write it in a log uh what
  • 00:03:07
    we call is a right Ahad logs so whenever
  • 00:03:09
    the transaction is done that log will be
  • 00:03:11
    captured there and DMS is taking all the
  • 00:03:14
    logs from the client side and it will
  • 00:03:16
    just uh migrate it to the S3 it will
  • 00:03:19
    have the raw files as well like all the
  • 00:03:21
    raw files we'll be receiving so since we
  • 00:03:24
    are just dealing with the change data
  • 00:03:25
    capture it's not uh in a like huge
  • 00:03:28
    volume but when we are doing history
  • 00:03:30
    load then uh yeah it is taking a lot of
  • 00:03:33
    time DM is taking a lot of time to get
  • 00:03:35
    the data from the source so sometimes it
  • 00:03:38
    also take like more than two hours so
  • 00:03:40
    like the row data right as I understand
  • 00:03:42
    it is the exact replic of your orle
  • 00:03:44
    system right so do you have set any
  • 00:03:46
    retention at that part or will it be
  • 00:03:48
    always in the S3 standard
  • 00:03:50
    layer uh so like uh firstly the logs
  • 00:03:53
    there is a retention period like
  • 00:03:55
    previously there was a retention period
  • 00:03:56
    of 2 to 3 days but if there is any
  • 00:03:58
    failure which has not been resolved in 2
  • 00:04:00
    3 days so uh then we have changed it to
  • 00:04:02
    7 days so logs will be there uh till
  • 00:04:05
    like 7 Days retention in S3 we what we
  • 00:04:08
    are doing is like after 90 days like we
  • 00:04:10
    are moving it to uh other basically we
  • 00:04:12
    are archiving that
  • 00:04:14
    data so rtion so say you have a
  • 00:04:18
    requirement right where you don't
  • 00:04:19
    suppose you have a location say uh in
  • 00:04:22
    the S3 bucket right and you are writing
  • 00:04:24
    the data the next time you run a load
  • 00:04:27
    you don't want that location uh so you
  • 00:04:29
    want that location to be overd right but
  • 00:04:32
    in this case you don't want to lose the
  • 00:04:34
    previous data so what capability of S3
  • 00:04:36
    would you use so that you can't you
  • 00:04:37
    won't delete the previous version of the
  • 00:04:40
    data uh so basically everyday data is
  • 00:04:44
    going in a partition date so in S3 we do
  • 00:04:46
    have date date folders uh so let's say
  • 00:04:50
    if there are 100 transaction that has
  • 00:04:52
    happened today so it will be in today's
  • 00:04:54
    date and uh for tomorrow it will be in
  • 00:04:56
    different date so whenever we are uh
  • 00:04:58
    creating our external table we are
  • 00:05:00
    adding the uh today's date partition to
  • 00:05:02
    it and we just we are just reading uh
  • 00:05:04
    the latest data from there and then uh
  • 00:05:07
    like store procedure is running and it
  • 00:05:08
    is dumping the data from is3 to uh our
  • 00:05:11
    red shift ODS layer so it's like if I am
  • 00:05:15
    running my job today I'll have if there
  • 00:05:17
    is any case where failure has been uh
  • 00:05:20
    like there was a failure so we can add
  • 00:05:21
    multiple date partition as well to
  • 00:05:23
    access different uh dates data but for
  • 00:05:26
    now the use case is like like every day
  • 00:05:28
    it will drop the previous date partition
  • 00:05:30
    and add a new one and uh we are just
  • 00:05:32
    able to access the one day data only
  • 00:05:35
    okay so are you doing any reconciliation
  • 00:05:37
    process as part of this
  • 00:05:40
    job uh yeah we are doing reconciliation
  • 00:05:43
    process once we are uh done with like
  • 00:05:45
    the production load for all the use
  • 00:05:47
    cases there will be SNS notifications
  • 00:05:49
    that will get triggered based on the
  • 00:05:52
    success and failures so mostly what we
  • 00:05:54
    are doing is we are just uh based on
  • 00:05:57
    some transformation we are checking
  • 00:05:58
    whatever data is there in in our
  • 00:05:59
    dimensions and facts uh and as per the
  • 00:06:02
    requirement whatever data is there on
  • 00:06:04
    the source so we have written SQL
  • 00:06:06
    queries uh as part of Recon so it will
  • 00:06:09
    basically match all the kpis and if
  • 00:06:11
    there is any difference it will trigger
  • 00:06:13
    a notification if there is no difference
  • 00:06:15
    as well then then also like we have kept
  • 00:06:17
    it that way that uh it will uh daily
  • 00:06:19
    match some kpis for example like the
  • 00:06:21
    policy count uh the premium amount so
  • 00:06:24
    these kpis it will be matching and we
  • 00:06:26
    are sending the triggers so how you
  • 00:06:28
    scheduling the trigger ciliation process
  • 00:06:30
    and at what time like is it a daily run
  • 00:06:32
    that you are doing or a weekly run so
  • 00:06:33
    how it is
  • 00:06:34
    scheduled it's a daily run so for now
  • 00:06:37
    like we have scheduled it like uh in the
  • 00:06:39
    morning around 9:00 because we have we
  • 00:06:42
    are starting all our use cases run after
  • 00:06:44
    12: after midnight so uh as for the load
  • 00:06:48
    that we were getting everything will be
  • 00:06:50
    like all the uh use cases jobs will uh
  • 00:06:53
    will be successful by 7 in the morning
  • 00:06:56
    so based on that like we are running it
  • 00:06:58
    uh at like 9 in the like 9 in the
  • 00:07:01
    morning yeah and say as part of your
  • 00:07:03
    reconciliation process right you find
  • 00:07:05
    some discrepancy in your data like maybe
  • 00:07:07
    there is some corrupt record or
  • 00:07:08
    something so how you would deal with
  • 00:07:10
    that and how you would like reprocess it
  • 00:07:13
    again yeah okay so if there is any
  • 00:07:16
    mismatch in the data so what we are
  • 00:07:18
    doing is we are going back to the code
  • 00:07:20
    firstly we'll check the logs like there
  • 00:07:22
    can be different type of uh failures if
  • 00:07:24
    it's a job failure or is it a uh kpi
  • 00:07:28
    mismatch so in case of uh there is any
  • 00:07:31
    failure we are going to the step
  • 00:07:32
    function that we are using for
  • 00:07:33
    orchestrating all our tasks so there we
  • 00:07:36
    do have enabled uh like Cloud watch logs
  • 00:07:39
    we can go to the ECS task and we can
  • 00:07:42
    check the cloudwatch logs and based on
  • 00:07:44
    the error message that it is showing
  • 00:07:46
    we'll go and fix in the code uh in the
  • 00:07:48
    jobs basically and if it's a kbii
  • 00:07:50
    mismatch related data then we know like
  • 00:07:53
    uh from which fact we are getting this
  • 00:07:55
    data and we'll have to basically
  • 00:07:56
    backtrack the source of that particular
  • 00:07:58
    kpi uh like what all the transformation
  • 00:08:01
    we are applying to that uh particular uh
  • 00:08:04
    like that metric and from where it is
  • 00:08:06
    coming and what can be done like uh we
  • 00:08:08
    will be backtracking it at every uh
  • 00:08:11
    Point like every
  • 00:08:13
    stage so how you are currently notifying
  • 00:08:16
    your consumer so say you are producing
  • 00:08:17
    the data and there are some consumers
  • 00:08:19
    right and because of your process you
  • 00:08:21
    identify maybe there is some SL sometime
  • 00:08:23
    that is breached or maybe there is some
  • 00:08:26
    freshness issue or something so how that
  • 00:08:28
    will be communicated to the
  • 00:08:30
    consumers uh so basically uh we have
  • 00:08:33
    created views on top of our facts and
  • 00:08:37
    the access has been given so it's uh
  • 00:08:39
    like country related data that we are
  • 00:08:41
    getting so there are some users that are
  • 00:08:43
    accessing the uh like let's say Hong
  • 00:08:46
    Kong data there are some users that are
  • 00:08:47
    accessing Vietnam data So based on the
  • 00:08:50
    uh like authent like we will be giving
  • 00:08:53
    the them the uh permission to view the
  • 00:08:55
    data so only the uh related use cases uh
  • 00:09:00
    basically if there are some users that
  • 00:09:02
    want they want the data uh to get viewed
  • 00:09:05
    or something like that so they'll have
  • 00:09:07
    to ask for Access and we need to add
  • 00:09:09
    their name as well but suppose they
  • 00:09:11
    already have the access but somehow
  • 00:09:13
    because of your job fail or something
  • 00:09:15
    you are not able to publish today's date
  • 00:09:17
    data right and now when they check on
  • 00:09:19
    their side they won't be able to get the
  • 00:09:20
    latest data so as a proactive measure
  • 00:09:22
    you can also notify them right as part
  • 00:09:25
    of your process so how can you is it
  • 00:09:27
    currently that You' have implemented if
  • 00:09:29
    note how could you do that yes yes us uh
  • 00:09:32
    SNS notification will get triggered
  • 00:09:34
    after the run is complete and it's going
  • 00:09:36
    like we have included the users as well
  • 00:09:39
    who are consuming the data so they'll
  • 00:09:40
    get a notification in case of there is
  • 00:09:42
    any failure or there is any kpi mismatch
  • 00:09:44
    they'll directly get the notification
  • 00:09:46
    from
  • 00:09:47
    there uh so will how will you implement
  • 00:09:49
    it as part of your current
  • 00:09:51
    workflow yeah uh so uh like uh we are
  • 00:09:55
    orchestrating everything in Step
  • 00:09:56
    functions there is one uh SNS uh trigger
  • 00:09:59
    that we are uh basically we have written
  • 00:10:01
    the Lambda as well where we are
  • 00:10:03
    comparing all the data and uh that we
  • 00:10:06
    will be invoking and it will uh based on
  • 00:10:10
    some conditions it will trigger the
  • 00:10:11
    notification we when we are creating the
  • 00:10:13
    topics we are sending that them to the
  • 00:10:15
    users they will subscribe to it based on
  • 00:10:17
    the PS up method they will be receiving
  • 00:10:20
    notifications later and in case of any
  • 00:10:22
    failure or production issue then based
  • 00:10:24
    on that like uh we'll have to estimate
  • 00:10:26
    it and we'll create a like us a story
  • 00:10:28
    for that like produ issue we'll have to
  • 00:10:30
    log in jir and based on the we'll start
  • 00:10:33
    working on it okay so in S3 bucket right
  • 00:10:37
    can we have like two buckets with the
  • 00:10:39
    same name in AWS
  • 00:10:41
    S3 uh no uh it should be unique uh so
  • 00:10:46
    but is it like is it not region specific
  • 00:10:48
    or is it region specific d three service
  • 00:10:50
    it's Global it's Global and every name
  • 00:10:53
    should be unique is there any
  • 00:10:55
    requirement for having such Global name
  • 00:10:58
    space
  • 00:11:00
    uh require uh sorry uh I did not
  • 00:11:03
    understand your question so point is
  • 00:11:04
    like is uh what is the use case like why
  • 00:11:06
    ad decided to have the global name
  • 00:11:08
    spaces even if we can specify the region
  • 00:11:11
    in an S3 bucket okay I check I don't so
  • 00:11:13
    say you have an S3 bucket right and like
  • 00:11:15
    how can you secure your S3 bucket like
  • 00:11:17
    what different uh things you can
  • 00:11:19
    Implement by which you can secure your
  • 00:11:21
    S3
  • 00:11:23
    bucket uh so basically we can provide
  • 00:11:25
    permissions accordingly like uh there
  • 00:11:28
    can be an uh access uh list we can add
  • 00:11:31
    and uh we can allow or deny some
  • 00:11:33
    resources or we can allow or uh deny
  • 00:11:35
    based on the IM am
  • 00:11:37
    permissions so that we can uh
  • 00:11:41
    do okay any other thing that you can
  • 00:11:43
    Implement I can think of right now okay
  • 00:11:47
    so say you are working for say
  • 00:11:48
    e-commerce company right hello say
  • 00:11:52
    you're working for an e-commerce company
  • 00:11:54
    and uh maybe handling the Sher data
  • 00:11:55
    product right so all the customer
  • 00:11:57
    information and everything you are
  • 00:11:58
    dealing with those type of data uh so
  • 00:12:01
    Cher data is basically in pii data right
  • 00:12:03
    and it's your responsibility as part of
  • 00:12:05
    the team to secure the pii information
  • 00:12:08
    right so what are the diff so how you
  • 00:12:10
    will Design uh this pipelines or maybe
  • 00:12:12
    how you will secure this pii information
  • 00:12:14
    so maybe you can list down some factors
  • 00:12:16
    and maybe some ways uh so first thing is
  • 00:12:19
    like uh this data like are we receiving
  • 00:12:21
    the piia columns as well and uh what is
  • 00:12:24
    the like use case like can we uh can we
  • 00:12:28
    access that data uh if yes then we can
  • 00:12:31
    uh do one thing we can create an
  • 00:12:32
    abstraction layer on top of that where
  • 00:12:34
    we can just uh basically uh mask that
  • 00:12:37
    data so what we are doing is in some
  • 00:12:40
    cases we do have two uh two different
  • 00:12:42
    schemas one is restricted one and one is
  • 00:12:45
    the general one so in the general one we
  • 00:12:46
    will be keeping all the uh we will be
  • 00:12:49
    doing all the masking on the columns
  • 00:12:51
    that uh that comes under pi and in the
  • 00:12:54
    restricted column we will be showing
  • 00:12:55
    their values as well so what we are
  • 00:12:57
    doing is uh we do have have data stored
  • 00:12:59
    in our facts when we are creating the
  • 00:13:01
    views on top of that then only in the
  • 00:13:03
    view uh view query we are uh just have
  • 00:13:07
    written a case statement where we are
  • 00:13:09
    checking if that column is present or
  • 00:13:11
    not if the value is there then we are
  • 00:13:12
    just masking that value other than data
  • 00:13:15
    masking like is is there any other thing
  • 00:13:17
    that you will Implement so that say your
  • 00:13:20
    data is present in the S3 bucket and
  • 00:13:21
    currently uh you have the access right
  • 00:13:24
    and you want to prevent any unauthorized
  • 00:13:26
    access to your data so data asking is
  • 00:13:29
    one of the technique that you will
  • 00:13:30
    surely Implement before dumping it to
  • 00:13:31
    the Target sources so like any other
  • 00:13:34
    technique that you can Implement to
  • 00:13:35
    secure your data maybe we can remove
  • 00:13:38
    those uh columns from there and we can
  • 00:13:40
    keep it in different uh uh like
  • 00:13:42
    different file or different table and uh
  • 00:13:45
    we can move it to different folder uh
  • 00:13:48
    which do not have access to the normal
  • 00:13:50
    users we can restrict the access to that
  • 00:13:53
    we can write python code we can read the
  • 00:13:56
    data we can uh basically separate the
  • 00:13:58
    The pii Columns and the uh original
  • 00:14:01
    columns that we want to use and we can
  • 00:14:03
    create two different files out of it and
  • 00:14:06
    one file will go to the restricted
  • 00:14:08
    folder and one file can go uh to the
  • 00:14:10
    normal folder that basically can be
  • 00:14:12
    consumed by all the users okay so when
  • 00:14:15
    you're explaining the architecture right
  • 00:14:17
    so what was the format of the data that
  • 00:14:19
    you have in the
  • 00:14:20
    roer so it's basically firstly uh all
  • 00:14:24
    the logs that we are receiving it's in
  • 00:14:26
    CSV format and later on using python we
  • 00:14:29
    are taking the chunks uh like chunks of
  • 00:14:31
    data we are uh basically converting that
  • 00:14:34
    uh that into park format so that we can
  • 00:14:36
    access it later so how you transforming
  • 00:14:39
    from rad to uh transform is there any
  • 00:14:41
    Services of AWS you're
  • 00:14:44
    using so uh firstly there is one raw
  • 00:14:47
    Zone which we are calling uh which will
  • 00:14:50
    have all the like all the logs so let's
  • 00:14:53
    say if there are 100 transaction that
  • 00:14:54
    has happened so it will have 10 uh like
  • 00:14:57
    100 CSV files inside it so now what we
  • 00:15:00
    are doing is we have written a python
  • 00:15:02
    code and the python code is running it
  • 00:15:04
    will merge all the those 100 uh files
  • 00:15:07
    because uh there can be a case like for
  • 00:15:09
    every transaction it is creating a file
  • 00:15:11
    so there can be like thousands or uh 10
  • 00:15:14
    thousands of files we have we are
  • 00:15:16
    receiving in one day but we want to
  • 00:15:17
    process everything so we are merging all
  • 00:15:19
    those data and then we are in the python
  • 00:15:22
    code only like we are creating U like we
  • 00:15:24
    are converting it into par and we are
  • 00:15:26
    keeping that data into our aggregate
  • 00:15:28
    zone so agregate Zone uh we have in
  • 00:15:32
    place which will have the exact same
  • 00:15:33
    data but it it's uh in uh like we are
  • 00:15:35
    it's compressed it's in park format that
  • 00:15:38
    we can leverage later for the
  • 00:15:40
    optimization techniques that we can use
  • 00:15:42
    and uh it's acting uh like uh acting as
  • 00:15:46
    a middle layer which can be later on
  • 00:15:48
    consumed uh by the user so on top of
  • 00:15:50
    that we are creating external tables and
  • 00:15:52
    we are dumping the data into ODS so in
  • 00:15:55
    this whole pipeline L how you are
  • 00:15:56
    managing the schema right because CSV
  • 00:15:59
    and uh don't come with the schemas right
  • 00:16:02
    so how you managing your schemas schema
  • 00:16:04
    drift or schema Evolution as part of
  • 00:16:06
    this pipeline if there is any Evolution
  • 00:16:08
    happening right so how you will
  • 00:16:09
    incorporate that in a
  • 00:16:11
    pipeline yeah so CSV does not provide us
  • 00:16:15
    the uh feature for like schi Evolution
  • 00:16:18
    so what we are doing is uh we have
  • 00:16:20
    written a python code and uh like
  • 00:16:23
    whenever there is a new file we are
  • 00:16:25
    basically creating a data frame out of
  • 00:16:26
    it and uh we are checking all The
  • 00:16:28
    Columns we are creating a set of it and
  • 00:16:30
    we are connecting to our red ship which
  • 00:16:32
    is our Target and we are taking some
  • 00:16:35
    data from that as well again doing the
  • 00:16:38
    same process checking all the columns
  • 00:16:40
    and creating a set out of it and then we
  • 00:16:42
    have two sets like one set has all the
  • 00:16:44
    columns from The Source One set has all
  • 00:16:46
    the columns from the target we are
  • 00:16:47
    taking a set difference and if there is
  • 00:16:50
    uh some columns that is like we are
  • 00:16:52
    getting as as an output of set
  • 00:16:54
    difference then we know like these are
  • 00:16:56
    all the columns that are uh that that is
  • 00:16:58
    not there in the red shift in our Target
  • 00:17:01
    and these are all the new columns so
  • 00:17:03
    there like we are doing two things uh we
  • 00:17:05
    are firstly like uh based on the columns
  • 00:17:08
    which we are getting we are creating
  • 00:17:09
    alter statements so we are adding new
  • 00:17:12
    columns to our Target and we are just
  • 00:17:14
    keeping it as we care because it can uh
  • 00:17:16
    have any type of data and later on like
  • 00:17:19
    we can go on and manually change that
  • 00:17:21
    like as per the uh like the data
  • 00:17:23
    requirement if it's uh we need to keep
  • 00:17:25
    it integer or like any other data type
  • 00:17:28
    so this will be the one thing and second
  • 00:17:30
    thing we are sending notifications to
  • 00:17:31
    your user so that we can get their
  • 00:17:33
    confirmation like if those columns were
  • 00:17:35
    needed and uh are they part of as part
  • 00:17:38
    of any new requirement or things like
  • 00:17:40
    that so can't we use like blue catalog
  • 00:17:42
    service here to manage the schema like
  • 00:17:45
    instead of managing like connecting to
  • 00:17:47
    source and Target and comparing it so
  • 00:17:49
    can't we use any managed catalog service
  • 00:17:51
    Like
  • 00:17:52
    Glue yes yes we can definitely use that
  • 00:17:55
    but since it was already uh like flow is
  • 00:17:58
    like we are getting the data from DMS
  • 00:18:00
    and we are processing all we are doing
  • 00:18:02
    all the processing writing all the SQL
  • 00:18:05
    scripts and dumping the data so we are
  • 00:18:07
    not using glue for that that's why we
  • 00:18:09
    have not used but definitely we can uh
  • 00:18:12
    yeah change that to uh we can use some
  • 00:18:14
    other ews Service as well okay so say
  • 00:18:17
    you're using glue right for maybe EMR
  • 00:18:20
    service right so we have to specify the
  • 00:18:22
    cluster configurations right like what
  • 00:18:24
    type of instances we want to use so um
  • 00:18:27
    we have like different instance typee in
  • 00:18:29
    AWS service right so uh on basis of use
  • 00:18:32
    case how we will specify like what type
  • 00:18:34
    of instance we have to use from AWS
  • 00:18:38
    offering so uh like if we are using AWS
  • 00:18:41
    clue we don't need to worry about the
  • 00:18:44
    infrastructure management because it's a
  • 00:18:46
    server serverless and it manages uh as
  • 00:18:49
    per the load so if there's more uh like
  • 00:18:52
    more data coming in our way and the
  • 00:18:54
    workload is uh higher than expected like
  • 00:18:56
    it will spin up all the cluster
  • 00:18:59
    accordingly uh so I have used uh ews
  • 00:19:02
    glue not uh EMR I can talk in terms of
  • 00:19:05
    that like if firstly uh Whenever there
  • 00:19:08
    is any use case we'll have to go on and
  • 00:19:10
    uh analyze the data data volume data
  • 00:19:13
    types and uh we will be doing all the
  • 00:19:15
    analysis on that basis only we can
  • 00:19:18
    either uh use some uh like EMR in EMR we
  • 00:19:22
    can get uh all the flexibility to
  • 00:19:24
    provision our own clusters so there it's
  • 00:19:27
    very much necessary that we know what uh
  • 00:19:29
    data volume we're getting and as for
  • 00:19:31
    that like how can it be handled how many
  • 00:19:33
    nodes how many basically how many
  • 00:19:35
    executors do we need what is the memory
  • 00:19:37
    what is the core that we need as per the
  • 00:19:39
    requirement and the data we are dealing
  • 00:19:41
    with So currently AMR also comes with a
  • 00:19:44
    new offering of EMR serverless right so
  • 00:19:46
    in this case you also can use EMR Ser
  • 00:19:48
    serverless for your load so what what
  • 00:19:51
    can be the use case for using EMR
  • 00:19:53
    serverless over top of over glue yeah so
  • 00:19:56
    if it's uh like similar like
  • 00:19:58
    architecture wise like it's serverless
  • 00:20:00
    and it also gives like uh manages all
  • 00:20:04
    the infrastructure itself so maybe the
  • 00:20:06
    data cataloging part that we have in AWS
  • 00:20:09
    glue we don't have in am EMR I'm not uh
  • 00:20:12
    wellers with the EMR uh like
  • 00:20:14
    functionalities and features but I'll
  • 00:20:16
    talk about uh aw's clue it does have uh
  • 00:20:20
    data brew as well where we can uh like
  • 00:20:22
    perform some analytics that is one plus
  • 00:20:24
    Point uh it does have AWS crawlers as
  • 00:20:26
    well that can help to crawl the data
  • 00:20:28
    from some source to Etha or some other
  • 00:20:31
    source so those are the functionalities
  • 00:20:33
    maybe that uh EMR is not like it's not
  • 00:20:36
    there in the EMR okay so say you are
  • 00:20:39
    using glue service right for your detail
  • 00:20:41
    Pipeline and you are using glue crawler
  • 00:20:44
    and say your data is present in the S3
  • 00:20:45
    bucket right so how so say one day you
  • 00:20:48
    use the glue crawler and schema is up to
  • 00:20:50
    date in your glue pipelines right uh so
  • 00:20:53
    maybe from some few days uh afterwards
  • 00:20:55
    your schema has changed right so how
  • 00:20:57
    will glue come to know that schema has
  • 00:21:00
    changed so how you will implement this
  • 00:21:02
    type of workflow so that your ETL
  • 00:21:04
    pipelines would be aware of the recent
  • 00:21:06
    schema
  • 00:21:07
    changes so we are using AWS uh like data
  • 00:21:11
    catalog part here and data catalog will
  • 00:21:13
    basically store all the metadata
  • 00:21:15
    regarding the data files and everything
  • 00:21:17
    so if there is any change in the data so
  • 00:21:19
    AWS catalog will capture that change as
  • 00:21:22
    part of schema evaluation and when we
  • 00:21:24
    are uh moving the data to like crawling
  • 00:21:26
    the data from the S3 or some other
  • 00:21:29
    source uh it will go on and check the
  • 00:21:32
    metadata from data catalog and it will
  • 00:21:34
    come to know about the changes in the
  • 00:21:36
    schema so say you want to like have a
  • 00:21:38
    pipeline implemented as uh in the
  • 00:21:41
    similar flow so say you want to have an
  • 00:21:43
    Automation in place uh in this case like
  • 00:21:45
    whenever there is a schema changes or
  • 00:21:48
    something or maybe you have a glue job
  • 00:21:50
    which will basically maybe per day basis
  • 00:21:53
    it will scan the data and in case there
  • 00:21:55
    is any drift in the schema you will be
  • 00:21:57
    get notified ified so you want to
  • 00:21:59
    automate such a pipeline so how you will
  • 00:22:01
    Design such a pipeline in glue uh so in
  • 00:22:03
    glue what we can do is uh in that case
  • 00:22:06
    like if there is any schema changes we
  • 00:22:09
    need to
  • 00:22:09
    notify uh maybe we can add some triggers
  • 00:22:13
    or uh we can send some notifications
  • 00:22:15
    using teams
  • 00:22:17
    or maybe on some other channel like
  • 00:22:20
    slack Channel teams uh what according to
  • 00:22:22
    the functionality that it provides uh
  • 00:22:26
    yeah I'm not uh much like I have to uh
  • 00:22:28
    think about that use case but my
  • 00:22:31
    Approach will be like if I am using glue
  • 00:22:33
    uh somewhere like I will get previous
  • 00:22:36
    date uh data right uh that has because
  • 00:22:40
    uh the the schema was the previous
  • 00:22:42
    schema that we were referring to and uh
  • 00:22:45
    the new schema that we are getting we
  • 00:22:47
    can also load that also in our AWS glue
  • 00:22:49
    so maybe let's say like I have like
  • 00:22:51
    today's the date that we have received
  • 00:22:54
    the new schema and two days back the
  • 00:22:56
    date was uh there like uh based on the
  • 00:22:58
    partition date I can read two partitions
  • 00:23:01
    now and from those two partitions I can
  • 00:23:03
    read the data and check all the columns
  • 00:23:05
    and uh basically get the difference in
  • 00:23:07
    the columns whatever the difference is
  • 00:23:08
    there and what is the data type changes
  • 00:23:11
    basically uh based on that like if there
  • 00:23:14
    is any difference that uh there will we
  • 00:23:17
    can uh set some flags and based on those
  • 00:23:19
    flags or we can uh invoke some Lambda or
  • 00:23:22
    basically we can invoke SNS to trigger
  • 00:23:25
    the notifications so something like that
  • 00:23:28
    we can Implement okay so you mentioned
  • 00:23:31
    in your resume like you have also used
  • 00:23:33
    Dynam and elastic search as a database
  • 00:23:35
    uh so what was the use case right and
  • 00:23:37
    why we have like using two different
  • 00:23:39
    databases so what were the different use
  • 00:23:40
    cases that you were using this
  • 00:23:43
    for yeah so uh basically dynamodb and
  • 00:23:47
    elastic search we are using for same use
  • 00:23:48
    case only what is what was happening is
  • 00:23:51
    we were getting uh aay related data so
  • 00:23:53
    whenever like uh we need to uh set a
  • 00:23:56
    mandate so that like uh the premium or
  • 00:23:59
    anything like that like it can be
  • 00:24:00
    deducted automatically from the account
  • 00:24:02
    so those sort of data we were getting in
  • 00:24:04
    terms of CSV and txt file from the users
  • 00:24:07
    themselves and using infoworks for the
  • 00:24:09
    data injection part like we were using
  • 00:24:11
    DMS ews service in this use case we are
  • 00:24:14
    using infoworks which basically is a
  • 00:24:15
    data injection tool there like we can
  • 00:24:17
    write the same workflows and uh we can
  • 00:24:19
    create a pipeline uh what it was doing
  • 00:24:22
    was like it was reading the data from
  • 00:24:24
    that file and uh it was uh converting it
  • 00:24:27
    into par and again the so like the
  • 00:24:29
    target was to uh basically convert in
  • 00:24:31
    park and uh place it on S3 itself which
  • 00:24:35
    was our um raw Zone there as well and
  • 00:24:38
    later on like what we were doing is uh
  • 00:24:40
    we were using AWS glue jobs to read the
  • 00:24:42
    data from there and we were it's uh like
  • 00:24:44
    we were creating the Json structure
  • 00:24:46
    there like using stru type and struck
  • 00:24:47
    field and the target was Dynamo DB
  • 00:24:51
    because um what we like Dynamo DB and uh
  • 00:24:54
    elastic search why we were using two
  • 00:24:56
    different uh Services because uh firstly
  • 00:24:59
    there are some functionalities that
  • 00:25:01
    elastic search provides like it can
  • 00:25:02
    provides us like we can uh write more
  • 00:25:05
    complex queries it's more uh based on
  • 00:25:07
    real-time analytics and also like uh it
  • 00:25:10
    can uh it provides a uh full text search
  • 00:25:13
    so if we want to uh search in a complex
  • 00:25:16
    uh uh there are some partition Keys also
  • 00:25:19
    like in Dynamo DB it's also uh very fast
  • 00:25:22
    it's it has a very high throughput and
  • 00:25:25
    uh it does store data in uh no SQL
  • 00:25:28
    unstructured or maybe semi-structured
  • 00:25:30
    data but it comes with the cost
  • 00:25:33
    definitely our use case was like like if
  • 00:25:35
    there is any change in the document
  • 00:25:38
    whatever we are storing and based on the
  • 00:25:40
    same partition key if I want to update
  • 00:25:42
    that it will override that particular uh
  • 00:25:44
    data which was there in Dynamo TP it
  • 00:25:47
    does not index it and keep both the data
  • 00:25:49
    fields so let's say if there is any like
  • 00:25:51
    my partition uh like we need to uh
  • 00:25:54
    basically specify the partitions and
  • 00:25:56
    sort key there so let's say if I'm
  • 00:25:58
    keeping my like employee ID employee
  • 00:26:00
    name those two keys as part of partition
  • 00:26:03
    keys and uh on those two keys I have
  • 00:26:05
    some updates now I want to update that
  • 00:26:08
    system like there is some changes okay
  • 00:26:10
    so if I'll go on and hit the uh updating
  • 00:26:13
    Dynamo DB it will override that uh data
  • 00:26:15
    in elastic search it will index that
  • 00:26:18
    like it will keep both the data like we
  • 00:26:20
    can uh it doesn't have any uh if you
  • 00:26:22
    want to uh basically prevent it from uh
  • 00:26:26
    having the duplicates we'll have to IND
  • 00:26:27
    that data and we can query that as well
  • 00:26:29
    like based on uh if you want the um if
  • 00:26:33
    you want the latest data we can just
  • 00:26:35
    query it accordingly because it has its
  • 00:26:37
    like already indexed documents so that
  • 00:26:39
    was one use case which were uh because
  • 00:26:41
    of which we are having the same data at
  • 00:26:44
    both places but the consumers are
  • 00:26:46
    different like the users are different
  • 00:26:48
    so can you explain me the difference in
  • 00:26:50
    the database Charing and the
  • 00:26:51
    partitioning uh database charting and
  • 00:26:54
    partitioning partitioning is uh uh in my
  • 00:26:58
    understanding like where we are getting
  • 00:26:59
    like we have a large data set and we
  • 00:27:01
    want to process that data set uh in uh
  • 00:27:04
    like very quick time so we'll be taking
  • 00:27:07
    the chunks of it and uh we'll be keeping
  • 00:27:10
    those chunks in different different
  • 00:27:11
    partitions so that is partitioning to
  • 00:27:13
    achieve parallelism where we can uh par
  • 00:27:16
    parall we can process all those data and
  • 00:27:19
    sharding uh like I have I'm aware on the
  • 00:27:23
    AWS skesis I guess in sharding like we
  • 00:27:25
    are also getting the data like from
  • 00:27:26
    streaming
  • 00:27:28
    I'm not sure if uh I'm correct here but
  • 00:27:31
    uh I have read about charts which
  • 00:27:34
    basically capture the streaming data and
  • 00:27:36
    we do specify the capacity of that chart
  • 00:27:39
    uh which will be there so it can like
  • 00:27:41
    maybe um in one chart we can have 100
  • 00:27:44
    records or things like that that's
  • 00:27:46
    mostly using for the uh streaming
  • 00:27:48
    related data okay so like what is the
  • 00:27:52
    secondary index Concept in the Dynamo DB
  • 00:27:55
    so like we have the secondary local
  • 00:27:57
    secondary index as well as the global
  • 00:27:58
    secondary index so what is the use case
  • 00:28:00
    of implementing the secondary index and
  • 00:28:02
    what is the cons of using those
  • 00:28:04
    secondary index I'm not uh much aware on
  • 00:28:08
    that actually I'll have to okay look
  • 00:28:10
    into it so say consider one scenario
  • 00:28:13
    right so you are working for some
  • 00:28:14
    business unit and uh so they are putting
  • 00:28:17
    your data in the S3 bucket so they don't
  • 00:28:20
    have a particular schedule when they are
  • 00:28:22
    dumping it is like on an ad hoc basis on
  • 00:28:23
    a daily basis right and they have
  • 00:28:26
    mentioned an SLA for 30 minutes so
  • 00:28:28
    whenever data LS to the S3 bucket you
  • 00:28:30
    need to process that data in in the 30
  • 00:28:33
    minute window right so they have also
  • 00:28:35
    given some business logic which you need
  • 00:28:37
    to perform on your data and right and
  • 00:28:40
    again they also want all the auditing
  • 00:28:42
    and logging in place so that they can
  • 00:28:44
    also track the data lineage so what
  • 00:28:47
    would be your approach of handling such
  • 00:28:48
    a pipeline uh for Designing such a
  • 00:28:50
    pipeline by meeting the constraints that
  • 00:28:53
    they have mentioned so first constraint
  • 00:28:55
    is uh that it's on a dog basis we are
  • 00:28:58
    not getting any notifications right uh
  • 00:29:01
    what I can think of is uh what we can do
  • 00:29:04
    is uh so whenever we can uh activate we
  • 00:29:07
    can write a Lambda function and uh we
  • 00:29:10
    can invoke it based on the put uh put
  • 00:29:12
    event in S3 so if there is any file that
  • 00:29:16
    has been placed in S3 it will
  • 00:29:18
    automatically go on and basically invoke
  • 00:29:20
    the Lambda function and Lambda function
  • 00:29:22
    can take care of the uh processing part
  • 00:29:25
    there um that's one thing and in case
  • 00:29:29
    there are like the data size data volume
  • 00:29:31
    is huge so what can we do in that case
  • 00:29:35
    um basic uh what I'm thinking is maybe
  • 00:29:37
    we can uh do the same thing like till
  • 00:29:40
    the first part what we can do is uh
  • 00:29:43
    whenever there's a put object it can
  • 00:29:45
    invoke the Lambda and in the Lambda we
  • 00:29:47
    can invoke AWS glue jobs maybe and uh it
  • 00:29:50
    will take the data from S3 and in the
  • 00:29:52
    AWS glue jobs it will start processing
  • 00:29:56
    uh so that was the a do part and what
  • 00:29:59
    was the second uh what were uh the other
  • 00:30:02
    constraint uh so for other constraint is
  • 00:30:04
    they also want to have the logging in
  • 00:30:06
    place as well as the auditing mechanism
  • 00:30:08
    so that they can also see the data
  • 00:30:12
    lineage okay so for that like uh uh data
  • 00:30:16
    lineage okay so uh in AWS glue if we are
  • 00:30:20
    using that uh so it does uh have like uh
  • 00:30:25
    we we are leveraging spark right so it
  • 00:30:28
    does store the data in the lower uh like
  • 00:30:31
    low level in rdds which does have the uh
  • 00:30:33
    lineage information so maybe we can
  • 00:30:35
    leverage
  • 00:30:36
    that or in case and for the logging part
  • 00:30:39
    we can
  • 00:30:41
    active yeah so for any logging part or
  • 00:30:43
    monitoring purpose we can uh activate uh
  • 00:30:46
    AWS cloudwatch events there and in case
  • 00:30:49
    of any failure in case of anything we
  • 00:30:51
    want to uh basically see we can go on
  • 00:30:54
    and uh check on the logs uh yeah so can
  • 00:30:58
    you explain like how rdds will give you
  • 00:30:59
    the data
  • 00:31:01
    lineage yeah so uh rdds uh basically how
  • 00:31:07
    uh okay so whenever we are calling an
  • 00:31:09
    action basically uh it's a in sparkk
  • 00:31:12
    It's a lazy evaluation it follows a lazy
  • 00:31:14
    evaluation technique so if there is any
  • 00:31:16
    action that has been called it will go
  • 00:31:17
    on and from the start it will start uh
  • 00:31:19
    take all the uh steps that will that is
  • 00:31:22
    there it will create a execution plan uh
  • 00:31:25
    and in that execution plan will have all
  • 00:31:27
    the Transformations and everything that
  • 00:31:28
    needs to be done to get to the output
  • 00:31:31
    and it it's in certain order and so it
  • 00:31:34
    will create dags for it so dag will have
  • 00:31:37
    a particular tasks in certain order that
  • 00:31:39
    needs to be executed that is basically
  • 00:31:41
    the lineage information that rdd has uh
  • 00:31:45
    so in that sense like we can uh if there
  • 00:31:47
    is any node failure or anything like
  • 00:31:48
    that happens uh so it can go on and uh
  • 00:31:52
    Trace back and using that dag it can
  • 00:31:55
    recomm that node
  • 00:31:57
    so it does have a lineage information
  • 00:31:59
    with it so say you have in this uh in
  • 00:32:02
    this use case only you have the
  • 00:32:03
    requirement to have all the different
  • 00:32:05
    event so say you are saying I am
  • 00:32:07
    triggering a Lambda right and then using
  • 00:32:09
    Lambda I am again triggering the glue
  • 00:32:10
    service right so all this event you want
  • 00:32:13
    to track as part of your audit right
  • 00:32:15
    what is the originating Source what is
  • 00:32:17
    uh what is the transformation source so
  • 00:32:19
    in this scenario like for this
  • 00:32:21
    implementation how you will uh enable
  • 00:32:23
    the auditing part uh I guess cloud trail
  • 00:32:26
    is a
  • 00:32:27
    uh AWS service that can help in auditing
  • 00:32:31
    part uh but that is again on the entire
  • 00:32:34
    infra level right that cloud trail yeah
  • 00:32:37
    okay okay in that case like uh we can
  • 00:32:41
    maybe leverage not AWS glue we can use
  • 00:32:45
    airflow as well we can write uh we can
  • 00:32:47
    create dags in airflow and uh we can
  • 00:32:49
    create tasks and uh we
  • 00:32:52
    can uh
  • 00:32:53
    basically use their uh uh we can
  • 00:32:57
    uh set the dependencies between them and
  • 00:33:00
    based on that like it can leverage act
  • 00:33:02
    as a leanage information in case of any
  • 00:33:04
    failures okay uh so how you are
  • 00:33:07
    currently deploying right your pipeline
  • 00:33:09
    so what what is the is there any devops
  • 00:33:11
    tool that you are using so you are also
  • 00:33:13
    using quite a good Services of AWS right
  • 00:33:15
    so how you deploying this
  • 00:33:18
    infrastructure okay so uh basically uh
  • 00:33:22
    we do have now a new framework with us
  • 00:33:25
    with which is basically uh has been
  • 00:33:27
    created by our team uh in quantify uh so
  • 00:33:31
    to deploy what we are doing is we do
  • 00:33:33
    have a repository get repository so all
  • 00:33:35
    the changes that we are doing uh in
  • 00:33:37
    terms of our uh Transformations and all
  • 00:33:40
    the SQL related changes on in the
  • 00:33:42
    scripts we will create a get uh request
  • 00:33:45
    like basically purle request to merge
  • 00:33:47
    the data so someone will approve that
  • 00:33:49
    and will be able to merge the data and
  • 00:33:52
    post that we do have a genin pipeline
  • 00:33:54
    that we are using for the deployment
  • 00:33:56
    part so gen pipeline is pointing to that
  • 00:33:58
    git repository so if there is any change
  • 00:34:00
    we'll need to again deploy the genkins
  • 00:34:03
    it will go on and uh choose the stack
  • 00:34:05
    and whatever the resources it will go on
  • 00:34:07
    and allocate that so this is the cicd
  • 00:34:10
    tool that we are using for now so what
  • 00:34:13
    what stack it is deploying
  • 00:34:16
    like so like it there is a code that has
  • 00:34:19
    been written by the platform team uh to
  • 00:34:22
    uh deploy all the resources that are
  • 00:34:25
    needed like uh and provide all the uh
  • 00:34:28
    permissions and accesses so if in case
  • 00:34:31
    like we are using uh AWS S3 so all the
  • 00:34:34
    resources that are using in
  • 00:34:36
    communicating with S3 they need the
  • 00:34:37
    permission to communicate with S3 so all
  • 00:34:40
    those things there is a terraform code I
  • 00:34:42
    guess like that has been actually taken
  • 00:34:44
    care by the platform team so they are
  • 00:34:46
    the ones who are deploying so it is a
  • 00:34:48
    separate team who is managing the INF
  • 00:34:51
    yes yes okay so uh you are also using
  • 00:34:53
    Lambda right you have used Lambda so you
  • 00:34:56
    already are aware that Lambda is a
  • 00:34:57
    limitation right where it can't have the
  • 00:35:00
    processing more than 15 minute right so
  • 00:35:02
    say you have a use case right where you
  • 00:35:04
    uh wanted to have sequential Lambda
  • 00:35:06
    trigger right so maybe once one Lambda
  • 00:35:09
    completes maybe before the timeout uh it
  • 00:35:12
    will trigger another Lambda somehow and
  • 00:35:14
    the processing would continue from that
  • 00:35:16
    execution state only uh so can you think
  • 00:35:19
    like how can we Implement such a use
  • 00:35:21
    case uh let me think uh so in case we
  • 00:35:26
    need to use Lambda only or processing
  • 00:35:29
    the data that is huge more than expected
  • 00:35:31
    which cannot be completed in 15 minutes
  • 00:35:33
    of time span so uh yeah we can uh
  • 00:35:36
    trigger another Lambda inside a Lambda
  • 00:35:39
    that we can do and uh maybe what we can
  • 00:35:42
    do is uh whatever the processing like
  • 00:35:44
    based on the time limit so let's say
  • 00:35:46
    there are three or four transformation
  • 00:35:48
    that we need to do and as part of one
  • 00:35:51
    Lambda time St like time out uh we can
  • 00:35:54
    just do two Transformations so we can
  • 00:35:57
    read the file we can perform those two
  • 00:35:58
    transformation and we can uh get a
  • 00:36:00
    output file and we can place it again on
  • 00:36:02
    some other S3 Zone uh we can create some
  • 00:36:05
    intermediate Zone in S3 we can do that
  • 00:36:08
    and uh that can be passed as a argument
  • 00:36:11
    in like as a variable in Lambda like the
  • 00:36:14
    another Lambda that will be invoked
  • 00:36:15
    inside this Lambda only so maybe that we
  • 00:36:18
    can do or uh we can uh do one more thing
  • 00:36:22
    instead of like creating the f file we
  • 00:36:24
    can uh dump the data like whatever the
  • 00:36:26
    process processing we have done we can
  • 00:36:27
    dump the data into our red shift and uh
  • 00:36:30
    from that point of state only it will
  • 00:36:33
    fetch the recent data that has been that
  • 00:36:35
    that is there in red shift and it can
  • 00:36:36
    continue working on it okay so say you
  • 00:36:40
    using Athena right so for some ad hoc
  • 00:36:42
    analysis and something right uh so see
  • 00:36:44
    there is uh the currently the definition
  • 00:36:47
    that you have mentioned there is no
  • 00:36:48
    partition right maybe on the next day uh
  • 00:36:51
    the data has been changed right and now
  • 00:36:53
    there is partition so will aena able to
  • 00:36:55
    detect the partition automat ially or do
  • 00:36:57
    you need to maybe have some commands
  • 00:37:00
    perform by which it will be able to
  • 00:37:01
    identify the
  • 00:37:04
    partitions uh so as far as I remember
  • 00:37:07
    like in Athena we need to give the
  • 00:37:08
    partitions as well because that is how
  • 00:37:11
    the query like it's able to access uh
  • 00:37:14
    give the query result faster because
  • 00:37:16
    it's working on partitions uh so as far
  • 00:37:20
    as remember like I we need to give some
  • 00:37:22
    partitions because based on that create
  • 00:37:24
    an external table without partitions
  • 00:37:27
    will AA give us an error okay I'm I'm
  • 00:37:29
    not aware on that I'll have to look into
  • 00:37:31
    it actually so does aena also physically
  • 00:37:34
    load the data because it is also
  • 00:37:35
    maintaining the catalog and meta store
  • 00:37:37
    right so will it like entirely load the
  • 00:37:39
    data and then it will do the
  • 00:37:41
    processing uh no no it's not uh loading
  • 00:37:44
    the data uh it's just referring to the
  • 00:37:46
    data that is present on maybe S3 so that
  • 00:37:51
    not loading the data so can we have an
  • 00:37:53
    aena service say in one region uh which
  • 00:37:56
    is say maybe doing the analysis on the
  • 00:37:59
    data in some other S3 bucket in another
  • 00:38:01
    region so can we have such a cross cross
  • 00:38:03
    region communication here or is it a
  • 00:38:05
    mandate to have the AA in the same
  • 00:38:07
    region as the F3
  • 00:38:10
    bucket I'm not I'm not aware on that
  • 00:38:13
    okay uh so can you please share your
  • 00:38:15
    screen like I have in SLE questions yeah
  • 00:38:17
    let me know if my screen is visible yeah
  • 00:38:19
    yes so yeah your screen is visible can
  • 00:38:21
    you see on the chat maybe so let me
  • 00:38:24
    explain the question first right so here
  • 00:38:26
    will have like the three different
  • 00:38:27
    tables so consider this as for the
  • 00:38:29
    e-commerce data right where we have the
  • 00:38:31
    orders data and then also we have the
  • 00:38:33
    products information right and then we
  • 00:38:35
    have the order detail so order detail is
  • 00:38:37
    like a more granular where you have will
  • 00:38:39
    have the like the uh for a single order
  • 00:38:42
    ID right you will have the different
  • 00:38:44
    products right so order detail is like a
  • 00:38:47
    more granular where you have all the
  • 00:38:48
    product ID information how many quantity
  • 00:38:51
    of that unit you have purchased what is
  • 00:38:52
    the unit price right so this order
  • 00:38:55
    details is at the detail Lev level right
  • 00:38:57
    and this orders table is on a very high
  • 00:38:59
    level like what is the order ID what is
  • 00:39:01
    the order date what is the total amount
  • 00:39:03
    right so that orders table is is the
  • 00:39:06
    Gran is a more at high level of
  • 00:39:08
    granularity right then this product
  • 00:39:10
    table is containing the information
  • 00:39:12
    right uh this is the product ID
  • 00:39:13
    corresponding to this name and this
  • 00:39:15
    category so you you have this three
  • 00:39:17
    different uh tables right now what you
  • 00:39:20
    need to find out in the SQL query is the
  • 00:39:22
    top selling product right so what is the
  • 00:39:25
    so maybe you generating some report for
  • 00:39:27
    some customer or maybe some stakeholders
  • 00:39:29
    right and you want to show them right
  • 00:39:31
    this is the top selling product right
  • 00:39:33
    what whatever we are offered and what is
  • 00:39:35
    the criteria for finding the top selling
  • 00:39:38
    product is basically characterized by
  • 00:39:40
    the highest
  • 00:39:41
    revenue uh what is the criteria again
  • 00:39:43
    for the highest revenue so we have to uh
  • 00:39:45
    do the segregation based on each
  • 00:39:47
    category for the last quarter so say you
  • 00:39:50
    are building a quarterly report right
  • 00:39:52
    and in the quarterly report within each
  • 00:39:54
    category what is the top selling product
  • 00:39:57
    which will be characterized by the
  • 00:39:58
    highest revenue so you need to create a
  • 00:40:00
    SQL query for
  • 00:40:03
    this let me know in case of any question
  • 00:40:05
    yeah
  • 00:40:06
    okay yeah so highest revenue is it based
  • 00:40:09
    on like the total amount which we have
  • 00:40:11
    here so you can ignore this total amount
  • 00:40:14
    right I think it might not be accurate
  • 00:40:15
    so consider this the order details and
  • 00:40:18
    the quantity and unit price for the for
  • 00:40:20
    the revenue okay okay so uh firstly what
  • 00:40:26
    I'm thinking firstly we need the data
  • 00:40:29
    just for the last quarter so uh let's
  • 00:40:31
    say last three months uh okay so that we
  • 00:40:34
    can filter out based on the order date
  • 00:40:36
    okay uh just writing some approach here
  • 00:40:41
    so last 3 months data we can filter out
  • 00:40:45
    from uh order
  • 00:40:47
    table now uh we need to join orders and
  • 00:40:51
    Order detail together okay to get the um
  • 00:40:56
    we can join these two based on order
  • 00:41:00
    ID and we can get quantity and uh unit
  • 00:41:04
    price okay and we'll try we'll multiply
  • 00:41:08
    uh the number of units that we have with
  • 00:41:10
    the unit price so to get the total
  • 00:41:13
    amount if we are ignoring uh this
  • 00:41:15
    particular total amount here so we need
  • 00:41:18
    to multiply this
  • 00:41:20
    quantity uh with unit price and getting
  • 00:41:24
    the total
  • 00:41:25
    amount
  • 00:41:26
    now what uh we can do is now if we have
  • 00:41:31
    the total amount here and
  • 00:41:34
    uh we have the order ID okay based on
  • 00:41:37
    the order ID here okay uh order ID so
  • 00:41:44
    what type of join will you perform in
  • 00:41:45
    this two
  • 00:41:47
    table we need all t uh like all the
  • 00:41:50
    orders then we can uh use l
  • 00:41:53
    join I'm assuming like it will have
  • 00:41:56
    information for all the orders because
  • 00:41:58
    it's on the granular level of orders so
  • 00:42:02
    even if we do in a join I don't think
  • 00:42:04
    there will be any loss of data
  • 00:42:07
    okay yeah okay so now I have my total
  • 00:42:12
    amount I have filtered order date uh now
  • 00:42:16
    I need to group my data um group my data
  • 00:42:19
    based on product and get the sum of the
  • 00:42:22
    total amount get the revenue basically
  • 00:42:28
    and within each category of for the last
  • 00:42:30
    quarter so category okay now category
  • 00:42:33
    also we need so we need to uh basically
  • 00:42:37
    join this
  • 00:42:39
    product I like product table as well and
  • 00:42:42
    based on product ID and we need uh
  • 00:42:46
    category from here we need category from
  • 00:42:48
    here now we have all the data we can
  • 00:42:51
    group group it on um product
  • 00:42:54
    ID but product ID unique right uh okay
  • 00:42:58
    so the product ID is unique so it will
  • 00:43:01
    automatically have different categories
  • 00:43:03
    right uh we are considering these uh
  • 00:43:06
    different products
  • 00:43:08
    correct yeah okay so we can just group
  • 00:43:12
    it on product
  • 00:43:14
    ID and we can take the total sum of uh
  • 00:43:20
    like some of this
  • 00:43:22
    amount uh total amount uh so can you
  • 00:43:26
    explain like why we are doing the
  • 00:43:27
    grouping on the product ID uh so we need
  • 00:43:30
    top selling product so top selling
  • 00:43:32
    product will be the one that has
  • 00:43:34
    generated the highest revenue so if I
  • 00:43:38
    want the top selling product I have this
  • 00:43:40
    product ID table and I have this product
  • 00:43:43
    name and category cons like this is the
  • 00:43:45
    unique one so if I am doing the uh group
  • 00:43:48
    or grouping on product ID so whatever
  • 00:43:50
    the orders are there for this particular
  • 00:43:52
    order ID 101 I am uh basically basically
  • 00:43:56
    adding all the revenues generated from
  • 00:43:58
    all the orders for this particular
  • 00:43:59
    product okay and I'm getting the total
  • 00:44:02
    revenue for that particular product
  • 00:44:04
    likewise I am doing it for all the
  • 00:44:06
    products now so if I'm grouping it on
  • 00:44:08
    product ID I will have all the products
  • 00:44:11
    and all the revenue generated from uh
  • 00:44:14
    like uh from all the orders for these
  • 00:44:17
    products uh so now what you have in your
  • 00:44:20
    select query in this
  • 00:44:22
    case okay so in my select query
  • 00:44:27
    uh in my select query I will have these
  • 00:44:30
    two uh quantity and unit and
  • 00:44:35
    uh I will I can uh like I can use
  • 00:44:39
    product ID only because I'm grouping it
  • 00:44:41
    by product ID but like in the question
  • 00:44:44
    we have can you read out the question
  • 00:44:46
    again what would be the top the top
  • 00:44:49
    selling product yeah okay so we need
  • 00:44:52
    product name and product
  • 00:44:54
    category what will uh what is the like
  • 00:44:57
    all the columns we need in the output
  • 00:44:59
    can you uh just so we would need the
  • 00:45:01
    category and then we would need the
  • 00:45:03
    product name and the total revenue
  • 00:45:05
    highest revenue that is
  • 00:45:07
    there
  • 00:45:09
    category product
  • 00:45:11
    name and uh Revenue
  • 00:45:15
    okay okay so I can you uh like uh
  • 00:45:18
    instead of grouping it on product ID I
  • 00:45:20
    can group it on product name and
  • 00:45:22
    category also okay that because uh
  • 00:45:25
    combination of this is unique okay and
  • 00:45:28
    uh in the select query also like I can
  • 00:45:30
    use the same
  • 00:45:32
    okay will it give the fin because I need
  • 00:45:35
    the no because I just need the uh top
  • 00:45:39
    selling product so maybe uh I need to
  • 00:45:43
    order
  • 00:45:44
    it uh by this Revenue okay in descending
  • 00:45:49
    order so now I have uh all the highest
  • 00:45:52
    revenue first and then the lowest one in
  • 00:45:56
    the uh bottom uh then I can just like
  • 00:46:00
    maybe I'm thinking to limit it to one
  • 00:46:02
    then it will give me the just one output
  • 00:46:04
    with the
  • 00:46:05
    highest but we require for each category
  • 00:46:08
    uh we just we don't need a single row we
  • 00:46:11
    need for each category the top selling
  • 00:46:13
    product okay okay okay okay so in that
  • 00:46:16
    case like uh we'll have to go with the
  • 00:46:18
    uh window function
  • 00:46:20
    here so uh yeah so what were your window
  • 00:46:24
    condition H uh so in Partition by uh we
  • 00:46:30
    will be writing this product uh name and
  • 00:46:33
    category uh sorry uh in like the C yeah
  • 00:46:37
    um these two are maybe just the category
  • 00:46:39
    one we can keep in Partition by so we
  • 00:46:42
    can partition the data based on all the
  • 00:46:44
    categories and uh we can order the data
  • 00:46:48
    on revenue and uh like this Revenue in
  • 00:46:52
    descending order and then uh we can just
  • 00:46:55
    uh ask it to maybe uh give it a name and
  • 00:47:00
    uh we can just take the data or like we
  • 00:47:03
    can filter it out post that we can just
  • 00:47:05
    uh uh something like that like if I am
  • 00:47:09
    using just a minute
  • 00:47:13
    so I'm ranking my data let's say name
  • 00:47:16
    and category I already have this as
  • 00:47:21
    revenue
  • 00:47:23
    and sending and you can just give it
  • 00:47:28
    some Rank and later on like
  • 00:47:31
    uh or maybe use qualify for this but
  • 00:47:35
    that is only supported in some SQL not
  • 00:47:38
    in all databases yeah yeah yes yes uh so
  • 00:47:42
    in that case like uh in another uh like
  • 00:47:45
    you can keep it as a subquery and later
  • 00:47:48
    on we can just use RNs one so for every
  • 00:47:53
    category whatever the highest revenue is
  • 00:47:55
    we we are just switching that uh so can
  • 00:47:57
    you go to the product table you scroll
  • 00:47:59
    to product table right so in the product
  • 00:48:02
    table right say you have a category of
  • 00:48:03
    business called right so you have
  • 00:48:05
    different product name A and B A and D
  • 00:48:08
    sorry right so say uh as for this data
  • 00:48:11
    I'm just assuming right so maybe you
  • 00:48:13
    have the a as the highest selling
  • 00:48:14
    product so you don't need a two separate
  • 00:48:17
    Row for business card just a single Row
  • 00:48:18
    for business card and a and the highest
  • 00:48:20
    revenue so will your SQL query give me
  • 00:48:23
    the single Row for a single category
  • 00:48:26
    no in that case I don't uh I was
  • 00:48:28
    assuming that the product name and
  • 00:48:30
    category both are different both will be
  • 00:48:32
    unique now if uh this is the case like
  • 00:48:36
    then I can group it on category only
  • 00:48:38
    then uh it will sum up all the revenues
  • 00:48:41
    uh from the C and D product name and if
  • 00:48:47
    that is the highest like whatever the
  • 00:48:49
    highest one I guess uh it will give me
  • 00:48:52
    the highest one so in
  • 00:48:54
    case both the different product have the
  • 00:48:56
    same Revenue so what would be the output
  • 00:48:59
    in the
  • 00:49:01
    rank uh it will give me the same like
  • 00:49:03
    both the uh because uh okay so in case
  • 00:49:08
    if uh okay uh because I am using
  • 00:49:11
    Rank and uh if
  • 00:49:15
    it's if it's the same it will have the
  • 00:49:18
    same uh rank only so will it give me
  • 00:49:22
    like the two different Row in the
  • 00:49:24
    output yes uh yeah I guess so okay uh
  • 00:49:29
    because from the next one it will uh
  • 00:49:33
    like skip a rank if you're using uh rank
  • 00:49:36
    function so say we want to add another
  • 00:49:39
    CL right say we want to add another
  • 00:49:40
    close here uh say we just need a single
  • 00:49:44
    row right for the category and we can
  • 00:49:46
    add another condition so in that case
  • 00:49:48
    when there is a same revenue for
  • 00:49:50
    different products uh we can go ahead
  • 00:49:52
    with the alphabetical ordering for the
  • 00:49:54
    product name so whatever ever uh product
  • 00:49:56
    name comes first as per the alphabetical
  • 00:49:58
    ordering we want to include that row so
  • 00:50:01
    how you will include this condition in
  • 00:50:02
    your
  • 00:50:04
    query if I want to include uh only
  • 00:50:08
    product name that
  • 00:50:10
    comes uh okay only in this case if a and
  • 00:50:13
    C has like the same Revenue uh Like A
  • 00:50:17
    and D has the same Revenue based on the
  • 00:50:19
    uh category business card uh I just want
  • 00:50:21
    to display a right uh okay stand that uh
  • 00:50:28
    I can compare uh the product name and uh
  • 00:50:32
    because product name will have a and c
  • 00:50:34
    and which one uh like either one is
  • 00:50:37
    smaller it will uh it should give me the
  • 00:50:40
    smaller one only how will you compare
  • 00:50:43
    this two how will
  • 00:50:45
    I okay so now I have two columns what I
  • 00:50:50
    can do is I can uh use a row number or
  • 00:50:53
    uh row number function based on the
  • 00:50:55
    category and I can order it on product
  • 00:50:59
    name and uh like product name uh
  • 00:51:02
    ascending order if uh what I like I'm
  • 00:51:05
    just think uh so if I am grouping it
  • 00:51:09
    again on
  • 00:51:11
    category
  • 00:51:13
    um yeah so if I'm grouping it again on
  • 00:51:16
    category and uh I am just
  • 00:51:18
    ordering by uh my product
  • 00:51:22
    name
  • 00:51:24
    uh
  • 00:51:26
    so I am just giving
  • 00:51:30
    this just a
  • 00:51:34
    minute if I'm using maybe R number
  • 00:51:38
    function and I'm partitioning by
  • 00:51:41
    category and I'm ordering by product
  • 00:51:44
    name so whatever the ascending order
  • 00:51:46
    will be it will have uh R number as one
  • 00:51:50
    and uh the second one like d will have
  • 00:51:52
    row number as two uh then I can filter
  • 00:51:55
    it out based on rle number here as well
  • 00:51:57
    like uh Ru number one so you will apply
  • 00:51:59
    this condition after all the above logic
  • 00:52:02
    right so you will add another window
  • 00:52:04
    function uh to incorporate this
  • 00:52:07
    logic okay so order by Revenue in
  • 00:52:10
    descending then I can uh here only
  • 00:52:14
    M I can
  • 00:52:16
    use this condition like I can order it
  • 00:52:20
    by like firstly it will order it by the
  • 00:52:22
    revenue and then uh it will order it uh
  • 00:52:25
    on product name okay again in category
  • 00:52:28
    itself okay so we won't need this second
  • 00:52:31
    window function will this work the
  • 00:52:33
    condition no
  • 00:52:35
    no yeah I guess uh yeah I'm thinking
  • 00:52:39
    like it should uh
  • 00:52:41
    work okay yeah yeah so I'm good from my
  • 00:52:45
    side like do you have any questions for
  • 00:52:47
    me yeah so uh
  • 00:52:49
    based like do you have any feedback or
  • 00:52:53
    anything yeah so you were very good at
  • 00:52:55
    the SQL right so the
  • 00:52:57
    thinking yeah so thinking uh you are
  • 00:53:00
    able to think logically right so you
  • 00:53:02
    understand the data uh the only thing
  • 00:53:04
    that is important is the Assumption so
  • 00:53:06
    don't assume anything you have the data
  • 00:53:08
    set right the products uh it's there the
  • 00:53:10
    different category and the different
  • 00:53:11
    product name right so for the same two
  • 00:53:13
    categories we have the different product
  • 00:53:15
    name so it's not unique right so always
  • 00:53:17
    if you are making an assumption just
  • 00:53:19
    clarify it from the interviewer if the
  • 00:53:21
    Assumption you are taking is right or
  • 00:53:22
    wrong so that is one part otherwise your
  • 00:53:24
    logic part is very good you are able to
  • 00:53:27
    Think Through the problem and able to
  • 00:53:30
    demonstrate your SQL capabilities right
  • 00:53:33
    uh on the engineering side right I think
  • 00:53:35
    uh main other thing you need to also
  • 00:53:38
    focus is on the data security part so
  • 00:53:40
    like how because when why we are using
  • 00:53:43
    Cloud right cloud is not always secure
  • 00:53:45
    so Cloud we have to make our services
  • 00:53:47
    secure so what are the different things
  • 00:53:49
    that we can incorporate like uh the
  • 00:53:51
    question that I asked right how you will
  • 00:53:53
    make your maybe F3 bucket secure or
  • 00:53:55
    maybe the pii data so all those things
  • 00:53:57
    there are different things that we can
  • 00:53:59
    Implement so there is like encryption
  • 00:54:01
    then also we have the server side loging
  • 00:54:03
    then also we have the different IM roles
  • 00:54:05
    so we can leverage IM roles and policies
  • 00:54:08
    uh and also we have to follow the Leist
  • 00:54:10
    privilege access policies so sometimes I
  • 00:54:12
    have seen different codes where people
  • 00:54:14
    just put star like give all the
  • 00:54:16
    permissions for different service so we
  • 00:54:18
    uh AWS doesn't recommend like this so
  • 00:54:20
    always have the list granity at the
  • 00:54:22
    possible in the permissions and other
  • 00:54:24
    stuff uh so that is again very much
  • 00:54:26
    important and uh so yeah I find like you
  • 00:54:30
    good with the designing another thing
  • 00:54:32
    but security is again another thing that
  • 00:54:34
    we should also Focus as a data engineer
  • 00:54:36
    so yeah that is yes definitely I'll look
  • 00:54:39
    into it yeah and just maybe a little bit
  • 00:54:42
    more on the uh different services so you
  • 00:54:43
    are also good with the Dynam DB and
  • 00:54:45
    elastic search uh why we are using the
  • 00:54:48
    use of different so there can be a
  • 00:54:50
    question like why we are replicating the
  • 00:54:51
    data between the two different databases
  • 00:54:53
    but you are clear on your use is like
  • 00:54:55
    why there is a need so it's always the
  • 00:54:57
    use case implementation specific and you
  • 00:54:59
    are clear with that point right so I
  • 00:55:01
    think I'm good yeah just focus on those
  • 00:55:04
    little uh little details yeah on the
  • 00:55:06
    infrastructure part I think otherwise
  • 00:55:11
    it's
Tags
  • AWS
  • Data Engineering
  • ETL
  • AWS Lambda
  • Data Security
  • DynamoDB
  • ElasticSearch
  • Data Reconciliation
  • SQL
  • S3 Partitioning