AI-Powered Data Discovery for Modern Data Governance – Peggy Tsai, BigID

00:27:36
https://www.youtube.com/watch?v=DvJ0desvO2Y

摘要

TLDRIn her presentation, Peggy Sai from BigID discusses the importance of AI-powered data discovery in establishing a modern data governance program. Drawing from her experience in financial services, she identifies key challenges in data governance, particularly in locating and assessing data quality. Sai emphasizes the need for a comprehensive approach that integrates data catalogs, quality monitoring, and automated discovery to manage both structured and unstructured data effectively. She introduces BigID's solutions for data classification, clustering, and compliance with privacy regulations, illustrating their application through various industry use cases. The presentation underscores the significance of understanding unstructured data and leveraging machine learning to enhance data governance capabilities.

心得

  • 🌊 Data governance challenges resemble an iceberg, with unstructured data hidden beneath the surface.
  • 📊 A comprehensive data catalog is essential for understanding data assets and their locations.
  • 🤖 Automation in data discovery reduces manual tasks and enhances efficiency.
  • 🔍 Machine learning aids in the classification and labeling of data, saving time.
  • 📈 Data quality monitoring provides actionable insights for data governance teams.
  • 🗂️ Clustering helps identify duplicate data, optimizing storage and management.
  • 🔒 Compliance with privacy regulations is critical for organizations handling sensitive data.
  • 🌐 A modern data governance program integrates structured and unstructured data for holistic management.
  • 📅 Continuous monitoring ensures that new data is accounted for in governance processes.
  • 💡 Understanding unstructured data is key to mitigating risks in data governance.

时间轴

  • 00:00:00 - 00:05:00

    Peggy Sai introduces herself and her background in data governance, particularly in the financial services industry. She highlights the challenges organizations face in data governance, such as locating data, understanding its quality, and managing unstructured data, which often poses greater risks than known data sources.

  • 00:05:00 - 00:10:00

    Sai compares data sources to an iceberg, where only a small portion is visible above water. She emphasizes the importance of identifying and managing unstructured and dark data, which can be a significant risk to organizations. The need for a cohesive data governance program that integrates all data sources is discussed, along with traditional methods of data discovery.

  • 00:10:00 - 00:15:00

    The presentation outlines the role of a Chief Data Officer in managing data governance programs, emphasizing the need for a comprehensive data dashboard to consolidate data quality, cataloging, and remediation issues. Sai discusses the importance of data dictionaries and the challenges of manual data tagging, which can be resource-intensive.

  • 00:15:00 - 00:20:00

    Sai introduces the concept of automated data discovery, which can alleviate the manual tasks associated with data governance. She emphasizes the need for continuous monitoring of data and the importance of linking logical assets to physical data locations to enhance data governance processes.

  • 00:20:00 - 00:27:36

    The presentation concludes with a discussion of BigID's data discovery solutions, highlighting the importance of extensible data coverage, automated classification, and machine learning in managing data governance. Sai shares use cases from organizations that have successfully implemented these solutions to comply with privacy regulations and improve their data governance programs.

显示更多

思维导图

视频问答

  • What is the main focus of Peggy Sai's presentation?

    The main focus is on AI-powered data discovery for building a modern data governance program.

  • What challenges in data governance does Peggy highlight?

    She highlights challenges such as finding data, understanding data quality, and managing unstructured data.

  • What is the significance of a data catalog in data governance?

    A data catalog provides a single view of all data assets, helping organizations understand their data landscape.

  • How does automation benefit data governance?

    Automation helps reduce manual tasks, enabling continuous monitoring and efficient data management.

  • What role does machine learning play in data classification?

    Machine learning automates the classification of data, improving efficiency and accuracy in identifying sensitive information.

  • Can you give an example of a use case mentioned in the presentation?

    One use case involves a global athletic brand needing to comply with GDPR by classifying personal information across multiple data sources.

  • What industries can benefit from BigID's data discovery solutions?

    Industries such as financial services, healthcare, retail, and insurance can benefit from these solutions.

  • What is the importance of understanding unstructured data?

    Unstructured data often poses greater risks and compliance challenges, making it crucial to identify and manage.

  • What is clustering in the context of data governance?

    Clustering helps identify duplicate data and optimize storage during data migrations.

  • What is the ultimate goal of a modern data governance program?

    The goal is to know your data, where it is, and its quality to ensure effective governance.

查看更多视频摘要

即时访问由人工智能支持的免费 YouTube 视频摘要!
字幕
en
自动滚动:
  • 00:00:00
    [Music]
  • 00:00:11
    hello good morning
  • 00:00:12
    my name is peggy sai from big id today i
  • 00:00:15
    will be talking about
  • 00:00:17
    ai power data discovery for building a
  • 00:00:19
    modern data governance program
  • 00:00:22
    before i joined big idea a year ago i
  • 00:00:24
    was in your position
  • 00:00:26
    where i helped to run and operationalize
  • 00:00:28
    data management programs i worked mainly
  • 00:00:30
    in the financial services industry
  • 00:00:32
    my most recent role was in morgan
  • 00:00:34
    stanley where i helped operationalize
  • 00:00:36
    their data governance program for wealth
  • 00:00:38
    management
  • 00:00:39
    i also worked on other regulatory and
  • 00:00:42
    business initiatives
  • 00:00:43
    that focused on understanding their
  • 00:00:46
    critical data elements
  • 00:00:47
    ensuring data remediation and monitoring
  • 00:00:50
    for data quality
  • 00:00:52
    i also recently co-authored the ai book
  • 00:00:55
    that was published last year in may 2020
  • 00:00:59
    so a lot of the things i'll be talking
  • 00:01:01
    about today is based on my experience as
  • 00:01:03
    a data steward in working in various
  • 00:01:05
    financial industries
  • 00:01:07
    and the complexities that we i
  • 00:01:09
    personally felt
  • 00:01:10
    when building out and broadening our
  • 00:01:13
    data governance program
  • 00:01:16
    so when we talked to customers today and
  • 00:01:17
    even based on my experience
  • 00:01:19
    the main challenges with data governance
  • 00:01:21
    is
  • 00:01:22
    finding your data knowing where your
  • 00:01:25
    data is
  • 00:01:26
    and being able to understand the quality
  • 00:01:28
    of that data
  • 00:01:30
    now many organizations i like their data
  • 00:01:32
    sources
  • 00:01:33
    i like to see it as an iceberg
  • 00:01:37
    it the iceberg that's above the water is
  • 00:01:39
    normally
  • 00:01:40
    much smaller than what's actually
  • 00:01:42
    beneath the water
  • 00:01:44
    on the iceberg that's on top of the
  • 00:01:45
    water is the known data sources that's
  • 00:01:48
    in your organization that's
  • 00:01:49
    inventoried and cataloged and that's
  • 00:01:52
    mostly
  • 00:01:53
    the structured data and that's what a
  • 00:01:56
    lot of your technology teams
  • 00:01:57
    and data management teams are are
  • 00:02:00
    working with are
  • 00:02:01
    able to understand because it's all
  • 00:02:03
    inventory and cataloged
  • 00:02:05
    but what about the data that resides
  • 00:02:07
    underneath
  • 00:02:08
    the water and that's where a lot of the
  • 00:02:11
    risks
  • 00:02:12
    lie because the data that's not known
  • 00:02:16
    is oftentimes provides a larger risk to
  • 00:02:19
    your organization
  • 00:02:21
    these are most oftentimes unstructured
  • 00:02:24
    data
  • 00:02:25
    dark data data that hasn't been um
  • 00:02:28
    labeled or classified or fully
  • 00:02:30
    understood
  • 00:02:31
    the reason why it's difficult to
  • 00:02:33
    understand this is just because
  • 00:02:35
    the mass quantity
  • 00:02:38
    of this data and really the sheer amount
  • 00:02:41
    of
  • 00:02:42
    resources it's really going to take to
  • 00:02:44
    find and identify
  • 00:02:45
    that data but there's really an
  • 00:02:47
    opportunity here
  • 00:02:49
    not just a challenge the opportunity to
  • 00:02:51
    really be able to
  • 00:02:52
    break down the silos of the data whether
  • 00:02:55
    it's within
  • 00:02:56
    organization or whether it's data that's
  • 00:02:59
    just
  • 00:02:59
    not um been used is to bring it all
  • 00:03:02
    together
  • 00:03:03
    into one cohesive data governance
  • 00:03:05
    program
  • 00:03:06
    and management managing it for all
  • 00:03:09
    business purposes
  • 00:03:12
    now the way that technology companies
  • 00:03:15
    today
  • 00:03:16
    in technology teams traditionally find
  • 00:03:19
    data is
  • 00:03:19
    you know based on um you know writing a
  • 00:03:22
    code writing a query
  • 00:03:24
    to extract and find that data and again
  • 00:03:26
    that's really very simple
  • 00:03:28
    when it comes to data that's you know
  • 00:03:31
    known and structured format
  • 00:03:34
    but again the challenges and the risk
  • 00:03:36
    are with where
  • 00:03:37
    the data is more in a document form
  • 00:03:41
    or it hasn't been fully classified yet
  • 00:03:47
    um when we talk to customers on
  • 00:03:50
    you know finding their critical data or
  • 00:03:52
    their sensitive data
  • 00:03:54
    you know across all their data sources
  • 00:03:56
    um
  • 00:03:57
    they rely on uh traditional approaches
  • 00:04:01
    to finding the data in addition to
  • 00:04:04
    query data in a straightforward manner
  • 00:04:08
    there are other techniques that are used
  • 00:04:10
    in the marketplace today to find the
  • 00:04:11
    data whether it's
  • 00:04:13
    through patterns for example a credit
  • 00:04:16
    card number has a specific
  • 00:04:19
    pattern in the sequence of numbers
  • 00:04:22
    regular expression is another way to
  • 00:04:27
    find things like um locations
  • 00:04:30
    and you know for an address uh the
  • 00:04:33
    the this city state zip code things like
  • 00:04:36
    that
  • 00:04:37
    um and another approach as a former data
  • 00:04:40
    steward is really
  • 00:04:41
    building out a data dictionary
  • 00:04:44
    that helps find the data this involves
  • 00:04:48
    manually tagging that data and being
  • 00:04:51
    able to put labels in it again
  • 00:04:54
    these are very time intensive resource
  • 00:04:57
    intensive
  • 00:04:58
    activities in order to um
  • 00:05:01
    put the metadata to to describe your
  • 00:05:04
    actual data
  • 00:05:05
    um you know there's various levels of
  • 00:05:08
    categories
  • 00:05:09
    in regards to the risk level
  • 00:05:12
    the usage level and actually the content
  • 00:05:15
    level so
  • 00:05:17
    these are many manual approaches to
  • 00:05:20
    tagging the data now when i think about
  • 00:05:24
    a data governance program which is uh
  • 00:05:27
    usually led by
  • 00:05:28
    a chief data officer and many
  • 00:05:29
    organizations
  • 00:05:31
    have a chief data officer today
  • 00:05:34
    the way they run their program whether
  • 00:05:36
    or not they're whether they're looking
  • 00:05:38
    at you know their data quality
  • 00:05:40
    or their business glossary assets
  • 00:05:43
    or their data issues they normally have
  • 00:05:46
    to go to
  • 00:05:47
    different applications um to see this
  • 00:05:50
    information again it's
  • 00:05:52
    uh it's disjointed it leads to an
  • 00:05:55
    ability
  • 00:05:55
    inability to really make a decision and
  • 00:05:58
    understand the health
  • 00:05:59
    of the entire data organization so
  • 00:06:02
    what's lacking today
  • 00:06:03
    is you know what i call a control center
  • 00:06:06
    a comprehensive data dashboard
  • 00:06:08
    that can really bring together the
  • 00:06:11
    aspects of
  • 00:06:12
    a catalog data quality
  • 00:06:16
    and remediation issue data issues into
  • 00:06:18
    one singular place
  • 00:06:20
    in order for the chief data officer and
  • 00:06:22
    the data teams
  • 00:06:23
    to gain the right insights into their
  • 00:06:26
    data and also to take the right actions
  • 00:06:29
    now i'm going to be talking about some
  • 00:06:31
    of the parts of data governance program
  • 00:06:34
    that i myself have personally been
  • 00:06:36
    involved in the past and
  • 00:06:38
    um data catalog um people sometimes call
  • 00:06:42
    this
  • 00:06:42
    a business dictionary a data dictionary
  • 00:06:45
    or um a precursor to an
  • 00:06:49
    inventory so it's really a single
  • 00:06:52
    place or a listing of all the data
  • 00:06:55
    assets that you have an organization has
  • 00:06:57
    across um their their landscape
  • 00:07:01
    um this can include um you know the
  • 00:07:04
    business name the definition
  • 00:07:06
    and also identify the actual location of
  • 00:07:09
    this data and that's
  • 00:07:10
    really important as part a of a
  • 00:07:14
    data governance team to understand not
  • 00:07:16
    only do you know what you have in terms
  • 00:07:18
    of your logical data assets you also
  • 00:07:20
    know
  • 00:07:21
    where it resides as well
  • 00:07:24
    now this single view of data that i
  • 00:07:27
    think a lot of organizations
  • 00:07:28
    really looking to build involves
  • 00:07:31
    collaboration
  • 00:07:32
    um mainly these are folks
  • 00:07:35
    call with the title of data steward that
  • 00:07:38
    works to
  • 00:07:39
    collaborate to integrate um you know
  • 00:07:42
    enrich
  • 00:07:43
    these um logical assets and really bring
  • 00:07:46
    more value to the data because they're
  • 00:07:48
    enriching the data they're providing
  • 00:07:50
    uh more context into how the data is
  • 00:07:54
    actually used in the business sense
  • 00:07:56
    so all that information and knowledge
  • 00:07:58
    should be um
  • 00:07:59
    collected and documented in a single
  • 00:08:01
    place
  • 00:08:02
    another big component of a data
  • 00:08:04
    governance program is data quality and
  • 00:08:06
    sometimes
  • 00:08:07
    um people see this as the most important
  • 00:08:10
    component and the reason is it's very
  • 00:08:12
    measurable
  • 00:08:13
    it's the most visual part of seeing the
  • 00:08:16
    progress of a data governance program
  • 00:08:18
    so whether it's trends or seeing the
  • 00:08:21
    data quality score
  • 00:08:23
    increase or decrease or change it really
  • 00:08:26
    gives something
  • 00:08:27
    actionable for the chief data officer
  • 00:08:29
    and the data team
  • 00:08:30
    to to take action on so data quality and
  • 00:08:33
    being able to see that holistically
  • 00:08:35
    is very important now
  • 00:08:38
    to begin with in any data governance
  • 00:08:40
    program you know before you actually
  • 00:08:42
    find the data and know your data um we
  • 00:08:45
    focus
  • 00:08:46
    a lot on data discovery um i think it's
  • 00:08:49
    an important concept to really having a
  • 00:08:52
    strong data management
  • 00:08:54
    program um because you really need to
  • 00:08:57
    know
  • 00:08:58
    where all your assets are in in your
  • 00:09:00
    organization
  • 00:09:01
    so chief data officers i think a lot of
  • 00:09:04
    the
  • 00:09:04
    biggest pain points they have and that
  • 00:09:06
    they shared with us
  • 00:09:08
    is the fact that um you know they they
  • 00:09:11
    stay up late at night and these are
  • 00:09:12
    things they worry about
  • 00:09:13
    is what new data is being created or
  • 00:09:16
    being ingested
  • 00:09:17
    in my organization that i'm just not
  • 00:09:20
    aware of
  • 00:09:21
    so again referencing back to the iceberg
  • 00:09:24
    what
  • 00:09:24
    is the data that's below the water line
  • 00:09:27
    that i'm just
  • 00:09:28
    not sure i have i'm not sure if it's
  • 00:09:31
    part of my governance processes
  • 00:09:33
    and just making sure that they have
  • 00:09:35
    complete understanding of the coverage
  • 00:09:37
    of the data with that's within the
  • 00:09:39
    organization
  • 00:09:41
    and secondly with automated data
  • 00:09:43
    discovery
  • 00:09:44
    really ability to automate a lot of the
  • 00:09:47
    manual tasks that goes on today within a
  • 00:09:50
    governance team
  • 00:09:51
    so for example you know not only do they
  • 00:09:54
    have to know what their data is they
  • 00:09:56
    have to know where
  • 00:09:57
    their data is so being able to link
  • 00:10:00
    their logical assets to the physical
  • 00:10:03
    objects
  • 00:10:04
    where what table and columns um
  • 00:10:07
    something like an email address is is
  • 00:10:10
    saved in
  • 00:10:11
    whether it's in your customer database
  • 00:10:13
    your sales and marketing
  • 00:10:15
    your financial database may have all
  • 00:10:16
    this information you need to be able to
  • 00:10:19
    document
  • 00:10:19
    all the instantiations of this data but
  • 00:10:22
    you don't want to do this manually and
  • 00:10:24
    oftentimes this is a joint manual effort
  • 00:10:28
    by your teams and this is where
  • 00:10:30
    automation
  • 00:10:31
    can can really help to modernize and
  • 00:10:34
    reduce a lot of these activities
  • 00:10:37
    um and bring by bringing in automation
  • 00:10:39
    as well you're able to put in a process
  • 00:10:41
    where it's
  • 00:10:42
    continuous monitoring you don't have to
  • 00:10:45
    rely
  • 00:10:45
    on a person a resource to actually be
  • 00:10:49
    doing the checking and the curation and
  • 00:10:52
    this type of validation
  • 00:10:53
    on new data that may have popped up in
  • 00:10:56
    your organization so
  • 00:10:58
    being able to show to audit that you
  • 00:11:01
    have
  • 00:11:01
    a automated continuous monitoring
  • 00:11:03
    process certainly helps to
  • 00:11:05
    increase the overall maturity of your
  • 00:11:07
    program
  • 00:11:10
    so i'm going to talk about a really
  • 00:11:14
    big id data discovery um this will be
  • 00:11:17
    the one or two slides um talking about
  • 00:11:19
    more
  • 00:11:19
    of the specific product in itself and
  • 00:11:22
    you know to use this as a comparison
  • 00:11:24
    to how we do data discovery so um
  • 00:11:28
    you know first of all its most important
  • 00:11:30
    thing is extensible data coverage i mean
  • 00:11:33
    being able to
  • 00:11:34
    connect to all data sources you don't
  • 00:11:37
    want to
  • 00:11:37
    build a governance program that's very
  • 00:11:40
    silo that only focuses on one or two
  • 00:11:42
    data sources
  • 00:11:43
    um just because it's not going to be
  • 00:11:46
    leverageable and scalable
  • 00:11:48
    in in your entire organization so being
  • 00:11:50
    able to connect
  • 00:11:52
    to all your you know your hadoop's
  • 00:11:56
    your saps your work days um if you're
  • 00:11:59
    using google cloud
  • 00:12:00
    all that all the data's possible sources
  • 00:12:04
    need to be connected and built and
  • 00:12:06
    produced
  • 00:12:07
    in a what we call a catalog so a
  • 00:12:10
    catalog is really our ability to
  • 00:12:13
    not only connect collect your metadata
  • 00:12:16
    your technical
  • 00:12:17
    business operational metadata but bring
  • 00:12:20
    able
  • 00:12:20
    being able to bring together your
  • 00:12:23
    structured and unstructured data into
  • 00:12:24
    one view
  • 00:12:26
    classification and this is really
  • 00:12:28
    important because i remember as a data
  • 00:12:30
    steward
  • 00:12:32
    spending manual time to look through
  • 00:12:34
    each of the data values
  • 00:12:36
    and identifying the sensitivity level
  • 00:12:40
    so being able to um automate
  • 00:12:43
    the sensitivity level the risk level
  • 00:12:46
    and the actual content and classify that
  • 00:12:49
    and label that
  • 00:12:50
    through machine learning um today is you
  • 00:12:53
    know
  • 00:12:54
    is such a time saver and it's much more
  • 00:12:56
    efficient and
  • 00:12:57
    leveraging um machine learning to do
  • 00:13:00
    this
  • 00:13:01
    will help complete the classification of
  • 00:13:05
    all the data that's in your organization
  • 00:13:08
    so it's data that's in your structured
  • 00:13:10
    and unstructured so including
  • 00:13:13
    documents that can be classified and
  • 00:13:15
    it's really important so
  • 00:13:17
    it can provide faster time to value
  • 00:13:19
    within your organization
  • 00:13:21
    it can be consumed faster and earlier in
  • 00:13:24
    the day
  • 00:13:25
    of data life cycle to your analytics and
  • 00:13:27
    data science teams
  • 00:13:29
    so it's really great big benefits that
  • 00:13:32
    we see
  • 00:13:32
    in terms of classification a cluster
  • 00:13:36
    analysis
  • 00:13:36
    is a one of our patented methodologies
  • 00:13:39
    to leverage machine learning
  • 00:13:42
    to understand groupings of your data
  • 00:13:44
    that
  • 00:13:45
    can be that's probably duplicate or very
  • 00:13:48
    similar
  • 00:13:48
    um and right now we see it
  • 00:13:52
    in uh in the unstructured world where
  • 00:13:55
    you want to know how many copies of a
  • 00:13:58
    document or an excel file that you have
  • 00:14:00
    saved
  • 00:14:01
    and it's really important when you're
  • 00:14:02
    talking about uh
  • 00:14:04
    saving down saving um the the memory
  • 00:14:08
    in terms of the database space you want
  • 00:14:12
    to cut down
  • 00:14:12
    on how many duplicates you have or
  • 00:14:14
    you're doing
  • 00:14:16
    a data lake or a cloud migration you
  • 00:14:19
    really want to be able to
  • 00:14:20
    uh only keep the golden copies
  • 00:14:23
    of your data so being able to do that
  • 00:14:26
    analysis
  • 00:14:27
    in a smart way is where cluster analysis
  • 00:14:31
    can can really come in
  • 00:14:33
    and lastly here correlation correlate
  • 00:14:36
    this is where it's really critical for
  • 00:14:40
    compliance to privacy regulations like
  • 00:14:43
    gdpr
  • 00:14:44
    and in the united states uh california
  • 00:14:46
    consumer privacy act
  • 00:14:48
    it's really helpful because not only do
  • 00:14:50
    you have to identify within your data
  • 00:14:52
    organization
  • 00:14:53
    what is personal information there's a
  • 00:14:56
    concept of
  • 00:14:56
    personal identifiable information so
  • 00:14:59
    information like your your health and
  • 00:15:01
    medical records
  • 00:15:02
    to to your cookie settings and ip
  • 00:15:05
    records
  • 00:15:06
    those are identifiable information that
  • 00:15:10
    may not
  • 00:15:10
    necessarily be tied to um an individual
  • 00:15:14
    and it's saved in your database
  • 00:15:16
    it's probably information that's um
  • 00:15:18
    saved in
  • 00:15:20
    other databases but it still describes a
  • 00:15:23
    single person or
  • 00:15:24
    entity therefore needs to be produced
  • 00:15:28
    when uh the data subject access requests
  • 00:15:32
    are required
  • 00:15:33
    so that's where it's tricky for
  • 00:15:35
    technology teams to be
  • 00:15:38
    to be able to find this information
  • 00:15:40
    directly because this is these are
  • 00:15:42
    indirect
  • 00:15:42
    attributes that's ties to a person so
  • 00:15:45
    correlation
  • 00:15:46
    has been really key in in helping many
  • 00:15:48
    organizations
  • 00:15:49
    um be compliant with privacy protection
  • 00:15:52
    laws
  • 00:15:53
    but also being able to find and
  • 00:15:55
    correlate
  • 00:15:56
    all related information that's tied to a
  • 00:15:58
    person
  • 00:16:01
    now when chief data officers and data
  • 00:16:05
    management programs have this foundation
  • 00:16:09
    the single source of truth that's built
  • 00:16:11
    on
  • 00:16:11
    automation machine learning smarter
  • 00:16:15
    insights
  • 00:16:16
    on top of that foundation you know we
  • 00:16:18
    can build
  • 00:16:19
    very specific capabilities that's for
  • 00:16:22
    privacy
  • 00:16:22
    security and when it comes to governance
  • 00:16:25
    you know
  • 00:16:26
    smarter data quality data retention
  • 00:16:29
    rules because we're able to
  • 00:16:31
    already read through the files we know
  • 00:16:34
    when a file has last been updated or
  • 00:16:37
    opened
  • 00:16:37
    and we can um compare it to the relevant
  • 00:16:41
    policy
  • 00:16:42
    so things like um that are already being
  • 00:16:45
    done today when it comes to
  • 00:16:47
    a catalog and to a data quality rule
  • 00:16:51
    is can run more efficiently and can be
  • 00:16:54
    more scalable
  • 00:16:55
    with um with the ai discovery
  • 00:16:58
    foundation so just to sum up a bit
  • 00:17:02
    machine learning ai led discovery um
  • 00:17:05
    you know leverages machine learning to
  • 00:17:07
    um content
  • 00:17:09
    um identify the content uh faster and
  • 00:17:12
    smarter
  • 00:17:13
    and label it based used on using natural
  • 00:17:16
    language processing
  • 00:17:18
    in terms of enriching the data it's
  • 00:17:20
    automated
  • 00:17:21
    based on our ability to leverage these
  • 00:17:23
    classifiers
  • 00:17:25
    and also be able to take action
  • 00:17:28
    on your data in a smarter way because
  • 00:17:31
    you know the clusters that are
  • 00:17:32
    duplicates
  • 00:17:34
    and you know everything as correlated or
  • 00:17:36
    related
  • 00:17:37
    tied back to a person or entity
  • 00:17:41
    so it's really about intelligent
  • 00:17:44
    classification
  • 00:17:46
    and some of the benefits that we've
  • 00:17:48
    really seen
  • 00:17:49
    um from our customers and from for
  • 00:17:52
    many people that we've spoken to is well
  • 00:17:55
    what are the business benefits
  • 00:17:57
    um so in terms of a catalog
  • 00:18:00
    you know many organizations have to you
  • 00:18:03
    know document
  • 00:18:04
    the processes or you know document the
  • 00:18:06
    information
  • 00:18:07
    instead of doing it manually um a
  • 00:18:10
    catalog
  • 00:18:11
    that's um you know can be that can be
  • 00:18:14
    automatically
  • 00:18:14
    um collected and updated can reduce the
  • 00:18:18
    the work that needs to go into this
  • 00:18:20
    documentation process
  • 00:18:22
    and also the fact that our catalog um
  • 00:18:25
    you know you want to look at a catalog
  • 00:18:26
    that not only has structured information
  • 00:18:28
    but you want to be able to see all your
  • 00:18:31
    data assets holistically in one place
  • 00:18:34
    and
  • 00:18:34
    if if that's possible then certainly
  • 00:18:37
    being able to see the impacts of one
  • 00:18:40
    change can
  • 00:18:40
    do on other applications and other
  • 00:18:44
    documents is
  • 00:18:45
    it's something that's much easier to see
  • 00:18:47
    if you have a full understanding of that
  • 00:18:50
    and also the ability to within your
  • 00:18:52
    catalog not only
  • 00:18:53
    are you seeing the classifiers but
  • 00:18:56
    really one view
  • 00:18:58
    of that dashboard view that i spoke
  • 00:18:59
    about where chief data officers can
  • 00:19:01
    see their assets see the classifications
  • 00:19:04
    of what the data is
  • 00:19:05
    and also the profiling statistics of
  • 00:19:09
    within uh data quality to understand
  • 00:19:12
    the completeness of the data is it some
  • 00:19:14
    is this um
  • 00:19:15
    a data value that's missing information
  • 00:19:18
    that
  • 00:19:18
    i should be alerted to to take further
  • 00:19:21
    action on
  • 00:19:22
    so that's a value of this type of
  • 00:19:24
    information
  • 00:19:25
    um and again classification what is one
  • 00:19:29
    of the biggest
  • 00:19:30
    um you know pain points i think as a
  • 00:19:32
    data stored
  • 00:19:33
    to not really being able to understand
  • 00:19:36
    the data
  • 00:19:36
    taking the time um to to review the data
  • 00:19:40
    value
  • 00:19:41
    and then to confirm it with a subject
  • 00:19:42
    matter expert that takes time and
  • 00:19:44
    imagine doing it for every single data
  • 00:19:47
    set
  • 00:19:48
    and data element that's you know really
  • 00:19:50
    time consuming
  • 00:19:52
    so really being able to leverage ai to
  • 00:19:54
    to your benefit
  • 00:19:56
    and to have these labels applied earlier
  • 00:20:00
    in the life cycle management so that it
  • 00:20:03
    can be consumed by your data science the
  • 00:20:05
    analytics teams or any other
  • 00:20:07
    business user so that and everyone is
  • 00:20:11
    clear
  • 00:20:11
    exactly on what this data is and what it
  • 00:20:14
    should be
  • 00:20:15
    used for i spoke briefly about um
  • 00:20:20
    clustering and certainly it's certainly
  • 00:20:22
    been used
  • 00:20:23
    in our cloud migration use cases
  • 00:20:27
    um and if you think about um the
  • 00:20:30
    many organizations are beginning or on
  • 00:20:32
    their way to journey to the cloud
  • 00:20:35
    really it's not a simple lift and shift
  • 00:20:37
    activity
  • 00:20:38
    it's uh it's takes careful analysis to
  • 00:20:42
    prioritize the data sets that should be
  • 00:20:44
    migrated over and then within each data
  • 00:20:46
    set figuring out exactly which ones
  • 00:20:49
    should go into the cloud and before that
  • 00:20:52
    um journey can be successful one you
  • 00:20:55
    need to actually have a very
  • 00:20:57
    complete inventory of what you have
  • 00:20:59
    where it is and exactly where it's being
  • 00:21:01
    migrated
  • 00:21:02
    you also want to take this opportunity
  • 00:21:04
    to clean up your data so being able to
  • 00:21:06
    see
  • 00:21:07
    all the duplicates you want to reduce
  • 00:21:09
    the redundancies
  • 00:21:11
    and fix any data quality issues before
  • 00:21:13
    you do the migration
  • 00:21:15
    so these are the type of activities that
  • 00:21:17
    need to be done
  • 00:21:18
    but how can you do that in a smart way
  • 00:21:22
    um so certainly these um a technique
  • 00:21:25
    like clustering can help you
  • 00:21:27
    reduce the risk and making sure that
  • 00:21:30
    you're adhering to your organization's
  • 00:21:33
    data policies as well as you whether
  • 00:21:35
    you're doing
  • 00:21:35
    the cloud migration and then once the
  • 00:21:38
    data is migrated to the cloud
  • 00:21:40
    being able to connect to that data
  • 00:21:44
    to your other data sources and
  • 00:21:46
    structures and making sure
  • 00:21:48
    that everything is in alignment as well
  • 00:21:50
    so that's another consideration to think
  • 00:21:52
    about
  • 00:21:53
    is the maintenance part of your
  • 00:21:56
    cloud journey as well even after it's
  • 00:21:58
    been done
  • 00:22:00
    so i'm going to talk about some of the
  • 00:22:02
    um use cases that we've seen
  • 00:22:04
    specifically
  • 00:22:05
    i think um many of the organizations
  • 00:22:08
    that we've seen
  • 00:22:10
    you know started with the need uh
  • 00:22:13
    to do a privacy or some type of
  • 00:22:16
    regulatory
  • 00:22:17
    driven initiative so in this first use
  • 00:22:20
    case
  • 00:22:20
    it's a global athletic brand retail
  • 00:22:23
    company
  • 00:22:24
    that has over 150 billion dollars in
  • 00:22:26
    revenue
  • 00:22:27
    73 000 employees and really looking at
  • 00:22:31
    over 1200 data sources
  • 00:22:34
    and having a need to comply with gdpr
  • 00:22:38
    and their local privacy laws
  • 00:22:42
    being able to look in their entire
  • 00:22:44
    landscape and identify
  • 00:22:45
    what is sensitive um
  • 00:22:48
    being able to classify and label
  • 00:22:51
    specifically their personal
  • 00:22:53
    information that's tied to a customer
  • 00:22:55
    employee and also the personally
  • 00:22:57
    identifiable information
  • 00:22:59
    so having a need to classify i see that
  • 00:23:02
    in
  • 00:23:03
    use cases whether it's in privacy
  • 00:23:05
    security and data governance
  • 00:23:07
    or other organizations that really have
  • 00:23:10
    a project for
  • 00:23:11
    classification and then from there um
  • 00:23:14
    putting it into a catalog that can
  • 00:23:17
    display the classification
  • 00:23:19
    bring together the different product
  • 00:23:23
    lines
  • 00:23:23
    in the retail company being able to
  • 00:23:26
    align and manage
  • 00:23:28
    not only their products but their
  • 00:23:31
    customer data their employee data
  • 00:23:33
    and making sure that they are um
  • 00:23:36
    in compliance with privacy but over the
  • 00:23:40
    last few years
  • 00:23:41
    that this company has also looked into
  • 00:23:44
    expanding and
  • 00:23:46
    their data governance program because
  • 00:23:48
    they've seen the benefits of this data
  • 00:23:50
    catalog
  • 00:23:51
    um how it's helped them identify
  • 00:23:53
    sensitive private data
  • 00:23:55
    um you know being able to expand that to
  • 00:23:58
    um
  • 00:23:58
    the definition of sensitivity you know
  • 00:24:00
    other types of critical
  • 00:24:02
    data um when it comes to their product
  • 00:24:05
    to shipping their product
  • 00:24:07
    other types of critical um and
  • 00:24:09
    confidential data
  • 00:24:11
    that they want to make sure is also
  • 00:24:13
    being monitored
  • 00:24:14
    um and to be able to classify and tag
  • 00:24:17
    that
  • 00:24:17
    it's been quite important for them
  • 00:24:21
    the second use case is much larger in
  • 00:24:24
    scale it's a global marketing
  • 00:24:26
    and digital brand portfolio company so
  • 00:24:28
    within this portfolio company
  • 00:24:31
    they have assets in the finance and
  • 00:24:35
    healthcare
  • 00:24:36
    and retail so it's really a
  • 00:24:37
    multi-billion dollar company with
  • 00:24:39
    multiple brands
  • 00:24:40
    that they have to keep a siloed but also
  • 00:24:43
    being able to
  • 00:24:44
    understand their privacy initiatives as
  • 00:24:47
    well
  • 00:24:48
    because of their status as a global
  • 00:24:51
    global company
  • 00:24:52
    there's multiple regulations that they
  • 00:24:55
    have to be
  • 00:24:56
    compliant with so this is how do they
  • 00:24:59
    juggle and manage
  • 00:25:00
    the fact that they're sometimes um you
  • 00:25:02
    know differentiating and
  • 00:25:04
    distinct regulations in each local
  • 00:25:06
    region
  • 00:25:07
    that they have to be mindful of so being
  • 00:25:09
    able to comply
  • 00:25:11
    with hipaa in healthcare and also
  • 00:25:14
    finding their personal information and
  • 00:25:17
    bringing together
  • 00:25:18
    their structure data and unstructured
  • 00:25:21
    data
  • 00:25:21
    you know have multiple
  • 00:25:24
    data sources across the globe that they
  • 00:25:27
    have to be too mindful of so
  • 00:25:29
    in terms of you know federation being
  • 00:25:32
    able to see
  • 00:25:33
    a singular view but also having the the
  • 00:25:36
    local view
  • 00:25:37
    in which the globe the local offices can
  • 00:25:40
    manage their data as well
  • 00:25:44
    so in really summary here
  • 00:25:47
    by having a modern data governance
  • 00:25:49
    program
  • 00:25:51
    starting with data discovery to know
  • 00:25:54
    your data
  • 00:25:54
    will lead to know where your data is and
  • 00:25:58
    leading to know the quality of your data
  • 00:26:00
    all those three
  • 00:26:01
    basic component parts you know starts
  • 00:26:03
    with data discovery
  • 00:26:05
    um you know so advocating for
  • 00:26:08
    you know expanding beyond you know
  • 00:26:10
    structured as your source it's really
  • 00:26:12
    unstructured where we see a lot of the
  • 00:26:16
    risk and compliance teams
  • 00:26:17
    really raising their hands and saying
  • 00:26:19
    that their issues
  • 00:26:21
    in terms of being able to have a
  • 00:26:23
    complete holistic view
  • 00:26:25
    of their data and being able to identify
  • 00:26:28
    it
  • 00:26:29
    through classification and also being
  • 00:26:31
    able to take action on it based on
  • 00:26:34
    clustering and correlation this is
  • 00:26:36
    really where
  • 00:26:37
    we see machine learning and ai driving
  • 00:26:40
    these analysis
  • 00:26:41
    helping achieve data officer
  • 00:26:45
    modernize and scale and grow their
  • 00:26:48
    data governance capabilities much faster
  • 00:26:52
    so we see a lot of traction in this area
  • 00:26:55
    and we hope to really have this
  • 00:26:59
    opportunity
  • 00:27:00
    to anyone that's interested in learning
  • 00:27:02
    more about
  • 00:27:03
    big id or data discovery or even
  • 00:27:06
    some of the you know the use cases that
  • 00:27:08
    we see that have been
  • 00:27:10
    beneficial for financial services and
  • 00:27:13
    insurance healthcare retail companies so
  • 00:27:16
    thank you everyone uh for your time
  • 00:27:18
    today and be sure to visit
  • 00:27:20
    a big id at our virtual booth thank you
  • 00:27:35
    you
标签
  • Data Governance
  • AI
  • Data Discovery
  • Data Quality
  • Machine Learning
  • Data Catalog
  • Unstructured Data
  • Compliance
  • BigID
  • Data Management