What is the main focus of Peggy Sai's presentation?

The main focus is on AI-powered data discovery for building a modern data governance program.

What challenges in data governance does Peggy highlight?

She highlights challenges such as finding data, understanding data quality, and managing unstructured data.

What is the significance of a data catalog in data governance?

A data catalog provides a single view of all data assets, helping organizations understand their data landscape.

How does automation benefit data governance?

Automation helps reduce manual tasks, enabling continuous monitoring and efficient data management.

What role does machine learning play in data classification?

Machine learning automates the classification of data, improving efficiency and accuracy in identifying sensitive information.

Can you give an example of a use case mentioned in the presentation?

One use case involves a global athletic brand needing to comply with GDPR by classifying personal information across multiple data sources.

What industries can benefit from BigID's data discovery solutions?

Industries such as financial services, healthcare, retail, and insurance can benefit from these solutions.

What is the importance of understanding unstructured data?

Unstructured data often poses greater risks and compliance challenges, making it crucial to identify and manage.

What is clustering in the context of data governance?

Clustering helps identify duplicate data and optimize storage during data migrations.

What is the ultimate goal of a modern data governance program?

The goal is to know your data, where it is, and its quality to ensure effective governance.

AI-Powered Data Discovery for Modern Data Governance – Peggy Tsai, BigID

00:27:36

https://www.youtube.com/watch?v=DvJ0desvO2Y

摘要

TLDRIn her presentation, Peggy Sai from BigID discusses the importance of AI-powered data discovery in establishing a modern data governance program. Drawing from her experience in financial services, she identifies key challenges in data governance, particularly in locating and assessing data quality. Sai emphasizes the need for a comprehensive approach that integrates data catalogs, quality monitoring, and automated discovery to manage both structured and unstructured data effectively. She introduces BigID's solutions for data classification, clustering, and compliance with privacy regulations, illustrating their application through various industry use cases. The presentation underscores the significance of understanding unstructured data and leveraging machine learning to enhance data governance capabilities.

心得

🌊 Data governance challenges resemble an iceberg, with unstructured data hidden beneath the surface.
📊 A comprehensive data catalog is essential for understanding data assets and their locations.
🤖 Automation in data discovery reduces manual tasks and enhances efficiency.
🔍 Machine learning aids in the classification and labeling of data, saving time.
📈 Data quality monitoring provides actionable insights for data governance teams.
🗂️ Clustering helps identify duplicate data, optimizing storage and management.
🔒 Compliance with privacy regulations is critical for organizations handling sensitive data.
🌐 A modern data governance program integrates structured and unstructured data for holistic management.
📅 Continuous monitoring ensures that new data is accounted for in governance processes.
💡 Understanding unstructured data is key to mitigating risks in data governance.

时间轴

00:00:00 - 00:05:00
Peggy Sai introduces herself and her background in data governance, particularly in the financial services industry. She highlights the challenges organizations face in data governance, such as locating data, understanding its quality, and managing unstructured data, which often poses greater risks than known data sources.
00:05:00 - 00:10:00
Sai compares data sources to an iceberg, where only a small portion is visible above water. She emphasizes the importance of identifying and managing unstructured and dark data, which can be a significant risk to organizations. The need for a cohesive data governance program that integrates all data sources is discussed, along with traditional methods of data discovery.
00:10:00 - 00:15:00
The presentation outlines the role of a Chief Data Officer in managing data governance programs, emphasizing the need for a comprehensive data dashboard to consolidate data quality, cataloging, and remediation issues. Sai discusses the importance of data dictionaries and the challenges of manual data tagging, which can be resource-intensive.
00:15:00 - 00:20:00
Sai introduces the concept of automated data discovery, which can alleviate the manual tasks associated with data governance. She emphasizes the need for continuous monitoring of data and the importance of linking logical assets to physical data locations to enhance data governance processes.
00:20:00 - 00:27:36
The presentation concludes with a discussion of BigID's data discovery solutions, highlighting the importance of extensible data coverage, automated classification, and machine learning in managing data governance. Sai shares use cases from organizations that have successfully implemented these solutions to comply with privacy regulations and improve their data governance programs.

显示更多

思维导图

视频问答

What is the main focus of Peggy Sai's presentation?
The main focus is on AI-powered data discovery for building a modern data governance program.
What challenges in data governance does Peggy highlight?
She highlights challenges such as finding data, understanding data quality, and managing unstructured data.
What is the significance of a data catalog in data governance?
A data catalog provides a single view of all data assets, helping organizations understand their data landscape.
How does automation benefit data governance?
Automation helps reduce manual tasks, enabling continuous monitoring and efficient data management.
What role does machine learning play in data classification?
Machine learning automates the classification of data, improving efficiency and accuracy in identifying sensitive information.
Can you give an example of a use case mentioned in the presentation?
One use case involves a global athletic brand needing to comply with GDPR by classifying personal information across multiple data sources.
What industries can benefit from BigID's data discovery solutions?
Industries such as financial services, healthcare, retail, and insurance can benefit from these solutions.
What is the importance of understanding unstructured data?
Unstructured data often poses greater risks and compliance challenges, making it crucial to identify and manage.
What is clustering in the context of data governance?
Clustering helps identify duplicate data and optimize storage during data migrations.
What is the ultimate goal of a modern data governance program?
The goal is to know your data, where it is, and its quality to ensure effective governance.

查看更多视频摘要

即时访问由人工智能支持的免费 YouTube 视频摘要！

字幕

自动滚动:

00:00:00
[Music]
00:00:11
hello good morning
00:00:12
my name is peggy sai from big id today i
00:00:15
will be talking about
00:00:17
ai power data discovery for building a
00:00:19
modern data governance program
00:00:22
before i joined big idea a year ago i
00:00:24
was in your position
00:00:26
where i helped to run and operationalize
00:00:28
data management programs i worked mainly
00:00:30
in the financial services industry
00:00:32
my most recent role was in morgan
00:00:34
stanley where i helped operationalize
00:00:36
their data governance program for wealth
00:00:38
management
00:00:39
i also worked on other regulatory and
00:00:42
business initiatives
00:00:43
that focused on understanding their
00:00:46
critical data elements
00:00:47
ensuring data remediation and monitoring
00:00:50
for data quality
00:00:52
i also recently co-authored the ai book
00:00:55
that was published last year in may 2020
00:00:59
so a lot of the things i'll be talking
00:01:01
about today is based on my experience as
00:01:03
a data steward in working in various
00:01:05
financial industries
00:01:07
and the complexities that we i
00:01:09
personally felt
00:01:10
when building out and broadening our
00:01:13
data governance program
00:01:16
so when we talked to customers today and
00:01:17
even based on my experience
00:01:19
the main challenges with data governance
00:01:21
is
00:01:22
finding your data knowing where your
00:01:25
data is
00:01:26
and being able to understand the quality
00:01:28
of that data
00:01:30
now many organizations i like their data
00:01:32
sources
00:01:33
i like to see it as an iceberg
00:01:37
it the iceberg that's above the water is
00:01:39
normally
00:01:40
much smaller than what's actually
00:01:42
beneath the water
00:01:44
on the iceberg that's on top of the
00:01:45
water is the known data sources that's
00:01:48
in your organization that's
00:01:49
inventoried and cataloged and that's
00:01:52
mostly
00:01:53
the structured data and that's what a
00:01:56
lot of your technology teams
00:01:57
and data management teams are are
00:02:00
working with are
00:02:01
able to understand because it's all
00:02:03
inventory and cataloged
00:02:05
but what about the data that resides
00:02:07
underneath
00:02:08
the water and that's where a lot of the
00:02:11
risks
00:02:12
lie because the data that's not known
00:02:16
is oftentimes provides a larger risk to
00:02:19
your organization
00:02:21
these are most oftentimes unstructured
00:02:24
data
00:02:25
dark data data that hasn't been um
00:02:28
labeled or classified or fully
00:02:30
understood
00:02:31
the reason why it's difficult to
00:02:33
understand this is just because
00:02:35
the mass quantity
00:02:38
of this data and really the sheer amount
00:02:41
of
00:02:42
resources it's really going to take to
00:02:44
find and identify
00:02:45
that data but there's really an
00:02:47
opportunity here
00:02:49
not just a challenge the opportunity to
00:02:51
really be able to
00:02:52
break down the silos of the data whether
00:02:55
it's within
00:02:56
organization or whether it's data that's
00:02:59
just
00:02:59
not um been used is to bring it all
00:03:02
together
00:03:03
into one cohesive data governance
00:03:05
program
00:03:06
and management managing it for all
00:03:09
business purposes
00:03:12
now the way that technology companies
00:03:15
today
00:03:16
in technology teams traditionally find
00:03:19
data is
00:03:19
you know based on um you know writing a
00:03:22
code writing a query
00:03:24
to extract and find that data and again
00:03:26
that's really very simple
00:03:28
when it comes to data that's you know
00:03:31
known and structured format
00:03:34
but again the challenges and the risk
00:03:36
are with where
00:03:37
the data is more in a document form
00:03:41
or it hasn't been fully classified yet
00:03:47
um when we talk to customers on
00:03:50
you know finding their critical data or
00:03:52
their sensitive data
00:03:54
you know across all their data sources
00:03:56
um
00:03:57
they rely on uh traditional approaches
00:04:01
to finding the data in addition to
00:04:04
query data in a straightforward manner
00:04:08
there are other techniques that are used
00:04:10
in the marketplace today to find the
00:04:11
data whether it's
00:04:13
through patterns for example a credit
00:04:16
card number has a specific
00:04:19
pattern in the sequence of numbers
00:04:22
regular expression is another way to
00:04:27
find things like um locations
00:04:30
and you know for an address uh the
00:04:33
the this city state zip code things like
00:04:36
that
00:04:37
um and another approach as a former data
00:04:40
steward is really
00:04:41
building out a data dictionary
00:04:44
that helps find the data this involves
00:04:48
manually tagging that data and being
00:04:51
able to put labels in it again
00:04:54
these are very time intensive resource
00:04:57
intensive
00:04:58
activities in order to um
00:05:01
put the metadata to to describe your
00:05:04
actual data
00:05:05
um you know there's various levels of
00:05:08
categories
00:05:09
in regards to the risk level
00:05:12
the usage level and actually the content
00:05:15
level so
00:05:17
these are many manual approaches to
00:05:20
tagging the data now when i think about
00:05:24
a data governance program which is uh
00:05:27
usually led by
00:05:28
a chief data officer and many
00:05:29
organizations
00:05:31
have a chief data officer today
00:05:34
the way they run their program whether
00:05:36
or not they're whether they're looking
00:05:38
at you know their data quality
00:05:40
or their business glossary assets
00:05:43
or their data issues they normally have
00:05:46
to go to
00:05:47
different applications um to see this
00:05:50
information again it's
00:05:52
uh it's disjointed it leads to an
00:05:55
ability
00:05:55
inability to really make a decision and
00:05:58
understand the health
00:05:59
of the entire data organization so
00:06:02
what's lacking today
00:06:03
is you know what i call a control center
00:06:06
a comprehensive data dashboard
00:06:08
that can really bring together the
00:06:11
aspects of
00:06:12
a catalog data quality
00:06:16
and remediation issue data issues into
00:06:18
one singular place
00:06:20
in order for the chief data officer and
00:06:22
the data teams
00:06:23
to gain the right insights into their
00:06:26
data and also to take the right actions
00:06:29
now i'm going to be talking about some
00:06:31
of the parts of data governance program
00:06:34
that i myself have personally been
00:06:36
involved in the past and
00:06:38
um data catalog um people sometimes call
00:06:42
this
00:06:42
a business dictionary a data dictionary
00:06:45
or um a precursor to an
00:06:49
inventory so it's really a single
00:06:52
place or a listing of all the data
00:06:55
assets that you have an organization has
00:06:57
across um their their landscape
00:07:01
um this can include um you know the
00:07:04
business name the definition
00:07:06
and also identify the actual location of
00:07:09
this data and that's
00:07:10
really important as part a of a
00:07:14
data governance team to understand not
00:07:16
only do you know what you have in terms
00:07:18
of your logical data assets you also
00:07:20
know
00:07:21
where it resides as well
00:07:24
now this single view of data that i
00:07:27
think a lot of organizations
00:07:28
really looking to build involves
00:07:31
collaboration
00:07:32
um mainly these are folks
00:07:35
call with the title of data steward that
00:07:38
works to
00:07:39
collaborate to integrate um you know
00:07:42
enrich
00:07:43
these um logical assets and really bring
00:07:46
more value to the data because they're
00:07:48
enriching the data they're providing
00:07:50
uh more context into how the data is
00:07:54
actually used in the business sense
00:07:56
so all that information and knowledge
00:07:58
should be um
00:07:59
collected and documented in a single
00:08:01
place
00:08:02
another big component of a data
00:08:04
governance program is data quality and
00:08:06
sometimes
00:08:07
um people see this as the most important
00:08:10
component and the reason is it's very
00:08:12
measurable
00:08:13
it's the most visual part of seeing the
00:08:16
progress of a data governance program
00:08:18
so whether it's trends or seeing the
00:08:21
data quality score
00:08:23
increase or decrease or change it really
00:08:26
gives something
00:08:27
actionable for the chief data officer
00:08:29
and the data team
00:08:30
to to take action on so data quality and
00:08:33
being able to see that holistically
00:08:35
is very important now
00:08:38
to begin with in any data governance
00:08:40
program you know before you actually
00:08:42
find the data and know your data um we
00:08:45
focus
00:08:46
a lot on data discovery um i think it's
00:08:49
an important concept to really having a
00:08:52
strong data management
00:08:54
program um because you really need to
00:08:57
know
00:08:58
where all your assets are in in your
00:09:00
organization
00:09:01
so chief data officers i think a lot of
00:09:04
the
00:09:04
biggest pain points they have and that
00:09:06
they shared with us
00:09:08
is the fact that um you know they they
00:09:11
stay up late at night and these are
00:09:12
things they worry about
00:09:13
is what new data is being created or
00:09:16
being ingested
00:09:17
in my organization that i'm just not
00:09:20
aware of
00:09:21
so again referencing back to the iceberg
00:09:24
what
00:09:24
is the data that's below the water line
00:09:27
that i'm just
00:09:28
not sure i have i'm not sure if it's
00:09:31
part of my governance processes
00:09:33
and just making sure that they have
00:09:35
complete understanding of the coverage
00:09:37
of the data with that's within the
00:09:39
organization
00:09:41
and secondly with automated data
00:09:43
discovery
00:09:44
really ability to automate a lot of the
00:09:47
manual tasks that goes on today within a
00:09:50
governance team
00:09:51
so for example you know not only do they
00:09:54
have to know what their data is they
00:09:56
have to know where
00:09:57
their data is so being able to link
00:10:00
their logical assets to the physical
00:10:03
objects
00:10:04
where what table and columns um
00:10:07
something like an email address is is
00:10:10
saved in
00:10:11
whether it's in your customer database
00:10:13
your sales and marketing
00:10:15
your financial database may have all
00:10:16
this information you need to be able to
00:10:19
document
00:10:19
all the instantiations of this data but
00:10:22
you don't want to do this manually and
00:10:24
oftentimes this is a joint manual effort
00:10:28
by your teams and this is where
00:10:30
automation
00:10:31
can can really help to modernize and
00:10:34
reduce a lot of these activities
00:10:37
um and bring by bringing in automation
00:10:39
as well you're able to put in a process
00:10:41
where it's
00:10:42
continuous monitoring you don't have to
00:10:45
rely
00:10:45
on a person a resource to actually be
00:10:49
doing the checking and the curation and
00:10:52
this type of validation
00:10:53
on new data that may have popped up in
00:10:56
your organization so
00:10:58
being able to show to audit that you
00:11:01
have
00:11:01
a automated continuous monitoring
00:11:03
process certainly helps to
00:11:05
increase the overall maturity of your
00:11:07
program
00:11:10
so i'm going to talk about a really
00:11:14
big id data discovery um this will be
00:11:17
the one or two slides um talking about
00:11:19
more
00:11:19
of the specific product in itself and
00:11:22
you know to use this as a comparison
00:11:24
to how we do data discovery so um
00:11:28
you know first of all its most important
00:11:30
thing is extensible data coverage i mean
00:11:33
being able to
00:11:34
connect to all data sources you don't
00:11:37
want to
00:11:37
build a governance program that's very
00:11:40
silo that only focuses on one or two
00:11:42
data sources
00:11:43
um just because it's not going to be
00:11:46
leverageable and scalable
00:11:48
in in your entire organization so being
00:11:50
able to connect
00:11:52
to all your you know your hadoop's
00:11:56
your saps your work days um if you're
00:11:59
using google cloud
00:12:00
all that all the data's possible sources
00:12:04
need to be connected and built and
00:12:06
produced
00:12:07
in a what we call a catalog so a
00:12:10
catalog is really our ability to
00:12:13
not only connect collect your metadata
00:12:16
your technical
00:12:17
business operational metadata but bring
00:12:20
able
00:12:20
being able to bring together your
00:12:23
structured and unstructured data into
00:12:24
one view
00:12:26
classification and this is really
00:12:28
important because i remember as a data
00:12:30
steward
00:12:32
spending manual time to look through
00:12:34
each of the data values
00:12:36
and identifying the sensitivity level
00:12:40
so being able to um automate
00:12:43
the sensitivity level the risk level
00:12:46
and the actual content and classify that
00:12:49
and label that
00:12:50
through machine learning um today is you
00:12:53
know
00:12:54
is such a time saver and it's much more
00:12:56
efficient and
00:12:57
leveraging um machine learning to do
00:13:00
this
00:13:01
will help complete the classification of
00:13:05
all the data that's in your organization
00:13:08
so it's data that's in your structured
00:13:10
and unstructured so including
00:13:13
documents that can be classified and
00:13:15
it's really important so
00:13:17
it can provide faster time to value
00:13:19
within your organization
00:13:21
it can be consumed faster and earlier in
00:13:24
the day
00:13:25
of data life cycle to your analytics and
00:13:27
data science teams
00:13:29
so it's really great big benefits that
00:13:32
we see
00:13:32
in terms of classification a cluster
00:13:36
analysis
00:13:36
is a one of our patented methodologies
00:13:39
to leverage machine learning
00:13:42
to understand groupings of your data
00:13:44
that
00:13:45
can be that's probably duplicate or very
00:13:48
similar
00:13:48
um and right now we see it
00:13:52
in uh in the unstructured world where
00:13:55
you want to know how many copies of a
00:13:58
document or an excel file that you have
00:14:00
saved
00:14:01
and it's really important when you're
00:14:02
talking about uh
00:14:04
saving down saving um the the memory
00:14:08
in terms of the database space you want
00:14:12
to cut down
00:14:12
on how many duplicates you have or
00:14:14
you're doing
00:14:16
a data lake or a cloud migration you
00:14:19
really want to be able to
00:14:20
uh only keep the golden copies
00:14:23
of your data so being able to do that
00:14:26
analysis
00:14:27
in a smart way is where cluster analysis
00:14:31
can can really come in
00:14:33
and lastly here correlation correlate
00:14:36
this is where it's really critical for
00:14:40
compliance to privacy regulations like
00:14:43
gdpr
00:14:44
and in the united states uh california
00:14:46
consumer privacy act
00:14:48
it's really helpful because not only do
00:14:50
you have to identify within your data
00:14:52
organization
00:14:53
what is personal information there's a
00:14:56
concept of
00:14:56
personal identifiable information so
00:14:59
information like your your health and
00:15:01
medical records
00:15:02
to to your cookie settings and ip
00:15:05
records
00:15:06
those are identifiable information that
00:15:10
may not
00:15:10
necessarily be tied to um an individual
00:15:14
and it's saved in your database
00:15:16
it's probably information that's um
00:15:18
saved in
00:15:20
other databases but it still describes a
00:15:23
single person or
00:15:24
entity therefore needs to be produced
00:15:28
when uh the data subject access requests
00:15:32
are required
00:15:33
so that's where it's tricky for
00:15:35
technology teams to be
00:15:38
to be able to find this information
00:15:40
directly because this is these are
00:15:42
indirect
00:15:42
attributes that's ties to a person so
00:15:45
correlation
00:15:46
has been really key in in helping many
00:15:48
organizations
00:15:49
um be compliant with privacy protection
00:15:52
laws
00:15:53
but also being able to find and
00:15:55
correlate
00:15:56
all related information that's tied to a
00:15:58
person
00:16:01
now when chief data officers and data
00:16:05
management programs have this foundation
00:16:09
the single source of truth that's built
00:16:11
on
00:16:11
automation machine learning smarter
00:16:15
insights
00:16:16
on top of that foundation you know we
00:16:18
can build
00:16:19
very specific capabilities that's for
00:16:22
privacy
00:16:22
security and when it comes to governance
00:16:25
you know
00:16:26
smarter data quality data retention
00:16:29
rules because we're able to
00:16:31
already read through the files we know
00:16:34
when a file has last been updated or
00:16:37
opened
00:16:37
and we can um compare it to the relevant
00:16:41
policy
00:16:42
so things like um that are already being
00:16:45
done today when it comes to
00:16:47
a catalog and to a data quality rule
00:16:51
is can run more efficiently and can be
00:16:54
more scalable
00:16:55
with um with the ai discovery
00:16:58
foundation so just to sum up a bit
00:17:02
machine learning ai led discovery um
00:17:05
you know leverages machine learning to
00:17:07
um content
00:17:09
um identify the content uh faster and
00:17:12
smarter
00:17:13
and label it based used on using natural
00:17:16
language processing
00:17:18
in terms of enriching the data it's
00:17:20
automated
00:17:21
based on our ability to leverage these
00:17:23
classifiers
00:17:25
and also be able to take action
00:17:28
on your data in a smarter way because
00:17:31
you know the clusters that are
00:17:32
duplicates
00:17:34
and you know everything as correlated or
00:17:36
related
00:17:37
tied back to a person or entity
00:17:41
so it's really about intelligent
00:17:44
classification
00:17:46
and some of the benefits that we've
00:17:48
really seen
00:17:49
um from our customers and from for
00:17:52
many people that we've spoken to is well
00:17:55
what are the business benefits
00:17:57
um so in terms of a catalog
00:18:00
you know many organizations have to you
00:18:03
know document
00:18:04
the processes or you know document the
00:18:06
information
00:18:07
instead of doing it manually um a
00:18:10
catalog
00:18:11
that's um you know can be that can be
00:18:14
automatically
00:18:14
um collected and updated can reduce the
00:18:18
the work that needs to go into this
00:18:20
documentation process
00:18:22
and also the fact that our catalog um
00:18:25
you know you want to look at a catalog
00:18:26
that not only has structured information
00:18:28
but you want to be able to see all your
00:18:31
data assets holistically in one place
00:18:34
and
00:18:34
if if that's possible then certainly
00:18:37
being able to see the impacts of one
00:18:40
change can
00:18:40
do on other applications and other
00:18:44
documents is
00:18:45
it's something that's much easier to see
00:18:47
if you have a full understanding of that
00:18:50
and also the ability to within your
00:18:52
catalog not only
00:18:53
are you seeing the classifiers but
00:18:56
really one view
00:18:58
of that dashboard view that i spoke
00:18:59
about where chief data officers can
00:19:01
see their assets see the classifications
00:19:04
of what the data is
00:19:05
and also the profiling statistics of
00:19:09
within uh data quality to understand
00:19:12
the completeness of the data is it some
00:19:14
is this um
00:19:15
a data value that's missing information
00:19:18
that
00:19:18
i should be alerted to to take further
00:19:21
action on
00:19:22
so that's a value of this type of
00:19:24
information
00:19:25
um and again classification what is one
00:19:29
of the biggest
00:19:30
um you know pain points i think as a
00:19:32
data stored
00:19:33
to not really being able to understand
00:19:36
the data
00:19:36
taking the time um to to review the data
00:19:40
value
00:19:41
and then to confirm it with a subject
00:19:42
matter expert that takes time and
00:19:44
imagine doing it for every single data
00:19:47
set
00:19:48
and data element that's you know really
00:19:50
time consuming
00:19:52
so really being able to leverage ai to
00:19:54
to your benefit
00:19:56
and to have these labels applied earlier
00:20:00
in the life cycle management so that it
00:20:03
can be consumed by your data science the
00:20:05
analytics teams or any other
00:20:07
business user so that and everyone is
00:20:11
clear
00:20:11
exactly on what this data is and what it
00:20:14
should be
00:20:15
used for i spoke briefly about um
00:20:20
clustering and certainly it's certainly
00:20:22
been used
00:20:23
in our cloud migration use cases
00:20:27
um and if you think about um the
00:20:30
many organizations are beginning or on
00:20:32
their way to journey to the cloud
00:20:35
really it's not a simple lift and shift
00:20:37
activity
00:20:38
it's uh it's takes careful analysis to
00:20:42
prioritize the data sets that should be
00:20:44
migrated over and then within each data
00:20:46
set figuring out exactly which ones
00:20:49
should go into the cloud and before that
00:20:52
um journey can be successful one you
00:20:55
need to actually have a very
00:20:57
complete inventory of what you have
00:20:59
where it is and exactly where it's being
00:21:01
migrated
00:21:02
you also want to take this opportunity
00:21:04
to clean up your data so being able to
00:21:06
see
00:21:07
all the duplicates you want to reduce
00:21:09
the redundancies
00:21:11
and fix any data quality issues before
00:21:13
you do the migration
00:21:15
so these are the type of activities that
00:21:17
need to be done
00:21:18
but how can you do that in a smart way
00:21:22
um so certainly these um a technique
00:21:25
like clustering can help you
00:21:27
reduce the risk and making sure that
00:21:30
you're adhering to your organization's
00:21:33
data policies as well as you whether
00:21:35
you're doing
00:21:35
the cloud migration and then once the
00:21:38
data is migrated to the cloud
00:21:40
being able to connect to that data
00:21:44
to your other data sources and
00:21:46
structures and making sure
00:21:48
that everything is in alignment as well
00:21:50
so that's another consideration to think
00:21:52
about
00:21:53
is the maintenance part of your
00:21:56
cloud journey as well even after it's
00:21:58
been done
00:22:00
so i'm going to talk about some of the
00:22:02
um use cases that we've seen
00:22:04
specifically
00:22:05
i think um many of the organizations
00:22:08
that we've seen
00:22:10
you know started with the need uh
00:22:13
to do a privacy or some type of
00:22:16
regulatory
00:22:17
driven initiative so in this first use
00:22:20
case
00:22:20
it's a global athletic brand retail
00:22:23
company
00:22:24
that has over 150 billion dollars in
00:22:26
revenue
00:22:27
73 000 employees and really looking at
00:22:31
over 1200 data sources
00:22:34
and having a need to comply with gdpr
00:22:38
and their local privacy laws
00:22:42
being able to look in their entire
00:22:44
landscape and identify
00:22:45
what is sensitive um
00:22:48
being able to classify and label
00:22:51
specifically their personal
00:22:53
information that's tied to a customer
00:22:55
employee and also the personally
00:22:57
identifiable information
00:22:59
so having a need to classify i see that
00:23:02
in
00:23:03
use cases whether it's in privacy
00:23:05
security and data governance
00:23:07
or other organizations that really have
00:23:10
a project for
00:23:11
classification and then from there um
00:23:14
putting it into a catalog that can
00:23:17
display the classification
00:23:19
bring together the different product
00:23:23
lines
00:23:23
in the retail company being able to
00:23:26
align and manage
00:23:28
not only their products but their
00:23:31
customer data their employee data
00:23:33
and making sure that they are um
00:23:36
in compliance with privacy but over the
00:23:40
last few years
00:23:41
that this company has also looked into
00:23:44
expanding and
00:23:46
their data governance program because
00:23:48
they've seen the benefits of this data
00:23:50
catalog
00:23:51
um how it's helped them identify
00:23:53
sensitive private data
00:23:55
um you know being able to expand that to
00:23:58
um
00:23:58
the definition of sensitivity you know
00:24:00
other types of critical
00:24:02
data um when it comes to their product
00:24:05
to shipping their product
00:24:07
other types of critical um and
00:24:09
confidential data
00:24:11
that they want to make sure is also
00:24:13
being monitored
00:24:14
um and to be able to classify and tag
00:24:17
that
00:24:17
it's been quite important for them
00:24:21
the second use case is much larger in
00:24:24
scale it's a global marketing
00:24:26
and digital brand portfolio company so
00:24:28
within this portfolio company
00:24:31
they have assets in the finance and
00:24:35
healthcare
00:24:36
and retail so it's really a
00:24:37
multi-billion dollar company with
00:24:39
multiple brands
00:24:40
that they have to keep a siloed but also
00:24:43
being able to
00:24:44
understand their privacy initiatives as
00:24:47
well
00:24:48
because of their status as a global
00:24:51
global company
00:24:52
there's multiple regulations that they
00:24:55
have to be
00:24:56
compliant with so this is how do they
00:24:59
juggle and manage
00:25:00
the fact that they're sometimes um you
00:25:02
know differentiating and
00:25:04
distinct regulations in each local
00:25:06
region
00:25:07
that they have to be mindful of so being
00:25:09
able to comply
00:25:11
with hipaa in healthcare and also
00:25:14
finding their personal information and
00:25:17
bringing together
00:25:18
their structure data and unstructured
00:25:21
data
00:25:21
you know have multiple
00:25:24
data sources across the globe that they
00:25:27
have to be too mindful of so
00:25:29
in terms of you know federation being
00:25:32
able to see
00:25:33
a singular view but also having the the
00:25:36
local view
00:25:37
in which the globe the local offices can
00:25:40
manage their data as well
00:25:44
so in really summary here
00:25:47
by having a modern data governance
00:25:49
program
00:25:51
starting with data discovery to know
00:25:54
your data
00:25:54
will lead to know where your data is and
00:25:58
leading to know the quality of your data
00:26:00
all those three
00:26:01
basic component parts you know starts
00:26:03
with data discovery
00:26:05
um you know so advocating for
00:26:08
you know expanding beyond you know
00:26:10
structured as your source it's really
00:26:12
unstructured where we see a lot of the
00:26:16
risk and compliance teams
00:26:17
really raising their hands and saying
00:26:19
that their issues
00:26:21
in terms of being able to have a
00:26:23
complete holistic view
00:26:25
of their data and being able to identify
00:26:28
it
00:26:29
through classification and also being
00:26:31
able to take action on it based on
00:26:34
clustering and correlation this is
00:26:36
really where
00:26:37
we see machine learning and ai driving
00:26:40
these analysis
00:26:41
helping achieve data officer
00:26:45
modernize and scale and grow their
00:26:48
data governance capabilities much faster
00:26:52
so we see a lot of traction in this area
00:26:55
and we hope to really have this
00:26:59
opportunity
00:27:00
to anyone that's interested in learning
00:27:02
more about
00:27:03
big id or data discovery or even
00:27:06
some of the you know the use cases that
00:27:08
we see that have been
00:27:10
beneficial for financial services and
00:27:13
insurance healthcare retail companies so
00:27:16
thank you everyone uh for your time
00:27:18
today and be sure to visit
00:27:20
a big id at our virtual booth thank you
00:27:35
you

标签

Data Governance
AI
Data Discovery
Data Quality
Machine Learning
Data Catalog
Unstructured Data
Compliance
BigID
Data Management