What is data governance?

Data governance refers to the management of data availability, usability, integrity, and security in an organization.

What are some challenges in data governance?

Challenges include dark data (data that is not discovered or used), data quality issues, and the complexity of enterprise data landscapes.

DataX is a tool developed by Google Cloud for automating data governance and management at scale.

How does Ford use DataX?

Ford utilizes DataX for capturing metadata, enhancing data discovery, and maintaining data quality across its platform.

What innovations are being implemented in data discovery?

Innovations include automated cataloging, natural language querying, and enhanced metadata management.

What is the importance of metadata?

Metadata provides context and enables effective data search, discovery, and governance.

How does Orange approach data governance?

Orange adopts a data democracy approach, utilizing policy as code to manage access and maintain data quality across its operations.

What role does AI play in data governance?

AI assists in automating data classification, anomaly detection, and enhancing the overall data discovery process.

What are governance rules in DataX?

Governance rules allow organizations to define and enforce data governance policies at scale based on existing metadata.

What are the next steps for data discovery at Ford?

Ford is piloting a data marketplace experience to enhance user access and streamline data requests.

Data governance in the AI era

00:45:30

https://www.youtube.com/watch?v=3A855rN_9pE

摘要

TLDRThe session on data governance emphasizes the vital role of quality data in successfully leveraging AI technologies. With speakers Cynthia Gums from Ford and Steve Jared from Orange, insights are shared on the challenges faced companies today such as dark data and governance complexities. Solutions like DataX from Google Cloud are highlighted, showcasing capabilities like automated cataloging and intelligent data management, which help organizations better manage their data. Furthermore, the session discusses the evolving approaches to data discovery and governance within both organizations through user-centric interfaces and AI enhancements.

心得

🌐 Data governance is critical in the age of AI.
🔍 66% of organizations report having dark data.
🔑 DataX automates data governance at scale.
📊 Quality data is vital for effective AI outputs.
🚀 Ford and Orange share their data governance journeys.
🛠️ Automated cataloging enhances data discovery.
🌍 Data democracy improves data accessibility.
⚙️ Governance rules simplify compliance management.
🤖 AI aids in metadata enrichment and quality checks.
📈 Continuous user feedback improves data platforms.

时间轴

00:00:00 - 00:05:00
The session on data governance in the age of AI features speakers from Ford and Orange, discussing the importance of data governance as AI technology evolves. The agenda includes an introduction, case studies, and updates on data governance tools.
00:05:00 - 00:10:00
Data is essential for AI, but many organizations struggle with 'dark data' and data quality issues. A significant percentage of organizations report that much of their data is not utilized, leading to challenges in data governance due to the complexity of data landscapes.
00:10:00 - 00:15:00
Google Cloud's DataX is introduced as a solution for automating data governance and management. It integrates with various services to provide unified metadata, centralized security, and intelligent data management features, helping organizations build trust in their data.
00:15:00 - 00:20:00
Cynthia from Ford discusses her role in data discovery and classification, emphasizing the importance of data governance in creating a single source of truth. Ford's data platform, powered by Google Cloud, aims to organize data sources and improve data accessibility.
00:20:00 - 00:25:00
Ford's data governance strategy includes using DataX for capturing metadata and implementing data lineage to understand data origins and life cycles. They are also working on automating metadata enrichment to enhance data discovery.
00:25:00 - 00:30:00
Cynthia highlights the challenges of user experience in data discovery, noting the need for tailored interfaces for different user personas. Ford has developed a custom data discovery hub to improve user engagement and data accessibility.
00:30:00 - 00:35:00
Steve from Orange shares their journey in data governance, emphasizing the need to break down data silos and improve data accessibility across their operations in 26 countries. They aim to create a data democracy to enhance data utilization.
00:35:00 - 00:40:00
Orange's approach includes using policy as code for data governance, enabling better management of data quality and access control. They have established a centralized team to define architecture and support data product development across regions.
00:40:00 - 00:45:30
The session concludes with updates on new features in DataX, including automated cataloging, lineage tracking, and governance rules, aimed at enhancing data discovery and governance across organizations.

显示更多

思维导图

视频问答

What is data governance?
Data governance refers to the management of data availability, usability, integrity, and security in an organization.
What are some challenges in data governance?
Challenges include dark data (data that is not discovered or used), data quality issues, and the complexity of enterprise data landscapes.
What is DataX?
DataX is a tool developed by Google Cloud for automating data governance and management at scale.
How does Ford use DataX?
Ford utilizes DataX for capturing metadata, enhancing data discovery, and maintaining data quality across its platform.
What innovations are being implemented in data discovery?
Innovations include automated cataloging, natural language querying, and enhanced metadata management.
What is the importance of metadata?
Metadata provides context and enables effective data search, discovery, and governance.
How does Orange approach data governance?
Orange adopts a data democracy approach, utilizing policy as code to manage access and maintain data quality across its operations.
What role does AI play in data governance?
AI assists in automating data classification, anomaly detection, and enhancing the overall data discovery process.
What are governance rules in DataX?
Governance rules allow organizations to define and enforce data governance policies at scale based on existing metadata.
What are the next steps for data discovery at Ford?
Ford is piloting a data marketplace experience to enhance user access and streamline data requests.

查看更多视频摘要

即时访问由人工智能支持的免费 YouTube 视频摘要！

字幕

自动滚动:

00:00:00
[Music]
00:00:11
good morning everyone thank you so much
00:00:13
for coming to this breakout session on
00:00:16
data governance in the age of
00:00:20
AI we're very excited to have with us
00:00:23
two customer speakers today we have
00:00:26
Cynthia gums who is the manager of
00:00:29
global data insights and analytics at
00:00:31
Ford Driving key initiatives around the
00:00:34
new data Factory at Ford we also have
00:00:37
Steve Jared who is the chief AI officer
00:00:41
at Orange leading Ai and data strategy
00:00:45
for orange across 26 countries and my
00:00:49
name is louan I'm a product manager for
00:00:51
data Plex here at Google Cloud so we're
00:00:55
very excited to be sharing with you how
00:00:58
we think about data governance in this
00:01:00
age of AI and what do our Journeys each
00:01:04
look
00:01:06
like so here's our agenda for today
00:01:09
we're going to start with an
00:01:10
introduction and a product overview
00:01:13
followed by case studies from Ford and
00:01:16
orange and then we'll talk about what's
00:01:19
new what's upcoming in
00:01:24
datax so as all of us have experienced
00:01:27
recently generative AI is really this
00:01:30
paradigm shift that is
00:01:33
revolutionizing how we operate as
00:01:36
businesses whether it is generating
00:01:38
creative content whether it is working
00:01:40
with complex data whether it's improving
00:01:43
your customer experiences or even
00:01:46
training your own large language models
00:01:48
for Enterprise use cases the impact of
00:01:51
AI is really profound and all
00:01:57
encompassing at the same time we know
00:02:00
that data is the fuel that feeds into
00:02:03
the engine of AI it is really the
00:02:06
critical foundation for training and
00:02:09
grounding your models and in return this
00:02:13
rapid growth in AI Innovation is really
00:02:17
creating an accelerating demand for data
00:02:21
that is well governed high quality and
00:02:24
easy to
00:02:27
discover so given the strong need and
00:02:30
Northstar Vision what are the challenges
00:02:33
that companies are actually facing in
00:02:37
reality well first and foremost we have
00:02:40
the challenge of dark data which I'm
00:02:43
sure most of you could resonate with in
00:02:46
fact what we know that
00:02:49
66% of organizations have reported that
00:02:52
at least half of their data is dark
00:02:56
which means that it is data that is not
00:02:58
even discovered or used in the first
00:03:03
place and even if you're able to
00:03:05
discover and use that data there's still
00:03:08
a lot of questions about data quality
00:03:11
whether this data is valid whether this
00:03:14
data is
00:03:16
trustworthy we learned from our survey
00:03:18
that only
00:03:20
44% of data leaders are fully confident
00:03:24
in the quality of their organization's
00:03:26
data and as we all know local quality
00:03:30
data would only result in low quality
00:03:33
output which you really cannot trust for
00:03:36
any inside generation or
00:03:41
decision-making now the reason why
00:03:44
managing and governing your data it's so
00:03:47
difficult it's due to the complexity of
00:03:50
the Enterprise data
00:03:52
landscape as you can see here on the
00:03:54
diagram data is really coming in from
00:03:57
various different sources the are stored
00:04:00
and processed in different Services
00:04:04
whether it's data warehouse data lakes
00:04:06
or
00:04:07
databases they reside in different
00:04:10
formats and they're used by different
00:04:13
personas across different
00:04:16
workflows now please raise your hand if
00:04:19
this complex situation ever seemed
00:04:21
familiar to
00:04:23
you yes I see a lot of hands
00:04:26
raised definitely that's what we see all
00:04:29
the time as well so this challenge is
00:04:33
exactly what's keeping us busy here at
00:04:35
Google cloud and as you may have learned
00:04:38
from this conference so far we're really
00:04:40
evolving bigquery into a unified data
00:04:44
and AI governance platform a unified
00:04:47
data and AI platform and this platform
00:04:49
is designed with data governance as a
00:04:52
central builting consideration that is
00:04:55
contextual and pervasive across the
00:04:58
different layers of the text
00:05:01
deck now at the heart of providing this
00:05:05
unified data and AI governance is datax
00:05:09
which is our native offering for
00:05:11
automating data governance and data
00:05:13
management at
00:05:15
scale there are several key value
00:05:17
propositions of datax we have seen truly
00:05:21
resonating with our customers based on
00:05:25
interactions first and foremost dataplex
00:05:27
deeply integrates with various products
00:05:30
and services to really provide this
00:05:33
UniFi metadata across distributed data
00:05:36
and based on this you're able to perform
00:05:38
search across different projects across
00:05:42
different regions and across different
00:05:44
data
00:05:46
silos and based on that you're also able
00:05:49
to further enrich and organize your data
00:05:52
as needed so that's number one number
00:05:56
two on top of this wealth of metadata
00:05:58
datax offers centralized security and
00:06:02
governance
00:06:04
features this really allows you to
00:06:06
easily manage your data governance
00:06:09
policies based on understanding of the
00:06:12
metadata
00:06:13
context and last but not least datax has
00:06:16
a rich set of features around
00:06:19
intelligent data management from
00:06:21
tracking data lineage to assessing data
00:06:25
profile and to automating data quality
00:06:28
checks so really helping you build
00:06:30
better trust in your data and helping
00:06:33
you optimize data related
00:06:38
Roi now since GA launch back in 2022
00:06:42
datax has been widely adopted by
00:06:46
customers across different geographies
00:06:49
and different industry verticals as of
00:06:52
now over
00:06:53
95% of the top data analytics customers
00:06:57
at Google Cloud are all already using
00:07:00
data Plex for managing and governing
00:07:02
their data at
00:07:04
scale So today we're very excited to be
00:07:07
hearing from two of them Ford and orange
00:07:11
so please join me in first welcoming
00:07:14
Cynthia from Ford to talk about her
00:07:17
journey with data
00:07:20
[Applause]
00:07:28
governance
00:07:33
all right good day how's everybody doing
00:07:38
today so my name is Cynthia gums and I'm
00:07:40
responsible for data Discovery and
00:07:42
classification at Ford Motor Company
00:07:45
welcome to my TED
00:07:48
talk now I'm just
00:07:53
kidding and
00:07:57
so while this may not be a t talk I
00:08:01
promise you that this is an important
00:08:02
topic and I'm really honored to have the
00:08:05
opportunity to speak with you today
00:08:07
about it
00:08:09
okay so before we start o that's loud
00:08:14
before we start I'm curious about who's
00:08:16
in the crowd today so how many of you
00:08:20
consider yourselves a data governance
00:08:22
professional you can just wave your hand
00:08:25
all right there's a lot of you out there
00:08:27
and if you didn't raise your hand how
00:08:29
many of you have an appreciation for
00:08:30
what data governance
00:08:33
is okay so most of you awesome so I as
00:08:37
it and data professionals I'm sure
00:08:39
you've had the
00:08:40
opportunity to explain to nonata or
00:08:43
non-te people what you do for a living
00:08:46
right so when you tell someone I work in
00:08:49
it Information Technology they look at
00:08:52
you and they think oh you must be doing
00:08:54
tech support and then they start calling
00:08:56
you to help them with their Wi-Fi right
00:08:59
then you you tell them well I work in
00:09:00
the data analytics department and then
00:09:03
they think oh she just creates charts
00:09:05
and things all day she doesn't really do
00:09:08
a whole lot of anything because that's
00:09:09
what data analytics is right then you
00:09:12
tell them well I'm in data
00:09:14
governance and now they think you have
00:09:17
the magic key to give everybody access
00:09:19
to data data governance is so much more
00:09:21
than that and then for me when I say I
00:09:23
work in data Discovery and
00:09:25
classification they have no clue what
00:09:27
I'm talking about so then I have to
00:09:29
explain to them I am responsible for
00:09:33
making sure people can find the data
00:09:35
they need to solve business problems and
00:09:39
then I might give an analogy about a
00:09:41
shopping experience or being in a
00:09:44
library and then they get it okay so
00:09:47
data governance is a broad topic today
00:09:50
I'm just going to scratch the surface of
00:09:51
it but I'm really going to dig a little
00:09:53
bit deeper into Data Discovery so our
00:09:56
Ford data platform is powered by Google
00:09:59
cloud and we believe that data is
00:10:02
awesome valuable and worthy of respect
00:10:07
our platform objective is to establish a
00:10:10
single source of truth of data to enable
00:10:12
data fusion and
00:10:15
responsible data
00:10:17
usage our platform helps data Engineers
00:10:21
organize many many data sources across
00:10:24
every aspect of our business we have a
00:10:27
very complex environment
00:10:29
and we have a number of key capabilities
00:10:32
that allow us to meet our objective data
00:10:35
governance is a foundational capability
00:10:37
for our
00:10:40
platform as a part of our migration to
00:10:43
Google Cloud we have enabled datax data
00:10:46
catalog to capture Technical and
00:10:48
business metadata about our projects and
00:10:51
data sets each of our governance teams
00:10:55
benefits from data capabilities but the
00:10:58
main one is tag templates for the
00:11:00
collection of business metadata we also
00:11:03
benefit from dataplex apis to help
00:11:07
expose that metadata in our custom user
00:11:09
interfaces as well as our backend
00:11:13
processes additionally our data quality
00:11:15
team recently launched data lineage and
00:11:18
data lineage allows us to understand
00:11:20
where the data came from and its life
00:11:22
cycle we also use lineage to understand
00:11:26
how to troubleshoot the data how to
00:11:28
identify dependencies and also for data
00:11:33
Discovery so as I previously previously
00:11:36
mentioned we use Tag templates as the
00:11:38
foundation for our data catalog
00:11:41
experience and this metadata is
00:11:43
collected as a part of our endtoend data
00:11:47
process so when you onboard the data
00:11:50
when you create your project all the way
00:11:52
through access enablement we're
00:11:54
collecting the data along the way we
00:11:56
collect the metadata at every level of
00:11:59
our project hierarchy starting with
00:12:01
projects then data sets tables views and
00:12:04
columns we have tag templates for each
00:12:06
of those levels today we're collecting
00:12:08
roughly 70 Business metadata tags that
00:12:12
might not sound like a lot but it's it's
00:12:15
important data that we need to help
00:12:16
people discover the data and actually
00:12:18
dataplex lets you capture thousands of
00:12:22
tags most of this metadata is captured
00:12:25
manually so you can imagine that it's
00:12:27
kind of tedious time consuming and
00:12:29
resource
00:12:31
intensive but we're working with um
00:12:35
we're experimenting with Gen to see if
00:12:37
we can do some Automation in the
00:12:39
metadata enrichment space we have custom
00:12:42
user interfaces to allow our users to
00:12:45
input and extract the metadata from the
00:12:48
data
00:12:49
catalog now you might be wondering well
00:12:51
why do you need a custom user interface
00:12:54
dataplex has an interface via the
00:12:56
console however there's many reasons
00:13:00
why we decided to go this route but the
00:13:01
main one is we need to control the way
00:13:05
that metadata is input and the way it is
00:13:08
exposed to our users right now today in
00:13:11
dataplex which is perfect for a platform
00:13:13
team it shows you everything staging
00:13:17
tables temp tables that's great for the
00:13:19
platform team but your end users don't
00:13:21
want to see that and so we try to modify
00:13:23
the experience so that we can curate it
00:13:25
for
00:13:27
excellence okay
00:13:30
another reason why we created a custom
00:13:33
user experience is because we have so
00:13:35
many different personas that we're
00:13:38
trying to satisfy so last year we
00:13:41
launched our custom data Discovery
00:13:43
experience this was an exciting time for
00:13:46
us because we've been on quite a journey
00:13:47
trying to solve the data Discovery
00:13:51
challenge this experience makes use of
00:13:54
the data Plex
00:13:55
apis and we call it our data Discovery
00:13:58
hub
00:14:00
this interface currently has over 9,000
00:14:03
users in
00:14:04
growing our users are able to search the
00:14:08
data catalog and when the results come
00:14:11
back they're also able to see all the
00:14:13
metadata related to that data asset they
00:14:16
can also see the table and view schemas
00:14:19
as well we've enabled an export
00:14:22
capability that allows users to export
00:14:25
their search results the table schemas
00:14:28
and the metadata if they need to do any
00:14:30
offline analysis now I need to be clear
00:14:33
this is not an export of the data this
00:14:36
is export of the metadata right the
00:14:38
security folks are probably looking for
00:14:40
me
00:14:41
now and so our user interface also
00:14:45
creates or provides linkages to other
00:14:48
tools that we have we have our data
00:14:51
Activation Portal which allows users to
00:14:54
request access to the data that they've
00:14:56
discovered then we also have our data
00:14:59
quality dashboards so they can see the
00:15:01
completeness the timeliness and the
00:15:03
other measures for data
00:15:09
quality all
00:15:13
right all right so we are
00:15:16
currently uh oh sorry guys all right so
00:15:20
one of the key challenges for our data
00:15:24
Discovery activity is that um the user
00:15:28
experience is really important and our
00:15:30
challenge was that we have all these
00:15:32
different user personas we have data
00:15:35
analysts data stewards data scientists
00:15:38
software analysts data engineers and
00:15:42
then you have your business users all
00:15:43
these people have a need to discover
00:15:45
data but their needs are a little bit
00:15:47
different and so if I return to a
00:15:49
business user this big long list of
00:15:53
tables and columns they're not going to
00:15:54
know what to do with that right and so
00:15:57
to solve this problem we've engaged
00:15:59
project product designers to help us go
00:16:02
through this full effort of
00:16:05
understanding our personas what do they
00:16:07
want what are they thinking about when
00:16:09
they search for data how could we return
00:16:11
the results such that they have an
00:16:13
appreciation for what those um what the
00:16:16
data is and what it can do for them
00:16:19
we've done user interviews sessions
00:16:21
where we sit down with them and watch
00:16:23
them use the tool we've done surveys and
00:16:25
all of that has resulted in excellent
00:16:27
feedback that we're using to inform how
00:16:30
we modify our product another challenge
00:16:33
is ensuring that we have high quality
00:16:35
relevant
00:16:37
metadata if your columns don't have
00:16:39
descriptions how can someone search and
00:16:42
find that your column has the data that
00:16:43
they need to solve their business
00:16:46
problem however I already mentioned that
00:16:48
populating metadata is time consuming
00:16:50
and resource intensive and so how do you
00:16:53
solve that problem well one thing that
00:16:55
we've done is we created a tool that
00:16:58
allows data teams to update their
00:17:02
descriptions in bulk using their data
00:17:05
dictionaries as input and so now they
00:17:07
already have a data dictionary they can
00:17:10
upload that it goes through terraform
00:17:12
and it updates their descriptions behind
00:17:13
the scenes so that was a great
00:17:16
enhancement that we
00:17:18
enabled so what's next for data
00:17:21
Discovery at
00:17:22
Ford we are currently piloting a data
00:17:25
Marketplace experience which allows
00:17:28
users to search find and access data
00:17:32
well request access to data in the same
00:17:34
user experience today it's separate
00:17:37
experiences but we're merging them right
00:17:40
and so this was a highly requested item
00:17:42
from our users that we're trying to
00:17:45
satisfy so far the pilot is going really
00:17:48
well and we're getting excellent
00:17:49
feedback that will inform our our
00:17:53
product and now we're also looking to to
00:17:56
launch some additional enhancements to
00:17:57
the experience so pre previously it was
00:17:59
just a simple search experience but now
00:18:01
it's search that will have Best Bets
00:18:05
enabled and that's what we're calling it
00:18:07
Best Bets Best Bets means um it's a
00:18:11
keyword focused activity where we have
00:18:14
popular data sets keywords that describe
00:18:16
them when users go in to do their search
00:18:18
if they hit one of those keywords those
00:18:21
items will bubble to the top of the
00:18:22
search we're also allowing our users to
00:18:25
add comments and ratings for the data so
00:18:29
if you already have access to the data
00:18:30
you can find it in the catalog and say
00:18:33
this data set was awesome it helped me
00:18:34
do X Y or
00:18:36
Z try it out right and then if they
00:18:38
thumbs it up that also influences where
00:18:41
it shows up in the search results the
00:18:44
next one is around data
00:18:47
collections so it's kind of like a
00:18:49
storefront for the data at Ford we have
00:18:51
a number of different subject areas and
00:18:54
they want to see their data together
00:18:56
right they don't want to see it mixed up
00:18:58
with every every body else's data and so
00:19:00
we're looking to create these
00:19:02
collections or storefronts so those
00:19:04
different subject areas can say you want
00:19:07
to use my data go to this place in the
00:19:09
data catalog to find it
00:19:12
okay so we're also experimenting with
00:19:15
Gen I'm sure you've been hearing a lot
00:19:17
about gen during the conference this
00:19:19
week and so we're looking at using gen
00:19:22
to assist with metadata enrichment
00:19:25
automating it as well as helping to
00:19:28
gener generate business descriptions for
00:19:30
the data I already mentioned that it's
00:19:32
time you know time consuming to do so
00:19:34
manually but how cool would it be if you
00:19:37
could just throw gen at a table and say
00:19:40
what's in this table and it figures it
00:19:42
out because they can tell what's in the
00:19:44
table from the data and it will describe
00:19:45
it and you move on now you might have to
00:19:48
verify that the descriptions are decent
00:19:50
but once you get that confidence I think
00:19:52
it'll be a powerful data Discovery
00:19:56
experience all right and then we're also
00:19:58
looking at doing an
00:19:59
llm and we're training it on
00:20:02
documentation about the data as well as
00:20:05
the data catalog
00:20:06
itself now you're taking data Discovery
00:20:09
to another level right because you can
00:20:11
do natural language queries ask it
00:20:14
questions and it's going to combine that
00:20:16
documentation with that metadata and
00:20:19
give you a really good answer about what
00:20:20
data is available to solve your business
00:20:23
problems lastly we're really looking
00:20:26
forward to implementing data plexes
00:20:29
business terms and glossery this will
00:20:32
allow us to give common understanding to
00:20:35
those key terms that are the same across
00:20:37
the company so if you have 50 different
00:20:41
data sources and they all mention part
00:20:42
number do we have to Define part number
00:20:45
50 times in 50 different ways no but the
00:20:48
B business terms will allow us to have
00:20:50
that be consistent across all the
00:20:53
data so we've been on this
00:20:56
journey for a little over two years
00:20:59
years and it's been challenging but
00:21:02
rewarding at the same time especially
00:21:04
with this data Marketplace launch and so
00:21:07
I want to show appreciation to Google my
00:21:10
team and also all the teams involved for
00:21:13
their engagement and collaboration
00:21:15
because it's it's been fun I was um in
00:21:18
another session and the gentleman said
00:21:20
data governance was
00:21:22
boring really we're having so much fun
00:21:26
trying to figure out how to solve this
00:21:28
data Lex problem or not dataplex problem
00:21:30
but data data Discovery problem so I'll
00:21:33
ask you all who here thinks that data
00:21:36
Discovery is
00:21:39
easy I don't see a single hand raised
00:21:42
and thank you for validating us it's not
00:21:46
easy my manager's in here too it's not
00:21:49
easy all right so with that I'm going to
00:21:53
hand it over to Steve from Orange and
00:21:55
he's going to tell us how they've
00:21:58
enabled data Discovery with
00:22:09
gcp thanks Cynthia and we're on a very
00:22:12
very similar Journey so orange we're one
00:22:16
of the largest telecoms providers in the
00:22:18
world we also sell a lot of IT services
00:22:21
across 26 countries so we were
00:22:23
originally France Telecom and then we
00:22:26
acquired operations in many other
00:22:28
countries ranging from Belgium and Spain
00:22:31
and Poland as well as uh African
00:22:34
countries ranging from Sagal and Ivory
00:22:37
Coast to to the Democratic Republic of
00:22:39
Congo so we have an enormously diverse
00:22:42
set of challenges with almost 300
00:22:45
million customers across those countries
00:22:48
and the company is really a very proud
00:22:51
technological company a lot of the
00:22:53
reasons why you have power saving modes
00:22:56
in 5G today is because Orange cared
00:22:58
about the impact uh of these
00:23:01
Technologies on the environment um for
00:23:03
for decades and we're also at the
00:23:05
Forefront of a lot of AI work I'm I'm
00:23:08
really lucky to have a dedicated AI
00:23:11
research team that came from France
00:23:14
Telecom labs and so in my central team
00:23:17
we have not only all the data
00:23:18
engineering data science and ml
00:23:20
engineering but also as I mentioned the
00:23:22
pure research team and we're really
00:23:24
focused in three domains uh and we use
00:23:27
superpower inter changeably with AI so
00:23:30
we say that we're trying to superpower
00:23:32
our employees daily lives we're trying
00:23:35
to superpower all of our networks and
00:23:37
we're trying to superpower our customer
00:23:39
experiences and as Cynthia was
00:23:41
describing there's many challenges in
00:23:44
providing these kinds of services at
00:23:46
scale so before we started to use Google
00:23:50
Cloud we had data in organizational
00:23:55
silos that were mapped to the physical
00:23:58
infrastructure Ure so each of these
00:23:59
teams within the countries like the
00:24:01
network team the finance team they had
00:24:03
built and maintained their own data
00:24:05
infrastructures that led to these silos
00:24:07
that map to these cult cultural and and
00:24:10
uh and operational silos that we faced
00:24:13
and across these 26 countries the the
00:24:17
data infrastructure that we had was
00:24:18
incredibly
00:24:19
heterogeneous and mostly had been um
00:24:22
self-integrated uh and managed and so
00:24:25
that the the level of complexity of
00:24:26
maintaining that infrastructure and the
00:24:29
skills necessary for the teams to manage
00:24:31
that infrastructure were extremely
00:24:33
complex and and that a lot of the time
00:24:35
that was taken by the data engineering
00:24:37
teams was just managed keeping the
00:24:39
lights on on our
00:24:41
infrastructure and data governance was
00:24:43
also managed uh very very manually uh
00:24:47
with these very basic systems and we had
00:24:49
many different security and Regulatory
00:24:52
risk that we had to mitigate um through
00:24:54
these systems and as Cynthia was saying
00:24:57
a lot of our Executives didn't really
00:25:00
see data governance as strategic they
00:25:04
saw it as something that was just like a
00:25:06
Regulatory Compliance requirement they
00:25:08
didn't see it as
00:25:10
enabling uh the our ability to reach AI
00:25:13
at scale and so that was really
00:25:15
preventing us from taking advantage of
00:25:17
these enormous volumes of data that we
00:25:18
generate across the business to generate
00:25:21
value from
00:25:22
that so we established a few years ago
00:25:26
now this vision of a data democracy
00:25:29
where we make this data widely available
00:25:32
within each country by breaking these
00:25:34
silos that we have between the
00:25:37
organizations by having very rich data
00:25:40
Discovery and to do this at scale we use
00:25:43
policy as code to not only enforce
00:25:45
access control but also things um like
00:25:48
all of the data processes that we have
00:25:50
for maintaining quality through the
00:25:52
pipeline and that allowed us to really
00:25:55
use standard cicd techniques and tooling
00:25:58
to dramatically improve the way that we
00:25:59
manage our data and also using AI itself
00:26:03
to identify anomalies in the pipelines
00:26:06
has been very very useful and so now our
00:26:10
CEOs also because of the tsunami of AI
00:26:13
they're really seeing data uh itself as
00:26:15
being really foundational and crucial to
00:26:17
the business and the thing that we did
00:26:19
to encourage that was we set two things
00:26:23
one was we set a a uniform way to
00:26:25
measure value on use cases across all 26
00:26:28
countries and that's widely available on
00:26:31
a dashboard that every CEO and employee
00:26:35
can see so they can see which use cases
00:26:37
are generating a lot of value but also
00:26:40
we have other operational kpis that
00:26:43
relate to our data migrations and data
00:26:45
quality that are also public and so what
00:26:48
that did did was it created a
00:26:50
competitive dynamic between our
00:26:54
CEOs which was really effective and it
00:26:56
also LED for individual people people in
00:26:58
the company to see which countries were
00:27:00
being really successful at different
00:27:02
parts of their Journey towards AI at
00:27:04
scale and encourage them to learn from
00:27:06
one another and so we took inspiration
00:27:09
from data mesh to then build a set of
00:27:11
data products but our approach to data
00:27:14
products is that we have a centralized
00:27:16
team my team that defines architecture
00:27:20
with Partners like Google and that
00:27:22
allows us to have uniform infrastructure
00:27:25
that's provided to each of these teams
00:27:26
that's generating data
00:27:28
and then they're responsible for
00:27:31
maintaining the freshness and the
00:27:32
documentation and the quality of that
00:27:34
data and and and and the level of
00:27:37
automation that we're providing uh
00:27:38
enables this really rich um set of
00:27:41
outputs and value that's getting
00:27:42
generated by the use cases in these
00:27:44
countries so this is what it looks like
00:27:47
there's really three pillars the first
00:27:50
is the data products that relate to data
00:27:52
management and data quality and also
00:27:55
role-based um access with policy code so
00:27:59
one of the things that's really powerful
00:28:01
uh with datax and big query in
00:28:03
Partnership is that you can have very
00:28:06
clear role-based access to data but also
00:28:08
on a column level uh which is really
00:28:10
powerful because there's certain users
00:28:13
that we have that we don't want them to
00:28:14
have be able to see the really sensitive
00:28:16
data but we want to enable them to use
00:28:18
that data for data
00:28:21
operations the second part is the
00:28:23
self-service platform that we built
00:28:25
based on gitlab and leverages this great
00:28:28
interaction between big query datax and
00:28:30
vertex so for me it's this Golden
00:28:34
Triangle of the ability to use the best
00:28:38
we think the best data infrastructure in
00:28:39
the world on bigquery with vertex where
00:28:42
we get not only the best of the Google
00:28:45
first-party models and
00:28:46
tools also through the model Garden we
00:28:49
get leading State of-the-art openweight
00:28:53
models as well as state-of-the-art open
00:28:56
source tooling for manag ing the model
00:28:58
life cycle and I've never seen a pace of
00:29:03
innovation in my entire career than what
00:29:05
we see in open source tooling as well as
00:29:08
open source and open weight llms so
00:29:11
being able to manage that as policy is
00:29:15
code between all of the business
00:29:17
decision makers whether they be the the
00:29:20
the the data Governors our engineers and
00:29:23
the business owners has allowed us to
00:29:24
really operate this at a much higher
00:29:26
scale than was possible before
00:29:29
and then lastly we Federate all of that
00:29:32
with governance to harmonize not only
00:29:34
access to the data and the documentation
00:29:36
but also the cataloges and the forms of
00:29:39
Discovery so let me talk next about what
00:29:41
we want to do
00:29:43
next we've built this early
00:29:46
implementation of data mesh on top of
00:29:49
the Google platform and what we've seen
00:29:52
is that the dataplex tool is incredibly
00:29:54
useful for us for things like data
00:29:57
catalog
00:29:58
autod DQ data loss
00:30:01
prevention and the other thing that's
00:30:03
been great about working with the datax
00:30:05
team in addition to the fact that
00:30:07
they've been extremely reactive to
00:30:09
understanding our challenges and
00:30:11
providing us great interaction with the
00:30:14
the the roadmap and influencing the
00:30:16
engineering team but also they've been
00:30:18
really good about providing open apis to
00:30:21
our other partners to allow them to have
00:30:23
Rich synchronization between their
00:30:26
tooling environment and datax itself s
00:30:28
so for example uh in the case of cbra we
00:30:31
use cbra today across the company to
00:30:34
manage data that's not on gcp that's on
00:30:36
a lot of existing um
00:30:38
infrastructure um that hasn't been
00:30:40
migrated yet so having the ability to
00:30:42
have really rich interaction between our
00:30:45
existing data governance infrastructure
00:30:47
and what we're building datax has been
00:30:49
extremely
00:30:51
powerful so where we want to go is we
00:30:54
have this vision of using AI itself
00:30:58
um to have a Marketplace for both the
00:31:01
data that's within bigquery but also
00:31:03
within vertex and so the idea that we
00:31:05
can use natural language as a way for
00:31:08
anyone in the company within this
00:31:10
Marketplace to query what data is
00:31:12
available to have very very quick
00:31:15
business intelligence visualizations of
00:31:17
that data and be able to answer really
00:31:20
direct simple questions and have a
00:31:22
dialogue with the data even before they
00:31:24
engage a data scientist or a data
00:31:26
engineer uh really unlocks an enormous
00:31:29
amount of value in the company and and
00:31:32
we think that it's a fundamental shift
00:31:34
in the technological interaction with
00:31:36
computers where you can use natural
00:31:38
language as if at at the at the level of
00:31:41
power that previously you needed to be a
00:31:43
programmer to achieve and and then
00:31:46
lastly using AI to detect anomalies in
00:31:49
our pipelines uh to help us fill uh
00:31:52
where we have gaps in our data and
00:31:54
otherwise to make sure that the data
00:31:56
that we're generating is of high quality
00:31:58
because it's clear that without
00:32:00
extremely high quality data we're not
00:32:02
going to have high quality outputs from
00:32:04
our AI systems and then that applies not
00:32:07
only to systems where we're just doing
00:32:09
inference on existing large language
00:32:11
models but it's also very true when we
00:32:14
try to fine-tune models so them for them
00:32:17
to be much smaller uh and to and operate
00:32:20
much faster so again having extremely
00:32:23
high quality data we're we managing the
00:32:25
lineage of that data and that's really
00:32:28
easily accessible to the the teams that
00:32:29
are working on the AI fine-tuning uh has
00:32:32
been really transformative and so this
00:32:34
data democracy for us is all about
00:32:37
having this data easily accessible in
00:32:40
extremely high quality that's well
00:32:42
documented including by having
00:32:44
generative AI generate uh gaps in
00:32:47
documentation and identify uh uh missing
00:32:50
elements and having that integrated
00:32:53
extremely well into the workflow of our
00:32:55
employees and we think that this data
00:32:58
democracy will unlock unlock an enormous
00:33:00
amount of value across the company
00:33:03
because the amount of data that we're
00:33:04
generating today that's been very hard
00:33:06
to manage in the past now with this more
00:33:09
uniform infrastructure that's not only
00:33:11
available for us on public Cloud but
00:33:14
also we've been working very closely
00:33:16
with Google over the last few years to
00:33:18
have on premise data infrastructure
00:33:20
which we announced today so we have a
00:33:22
GDC Edge uh infrastructure that we can
00:33:25
deploy in our own data centers in each
00:33:27
country that also has uh data management
00:33:31
and AI capability so it gives us this
00:33:33
really rich environment between hybrid
00:33:36
between on-prem and public Cloud because
00:33:39
we have to respond not only to very
00:33:41
varying regulatory requirements across
00:33:43
our countries that change often
00:33:46
unpredictably um but also we have
00:33:48
commercial constraints because the
00:33:50
amount of data that's coming off our
00:33:51
network is enormous so for example just
00:33:55
the network Telemetry data which which
00:33:57
is the data we use to operate the
00:33:59
network it's over a paby a day so having
00:34:02
something sophisticated on premise to
00:34:04
allow us to filter that data before we
00:34:07
send to public cloud and to do that in a
00:34:09
way that maintains quality and this
00:34:11
policy of code U mechanism is is
00:34:13
extremely
00:34:15
transformative so let me bring Lou back
00:34:17
up to talk about what's on the road
00:34:23
map
00:34:26
thanks all right thank you so much Steve
00:34:29
and Cynthia for sharing your use cases
00:34:32
and perspective those are really really
00:34:34
wonderful insights and I think we can
00:34:37
all resonate with just how critical it
00:34:40
is to have this platform with
00:34:42
self-served Discovery and well-governed
00:34:45
data it's not easy but that's what we're
00:34:48
here for so next let's take a look at
00:34:51
what are the new launches we're very
00:34:53
excited to announce this
00:34:56
time first and foremost most everything
00:34:58
in datax starts from having this unified
00:35:01
metadata across distributed data and
00:35:05
that's exactly where automated
00:35:07
cataloging comes in we have worked very
00:35:10
closely with various gcp services and
00:35:13
products in order to ingest that
00:35:16
metadata to harvest metadata and index
00:35:19
metadata for search and based on this
00:35:23
you will be able to discover your assets
00:35:25
across Analytics data laks databases Ai
00:35:30
and bi
00:35:32
Services you're also able to enrich and
00:35:35
organize this data to track lineage to
00:35:38
enforce governance policies and really
00:35:41
having this solid foundation for data to
00:35:44
AI
00:35:46
governance already data plaque supports
00:35:49
a rich set of data sources such as big
00:35:52
quy and pups sub and today we're super
00:35:55
excited to announce a host of new new
00:35:57
Integrations as you can see
00:36:00
here first are the vertex related
00:36:02
launches we're very excited to be
00:36:04
announcing the ga of automated
00:36:07
cataloging for vertex models and data
00:36:10
sets and also the preview of automated
00:36:13
cataloging for vertex AI features with
00:36:16
those Integrations in place as soon as
00:36:18
you create a new artifact in vertex AI
00:36:21
they will be made searchable in datax in
00:36:24
near real time and this is really
00:36:27
critical because we truly believe that
00:36:30
data and AI should be managed and
00:36:33
governed in a consistent and coherent
00:36:36
way next are operational databases
00:36:39
including the ga of big table
00:36:42
integration spanner integration as well
00:36:45
as the preview of automated metadata
00:36:48
cataloging from cloud SQL and it's super
00:36:51
important to have this coverage for
00:36:53
operational databases as well to really
00:36:56
provide and to a
00:36:59
visibility next we're also actively
00:37:01
working on looker integration and it's a
00:37:04
launch that's coming soon so please stay
00:37:06
tuned and with all of those launches in
00:37:09
place our goal is to really provide a
00:37:12
powerful metadata Foundation to you to
00:37:15
enable automated metadata Discovery
00:37:18
management and
00:37:21
governance next is lineage so datax
00:37:25
already provides the ability for you to
00:37:28
automatically track and visualize
00:37:30
lineage as your data artifact flow
00:37:33
through your distributed data
00:37:35
landscape now this capability also work
00:37:38
nicely with other datax features such as
00:37:41
data quality checks where as soon as a
00:37:44
data quality issue is discovered you
00:37:47
will be able to trace upstream and
00:37:49
downstream in order to understand what
00:37:51
is the root cause and impact of a
00:37:54
particular data quality
00:37:56
breach now with lineage parsing there's
00:37:59
already native integration with services
00:38:02
like bigquery data proc and composer and
00:38:06
we also have datax API and open lineage
00:38:09
integration to really provide that
00:38:12
extensibility and today we're really
00:38:14
excited to announce the lineage support
00:38:17
for vertex AI pipelines really allowing
00:38:20
for this end and traceability from data
00:38:23
processing to data analytics to machine
00:38:26
learning training and deployment and
00:38:29
providing you with this endtoend picture
00:38:32
that is critical for data to AI
00:38:34
governance and
00:38:37
compliance now at the same time in
00:38:39
addition to extending the type of data
00:38:41
sources being covered we're also
00:38:43
enhancing the granularity of lineage
00:38:45
tracking we're very excited to introduce
00:38:48
the preview for column level lineage in
00:38:53
Bor oh hey thank
00:38:56
you
00:38:59
so ever since introducing table lineage
00:39:01
in B Cory as well as other services we
00:39:04
have seen strong customer enthusiasm
00:39:07
adoption thanks to all of you and we're
00:39:09
also getting very strong demand for the
00:39:12
next level granularity which is what
00:39:15
we're very excited to bring to you today
00:39:17
so now you're able to perform root cause
00:39:19
analysis and impact analysis at the
00:39:22
column level in addition to at the table
00:39:25
level and imagine when you have a column
00:39:28
that's identified to contain personal
00:39:30
identifiable information this is where
00:39:32
column level lineage really shines right
00:39:35
where you're able to then control its
00:39:37
propagation and then be able to comply
00:39:39
with different
00:39:42
regulations now there's also more ease
00:39:45
of use features that we're launching
00:39:48
together with this so for example
00:39:50
there's the ability to help you pull up
00:39:52
all the upstreams and all the
00:39:54
downstreams of a particular node in the
00:39:56
lineage graph
00:39:57
there's also the ability to filter by
00:40:00
different transformation types to make
00:40:02
lineage graph more consumable and
00:40:04
there's also the ability to export
00:40:06
lineage for offline analysis so all of
00:40:09
this is to enhance our user experience
00:40:13
and to make it easier to work with data
00:40:15
Plex
00:40:17
lineage next are two gen powered Gemini
00:40:21
launches from dataplex so first of all
00:40:24
we know that searching over metadata is
00:40:27
a really critical experience with datax
00:40:30
and it's really at the core of what we
00:40:32
do here at dataplex now in addition to
00:40:35
doing keyword search with dataplex
00:40:37
you're able to just ask us a question in
00:40:40
natural language and datax will be able
00:40:43
to interpret your intent and be able to
00:40:46
retrieve the most relevant search
00:40:49
results this can really go along long
00:40:51
way to lower this entry barrier as we
00:40:53
have discussed earlier and to really
00:40:55
democratize the experience of data
00:40:58
Discovery to your entire
00:41:01
organization now once the data is
00:41:03
discovered there's another really
00:41:05
exciting gen power features from data
00:41:07
Plex to help which is Data
00:41:10
Insights now a lot of us working with
00:41:13
data must have experienced the cold star
00:41:16
problem now which is once you find a
00:41:19
valuable data asset you're sometimes not
00:41:21
sure what is the best SQL queries to
00:41:24
write in order to really extract that
00:41:26
meaning ful Insight from the data so
00:41:29
that's exactly word data Insight is here
00:41:32
to help it would automatically generate
00:41:35
and suggest SQL queries as well as a
00:41:38
list of questions you can ask of a table
00:41:41
in natural language and it will provide
00:41:44
validated SQL cars to you as well that
00:41:47
is ready to run in big car
00:41:50
studio so this could really help give
00:41:53
you a jump start into your analysis
00:41:55
journey and to really help help
00:41:57
accelerate time to Insight for all of
00:42:00
us next is data governance our favorite
00:42:04
topic so as we know metadata is the core
00:42:07
of everything we do here at data plx
00:42:10
right so we're constantly thinking in
00:42:13
addition to help you better discover and
00:42:15
better understand this data can we also
00:42:18
make metadata more actionable to help
00:42:20
you drive active actions in terms of
00:42:24
data governance
00:42:25
operations so this is exactly the
00:42:27
motivation for governance rules where we
00:42:30
start from the metadata you already have
00:42:33
in datax whether it's technical metadata
00:42:35
or business metadata and then you will
00:42:39
be able to Define and enforce governance
00:42:42
policies at scale with the help of
00:42:44
dataplex
00:42:45
so here's how it works first of all you
00:42:48
start by writing a search query in
00:42:50
dataplex to identify all the entries and
00:42:53
Fields that are relevant for a
00:42:56
particular governance policy to be
00:42:58
applied and then you can Define your
00:43:00
policy in the form of governance rules
00:43:03
with the help of data Plex and then data
00:43:05
Flex will help you apply and enforce
00:43:08
this policy across your distributed data
00:43:11
landscape with proper monitoring
00:43:13
included so in summary what we're
00:43:16
providing here is a single pain of glass
00:43:18
for you to indicate and enforce your
00:43:21
governance intendet scale across
00:43:23
different types of data no matter where
00:43:25
they're stored
00:43:27
now as you can imagine the possibility
00:43:30
of governance rules is really endless
00:43:33
right the rules could be about access
00:43:34
control could be about data life cycle
00:43:36
management could also be about running
00:43:38
data quality checks and many
00:43:40
more so today to start this journey
00:43:43
we're very excited to announce the
00:43:45
initial launch of governance rules
00:43:47
starting from fine grin Access Control
00:43:49
across big query and GCS so that instead
00:43:53
of having to configure governance
00:43:55
policies one table at a top time or one
00:43:57
column at a time you can now leverage
00:44:00
data Plex to apply them automatically
00:44:02
for you at scale and this would work
00:44:05
across big Cory and Google Cloud Storage
00:44:09
assets as
00:44:10
described so the goal here is to really
00:44:12
help you streamline the governance
00:44:14
operation and to really minimize any
00:44:16
potential risk to your security
00:44:20
posture last but not least we're very
00:44:22
excited to announce the latest key
00:44:24
launches driven by the partnership
00:44:27
between dataplex and
00:44:30
calbra specifically this is the preview
00:44:33
of metadata sync from dataplex to
00:44:35
calibra including technical metadata
00:44:38
business metadata as well as table level
00:44:41
lineage from
00:44:42
bikari so for this joint effort our goal
00:44:45
here is to really provide this unified
00:44:49
data Discovery experience spanning
00:44:51
multicloud and hybrid Cloud
00:44:54
environments and this is only the
00:44:56
beginning there are more exciting and
00:44:59
deeper Integrations that are being
00:45:01
planned and being worked on and
00:45:04
ultimately our goal is to provide the
00:45:07
flexibility of options and to be able to
00:45:09
help you combine the both Best of Both
00:45:12
Worlds for our
00:45:14
customers so with that thank you so much
00:45:17
for joining our session today thank you
00:45:19
so much Cynthia and Steve for the
00:45:21
wonderful insights you have shared thank
00:45:23
you everybody for
00:45:25
coming
00:45:29
w

标签

Data Governance
AI
Data Discovery
Data Quality
Google Cloud
DataX
Ford
Orange
Metadata
Data Management