Lakebase adalah produk baru dari Databricks yang memperkenalkan OLTP dengan pemisahan penyimpanan dan pengiraan.

Apa itu Agent Bricks?

Agent Bricks adalah alat untuk membina agen AI dengan cara yang lebih mudah dan rendah kod.

Apa itu edisi percuma Databricks?

Edisi percuma Databricks adalah versi penuh yang boleh digunakan oleh pelajar dan individu untuk belajar dan bereksperimen.

Apa itu Unity Catalog?

Unity Catalog adalah alat pengurusan data yang membolehkan pengguna mengakses dan mengurus data dengan lebih baik.

Apa yang baru dalam Genie?

Genie kini mempunyai ciri baru seperti pengambilan data dan cadangan pertanyaan.

Lake Bridge adalah alat migrasi yang membantu memindahkan data dari sistem lain ke Databricks.

Lakeflow adalah rangka kerja ETL baru yang menggabungkan pelbagai alat untuk pengurusan data.

AI Gateway adalah alat untuk mengurus dan mengoptimumkan API untuk model AI.

MLflow 3.0 adalah versi baru yang memperkenalkan pengurusan versi untuk prompt dalam aliran kerja AI.

Data + AI Summit 2025 - Keynote Recap

00:45:11

https://www.youtube.com/watch?v=Kqx4eeSDtAI

Summary

TLDRVideo ini merangkum pengumuman penting dari Data and AI Summit yang diadakan di San Francisco, termasuk pelancaran produk baru seperti Lakebase, yang memperkenalkan OLTP dengan pemisahan penyimpanan dan pengiraan, serta Agent Bricks, yang memudahkan pembinaan agen AI. Spark 4.0 kini mempunyai mod SQL standard dan pelbagai ciri baru yang telah dibuka sumber. Edisi percuma Databricks juga diperkenalkan, membolehkan pelajar dan individu belajar dan bereksperimen dengan alat ini. Selain itu, terdapat kemas kini pada Unity Catalog dan Genie, yang meningkatkan pengurusan dan interaksi dengan data. Video ini memberikan pandangan mendalam tentang semua perubahan dan inovasi yang berlaku dalam ekosistem Databricks.

Takeaways

🚀 Lakebase memperkenalkan OLTP dengan pemisahan penyimpanan dan pengiraan.
🤖 Agent Bricks memudahkan pembinaan agen AI.
📊 Spark 4.0 kini mempunyai mod SQL standard.
🎓 Edisi percuma Databricks untuk pelajar dan individu.
📁 Unity Catalog meningkatkan pengurusan data.
💬 Genie kini mempunyai ciri baru untuk pengambilan data.
🔄 Lake Bridge membantu migrasi data ke Databricks.
🔧 Lakeflow adalah rangka kerja ETL baru.
🌐 AI Gateway mengurus API untuk model AI.
📈 MLflow 3.0 memperkenalkan pengurusan versi untuk prompt.

Timeline

00:00:00 - 00:05:00
Video ini memperkenalkan kemas kini terkini dari Advancing Spark, yang merangkumi pengumuman penting dari Data dan AI Summit di San Francisco. Pembentang menyatakan bahawa mereka akan merangkum hampir 7 jam pengumuman menjadi poin-poin penting yang perlu diketahui.
00:05:00 - 00:10:00
Pembentang membincangkan visi baru platform Data Bricks, termasuk elemen seperti AI, SQL, dan pasar data. Penekanan diberikan kepada Unity Catalog yang kini menyokong Iceberg, membolehkan pengguna memilih antara Delta dan Iceberg untuk pengurusan data.
00:10:00 - 00:15:00
Pengumuman terbesar adalah Lakebase, yang menandakan Data Bricks memasuki pasaran OLTP (Online Transactional Processing). Ini membolehkan pemisahan penyimpanan dan pengiraan, yang berbeza dari pendekatan tradisional OLTP, dan dibina di atas Postgres.
00:15:00 - 00:20:00
Lakebase membolehkan pengguna untuk mencipta pangkalan data yang berfungsi dengan cepat dan efisien, dengan keupayaan untuk mengendalikan berjuta-juta pertanyaan secara serentak. Ini adalah langkah besar dalam dunia pengkomputeran data.
00:20:00 - 00:25:00
Agent Bricks diperkenalkan sebagai alat untuk membina sistem berasaskan agen dengan lebih mudah. Ini membolehkan pengguna untuk mencipta agen yang dapat mengautomasi aliran kerja tanpa memerlukan pengetahuan teknikal yang mendalam.
00:25:00 - 00:30:00
Spark 4.0 dilancarkan dengan banyak ciri baru, termasuk sintaks SQL yang lebih baik dan pengendalian kesalahan yang lebih ketat. Ini adalah perubahan besar yang perlu diperhatikan oleh pengguna yang telah terbiasa dengan cara lama Spark berfungsi.
00:30:00 - 00:35:00
Declarative Pipelines, sebelum ini dikenali sebagai Delta Live Tables, kini telah dibuka sumber dan diintegrasikan ke dalam Lakeflow, yang merupakan rangkaian alat untuk pengambilan dan pemprosesan data.
00:35:00 - 00:40:00
Lakehouse Apps kini tersedia secara umum, membolehkan pengguna membina antaramuka pengguna untuk berinteraksi dengan AI. Ini adalah langkah penting untuk menjadikan AI lebih mudah diakses oleh pengguna biasa.
00:40:00 - 00:45:11
Genie, alat interaksi data yang membolehkan pengguna berkomunikasi dengan data mereka, kini juga tersedia secara umum dengan pelbagai ciri baru yang meningkatkan pengalaman pengguna.

Mind Map

Video Q&A

Apa itu Lakebase?
Lakebase adalah produk baru dari Databricks yang memperkenalkan OLTP dengan pemisahan penyimpanan dan pengiraan.
Apa itu Agent Bricks?
Agent Bricks adalah alat untuk membina agen AI dengan cara yang lebih mudah dan rendah kod.
Apa yang baru dalam Spark 4.0?
Spark 4.0 kini mempunyai mod SQL standard dan pelbagai ciri baru yang telah dibuka sumber.
Apa itu edisi percuma Databricks?
Edisi percuma Databricks adalah versi penuh yang boleh digunakan oleh pelajar dan individu untuk belajar dan bereksperimen.
Apa itu Unity Catalog?
Unity Catalog adalah alat pengurusan data yang membolehkan pengguna mengakses dan mengurus data dengan lebih baik.
Apa yang baru dalam Genie?
Genie kini mempunyai ciri baru seperti pengambilan data dan cadangan pertanyaan.
Apa itu Lake Bridge?
Lake Bridge adalah alat migrasi yang membantu memindahkan data dari sistem lain ke Databricks.
Apa itu Lakeflow?
Lakeflow adalah rangka kerja ETL baru yang menggabungkan pelbagai alat untuk pengurusan data.
Apa itu AI Gateway?
AI Gateway adalah alat untuk mengurus dan mengoptimumkan API untuk model AI.
Apa itu MLflow 3.0?
MLflow 3.0 adalah versi baru yang memperkenalkan pengurusan versi untuk prompt dalam aliran kerja AI.

View more video summaries

Get instant access to free YouTube video summaries powered by AI!

Subtitles

Auto Scroll:

00:00:02
Hello Spark fans. Welcome back to
00:00:04
Advancing Spark brought to you by
00:00:06
Advancing Analytics, the only people who
00:00:08
actually understand what's going on in
00:00:10
the world of data bricks currently. Now
00:00:12
for you is the big question cuz last
00:00:15
week was the data and AI summit over in
00:00:17
San Francisco and me and the team were
00:00:19
over there for so many hours of keynotes
00:00:23
and announcements and things going on in
00:00:25
the wacky world that is data bricks. So,
00:00:27
what I thought I would do is take an
00:00:29
almost 7 hours worth of announcements
00:00:31
and crushed it right down to just tell
00:00:34
you what I think is important, what I
00:00:35
think of it, and how to understand it.
00:00:37
Now, that is still going to take us a
00:00:38
little while. So, strap yourselves in.
00:00:40
This is not going to be a short video.
00:00:42
So, yeah, I've got a lot of marketing.
00:00:44
I've taken some screenshots of slides.
00:00:46
There's some dodgy quality in the things
00:00:47
I've snipped, but that's fine. We can go
00:00:49
and have a look through that. If it's
00:00:51
your first time around here, well,
00:00:52
welcome. Don't forget to like and
00:00:54
subscribe. And yeah, just buckle up.
00:00:56
We'll talk about a ton of stuff that is
00:00:58
happening in the world of data
00:01:00
intelligence on data bricks. So let's go
00:01:05
and have a look. So this is like the big
00:01:06
opening slide we saw them going back to.
00:01:09
Essentially we always see this new
00:01:12
vision each year of this is what the
00:01:14
platform looks like. Similar idea you
00:01:17
tend to have bit AI bit SQL. The data
00:01:20
marketplace is a new thing. Apps is kind
00:01:22
of a new thing in this uh circle. Lake
00:01:24
flow we'll talk about more in a minute.
00:01:26
and the AIBI. Loads of stuff going on.
00:01:28
All underpinned by Unity Catalog with a
00:01:31
special new entrance down there in terms
00:01:33
of Unity Catalog is underpinned not just
00:01:36
by Delta but also by Iceberg. One of the
00:01:38
big announcements this year was that
00:01:40
iceberg is now a fully managed available
00:01:43
for Unity Catalog. You can just create a
00:01:45
table, decide ah it's going to be a
00:01:47
delta table. This one's going to be an
00:01:48
iceberg table. They both work together.
00:01:50
No one really cares what's under the
00:01:51
hood.
00:01:53
all full parity of support for both of
00:01:56
those two. So very very cool in terms of
00:01:58
what we're seeing there. But there's a
00:02:01
ton of new things. Lake Flow itself is
00:02:04
new, but we'll get on to that. So I want
00:02:06
to take a while just talking through the
00:02:09
new concepts, what I thought of it when
00:02:10
I first heard about it and what I think
00:02:12
about it now, which is different. So,
00:02:15
the first one, probably the the biggest
00:02:16
one, and word of warning, there's a lot
00:02:19
of products that are called either lake
00:02:21
something or something bricks. So,
00:02:23
you're going to have to get used to
00:02:24
remembering all these different things
00:02:25
cuz there's a lot. So, number one, lake
00:02:28
base. Probably the most surprising
00:02:30
announcement that we actually had this
00:02:32
year. Essentially, data bricks are
00:02:34
entering the OOLTP market. Now, if
00:02:37
you're not from a databasey background,
00:02:39
uh, OOLTP is online transactional
00:02:41
processing. And you've also got the idea
00:02:43
of OLAP, online analytical processing.
00:02:45
OLAP looks after big chunky set based
00:02:48
queries. I'm looking to aggregate over
00:02:50
millions, billions of rows and give me
00:02:52
an aggregate answer. OLTP is looking for
00:02:54
lots of small little concurrent uh
00:02:58
singleton reads and writes. Very
00:02:59
different type of technology. Most of
00:03:01
the time it's still a database. You
00:03:02
still use SQL. You still work with it in
00:03:04
the same way. But actually the tools
00:03:06
that have grown and evolved in the big
00:03:08
data ecosystem, the big crunchy parallel
00:03:11
distributed compute aka Spark has always
00:03:14
been about analytics. It's always been
00:03:17
about doing things at scale to many many
00:03:19
rows at the same time. So data bricks
00:03:21
going hey now we now do OOLTP was always
00:03:24
a bit of a weird one. When I first heard
00:03:26
about it I was like why? Who's that for?
00:03:30
Is that just to tick a box saying well
00:03:33
other tools have got a database in there
00:03:35
we've now got a database. So I honestly
00:03:37
I held my hand up going all right and I
00:03:41
will happily eat my words cuz what what
00:03:43
makes the difference if they were just
00:03:45
ticking a box and going hey we've got
00:03:46
manage Postgress you can now go and
00:03:48
build a database and it works like a
00:03:49
database looks like a database that
00:03:51
would not be exciting and I'd struggle
00:03:53
to see why they're actually trying to do
00:03:54
anything different.
00:03:56
There's there's a lot of stuff actually
00:03:58
behind what's going on. So firstly this
00:04:00
new lake base yes it is managed OOLTP it
00:04:02
is Postgres working inside the thing
00:04:05
it's but it's Postgres with a split of
00:04:08
separation of storage and compute and
00:04:11
that's that's you don't get that with
00:04:12
OLTP OLTP to get it fast to get it
00:04:15
millisecond latency you have to have the
00:04:17
data stored in a very highly indexed way
00:04:19
so you can do that very very fast record
00:04:22
return that gives you the low latency
00:04:24
that OLTP needs so that announcement
00:04:26
going hey look we've got a database
00:04:28
that's now lot of data bricks but it
00:04:30
we've separated storage to compute is
00:04:32
nuts.
00:04:34
So there's a few things behind it.
00:04:36
There's a few things that they were
00:04:36
saying well this is what we need to get
00:04:38
this to actually work. So one being open
00:04:40
source yes of course it's data bricks
00:04:42
they are all about open sourcing things.
00:04:44
Uh so it's built on postgres biggest
00:04:47
open source database there is that just
00:04:49
kind of makes sense. Uh the separated
00:04:51
storage compute yes is nuts. Uh so
00:04:54
getting a huge like QPS queries per
00:04:56
second that's concurrency like most of
00:04:59
the time if you're talking about
00:05:00
analytics you're going to have tens
00:05:02
maybe hundreds of users running queries
00:05:05
not crazy if you talk about a website
00:05:08
talk about e-commerce talk about an
00:05:09
actual application used to scale you're
00:05:11
talking about millions of queries per
00:05:14
second concurrency is king in the world
00:05:17
of any kind of OLTP system there's just
00:05:20
a lot going on
00:05:23
Now the other thing is AI. So actually
00:05:26
in this world of agents in the way that
00:05:28
we're actually going, we're expecting
00:05:30
all this stuff to be so much more
00:05:31
ephemeral than it used to be. Uh someone
00:05:33
writes what writes a query that spins
00:05:36
off an agent. The agent creates a
00:05:38
database that it uses to fulfill that
00:05:40
request and then it trashes it. That's a
00:05:42
weird idea. We're used to provisioning
00:05:44
compute and maybe the compute scales up
00:05:46
and down. But the idea of spinning up a
00:05:48
whole separate database to fulfill the
00:05:50
work and then crashing it isn't really
00:05:52
something we've ever done in the
00:05:53
application style of things because
00:05:54
we've never really separated storage and
00:05:56
compute in the application style of
00:05:57
things. So yeah, it's just it's just
00:06:00
different. It's way more different than
00:06:02
I actually thought it was. Dropping my
00:06:05
clicker.
00:06:06
So why is it different? I mean it makes
00:06:09
a little bit more sense if you roll back
00:06:10
a couple of weeks before summit. We
00:06:12
actually got the announcement that data
00:06:13
bricks had acquired a company called
00:06:15
Neon. Now Neon are well known. They're a
00:06:17
startup for creating serverless Postgres
00:06:20
that champions separation of storage and
00:06:22
compute. That's what they do. So this
00:06:24
shouldn't come as that much of a
00:06:25
surprise. Now the data bricks obviously
00:06:29
they've just acquired Neon. They've been
00:06:30
working on Lakebase for a good year or
00:06:33
so. But they were working with Neon.
00:06:35
They've said actually they've been
00:06:36
working in partnership very heavily and
00:06:37
then now they've acquired them. So Lake
00:06:40
Base isn't neon, but it's heavily worked
00:06:42
with and inferred by on that. And I'm
00:06:44
sure we'll see them come closer together
00:06:45
in the future. But that just makes a
00:06:48
little bit more sense about where they
00:06:49
got the idea from, where it's come from,
00:06:51
where the technical expertise has come
00:06:53
from. So it is it it's it's more novel
00:06:56
than I thought when I first heard the
00:06:58
announcement. Oh, hey, we're doing LTP.
00:07:00
It's oh, hey, we're doing LTP, but very
00:07:02
very different.
00:07:05
Now, how much that'll go, oh yeah,
00:07:07
that's different enough. I'm gonna take
00:07:09
my existing e-commerce system. I'm gonna
00:07:11
run it on top of Lakebase. I don't know.
00:07:13
I mean, there's a maturity piece out
00:07:15
there. This has just been announced
00:07:16
before people actually use it in anger
00:07:18
to run their entire business on.
00:07:20
There'll be a little bit of maturing, a
00:07:22
little bit of migration. There'll be an
00:07:23
adoption period, right? It's very new,
00:07:26
but the idea is novel enough that
00:07:27
actually, yeah, it's interesting.
00:07:31
So he thinks one of my comedy clipped
00:07:34
slides that's terribly low res but that
00:07:36
whole idea of going well there is the
00:07:38
object store actually in the lake
00:07:40
there's compute can just be ephemeral I
00:07:42
can spin up a new database that's
00:07:43
looking at existing data and just spin
00:07:45
that out each as new instances which
00:07:47
have their own controls around it but
00:07:48
it's that buffer store and concurrency
00:07:50
management that sits in the middle which
00:07:52
is essentially doing some of the magic
00:07:53
that's enabling this to actually work at
00:07:55
that level of scale now the actual
00:07:57
technical how that works I don't know
00:08:00
we'll we'll stick around. I'm sure
00:08:01
there's much much deeper videos we're
00:08:03
going to do about lake fakes in the
00:08:04
future. But yeah, that's probably the
00:08:06
biggest announcement, the biggest like,
00:08:08
oh, that's different.
00:08:12
It's number one. See, that's the first
00:08:14
nice gentle announcement just to eases
00:08:16
into things. Number two is a little
00:08:18
crazier. So, we got this thing, which
00:08:20
I'm so happy they finally decided the
00:08:22
name for is agent bricks. Yes, it's
00:08:24
another combination of lake and bricks.
00:08:26
So agent bricks everything obviously the
00:08:29
world has gone agent based world is
00:08:31
agentic these days two years ago
00:08:33
everyone was mad for checkp and the idea
00:08:35
of LLMs last year it was all about
00:08:37
vector databases and rags this year has
00:08:39
heavily been defined by everything is an
00:08:42
agent I'm automating all of my workflows
00:08:44
build an agent to go and automate it
00:08:46
you're migrating some better you build
00:08:47
an agent everything has gone agentic but
00:08:49
there's been this barrier to entry like
00:08:51
it's you can't go and just pick up and
00:08:53
use an agent you have to go and
00:08:54
understand well autogen and all the
00:08:56
other libraries that actually build up.
00:08:58
It's quite a technical high barrier to
00:09:00
entry. It's not crazy, but yeah, you got
00:09:03
to learn some stuff. Now, Agent Bricks
00:09:06
is going to tackle that. Agent Bricks is
00:09:07
trying to say, how do we do kind of low
00:09:10
code for building out an agent? Now,
00:09:13
it's this idea. Oh, that looks how do we
00:09:16
build an agentic system that answers a
00:09:18
certain question. Essentially, writing a
00:09:20
prompt that says this is what this is
00:09:23
what good looks like. Now you got that
00:09:24
LLN judge and that was a whole thing
00:09:26
that came out around the idea about how
00:09:29
how do you evaluate a large language
00:09:31
model. If we're looking at machine
00:09:32
learning and we say I'm going to predict
00:09:34
this figure I can then actually find out
00:09:36
what the figure actually was and I can
00:09:37
compare the two and I can say
00:09:39
scientifically mathematically this is
00:09:41
how accurate that model was. We can put
00:09:44
an actual percentage figure on the
00:09:45
accuracy of traditional machine
00:09:46
learning. When a large language model is
00:09:49
bringing back some text I can go well
00:09:51
that that looks right or that doesn't
00:09:53
look right. Well, that's factually
00:09:54
incorrect. I can't put a number on it. I
00:09:56
can't quantify as much. So, how do you
00:09:58
do that at scale? How do you make that
00:10:00
repeatable? Well, the answer is get a
00:10:02
large language model to judge the output
00:10:04
of another large language model. That
00:10:05
whole LLMs as judges. And that's
00:10:08
actually grown. There's technology
00:10:09
around it. There's ratings. There's
00:10:10
there's now frameworks to allow you to
00:10:12
do it built into ML flow. So this whole
00:10:14
idea of essentially we can type some
00:10:16
sentences to tell the judge what it's
00:10:18
looking for and then actually under the
00:10:20
hood allow agent bricks to then go and
00:10:22
build an agent that satisfies that and
00:10:24
uses it as acceptance criteria and then
00:10:26
it goes right I've got some options that
00:10:28
meet it. Which one do you want to go
00:10:29
with? Essentially it's an agentic
00:10:31
workflow to build agents. The robots are
00:10:33
now building robots. This is where we've
00:10:36
got to. That's essentially what agent
00:10:37
bricks is. Now to get this to work
00:10:40
they've had to kind of collapse it down.
00:10:41
They can't just do anything in the world
00:10:43
you can think of. Essentially, there's
00:10:45
some out ofthe-box pre-anned solutions.
00:10:47
So, I can want to do information
00:10:48
extraction. So, here's a load of
00:10:50
documents. Sign a databick volume. I
00:10:52
want to build an agent that will go and
00:10:54
figure some of that stuff out. Do a
00:10:55
quick rag lookup and answer some
00:10:56
questions. Uh, I want to do like, yeah,
00:10:58
the Q&A style things. Multi-agent
00:11:01
supervisor super. I was not expecting
00:11:02
that so early. One of the things people
00:11:04
have got with data genie, the whole chat
00:11:06
over your data thing, is because you
00:11:08
build them scoped around domains of your
00:11:10
data, you can't really have a user come
00:11:12
just to ask a question across anything.
00:11:14
You need to build an agent that sits on
00:11:16
the top of a load of Genie APIs that
00:11:18
then decides which Genie room to ask a
00:11:19
question to or use a multi- aent
00:11:22
supervisor, put that on top of Genie,
00:11:24
give it the ability to ask a question to
00:11:26
a load of Genie or a separate agent.
00:11:28
That's just an out of the box thing as
00:11:30
part of agent bricks, which is crazy. So
00:11:32
there's loads of really cool stuff sat
00:11:34
underneath there in terms of how it
00:11:36
works. There are obviously lots of cool
00:11:38
demo videos that you've got, but this
00:11:40
this is the kind of workflow of
00:11:41
describing things essentially writing a
00:11:43
prompt so it understands how to actually
00:11:45
do it, picking that off, looking at the
00:11:47
response, evaluating it and says that's
00:11:49
not good enough, that's good enough,
00:11:51
letting it iterate over that stuff and
00:11:53
then actually just coming out and go
00:11:54
well there's a production agent we can
00:11:56
go and work with.
00:11:59
So I we're expecting huge amounts of
00:12:02
uptake in terms of agent bricks just
00:12:04
because everyone's trying to build
00:12:06
agents right now, but not everyone can
00:12:07
build an agent unless it's now one of
00:12:09
those agents in which case you can
00:12:11
actually build a fairly decent
00:12:12
production grade agent hire it. It's
00:12:15
better than things like AutoML. AutoML
00:12:18
is doing a similar thing with
00:12:19
traditional machine learning when we're
00:12:20
saying I've got this problem suggest a
00:12:22
load of models. But nearly always would
00:12:24
need to take that model and then
00:12:25
actually give it to a data scientist to
00:12:26
actually turn into a production grade
00:12:28
thing. Some of these agents are
00:12:30
production grade when you build them,
00:12:32
which is madness, but very cool. All
00:12:35
right, let's take a step back from all
00:12:37
the AI and hype. We've got a release of
00:12:39
Spark 4.0, which been hotly
00:12:41
anticipating. loads of stuff going into
00:12:42
it, but it's an interesting one if
00:12:45
you've been working inside data bricks
00:12:46
for a long time cuz it's essentially
00:12:49
here's a load of data bricks features
00:12:50
that are now open source as part of
00:12:52
Spark. So, if we get the big wall of
00:12:54
what's gone into it, loads of things in
00:12:56
there that we've seen before. You got
00:12:57
the SQL pipe syntax. Uh you've got the
00:12:59
improvements to the SQL udfs in there.
00:13:01
You got the variant data type. You've
00:13:02
got the uh data frame.plot. So, if you
00:13:05
can do some native plotting rather than
00:13:06
having to call a separate library for
00:13:08
it. Loads of things we've seen inside
00:13:10
data runtimes over the past 69 months
00:13:13
have now be bundled up and actually put
00:13:14
inside uh the main Spark 4.0 release. So
00:13:18
huge release just open sourcing all of
00:13:20
this good stuff. Now there is the other
00:13:22
thing that's in there and this is going
00:13:23
to catch out so many people in that as
00:13:26
of Spark 4.0 an SQL mode is standard. So
00:13:30
if you're using a runtime that has Spark
00:13:32
4.0 baked into it then it is going to
00:13:34
use an SQL mode. Generally that's good.
00:13:38
SQL behaves more like SQL in other
00:13:40
database systems. That's generally what
00:13:42
people want. However, Spark has always
00:13:45
had this idea of I'm just going to run
00:13:47
and I'm just if something happens, I'm
00:13:50
just going to ignore it. I'm going to
00:13:51
get to the end. I'm not going to fail.
00:13:53
So, if you had a divide by zero error, B
00:13:55
could just return a null and carry on
00:13:57
going. If you had a data type collision,
00:13:59
it would just return a null and keep on
00:14:01
going. Now, anti-standard mode, that is
00:14:03
going to throw an error. So if you've
00:14:05
got a lot of pipelines that were nulling
00:14:08
out data, but you weren't really aware
00:14:09
of it or you're doing that deliberately
00:14:11
and you're just letting that run
00:14:12
through, these are now going to error.
00:14:14
So be careful when you adopt Spark 4.0.
00:14:16
There is a behavioral breaking change
00:14:18
around how it actually uses SQL because
00:14:21
of anti-standard mode. But otherwise,
00:14:23
loads and loads of good stuff in there.
00:14:24
I would say check it out. But that's not
00:14:27
the biggest thing that has happened in
00:14:29
the world of open source spark. The
00:14:31
biggest thing is this declarative
00:14:34
pipelines for spark previously known as
00:14:37
delta live tableables. So DLT delta live
00:14:41
tables is an ETL framework that's been
00:14:42
baked into data bricks for ages and it
00:14:45
allows you to you can define some tables
00:14:47
as pispark or as SQL and it will then
00:14:50
just wrap it in dependency management in
00:14:54
restartability in full telemetry and
00:14:56
logging with data quality expectations
00:14:59
all of the stuff that you would expect a
00:15:01
good data engineer to actually build
00:15:03
into any of their pipelines it just does
00:15:05
automatically. So DT is a fairly robust
00:15:08
now fairly mature uh processing
00:15:11
framework and they've just open sourced
00:15:14
it. So in ter in open sourcing it they
00:15:16
have changed the name. So you will see
00:15:18
loads of people talking about
00:15:20
declarative pipelines for spark
00:15:23
DP DPFS declarative pipelines for spark
00:15:27
is the new name for delta live tables.
00:15:29
You will not see DT referred to in any
00:15:32
of the docs in any of the the libraries
00:15:33
anymore. It is now declarative
00:15:35
pipelines. Now there is a slight
00:15:37
difference. You got declarative
00:15:38
pipelines open source version but you've
00:15:41
also got this how is it implemented
00:15:43
inside of data bricks? Well the answer
00:15:45
is to this new product group called
00:15:47
lakeflow. Now they announced lakeflow
00:15:49
last year and they said this is what's
00:15:51
coming. We're going to build all these
00:15:52
different steps but essentially DT
00:15:55
implemented inside of data is part of
00:15:57
lakeflow. You got other things like
00:15:58
lakeflow connect is part of lakeflow.
00:16:00
There's this whole essentially how you
00:16:03
go about doing ETL inside of data bricks
00:16:05
is now under this umbrella of tools
00:16:08
known as lakeflow. So two things cloud
00:16:12
pipelines now open source in spark all
00:16:15
the other components inside of data
00:16:16
bricks rebranded relaunched with a load
00:16:18
of new stuff as data bricks lakeflow
00:16:21
which is now generally available so
00:16:23
there's three different aspects to it
00:16:26
most of which we've kind of seen before.
00:16:27
So lakeflow connect where part of it
00:16:29
went GA earlier this year that is the
00:16:32
low code ability to go and get data from
00:16:34
somewhere go get data from Salesforce
00:16:36
from workday a bunch of fairly common
00:16:39
data sources click some buttons and just
00:16:41
have it automatically start doing
00:16:43
incremental CDC of that data into my
00:16:45
lake great declarative pipeline sitting
00:16:49
in the middle well that's Delta live
00:16:50
tables but with a few updates and some
00:16:52
more stuff going into it and some other
00:16:54
cool stuff on top we'll look at in a
00:16:56
minute and then lakeflow jobs
00:16:58
essentially a rebranded revamped version
00:17:00
of uh datick workflows sitting around
00:17:04
there. So this this suite of tools data
00:17:06
acquisition data transformation and then
00:17:08
data orchestration sits under that
00:17:11
umbrella of LakeFlow. So if you've not
00:17:13
seen LakeFlow connect it's this kind of
00:17:15
thing. I've got a little nice clean easy
00:17:17
gooey to go off get some data. This is
00:17:20
the source pull it in. Now, they've
00:17:22
added a load more sources that are
00:17:24
compatible in terms of the release and
00:17:26
the announcements last week, but there
00:17:28
just still needs to be more. So, we're
00:17:29
going to see this like running
00:17:31
incremental addition of we've added this
00:17:33
data source, this data source, this data
00:17:34
source. So, just gradually we'll just be
00:17:36
able to do 100% of all the dual sources
00:17:39
that you're doing. You'll be able to do
00:17:40
from within LakeFlow. Not yet. There's a
00:17:42
bunch of sources it can currently do out
00:17:44
of the box. Some of which are GA, some
00:17:46
of which are still in preview. So, yeah,
00:17:48
lots of stuff going on there. uh changes
00:17:51
to how DT now known as the collaborative
00:17:54
pipelines actually works. You've got
00:17:55
this a multifile editor view. So if
00:17:58
you've been using DT for a long time, it
00:18:00
used to be pretty awful in that you'd
00:18:02
write a notebook and then have to go to
00:18:03
an entirely different screen to test it
00:18:05
and run it and then go back to your
00:18:06
notebook and the dev experience was a
00:18:08
bit mad. Now we've got this idea of one
00:18:10
an opinionated project structure going
00:18:13
actually this is how you should
00:18:14
structure your Python and SQL code
00:18:16
actually into your layers. fairly fairly
00:18:18
decent way of actually building out a
00:18:20
declared job. You've got your editor in
00:18:23
there that has part of your result set.
00:18:25
It's got your DT uh like execution graph
00:18:28
all baked in. So basically the IDE uh
00:18:32
declarative pipelines has now changed in
00:18:34
line with these announcements. So there
00:18:36
just more stuff you can do in there. So
00:18:38
yeah, lots of lots of nice stuff.
00:18:39
Essentially it's going to be nicer to
00:18:41
work in that environment. The big
00:18:43
announcement around this area is this
00:18:45
thing flow designer. And now this is
00:18:47
something that they demoed in the
00:18:48
keynote and everyone just came out kind
00:18:50
of gobsmacked openmouth going oh that's
00:18:53
going to change things. Now this is on
00:18:56
the face of things just a low code
00:18:58
editor for building out declarative
00:19:00
pipelines. So I can drag and drop and
00:19:02
say well get some data from that source
00:19:04
transform it write it over there. That's
00:19:06
fairly straightforward. Um but they've
00:19:09
gone a step further. So, the thing that
00:19:11
they demoed at the keynote last week is
00:19:14
essentially saying, well, what if we did
00:19:15
that, but we just slapped um essentially
00:19:18
a low code like a natural language
00:19:20
editor on top of it. So, what if we had
00:19:22
something like this where we had our
00:19:24
things and we just tell it what to do
00:19:26
and it will go and build it for us.
00:19:28
That's terrifying. That absolutely
00:19:31
terrifying actually how that changes the
00:19:33
workflow of what we do. We are in the
00:19:36
era of vibe engineering my friends. you
00:19:39
are going to be able to just actually
00:19:40
build out pipelines component by
00:19:42
component by just saying what you want
00:19:44
it to do and it will go and build it
00:19:45
out. Now I've had interesting
00:19:48
conversations about this going around
00:19:49
the are we going to use this forever
00:19:51
thing? Well, probably not. It would take
00:19:52
me longer to ask that question of each
00:19:54
and every single table I'm going to load
00:19:56
than it would for me to define a generic
00:19:58
workflow and say and then do that a
00:19:59
thousand times. But this really really
00:20:03
lowers the technical barrier. So like
00:20:05
anyone can go and build these things.
00:20:07
I'm just slightly terrified that in over
00:20:09
a year or two essentially we're going to
00:20:11
have this mad spaghetti mess of
00:20:13
pipelines that have been built slightly
00:20:14
differently depending on the syntax and
00:20:16
the context of how the question was
00:20:17
asked for each and every separate
00:20:19
pipeline. So how we use it going to be
00:20:22
interesting but actually the fact that
00:20:24
we can use it so much power behind it so
00:20:26
much use behind it pretty pretty crazy
00:20:29
and cool. So yeah, LakeFlow designer low
00:20:32
code and agentic pipeline building on
00:20:36
top of declarative pipelines. It still
00:20:38
builds it using declarative pipelines
00:20:40
for Spark.
00:20:42
All right, next up we got this thing
00:20:44
lake bridge previously known as Blade
00:20:46
Bridge is a company that uh databicks
00:20:48
acquired which is a AI fueled migration
00:20:51
tool. So this is something saying if I'm
00:20:53
going from maybe I've got like an old
00:20:55
SQL server, I've got Oracle, I've got
00:20:57
Synapse and I'm trying to say actually
00:20:59
take it out of that and put it into data
00:21:01
bricks. It is a lift and shift migrator.
00:21:03
Now I need to be real real specific
00:21:06
there. It's a lift and shift migrator.
00:21:08
So if I had a thousand stored procs
00:21:10
inside Snowflake and I wanted to get
00:21:12
that into databicks, it would pick that
00:21:14
up. It would evaluate it. it would
00:21:16
translate all of that code into Spark
00:21:18
SQL compatible code and then create it
00:21:21
in data bricks as a thousand separate um
00:21:24
Spark SQL notebooks. So it's not going
00:21:27
to refactor. It's not going to change it
00:21:28
into a metadata driven framework. It's
00:21:30
not necessarily going to modernize any
00:21:32
of my code. It's just going to be able
00:21:34
to take my code, make it compatible, and
00:21:37
get it onto the new platform. It's about
00:21:39
getting onto the new platform by any
00:21:41
means possible. So that might not be the
00:21:44
end. It might be you do a two-phase
00:21:45
migration. First, you lift and shift to
00:21:47
get it onto data bricks so you can turn
00:21:49
off the other system. Then you do
00:21:51
refactor into the proper end state the
00:21:54
the way you want it to work eventually.
00:21:55
But yeah, massive use case for Lake
00:21:57
Bridge there in terms of just getting
00:21:59
stuff onto the same platform so you can
00:22:01
get it inside Unity Catalog and then you
00:22:03
have a single control plane across
00:22:04
everything. So few steps to go through.
00:22:07
It'll do a scan. It'll make a report.
00:22:08
It'll tell you here's all the migration
00:22:10
stats. This is what's going to work.
00:22:11
This is what's not going to work. It
00:22:13
will do the conversion for you and spit
00:22:14
it out from a whole bunch of different
00:22:16
potential languages and sources into uh
00:22:19
data SQL code and then it'll do a load
00:22:21
of testing. It'll do source target uh
00:22:24
comparisons. It will validate it. It
00:22:26
will go yes this is successfully
00:22:27
migrated. Really really cool tool that
00:22:29
is now available as this thing called
00:22:31
lake bridge.
00:22:33
Moving on, we got a few
00:22:36
uh GA announcements. So lakehouse apps
00:22:39
has gone generally available. So
00:22:41
obviously part of the whole story in
00:22:43
terms of where database has been going
00:22:44
over the past year is saying well yeah
00:22:46
you build AI but people need something
00:22:48
to be able to interact with AI you built
00:22:50
an agent great but where do they go to
00:22:52
type things you you need some kind of
00:22:54
user interface that's essentially what
00:22:55
lakehouse apps is so it is that thing
00:22:58
that allows us to use a load a bunch of
00:23:00
um common tooling and frameworks such as
00:23:03
streamlit and python radio those kind of
00:23:05
things and now also javascript if you
00:23:07
want to build a react web app front end
00:23:10
you can do that in lakehouse apps and
00:23:12
lakehouse apps itself sorry data bricks
00:23:14
apps is now generally available so you
00:23:16
can go and do that big piece of the
00:23:18
puzzle kind of fix there and then
00:23:20
there's another one on the data clean
00:23:21
rooms side clean rooms if you're not
00:23:24
familiar with it is this essentially
00:23:26
this third party data collaboration idea
00:23:29
so I can have two data rich workspaces
00:23:31
that each delta share data into this no
00:23:33
man's land in the middle I can run some
00:23:36
scripts on it in a very controlled way
00:23:38
and get some outputs the important part
00:23:40
is neither collaborator can see the
00:23:42
other person's data. We can take two
00:23:45
massively sensitive bits of data from
00:23:46
both sides, put it together, get the
00:23:49
outputs, and view the aggregate level
00:23:50
data without the other person actually
00:23:52
seeing the low-level transactional data.
00:23:54
I don't have to give away my sensitive
00:23:56
information in order to collaborate with
00:23:58
people. That's clean rooms itself. So,
00:24:00
that went g back January I think this
00:24:02
year, but it only allowed for two
00:24:04
collaborators and it that wasn't the
00:24:07
that wasn't the dream. That wasn't the
00:24:08
story we were sold. So as part of the
00:24:10
announcements last week, you can have up
00:24:12
to nine different collaborators from
00:24:14
different clouds. So we can have Azure
00:24:17
ADS, Azour, ADS, GCP all collaborating
00:24:20
in the same clean room with the same
00:24:22
level of security and controls around
00:24:24
it. Such massively more powerful in
00:24:26
terms of the kind of things you can do.
00:24:28
And previously you could never run the
00:24:31
code if you wrote the code. I could
00:24:33
submit my notebook and the other
00:24:35
collaborator would have to execute it.
00:24:36
you would never allowed to self-run your
00:24:38
code. They've added the ability to do
00:24:40
that now if it is approved by the other
00:24:42
collaborator. So if the other person
00:24:44
trusts you, they go, "Yeah, yeah, you
00:24:45
can run your own code. You just go and
00:24:46
do it." Just so you got a faster dev
00:24:48
cycle, but only if that's the way you
00:24:51
want to work. Otherwise, it will work in
00:24:52
the normal way, which is a person cannot
00:24:55
write and run the code. So yeah, some
00:24:58
nice updates to clean room just to make
00:24:59
it a little bit more accessible, more
00:25:00
functional, more more powerful.
00:25:03
Now the next thing that's gone is one of
00:25:05
my favorite things. databicks genie or
00:25:06
AIB genie that is the talk with your
00:25:09
data. It is the chatbot that writes SQL
00:25:12
for you and interacts with your data
00:25:14
that has gone generally available. You
00:25:16
can go and use it in Angular, use it in
00:25:17
production, put it out in the world.
00:25:19
Most people I know are already using it
00:25:20
in production. Would be surprised that
00:25:22
it's now just gone GA. But there's a
00:25:25
bunch of things in there. There's a load
00:25:26
of new features that kind of snuck out.
00:25:28
I didn't really see them in some of the
00:25:29
announcements, but the kind of uh I've
00:25:31
scraped this off some of their
00:25:32
documentation. They put a bunch of
00:25:33
things in there. So firstly they've
00:25:35
added some of this stuff which allows
00:25:36
you to do data scanning. So if I've got
00:25:39
if I was writing a filter statement and
00:25:42
I think the example they've got on the
00:25:43
website if I was going talking about the
00:25:45
country and I got genie to generate the
00:25:47
SQL script for me like well go tell me
00:25:49
all the results from Florida and that
00:25:51
would go it would write it out and it go
00:25:52
well where country equals Florida. But
00:25:55
actually what it doesn't know because
00:25:57
Genie normally never sees your data is
00:25:59
actually have stored all that data as
00:26:02
state codes. So yeah, country state
00:26:04
state equals Florida. Florida's not a
00:26:06
country. So I'd have it so that actually
00:26:09
state should be FL, not Florida. But it
00:26:11
didn't know that. But I can turn this
00:26:13
thing on now. I can say go do some data
00:26:15
sampling. It will build a value
00:26:16
dictionary. And now it knows exactly
00:26:18
what the categories it has to choose
00:26:19
from. And it will use that to make
00:26:21
better, more intelligent decisions in
00:26:22
the SQL it generates. So data sampling
00:26:24
one thing. But you are going to be
00:26:26
sending your data through to the large
00:26:27
language model. You need to accept that
00:26:30
if you're going to use data sampling.
00:26:32
That's number one new thing. Now, number
00:26:35
two, new thing is just an improvement to
00:26:36
the way we're actually getting feedback.
00:26:38
So, previously we had like a thumbs up,
00:26:39
thumbs down kind of thing and people
00:26:40
would give it a thumbs down, they'd
00:26:41
never take a thumbs up. Um, but is that
00:26:44
a really hard thing to go, well, I can
00:26:45
say thumbs down cuz I'm not sure, but I
00:26:47
can't really articulate why. So, this
00:26:50
whole thing of saying send it for
00:26:51
review, give it like a medium, I'm not
00:26:54
really sure, give it an explanation, and
00:26:56
there's a whole review workflow. So,
00:26:57
it's just making the idea, the act of
00:27:00
looking after a genie space just a
00:27:02
little bit more reactive, a little bit
00:27:03
more collaborative, a little bit of back
00:27:04
and forth. Oh, actually, they flagged it
00:27:07
for a review because they didn't quite
00:27:08
understand what it was doing. It's
00:27:09
actually doing the right thing. It's an
00:27:10
education issue. Or they flagged it for
00:27:11
review. No, they're right to that's
00:27:13
wrong. Let's go and fix it. Just just a
00:27:15
better feedback mechanism baked into
00:27:17
Genie itself.
00:27:18
Uh, we've got this thing suggested
00:27:21
queries. So not from the users within
00:27:24
Genie itself but actually from how
00:27:25
people are using those tables in Unity
00:27:27
catalog. So looking at popular queries
00:27:29
looking at the most recent queries it'll
00:27:31
actually come back and go hey look guys
00:27:33
in smart ways people are actually using
00:27:35
this data elsewhere why don't you add
00:27:38
these as the sample queries inside the
00:27:40
genie room just to give people a bit
00:27:41
more inspiration just to tweak it a
00:27:43
little bit give it a bit more of a nudge
00:27:44
in the right direction. So,
00:27:46
automatically generated query
00:27:48
suggestions inside of Genie that's going
00:27:51
to be coming soon. We're going to see
00:27:52
that appearing, which is cool. But it
00:27:54
gets crazier. So, we got more stuff that
00:27:57
we've got uh which is this idea for like
00:28:00
clarification. So, we've had
00:28:01
clarifications in the past. It comes
00:28:02
back and it asks a question. And if we
00:28:04
clarify, that's then just just used
00:28:07
within that chat. But actually, now
00:28:09
we've got this whole idea of going,
00:28:11
well, there's there's more stuff we can
00:28:13
do. I can say, yes, that was right. It's
00:28:15
going to look at the fact that was right
00:28:16
and go, "Wow, well, if that was right,
00:28:18
why don't I actually remember the fact
00:28:21
that there's this measure, this filter,
00:28:23
this this whole way of working. Why
00:28:25
don't I remember this next time? Do you
00:28:27
want to curate this idea of a set of
00:28:29
metrics, set of dimensions that we're
00:28:32
actually going to build up over time?"
00:28:34
again that's coming into Genie kind of
00:28:36
some of the announcements some of the
00:28:37
blogs they've put out there just a
00:28:39
better way to build up better semantic
00:28:42
definitions inside Genie based on user
00:28:44
interactions and again that can have a
00:28:46
whole approval workflow and people to
00:28:47
reject it and people to curate it over
00:28:49
time interesting stuff happening not as
00:28:52
crazy uh as this whole new deep research
00:28:55
bun which is the yeah I'm going to ask a
00:28:58
complex how do you optimize our
00:28:59
marketing funnel that's a complex
00:29:01
question it's going to go well I'd break
00:29:03
it down I'd run all these different
00:29:04
queries. I get data from there. I look
00:29:06
at it. I make data from there. I then
00:29:07
analyze it and make data from there.
00:29:09
It's basically doing a research
00:29:11
function. It's then going to run all
00:29:13
those queries. It's going to compile the
00:29:15
results of it. It's going to interpret
00:29:17
the results of it. It's then going to
00:29:18
actually build it into basically a white
00:29:21
paper essentially a here's all the ways
00:29:24
that we can actually work. Now, this is
00:29:26
very different to how genies worked
00:29:27
previously. Genie previously has just
00:29:29
been a SQL generator where it never sees
00:29:31
the the data. It just runs the code and
00:29:33
you see the results. This is it running
00:29:35
the results, interpreting the results,
00:29:37
and actually writing recommendations and
00:29:39
building that out for you as something
00:29:41
you can then take away and use. So, deep
00:29:43
research mode is huge. Not out yet, but
00:29:47
announced as something that is coming
00:29:49
and go find it on the Genie websites.
00:29:51
Yeah, bunch of stuff going on in the
00:29:54
world of Genie. We've got to crack on.
00:29:56
There are so many other things we need
00:29:57
to talk about. Now one of the big
00:29:59
announcements which is an interesting
00:30:01
one is datab bricks free edition. Now
00:30:04
that was met with some confusion. So
00:30:06
there has historically always been data
00:30:08
bricks community edition. Now that was a
00:30:11
very cut down version of datab bricks
00:30:13
that was great if you were learning
00:30:14
pispark or scala. You could b
00:30:17
essentially log on have like a single
00:30:19
box single core little spark cluster.
00:30:21
You could write some queries on it and
00:30:23
it was great for if you're just learning
00:30:24
how pispark works. It wasn't great if
00:30:26
you're learning data bricks. So it
00:30:28
didn't have all the other features. You
00:30:30
couldn't go and use DT now declarative
00:30:32
pipelines. You couldn't go and actually
00:30:33
play around with any of the AI stuff. It
00:30:36
was purely a little Spark playground to
00:30:38
help you learn Spark. So last week data
00:30:40
bricks announced free edition which is a
00:30:42
completely free edition of fully
00:30:45
featured data bricks. Now it only has a
00:30:47
tiny sliver of compute still. They're
00:30:48
not giving away their entire product for
00:30:50
free but actually it is much more fully
00:30:53
featured than community edition. So if
00:30:55
you're learning, if you're a student, if
00:30:57
you're trying to do some sandboxing in
00:30:59
your spare time to just try and get
00:31:00
ahead and understand these things, you
00:31:02
can use data free edition to go and do
00:31:04
that. Now, it's not meant for
00:31:06
businesses. It's not meant to be you're
00:31:08
a data influencer running a boot camp
00:31:11
and you want this to actually pay for
00:31:12
all your costs. That's not what it's
00:31:14
for. This is for students, for personal
00:31:16
things to use to be able to teach
00:31:18
yourself. Maybe you're going and like
00:31:20
you're learning and you're following
00:31:21
some training. Absolutely. as a student,
00:31:22
you can use this free edition to
00:31:24
actually do your learning. If you're
00:31:25
doing any of the Daily Rick searchs, you
00:31:27
can go and use this to do your learning.
00:31:29
Loads of good stuff in there. So, that
00:31:30
is Daily Rick free edition. Subtly
00:31:33
different to community edition cuz it's
00:31:35
actually got all the features in there,
00:31:37
which is cool. That's available now.
00:31:40
Now, the other thing that was announced
00:31:42
is one of the biggest complaints that
00:31:44
there's always been about data bricks is
00:31:46
a yeah, but we can't show it to the
00:31:48
business users. We can't let our execs
00:31:50
log into it. So even with AIBI
00:31:52
dashboards, even with Genie, with all
00:31:54
the the new features which are trying to
00:31:56
take that data consumer role, it's
00:31:58
trying to engage directly with the
00:31:59
business, we've always had this
00:32:01
complaints that we hear from our clients
00:32:03
going, "Yeah, but I need to pick it up
00:32:05
and put it somewhere else because I
00:32:08
can't show that to my exec. It looks too
00:32:09
technical. It's too intimidating. Got
00:32:11
too much stuff in there. It's too busy
00:32:13
on the screen. It feels intimidating and
00:32:16
it feels not welcoming."
00:32:19
That's where this thing comes in. So
00:32:22
brand spanking new is this idea of data
00:32:24
bricks one essentially a a rounder a
00:32:27
nice a softer a more businessfacing
00:32:29
portal into data bricks. Think about
00:32:31
this as your reporting portal where
00:32:33
you've built all your code. You've built
00:32:35
all your reports. Other users can log in
00:32:37
through data one and just see this nice
00:32:40
clean shiny version to access things
00:32:42
they've been given access to. So I can
00:32:44
go in I've got this things by domain. I
00:32:46
can go view my dashboards. I can do my
00:32:47
cross filtering and things. It's still
00:32:49
lake view dashboards. It's still a
00:32:51
dashboard that I built inside a
00:32:52
fullyfledged data bricks. It's just this
00:32:54
is a way for my business users to come
00:32:55
in and interact with these things. I've
00:32:58
got the whole idea of being able to work
00:33:00
via my different domains. I can see
00:33:02
different ways of organizing my data. I
00:33:04
can go and do my data discovery and
00:33:05
actually find out different things
00:33:06
inside Unity Catalog using I can use it
00:33:08
as a discovery tool. Just just far
00:33:12
better ways of interacting with our
00:33:14
users.
00:33:15
We can have Genie baked into it. So if
00:33:17
you're seeing all those new features
00:33:18
about Genie going, "Oh, that's cool. Ah,
00:33:20
but my users would never want to use
00:33:21
it." Well, they can use it through data
00:33:24
one and it's just a cleaner, much more
00:33:26
streamlined experience. Still using
00:33:28
Genie under the hood, still works the
00:33:30
same way any of their responses and
00:33:32
their kind of review quotes, which feed
00:33:34
back into the normal Genie room that you
00:33:35
administer through fully fledged data
00:33:38
bricks. It's just just how business
00:33:41
users can interact with it. Huge amounts
00:33:43
of stuff going on there. So we're
00:33:44
expecting data bricks one to be
00:33:46
absolutely massive in terms of uptake
00:33:48
from our clients just because it is a
00:33:50
much more engaging much more
00:33:53
approachable way of using data bricks
00:33:56
right bunch of other things moving on
00:33:58
we've got unity catalog itself has a
00:34:00
load of new things going in there now I
00:34:02
kind of mentioned this when we're
00:34:03
talking about genie you've last year
00:34:05
they mentioned this thing called uni uh
00:34:07
unity catalog metrics and we've got a
00:34:10
load more information we've actually see
00:34:11
it out in the wild now we can go and
00:34:12
have a play with You've got Unity
00:34:14
catalog metric views is how it's
00:34:16
eventually been released. Essentially,
00:34:18
you define a metric view with saying,
00:34:20
well, here are my measures. Here are my
00:34:22
dimensions. Here are my filter
00:34:23
statements. And that just means it's
00:34:25
much more like a cube. It means, well,
00:34:26
I've got this measure, but if I I can
00:34:28
cut and slice my data, and it will
00:34:30
calculate the measure based on the
00:34:32
filter context of the various different
00:34:33
dimensions I'm using, much like any
00:34:35
other semantic model in other tools.
00:34:38
Now, those are coming and going to be
00:34:40
baked into uh Unity Catalog. they'll be
00:34:42
consumed by Genie so you can start to
00:34:44
see and you saw Genie picking up some of
00:34:46
itself and then we can go and push that
00:34:48
out. So loads of things happening on
00:34:49
that space. I forgot the slides. Yeah,
00:34:51
icebergs in there. We talked about that
00:34:52
already. This is what I want to talk
00:34:55
about the idea of uni uh unity catalog
00:34:57
metrics. Now there's already a load of
00:35:00
announcements about a load of different
00:35:01
BI tools that are going to be able to
00:35:02
consume it. So if you're in Sigma,
00:35:05
Tableau, you're in Thorpspot, you'll be
00:35:07
able to bring in data from Unity Catalog
00:35:10
and it will have awareness of those
00:35:11
metrics. Now there's a big green F
00:35:14
missing from this diagram where we don't
00:35:17
see anything to do with PowerBI or
00:35:18
fabric. So we're really curious where
00:35:20
that is on the road map. I've not found
00:35:21
out where it is on the road map. I will
00:35:23
hopefully do soon. Um but that's a
00:35:25
missing piece currently. But most BI
00:35:28
tools in the stack can use these
00:35:29
metrics. So it's basically you can kind
00:35:31
of rather than have all your metrics and
00:35:32
logic defined downstream in your
00:35:34
different tools. And if you're using
00:35:36
Thorbot and Tableau and Sigma, you might
00:35:39
have those metrics defined three times
00:35:41
in each of those different ones. You can
00:35:43
pull that upstream into your lake into
00:35:45
your gold layer, store it actually next
00:35:47
to your data and then whatever's
00:35:49
consuming it downstream, you've just got
00:35:50
one definition of your KPIs, which is
00:35:52
what we want in life. So really cool.
00:35:55
Loads of stuff happening there. I've got
00:35:57
this idea of domains. So looking at
00:35:59
Unity catalog be able to actually say
00:36:00
well these different objects all part of
00:36:02
this domain helping us with the whole
00:36:04
kind of product ownership idea helping
00:36:06
us with the slightly uh distributed mesh
00:36:08
idea that people are going for lots of
00:36:10
stuff in there which is just help.
00:36:13
We got this whole idea of a data quality
00:36:15
monitoring tool that will go and
00:36:17
actively monitor my data quality in my
00:36:19
different tables. We've seen that with
00:36:21
lakehouse monitoring but that was like
00:36:23
such a super deep data profiling scan.
00:36:25
It wasn't really just the I just want a
00:36:27
little bit of information about the
00:36:29
quality of everything. This is
00:36:31
absolutely that kind of just keep track
00:36:33
of my data quality. Look at the
00:36:35
freshness. Look at the completeness.
00:36:36
Like nice good chunky data quality
00:36:38
metrics. It can just run on top of
00:36:40
everything and give me a nice dashboard
00:36:42
that is coming. Loads of stuff we've
00:36:43
seen about that. And then yeah, a bunch
00:36:45
of other stuff inside Unity. We've got
00:36:48
the ability to certify a data set and
00:36:50
say, "Oh, that's that's certified. That
00:36:52
table is not certified. That table used
00:36:54
to be the good one. It's now deprecated.
00:36:56
So, we're just just adding a bit more
00:36:58
information for our users so they can
00:37:01
use it as a data discovery tool. It's
00:37:03
moving unity catalog from being a
00:37:04
technical catalog to actually a data
00:37:07
discovery tool, a business facing
00:37:09
catalog through data bricks one, all
00:37:11
that good stuff. Uh got a request for
00:37:13
access workflow. So, if I if I discover
00:37:15
a table, I don't have access. Well,
00:37:18
request access much like any other
00:37:19
governance tool. Just seeing these
00:37:21
things come in. uh aback attribute-based
00:37:24
access control we we've heard talked
00:37:26
about by data bricks in previous summits
00:37:28
that's the thing where if I tag a table
00:37:29
as sensitive I can have a control ro
00:37:32
saying you have access to sensitive data
00:37:34
no one else has access to sensitive data
00:37:36
and then just by tagging that table that
00:37:38
security will be applied to it so aback
00:37:40
we're actually seeing loads of different
00:37:42
use cases for all things like GDPR and
00:37:44
sensor data sure but even tagging it to
00:37:46
a domain actually you can then have
00:37:48
controls around the domains that
00:37:50
separates it gives you a different
00:37:51
dimension control away from oh this
00:37:54
catalog this schema these tables you
00:37:56
might have tables across all your
00:37:57
different schemas that you want to give
00:37:58
one ro access to and you don't want to
00:38:00
have to do that manual on each table you
00:38:02
can do that via aback which is currently
00:38:04
in beta so it's not actually GA tag
00:38:07
policies you're not allowed to put a
00:38:09
table in unless you've actually followed
00:38:10
all the policies that's good data
00:38:11
classification scanning like just oh
00:38:14
someone loaded that table but that's an
00:38:15
email address that's that's PII data or
00:38:17
that's a bank account don't share that
00:38:20
via Genie Having that automatically tag
00:38:23
it means you can do that reactive
00:38:24
governance. It means you can just be sat
00:38:26
there with a control panel and having
00:38:28
things flag up and link those two
00:38:30
together. Oh, it's going to classify
00:38:32
something as sensitive because it's
00:38:33
found a bank number. It's going to apply
00:38:35
attribute based access control and lock
00:38:37
that table down before it gets exposed
00:38:39
and excfiltrated. All of that coming on
00:38:42
the catalog road map that we've had
00:38:44
talked through. Oh, we are nearly there
00:38:46
my friends. Do not worry. final section
00:38:49
is talking about the changes in Mosaic
00:38:51
AI and there's probably eight chunky
00:38:53
areas that we need to briefly talk
00:38:55
through. There's a load of stuff going
00:38:57
on. One of which I've talked to already
00:38:58
which is the whole agent bricks thing
00:39:00
that is huge but it was big enough that
00:39:02
I just pulled it out and did a separate
00:39:03
little run through it. So agent bricks
00:39:05
is coming currently only available in
00:39:07
certain data bricks regions. We're going
00:39:08
to see that slowly rolling out much like
00:39:10
any other data bricks feature. uh AI
00:39:12
functions. They're the SQL functions
00:39:14
that you write inside of uh Spark SQL in
00:39:17
data bricks. So all using all of your
00:39:19
like uh AI extract, AI pass, document,
00:39:22
all those kind of things. We're actually
00:39:24
seeing they've done a load of work to
00:39:26
make them much faster. So we're seeing
00:39:28
them what three times faster, four times
00:39:29
cheaper, doing them in bulk mode rather
00:39:32
than calling it incrementally. We're
00:39:33
just seeing a load of improvements
00:39:35
around it. See people adopt it more and
00:39:37
do more stuff with it. uh vector search
00:39:40
just before summit kicked off we saw an
00:39:42
announcement of storage optimized vector
00:39:44
search which had to carve out a load and
00:39:45
put it down uh into cheaper storage
00:39:48
which just makes the size and scale of
00:39:50
it so much more but also means it's much
00:39:53
much cheaper. So what a 7x cost
00:39:55
reduction and a massive increase in the
00:39:57
number of vectors you can have inside
00:39:58
your vector database. So if you're
00:40:00
building a rag architecture and you
00:40:02
struggle with how many embeddings you're
00:40:04
trying to get into that vector database,
00:40:05
well, you've now suddenly got a load
00:40:08
more space you can do if you're using
00:40:09
that storage optimized mode. So loads of
00:40:11
stuff around there. Oh, my little in
00:40:14
picture is hiding the title of this one.
00:40:16
Serverless GPUs
00:40:18
has been a big deal if you're dealing
00:40:20
with uh Spark clusters and you're trying
00:40:22
to sped it up with a GPU cuz you're
00:40:24
trying to do something that would
00:40:25
actually benefit from having it in
00:40:26
there. lots of your neural network kind
00:40:28
of AI workloads need a GPU to go fast be
00:40:32
really hard to get a hold of them
00:40:33
whereas now actually we're seeing datix
00:40:36
are rolling out serverless GPUs so if
00:40:37
you're using serverless you're like oh I
00:40:39
wish that was a GPU you can now going to
00:40:41
be able to start using it again only in
00:40:42
certain regions so far but we'll see
00:40:44
that rolled out as they find more GPUs
00:40:47
hidden down the back of the sofa
00:40:48
somewhere uh model serving has gotten
00:40:51
much faster so looking at what 25,000
00:40:54
QPS queries per second Just the size and
00:40:58
scale of how many things we can serve at
00:41:01
the same time has kind of exploded.
00:41:02
Loads and loads of stuff going on there.
00:41:04
Uh AI gateway is a massive thing for
00:41:07
productionizing. So many people just
00:41:09
build a PC and they don't get it into
00:41:11
production. AI gateway is allowing us to
00:41:13
do things like um throughut uh
00:41:16
bottlenecking. It's allowing to do
00:41:17
failover in case we get a bounce back
00:41:19
from a web point. just all of the good
00:41:22
security and management essentially
00:41:24
around managing an API baked into data
00:41:27
bricks allowing us to host several
00:41:29
different models and put certain bits of
00:41:31
traffic through to different models and
00:41:34
manage that over time all in AI gateway
00:41:36
loads of stuff going inside there really
00:41:38
really important if you're building
00:41:40
agents if you're building large language
00:41:41
model integrations if you're building
00:41:43
traditional IML lots of stuff inside
00:41:45
there ML 3.0 Oh, so there's a new
00:41:48
release of ML. I feel sad that MLFlow is
00:41:50
so far down the list of the things that
00:41:52
we're doing. Loads of stuff uh going in
00:41:54
there like lots of nice new features. Uh
00:41:56
one of the big ones that we called out
00:41:58
is prompt versioning. So if I was like
00:42:00
writing crafting the perfect prompt to
00:42:03
actually put into my agentic workflow,
00:42:05
uh and then I went back and I tweaked it
00:42:07
and I changed it and I changed it, I
00:42:08
didn't really have a way of tracking
00:42:10
over time what happened as I changed
00:42:12
that prompt. So the automatic storing of
00:42:14
different versions of a prompt as you go
00:42:16
through that workflow is now just part
00:42:18
of MLflow. Much like when they first
00:42:20
brought in automatic uh versioning of
00:42:23
notebooks when I was in an experiment,
00:42:24
they've now actually just getting more
00:42:26
and more of the whole LLM ops story into
00:42:30
MLflow. Lots of stuff happening there.
00:42:32
And finally MCP support coming in. So
00:42:34
MCP support that kind of like common
00:42:37
language to get agents to talk to each
00:42:39
other. So I can just really really
00:42:40
quickly add more and more different
00:42:42
integrations to other APIs to other
00:42:44
tools to other things using MCP as that
00:42:46
common integration protocol is being
00:42:48
rolled out. So that's now available
00:42:50
within data bricks
00:42:53
bunch of stuff there. Now all those
00:42:55
different ones again AI gateway and ML
00:42:58
flow 3.0 they are GA they are out in
00:43:00
anger. Everything else is preview or
00:43:02
incremental changes. They're still
00:43:03
rolling out. Some of them are in beta.
00:43:05
Do be aware not everything is absolutely
00:43:07
out in production currently. But yeah,
00:43:10
just huge amounts of stuff going on
00:43:12
currently in the world.
00:43:16
Okay,
00:43:18
we made it. That is my big old list of
00:43:20
everything that was announced across six
00:43:22
hours of keynotes, two different
00:43:24
keynotes on the data last week. As I'm
00:43:28
sure you're with me going, "Wow, that's
00:43:30
a lot of stuff." Lake Base alone
00:43:33
bringing OLTP using separation of
00:43:35
service and compute is a massively huge
00:43:38
step. What data bricks are trying to do
00:43:41
lake bridge amazing for trying to do
00:43:42
super quick migrations although we need
00:43:44
to then refactor it and get rid of the
00:43:46
technical debt. Agent bricks just
00:43:49
completely blowing open the idea of
00:43:51
agents and meaning anyone could build an
00:43:53
agent really easily. DT suddenly being
00:43:56
open sourced and renamed into clarative
00:43:58
pipelines.
00:44:00
Genie getting loads of extra features
00:44:01
announced that we're slowly going to see
00:44:03
over the next few months it getting
00:44:05
better and better and better and better.
00:44:07
Loads and loads and loads of things
00:44:08
happening. Databicks one I'm expecting
00:44:10
to be fairly huge even though it's just
00:44:12
a nicer UI. It's just a nicer UI for
00:44:15
business people but that is going to
00:44:18
drive mass business adoption and rather
00:44:21
it being you do a load of stuff in data
00:44:22
bricks and then someone else access it
00:44:24
from somewhere else. It's just bringing
00:44:26
people onto that same platform so we're
00:44:28
all working in the same place.
00:44:30
huge amounts of stuff going on. Now,
00:44:32
obviously, we have skimmed the surface.
00:44:34
Some of that we've just there's some
00:44:35
marketing slides and that's all we know
00:44:37
about it. Some of it uh we've been
00:44:39
tinkering and playing with and we were
00:44:40
now actually allowed to talk about it.
00:44:42
So, we've got follow-up videos planned
00:44:43
with a whole bunch of these features.
00:44:45
So, bear with me. I'll be back on making
00:44:48
videos over the next few weeks trying to
00:44:50
catch up with all of this stuff. And
00:44:52
yeah, if there's anything that you
00:44:53
really want to go, tell us more about
00:44:55
that. If I can, we will. Let us know
00:44:57
down in the comments what you think,
00:44:58
which is your favorite feature that's
00:45:00
been announced. And as always, don't
00:45:02
forget to like and subscribe. Cheers.

Data + AI Summit 2025 - Keynote Recap

Summary

Takeaways

Timeline

Mind Map

Video Q&A

Apa itu Lakebase?

Lakebase adalah produk baru dari Databricks yang memperkenalkan OLTP dengan pemisahan penyimpanan dan pengiraan.

Apa itu Agent Bricks?

Agent Bricks adalah alat untuk membina agen AI dengan cara yang lebih mudah dan rendah kod.

Apa yang baru dalam Spark 4.0?

Spark 4.0 kini mempunyai mod SQL standard dan pelbagai ciri baru yang telah dibuka sumber.

Apa itu edisi percuma Databricks?

Edisi percuma Databricks adalah versi penuh yang boleh digunakan oleh pelajar dan individu untuk belajar dan bereksperimen.

Apa itu Unity Catalog?

Unity Catalog adalah alat pengurusan data yang membolehkan pengguna mengakses dan mengurus data dengan lebih baik.

Apa yang baru dalam Genie?

Genie kini mempunyai ciri baru seperti pengambilan data dan cadangan pertanyaan.

Apa itu Lake Bridge?

Lake Bridge adalah alat migrasi yang membantu memindahkan data dari sistem lain ke Databricks.

Apa itu Lakeflow?

Lakeflow adalah rangka kerja ETL baru yang menggabungkan pelbagai alat untuk pengurusan data.

Apa itu AI Gateway?

AI Gateway adalah alat untuk mengurus dan mengoptimumkan API untuk model AI.

Apa itu MLflow 3.0?

MLflow 3.0 adalah versi baru yang memperkenalkan pengurusan versi untuk prompt dalam aliran kerja AI.

View more video summaries

Tools

Daily Summary

Legal