Data + AI Summit 2025 - Keynote Recap
Summary
TLDRVideo ini merangkum pengumuman penting dari Data and AI Summit yang diadakan di San Francisco, termasuk pelancaran produk baru seperti Lakebase, yang memperkenalkan OLTP dengan pemisahan penyimpanan dan pengiraan, serta Agent Bricks, yang memudahkan pembinaan agen AI. Spark 4.0 kini mempunyai mod SQL standard dan pelbagai ciri baru yang telah dibuka sumber. Edisi percuma Databricks juga diperkenalkan, membolehkan pelajar dan individu belajar dan bereksperimen dengan alat ini. Selain itu, terdapat kemas kini pada Unity Catalog dan Genie, yang meningkatkan pengurusan dan interaksi dengan data. Video ini memberikan pandangan mendalam tentang semua perubahan dan inovasi yang berlaku dalam ekosistem Databricks.
Takeaways
- 🚀 Lakebase memperkenalkan OLTP dengan pemisahan penyimpanan dan pengiraan.
- 🤖 Agent Bricks memudahkan pembinaan agen AI.
- 📊 Spark 4.0 kini mempunyai mod SQL standard.
- 🎓 Edisi percuma Databricks untuk pelajar dan individu.
- 📁 Unity Catalog meningkatkan pengurusan data.
- 💬 Genie kini mempunyai ciri baru untuk pengambilan data.
- 🔄 Lake Bridge membantu migrasi data ke Databricks.
- 🔧 Lakeflow adalah rangka kerja ETL baru.
- 🌐 AI Gateway mengurus API untuk model AI.
- 📈 MLflow 3.0 memperkenalkan pengurusan versi untuk prompt.
Timeline
- 00:00:00 - 00:05:00
Video ini memperkenalkan kemas kini terkini dari Advancing Spark, yang merangkumi pengumuman penting dari Data dan AI Summit di San Francisco. Pembentang menyatakan bahawa mereka akan merangkum hampir 7 jam pengumuman menjadi poin-poin penting yang perlu diketahui.
- 00:05:00 - 00:10:00
Pembentang membincangkan visi baru platform Data Bricks, termasuk elemen seperti AI, SQL, dan pasar data. Penekanan diberikan kepada Unity Catalog yang kini menyokong Iceberg, membolehkan pengguna memilih antara Delta dan Iceberg untuk pengurusan data.
- 00:10:00 - 00:15:00
Pengumuman terbesar adalah Lakebase, yang menandakan Data Bricks memasuki pasaran OLTP (Online Transactional Processing). Ini membolehkan pemisahan penyimpanan dan pengiraan, yang berbeza dari pendekatan tradisional OLTP, dan dibina di atas Postgres.
- 00:15:00 - 00:20:00
Lakebase membolehkan pengguna untuk mencipta pangkalan data yang berfungsi dengan cepat dan efisien, dengan keupayaan untuk mengendalikan berjuta-juta pertanyaan secara serentak. Ini adalah langkah besar dalam dunia pengkomputeran data.
- 00:20:00 - 00:25:00
Agent Bricks diperkenalkan sebagai alat untuk membina sistem berasaskan agen dengan lebih mudah. Ini membolehkan pengguna untuk mencipta agen yang dapat mengautomasi aliran kerja tanpa memerlukan pengetahuan teknikal yang mendalam.
- 00:25:00 - 00:30:00
Spark 4.0 dilancarkan dengan banyak ciri baru, termasuk sintaks SQL yang lebih baik dan pengendalian kesalahan yang lebih ketat. Ini adalah perubahan besar yang perlu diperhatikan oleh pengguna yang telah terbiasa dengan cara lama Spark berfungsi.
- 00:30:00 - 00:35:00
Declarative Pipelines, sebelum ini dikenali sebagai Delta Live Tables, kini telah dibuka sumber dan diintegrasikan ke dalam Lakeflow, yang merupakan rangkaian alat untuk pengambilan dan pemprosesan data.
- 00:35:00 - 00:40:00
Lakehouse Apps kini tersedia secara umum, membolehkan pengguna membina antaramuka pengguna untuk berinteraksi dengan AI. Ini adalah langkah penting untuk menjadikan AI lebih mudah diakses oleh pengguna biasa.
- 00:40:00 - 00:45:11
Genie, alat interaksi data yang membolehkan pengguna berkomunikasi dengan data mereka, kini juga tersedia secara umum dengan pelbagai ciri baru yang meningkatkan pengalaman pengguna.
Mind Map
Video Q&A
Apa itu Lakebase?
Lakebase adalah produk baru dari Databricks yang memperkenalkan OLTP dengan pemisahan penyimpanan dan pengiraan.
Apa itu Agent Bricks?
Agent Bricks adalah alat untuk membina agen AI dengan cara yang lebih mudah dan rendah kod.
Apa yang baru dalam Spark 4.0?
Spark 4.0 kini mempunyai mod SQL standard dan pelbagai ciri baru yang telah dibuka sumber.
Apa itu edisi percuma Databricks?
Edisi percuma Databricks adalah versi penuh yang boleh digunakan oleh pelajar dan individu untuk belajar dan bereksperimen.
Apa itu Unity Catalog?
Unity Catalog adalah alat pengurusan data yang membolehkan pengguna mengakses dan mengurus data dengan lebih baik.
Apa yang baru dalam Genie?
Genie kini mempunyai ciri baru seperti pengambilan data dan cadangan pertanyaan.
Apa itu Lake Bridge?
Lake Bridge adalah alat migrasi yang membantu memindahkan data dari sistem lain ke Databricks.
Apa itu Lakeflow?
Lakeflow adalah rangka kerja ETL baru yang menggabungkan pelbagai alat untuk pengurusan data.
Apa itu AI Gateway?
AI Gateway adalah alat untuk mengurus dan mengoptimumkan API untuk model AI.
Apa itu MLflow 3.0?
MLflow 3.0 adalah versi baru yang memperkenalkan pengurusan versi untuk prompt dalam aliran kerja AI.
View more video summaries
Pembekalan Mahasiswa Baru Tentang Bela Negara
Materi Kuliah: Filsafat Ilmu dan Logika, Kelas 2A, Semester 2, PAI, IAI Diniyyah Pekanbaru
STRATEGI INTEGRASI PERTANIAN TERPADU BSM ,KAMBING, JAGUNG,AYAM #beritaterkini #bayusehatmandiri
Lembaga Keuangan Mikro-IKNB | Ekonomi Kelas X (Kurikulum Sekolah Penggerak) | EDURAYA MENGAJAR
Lembaga Keuangan Khusus-IKNB | Ekonomi Kelas X (Kurikulum Sekolah Penggerak) - EDURAYA MENGAJAR
UJIAN PRAKTIK_1 PKM (Pemantapan Kemampuan Mengajar) PENDIDIKAN BIOLOGI_Diana Rupmana (050605317)
- 00:00:02Hello Spark fans. Welcome back to
- 00:00:04Advancing Spark brought to you by
- 00:00:06Advancing Analytics, the only people who
- 00:00:08actually understand what's going on in
- 00:00:10the world of data bricks currently. Now
- 00:00:12for you is the big question cuz last
- 00:00:15week was the data and AI summit over in
- 00:00:17San Francisco and me and the team were
- 00:00:19over there for so many hours of keynotes
- 00:00:23and announcements and things going on in
- 00:00:25the wacky world that is data bricks. So,
- 00:00:27what I thought I would do is take an
- 00:00:29almost 7 hours worth of announcements
- 00:00:31and crushed it right down to just tell
- 00:00:34you what I think is important, what I
- 00:00:35think of it, and how to understand it.
- 00:00:37Now, that is still going to take us a
- 00:00:38little while. So, strap yourselves in.
- 00:00:40This is not going to be a short video.
- 00:00:42So, yeah, I've got a lot of marketing.
- 00:00:44I've taken some screenshots of slides.
- 00:00:46There's some dodgy quality in the things
- 00:00:47I've snipped, but that's fine. We can go
- 00:00:49and have a look through that. If it's
- 00:00:51your first time around here, well,
- 00:00:52welcome. Don't forget to like and
- 00:00:54subscribe. And yeah, just buckle up.
- 00:00:56We'll talk about a ton of stuff that is
- 00:00:58happening in the world of data
- 00:01:00intelligence on data bricks. So let's go
- 00:01:05and have a look. So this is like the big
- 00:01:06opening slide we saw them going back to.
- 00:01:09Essentially we always see this new
- 00:01:12vision each year of this is what the
- 00:01:14platform looks like. Similar idea you
- 00:01:17tend to have bit AI bit SQL. The data
- 00:01:20marketplace is a new thing. Apps is kind
- 00:01:22of a new thing in this uh circle. Lake
- 00:01:24flow we'll talk about more in a minute.
- 00:01:26and the AIBI. Loads of stuff going on.
- 00:01:28All underpinned by Unity Catalog with a
- 00:01:31special new entrance down there in terms
- 00:01:33of Unity Catalog is underpinned not just
- 00:01:36by Delta but also by Iceberg. One of the
- 00:01:38big announcements this year was that
- 00:01:40iceberg is now a fully managed available
- 00:01:43for Unity Catalog. You can just create a
- 00:01:45table, decide ah it's going to be a
- 00:01:47delta table. This one's going to be an
- 00:01:48iceberg table. They both work together.
- 00:01:50No one really cares what's under the
- 00:01:51hood.
- 00:01:53all full parity of support for both of
- 00:01:56those two. So very very cool in terms of
- 00:01:58what we're seeing there. But there's a
- 00:02:01ton of new things. Lake Flow itself is
- 00:02:04new, but we'll get on to that. So I want
- 00:02:06to take a while just talking through the
- 00:02:09new concepts, what I thought of it when
- 00:02:10I first heard about it and what I think
- 00:02:12about it now, which is different. So,
- 00:02:15the first one, probably the the biggest
- 00:02:16one, and word of warning, there's a lot
- 00:02:19of products that are called either lake
- 00:02:21something or something bricks. So,
- 00:02:23you're going to have to get used to
- 00:02:24remembering all these different things
- 00:02:25cuz there's a lot. So, number one, lake
- 00:02:28base. Probably the most surprising
- 00:02:30announcement that we actually had this
- 00:02:32year. Essentially, data bricks are
- 00:02:34entering the OOLTP market. Now, if
- 00:02:37you're not from a databasey background,
- 00:02:39uh, OOLTP is online transactional
- 00:02:41processing. And you've also got the idea
- 00:02:43of OLAP, online analytical processing.
- 00:02:45OLAP looks after big chunky set based
- 00:02:48queries. I'm looking to aggregate over
- 00:02:50millions, billions of rows and give me
- 00:02:52an aggregate answer. OLTP is looking for
- 00:02:54lots of small little concurrent uh
- 00:02:58singleton reads and writes. Very
- 00:02:59different type of technology. Most of
- 00:03:01the time it's still a database. You
- 00:03:02still use SQL. You still work with it in
- 00:03:04the same way. But actually the tools
- 00:03:06that have grown and evolved in the big
- 00:03:08data ecosystem, the big crunchy parallel
- 00:03:11distributed compute aka Spark has always
- 00:03:14been about analytics. It's always been
- 00:03:17about doing things at scale to many many
- 00:03:19rows at the same time. So data bricks
- 00:03:21going hey now we now do OOLTP was always
- 00:03:24a bit of a weird one. When I first heard
- 00:03:26about it I was like why? Who's that for?
- 00:03:30Is that just to tick a box saying well
- 00:03:33other tools have got a database in there
- 00:03:35we've now got a database. So I honestly
- 00:03:37I held my hand up going all right and I
- 00:03:41will happily eat my words cuz what what
- 00:03:43makes the difference if they were just
- 00:03:45ticking a box and going hey we've got
- 00:03:46manage Postgress you can now go and
- 00:03:48build a database and it works like a
- 00:03:49database looks like a database that
- 00:03:51would not be exciting and I'd struggle
- 00:03:53to see why they're actually trying to do
- 00:03:54anything different.
- 00:03:56There's there's a lot of stuff actually
- 00:03:58behind what's going on. So firstly this
- 00:04:00new lake base yes it is managed OOLTP it
- 00:04:02is Postgres working inside the thing
- 00:04:05it's but it's Postgres with a split of
- 00:04:08separation of storage and compute and
- 00:04:11that's that's you don't get that with
- 00:04:12OLTP OLTP to get it fast to get it
- 00:04:15millisecond latency you have to have the
- 00:04:17data stored in a very highly indexed way
- 00:04:19so you can do that very very fast record
- 00:04:22return that gives you the low latency
- 00:04:24that OLTP needs so that announcement
- 00:04:26going hey look we've got a database
- 00:04:28that's now lot of data bricks but it
- 00:04:30we've separated storage to compute is
- 00:04:32nuts.
- 00:04:34So there's a few things behind it.
- 00:04:36There's a few things that they were
- 00:04:36saying well this is what we need to get
- 00:04:38this to actually work. So one being open
- 00:04:40source yes of course it's data bricks
- 00:04:42they are all about open sourcing things.
- 00:04:44Uh so it's built on postgres biggest
- 00:04:47open source database there is that just
- 00:04:49kind of makes sense. Uh the separated
- 00:04:51storage compute yes is nuts. Uh so
- 00:04:54getting a huge like QPS queries per
- 00:04:56second that's concurrency like most of
- 00:04:59the time if you're talking about
- 00:05:00analytics you're going to have tens
- 00:05:02maybe hundreds of users running queries
- 00:05:05not crazy if you talk about a website
- 00:05:08talk about e-commerce talk about an
- 00:05:09actual application used to scale you're
- 00:05:11talking about millions of queries per
- 00:05:14second concurrency is king in the world
- 00:05:17of any kind of OLTP system there's just
- 00:05:20a lot going on
- 00:05:23Now the other thing is AI. So actually
- 00:05:26in this world of agents in the way that
- 00:05:28we're actually going, we're expecting
- 00:05:30all this stuff to be so much more
- 00:05:31ephemeral than it used to be. Uh someone
- 00:05:33writes what writes a query that spins
- 00:05:36off an agent. The agent creates a
- 00:05:38database that it uses to fulfill that
- 00:05:40request and then it trashes it. That's a
- 00:05:42weird idea. We're used to provisioning
- 00:05:44compute and maybe the compute scales up
- 00:05:46and down. But the idea of spinning up a
- 00:05:48whole separate database to fulfill the
- 00:05:50work and then crashing it isn't really
- 00:05:52something we've ever done in the
- 00:05:53application style of things because
- 00:05:54we've never really separated storage and
- 00:05:56compute in the application style of
- 00:05:57things. So yeah, it's just it's just
- 00:06:00different. It's way more different than
- 00:06:02I actually thought it was. Dropping my
- 00:06:05clicker.
- 00:06:06So why is it different? I mean it makes
- 00:06:09a little bit more sense if you roll back
- 00:06:10a couple of weeks before summit. We
- 00:06:12actually got the announcement that data
- 00:06:13bricks had acquired a company called
- 00:06:15Neon. Now Neon are well known. They're a
- 00:06:17startup for creating serverless Postgres
- 00:06:20that champions separation of storage and
- 00:06:22compute. That's what they do. So this
- 00:06:24shouldn't come as that much of a
- 00:06:25surprise. Now the data bricks obviously
- 00:06:29they've just acquired Neon. They've been
- 00:06:30working on Lakebase for a good year or
- 00:06:33so. But they were working with Neon.
- 00:06:35They've said actually they've been
- 00:06:36working in partnership very heavily and
- 00:06:37then now they've acquired them. So Lake
- 00:06:40Base isn't neon, but it's heavily worked
- 00:06:42with and inferred by on that. And I'm
- 00:06:44sure we'll see them come closer together
- 00:06:45in the future. But that just makes a
- 00:06:48little bit more sense about where they
- 00:06:49got the idea from, where it's come from,
- 00:06:51where the technical expertise has come
- 00:06:53from. So it is it it's it's more novel
- 00:06:56than I thought when I first heard the
- 00:06:58announcement. Oh, hey, we're doing LTP.
- 00:07:00It's oh, hey, we're doing LTP, but very
- 00:07:02very different.
- 00:07:05Now, how much that'll go, oh yeah,
- 00:07:07that's different enough. I'm gonna take
- 00:07:09my existing e-commerce system. I'm gonna
- 00:07:11run it on top of Lakebase. I don't know.
- 00:07:13I mean, there's a maturity piece out
- 00:07:15there. This has just been announced
- 00:07:16before people actually use it in anger
- 00:07:18to run their entire business on.
- 00:07:20There'll be a little bit of maturing, a
- 00:07:22little bit of migration. There'll be an
- 00:07:23adoption period, right? It's very new,
- 00:07:26but the idea is novel enough that
- 00:07:27actually, yeah, it's interesting.
- 00:07:31So he thinks one of my comedy clipped
- 00:07:34slides that's terribly low res but that
- 00:07:36whole idea of going well there is the
- 00:07:38object store actually in the lake
- 00:07:40there's compute can just be ephemeral I
- 00:07:42can spin up a new database that's
- 00:07:43looking at existing data and just spin
- 00:07:45that out each as new instances which
- 00:07:47have their own controls around it but
- 00:07:48it's that buffer store and concurrency
- 00:07:50management that sits in the middle which
- 00:07:52is essentially doing some of the magic
- 00:07:53that's enabling this to actually work at
- 00:07:55that level of scale now the actual
- 00:07:57technical how that works I don't know
- 00:08:00we'll we'll stick around. I'm sure
- 00:08:01there's much much deeper videos we're
- 00:08:03going to do about lake fakes in the
- 00:08:04future. But yeah, that's probably the
- 00:08:06biggest announcement, the biggest like,
- 00:08:08oh, that's different.
- 00:08:12It's number one. See, that's the first
- 00:08:14nice gentle announcement just to eases
- 00:08:16into things. Number two is a little
- 00:08:18crazier. So, we got this thing, which
- 00:08:20I'm so happy they finally decided the
- 00:08:22name for is agent bricks. Yes, it's
- 00:08:24another combination of lake and bricks.
- 00:08:26So agent bricks everything obviously the
- 00:08:29world has gone agent based world is
- 00:08:31agentic these days two years ago
- 00:08:33everyone was mad for checkp and the idea
- 00:08:35of LLMs last year it was all about
- 00:08:37vector databases and rags this year has
- 00:08:39heavily been defined by everything is an
- 00:08:42agent I'm automating all of my workflows
- 00:08:44build an agent to go and automate it
- 00:08:46you're migrating some better you build
- 00:08:47an agent everything has gone agentic but
- 00:08:49there's been this barrier to entry like
- 00:08:51it's you can't go and just pick up and
- 00:08:53use an agent you have to go and
- 00:08:54understand well autogen and all the
- 00:08:56other libraries that actually build up.
- 00:08:58It's quite a technical high barrier to
- 00:09:00entry. It's not crazy, but yeah, you got
- 00:09:03to learn some stuff. Now, Agent Bricks
- 00:09:06is going to tackle that. Agent Bricks is
- 00:09:07trying to say, how do we do kind of low
- 00:09:10code for building out an agent? Now,
- 00:09:13it's this idea. Oh, that looks how do we
- 00:09:16build an agentic system that answers a
- 00:09:18certain question. Essentially, writing a
- 00:09:20prompt that says this is what this is
- 00:09:23what good looks like. Now you got that
- 00:09:24LLN judge and that was a whole thing
- 00:09:26that came out around the idea about how
- 00:09:29how do you evaluate a large language
- 00:09:31model. If we're looking at machine
- 00:09:32learning and we say I'm going to predict
- 00:09:34this figure I can then actually find out
- 00:09:36what the figure actually was and I can
- 00:09:37compare the two and I can say
- 00:09:39scientifically mathematically this is
- 00:09:41how accurate that model was. We can put
- 00:09:44an actual percentage figure on the
- 00:09:45accuracy of traditional machine
- 00:09:46learning. When a large language model is
- 00:09:49bringing back some text I can go well
- 00:09:51that that looks right or that doesn't
- 00:09:53look right. Well, that's factually
- 00:09:54incorrect. I can't put a number on it. I
- 00:09:56can't quantify as much. So, how do you
- 00:09:58do that at scale? How do you make that
- 00:10:00repeatable? Well, the answer is get a
- 00:10:02large language model to judge the output
- 00:10:04of another large language model. That
- 00:10:05whole LLMs as judges. And that's
- 00:10:08actually grown. There's technology
- 00:10:09around it. There's ratings. There's
- 00:10:10there's now frameworks to allow you to
- 00:10:12do it built into ML flow. So this whole
- 00:10:14idea of essentially we can type some
- 00:10:16sentences to tell the judge what it's
- 00:10:18looking for and then actually under the
- 00:10:20hood allow agent bricks to then go and
- 00:10:22build an agent that satisfies that and
- 00:10:24uses it as acceptance criteria and then
- 00:10:26it goes right I've got some options that
- 00:10:28meet it. Which one do you want to go
- 00:10:29with? Essentially it's an agentic
- 00:10:31workflow to build agents. The robots are
- 00:10:33now building robots. This is where we've
- 00:10:36got to. That's essentially what agent
- 00:10:37bricks is. Now to get this to work
- 00:10:40they've had to kind of collapse it down.
- 00:10:41They can't just do anything in the world
- 00:10:43you can think of. Essentially, there's
- 00:10:45some out ofthe-box pre-anned solutions.
- 00:10:47So, I can want to do information
- 00:10:48extraction. So, here's a load of
- 00:10:50documents. Sign a databick volume. I
- 00:10:52want to build an agent that will go and
- 00:10:54figure some of that stuff out. Do a
- 00:10:55quick rag lookup and answer some
- 00:10:56questions. Uh, I want to do like, yeah,
- 00:10:58the Q&A style things. Multi-agent
- 00:11:01supervisor super. I was not expecting
- 00:11:02that so early. One of the things people
- 00:11:04have got with data genie, the whole chat
- 00:11:06over your data thing, is because you
- 00:11:08build them scoped around domains of your
- 00:11:10data, you can't really have a user come
- 00:11:12just to ask a question across anything.
- 00:11:14You need to build an agent that sits on
- 00:11:16the top of a load of Genie APIs that
- 00:11:18then decides which Genie room to ask a
- 00:11:19question to or use a multi- aent
- 00:11:22supervisor, put that on top of Genie,
- 00:11:24give it the ability to ask a question to
- 00:11:26a load of Genie or a separate agent.
- 00:11:28That's just an out of the box thing as
- 00:11:30part of agent bricks, which is crazy. So
- 00:11:32there's loads of really cool stuff sat
- 00:11:34underneath there in terms of how it
- 00:11:36works. There are obviously lots of cool
- 00:11:38demo videos that you've got, but this
- 00:11:40this is the kind of workflow of
- 00:11:41describing things essentially writing a
- 00:11:43prompt so it understands how to actually
- 00:11:45do it, picking that off, looking at the
- 00:11:47response, evaluating it and says that's
- 00:11:49not good enough, that's good enough,
- 00:11:51letting it iterate over that stuff and
- 00:11:53then actually just coming out and go
- 00:11:54well there's a production agent we can
- 00:11:56go and work with.
- 00:11:59So I we're expecting huge amounts of
- 00:12:02uptake in terms of agent bricks just
- 00:12:04because everyone's trying to build
- 00:12:06agents right now, but not everyone can
- 00:12:07build an agent unless it's now one of
- 00:12:09those agents in which case you can
- 00:12:11actually build a fairly decent
- 00:12:12production grade agent hire it. It's
- 00:12:15better than things like AutoML. AutoML
- 00:12:18is doing a similar thing with
- 00:12:19traditional machine learning when we're
- 00:12:20saying I've got this problem suggest a
- 00:12:22load of models. But nearly always would
- 00:12:24need to take that model and then
- 00:12:25actually give it to a data scientist to
- 00:12:26actually turn into a production grade
- 00:12:28thing. Some of these agents are
- 00:12:30production grade when you build them,
- 00:12:32which is madness, but very cool. All
- 00:12:35right, let's take a step back from all
- 00:12:37the AI and hype. We've got a release of
- 00:12:39Spark 4.0, which been hotly
- 00:12:41anticipating. loads of stuff going into
- 00:12:42it, but it's an interesting one if
- 00:12:45you've been working inside data bricks
- 00:12:46for a long time cuz it's essentially
- 00:12:49here's a load of data bricks features
- 00:12:50that are now open source as part of
- 00:12:52Spark. So, if we get the big wall of
- 00:12:54what's gone into it, loads of things in
- 00:12:56there that we've seen before. You got
- 00:12:57the SQL pipe syntax. Uh you've got the
- 00:12:59improvements to the SQL udfs in there.
- 00:13:01You got the variant data type. You've
- 00:13:02got the uh data frame.plot. So, if you
- 00:13:05can do some native plotting rather than
- 00:13:06having to call a separate library for
- 00:13:08it. Loads of things we've seen inside
- 00:13:10data runtimes over the past 69 months
- 00:13:13have now be bundled up and actually put
- 00:13:14inside uh the main Spark 4.0 release. So
- 00:13:18huge release just open sourcing all of
- 00:13:20this good stuff. Now there is the other
- 00:13:22thing that's in there and this is going
- 00:13:23to catch out so many people in that as
- 00:13:26of Spark 4.0 an SQL mode is standard. So
- 00:13:30if you're using a runtime that has Spark
- 00:13:324.0 baked into it then it is going to
- 00:13:34use an SQL mode. Generally that's good.
- 00:13:38SQL behaves more like SQL in other
- 00:13:40database systems. That's generally what
- 00:13:42people want. However, Spark has always
- 00:13:45had this idea of I'm just going to run
- 00:13:47and I'm just if something happens, I'm
- 00:13:50just going to ignore it. I'm going to
- 00:13:51get to the end. I'm not going to fail.
- 00:13:53So, if you had a divide by zero error, B
- 00:13:55could just return a null and carry on
- 00:13:57going. If you had a data type collision,
- 00:13:59it would just return a null and keep on
- 00:14:01going. Now, anti-standard mode, that is
- 00:14:03going to throw an error. So if you've
- 00:14:05got a lot of pipelines that were nulling
- 00:14:08out data, but you weren't really aware
- 00:14:09of it or you're doing that deliberately
- 00:14:11and you're just letting that run
- 00:14:12through, these are now going to error.
- 00:14:14So be careful when you adopt Spark 4.0.
- 00:14:16There is a behavioral breaking change
- 00:14:18around how it actually uses SQL because
- 00:14:21of anti-standard mode. But otherwise,
- 00:14:23loads and loads of good stuff in there.
- 00:14:24I would say check it out. But that's not
- 00:14:27the biggest thing that has happened in
- 00:14:29the world of open source spark. The
- 00:14:31biggest thing is this declarative
- 00:14:34pipelines for spark previously known as
- 00:14:37delta live tableables. So DLT delta live
- 00:14:41tables is an ETL framework that's been
- 00:14:42baked into data bricks for ages and it
- 00:14:45allows you to you can define some tables
- 00:14:47as pispark or as SQL and it will then
- 00:14:50just wrap it in dependency management in
- 00:14:54restartability in full telemetry and
- 00:14:56logging with data quality expectations
- 00:14:59all of the stuff that you would expect a
- 00:15:01good data engineer to actually build
- 00:15:03into any of their pipelines it just does
- 00:15:05automatically. So DT is a fairly robust
- 00:15:08now fairly mature uh processing
- 00:15:11framework and they've just open sourced
- 00:15:14it. So in ter in open sourcing it they
- 00:15:16have changed the name. So you will see
- 00:15:18loads of people talking about
- 00:15:20declarative pipelines for spark
- 00:15:23DP DPFS declarative pipelines for spark
- 00:15:27is the new name for delta live tables.
- 00:15:29You will not see DT referred to in any
- 00:15:32of the docs in any of the the libraries
- 00:15:33anymore. It is now declarative
- 00:15:35pipelines. Now there is a slight
- 00:15:37difference. You got declarative
- 00:15:38pipelines open source version but you've
- 00:15:41also got this how is it implemented
- 00:15:43inside of data bricks? Well the answer
- 00:15:45is to this new product group called
- 00:15:47lakeflow. Now they announced lakeflow
- 00:15:49last year and they said this is what's
- 00:15:51coming. We're going to build all these
- 00:15:52different steps but essentially DT
- 00:15:55implemented inside of data is part of
- 00:15:57lakeflow. You got other things like
- 00:15:58lakeflow connect is part of lakeflow.
- 00:16:00There's this whole essentially how you
- 00:16:03go about doing ETL inside of data bricks
- 00:16:05is now under this umbrella of tools
- 00:16:08known as lakeflow. So two things cloud
- 00:16:12pipelines now open source in spark all
- 00:16:15the other components inside of data
- 00:16:16bricks rebranded relaunched with a load
- 00:16:18of new stuff as data bricks lakeflow
- 00:16:21which is now generally available so
- 00:16:23there's three different aspects to it
- 00:16:26most of which we've kind of seen before.
- 00:16:27So lakeflow connect where part of it
- 00:16:29went GA earlier this year that is the
- 00:16:32low code ability to go and get data from
- 00:16:34somewhere go get data from Salesforce
- 00:16:36from workday a bunch of fairly common
- 00:16:39data sources click some buttons and just
- 00:16:41have it automatically start doing
- 00:16:43incremental CDC of that data into my
- 00:16:45lake great declarative pipeline sitting
- 00:16:49in the middle well that's Delta live
- 00:16:50tables but with a few updates and some
- 00:16:52more stuff going into it and some other
- 00:16:54cool stuff on top we'll look at in a
- 00:16:56minute and then lakeflow jobs
- 00:16:58essentially a rebranded revamped version
- 00:17:00of uh datick workflows sitting around
- 00:17:04there. So this this suite of tools data
- 00:17:06acquisition data transformation and then
- 00:17:08data orchestration sits under that
- 00:17:11umbrella of LakeFlow. So if you've not
- 00:17:13seen LakeFlow connect it's this kind of
- 00:17:15thing. I've got a little nice clean easy
- 00:17:17gooey to go off get some data. This is
- 00:17:20the source pull it in. Now, they've
- 00:17:22added a load more sources that are
- 00:17:24compatible in terms of the release and
- 00:17:26the announcements last week, but there
- 00:17:28just still needs to be more. So, we're
- 00:17:29going to see this like running
- 00:17:31incremental addition of we've added this
- 00:17:33data source, this data source, this data
- 00:17:34source. So, just gradually we'll just be
- 00:17:36able to do 100% of all the dual sources
- 00:17:39that you're doing. You'll be able to do
- 00:17:40from within LakeFlow. Not yet. There's a
- 00:17:42bunch of sources it can currently do out
- 00:17:44of the box. Some of which are GA, some
- 00:17:46of which are still in preview. So, yeah,
- 00:17:48lots of stuff going on there. uh changes
- 00:17:51to how DT now known as the collaborative
- 00:17:54pipelines actually works. You've got
- 00:17:55this a multifile editor view. So if
- 00:17:58you've been using DT for a long time, it
- 00:18:00used to be pretty awful in that you'd
- 00:18:02write a notebook and then have to go to
- 00:18:03an entirely different screen to test it
- 00:18:05and run it and then go back to your
- 00:18:06notebook and the dev experience was a
- 00:18:08bit mad. Now we've got this idea of one
- 00:18:10an opinionated project structure going
- 00:18:13actually this is how you should
- 00:18:14structure your Python and SQL code
- 00:18:16actually into your layers. fairly fairly
- 00:18:18decent way of actually building out a
- 00:18:20declared job. You've got your editor in
- 00:18:23there that has part of your result set.
- 00:18:25It's got your DT uh like execution graph
- 00:18:28all baked in. So basically the IDE uh
- 00:18:32declarative pipelines has now changed in
- 00:18:34line with these announcements. So there
- 00:18:36just more stuff you can do in there. So
- 00:18:38yeah, lots of lots of nice stuff.
- 00:18:39Essentially it's going to be nicer to
- 00:18:41work in that environment. The big
- 00:18:43announcement around this area is this
- 00:18:45thing flow designer. And now this is
- 00:18:47something that they demoed in the
- 00:18:48keynote and everyone just came out kind
- 00:18:50of gobsmacked openmouth going oh that's
- 00:18:53going to change things. Now this is on
- 00:18:56the face of things just a low code
- 00:18:58editor for building out declarative
- 00:19:00pipelines. So I can drag and drop and
- 00:19:02say well get some data from that source
- 00:19:04transform it write it over there. That's
- 00:19:06fairly straightforward. Um but they've
- 00:19:09gone a step further. So, the thing that
- 00:19:11they demoed at the keynote last week is
- 00:19:14essentially saying, well, what if we did
- 00:19:15that, but we just slapped um essentially
- 00:19:18a low code like a natural language
- 00:19:20editor on top of it. So, what if we had
- 00:19:22something like this where we had our
- 00:19:24things and we just tell it what to do
- 00:19:26and it will go and build it for us.
- 00:19:28That's terrifying. That absolutely
- 00:19:31terrifying actually how that changes the
- 00:19:33workflow of what we do. We are in the
- 00:19:36era of vibe engineering my friends. you
- 00:19:39are going to be able to just actually
- 00:19:40build out pipelines component by
- 00:19:42component by just saying what you want
- 00:19:44it to do and it will go and build it
- 00:19:45out. Now I've had interesting
- 00:19:48conversations about this going around
- 00:19:49the are we going to use this forever
- 00:19:51thing? Well, probably not. It would take
- 00:19:52me longer to ask that question of each
- 00:19:54and every single table I'm going to load
- 00:19:56than it would for me to define a generic
- 00:19:58workflow and say and then do that a
- 00:19:59thousand times. But this really really
- 00:20:03lowers the technical barrier. So like
- 00:20:05anyone can go and build these things.
- 00:20:07I'm just slightly terrified that in over
- 00:20:09a year or two essentially we're going to
- 00:20:11have this mad spaghetti mess of
- 00:20:13pipelines that have been built slightly
- 00:20:14differently depending on the syntax and
- 00:20:16the context of how the question was
- 00:20:17asked for each and every separate
- 00:20:19pipeline. So how we use it going to be
- 00:20:22interesting but actually the fact that
- 00:20:24we can use it so much power behind it so
- 00:20:26much use behind it pretty pretty crazy
- 00:20:29and cool. So yeah, LakeFlow designer low
- 00:20:32code and agentic pipeline building on
- 00:20:36top of declarative pipelines. It still
- 00:20:38builds it using declarative pipelines
- 00:20:40for Spark.
- 00:20:42All right, next up we got this thing
- 00:20:44lake bridge previously known as Blade
- 00:20:46Bridge is a company that uh databicks
- 00:20:48acquired which is a AI fueled migration
- 00:20:51tool. So this is something saying if I'm
- 00:20:53going from maybe I've got like an old
- 00:20:55SQL server, I've got Oracle, I've got
- 00:20:57Synapse and I'm trying to say actually
- 00:20:59take it out of that and put it into data
- 00:21:01bricks. It is a lift and shift migrator.
- 00:21:03Now I need to be real real specific
- 00:21:06there. It's a lift and shift migrator.
- 00:21:08So if I had a thousand stored procs
- 00:21:10inside Snowflake and I wanted to get
- 00:21:12that into databicks, it would pick that
- 00:21:14up. It would evaluate it. it would
- 00:21:16translate all of that code into Spark
- 00:21:18SQL compatible code and then create it
- 00:21:21in data bricks as a thousand separate um
- 00:21:24Spark SQL notebooks. So it's not going
- 00:21:27to refactor. It's not going to change it
- 00:21:28into a metadata driven framework. It's
- 00:21:30not necessarily going to modernize any
- 00:21:32of my code. It's just going to be able
- 00:21:34to take my code, make it compatible, and
- 00:21:37get it onto the new platform. It's about
- 00:21:39getting onto the new platform by any
- 00:21:41means possible. So that might not be the
- 00:21:44end. It might be you do a two-phase
- 00:21:45migration. First, you lift and shift to
- 00:21:47get it onto data bricks so you can turn
- 00:21:49off the other system. Then you do
- 00:21:51refactor into the proper end state the
- 00:21:54the way you want it to work eventually.
- 00:21:55But yeah, massive use case for Lake
- 00:21:57Bridge there in terms of just getting
- 00:21:59stuff onto the same platform so you can
- 00:22:01get it inside Unity Catalog and then you
- 00:22:03have a single control plane across
- 00:22:04everything. So few steps to go through.
- 00:22:07It'll do a scan. It'll make a report.
- 00:22:08It'll tell you here's all the migration
- 00:22:10stats. This is what's going to work.
- 00:22:11This is what's not going to work. It
- 00:22:13will do the conversion for you and spit
- 00:22:14it out from a whole bunch of different
- 00:22:16potential languages and sources into uh
- 00:22:19data SQL code and then it'll do a load
- 00:22:21of testing. It'll do source target uh
- 00:22:24comparisons. It will validate it. It
- 00:22:26will go yes this is successfully
- 00:22:27migrated. Really really cool tool that
- 00:22:29is now available as this thing called
- 00:22:31lake bridge.
- 00:22:33Moving on, we got a few
- 00:22:36uh GA announcements. So lakehouse apps
- 00:22:39has gone generally available. So
- 00:22:41obviously part of the whole story in
- 00:22:43terms of where database has been going
- 00:22:44over the past year is saying well yeah
- 00:22:46you build AI but people need something
- 00:22:48to be able to interact with AI you built
- 00:22:50an agent great but where do they go to
- 00:22:52type things you you need some kind of
- 00:22:54user interface that's essentially what
- 00:22:55lakehouse apps is so it is that thing
- 00:22:58that allows us to use a load a bunch of
- 00:23:00um common tooling and frameworks such as
- 00:23:03streamlit and python radio those kind of
- 00:23:05things and now also javascript if you
- 00:23:07want to build a react web app front end
- 00:23:10you can do that in lakehouse apps and
- 00:23:12lakehouse apps itself sorry data bricks
- 00:23:14apps is now generally available so you
- 00:23:16can go and do that big piece of the
- 00:23:18puzzle kind of fix there and then
- 00:23:20there's another one on the data clean
- 00:23:21rooms side clean rooms if you're not
- 00:23:24familiar with it is this essentially
- 00:23:26this third party data collaboration idea
- 00:23:29so I can have two data rich workspaces
- 00:23:31that each delta share data into this no
- 00:23:33man's land in the middle I can run some
- 00:23:36scripts on it in a very controlled way
- 00:23:38and get some outputs the important part
- 00:23:40is neither collaborator can see the
- 00:23:42other person's data. We can take two
- 00:23:45massively sensitive bits of data from
- 00:23:46both sides, put it together, get the
- 00:23:49outputs, and view the aggregate level
- 00:23:50data without the other person actually
- 00:23:52seeing the low-level transactional data.
- 00:23:54I don't have to give away my sensitive
- 00:23:56information in order to collaborate with
- 00:23:58people. That's clean rooms itself. So,
- 00:24:00that went g back January I think this
- 00:24:02year, but it only allowed for two
- 00:24:04collaborators and it that wasn't the
- 00:24:07that wasn't the dream. That wasn't the
- 00:24:08story we were sold. So as part of the
- 00:24:10announcements last week, you can have up
- 00:24:12to nine different collaborators from
- 00:24:14different clouds. So we can have Azure
- 00:24:17ADS, Azour, ADS, GCP all collaborating
- 00:24:20in the same clean room with the same
- 00:24:22level of security and controls around
- 00:24:24it. Such massively more powerful in
- 00:24:26terms of the kind of things you can do.
- 00:24:28And previously you could never run the
- 00:24:31code if you wrote the code. I could
- 00:24:33submit my notebook and the other
- 00:24:35collaborator would have to execute it.
- 00:24:36you would never allowed to self-run your
- 00:24:38code. They've added the ability to do
- 00:24:40that now if it is approved by the other
- 00:24:42collaborator. So if the other person
- 00:24:44trusts you, they go, "Yeah, yeah, you
- 00:24:45can run your own code. You just go and
- 00:24:46do it." Just so you got a faster dev
- 00:24:48cycle, but only if that's the way you
- 00:24:51want to work. Otherwise, it will work in
- 00:24:52the normal way, which is a person cannot
- 00:24:55write and run the code. So yeah, some
- 00:24:58nice updates to clean room just to make
- 00:24:59it a little bit more accessible, more
- 00:25:00functional, more more powerful.
- 00:25:03Now the next thing that's gone is one of
- 00:25:05my favorite things. databicks genie or
- 00:25:06AIB genie that is the talk with your
- 00:25:09data. It is the chatbot that writes SQL
- 00:25:12for you and interacts with your data
- 00:25:14that has gone generally available. You
- 00:25:16can go and use it in Angular, use it in
- 00:25:17production, put it out in the world.
- 00:25:19Most people I know are already using it
- 00:25:20in production. Would be surprised that
- 00:25:22it's now just gone GA. But there's a
- 00:25:25bunch of things in there. There's a load
- 00:25:26of new features that kind of snuck out.
- 00:25:28I didn't really see them in some of the
- 00:25:29announcements, but the kind of uh I've
- 00:25:31scraped this off some of their
- 00:25:32documentation. They put a bunch of
- 00:25:33things in there. So firstly they've
- 00:25:35added some of this stuff which allows
- 00:25:36you to do data scanning. So if I've got
- 00:25:39if I was writing a filter statement and
- 00:25:42I think the example they've got on the
- 00:25:43website if I was going talking about the
- 00:25:45country and I got genie to generate the
- 00:25:47SQL script for me like well go tell me
- 00:25:49all the results from Florida and that
- 00:25:51would go it would write it out and it go
- 00:25:52well where country equals Florida. But
- 00:25:55actually what it doesn't know because
- 00:25:57Genie normally never sees your data is
- 00:25:59actually have stored all that data as
- 00:26:02state codes. So yeah, country state
- 00:26:04state equals Florida. Florida's not a
- 00:26:06country. So I'd have it so that actually
- 00:26:09state should be FL, not Florida. But it
- 00:26:11didn't know that. But I can turn this
- 00:26:13thing on now. I can say go do some data
- 00:26:15sampling. It will build a value
- 00:26:16dictionary. And now it knows exactly
- 00:26:18what the categories it has to choose
- 00:26:19from. And it will use that to make
- 00:26:21better, more intelligent decisions in
- 00:26:22the SQL it generates. So data sampling
- 00:26:24one thing. But you are going to be
- 00:26:26sending your data through to the large
- 00:26:27language model. You need to accept that
- 00:26:30if you're going to use data sampling.
- 00:26:32That's number one new thing. Now, number
- 00:26:35two, new thing is just an improvement to
- 00:26:36the way we're actually getting feedback.
- 00:26:38So, previously we had like a thumbs up,
- 00:26:39thumbs down kind of thing and people
- 00:26:40would give it a thumbs down, they'd
- 00:26:41never take a thumbs up. Um, but is that
- 00:26:44a really hard thing to go, well, I can
- 00:26:45say thumbs down cuz I'm not sure, but I
- 00:26:47can't really articulate why. So, this
- 00:26:50whole thing of saying send it for
- 00:26:51review, give it like a medium, I'm not
- 00:26:54really sure, give it an explanation, and
- 00:26:56there's a whole review workflow. So,
- 00:26:57it's just making the idea, the act of
- 00:27:00looking after a genie space just a
- 00:27:02little bit more reactive, a little bit
- 00:27:03more collaborative, a little bit of back
- 00:27:04and forth. Oh, actually, they flagged it
- 00:27:07for a review because they didn't quite
- 00:27:08understand what it was doing. It's
- 00:27:09actually doing the right thing. It's an
- 00:27:10education issue. Or they flagged it for
- 00:27:11review. No, they're right to that's
- 00:27:13wrong. Let's go and fix it. Just just a
- 00:27:15better feedback mechanism baked into
- 00:27:17Genie itself.
- 00:27:18Uh, we've got this thing suggested
- 00:27:21queries. So not from the users within
- 00:27:24Genie itself but actually from how
- 00:27:25people are using those tables in Unity
- 00:27:27catalog. So looking at popular queries
- 00:27:29looking at the most recent queries it'll
- 00:27:31actually come back and go hey look guys
- 00:27:33in smart ways people are actually using
- 00:27:35this data elsewhere why don't you add
- 00:27:38these as the sample queries inside the
- 00:27:40genie room just to give people a bit
- 00:27:41more inspiration just to tweak it a
- 00:27:43little bit give it a bit more of a nudge
- 00:27:44in the right direction. So,
- 00:27:46automatically generated query
- 00:27:48suggestions inside of Genie that's going
- 00:27:51to be coming soon. We're going to see
- 00:27:52that appearing, which is cool. But it
- 00:27:54gets crazier. So, we got more stuff that
- 00:27:57we've got uh which is this idea for like
- 00:28:00clarification. So, we've had
- 00:28:01clarifications in the past. It comes
- 00:28:02back and it asks a question. And if we
- 00:28:04clarify, that's then just just used
- 00:28:07within that chat. But actually, now
- 00:28:09we've got this whole idea of going,
- 00:28:11well, there's there's more stuff we can
- 00:28:13do. I can say, yes, that was right. It's
- 00:28:15going to look at the fact that was right
- 00:28:16and go, "Wow, well, if that was right,
- 00:28:18why don't I actually remember the fact
- 00:28:21that there's this measure, this filter,
- 00:28:23this this whole way of working. Why
- 00:28:25don't I remember this next time? Do you
- 00:28:27want to curate this idea of a set of
- 00:28:29metrics, set of dimensions that we're
- 00:28:32actually going to build up over time?"
- 00:28:34again that's coming into Genie kind of
- 00:28:36some of the announcements some of the
- 00:28:37blogs they've put out there just a
- 00:28:39better way to build up better semantic
- 00:28:42definitions inside Genie based on user
- 00:28:44interactions and again that can have a
- 00:28:46whole approval workflow and people to
- 00:28:47reject it and people to curate it over
- 00:28:49time interesting stuff happening not as
- 00:28:52crazy uh as this whole new deep research
- 00:28:55bun which is the yeah I'm going to ask a
- 00:28:58complex how do you optimize our
- 00:28:59marketing funnel that's a complex
- 00:29:01question it's going to go well I'd break
- 00:29:03it down I'd run all these different
- 00:29:04queries. I get data from there. I look
- 00:29:06at it. I make data from there. I then
- 00:29:07analyze it and make data from there.
- 00:29:09It's basically doing a research
- 00:29:11function. It's then going to run all
- 00:29:13those queries. It's going to compile the
- 00:29:15results of it. It's going to interpret
- 00:29:17the results of it. It's then going to
- 00:29:18actually build it into basically a white
- 00:29:21paper essentially a here's all the ways
- 00:29:24that we can actually work. Now, this is
- 00:29:26very different to how genies worked
- 00:29:27previously. Genie previously has just
- 00:29:29been a SQL generator where it never sees
- 00:29:31the the data. It just runs the code and
- 00:29:33you see the results. This is it running
- 00:29:35the results, interpreting the results,
- 00:29:37and actually writing recommendations and
- 00:29:39building that out for you as something
- 00:29:41you can then take away and use. So, deep
- 00:29:43research mode is huge. Not out yet, but
- 00:29:47announced as something that is coming
- 00:29:49and go find it on the Genie websites.
- 00:29:51Yeah, bunch of stuff going on in the
- 00:29:54world of Genie. We've got to crack on.
- 00:29:56There are so many other things we need
- 00:29:57to talk about. Now one of the big
- 00:29:59announcements which is an interesting
- 00:30:01one is datab bricks free edition. Now
- 00:30:04that was met with some confusion. So
- 00:30:06there has historically always been data
- 00:30:08bricks community edition. Now that was a
- 00:30:11very cut down version of datab bricks
- 00:30:13that was great if you were learning
- 00:30:14pispark or scala. You could b
- 00:30:17essentially log on have like a single
- 00:30:19box single core little spark cluster.
- 00:30:21You could write some queries on it and
- 00:30:23it was great for if you're just learning
- 00:30:24how pispark works. It wasn't great if
- 00:30:26you're learning data bricks. So it
- 00:30:28didn't have all the other features. You
- 00:30:30couldn't go and use DT now declarative
- 00:30:32pipelines. You couldn't go and actually
- 00:30:33play around with any of the AI stuff. It
- 00:30:36was purely a little Spark playground to
- 00:30:38help you learn Spark. So last week data
- 00:30:40bricks announced free edition which is a
- 00:30:42completely free edition of fully
- 00:30:45featured data bricks. Now it only has a
- 00:30:47tiny sliver of compute still. They're
- 00:30:48not giving away their entire product for
- 00:30:50free but actually it is much more fully
- 00:30:53featured than community edition. So if
- 00:30:55you're learning, if you're a student, if
- 00:30:57you're trying to do some sandboxing in
- 00:30:59your spare time to just try and get
- 00:31:00ahead and understand these things, you
- 00:31:02can use data free edition to go and do
- 00:31:04that. Now, it's not meant for
- 00:31:06businesses. It's not meant to be you're
- 00:31:08a data influencer running a boot camp
- 00:31:11and you want this to actually pay for
- 00:31:12all your costs. That's not what it's
- 00:31:14for. This is for students, for personal
- 00:31:16things to use to be able to teach
- 00:31:18yourself. Maybe you're going and like
- 00:31:20you're learning and you're following
- 00:31:21some training. Absolutely. as a student,
- 00:31:22you can use this free edition to
- 00:31:24actually do your learning. If you're
- 00:31:25doing any of the Daily Rick searchs, you
- 00:31:27can go and use this to do your learning.
- 00:31:29Loads of good stuff in there. So, that
- 00:31:30is Daily Rick free edition. Subtly
- 00:31:33different to community edition cuz it's
- 00:31:35actually got all the features in there,
- 00:31:37which is cool. That's available now.
- 00:31:40Now, the other thing that was announced
- 00:31:42is one of the biggest complaints that
- 00:31:44there's always been about data bricks is
- 00:31:46a yeah, but we can't show it to the
- 00:31:48business users. We can't let our execs
- 00:31:50log into it. So even with AIBI
- 00:31:52dashboards, even with Genie, with all
- 00:31:54the the new features which are trying to
- 00:31:56take that data consumer role, it's
- 00:31:58trying to engage directly with the
- 00:31:59business, we've always had this
- 00:32:01complaints that we hear from our clients
- 00:32:03going, "Yeah, but I need to pick it up
- 00:32:05and put it somewhere else because I
- 00:32:08can't show that to my exec. It looks too
- 00:32:09technical. It's too intimidating. Got
- 00:32:11too much stuff in there. It's too busy
- 00:32:13on the screen. It feels intimidating and
- 00:32:16it feels not welcoming."
- 00:32:19That's where this thing comes in. So
- 00:32:22brand spanking new is this idea of data
- 00:32:24bricks one essentially a a rounder a
- 00:32:27nice a softer a more businessfacing
- 00:32:29portal into data bricks. Think about
- 00:32:31this as your reporting portal where
- 00:32:33you've built all your code. You've built
- 00:32:35all your reports. Other users can log in
- 00:32:37through data one and just see this nice
- 00:32:40clean shiny version to access things
- 00:32:42they've been given access to. So I can
- 00:32:44go in I've got this things by domain. I
- 00:32:46can go view my dashboards. I can do my
- 00:32:47cross filtering and things. It's still
- 00:32:49lake view dashboards. It's still a
- 00:32:51dashboard that I built inside a
- 00:32:52fullyfledged data bricks. It's just this
- 00:32:54is a way for my business users to come
- 00:32:55in and interact with these things. I've
- 00:32:58got the whole idea of being able to work
- 00:33:00via my different domains. I can see
- 00:33:02different ways of organizing my data. I
- 00:33:04can go and do my data discovery and
- 00:33:05actually find out different things
- 00:33:06inside Unity Catalog using I can use it
- 00:33:08as a discovery tool. Just just far
- 00:33:12better ways of interacting with our
- 00:33:14users.
- 00:33:15We can have Genie baked into it. So if
- 00:33:17you're seeing all those new features
- 00:33:18about Genie going, "Oh, that's cool. Ah,
- 00:33:20but my users would never want to use
- 00:33:21it." Well, they can use it through data
- 00:33:24one and it's just a cleaner, much more
- 00:33:26streamlined experience. Still using
- 00:33:28Genie under the hood, still works the
- 00:33:30same way any of their responses and
- 00:33:32their kind of review quotes, which feed
- 00:33:34back into the normal Genie room that you
- 00:33:35administer through fully fledged data
- 00:33:38bricks. It's just just how business
- 00:33:41users can interact with it. Huge amounts
- 00:33:43of stuff going on there. So we're
- 00:33:44expecting data bricks one to be
- 00:33:46absolutely massive in terms of uptake
- 00:33:48from our clients just because it is a
- 00:33:50much more engaging much more
- 00:33:53approachable way of using data bricks
- 00:33:56right bunch of other things moving on
- 00:33:58we've got unity catalog itself has a
- 00:34:00load of new things going in there now I
- 00:34:02kind of mentioned this when we're
- 00:34:03talking about genie you've last year
- 00:34:05they mentioned this thing called uni uh
- 00:34:07unity catalog metrics and we've got a
- 00:34:10load more information we've actually see
- 00:34:11it out in the wild now we can go and
- 00:34:12have a play with You've got Unity
- 00:34:14catalog metric views is how it's
- 00:34:16eventually been released. Essentially,
- 00:34:18you define a metric view with saying,
- 00:34:20well, here are my measures. Here are my
- 00:34:22dimensions. Here are my filter
- 00:34:23statements. And that just means it's
- 00:34:25much more like a cube. It means, well,
- 00:34:26I've got this measure, but if I I can
- 00:34:28cut and slice my data, and it will
- 00:34:30calculate the measure based on the
- 00:34:32filter context of the various different
- 00:34:33dimensions I'm using, much like any
- 00:34:35other semantic model in other tools.
- 00:34:38Now, those are coming and going to be
- 00:34:40baked into uh Unity Catalog. they'll be
- 00:34:42consumed by Genie so you can start to
- 00:34:44see and you saw Genie picking up some of
- 00:34:46itself and then we can go and push that
- 00:34:48out. So loads of things happening on
- 00:34:49that space. I forgot the slides. Yeah,
- 00:34:51icebergs in there. We talked about that
- 00:34:52already. This is what I want to talk
- 00:34:55about the idea of uni uh unity catalog
- 00:34:57metrics. Now there's already a load of
- 00:35:00announcements about a load of different
- 00:35:01BI tools that are going to be able to
- 00:35:02consume it. So if you're in Sigma,
- 00:35:05Tableau, you're in Thorpspot, you'll be
- 00:35:07able to bring in data from Unity Catalog
- 00:35:10and it will have awareness of those
- 00:35:11metrics. Now there's a big green F
- 00:35:14missing from this diagram where we don't
- 00:35:17see anything to do with PowerBI or
- 00:35:18fabric. So we're really curious where
- 00:35:20that is on the road map. I've not found
- 00:35:21out where it is on the road map. I will
- 00:35:23hopefully do soon. Um but that's a
- 00:35:25missing piece currently. But most BI
- 00:35:28tools in the stack can use these
- 00:35:29metrics. So it's basically you can kind
- 00:35:31of rather than have all your metrics and
- 00:35:32logic defined downstream in your
- 00:35:34different tools. And if you're using
- 00:35:36Thorbot and Tableau and Sigma, you might
- 00:35:39have those metrics defined three times
- 00:35:41in each of those different ones. You can
- 00:35:43pull that upstream into your lake into
- 00:35:45your gold layer, store it actually next
- 00:35:47to your data and then whatever's
- 00:35:49consuming it downstream, you've just got
- 00:35:50one definition of your KPIs, which is
- 00:35:52what we want in life. So really cool.
- 00:35:55Loads of stuff happening there. I've got
- 00:35:57this idea of domains. So looking at
- 00:35:59Unity catalog be able to actually say
- 00:36:00well these different objects all part of
- 00:36:02this domain helping us with the whole
- 00:36:04kind of product ownership idea helping
- 00:36:06us with the slightly uh distributed mesh
- 00:36:08idea that people are going for lots of
- 00:36:10stuff in there which is just help.
- 00:36:13We got this whole idea of a data quality
- 00:36:15monitoring tool that will go and
- 00:36:17actively monitor my data quality in my
- 00:36:19different tables. We've seen that with
- 00:36:21lakehouse monitoring but that was like
- 00:36:23such a super deep data profiling scan.
- 00:36:25It wasn't really just the I just want a
- 00:36:27little bit of information about the
- 00:36:29quality of everything. This is
- 00:36:31absolutely that kind of just keep track
- 00:36:33of my data quality. Look at the
- 00:36:35freshness. Look at the completeness.
- 00:36:36Like nice good chunky data quality
- 00:36:38metrics. It can just run on top of
- 00:36:40everything and give me a nice dashboard
- 00:36:42that is coming. Loads of stuff we've
- 00:36:43seen about that. And then yeah, a bunch
- 00:36:45of other stuff inside Unity. We've got
- 00:36:48the ability to certify a data set and
- 00:36:50say, "Oh, that's that's certified. That
- 00:36:52table is not certified. That table used
- 00:36:54to be the good one. It's now deprecated.
- 00:36:56So, we're just just adding a bit more
- 00:36:58information for our users so they can
- 00:37:01use it as a data discovery tool. It's
- 00:37:03moving unity catalog from being a
- 00:37:04technical catalog to actually a data
- 00:37:07discovery tool, a business facing
- 00:37:09catalog through data bricks one, all
- 00:37:11that good stuff. Uh got a request for
- 00:37:13access workflow. So, if I if I discover
- 00:37:15a table, I don't have access. Well,
- 00:37:18request access much like any other
- 00:37:19governance tool. Just seeing these
- 00:37:21things come in. uh aback attribute-based
- 00:37:24access control we we've heard talked
- 00:37:26about by data bricks in previous summits
- 00:37:28that's the thing where if I tag a table
- 00:37:29as sensitive I can have a control ro
- 00:37:32saying you have access to sensitive data
- 00:37:34no one else has access to sensitive data
- 00:37:36and then just by tagging that table that
- 00:37:38security will be applied to it so aback
- 00:37:40we're actually seeing loads of different
- 00:37:42use cases for all things like GDPR and
- 00:37:44sensor data sure but even tagging it to
- 00:37:46a domain actually you can then have
- 00:37:48controls around the domains that
- 00:37:50separates it gives you a different
- 00:37:51dimension control away from oh this
- 00:37:54catalog this schema these tables you
- 00:37:56might have tables across all your
- 00:37:57different schemas that you want to give
- 00:37:58one ro access to and you don't want to
- 00:38:00have to do that manual on each table you
- 00:38:02can do that via aback which is currently
- 00:38:04in beta so it's not actually GA tag
- 00:38:07policies you're not allowed to put a
- 00:38:09table in unless you've actually followed
- 00:38:10all the policies that's good data
- 00:38:11classification scanning like just oh
- 00:38:14someone loaded that table but that's an
- 00:38:15email address that's that's PII data or
- 00:38:17that's a bank account don't share that
- 00:38:20via Genie Having that automatically tag
- 00:38:23it means you can do that reactive
- 00:38:24governance. It means you can just be sat
- 00:38:26there with a control panel and having
- 00:38:28things flag up and link those two
- 00:38:30together. Oh, it's going to classify
- 00:38:32something as sensitive because it's
- 00:38:33found a bank number. It's going to apply
- 00:38:35attribute based access control and lock
- 00:38:37that table down before it gets exposed
- 00:38:39and excfiltrated. All of that coming on
- 00:38:42the catalog road map that we've had
- 00:38:44talked through. Oh, we are nearly there
- 00:38:46my friends. Do not worry. final section
- 00:38:49is talking about the changes in Mosaic
- 00:38:51AI and there's probably eight chunky
- 00:38:53areas that we need to briefly talk
- 00:38:55through. There's a load of stuff going
- 00:38:57on. One of which I've talked to already
- 00:38:58which is the whole agent bricks thing
- 00:39:00that is huge but it was big enough that
- 00:39:02I just pulled it out and did a separate
- 00:39:03little run through it. So agent bricks
- 00:39:05is coming currently only available in
- 00:39:07certain data bricks regions. We're going
- 00:39:08to see that slowly rolling out much like
- 00:39:10any other data bricks feature. uh AI
- 00:39:12functions. They're the SQL functions
- 00:39:14that you write inside of uh Spark SQL in
- 00:39:17data bricks. So all using all of your
- 00:39:19like uh AI extract, AI pass, document,
- 00:39:22all those kind of things. We're actually
- 00:39:24seeing they've done a load of work to
- 00:39:26make them much faster. So we're seeing
- 00:39:28them what three times faster, four times
- 00:39:29cheaper, doing them in bulk mode rather
- 00:39:32than calling it incrementally. We're
- 00:39:33just seeing a load of improvements
- 00:39:35around it. See people adopt it more and
- 00:39:37do more stuff with it. uh vector search
- 00:39:40just before summit kicked off we saw an
- 00:39:42announcement of storage optimized vector
- 00:39:44search which had to carve out a load and
- 00:39:45put it down uh into cheaper storage
- 00:39:48which just makes the size and scale of
- 00:39:50it so much more but also means it's much
- 00:39:53much cheaper. So what a 7x cost
- 00:39:55reduction and a massive increase in the
- 00:39:57number of vectors you can have inside
- 00:39:58your vector database. So if you're
- 00:40:00building a rag architecture and you
- 00:40:02struggle with how many embeddings you're
- 00:40:04trying to get into that vector database,
- 00:40:05well, you've now suddenly got a load
- 00:40:08more space you can do if you're using
- 00:40:09that storage optimized mode. So loads of
- 00:40:11stuff around there. Oh, my little in
- 00:40:14picture is hiding the title of this one.
- 00:40:16Serverless GPUs
- 00:40:18has been a big deal if you're dealing
- 00:40:20with uh Spark clusters and you're trying
- 00:40:22to sped it up with a GPU cuz you're
- 00:40:24trying to do something that would
- 00:40:25actually benefit from having it in
- 00:40:26there. lots of your neural network kind
- 00:40:28of AI workloads need a GPU to go fast be
- 00:40:32really hard to get a hold of them
- 00:40:33whereas now actually we're seeing datix
- 00:40:36are rolling out serverless GPUs so if
- 00:40:37you're using serverless you're like oh I
- 00:40:39wish that was a GPU you can now going to
- 00:40:41be able to start using it again only in
- 00:40:42certain regions so far but we'll see
- 00:40:44that rolled out as they find more GPUs
- 00:40:47hidden down the back of the sofa
- 00:40:48somewhere uh model serving has gotten
- 00:40:51much faster so looking at what 25,000
- 00:40:54QPS queries per second Just the size and
- 00:40:58scale of how many things we can serve at
- 00:41:01the same time has kind of exploded.
- 00:41:02Loads and loads of stuff going on there.
- 00:41:04Uh AI gateway is a massive thing for
- 00:41:07productionizing. So many people just
- 00:41:09build a PC and they don't get it into
- 00:41:11production. AI gateway is allowing us to
- 00:41:13do things like um throughut uh
- 00:41:16bottlenecking. It's allowing to do
- 00:41:17failover in case we get a bounce back
- 00:41:19from a web point. just all of the good
- 00:41:22security and management essentially
- 00:41:24around managing an API baked into data
- 00:41:27bricks allowing us to host several
- 00:41:29different models and put certain bits of
- 00:41:31traffic through to different models and
- 00:41:34manage that over time all in AI gateway
- 00:41:36loads of stuff going inside there really
- 00:41:38really important if you're building
- 00:41:40agents if you're building large language
- 00:41:41model integrations if you're building
- 00:41:43traditional IML lots of stuff inside
- 00:41:45there ML 3.0 Oh, so there's a new
- 00:41:48release of ML. I feel sad that MLFlow is
- 00:41:50so far down the list of the things that
- 00:41:52we're doing. Loads of stuff uh going in
- 00:41:54there like lots of nice new features. Uh
- 00:41:56one of the big ones that we called out
- 00:41:58is prompt versioning. So if I was like
- 00:42:00writing crafting the perfect prompt to
- 00:42:03actually put into my agentic workflow,
- 00:42:05uh and then I went back and I tweaked it
- 00:42:07and I changed it and I changed it, I
- 00:42:08didn't really have a way of tracking
- 00:42:10over time what happened as I changed
- 00:42:12that prompt. So the automatic storing of
- 00:42:14different versions of a prompt as you go
- 00:42:16through that workflow is now just part
- 00:42:18of MLflow. Much like when they first
- 00:42:20brought in automatic uh versioning of
- 00:42:23notebooks when I was in an experiment,
- 00:42:24they've now actually just getting more
- 00:42:26and more of the whole LLM ops story into
- 00:42:30MLflow. Lots of stuff happening there.
- 00:42:32And finally MCP support coming in. So
- 00:42:34MCP support that kind of like common
- 00:42:37language to get agents to talk to each
- 00:42:39other. So I can just really really
- 00:42:40quickly add more and more different
- 00:42:42integrations to other APIs to other
- 00:42:44tools to other things using MCP as that
- 00:42:46common integration protocol is being
- 00:42:48rolled out. So that's now available
- 00:42:50within data bricks
- 00:42:53bunch of stuff there. Now all those
- 00:42:55different ones again AI gateway and ML
- 00:42:58flow 3.0 they are GA they are out in
- 00:43:00anger. Everything else is preview or
- 00:43:02incremental changes. They're still
- 00:43:03rolling out. Some of them are in beta.
- 00:43:05Do be aware not everything is absolutely
- 00:43:07out in production currently. But yeah,
- 00:43:10just huge amounts of stuff going on
- 00:43:12currently in the world.
- 00:43:16Okay,
- 00:43:18we made it. That is my big old list of
- 00:43:20everything that was announced across six
- 00:43:22hours of keynotes, two different
- 00:43:24keynotes on the data last week. As I'm
- 00:43:28sure you're with me going, "Wow, that's
- 00:43:30a lot of stuff." Lake Base alone
- 00:43:33bringing OLTP using separation of
- 00:43:35service and compute is a massively huge
- 00:43:38step. What data bricks are trying to do
- 00:43:41lake bridge amazing for trying to do
- 00:43:42super quick migrations although we need
- 00:43:44to then refactor it and get rid of the
- 00:43:46technical debt. Agent bricks just
- 00:43:49completely blowing open the idea of
- 00:43:51agents and meaning anyone could build an
- 00:43:53agent really easily. DT suddenly being
- 00:43:56open sourced and renamed into clarative
- 00:43:58pipelines.
- 00:44:00Genie getting loads of extra features
- 00:44:01announced that we're slowly going to see
- 00:44:03over the next few months it getting
- 00:44:05better and better and better and better.
- 00:44:07Loads and loads and loads of things
- 00:44:08happening. Databicks one I'm expecting
- 00:44:10to be fairly huge even though it's just
- 00:44:12a nicer UI. It's just a nicer UI for
- 00:44:15business people but that is going to
- 00:44:18drive mass business adoption and rather
- 00:44:21it being you do a load of stuff in data
- 00:44:22bricks and then someone else access it
- 00:44:24from somewhere else. It's just bringing
- 00:44:26people onto that same platform so we're
- 00:44:28all working in the same place.
- 00:44:30huge amounts of stuff going on. Now,
- 00:44:32obviously, we have skimmed the surface.
- 00:44:34Some of that we've just there's some
- 00:44:35marketing slides and that's all we know
- 00:44:37about it. Some of it uh we've been
- 00:44:39tinkering and playing with and we were
- 00:44:40now actually allowed to talk about it.
- 00:44:42So, we've got follow-up videos planned
- 00:44:43with a whole bunch of these features.
- 00:44:45So, bear with me. I'll be back on making
- 00:44:48videos over the next few weeks trying to
- 00:44:50catch up with all of this stuff. And
- 00:44:52yeah, if there's anything that you
- 00:44:53really want to go, tell us more about
- 00:44:55that. If I can, we will. Let us know
- 00:44:57down in the comments what you think,
- 00:44:58which is your favorite feature that's
- 00:45:00been announced. And as always, don't
- 00:45:02forget to like and subscribe. Cheers.
- Lakebase
- Agent Bricks
- Spark 4.0
- Edisi Percuma Databricks
- Unity Catalog
- Genie
- Lake Bridge
- Lakeflow
- AI Gateway
- MLflow 3.0