Qu'est-ce que Airite ?

Airite est une plateforme de mouvement de données qui facilite l'accès aux données à partir de systèmes variés via des connecteurs API.

Quel était l'objectif principal de l'événement ?

L'événement visait à informer sur l'intégration de l'IA dans les projets technologiques et fournir un espace interactif pour discuter des bonnes pratiques et des défis, notamment après un événement disruptif récent.

Qu'est-ce que l'AI Assist mentionné lors de l'événement ?

AI Assist est une fonction co-pilot d'AI intégrée dans l'interface utilisateur graphique de construction de connecteurs d'Airite, permettant de simplifier et d'automatiser la création de connecteurs API avec l'aide de l'IA.

Quel était le but de la démo faite par Nati ?

Il était utilisé pour démontrer comment l'IA peut être intégrée dans des outils pour simplifier les processus complexes comme la construction de connecteurs API.

Quels aspects de la mise en œuvre de l'IA ont été discutés lors de l'événement ?

L'événement a mis l'accent sur le partage de connaissances autour de la mise en œuvre d'applications d'IA en production, y compris les tests d'évaluation automatisés (evals) et les flux de travail agents.

How Airbyte Uses AI to Build Connectors

00:57:22

https://www.youtube.com/watch?v=SR5Spck-IY0

Ringkasan

TLDRL'événement a exposé les participants à l'importance de l'IA dans les projets technologiques modernes, avec un accent particulier sur Airite, une plateforme de mouvement de données, et leur fonction AI Assist qui utilise l'IA pour automatiser la création de connecteurs API. Les présentateurs ont discuté du cycle de développement, des défis et des succès de l'IA Assist, notamment l'importance des évaluations (evals) dans l'amélioration continue des outils IA. L'événement a servi à partager des expériences entre professionnels de l'industrie, s'attardant sur la nécessité de bien structurer les projets IA pour éviter les embûches courantes et garantir le succès en production. Enfin, l'accent a été mis sur le fait que l'intégration de l'IA doit se faire dans le cadre de flux de travail manuels déjà en place pour maximiser son efficacité.

Takeaways

🤝 L'événement favorise l'interaction et le partage de connaissances sur l'intégration de l'IA.
💡 Airite et AI Assist représentent des outils clés pour la création automatisée de connecteurs API.
🛠 Le développement IA nécessite des évaluations pour garantir l'efficacité et le succès.
⚙️ L'automatisation de tâches complexes comme les connecteurs se fait avec une IA bien structurée.
🔍 La compréhension et l'extraction correcte des données d'API sont cruciales.
🧩 Les flux de travail doivent être bien structurés pour intégrer efficacement l'IA.
🎯 L'importance de comprendre les besoins réels avant de démarrer des projets IA.
🔄 Le processus de développement IA est itératif et nécessite des ajustements constants.
📊 Partage des défis et solutions lors de la mise en œuvre de l'IA en production.
🚀 Des outils comme AI Assist peuvent rapprocher l'objectif d'un accès facile aux données.

Garis waktu

00:00:00 - 00:05:00
Teao présente Airite, une plateforme de mouvement de données, lors d'un événement interactif avec Fractional. L'objectif est de discuter des projets IA, des pratiques à faire et à éviter, et d'encourager une participation active du public.
00:05:00 - 00:10:00
Natik présente l'objectif d'Airite de rendre les données accessibles à tous. La société concentre ses efforts sur les frameworks capables de lire des données d'API arbitraires et a lancé AI Assist pour améliorer l'efficacité dans la création de connecteurs API.
00:10:00 - 00:15:00
Une démonstration d'AI Assist montre comment la création de connecteurs API peut être simplifiée de plusieurs jours à environ une heure. Cette innovation permet de faire des connexions API plus rapidement, ce qui est essentiel pour étendre la couverture API d'Airite.
00:15:00 - 00:20:00
AI Assist a commencé comme un projet naïf utilisant ChatGPT. L'approche était trop simple et n'a pas bien fonctionné pour des tâches complexes. L'équipe a ensuite développé une approche plus sophistiquée avec Fractional, combinant LLM avec une logique logiciel étendue.
00:20:00 - 00:25:00
La leçon principale tirée est que la 'magie' de l'IA nécessite beaucoup de travail technique fastidieux. Tester des solutions en dehors des environnements de production génère peu d'apprentissage jusqu'à ce que l'utilisateur final interagisse avec le logiciel.
00:25:00 - 00:30:00
Airbite offre un outil qui permet de récupérer des données à partir d'API et de les intégrer dans divers systèmes de base de données et destinations vectorielles. Cela facilite la vie des développeurs intéressés par les prototypes IA.
00:30:00 - 00:35:00
La présentation d'Eddie de Fractional met en avant l'importance de bien structurer les projets IA. Fractional a participé à la conception de l'AI Assist, en s'assurant que la production utilise des LLM d'une manière qui ajoute réellement de la valeur.
00:35:00 - 00:40:00
Eddie met en avant l'importance des workflows manuels existants où l'IA pourrait améliorer l'efficacité. Il souligne l'avantage des évaluations automatisées pour s'assurer que les solutions IA apportent une réelle valeur ajoutée aux utilisateurs.
00:40:00 - 00:45:00
Teao explique l'évolution des critères d'évaluation des projets IA, soulignant l'importance de systèmes d'évaluation robustes pour suivre les progrès et les régressions, assurant ainsi une amélioration constante du logiciel.
00:45:00 - 00:50:00
Eddie discute des futurs potentiels pour l'IA, notamment la montée des agents autonomes, mais souligne que la plupart des succès résideront dans la spécialisation et l'adoption de systèmes agentiques par domaine.
00:50:00 - 00:57:22
L'événement se conclut avec une invitation à continuer les discussions et à explorer plus en profondeur les opportunités qu'offrent les projets IA dans une ambiance détendue et collaborative.

Tampilkan lebih banyak

Peta Pikiran

Video Tanya Jawab

Qu'est-ce que Airite ?
Airite est une plateforme de mouvement de données qui facilite l'accès aux données à partir de systèmes variés via des connecteurs API.
Quel était l'objectif principal de l'événement ?
L'événement visait à informer sur l'intégration de l'IA dans les projets technologiques et fournir un espace interactif pour discuter des bonnes pratiques et des défis, notamment après un événement disruptif récent.
Qu'est-ce que l'AI Assist mentionné lors de l'événement ?
AI Assist est une fonction co-pilot d'AI intégrée dans l'interface utilisateur graphique de construction de connecteurs d'Airite, permettant de simplifier et d'automatiser la création de connecteurs API avec l'aide de l'IA.
Quel était le but de la démo faite par Nati ?
Il était utilisé pour démontrer comment l'IA peut être intégrée dans des outils pour simplifier les processus complexes comme la construction de connecteurs API.
Quels aspects de la mise en œuvre de l'IA ont été discutés lors de l'événement ?
L'événement a mis l'accent sur le partage de connaissances autour de la mise en œuvre d'applications d'IA en production, y compris les tests d'évaluation automatisés (evals) et les flux de travail agents.

Lihat lebih banyak ringkasan video

Dapatkan akses instan ke ringkasan video YouTube gratis yang didukung oleh AI!

Teks

Gulir Otomatis:

00:00:26
all right I think this works all right
00:00:28
we're good to go hi everybody it's great
00:00:30
to see you again fast forward from the
00:00:32
front door my name is teao I work over
00:00:34
here at airite we are a data Movement
00:00:37
platform you're going to be learning all
00:00:38
about tonight along with our partners
00:00:40
fractional who you're going to learn all
00:00:41
about them tonight as well thank you for
00:00:43
making the time to join us tonight we
00:00:44
hope you're enjoying the food the drinks
00:00:46
the company um our our aim here is to
00:00:49
make this a really fun night and a
00:00:50
really informative night especially
00:00:52
because many of you are probably still
00:00:54
about to start the recovery process from
00:00:56
disrupt um so we're excited to kind of
00:00:59
be closing out with you all for the day
00:01:02
um way we're going to do tonight nti's
00:01:04
gonna go ahead and come in and give his
00:01:06
presentation we're going to do our
00:01:07
fireside chat with Eddie and de we're
00:01:09
going to learn more about fractional and
00:01:10
how you can think about uh the AI
00:01:13
projects that you're working on the dos
00:01:14
the don'ts uh and really the aim for
00:01:16
tonight is not only to just be us
00:01:19
talking here and you listening we want
00:01:21
this to be interactive so if you have ai
00:01:23
projects that you're working on it's
00:01:25
like Eddie and everyone else from a
00:01:26
fractional perspective give their
00:01:27
thoughts if you want to Pepper Nati with
00:01:29
personal questions you can do that stuff
00:01:31
too um but really the night is meant to
00:01:33
be all about you uh so we're going to
00:01:36
try to live up to that but with that
00:01:39
being said I'm going to shut up and go
00:01:40
to the back here thank you all for
00:01:41
joining us again natik I'm gonna hand it
00:01:44
over to
00:01:48
[Applause]
00:01:50
you hello
00:01:52
hello all right a
00:01:55
sec I'm clumsy so all right uh my goal
00:02:00
today is not to sell you all on airbit
00:02:03
but to put some context on yeah a few
00:02:06
minutes on what we are doing and why we
00:02:10
try to do co-pilot um style AI assist in
00:02:14
our Dev tools what we've got as a result
00:02:17
what we've learned um how you can use it
00:02:20
to grab data for your projects and then
00:02:22
we're going to talk with Eddie and Eddie
00:02:23
is going to talk to us about how to
00:02:26
actually um be better at building with
00:02:29
AI um and avoid common
00:02:33
pitfalls
00:02:34
so airite we started just a few years
00:02:38
back we're almost four years
00:02:40
oldish and the slide that Michelle our
00:02:43
CEO shows to every new hire says that
00:02:46
our mission is to make data available to
00:02:50
anyone and anywhere if you own your data
00:02:53
and it's in any systems databases apis
00:02:55
you should be able to use your data
00:02:57
that's why there's a bunch of companies
00:02:58
like zap here or like University cases
00:03:00
right and turns out to fulfill this
00:03:02
Mission you know things get much easier
00:03:05
if you have Frameworks that can read
00:03:07
data from arbitrary apis that's what my
00:03:10
team is doing I am an engineering
00:03:12
manager on API extensibility team we're
00:03:15
doing Frameworks that power all of our
00:03:17
API
00:03:19
connectors um in 2021 2022 we had a
00:03:23
python cdk um connector developer kit
00:03:25
framework we had around a 100 connectors
00:03:29
at that time and we thought okay well
00:03:31
how do we scale that we have 20
00:03:33
Engineers supporting 10 certified
00:03:35
hardcore connectors Community
00:03:37
contributes connectors but how do we
00:03:38
maintain all that so in 2023 we made a
00:03:43
graphical user interface around our low
00:03:45
code no code framework that encapsulates
00:03:48
a connector in a basically a bunch of
00:03:51
yaml kubernetes resource definition
00:03:53
style and that's great people started
00:03:55
being able to make a connector in an
00:03:57
hour versus you know days but it's still
00:04:01
a cool hour or more so in 2024 we've
00:04:03
released AI assist which is essentially
00:04:06
co-pilot for our graphical user
00:04:07
interface
00:04:08
tool
00:04:10
and I want to show you how it works I'm
00:04:14
99% confident is going to be fine but
00:04:16
I'm going to do it one-handed so let's
00:04:20
see just to give you a sense of what
00:04:22
this thing is so I figured you know what
00:04:24
are we going to build today we already
00:04:27
have a lot of connectors so finding one
00:04:29
that we don't have was a little bit of a
00:04:31
challenge and my CFO was walking nearby
00:04:36
and I thought hey juel do you think it's
00:04:38
cool if I use our financial data for a
00:04:41
demo for a Meetup and he said you signed
00:04:45
an NDA you
00:04:52
stupid my cash was about to be warmed up
00:04:55
interesting okay this might take us a
00:04:57
minute so we might as well continue and
00:05:01
give it a few seconds while that is
00:05:09
happening yeah let's almost
00:05:13
smoothly so we're going to return to
00:05:15
that but to give you
00:05:17
perspective data transfer companies are
00:05:19
only as good as the connector coverage
00:05:22
that we have if we only support 200 apis
00:05:24
you have your own API does your own
00:05:26
thing you want your data we don't
00:05:28
support it you're not going to use us
00:05:30
so how are we doing well you know we've
00:05:32
released AI assist and connector builder
00:05:34
in like
00:05:35
2023 um we've
00:05:38
added what approximately 100 connectors
00:05:42
from August to the end of October and if
00:05:46
like our total is less than 400 that's a
00:05:49
lot of
00:05:50
connectors how is this live demo thing
00:05:53
doing oh okay so this roughly is our
00:05:58
connector Builder and and it needs to
00:06:00
know things about your API it needs to
00:06:02
know your base URL which a assist
00:06:04
guessed for me it needs to know how to
00:06:06
authenticate and it thinks that this API
00:06:08
is using beer
00:06:10
token
00:06:11
which I'm going to paste
00:06:15
save and we have streams of data so
00:06:18
transactions is obviously the most
00:06:20
interesting it figured out where
00:06:22
transactions live what HTTP method to
00:06:24
use um where transaction records are
00:06:29
within the HTTP
00:06:31
response um it figured the pagination it
00:06:34
figured where in the response is the
00:06:36
cursor to the next page let's see if it
00:06:39
works and if I actually pasted the
00:06:47
token come
00:06:51
on here it is okay I'm not going to show
00:06:54
you the actual records but what's
00:06:55
important is uh 100 records per page
00:06:58
five pages test read is successful
00:07:00
meaning I only had to paste my
00:07:02
documentation URL and my API token and
00:07:05
it figured out um how to get my data in
00:07:09
fact I did this a little bit earlier
00:07:11
today and got a bunch of streams and
00:07:14
then I used this little button here to
00:07:17
make a pull request and we have a pull
00:07:21
request in our GitHub I'm going to show
00:07:22
you that in a little bit that's how we
00:07:26
are growing from 200 something
00:07:27
connectors to 400 something
00:07:30
connectors within just these few
00:07:34
months
00:07:36
now we tried three times to get this
00:07:40
thing right it was a hobby project of
00:07:42
one of our Engineers like oh LMS are
00:07:44
cool let's build something with LMS um
00:07:46
didn't quite work
00:07:48
out the first attempt was very naive
00:07:51
Eddie will walk you through some of the
00:07:53
details but we thought you know what
00:07:54
Chad gpts are cool let's just let's
00:07:56
paste the docks give the docks to Chad
00:07:58
GPT and say hey you output the Manifest
00:08:00
file of the connector and it works on
00:08:03
super simple things like Pokey API or
00:08:05
like exchange rate API some something
00:08:07
super simple with one or two streams of
00:08:09
data doesn't work on anything serious
00:08:11
cannot figure out authentication then we
00:08:13
thought okay well it is very difficult
00:08:15
for a l large language model to Output
00:08:18
the Manifest in our format it doesn't
00:08:20
know the constraints the schema but
00:08:23
there's a lot of open apis specs on the
00:08:25
internet so what if we ask it to First
00:08:28
generate open API spec and then from
00:08:30
that we're going to euristic generate
00:08:32
the Manifest it's also extremely
00:08:34
brittle and then we decided to work with
00:08:37
fractional on this co-pilot approach
00:08:40
this works but it's not just a single
00:08:43
llm
00:08:44
call it's not just prompt engineering um
00:08:48
this diagram is probably not very
00:08:50
visible right but there's basically four
00:08:52
levels nested logic of how we figure out
00:08:56
what authentication scheme a given API
00:08:58
uses given its docs open API spec and if
00:09:03
we don't have enough information there
00:09:05
or if there's no open API spec we would
00:09:07
attempt Googling and scraping Ser
00:09:09
results uh from Google to figure out how
00:09:12
to
00:09:14
authenticate so Core lesson stop magic
00:09:17
is just a lot a lot a lot of TDS
00:09:19
software
00:09:20
engering and the thing there is all of
00:09:24
that time unless your users are actually
00:09:26
benefiting from your software you're not
00:09:28
learning anything and just having a
00:09:30
prototype doesn't give you much you got
00:09:32
to figure out where you host it how you
00:09:34
monitor it how you evaluate it how you
00:09:36
monitor your budget burn how you figure
00:09:38
out when it moves out of beta
00:09:41
Etc so we figured airb is not just an
00:09:45
open- Source graphic user interface data
00:09:48
pipelines tool or ETL uh my personal big
00:09:51
thing here is to make uh system that
00:09:55
gives you your data in python or in CLI
00:09:58
you don't have to use air proper you
00:09:59
don't have to use our graphical user
00:10:01
interfaces to get your data if you have
00:10:03
hobby projects or things that you do on
00:10:05
weekends we should be able to help which
00:10:07
should be handy if you decide to
00:10:09
prototype stuff with Eddie and
00:10:11
fractional later
00:10:12
on um so what we can do um we have by
00:10:15
airb which is a CLI or python library
00:10:18
that can read data again from anywhere
00:10:20
and write it to local du dbcash and then
00:10:23
we have a bunch of destinations
00:10:24
including a bunch of vector destinations
00:10:26
and PG Vector Bine cone and such
00:10:29
yeah very interesting time let's build
00:10:31
some stuff together now I'm going to
00:10:33
pass it to Eddie um and see what we want
00:10:37
to talk about next
00:10:41
[Applause]
00:11:00
are you moderating this section cool
00:11:02
well hello everybody uh while we're
00:11:04
waiting for Teo my name is Eddie I'm the
00:11:06
CTO at at fractional AI uh where uh Dev
00:11:10
shop that is specifically focused on
00:11:12
building challenging production
00:11:14
applications that that use llms in some
00:11:16
way so you know we were're uh we helped
00:11:20
build the the AI assist feature you just
00:11:23
saw which is like a good good example
00:11:25
when you're trying to dig into the weeds
00:11:26
of what some of these production AI
00:11:28
projects look like but we've also seen
00:11:30
over a hundred of these projects at this
00:11:31
point and um yeah I'm excited to talk
00:11:35
about all things about what it really
00:11:36
means to put put AI projects into
00:11:39
production that's for you
00:11:41
ni um I'm just going to be yelling
00:11:44
because you two are the most important
00:11:45
people here and from this side of room
00:11:48
you all are very important
00:11:49
obviously um I think where I want to
00:11:52
start Eddie you already kind of gave us
00:11:55
a little bit of background fractional uh
00:11:57
on in terms of working on different
00:11:58
kinds of projects
00:11:59
I want to go a little bit more
00:12:01
into the AI assistant when you thought
00:12:04
about the kinds of kinds of ways you can
00:12:07
incorporate AI for new projects like I
00:12:09
think there's a lot of people who are
00:12:10
looking around where should I be
00:12:12
implementing AI um ni you you talk a
00:12:15
little bit about how we want to bring AI
00:12:18
into our own workflow what's your first
00:12:20
advice for anyone who's thinking about
00:12:22
how can I bring AI into my
00:12:26
Enterprise it's a good question um I
00:12:28
think there's like a lot of ideas for
00:12:29
way AI can help um but that things often
00:12:32
get stuck early in the ideation process
00:12:34
or at the PCC phase I think one critical
00:12:37
thing that happened here was a lot of
00:12:41
the best opportunities for AI exist in a
00:12:43
manual workflow that you're already
00:12:45
running somewhere today uh people were
00:12:47
already building API connectors here and
00:12:50
so it was very clear like what was hard
00:12:52
you had a clear set of input output
00:12:54
pairs to care about you had clear
00:12:56
historical data you understood your
00:12:57
domain and could measure the value of
00:13:00
this thing right this took us quite a
00:13:02
while to build um if you're going to
00:13:03
spend all this time building something
00:13:05
you got to kind of know that there's a
00:13:06
there there that it's is like going to
00:13:07
save a lot of people a lot of real time
00:13:09
and not just be some speculative um
00:13:11
thing so that would be like the number
00:13:13
one thing I would focus on is this a
00:13:16
real existing manual workflow that looks
00:13:19
like the llm sort of capability set can
00:13:23
be applied here well and is it valuable
00:13:25
enough like if we can actually get there
00:13:27
does this save us a lot of time does it
00:13:29
it you know what's what's the financial
00:13:31
impact to us on this does it save us
00:13:33
hours does it you know generate new
00:13:35
Revenue what what kind of sort of uh uh
00:13:37
impact does it have when I think about
00:13:40
like what are the core capabilities of
00:13:43
these llms I basically think about it
00:13:47
as computers can now read write
00:13:53
make junior employee level decisions and
00:13:57
they're sort of domain experts about
00:13:58
everything and like that's the set of
00:14:00
things that I would look at in these
00:14:01
manual workflows rather than like oh
00:14:03
maybe we can apply AI here and it can
00:14:04
know everything about everything is this
00:14:06
very specific oh you know we're spending
00:14:08
a lot of time reading through API docs
00:14:09
and saying like what did it say um and
00:14:12
and that's a pretty llm capable
00:14:15
task did you have anything you want to
00:14:17
add there because otherwise I'm going to
00:14:18
take it to this experience directly
00:14:21
there's the whole you Scope our project
00:14:23
you decide you want you're going to do
00:14:24
it I'd love to know what went wrong in
00:14:27
this situation
00:14:29
oh so much
00:14:32
uh the first thing that jumped to mind
00:14:34
here is that um I think we failed to
00:14:36
appreciate upfront just how hard some of
00:14:39
the pure software engineering parts of
00:14:42
the crawling of API docs would be I
00:14:44
think we initially thought about this as
00:14:46
like step one download the docs step two
00:14:49
get llm to make a bunch of decisions um
00:14:53
and does that resonate with other people
00:14:55
you know one two and you're done all
00:14:57
right we got some hands over there ni
00:15:00
um and and fundamentally that is still
00:15:01
What's Happening Here Right like we're
00:15:03
trying to build a connector into an API
00:15:05
the kind of steps involved are go to the
00:15:08
web page that describes how to connect
00:15:09
this a to this API read through the docs
00:15:11
and then make a bunch of decisions okay
00:15:13
here's how we authenticate provide our
00:15:15
credentials to log into this API here's
00:15:17
what the set of endpoints looks like uh
00:15:19
turns out these documentation pages are
00:15:21
like everything you can possibly imagine
00:15:23
times like 10 and you have to support a
00:15:25
very wide variety of use cases you have
00:15:27
to handle you know rate limiting and
00:15:29
some docs are behind authentication and
00:15:31
some docs are like uh the information is
00:15:34
not even on the web page it's like you
00:15:35
know that you've got to click on things
00:15:36
and it's going to go fetch it from the
00:15:37
server and handling this super wide
00:15:39
variety of use cases or preventing
00:15:41
yourself from going and crawling out to
00:15:42
irrelevant Pages was incredibly hard and
00:15:45
even now when we look at failure cases
00:15:48
more often than not they're not uh an
00:15:51
the AI making a poort decision based on
00:15:53
good data it's the AI making something
00:15:55
up based on no data because we failed to
00:15:57
actually find the right the right sort
00:15:59
of source material out of the
00:16:01
web you just seen me make a demo that
00:16:04
took what like a minute right to process
00:16:08
and in this minute it tries to figure
00:16:10
out the relevant docs and figure out the
00:16:13
base URL then the stream URL
00:16:15
authentication scheme
00:16:16
parameters when we started there was the
00:16:20
happy path prototype connector like woo
00:16:22
this works really fast that's great but
00:16:25
then in some cases it took like four and
00:16:30
a half something minutes in crawling
00:16:32
docks in headless Chrome and sometimes
00:16:35
it would get into Loops so you would
00:16:38
think like in 2024 crawling pages from
00:16:41
the web should be solved problem and
00:16:42
there's a bunch of products that say
00:16:44
they do it right fir crawl is the one we
00:16:47
use
00:16:48
now but can you just out of the box
00:16:51
Point them and expect them to work like
00:16:54
nope if you go read like you know a rag
00:16:58
rag tutorial right now it's going to
00:17:00
tell you uh you know go download your
00:17:03
information get get craw the docs
00:17:05
download the docs uh strip out some HTML
00:17:09
chunk it up into pieces put it into a
00:17:12
vector store and then query your vector
00:17:13
store um and actually we did kind of
00:17:16
start there the final implementation we
00:17:18
ended up with looks something more like
00:17:20
we don't pre-ra anything we wait until
00:17:22
we have a specific task we're trying to
00:17:23
do like how do you like what is the
00:17:25
authentication mechanism does this API
00:17:27
use you know http basic off to for the
00:17:29
username password does it use an API key
00:17:32
what is the method and then we purpose
00:17:34
go crawl for that we start at the
00:17:36
homepage of the docs and we ask an llm
00:17:38
to help us navigate toward you know
00:17:40
where we'd want to want to go we have so
00:17:42
many fallback mechanisms in here we have
00:17:44
multiple different Services we use for
00:17:45
this crawling because there can be rate
00:17:47
limiting issues they can be flaky um
00:17:49
there's there's all sorts of issues
00:17:51
around that we fall back on doing a
00:17:52
Google search if we can't find the
00:17:54
information we're looking for we use
00:17:55
perplexity at some points in the flow uh
00:17:58
we have a repos repository under the
00:17:59
hood of a bunch of pre-built opening API
00:18:01
specs from common repositories like it
00:18:04
is very complicated under the hood
00:18:06
there's a lot a lot going on that
00:18:08
doesn't look like you know you're uh
00:18:10
here's how you ask a question of your
00:18:11
documents or rag
00:18:14
tutorial and I kind of want to like
00:18:16
before we go towards like the next
00:18:17
question there I want to just get a
00:18:19
pulse for the room probably should have
00:18:20
started with this but I think it's
00:18:21
helpful as we're diving deeper into some
00:18:23
of these Concepts just to make sure
00:18:24
we're all kind of on that same
00:18:25
wavelength would you raise your hand if
00:18:27
you identify a builder in AI right now
00:18:30
you're building some kind of company or
00:18:32
product in the space all right great how
00:18:34
many of you are not necessarily building
00:18:36
but pretty well versed in the topic
00:18:38
you're doing a lot of independent
00:18:40
research and rais all right those two
00:18:44
together I think we have a large
00:18:44
majority for everyone else you're
00:18:45
probably where I'm at in my like Journey
00:18:48
so you can go ahead and be Googling
00:18:50
things on the side just like I'm going
00:18:51
to be doing over here um yeah yeah call
00:18:53
me out for if I'm getting too technical
00:18:55
no no no it's good we want we want to go
00:18:57
de deeper and this being live stream
00:18:59
record so you can always come back later
00:19:01
if you have more questions I want to
00:19:03
talk about that piece then like thinking
00:19:04
about all these components that go into
00:19:07
building an AI you think about
00:19:08
observability you think about the rag
00:19:10
like could you talk through what are
00:19:14
core components for you of a successful
00:19:17
AI project maybe evaluations or or
00:19:19
things of that nature where do you want
00:19:20
to take
00:19:22
this so I think the one of the earliest
00:19:25
steps in any project that's going to
00:19:27
reach this level of success ESS um if
00:19:30
it's going to have any sort of
00:19:31
meaningful complexity to it is going to
00:19:33
have to be building evales and what I
00:19:36
mean by EV vals is basically an
00:19:38
automated test suite for your
00:19:40
application but one where you're running
00:19:42
over lots of examples that you want your
00:19:45
system to be good at and you're testing
00:19:46
how well it it does at these things so
00:19:48
you define some metrics up front to
00:19:50
measure how well am I doing um and so
00:19:52
like as a concrete example here we're
00:19:53
trying to build API Integrations our
00:19:55
first step was let's go gather a bunch
00:19:58
of existing API Integrations we built
00:20:01
let's build a a sort of test harness
00:20:03
that can generate output from our system
00:20:05
test it against how well does it match
00:20:08
up with the things that actually people
00:20:09
built in the past and we produced a
00:20:10
whole bunch of metrics around these it's
00:20:13
it's actually non-trivial to get this
00:20:15
right um uh you know even though we had
00:20:17
a really rich set of ground truth to
00:20:19
look at here you know we had hundreds of
00:20:20
connectors to people that built the
00:20:22
comparisons are not very straightforward
00:20:24
like sometimes our system comes up with
00:20:25
different names than than people came up
00:20:27
with or the the community connectors
00:20:29
might have only a subset of of things
00:20:32
defined in them they could have defined
00:20:34
and that's that's okay for their use
00:20:35
case um so detecting sort of the
00:20:37
difference between we didn't generate
00:20:40
something and we should have versus we
00:20:41
didn't generate something and that's
00:20:42
fine um is is not uh it's not trivial
00:20:46
but you got to start somewhere and if
00:20:48
you don't do this your starting point is
00:20:50
gonna be very s Vibes based you're gonna
00:20:52
like run your first best idea some
00:20:56
sometimes it's going to work which is
00:20:57
going to be really encouraging and cool
00:20:58
sometimes it's not and you're not like
00:21:00
going to kind of have some intuition
00:21:01
about maybe here's how I improve it but
00:21:02
it's going to be based on whatever sort
00:21:03
sitting in front of you this is what
00:21:05
they ended up looking like at some point
00:21:06
maybe there's like can you go up a slide
00:21:08
so this is how it looked at the
00:21:09
beginning when we started we were just
00:21:11
like so if you can't see the rows here
00:21:15
are just example connectors that that
00:21:18
existed already um and we just picked
00:21:21
three uh knowing that we wanted to be
00:21:23
better than just doing these three but
00:21:25
we started somewhere and then each of
00:21:26
these columns is some some way that we
00:21:28
measure ourselves against the ground
00:21:29
truth so if we ask our system to produce
00:21:31
a Sentry connector there's already a
00:21:33
Sentry connector out there how well do
00:21:35
we do at all these things and uh and
00:21:37
produce these these metrics and we try
00:21:39
and kind of like produce a score that is
00:21:41
roughly weighted by how valuable is it
00:21:43
to a user if we screw this up or get it
00:21:46
right uh and and then you start now you
00:21:49
can actually sort of measure how well
00:21:50
you're doing this is a super powerful
00:21:53
tool there's sort of a Dark Art to like
00:21:56
you know perfect versus good on this but
00:21:59
um if you get this into a good place it
00:22:02
guides development in a very real way
00:22:03
like first of all you can tell in an
00:22:04
unbiased way like how are we doing
00:22:06
overall you can track your progress you
00:22:08
can track regressions and if you sort of
00:22:11
if you're doing some prompt engineering
00:22:12
and you're like tweaking the language
00:22:13
all the time to get better at some
00:22:14
specific failure mode you're seeing what
00:22:17
how do you know if you tweak your prompt
00:22:18
it's like not going to make you worse
00:22:19
the thing you tried to get better at
00:22:20
yesterday so this will help you track
00:22:22
regressions it also
00:22:24
drives uh the sort of anecdotal evidence
00:22:28
you want to see for where to invest your
00:22:30
attention next if you go you know
00:22:32
like you know we're doing pretty well
00:22:34
actually at this stage um but like
00:22:37
there's still some zeros in here um so
00:22:40
like my intuition from seeing this is
00:22:43
like okay we're like doing okay at
00:22:44
whatever this thing is for zenitz and
00:22:46
we're like doing not that good for this
00:22:48
schema thing for zenitz like wonder what
00:22:50
that is and i' click into what it's It
00:22:52
Go actually look at what we generated
00:22:54
and say ah okay like this the L&M got
00:22:57
this wrong because we're feeding it the
00:22:58
wrong information this is a crawling
00:22:59
problem not prompting problem and we' go
00:23:02
update our crawler and so sort of tells
00:23:04
you what to work on next and then over
00:23:06
time we expanded to that that next slide
00:23:08
that you were on a second ago which
00:23:09
is the evals just got bigger and bigger
00:23:12
and bigger we just kept getting more use
00:23:13
cases in there trying to get a wider and
00:23:15
wider set of examples to look at um and
00:23:18
it's what drove you know you showed the
00:23:19
sort of workflow diagram in your slides
00:23:21
earlier that was like kind of the
00:23:22
spaghetti look of all the different
00:23:24
steps that go into just one of the
00:23:25
questions here that evolved out of this
00:23:28
exploration trying to get better and
00:23:30
better by adding more sort of uh catches
00:23:33
for things that could go
00:23:36
wrong did you want to add anything
00:23:39
there can hope to add some context at
00:23:44
the high level this is the diagram for
00:23:46
the whole thing in the
00:23:48
beginning and so the idea was okay we're
00:23:51
going to crawl all of the documents now
00:23:53
we're going to index everything shove it
00:23:55
into a vector store and then there's
00:23:57
going to be like three four different
00:23:58
components one's going to figure out the
00:24:00
AL the other is going to figure out the
00:24:02
pagination um right and then the the the
00:24:05
different ones going to figure out the
00:24:07
list of streams basically stream is an
00:24:08
API endpoint like oh you know
00:24:10
repositories and GitHub is a stream
00:24:12
issues and GitHub is a stream if you
00:24:14
look at this one right here deaf is
00:24:17
what's we call a record selectorate air
00:24:19
bite is basically where exactly in the
00:24:22
response Json is the useful information
00:24:26
and the schema means okay what are The
00:24:29
Columns of data what are the fields of
00:24:31
the useful objects that we
00:24:33
want and as we grew into this even the
00:24:37
number of things that we've paid
00:24:39
attention to increased and each
00:24:42
particular component became this huge
00:24:44
spaghetti because it turns out that like
00:24:47
originally we thought you know what each
00:24:49
component is going to be a subset of
00:24:51
index docs the tagged and a prompt and
00:24:55
hopefully a single prompt is going to
00:24:58
just make it fine like we crawled
00:25:00
everything already right and turns out
00:25:02
in reality like every component that we
00:25:04
need answer to like every field where
00:25:06
you can get an AI assist prompt is
00:25:08
basically a program in
00:25:11
itself I want Tove the spaghetti piece
00:25:14
also we're going to change this up
00:25:15
because originally I was just going to
00:25:17
like have a point where it's purely
00:25:18
audio audience Q&A if you're having
00:25:21
questions about things as we come up
00:25:23
raise your hand and I will kind of bring
00:25:25
you into the conversation rather than
00:25:27
just wait for the end um but I'm curious
00:25:30
about how the spaghetti evolves over
00:25:32
here what surprised you the most about
00:25:35
the way your evaluation criteria early
00:25:38
on differ when you think about the end
00:25:46
state so I I'm surprised by the number
00:25:49
of random fallbacks and stuff in the
00:25:51
system like that we're still Google
00:25:53
searching in perplexity you know
00:25:54
searching under the hood to get to some
00:25:56
of the answers we want um
00:26:01
uh I think a very useful but difficult
00:26:05
thing on this project was thinking about
00:26:07
how to progress along this path to how
00:26:10
do we arrive at the right spaghetti um
00:26:13
uh because if you were to just guess it
00:26:14
up front you wouldn't guess right like
00:26:16
you have to kind of evolve your way
00:26:18
toward it and then that's intention with
00:26:21
like how do we know we're going to get
00:26:22
there like how do we
00:26:24
know how do we know this is even
00:26:26
possible um let alone that going to get
00:26:28
there in like a reasonable amount of
00:26:29
time and I think that question is very
00:26:33
challenging for AI projects right like
00:26:34
there's there's some stat that like 70%
00:26:36
of of poc's never make it to production
00:26:39
with with AI projects and I think it's
00:26:42
very challenging to know what a good POC
00:26:44
looks like and how to get from there to
00:26:45
production um and and so if you like
00:26:48
take take the AI assist project as just
00:26:50
like an example of a broader
00:26:52
theme um I mean you mentioned you guys
00:26:55
tried it a few times before right and
00:26:56
you weren't exactly sure what do we make
00:26:58
of this like I think this says this is
00:27:00
possible but I don't know how we get
00:27:01
there and like
00:27:03
the if you were just gonna try tomorrow
00:27:06
to say like is it possible to build
00:27:08
these API Integrations with llms like
00:27:10
the first thing You' try is you just
00:27:12
like go ask chat GPT to do it you'd show
00:27:13
chat GPT an example of these connectors
00:27:16
are just a file under the hood you
00:27:17
showed chat GPT an example of the file
00:27:19
and you said you know build me one like
00:27:21
this but for this
00:27:22
API and then something will come out
00:27:25
like probably something pretty good uh
00:27:28
because the files are sort of
00:27:29
inscrutable and if you don't know what
00:27:30
you're looking for it's going to look
00:27:31
right even if it's like technically
00:27:32
doesn't run later um and then you're
00:27:34
kind of stuck you don't really
00:27:36
know what did this really tell me about
00:27:38
is it possible you you can't really
00:27:40
iterate on it like how do you make chat
00:27:42
PT better at this now um how do you know
00:27:45
what array of stuff it's good at versus
00:27:46
bad at and it's not going to get you to
00:27:48
this like eventual kind of spaghetti
00:27:50
looking
00:27:50
diagram um so instead the approach we we
00:27:54
tend to take is we try and build pcc's
00:27:57
that are 100% on the critical path to
00:27:58
production um we try and be thoughtful
00:28:01
about which pieces we build early but
00:28:03
early in the project we didn't start by
00:28:05
saying let's just show like a really
00:28:06
shiny marketing demo that shows complete
00:28:08
end to end it working perfectly for one
00:28:11
connector we said let's pick three
00:28:13
connectors as examples and it's going to
00:28:15
start out kind of crappy and then we're
00:28:16
going to try and make it better over
00:28:17
time um and that that diagram you showed
00:28:20
a second ago that's like the the this
00:28:23
one yes this one this was our sketch
00:28:25
like a few weeks into the project of
00:28:27
what we we imagined the eventual
00:28:29
spaghetti might look like and it ended
00:28:30
up changing over time and what we tried
00:28:32
to do was tackle these pieces in order
00:28:35
um to try and drisk the riskiest parts
00:28:37
of the project we're like all right
00:28:39
let's try and work on the box that's
00:28:40
about authentication right now and see
00:28:42
like what's it look what's it look like
00:28:44
start to feel it out get rid of unknown
00:28:45
unknowns get that to a place where we're
00:28:47
like I believe that with iteration this
00:28:48
part is possible then tackle the next
00:28:50
piece and tackle the next piece and
00:28:52
start to flesh this out I think actually
00:28:53
the screenshot like the gray boxes were
00:28:55
like things we didn't try yet or
00:28:57
something um or de prioritize for p so
00:29:00
like you know we hadn't we didn't
00:29:02
actually tackle all of these but we
00:29:03
tried to tackle as many as we could to
00:29:04
start to drisk it and then that process
00:29:08
drove us to a more robust eval driven
00:29:11
now it feels like iteration doesn't feel
00:29:13
like we're building a V1 of something it
00:29:14
feels like we're kind of like you know
00:29:15
iterating iterating iterating and that
00:29:17
drives the ideas for where to add the
00:29:19
sort of branching Paths of that that
00:29:21
workflow
00:29:23
diagram I do like the idea that we
00:29:24
should only be talking about EV valves
00:29:26
in the context of spaghetti going
00:29:29
so let's keep Let's uh maybe keep that
00:29:31
one up all night um thinking about the
00:29:38
yeah yeah in terms of the eval how are
00:29:41
you do are you just compar
00:30:01
yeah that's that's a great question um
00:30:02
yeah so the question was like what are
00:30:04
we measuring how are we doing these
00:30:05
evals um in this case are we just
00:30:07
comparing ourselves to an existing
00:30:09
connector that we know is good or uh he
00:30:11
said he's heard of some examples of
00:30:13
using an llm to evaluate how the other
00:30:15
llm did um it's a great question uh
00:30:21
so what we see across successful
00:30:24
projects varies a lot um part of what
00:30:27
makes the actually difficult is that
00:30:29
they rarely fit this like clean academic
00:30:32
standard for what you would want to see
00:30:34
um clean input output pairs great ground
00:30:37
truth you know how to compare these
00:30:38
things and how to measure them sometimes
00:30:39
the thing we're measuring ourselves
00:30:40
against is we like ship an example
00:30:43
output to some team somewhere and we're
00:30:44
like you're the experts on this domain
00:30:45
did we do a good job or not they ship it
00:30:47
back and like trying to evaluate based
00:30:48
on that and so the the mess wrangling
00:30:51
the mess is hard um we have seen
00:30:54
successful examples of using it that
00:30:57
that technique is called llm as judge
00:30:59
where you you have an llm evaluate how
00:31:01
you're doing it's good for like very
00:31:03
subjective things if you're generating
00:31:04
free form text and you're like does this
00:31:06
seem like it answered my question that's
00:31:08
like a task for an llm in this case we
00:31:10
were able to circumvent that I think in
00:31:13
every case uh we do some like
00:31:15
deterministic fuzzy stuff where we're
00:31:17
like does this name almost match that
00:31:19
name if so we're good um uh and so there
00:31:22
is some like deep Logic for like trying
00:31:26
to score ourselves uh in a way way
00:31:28
that's not not as straightforward is
00:31:29
just like does this thing equal that
00:31:31
thing um um but we've seen sort of
00:31:34
everything and at some point you do need
00:31:36
to sort of stop like looking for the
00:31:38
perfect thing and find something
00:31:39
directionally useful um we've had
00:31:41
projects where like you have a workflow
00:31:43
diagram this pop this this complicated
00:31:45
and the only thing we're able to measure
00:31:46
is like what's going on down here um
00:31:48
because it's like the only place where
00:31:49
you can design clean EV EV vals and then
00:31:52
you just sort of put up with that and
00:31:54
and do the best you can
00:32:17
AG
00:32:40
so it's so what does the output look
00:32:43
like is actually very critical to what I
00:32:45
think made this possible here um so
00:32:47
we've actually built uh sort of uh AI
00:32:51
powered integration Builders multiple
00:32:53
times um this is this is one of them for
00:32:55
airite I think one amazing asset that
00:32:58
airb has here is they have this format
00:33:01
that they call their their well I don't
00:33:03
know what you call it your low low code
00:33:04
cdk format your your this this spec for
00:33:07
how to define an API integration as
00:33:09
configuration instead of as code big
00:33:12
file that describ and in fact in our
00:33:14
pipeline we never have an llm write this
00:33:18
thing as output we write this as output
00:33:20
deterministically using code and we use
00:33:22
the llm to answer specific questions we
00:33:24
have about this process so we ask it
00:33:27
picking off authentication method for me
00:33:28
and then we use that to
00:33:29
deterministically generate the
00:33:30
authentication part of this that's part
00:33:32
of what makes this an approachable
00:33:34
problem we've built this before uh where
00:33:37
the end goal is to write code performs
00:33:40
way worse um and even in that process we
00:33:43
have uh under the hood we have an
00:33:46
intermediate format that is not I mean
00:33:49
it's like conceptually similar to this
00:33:52
that we're using to sort of constrain
00:33:53
the problem so much of the trick with
00:33:55
these LMS is constraining the domain in
00:33:56
which they're thinking right if if you
00:33:58
say write me some code you're going to
00:34:00
get something code shaped as output
00:34:01
whether it's good nobody knows um if you
00:34:04
ask it for a very specific constrainted
00:34:06
answer where it's only allowed to answer
00:34:08
within a very specific Universe it's
00:34:09
much more tunable it's going to perform
00:34:10
a lot better just kind of made that
00:34:12
possible yeah I mean
00:34:25
I'm I can take that
00:34:29
so to clarify the last two questions
00:34:33
it's I think it's both relevant to evals
00:34:35
ands to outputs uh the way we eval is we
00:34:39
compare what the model gives us with
00:34:41
what we have in connectors we know is
00:34:43
good it's not always one to one because
00:34:46
for example if you have a stream that's
00:34:48
called capital T transactions is it
00:34:51
still the same or like is it if if the
00:34:54
wording is slightly different but the
00:34:55
scheme is very similar if the schemas
00:34:58
are compatible but the columns are not
00:34:59
the same is it is it a match is it not
00:35:01
match like that that kind of stuff the
00:35:03
output is uh are pieces of the Manifest
00:35:07
and the AI Builder thing like we have a
00:35:10
python library that enforces the format
00:35:14
of the Manifest essentially think
00:35:16
kubernetes resource definitions right
00:35:18
there are fields that are required they
00:35:20
can be only of certain format so Builder
00:35:23
before outputting that as a suggestion
00:35:26
validates that it's
00:35:28
legit and then one use case is sure
00:35:32
right just a co-pilot thing in Builder
00:35:35
itself um what we see is the match
00:35:38
success rate like we see successful good
00:35:41
suggestions very very often like it's
00:35:43
probably north of 90% on each particular
00:35:46
field today but the thing is there's a
00:35:48
bunch of fields and those probabilities
00:35:50
multiply so the probability that you get
00:35:53
full connector end to end correctly is
00:35:57
you slightly lower but we're getting
00:35:59
there this use case is okay let's get a
00:36:02
lot of connectors let's make new
00:36:03
connectors Let's help people make
00:36:06
connectors for themselves and then share
00:36:07
them with our community but also I have
00:36:11
450 connectors and like more than 250 of
00:36:14
them are in that format so the whole
00:36:16
connector is just a big manifest file
00:36:18
and what I can do is I already have a CI
00:36:20
pipeline that runs every week and you
00:36:23
see there's this thing called version
00:36:24
right like this is the version of the
00:36:26
framework that it's using
00:36:28
and my CI pipeline checks hey do I have
00:36:30
a newer version of the framework and if
00:36:32
I do I'm going to update all of my
00:36:35
manifest as long as it's not breaking
00:36:37
another thing we could do basically on
00:36:39
CI uh or regularly is uh create another
00:36:42
endpoint in our AI assist thing and have
00:36:46
another flow where we say hey here's the
00:36:49
name of the connector here's the API
00:36:51
docs here's the existing manifest do you
00:36:54
think there may be some new streams that
00:36:56
we don't have
00:36:59
and like these or you know like maybe
00:37:01
there's a new authentication method
00:37:03
maybe there are some deprecations that
00:37:04
we want to clean up today the way this
00:37:07
works is connector fails for someone the
00:37:09
stream doesn't work anymore somebody
00:37:11
files in a GitHub issue they say well
00:37:13
we're open source you're very welcome to
00:37:14
contribute they contribute we run
00:37:16
regression tests verify it's not broken
00:37:18
then we merge when we had just the
00:37:20
python framework it took months now it
00:37:23
takes days but if I can automate this
00:37:27
cool so thank you for the
00:37:34
suggestion should I okay I'll do it
00:37:37
you're oh thanks um all right I kind of
00:37:41
want to like pull on a Thro a little bit
00:37:43
more that Samantha brought up which is
00:37:44
like you can Envision a future of like
00:37:47
an an agent or something doing this like
00:37:50
since the GPT era started it seems like
00:37:54
there's always something new it's
00:37:55
exciting that people are talking about
00:37:56
you know it was rag agents um graph rag
00:38:01
there's countless things in a year from
00:38:03
now do you feel any of these will
00:38:05
continue to be just as pertinent a part
00:38:07
of the conversation or do you think
00:38:09
something new will be the dominant point
00:38:11
of
00:38:16
discussion and if you do think something
00:38:18
new what is that
00:38:21
thing I'm less of an AI futurist and
00:38:25
more of an AI today practitioner uh but
00:38:29
um you know when people talk about
00:38:32
agents for example I think there's like
00:38:33
multiple things they might mean um think
00:38:37
one thing they might mean is like build
00:38:41
a thing that's got a lot of autonomy
00:38:42
around what it can do you give give
00:38:44
something a bunch of tools and you let
00:38:46
it sort of decide it's less of this
00:38:47
deterministic we do this then we do this
00:38:49
then we do this and you sort of give it
00:38:50
access to whatever it wants
00:38:54
um I've yet to see anything like that
00:38:56
come to fruition in practice for a
00:38:58
significant system that could see that
00:39:00
changing over time um but right now it
00:39:02
seems very um theoretical to me and like
00:39:06
may may happen if it gets driven by you
00:39:08
know big big boost to what Foundation
00:39:11
models are capable of
00:39:13
um but I think the more interesting
00:39:15
today thing for for agents what people
00:39:18
tend to mean is like less around
00:39:20
autonomy more around specialization like
00:39:22
how do you break your problem down into
00:39:24
specific components that are in charge
00:39:26
of a very small subdomain and are
00:39:28
experts in that subdomain that I think
00:39:30
is going to get even more common I think
00:39:31
people are
00:39:33
realizing a the complexity of these
00:39:35
projects in practice you know what looks
00:39:37
at a high level like hey chat GPT give
00:39:39
me give me a connector it looks more
00:39:40
like this under the hood and also that
00:39:44
so much of uh the sort of mystery of
00:39:46
what it's like to build with LMS is
00:39:48
actually just software engineering under
00:39:49
the hood um I think that is going to
00:39:51
drive more adoption of these sort of
00:39:54
that type of agent system um and I we're
00:39:58
seeing more and more of it we're talking
00:39:59
about a very sort of tech tech forward
00:40:00
company Tech forward use case but we
00:40:02
also see like you know 100y old big
00:40:05
equipment manufacturers talking about
00:40:06
these workflows in a very realistic way
00:40:09
that I think is is going to be in
00:40:10
production within the next year at at a
00:40:12
company like that um that you might call
00:40:14
an agentic workflow um so I see that
00:40:16
that part of it being very real over the
00:40:17
next
00:40:25
year Tak only
00:40:29
jumped to deep in building this thing my
00:40:32
Horizon of thinking about AI things a
00:40:34
year from now is very very
00:40:36
short my personal biggest thing is like
00:40:40
we we have manifest connect also have
00:40:43
python connectors and Java connectors
00:40:46
and we also have bug bugs in those so my
00:40:48
biggest dreams are around just those
00:40:51
software programming agents which can be
00:40:53
as simple as a little bash script that
00:40:55
says hey here's a GitHub un call issue
00:40:59
here's the bug report here are the logs
00:41:01
here's the directory with all of the
00:41:03
source files and here's the script that
00:41:05
builds and tests the connector here's
00:41:07
the bug
00:41:08
output can you fix it and then the
00:41:11
script applies the changes proposed by
00:41:13
the model runs the tests and if they
00:41:15
fail it says yeah that didn't work try
00:41:18
again in a while loop just until it
00:41:21
wraps up this is my next hobby project I
00:41:24
think after this thing is successful
00:41:27
what that means for other Industries and
00:41:31
for programmers and businesses that
00:41:33
build with AI EDD is the boss
00:41:44
there I don't know that we have a final
00:41:46
one fully
00:41:51
together I I don't think there is a full
00:41:54
final one and
00:42:03
very little framework code under the
00:42:05
hood there's there's some but it's it's
00:42:06
not
00:42:10
substantial it's kind of not
00:42:12
representative of the final no that's
00:42:14
okay maybe this is closing the biggest
00:42:16
place where I think the the diagram
00:42:18
diverged is like around the The Crawling
00:42:21
of the docks like we don't do an upfront
00:42:23
crawling Step at all um and so it's it
00:42:26
stops looking like
00:42:27
I guess the other big change is like at
00:42:29
that point we
00:42:31
envisioned URL to API docs as input
00:42:35
connector as output one shot build the
00:42:37
whole thing all at
00:42:38
once and uh where it ended up going was
00:42:42
that is what the initial experience is
00:42:44
like in the UI but there's lots of
00:42:45
little buttons you can push to fill in
00:42:46
fields here and there and so the flow is
00:42:48
much more decomposed into a set of a
00:42:50
bunch of different endpoints and smaller
00:42:52
workflows that leverage some shared
00:42:54
shared stuff under the hood and so it's
00:42:56
not exactly left to right end to end
00:42:58
thing it's like 12 end to end things
00:43:00
that have some shared
00:43:15
stuff yes so so there's actually two
00:43:17
inputs uh you can give us an open API
00:43:20
spec uh as as input for those that don't
00:43:23
know an open API spec is like a it's a
00:43:25
common standard format you can use to
00:43:27
describe uh an API um it's optional but
00:43:30
if you give it to us we'll we'll use it
00:43:32
um we also have our own curated kind of
00:43:36
repo like common common apis that that
00:43:39
are out there in their specs that we
00:43:41
sometimes use um other supplemental
00:43:44
information is is it's all stuff living
00:43:48
on the web it's like Google searching um
00:43:52
uh
00:43:54
crawling anything El yeah I think that's
00:43:56
all the supplemental stuff
00:44:04
I wonder if you for some stages that are
00:44:06
disconnected andon to an artifact have
00:44:10
you tried to combine
00:44:21
them I ended
00:44:24
up but then
00:44:34
I
00:44:37
you but but sometimes it makes sense and
00:44:40
the following question if you have
00:44:47
Doney so I think the first part was
00:44:50
around like instead of treating these
00:44:52
these different alternative steps for
00:44:53
finding information as as fallbacks to
00:44:55
one another can you sort of do them in
00:44:56
parallel and then try and try and
00:44:58
combine the information is that is that
00:45:09
right that are somewhere in here in a
00:45:12
sequence so have you tried to reconcile
00:45:14
them in a single step you know with
00:45:18
let's say let's call it a gentic
00:45:19
application or a gentic step in which
00:45:23
you do both tasks you can
00:45:28
right so the two tasks here are like
00:45:31
um they're
00:45:33
basically go out and find the relevant
00:45:36
information to a question like
00:45:38
authentication and
00:45:39
then I cannot
00:45:42
read imagine that there are two simple
00:45:44
Tas that you have separated by an
00:45:47
artifact you generate
00:45:52
one you instead
00:45:57
yeah I think it actually often starts
00:46:00
the opposite way it's like we start with
00:46:01
a larger problem we're like build this
00:46:03
whole thing and we're like this needs to
00:46:04
be broken down and sub
00:46:12
components possible that's happened
00:46:14
somewhere in the details I'm like less
00:46:15
less familiar no use case there is
00:46:18
jumping to mind but like I think the
00:46:19
tactic makes sense to me
00:46:22
um
00:46:24
uh yeah in practice like one area we've
00:46:27
had to break things down is like sort of
00:46:29
deeply nested questions um where like we
00:46:33
may be asking the the llm like which of
00:46:36
these authentication methods is used and
00:46:37
like if it's this one I need this
00:46:38
information if it's that one I need that
00:46:40
information it's sort of asking these
00:46:41
deeply nested questions it like sort of
00:46:42
falls off and gets lazy and stops
00:46:44
following the instructions so we've had
00:46:45
to sort of chop it up into the sub
00:46:47
pieces so this a little bit like the
00:46:48
opposite of the flow you're describing
00:46:50
but like I could see if we if we'
00:46:52
started out with the sort of multi-step
00:46:54
version being like I wonder if we can do
00:46:55
this all at once which does save you on
00:46:58
latency and
00:47:03
cost more easier you have that you
00:47:11
try yeah at least try to recile
00:47:16
something for example when I started
00:47:19
doing group
00:47:22
ofel so I start basically from functions
00:47:25
and I automate function with a agent
00:47:29
step AG step and then I I I link
00:47:33
together but then I say okay this two
00:47:36
maybe can recile single yeah you instead
00:47:39
of
00:47:41
having step agent to maintain have it
00:47:45
doesn't have to make sense for
00:47:46
everything and end up a single blob of
00:47:49
agent that that performs everything it's
00:47:51
not going to work what you were saying
00:47:52
at the very begin yeah we've seen I
00:47:55
think this is only tangentially Rel
00:47:56
related to what you're asking but we
00:47:57
have seen on another another project so
00:48:00
it looks pretty different to this but
00:48:02
it's fundamentally basically like it's a
00:48:04
Content moderation projects it's for a a
00:48:06
company called change.org where they
00:48:08
they have like a petition uh platform
00:48:11
where people can can post petitions
00:48:13
about you know political things and
00:48:15
local things and stuff like that um and
00:48:17
they have kind of a challenging content
00:48:20
moderation problem because it's not as
00:48:21
simple as saying like did someone just
00:48:24
post spam or did someone just post hate
00:48:25
speech it's actually like a valid use of
00:48:27
their platform to say something like
00:48:29
somewhat inflammatory but like it can't
00:48:31
cross the lines of of their Community
00:48:33
guidelines and so um getting uh these
00:48:37
agents to sort of understand the
00:48:38
different nuances of like what does it
00:48:40
mean to to um to violate our policies is
00:48:44
is challenging and under the hood what
00:48:46
we do is we have these sort of
00:48:47
specialist agents that do look at this
00:48:49
through different lenses they write out
00:48:51
their sort of reasoning their Chain of
00:48:53
Thought they give us confidence scores
00:48:54
at the end and then we take a bunch of
00:48:56
these different answers together at the
00:48:57
end and we give it to one bigger process
00:48:59
that's like all right now that you
00:49:00
understand all the Nuance of these
00:49:01
different angles make a final decision
00:49:03
and it's sort of combining um these
00:49:05
different sort of Sub sub viewpoints if
00:49:06
that makes sense it's not exactly what
00:49:08
you were talking about but it's a sort
00:49:09
similar idea um on the 01 question uh he
00:49:13
asked if we had tried o one at any point
00:49:15
um we have um uh the biggest drawback
00:49:20
with o one is that it's slow um so this
00:49:23
is like just too latency sensitive of an
00:49:25
application we already have um takes a
00:49:27
while to build to generate a connector
00:49:29
here there's a lot of substeps if you
00:49:30
added 20 seconds to one of the prompts
00:49:32
it would probably be a nonstarter
00:49:34
especially given that the bottleneck
00:49:35
here is less the ai's intelligence and
00:49:39
more our ability to give the AI the
00:49:40
right information at the right
00:49:42
time we are getting that point where we
00:49:46
have a lot of pizza that people still
00:49:49
toat so I I want to start putting the
00:49:52
bows on the present here and just
00:49:54
confirm is there anything else that you
00:49:55
all wanted share with the audience that
00:49:57
we haven't had a chance to talk about
00:49:59
and I will also give the opportunity if
00:50:01
you have any burning final questions
00:50:03
feel free getting those in there but I
00:50:05
know there's slides there's a lot of
00:50:06
things that you all might want to show
00:50:08
anything you wanted to kind of TOS
00:50:13
out anything else
00:50:17
yes to cont
00:50:36
so we're trying toise some
00:50:49
to
00:50:51
resp speak
00:51:24
spaghet so I guess I'll start by saying
00:51:26
this domain sounds very hard
00:51:29
um the the thing that makes me say it
00:51:31
sounds hard is that um hirings sounds
00:51:35
hard and like uh we struggle to train
00:51:37
humans to do it today um so getting
00:51:43
getting if I struggle to picture how to
00:51:45
get uh a pretty Junior uh person to
00:51:49
figure out how to reliably produce this
00:51:50
output then I also struggle to see how
00:51:52
to get an LM to do it the the analogy
00:51:54
that jumps to mind though is um
00:51:58
this kind of problem is present for AI
00:52:01
phone agent applications there's a lot
00:52:03
of you know people trying to put AI
00:52:04
agents on the phone they have to sort of
00:52:06
be robust in the face of people can say
00:52:09
anything um it's hard to build
00:52:12
uh customer support bot for an airline
00:52:14
if if you're afraid that it's you know
00:52:16
gonna just like give someone a free
00:52:17
ticket because you say ignore previous
00:52:19
instructions you know um I don't get the
00:52:22
sense that anyone's like figured this
00:52:23
out super well um the tactic they use
00:52:26
there is is sort of a hybrid
00:52:28
between a um almost like a what you
00:52:32
picture for like a phone tree where you
00:52:33
can just you know press press one if
00:52:35
you're a good candidate um and and and
00:52:38
still leveraging uh you know like the
00:52:40
the lm's ability to handle inputs as
00:52:43
never seen before and so it tends to
00:52:45
look like a state machine where you have
00:52:47
different states that the agent can be
00:52:49
in it's trying to assess it's very
00:52:50
specific narrow things at each point in
00:52:52
the state but that the way it decides to
00:52:54
move from state to state is B based on
00:52:56
llm logic you know logic described in
00:52:58
English not a very deterministic uh sort
00:53:01
of thing um and then I would still take
00:53:04
the approach of build evals based on Old
00:53:07
transcripts of calls that have gone off
00:53:08
the rails and measure yourself against
00:53:09
like known known bad use cases um
00:53:13
getting to Perfection on this sounds
00:53:15
sounds pretty challenging um also
00:53:17
getting nlms to to state how confident
00:53:20
they are in something is his own sort of
00:53:21
sub problem and so like you may be able
00:53:23
to get this eventually to a point where
00:53:25
it can tell you when it doesn't know but
00:53:27
tuning that is also going to be
00:53:28
challenging because they sort of
00:53:29
overstate their confidence
00:54:00
yes but the interpretation and tuning is
00:54:02
is like a real challenge like
00:54:04
um a lot of our projects have have steps
00:54:07
in the middle of the workflow where
00:54:09
we're asking we're asking for an
00:54:12
evaluation of the form of like think out
00:54:14
loud then come up with your answer and
00:54:17
then tell us you know how confident are
00:54:19
you in your answer it's usually not a
00:54:20
number it's usually like low medium high
00:54:22
very high and then you don't just trust
00:54:24
what that means you measure it against
00:54:25
your EV like is this predictive of
00:54:27
anything like um seems like very high
00:54:30
means like maybe possibly correct and so
00:54:32
you only filter down to maybe
00:54:35
high so do you have
00:54:38
any talking you have a things that are
00:54:43
work out really well like give you one
00:54:45
example I found out that for me if I put
00:54:48
in uh some example inputs and some
00:54:51
perfect outputs into the context then
00:54:54
you know splits out result like simar
00:54:57
like
00:55:01
recation the examples thing does work uh
00:55:04
um showing it examples usually gets us
00:55:06
sort back on the rails like I'm sure
00:55:07
you've seen all the sort of trendy
00:55:08
little tricks you know offered a big tip
00:55:10
like say put a bunch of exclamation
00:55:12
points in there offer to fire it if it's
00:55:13
not going to do a good job like those
00:55:14
those things I think you
00:55:16
know may give you lift uh it's going to
00:55:19
be challenging to know if you don't if
00:55:20
you don't measure it
00:55:23
um I think more often in practice it's
00:55:26
it's around
00:55:28
um finding specific cases where you did
00:55:31
poorly and then baking them into your
00:55:32
prompt um uh you know trying a wide
00:55:35
variety of things noticing that it's
00:55:37
sort of off on this case and then
00:55:39
describing that case to it
00:55:42
um I don't have handy like a list of
00:55:44
things I mean um but uh and I bet if I
00:55:48
pulled the folks on our team everybody's
00:55:50
got a different uh set of favorite bag
00:55:53
of tricks um which I think is is um
00:55:57
it's also a danger on on these AR
00:55:59
projects is that um it's easy to fall
00:56:02
into like you just like it's like a
00:56:04
really good nerd snip machine right like
00:56:06
you can be like I'm pretty sure tipping
00:56:07
is g to be a great the great thing to
00:56:09
try on this project and so the evals
00:56:11
help keep you keep you on on task there
00:56:13
um the set of tactics is out there right
00:56:15
you can Google search for for people's
00:56:16
long list of tactics one random thing
00:56:18
we've had good success with is is
00:56:20
anthropic has a this prompt generator um
00:56:23
and you can just paste in your your
00:56:25
current prompt and it'll rewrite it
00:56:27
we've had surprising results where like
00:56:28
visually it doesn't look any better
00:56:30
we're like that's kind of what I already
00:56:31
said in my prompt and then like the
00:56:33
metrics just go up
00:56:35
um but it's not one weird trick it's
00:56:38
like try lots of things and measure your
00:56:41
progress all right thank you everybody
00:56:44
for coming tonight we're super excited
00:56:47
uh that you made their time to be with
00:56:49
us and quick round Applause for Eddie
00:56:51
and
00:56:57
the office is going to be open for the
00:56:58
next 20 minutes or so so like I said
00:57:00
lots of pizza to eat there's still
00:57:02
drinks as well so go enjoy pester Eddie
00:57:05
and with any further questions maybe you
00:57:07
didn't get a chance to ask it now they
00:57:09
are going to be around and if they
00:57:10
weren't planning to now they are um but
00:57:13
again thank you for being here hope you
00:57:14
had a great time let's keep partying