What is the main focus of Samuel's presentation?

The presentation focuses on how transformers can be adapted to perform Bayesian inference for supervised learning tasks.

What are the key aspects of the transformer model discussed?

Key aspects include the training process, the relationship between meta-learning and Bayesian inference, and the advantages of using transformers for fast and accurate predictions.

What is few-shot learning as mentioned in the talk?

Few-shot learning refers to the process where an algorithm learns to classify unseen data based on prior knowledge from multiple similar datasets.

How does Bayesian inference relate to transformers?

Bayesian inference is leveraged in transformers to optimize predictions, allowing for better generalization on unseen data.

What experiments were conducted to validate the approach?

Experiments included training on Gaussian priors and comparing performance against traditional Bayesian approaches, showing that transformers can achieve comparable or better results.

What is the significance of using priors in this model?

Priors allow the generation of datasets for training and help in approximating true posterior distributions for accurate predictions.

How long was the transformer trained?

The transformer was trained for one day on a TPU.

What are the future directions suggested in the talk?

Future directions include exploring larger datasets, improving prior generations, and enhancing interpretability.

Is this method effective for larger datasets?

Yes, the method showed generalization capabilities to handle data it had not seen during training.

What does the presentation suggest about the speed of transformer predictions?

The model can produce predictions in a fraction of the time compared to traditional Bayesian methods.

Samuel Mueller | "PFNs: Use neural networks for 100x faster Bayesian predictions"

00:51:04

https://www.youtube.com/watch?v=XnngBWe2WYE

Ringkasan

TLDRSamuel præsenterede, hvordan transformatorer kan anvendes til bayesiansk inferens i sit arbejde med automatisk maskinlæring. Han forklarede, hvordan modellen er trænet på prior-data for at optimere forudsigelser, og hvordan metoden gør det muligt for transformatorer hurtigt at give nøjagtige resultater på usete data. Der blev også præsenteret eksperimenter, som viste sammenligningen af præstationer mellem den foreslåede metode og traditionelle bayesianske tilgange, hvilket viste, at transformatorer kunne give bedre resultater med betydeligt mindre beregningstid. Diskussioner inkluderer fremtidige retninger for forskning og muligheder for forbedring af metoden.

Takeaways

📊 Samlet overblik over bayesiansk inferens med transformatorer.
⚡ Hurtige og nøjagtige forudsigelser på uset data.
🧠 Meta-læring bruger tilpassede datasæt til træning.
🔍 Sammenligning mellem forskellige maskinlæringsmetoder.
⏳ Træningsprocessen krævede kun en dags beregningstid.
📈 Retning mod bedre generalisering med større datasæt.
📚 Vigtigheden af at anvende passende priors.
💻 Mulighed for effektiv automatisering af maskinlæring.
🧩 Interdisciplinære muligheder for fremtidig forskning.

Garis waktu

00:00:00 - 00:05:00
Introduktion til dagens præsentation, hvor Samuel præsenterer sit arbejde med transformer-netværk i konteksten af bayesiansk inferens og automatiseret maskinlæring. Målet er at lave hurtige og præcise forudsigelser for ukendte problemer ved hjælp af standard transformer-modeller.
00:05:00 - 00:10:00
Samuel beskriver forskellen mellem overvåget læring og bayesiansk overvåget læring med fokus på kalibrering og fortolkning. Han introducerer ideen om 'meta-learning', hvor modellen lærer at lære fra multiple datasæt.
00:10:00 - 00:15:00
Forklaring af meta-læring, hvor en algoritme trænes på flere datasæt (meta-træning) og derefter anvendes på et nyt uset datasæt (meta-testning). Formålet er at lære at klassificere objekter hurtigt ved hjælp af dine præcedenter.
00:15:00 - 00:20:00
Samuel forklarer bayesiansk inference, hvor der etableres et forhold mellem input og output via latent variabel. Han introducerer normalt fordelte forudsigelser ved hjælp af bayesiansk reglen for at finde posterior forudsigelser.
00:20:00 - 00:25:00
Præsentationen drejer sig nu om, hvordan de standard transformer-modeller kan tilpasses til at udføre bayesiansk inference ved at optimalisere over posterie distributions. En direkte tilgang i stedet for at anvende traditionelle bayesianske metoder.
00:25:00 - 00:30:00
Exempler på træning af en netværk med data genereret fra en given prior, hvilket giver mulighed for at skabe store datasæt til træningen. Dette gør det muligt for transformerne at lære hurtigere og mere effektivt ved at anvende disse datasæt.
00:30:00 - 00:35:00
Diskussion om vigtigheden af permutation invarians og hvordan transformers kan designes til at tillade træningssæt at aggregere oplysninger om sig selv. Samuel understreger vigtigheden af at begrænse opmærksomheden til data i træningssættet.
00:35:00 - 00:40:00
Samuel præsenterer et konkret eksempel på, hvordan data samles, og hvordan metoden evaluerer yderligere observationer. Han diskuterer betydningen af at have præcise repræsentationer over distributionsformer for bedre inferens.
00:40:00 - 00:45:00
Præsentation af testresultaterne, hvor forskellige modeller testes mod hverandre for at evaluere ydeevnen. Samuel sammenligner resultaterne af transformer-baserede metoder med andre standardmetoder for maksimal likelihood estimator.
00:45:00 - 00:51:04
Endelig diskussion om potentialet ved at anvende transformer-modeller til automatisering af maskinlæring med fokus på realtidsevaluation og hvordan metoderne kan skaleres til større datasæt fremover.

Tampilkan lebih banyak

Peta Pikiran

Video Tanya Jawab

What is the main focus of Samuel's presentation?
The presentation focuses on how transformers can be adapted to perform Bayesian inference for supervised learning tasks.
What are the key aspects of the transformer model discussed?
Key aspects include the training process, the relationship between meta-learning and Bayesian inference, and the advantages of using transformers for fast and accurate predictions.
What is few-shot learning as mentioned in the talk?
Few-shot learning refers to the process where an algorithm learns to classify unseen data based on prior knowledge from multiple similar datasets.
How does Bayesian inference relate to transformers?
Bayesian inference is leveraged in transformers to optimize predictions, allowing for better generalization on unseen data.
What experiments were conducted to validate the approach?
Experiments included training on Gaussian priors and comparing performance against traditional Bayesian approaches, showing that transformers can achieve comparable or better results.
What is the significance of using priors in this model?
Priors allow the generation of datasets for training and help in approximating true posterior distributions for accurate predictions.
How long was the transformer trained?
The transformer was trained for one day on a TPU.
What are the future directions suggested in the talk?
Future directions include exploring larger datasets, improving prior generations, and enhancing interpretability.
Is this method effective for larger datasets?
Yes, the method showed generalization capabilities to handle data it had not seen during training.
What does the presentation suggest about the speed of transformer predictions?
The model can produce predictions in a fraction of the time compared to traditional Bayesian methods.

Lihat lebih banyak ringkasan video

Dapatkan akses instan ke ringkasan video YouTube gratis yang didukung oleh AI!

Teks

Gulir Otomatis:

00:00:02
okay hello everyone um
00:00:05
today is the last session before our
00:00:06
summer break um
00:00:08
and that's my great pleasure to
00:00:09
introduce our today speaker samuel he's
00:00:11
a pc student university of fibro
00:00:14
um yeah working on
00:00:16
transformers and you know transformers
00:00:18
foundation entrance uh also in the
00:00:20
context of automated machine learning
00:00:22
so yeah with that the flaws is yours
00:00:26
from here thank you very much erin um
00:00:29
yeah and that's also what i would like
00:00:31
to talk about today so
00:00:33
that is how we can get our transformers
00:00:37
uh pretty standard transformers um to do
00:00:41
bayesian inference for us
00:00:43
on any new problem that we never saw
00:00:45
before with during inferences
00:00:48
um so i will just first outline what
00:00:52
what kind of the problem we work on here
00:00:55
just so that it's it's more clear
00:00:58
what what the rest of this talk will be
00:00:59
about so what we care about here is like
00:01:01
supervised learning and specifically
00:01:04
asian supervised learning so
00:01:06
uh we have a label data set uh the
00:01:09
training set and we want to predict
00:01:11
labels for unlabeled validation set
00:01:14
and we care especially about calibration
00:01:17
here interpretability
00:01:19
but also of course we want to be
00:01:20
accurate and lastly something that so
00:01:24
far fell a little bit under the table in
00:01:26
many basin works uh we want to be fast
00:01:31
so
00:01:32
from here on i will start with a little
00:01:34
bit of background first a little bit of
00:01:36
a few short learning um and then a
00:01:38
little bit of bayesian learning but it
00:01:41
should be easy i guess
00:01:44
so first a future learning which is also
00:01:46
called meta learning sometimes
00:01:49
i don't know probably many of you know
00:01:51
it already i'll still explain it very
00:01:53
quickly
00:01:54
um
00:01:55
the idea is that we
00:01:57
learn
00:01:58
on one level higher so
00:02:00
we're not given just a data set of
00:02:02
images with labels where we're given a
00:02:05
data set
00:02:07
or a set of data sets
00:02:09
so we're given many many different data
00:02:12
sets
00:02:13
um with classification tasks in this
00:02:15
case so here
00:02:17
we're given for example a catch versus
00:02:19
bird specification
00:02:21
set or flowers where the spike
00:02:23
specification data set and the idea is
00:02:24
now
00:02:26
that an algorithm learns to learn or
00:02:28
learns to classify here
00:02:30
for a new
00:02:32
unseen data set and that would be here
00:02:34
the doc versus author data set on the
00:02:36
right never seen during training and the
00:02:39
idea now is that it learned
00:02:41
how to classify objects and can do so
00:02:44
really fast now that's the future part
00:02:47
we can classify objects of course
00:02:48
already with standard neural networks
00:02:50
but the idea is that you learn to do it
00:02:53
better by having available this set of
00:02:55
other data sets that look similar
00:02:59
and
00:03:00
we call this first phase the meta
00:03:02
training phase so the meta training is
00:03:04
on on these many data sets and then we
00:03:06
have the second phase the meta testing
00:03:08
um and that is where we tried it out and
00:03:10
never seen data sets before and we can
00:03:13
formalize this as like during meta
00:03:16
training we have a set of data sets
00:03:18
available which i call d1 and d2 and so
00:03:20
on here
00:03:21
each consisting of pairs of
00:03:25
inputs and outputs
00:03:26
and
00:03:28
then we have validation set there as
00:03:30
well uh here one validation example only
00:03:33
but it could be a set as well uh we have
00:03:35
the same for meta testing
00:03:37
um just a
00:03:38
data set as well as uh validation sets
00:03:41
and what we care about here like if we
00:03:43
look at all these what's the thing that
00:03:45
it predicted
00:03:46
and what is predicted here is only this
00:03:49
white head so so the rest of these is
00:03:51
used for training but all that is
00:03:53
predicted here is the white test that's
00:03:55
the special thing
00:03:57
we just learned from the left
00:03:59
how to predict and we predict only on
00:04:01
the right
00:04:02
is this clear how like what q shot
00:04:04
learning is in general
00:04:10
and i guess since there's a few yeah you
00:04:13
can easily interrupt me and this won't
00:04:15
like cause me through head detector or
00:04:17
anything um so don't hesitate to
00:04:20
interrupt me please um
00:04:23
the other background that i guess
00:04:26
relevant is asian inference especially
00:04:29
for supervised learning
00:04:31
and
00:04:32
here we just generally consider a prior
00:04:35
like a prior over some latent variable i
00:04:37
call it t here but you can call it
00:04:39
whatever
00:04:40
and this key usually or the c describes
00:04:43
your
00:04:43
your dependency between the input and
00:04:46
the output so uh for example a bayesian
00:04:48
neural network i call it dnn here
00:04:51
uh
00:04:52
describes uh or says that the output
00:04:56
should have a relationship to the input
00:04:57
such that there is a neural network
00:05:00
that's mapping the inputs to the outputs
00:05:03
but you can be more creative than that
00:05:05
in terms of what your prior is
00:05:10
um
00:05:11
and this is kind of how you given such a
00:05:14
it's given such a late and this is how
00:05:16
you traditionally solve the
00:05:18
classification problem uh or like a
00:05:21
standard supervised learning
00:05:22
classification problems so we are given
00:05:23
some data set d
00:05:25
and we want to now predict our y
00:05:28
and to do that we use uh
00:05:32
bayes rule here
00:05:33
to get the distribution or the
00:05:35
distribution over the latent given the
00:05:37
data and then we can
00:05:39
integrate that out to get the
00:05:41
distribution of the y and this is what
00:05:43
we call the
00:05:44
posterior predictive distribution and in
00:05:47
this work we will actually approximate
00:05:49
this thing here directly we will not go
00:05:53
this way we will not do base based uh
00:05:57
formula or anything we will just do this
00:05:59
directly um and maybe you see already
00:06:02
the connection to few short learning
00:06:04
here as we have a distribution that
00:06:06
takes in a data set
00:06:09
similar to
00:06:10
the data sets here
00:06:12
okay so
00:06:14
i will talk about this relationship
00:06:16
between meta-learning and bayesian
00:06:17
inference and we will exploit this for
00:06:21
uh for the approximation of the
00:06:23
posteriors for gaussian processes uh
00:06:25
even with meta priors uh
00:06:28
and for our base neural networks and uh
00:06:30
we'll get into that a little more
00:06:32
and lastly also this is possible to use
00:06:35
for basic optimization even
00:06:38
okay so so what is the approach in in in
00:06:41
general and this is a little example of
00:06:44
how to train
00:06:45
a prior data fitted network that is the
00:06:48
network that we propose
00:06:50
um which is
00:06:52
which which is a transformer in the end
00:06:56
and
00:06:57
how we how we train these is
00:06:59
we we train on many different data sets
00:07:02
similar to metal learning
00:07:04
but here the data sets are generated by
00:07:06
our prior so we define a prior just like
00:07:10
for equation
00:07:11
for bayesian imprint
00:07:13
and we sample data from this prior so
00:07:15
you can imagine
00:07:17
uh a gp prior for example and this is
00:07:19
where this data actually comes from as
00:07:21
well so this is data that we sampled
00:07:23
from a gt prior and this way we generate
00:07:26
data sets so these are our x-y pairs
00:07:28
this is the x axis this is the y so we
00:07:30
want to label uh the the the cross here
00:07:34
uh given the blue dots
00:07:36
and
00:07:37
we want to label the y of this cross
00:07:41
and we generate data sets like this
00:07:43
artificial data sets as many as we want
00:07:46
it's usually in the millions
00:07:49
and train our transformer or our kifn to
00:07:52
predict this uh this this holdout
00:07:54
example
00:07:55
correctly so it learns to
00:07:59
it does make the learning or it does um
00:08:01
view short learning but on a on data
00:08:04
that comes from a prior
00:08:07
and what is so cool about this is if we
00:08:09
have now real data that comes let's say
00:08:12
this is the function what what could
00:08:13
this be the altitude of some some hilly
00:08:16
region here and we want to predict the
00:08:18
altitude for the 0.0.8 here all we all
00:08:21
we have to do is
00:08:23
feed this data through our to our psn
00:08:26
and it will spit out not only like a
00:08:28
like uh like a mean expectation
00:08:31
but uh or sorry a mean prediction
00:08:35
but it will spit out a distribution over
00:08:37
what is likely to be the mean so
00:08:40
it will spit out the posterior for or
00:08:44
the posterior predictive distribution
00:08:46
for this particular prior that we chose
00:08:47
to train on
00:08:53
so actually good question here could you
00:08:55
say could you go back to this plot sorry
00:08:57
um so training and testing so
00:09:00
you test them on um
00:09:03
on the same functions that you use
00:09:04
during training but just different data
00:09:07
okay
00:09:09
no you generate um
00:09:12
so if you say testing you mean meter
00:09:14
testing i assume well you say test here
00:09:16
in your blood so you have to train your
00:09:18
points and test data points so
00:09:20
um oh i am sorry okay so you mean this
00:09:22
test data point here yeah these yeah oh
00:09:25
they come from the same function yes
00:09:26
okay but this is your meta testing this
00:09:28
is your the testing for your meta
00:09:30
training basically
00:09:32
um
00:09:33
these are basically just a
00:09:35
examples that we use during training
00:09:37
and then this side i would call meta
00:09:40
testing okay
00:09:42
okay i'm sorry the vocabulary is yeah it
00:09:45
kind of can get mixed up between the
00:09:46
meter and the not meter level okay no
00:09:51
[Music]
00:09:52
orange cross i want to say because i
00:09:54
wasn't so explicit about it these are
00:09:56
all samples from the same prior and then
00:09:58
we just randomly select what's the
00:09:59
holdout the holdout is not sampled in
00:10:01
any different way it's just like a
00:10:03
random subset of our data set is that
00:10:05
hold up
00:10:07
uh louise
00:10:11
um yeah i think maybe you'll get to this
00:10:14
but i just wanted to quickly ask so can
00:10:16
this actually satisfy the komagara of
00:10:18
extension theorem
00:10:21
oh i don't know the konmager of
00:10:22
extension for stochastic processes um
00:10:26
like can you comment on
00:10:27
sort of the the permutation invariants
00:10:31
yeah we
00:10:32
oh okay okay yeah we have we have that
00:10:35
um i will get to that um
00:10:37
and maybe you tell me if we satisfy the
00:10:39
kolmogorov extension theorem um but we
00:10:42
have permutation variance for sure okay
00:10:44
okay
00:10:45
i'll just wait and listen then
00:10:47
yeah it comes in like three slides or so
00:10:49
the architecture
00:10:52
um
00:10:52
[Music]
00:10:54
yeah so
00:10:55
maybe this is not easy for you because
00:10:57
the questions were already quite
00:10:58
advanced
00:11:00
but i will just say once again like
00:11:02
written out in
00:11:04
in formula
00:11:06
what what we do
00:11:08
we we sample data sets from here are
00:11:10
called pd but we sample from our prior
00:11:13
in some way we sample these data sets so
00:11:15
this could be a gp prior or a page
00:11:17
neural network provider something like
00:11:19
that we sample a lot of these data sets
00:11:21
um this uppercase k here
00:11:24
is in our experiments usually something
00:11:26
like
00:11:27
like a few million
00:11:29
um so we sample a lot of these data sets
00:11:31
because it's super cheap to sample them
00:11:33
we can just do that over and over and
00:11:35
over again
00:11:36
um and then we train our p van with just
00:11:40
the very normal negative log likelihood
00:11:42
or cross-entropy loss on on these data
00:11:45
sets or on the pulled out part of these
00:11:47
data sets
00:11:49
and what we get out there is an
00:11:51
approximation
00:11:53
to the
00:11:55
to the
00:11:56
the true
00:11:57
um posterior predictive distribution so
00:11:59
to the bayesian
00:12:02
posterior predictive distribution that
00:12:04
that one cares about if one has this
00:12:05
prior one wants this distribution um and
00:12:08
we actually get an approximation to that
00:12:10
uh on
00:12:11
data that comes from the real world so
00:12:13
not feeding in the same data again or so
00:12:16
but like unseen data that the network
00:12:18
never saw before so what we have here is
00:12:20
a network that will do
00:12:22
that we'll do a form of bayesian
00:12:23
prediction um
00:12:25
or they will do an approximation of
00:12:27
patient prediction
00:12:28
um as the forward path so you just give
00:12:30
it the data and the hold out examples
00:12:32
and it will do patient prediction for
00:12:34
you uh there's no no gradient or
00:12:36
anything involved there
00:12:38
at inference time or at this
00:12:41
meter inference time
00:12:43
and yeah just very quick because this is
00:12:46
actually
00:12:47
quite easy connection um
00:12:50
so this is just the connection to to
00:12:53
bayesian inference and the connection
00:12:54
basically just boils down to
00:12:57
the laws that they just showed you
00:12:59
before or the generalization of that
00:13:02
loss over
00:13:03
over the whole expectation of these data
00:13:05
sets um
00:13:07
actually
00:13:08
is in a meaningful way
00:13:10
connected to the kl divergence
00:13:13
of our
00:13:14
posterior approximation to the real
00:13:16
posterior so it's a
00:13:18
it's the
00:13:21
it's the mean kl divergence of our
00:13:24
approximation to the real posterior uh
00:13:26
predictive distribution plus some
00:13:28
constant
00:13:31
and we can optimize this directly that's
00:13:32
the cool thing so we optimize directly
00:13:35
to become the posterior predictive
00:13:37
distribution
00:13:39
okay and now
00:13:40
the part luis asked about i think and
00:13:44
that is what is our model and
00:13:47
our model is uh is the transformer
00:13:50
where we uh throw away
00:13:52
the
00:13:53
the uh the positional encodings and that
00:13:56
is actually pretty much enough to make
00:13:58
the transformer positioner or
00:14:01
sorry permutation invariant
00:14:04
and then the other thing we do is we
00:14:06
restrict the attention such that
00:14:08
our
00:14:10
our holdout example can only look at the
00:14:12
data set
00:14:14
so can attend to the data set but the
00:14:16
data set can only attend to other points
00:14:18
in the dataset so the dataset actually
00:14:20
aggregates information about itself or
00:14:23
the sorry i said data set i mean
00:14:25
training set in this case and the whole
00:14:27
load example can then look things up in
00:14:30
in these building um
00:14:32
building up representations and here we
00:14:35
have actually two whole other examples
00:14:36
before i always just showed one but in
00:14:38
general we train with multiple for
00:14:40
efficiency reasons
00:14:42
and these are separated such that each
00:14:44
one of these can only attend to the
00:14:46
training set and not to each other i
00:14:48
mean that that generally was pretty bad
00:14:50
if they could do that because
00:14:52
in general they should not depend on
00:14:54
each other
00:14:55
that's i think if you know transformers
00:14:58
already i know this is kind of assuming
00:14:59
you know a little like a lot about
00:15:01
transformers but if you know a lot about
00:15:03
uh like if you know how transformers
00:15:05
work i think this should
00:15:07
like describe the whole model um maybe
00:15:10
one more thing is how we encode the x's
00:15:12
and y's
00:15:13
in in our case it's just linear layers
00:15:15
so we
00:15:16
so we just put the x's in the linear
00:15:18
layer and the y's in the linear layer
00:15:20
and add them together if we want to
00:15:21
encode a pair or we don't have the y if
00:15:24
we don't want to encode the y because we
00:15:26
don't want to give away the information
00:15:27
for the whole odd examples
00:15:34
so one thing that is missing in what i
00:15:37
described here is
00:15:39
how exactly this distribution up here
00:15:41
works so i just say yeah what comes out
00:15:43
here
00:15:45
will be this posterior distribution but
00:15:47
i
00:15:49
uh i didn't explain how
00:15:51
and the traditional way would be to to
00:15:53
predict uh
00:15:54
something like a gaussian here some
00:15:56
normal distribution predict the mean and
00:15:58
the variance we tried that and it didn't
00:16:01
work and what worked much better was
00:16:03
like a
00:16:04
discretization of the space so
00:16:07
so this is our space we care about
00:16:08
predictions in uh so we want to predict
00:16:11
uh things that that lie within let's
00:16:14
hear say let's say minus three up to
00:16:15
three
00:16:16
and what we do now is we discretize the
00:16:18
space into little buckets
00:16:20
and put a soft max in front so we just
00:16:23
classify
00:16:24
uh basically uh things are
00:16:27
either or we classify
00:16:29
things to be in one of these packets we
00:16:33
this is something we that just worked
00:16:35
out of the box pretty much and you don't
00:16:37
need to tune very much
00:16:39
usually 1 000 buckets is a good number
00:16:42
um
00:16:44
this is a problem though because if you
00:16:46
don't know this probability for your
00:16:48
prior so sometimes maybe a point lies
00:16:50
here like below minus three you get like
00:16:53
a minus infinity loss which
00:16:55
would be bad
00:16:56
so very simple hack to
00:16:59
have good uh good losses it's just to
00:17:02
replace the site with like half normal
00:17:03
so now
00:17:04
we actually have support for
00:17:07
minus infinity to plus infinity um we
00:17:09
have support everywhere
00:17:12
other questions about the model because
00:17:14
then we could already start into the
00:17:16
experiments next that's so actually this
00:17:18
side i haven't fully understood so
00:17:20
basically instead of a particular mean
00:17:22
and the variance you predict
00:17:24
quantiles
00:17:25
or something like that yeah instead of
00:17:27
predicting a mean and a variance we
00:17:28
basically uh we we have a soft maxim in
00:17:31
the end and we predict in which one of
00:17:33
these buckets our
00:17:35
our y will land
00:17:37
so these are the bucket the first bucket
00:17:39
goes from -3 to minus 1.5 for example
00:17:42
and then we give you the probability and
00:17:43
say 0.1
00:17:46
likelihood that we learned in this part
00:17:48
okay so basically make it a
00:17:50
classification problem
00:17:51
right exactly
00:17:53
why not why not predicting quantiles so
00:17:57
why not what predicting the quantile so
00:18:02
um i i didn't really think about this uh
00:18:05
so you mean like predicting the borders
00:18:07
of these
00:18:08
quantum
00:18:09
are the quantile in the distribution or
00:18:12
yeah
00:18:13
okay this is this data point yeah
00:18:14
exactly um
00:18:16
the quantizer question
00:18:19
[Music]
00:18:20
i mean the point is to make it a
00:18:22
classification problem right
00:18:24
because transformers are good at that
00:18:26
[Music]
00:18:27
right tim
00:18:29
yeah yeah yeah yeah for sure for sure i
00:18:31
mean
00:18:32
okay i think still one could probably do
00:18:34
like zero to one and then the bars again
00:18:38
but uh okay that's probably not what you
00:18:39
mean uh yeah i i didn't try the quantile
00:18:43
but i had the feeling that the
00:18:44
regression head works much worse with
00:18:46
transformers or with neural networks
00:18:48
maybe even in general no that's too much
00:18:50
of a claim that's used in many places
00:18:51
but with transformers it seemed to work
00:18:53
much worse
00:18:54
you could still make the classification
00:18:56
problem but it just sends a predictor
00:18:58
okay this this point is like within the
00:19:00
ten percent quantity twenty percent
00:19:01
contacts and
00:19:02
percentage yeah for sure you can
00:19:04
um
00:19:06
we kind of do that actually in a way
00:19:08
because we how we define these bars is
00:19:11
we take a big example from our prior
00:19:14
and now we create bars for every
00:19:17
quantile so for every one percent type
00:19:19
of quantile we have one bar
00:19:21
so we will have for normal just like for
00:19:24
gaussian process prior for example we
00:19:25
have smaller bars in the center and
00:19:27
wider on the sides so we kind of do
00:19:29
quantity prediction i guess okay i see
00:19:31
yeah okay
00:19:33
and and as a remark here i think it
00:19:35
would be a really interesting project if
00:19:37
anyone is interested in this to
00:19:40
look into
00:19:41
whether we can
00:19:43
just in generally and in general do
00:19:45
regression like this and whether in
00:19:47
general this is a good idea to to do
00:19:49
regression
00:19:51
like this with neural networks because
00:19:54
sometimes we tried regression with
00:19:55
neural networks and it didn't work all
00:19:57
that well
00:19:58
and um yeah this could be a generic
00:20:01
approach for
00:20:02
just how to do regression with neural
00:20:04
nets but um
00:20:05
yeah we haven't looked into it if
00:20:07
someone would left like to do that um
00:20:09
maybe together that'd be great
00:20:11
get in touch
00:20:15
yeah
00:20:17
okay so any more questions
00:20:21
just a quick question do you have any
00:20:22
insight on
00:20:24
let's say at a finite collection of test
00:20:27
points
00:20:28
um do you have any insight about what
00:20:31
the marginal looks like because you know
00:20:33
in in gaussian processors we have these
00:20:35
guarantees about the marginal
00:20:37
distribution at a finite collection
00:20:39
they're jointly gaussian
00:20:42
is that even kind of useful thing to
00:20:44
think about in this setting and if so do
00:20:47
you have any insight on what what those
00:20:48
things look like
00:20:50
so you mean the joint distribution of
00:20:52
multiple points
00:20:53
yeah exactly so so the the joint
00:20:56
distribution at a finite collection of
00:20:57
points will always be a multivariate
00:21:00
gaussian right in the case by definition
00:21:02
in the case of gaussian processes
00:21:05
yeah we actually
00:21:06
don't like this empirically
00:21:09
okay um
00:21:10
i mean what
00:21:12
i didn't look into the joint
00:21:13
distributions i have to say so i can't
00:21:15
really tell you much about this we can
00:21:17
say that each of these distributions
00:21:19
looks very much like a version so if you
00:21:20
like pull up the distribution they will
00:21:22
match pretty exactly but just like by
00:21:24
eye because in the end one is like bars
00:21:27
and the other is like a like a like a
00:21:29
continuous line
00:21:31
um but for the joints i don't know i
00:21:34
guess empirically you could calculate
00:21:36
the covariances at any pair of points
00:21:39
and you can kind of see
00:21:40
how it's distributed i think that would
00:21:42
be kind of interesting
00:21:44
and i i would think it would look like a
00:21:47
gaussian thing just because the prior
00:21:49
you're learning from is is gaussian
00:21:51
um
00:21:52
another quick question i had is uh in
00:21:54
the context of
00:21:56
like uh marginal likelihood um just
00:21:59
tuning the hyper parameters as you would
00:22:02
in a traditional gp
00:22:03
um like do you have do you consider when
00:22:06
when you're drawing your prior data sets
00:22:08
different
00:22:09
uh length scales or kernel amplitudes
00:22:13
uh variances things like this or do you
00:22:15
just keep it fixed and does that
00:22:17
actually then
00:22:19
from learning from that prior does that
00:22:21
actually
00:22:22
um
00:22:23
generalize well
00:22:25
yeah we'll get to that um we do a lot of
00:22:27
that so we do a lot of this like mixing
00:22:29
different different prior hyper
00:22:31
parameters right
00:22:34
but we'll get to that in the experiment
00:22:36
okay cool
00:22:37
thanks but but maybe to answer lucy's
00:22:40
quest there is first question um
00:22:42
first um
00:22:44
so i i do think we know what the
00:22:45
marginals look like they look
00:22:48
pretty much exactly like what we would
00:22:50
expect um with the true posterior
00:22:53
distribution
00:22:54
um
00:22:55
so if we train on a prior that comes
00:22:58
from a gp then well our predictions will
00:23:01
look pretty much exactly like a gaussian
00:23:03
because well that that's a kl divergence
00:23:05
we're optimizing between these two
00:23:06
distributions um
00:23:09
the joint of
00:23:11
multiple predictions
00:23:13
well actually that will look very
00:23:15
different
00:23:16
um we're just not going to predict any
00:23:19
cross-correlations here
00:23:21
because that that's what sam said we
00:23:23
predict independently for x4 and for x5
00:23:26
if you want to join then you would have
00:23:28
to predict for x4 put it into the
00:23:30
training data do another forward prop
00:23:32
and get the prediction for x5 and then i
00:23:34
would expect that we get pretty nice
00:23:37
joined
00:23:38
gaussians but that we haven't tried yet
00:23:43
i see gotcha
00:23:48
thanks okay
00:23:50
um then we jump into the experiments and
00:23:53
the experiments will actually be first
00:23:55
experiments of the original paper that
00:23:57
was also i think in the investigation so
00:23:59
this transformers can do based on
00:24:00
inference and then we'll get to
00:24:02
experiments of a paper we just uploaded
00:24:04
to archive so very fresh out of the oven
00:24:08
you'll see
00:24:09
um
00:24:10
so first
00:24:12
uh the motivating experiment is we train
00:24:14
on a gaussian
00:24:15
prior and want to see that our network
00:24:19
actually does something
00:24:20
that flower creates something that looks
00:24:22
like the gaussian posterior
00:24:24
and we actually do get that that was uh
00:24:27
that was nice back then um and so what
00:24:30
we see on the left here is the black
00:24:33
dots are are
00:24:35
our given examples or the training that
00:24:37
you could say so
00:24:38
this is known up front and
00:24:41
now we're interested in the
00:24:43
posterior distribution of our of our
00:24:45
gaussian process given these points and
00:24:48
so here we plot the density and we can
00:24:50
see that this is pretty smooth there are
00:24:53
some edges in there but i think this is
00:24:54
mostly due to x being in buckets
00:24:58
not so much why because we use a lot of
00:25:01
uh we use a lot of packets here like
00:25:03
that 1000 or so
00:25:05
um
00:25:07
this is so hard to compare to a
00:25:09
to a gp like for a gp we know the
00:25:11
material right so
00:25:12
it's actually a nice motivating example
00:25:14
because we can compare to the true
00:25:16
predictions unlike for many other priors
00:25:18
uh we can we can hear compared to the to
00:25:21
the ground truth
00:25:23
and this is on the right here
00:25:26
in blue it's the pfn and in green we
00:25:29
have the gp
00:25:30
and you see that at least for these
00:25:32
simple examples they uh they match very
00:25:36
exactly there are some differences we
00:25:38
mark them here with errors but in
00:25:40
general it looks like
00:25:42
it fits it pretty pretty well and even
00:25:45
the confidence interval looks looks
00:25:47
pretty similar
00:25:50
we can do this also here for another
00:25:53
length scale to show that yeah even the
00:25:56
length scale is different it also still
00:25:57
works and with more points uh same
00:26:00
experiment
00:26:01
and
00:26:03
and now we compared to the attentiveness
00:26:06
processes but this is in red here and
00:26:08
that's kind of what was there before
00:26:10
and we can see that yeah there is a
00:26:13
clear difference the transformer
00:26:14
architecture seems to be really good at
00:26:16
this task surprisingly compared to an
00:26:19
architecture that was invented for
00:26:21
future learning
00:26:25
and yeah now i would like to show you a
00:26:27
little demo
00:26:29
um you can find the link to this little
00:26:31
demo actually in the paper or on the
00:26:34
github
00:26:35
so if you want to try it yourself you
00:26:37
can
00:26:38
so
00:26:40
i think that's so yeah let's see so in
00:26:42
this little demo we basically create the
00:26:44
same figures we had before
00:26:47
uh with exactly the same legend again
00:26:50
but we can create our own and we could
00:26:53
now for example here create a point in
00:26:56
the middle somewhere
00:26:57
i wanted to try an example before but no
00:26:59
i didn't really and make it much higher
00:27:02
because they're pretty much on one line
00:27:03
so that it's maybe a little more
00:27:04
interesting
00:27:09
okay so we have something something like
00:27:10
this and then okay
00:27:13
i always press new column that's a
00:27:15
problem you press new column right
00:27:21
what did i type here 0.9
00:27:23
and now a new row and now you can do
00:27:26
0.6 and we can see like if we make it
00:27:29
more extreme it will be harder for the
00:27:31
for the network
00:27:37
let's see what with this so it should
00:27:39
have like a steep stand let's see if it
00:27:41
still can model it uh it still can model
00:27:44
it but you can see that there are more
00:27:45
errors than before i would say
00:27:48
for sure um and
00:27:51
and you can like this is
00:27:53
this is due to being much more out of
00:27:55
the distribution than
00:27:57
and
00:27:59
like the first example but yeah you can
00:28:01
play around with this demo if you like
00:28:02
as well
00:28:05
yeah frank posted the link very nice
00:28:07
cool let's go back to the presentation
00:28:15
so
00:28:15
now we look at uh
00:28:18
not anymore at night but her yes you
00:28:20
know still has nice of course everything
00:28:22
are and all of these are nice but um now
00:28:25
we look at um
00:28:26
performance plot uh here we see the
00:28:29
negative likelihood so as we've shown
00:28:31
like as we can show that this is a
00:28:33
measure for the kl divergence between
00:28:35
our posterior and the true posterior um
00:28:39
we use the negative log likelihood to
00:28:41
measure our method uh on priority later
00:28:44
and we compare it here with the true gp
00:28:46
so this is again an example where we
00:28:47
have the true true gp so we don't do
00:28:50
anything where we can't get the true
00:28:52
criteria up to here two thousand
00:28:54
examples
00:28:56
and
00:28:57
what we can see is that we have
00:28:59
different pfns and this is the number of
00:29:01
data sets they've seen during training
00:29:03
so this uppercase k from back then
00:29:06
um
00:29:08
and what we can see is that
00:29:09
the pl events like always get better
00:29:12
with more training data like a very
00:29:14
traditional result i guess in machine
00:29:16
learning but um you can see they they
00:29:18
they get better and better with more
00:29:20
training and more training and we can
00:29:21
see similar things actually with the
00:29:23
transformers side um not as strong but
00:29:26
similar
00:29:30
quick question
00:29:32
um
00:29:32
do you have some insight about uh like
00:29:36
at what point you reach
00:29:37
point of diminishing returns like i
00:29:40
assume with all things transformers you
00:29:42
have you have these power laws and uh
00:29:45
more is always more data is always
00:29:46
better and so on but let's say i wanted
00:29:48
to deploy this myself i wanted to train
00:29:50
this from scratch by myself
00:29:52
and i don't have all the time or compute
00:29:54
budget in the world um like how do you
00:29:57
have some intuition about how
00:30:00
how high or how much data to go to
00:30:03
and and at what point it's maybe just
00:30:05
not worth it anymore like you hit the
00:30:07
point of diminishing returns
00:30:10
so i would say this generally depends a
00:30:13
lot on your prior so if you have a very
00:30:15
simple prior of course you hit that
00:30:16
point earlier so probably one of the
00:30:20
examples where you hit it earlier would
00:30:21
be something like the sculpture process
00:30:23
here
00:30:24
um
00:30:25
and
00:30:26
in general like what i can see is that
00:30:28
training for a day or one gpu
00:30:32
yeah will yield pretty good results and
00:30:34
training for five days will be like a
00:30:36
little bit better but not much
00:30:39
this is something it can give you as
00:30:41
like a
00:30:42
so
00:30:43
yeah that's that's so far
00:30:45
how how we scale we hope we find some
00:30:47
way to make better use of more compute
00:30:49
but i think so far
00:30:51
like after one day or
00:30:53
yeah we spent sometimes five days on the
00:30:55
training but for this first paper it was
00:30:57
only one day
00:30:59
and generally this is also where it
00:31:00
doesn't get much better
00:31:03
and just quick question on the gp prior
00:31:05
that you use to generate the data sets
00:31:07
like do you just always have uh you
00:31:10
uniformly sample um
00:31:12
a fixed number of
00:31:14
input
00:31:15
like observed input points and then just
00:31:18
one test point
00:31:20
and you just yeah um domain or this is
00:31:22
very so
00:31:24
so yeah for the gp prior exactly since
00:31:26
the gp prior doesn't really entail
00:31:28
access we have to sign up for the access
00:31:30
ourselves in some way
00:31:32
and we do
00:31:34
yeah pretty much what you say we sample
00:31:36
uniformly at random or from uh from a
00:31:39
standard normal
00:31:41
and
00:31:43
we
00:31:44
what we do though is that what is
00:31:46
important is to train for different uh
00:31:49
for different numbers of data points
00:31:50
because you want to generalize among
00:31:53
different data set sizes
00:31:56
um and so we we sample uniformly at
00:31:58
random how many
00:32:00
examples are
00:32:01
training set and how many are tested
00:32:07
okay
00:32:07
and
00:32:09
do do you do this for higher dimensions
00:32:12
or is it just the one dimensional
00:32:14
problem
00:32:16
uh i'm sorry this is missing here but
00:32:18
this is for example for five dimensions
00:32:20
this plot we do it for high dimensions
00:32:22
as well yeah
00:32:23
we go
00:32:25
i think our largest experiment like like
00:32:28
close to 800 dimensions like 784
00:32:32
okay
00:32:33
use this center strategy
00:32:36
so do you use um user fixed kernel for
00:32:39
ugp prior or do you use um yeah
00:32:42
different types of current different
00:32:43
link skills noise and so on
00:32:45
uh yeah i'm sorry i i just see that i
00:32:47
threw out the plot where i where i use
00:32:50
different length scales and kernels um
00:32:52
but in this plot this is just the same
00:32:54
length scale kernel but you can do the
00:32:57
same um
00:32:58
by
00:32:59
just varying the length scale as you
00:33:01
sample like for each dataset use a
00:33:02
different length scale distribution over
00:33:04
these and we did that as well so the
00:33:06
so-called hyper priors in the literature
00:33:09
and um that works as well but we don't
00:33:12
have the nine space landing anymore
00:33:13
because there's no or there's no easy
00:33:16
way to get the
00:33:17
to get the correct um to get the correct
00:33:20
criteria we can only approximate it
00:33:22
there are only approximations for it out
00:33:24
there
00:33:24
so it's a little harder to repair i'm
00:33:27
just wondering because if you if you
00:33:28
then sample different length scales and
00:33:30
potentially different colors and your
00:33:32
prior gets more more uninformative right
00:33:34
and then it's you know
00:33:36
it yeah what is your model to learn
00:33:39
right here
00:33:40
in the extreme case you have like a
00:33:42
super uninformed fire and then yeah
00:33:44
well the model is not going to help
00:33:45
anymore because it's just models
00:33:48
yeah for sure that's that's that's the
00:33:50
trade-off there's some sweet spot yeah
00:33:54
yeah you don't want to make it too broad
00:33:56
but so far
00:33:58
yeah generally so far we saw making it
00:33:59
broader can can help in many
00:34:01
applications
00:34:03
um so because i think so far priors are
00:34:05
pretty restricted
00:34:07
with this really traditional vi or mcmc
00:34:10
methods because we have to have these
00:34:12
like easy to interpret latents or easy
00:34:14
to generate latent
00:34:15
ah
00:34:16
we don't need these anymore
00:34:18
cool
00:34:19
um yeah so this is uh since we talked
00:34:22
about it this is a beijing and then or
00:34:24
no we didn't really talk about it
00:34:26
this is the
00:34:28
bayesian neural network and we we do the
00:34:30
approximation there here as well and
00:34:32
what we look here at is um just to show
00:34:34
you the time scales that's actually
00:34:37
quite interesting so the svi baseline
00:34:39
here is the
00:34:40
base by backdrop or like a newer
00:34:42
variation of that and then we have the
00:34:44
nuts like an mcmc
00:34:47
sampler
00:34:48
and
00:34:51
these are two other approximations to
00:34:52
the to the prior to the posterior for
00:34:56
bayesian neural networks and we compare
00:34:59
our method to these so this is our
00:35:00
method and it takes
00:35:03
we can't really change the time it takes
00:35:05
for us to predict the label for a data
00:35:07
set because it's a forward pass we can't
00:35:09
really decide to only do half of our
00:35:11
passwords also that's why we have a dot
00:35:14
here so our method takes like uh like a
00:35:17
hundredth of a second to
00:35:20
to predict uh the labels for a given
00:35:22
task and then we looked at different
00:35:24
budgets
00:35:25
uh for our baselines and we saw that
00:35:27
they reach our performance after using
00:35:30
like something like 1000 x of the time
00:35:34
compared to us
00:35:36
which was yeah
00:35:37
quite cool
00:35:38
and and we can do this like we do this a
00:35:41
and b here it's just both be an end but
00:35:44
different sizes the left is a little
00:35:45
smaller than the right and generally
00:35:47
both of them are really small because
00:35:49
otherwise you can't do nuts
00:35:52
so
00:35:53
in this case are you just uh that the
00:35:56
data that you're learning from the data
00:35:57
sets that you generate are these just
00:35:59
literally um
00:36:01
you you have a bnn with some with
00:36:04
different random initializations and
00:36:06
that's how you're generating the data
00:36:07
sets
00:36:08
so yeah we have an mlp with random
00:36:10
initialization and then we generate our
00:36:13
x's randomly like before for the gps and
00:36:16
feed them through and generate our y's
00:36:18
that way
00:36:19
okay and just to clarify what i'm
00:36:21
looking at here this is so this is the
00:36:23
test likelihood i'm assuming on on some
00:36:25
regression or some
00:36:28
uh
00:36:29
problem like classification or
00:36:30
regression problems or is this um this
00:36:33
is um yeah so sorry this is uh
00:36:36
this is actually classification problems
00:36:39
um and these are classification problems
00:36:42
sampled from the prior so it's still in
00:36:45
prior with like we
00:36:46
to get rid of compounding factors like
00:36:49
choosing a choosing a data set that
00:36:50
might not lie within the prior week
00:36:53
we trained on unseen data sets from the
00:36:55
pro
00:36:58
okay thanks
00:37:00
and yes there's no fine tuning here
00:37:01
right so you wouldn't
00:37:03
you do the you train your transformer
00:37:05
and then
00:37:06
it's basically just testing so you
00:37:07
wouldn't update your
00:37:10
uh the transformer based on this new
00:37:11
data set right
00:37:13
no yeah the weights stay the same
00:37:17
it's yeah it's one training and then
00:37:18
only feed forward
00:37:22
don't know feed forward the wrong word
00:37:24
forward
00:37:26
so yeah okay here a little lookout um
00:37:28
it's actually yeah
00:37:30
yeah it it just to show you this is the
00:37:32
average regret and we can do uh we can
00:37:35
do we compare here the gp here is a
00:37:38
sorry we do bayesian optimization in
00:37:39
this plot and we compared to a gp that
00:37:42
uses mle too so that's kind of the
00:37:45
standard approach right now invasion
00:37:47
optimization for
00:37:48
to approximate these uh these type of
00:37:51
prior
00:37:52
gps
00:37:53
um
00:37:54
and uh
00:37:56
we compare our pfn against that and
00:37:58
against random start and we train rpa
00:38:00
event on the exact same prior the gp has
00:38:02
so the expectation would be that they
00:38:05
should be similar because it's two
00:38:06
similar approximations to the same
00:38:08
problem
00:38:09
and this is only for like up to 15
00:38:12
trials so relatively small
00:38:14
examples um
00:38:17
and both use ei so we can actually use
00:38:20
ei on top of our pfn so that's expected
00:38:23
improvements like one specific
00:38:24
acquisition function and what we could
00:38:26
see there is that our method performs
00:38:29
actually very similar to
00:38:31
to the gp and the cool thing now is yeah
00:38:33
you know you can get creative and build
00:38:35
new players that are not possible with
00:38:37
the gp to to even beat it and that's
00:38:40
what i'm working on right now actually
00:38:44
then the last part we're already at 42
00:38:47
but yeah i guess we don't really need
00:38:48
discussion up there
00:38:50
um
00:38:53
uh the last part is that the type event
00:38:55
so that's the paper we just
00:38:57
just submitted and also just uploaded to
00:38:59
archive
00:39:01
and um
00:39:04
and that is basically one pfn so one of
00:39:07
these prior data fitted networks one
00:39:10
transformer one could also say
00:39:13
in which we train no full ultra ml
00:39:15
method so we
00:39:17
want to make this one thing um be good
00:39:20
at
00:39:21
tabular automl in our case like the
00:39:23
automl and tabular datasets
00:39:28
what's cool about this is that makes
00:39:30
autumn l pretty much instant
00:39:32
our method performs comparably to like
00:39:35
state-of-the-art automl methods that get
00:39:38
five minutes of time while we only take
00:39:40
a second
00:39:43
and
00:39:44
this
00:39:45
simplifies out to ml at least from the
00:39:47
view of a deeper learner from the
00:39:49
viewpoint of a deep learner because
00:39:51
before
00:39:52
you had a setup like this for example uh
00:39:56
where we have some meteor learning
00:39:57
submission optimization etc etc which is
00:40:01
all
00:40:02
replaced now by a single neural network
00:40:06
plus some pre-processing i've said that
00:40:08
but people's listening is something like
00:40:09
normalizing data so i hope that's
00:40:12
allowed
00:40:12
[Music]
00:40:14
and
00:40:16
one caveat here i want to give up front
00:40:19
is that we
00:40:20
looked mostly at small datas or we
00:40:22
looked at small data sets so we care
00:40:24
only about data sets up to 1000 examples
00:40:28
okay so
00:40:30
now
00:40:32
let's talk about the prior so as i said
00:40:35
before you can get creative with these
00:40:36
priors because we just have to be able
00:40:38
to sample data sets from them so what is
00:40:40
it even a prior i mean you can also call
00:40:41
it like a simulation of a real data so
00:40:44
um
00:40:45
and here we just like started from
00:40:47
scratch we want to be able to
00:40:50
yield a transformer that's just good at
00:40:52
uh it's good at classification um on
00:40:55
tabular datasets so yeah of course we
00:40:57
wanted to be simple
00:40:58
so
00:40:59
we thought like
00:41:01
always we should have something in there
00:41:02
that like forces our um or latents or
00:41:07
whatever the latent is um to to
00:41:09
represent simple solutions that can be
00:41:11
described with few words
00:41:15
and where we took inspiration is from
00:41:17
the
00:41:18
from the basin and ends so on the on the
00:41:21
left here
00:41:22
that's the standard vision and then so
00:41:24
how we talked about it with luis uh
00:41:26
where
00:41:27
where we just like during training we
00:41:30
just sample the bayesian uh sorry we
00:41:32
just sampled the weight
00:41:33
from from a standard initialization
00:41:36
and now feed random data through it to
00:41:39
get generate wise and this is now our
00:41:41
data set
00:41:42
one data set
00:41:44
and um
00:41:46
what we what we can do then on top of
00:41:48
that is the
00:41:49
what we call the nas but only in and
00:41:52
takes in quotation marks uh the nas part
00:41:55
and that is uh we can actually sample
00:41:57
the architecture as well so we we build
00:42:00
a posterior over different architectures
00:42:03
not only
00:42:04
a single architecture like a normal
00:42:05
vision enhance but but multiple
00:42:07
architectures so that it basically
00:42:09
chooses the architecture that's most
00:42:11
likely or we have like an ensemble over
00:42:13
architectures
00:42:14
and that was though in the earlier paper
00:42:17
and now in the new tapifn paper we took
00:42:19
that one step further
00:42:20
and and looked at structural causal
00:42:23
models which can give some
00:42:26
uh some some some nice uh
00:42:29
buildup i think for our prior
00:42:32
and how they work is basically
00:42:34
these are pretty much pruned
00:42:37
or yeah i guess for you and
00:42:39
these are pruned page neural networks
00:42:42
and now we just change where the input
00:42:45
is and where the outputs are the outputs
00:42:47
and the inputs don't have to be at the
00:42:48
beginning or the end anymore we have
00:42:50
random inputs these are like these
00:42:52
epsilons here
00:42:54
here these epsilons
00:42:56
and um uh like if i say input now i mean
00:42:59
input to the to the to the neural
00:43:02
network but our inputs that we repeat to
00:43:04
the transformers then later to the p of
00:43:06
n then later can for example be here at
00:43:08
the output node it doesn't really matter
00:43:10
where they are or the the y can actually
00:43:13
influence the x um like the
00:43:16
x can can come from the y so it could be
00:43:19
before so they're basically just at
00:43:20
random points in this graph and the
00:43:22
integration comes from these are called
00:43:25
the links so these like
00:43:27
the the the arrows boil down to um
00:43:30
weighted thumbs then and and we think of
00:43:33
these as like the causing links for the
00:43:35
next thing um and
00:43:39
this way we can generate graphs with
00:43:41
like few causing links and generate data
00:43:44
that at least for us look
00:43:46
very much like real data
00:43:49
um there's a lot of action in the chats
00:43:51
today
00:43:52
okay frank
00:43:53
that's good
00:43:54
um
00:43:56
well in the end you could refer to it
00:43:57
but maybe let's push it okay
00:44:00
okay
00:44:02
and
00:44:04
yeah so what i mean with it looks like
00:44:05
real data these are the covariance
00:44:07
matrices of
00:44:08
a real data set that we found in our
00:44:10
benchmark compared to a data set that we
00:44:12
generate and we see these covariant
00:44:14
matrices they actually look pretty much
00:44:16
the same build clusters and have other
00:44:19
parts that are really not very
00:44:20
correlated
00:44:22
it's similar in a way so
00:44:24
here it's on the on the right it's
00:44:26
synthetic data and on the left we have
00:44:28
the real data
00:44:31
cool i hope it's kind of clear what this
00:44:34
prior is all about and
00:44:38
we can
00:44:39
jump to
00:44:41
the experiment to the experiments and
00:44:44
what we did for experiments we have like
00:44:46
one
00:44:46
main experiment here and that is based
00:44:49
on the openml 5018 suit so we didn't
00:44:52
want to cherry pick our tabular data
00:44:54
sets because i think that actually
00:44:55
happens pretty easily there are many
00:44:57
available and it's not clear which one
00:44:58
is the image net of familiar data sets
00:45:01
so
00:45:02
it can happen easily that you cherry
00:45:04
pick that's why we use use the benchmark
00:45:07
that's out there and as i said before we
00:45:10
uh we restricted ourselves to only 1000
00:45:13
examples so from this benchmark we threw
00:45:15
away
00:45:16
all the data sets that have over 1 000
00:45:19
um training examples
00:45:22
and we were left with
00:45:24
uh like around two thirds of the data
00:45:27
sets from this from this benchmark so
00:45:29
this was still around
00:45:31
30 data sets i think or so
00:45:34
that we had in the end
00:45:36
and
00:45:38
here we compared to like a few of the
00:45:41
modern autumn l frameworks as well as
00:45:44
through boots of trees and some very
00:45:46
simple methods like the peas and
00:45:48
logistic regression and k nearest
00:45:51
neighbors and what we what we look at
00:45:53
here is um
00:45:55
in each plot we have
00:45:57
on the on the x-axis we have the time so
00:45:59
how much time does it take to to get
00:46:02
results on a data set on average for
00:46:04
these real-world data sets that we have
00:46:06
from the openml cc8
00:46:08
benchmark and what performance do we get
00:46:11
in that time so on the very left we have
00:46:13
the the rock aug
00:46:15
um for these uh like the average of the
00:46:18
rock augs
00:46:20
for these uh different data sets and we
00:46:22
can see we're here on the very far left
00:46:24
and we get comparable results only like
00:46:27
after like 100x time or so
00:46:31
um by methods like ultra's killer and 2
00:46:34
and auto group 1.
00:46:36
we also compared in terms of the number
00:46:39
of wins and the number of or
00:46:41
average ranking and we see a similar
00:46:43
result like
00:46:45
we are going to have the best rank and
00:46:48
win at the very beginning but even then
00:46:50
given one hour which we don't use we use
00:46:52
only one second uh we are we have
00:46:54
comparable platform
00:46:57
so
00:46:58
how long did it take to change the
00:46:59
transformer so this
00:47:01
was fantastic
00:47:02
this particular transformer was trained
00:47:04
for one day on atp
00:47:09
um but it can be so
00:47:11
yeah the
00:47:12
it can be used now for all kinds of data
00:47:14
sets but it's only trained once and uh
00:47:16
it's published and
00:47:17
you can now use it for your other data
00:47:19
sets that we never saw before
00:47:22
i guess you you would the comparable
00:47:24
question would be how long did it take
00:47:26
to um
00:47:28
develop auto escalant two and i guess
00:47:30
that's a couple of years so
00:47:32
um it's it's really a one-time training
00:47:36
um as part of algorithm development
00:47:40
and and when we come up with a new prior
00:47:42
then that's sort of like coming up with
00:47:43
a new framework
00:47:45
right but but i mean okay i think that's
00:47:47
that's out of the scope of this of this
00:47:49
paper and maybe also
00:47:51
besides the point but a fair comparison
00:47:53
would be maybe you know how would
00:47:55
um all the other ml with transfo uh
00:47:58
transfer learning work here right so um
00:48:00
but maybe but it's there's no transfer
00:48:03
it's just running on artificial data
00:48:06
yeah that's training
00:48:07
okay i see yeah yeah
00:48:09
i never saw a real tabula data set
00:48:11
before
00:48:12
okay yeah i see a point yeah yeah true
00:48:16
okay then at the last uh cherry like a
00:48:20
last little cool thing that uh we don't
00:48:23
really understand though
00:48:24
is um we then tried it later with longer
00:48:27
data sets so data sets that have more
00:48:29
training examples that we ever
00:48:30
considered during training because like
00:48:32
like i said before we sample these
00:48:34
artificial data sets during training and
00:48:35
we
00:48:36
in this case we sample data sets of the
00:48:38
length
00:48:39
1024 we always sampled
00:48:41
um
00:48:42
so it should learn to to be able to
00:48:44
predict for everything smaller or equal
00:48:46
to one one thousand twenty four examples
00:48:49
but we later run on larger data sets
00:48:51
from the autumn l benchmark and um it
00:48:54
actually finally still generalized and
00:48:57
uh and could
00:48:58
gain improvements
00:49:00
uh after after the point so it can
00:49:03
somehow generalize and make use of
00:49:04
larger data sets that never saw during
00:49:06
training which is i guess a quite
00:49:08
interesting feature of transformers
00:49:13
okay and now this is the final slide
00:49:16
already and that's the lookout and uh
00:49:18
what what one could use this for and the
00:49:22
and what's still out there and uh the
00:49:25
first thing for the basic optimization
00:49:26
which we which we work on and uh then of
00:49:30
course larger data sets important i said
00:49:31
we always restrict ourselves to 1000
00:49:34
examples and as you know training a
00:49:36
transformer with very long sequences can
00:49:38
be a problem and that's why we restrict
00:49:40
ourselves to something like that
00:49:43
but we're working on more
00:49:45
yeah making it larger and also if you're
00:49:47
interested reached out and and then
00:49:50
there's this creative thing of inventing
00:49:52
new priors and what
00:49:54
has really described the world really
00:49:56
well uh where we where we are able to
00:49:58
maybe generate data but we can't for
00:50:00
example compute the probability of
00:50:02
different data points or so because we
00:50:04
don't need to we just need to sample now
00:50:06
um
00:50:08
and
00:50:09
yeah other things are the parametric
00:50:10
posterior approximation which would be
00:50:12
pretty much stimulation based inference
00:50:14
and um interpretability is also
00:50:18
something that could be interesting here
00:50:20
because you can do forward passes much
00:50:22
faster than before so you can actually
00:50:24
try out a lot of different uh different
00:50:27
scenarios to figure out what
00:50:29
what actually makes uh make some
00:50:32
prediction be one or zero and um you can
00:50:36
also take gradients to the pfn so
00:50:38
you can figure out like you can
00:50:40
basically do gradient descent to find
00:50:42
the direction in which your label
00:50:44
changes and lastly uh yeah you can do
00:50:47
something like aliatoric versus a
00:50:48
systemic uncertainty notification but
00:50:50
yeah that's all on the horizon basically
00:50:52
i just wanted to tell you there's many
00:50:55
paths
00:50:58
cool
00:50:58
um
00:51:00
yeah that's it and i guess we can answer
00:51:03
some questions