Samuel Mueller | "PFNs: Use neural networks for 100x faster Bayesian predictions"
الملخص
TLDRSamuel præsenterede, hvordan transformatorer kan anvendes til bayesiansk inferens i sit arbejde med automatisk maskinlæring. Han forklarede, hvordan modellen er trænet på prior-data for at optimere forudsigelser, og hvordan metoden gør det muligt for transformatorer hurtigt at give nøjagtige resultater på usete data. Der blev også præsenteret eksperimenter, som viste sammenligningen af præstationer mellem den foreslåede metode og traditionelle bayesianske tilgange, hvilket viste, at transformatorer kunne give bedre resultater med betydeligt mindre beregningstid. Diskussioner inkluderer fremtidige retninger for forskning og muligheder for forbedring af metoden.
الوجبات الجاهزة
- 📊 Samlet overblik over bayesiansk inferens med transformatorer.
- ⚡ Hurtige og nøjagtige forudsigelser på uset data.
- 🧠 Meta-læring bruger tilpassede datasæt til træning.
- 🔍 Sammenligning mellem forskellige maskinlæringsmetoder.
- ⏳ Træningsprocessen krævede kun en dags beregningstid.
- 📈 Retning mod bedre generalisering med større datasæt.
- 📚 Vigtigheden af at anvende passende priors.
- 💻 Mulighed for effektiv automatisering af maskinlæring.
- 🧩 Interdisciplinære muligheder for fremtidig forskning.
الجدول الزمني
- 00:00:00 - 00:05:00
Introduktion til dagens præsentation, hvor Samuel præsenterer sit arbejde med transformer-netværk i konteksten af bayesiansk inferens og automatiseret maskinlæring. Målet er at lave hurtige og præcise forudsigelser for ukendte problemer ved hjælp af standard transformer-modeller.
- 00:05:00 - 00:10:00
Samuel beskriver forskellen mellem overvåget læring og bayesiansk overvåget læring med fokus på kalibrering og fortolkning. Han introducerer ideen om 'meta-learning', hvor modellen lærer at lære fra multiple datasæt.
- 00:10:00 - 00:15:00
Forklaring af meta-læring, hvor en algoritme trænes på flere datasæt (meta-træning) og derefter anvendes på et nyt uset datasæt (meta-testning). Formålet er at lære at klassificere objekter hurtigt ved hjælp af dine præcedenter.
- 00:15:00 - 00:20:00
Samuel forklarer bayesiansk inference, hvor der etableres et forhold mellem input og output via latent variabel. Han introducerer normalt fordelte forudsigelser ved hjælp af bayesiansk reglen for at finde posterior forudsigelser.
- 00:20:00 - 00:25:00
Præsentationen drejer sig nu om, hvordan de standard transformer-modeller kan tilpasses til at udføre bayesiansk inference ved at optimalisere over posterie distributions. En direkte tilgang i stedet for at anvende traditionelle bayesianske metoder.
- 00:25:00 - 00:30:00
Exempler på træning af en netværk med data genereret fra en given prior, hvilket giver mulighed for at skabe store datasæt til træningen. Dette gør det muligt for transformerne at lære hurtigere og mere effektivt ved at anvende disse datasæt.
- 00:30:00 - 00:35:00
Diskussion om vigtigheden af permutation invarians og hvordan transformers kan designes til at tillade træningssæt at aggregere oplysninger om sig selv. Samuel understreger vigtigheden af at begrænse opmærksomheden til data i træningssættet.
- 00:35:00 - 00:40:00
Samuel præsenterer et konkret eksempel på, hvordan data samles, og hvordan metoden evaluerer yderligere observationer. Han diskuterer betydningen af at have præcise repræsentationer over distributionsformer for bedre inferens.
- 00:40:00 - 00:45:00
Præsentation af testresultaterne, hvor forskellige modeller testes mod hverandre for at evaluere ydeevnen. Samuel sammenligner resultaterne af transformer-baserede metoder med andre standardmetoder for maksimal likelihood estimator.
- 00:45:00 - 00:51:04
Endelig diskussion om potentialet ved at anvende transformer-modeller til automatisering af maskinlæring med fokus på realtidsevaluation og hvordan metoderne kan skaleres til større datasæt fremover.
الخريطة الذهنية
فيديو أسئلة وأجوبة
What is the main focus of Samuel's presentation?
The presentation focuses on how transformers can be adapted to perform Bayesian inference for supervised learning tasks.
What are the key aspects of the transformer model discussed?
Key aspects include the training process, the relationship between meta-learning and Bayesian inference, and the advantages of using transformers for fast and accurate predictions.
What is few-shot learning as mentioned in the talk?
Few-shot learning refers to the process where an algorithm learns to classify unseen data based on prior knowledge from multiple similar datasets.
How does Bayesian inference relate to transformers?
Bayesian inference is leveraged in transformers to optimize predictions, allowing for better generalization on unseen data.
What experiments were conducted to validate the approach?
Experiments included training on Gaussian priors and comparing performance against traditional Bayesian approaches, showing that transformers can achieve comparable or better results.
What is the significance of using priors in this model?
Priors allow the generation of datasets for training and help in approximating true posterior distributions for accurate predictions.
How long was the transformer trained?
The transformer was trained for one day on a TPU.
What are the future directions suggested in the talk?
Future directions include exploring larger datasets, improving prior generations, and enhancing interpretability.
Is this method effective for larger datasets?
Yes, the method showed generalization capabilities to handle data it had not seen during training.
What does the presentation suggest about the speed of transformer predictions?
The model can produce predictions in a fraction of the time compared to traditional Bayesian methods.
عرض المزيد من ملخصات الفيديو
Social Media Scheduling Using AI (My Secret Method)
Muslim Doctor Justifies Killing for Islam (Response to Dr. Haitham Talaat)
Episode 26: SFMC Bootcamp: Data Views: Get Started with Salesforce Marketing Cloud Data Views
8 Truths About Getting Older
Drug use, misuse, substance abuse and drug dependence
What is Understanding by Design? Author Jay McTighe explains.
- 00:00:02okay hello everyone um
- 00:00:05today is the last session before our
- 00:00:06summer break um
- 00:00:08and that's my great pleasure to
- 00:00:09introduce our today speaker samuel he's
- 00:00:11a pc student university of fibro
- 00:00:14um yeah working on
- 00:00:16transformers and you know transformers
- 00:00:18foundation entrance uh also in the
- 00:00:20context of automated machine learning
- 00:00:22so yeah with that the flaws is yours
- 00:00:26from here thank you very much erin um
- 00:00:29yeah and that's also what i would like
- 00:00:31to talk about today so
- 00:00:33that is how we can get our transformers
- 00:00:37uh pretty standard transformers um to do
- 00:00:41bayesian inference for us
- 00:00:43on any new problem that we never saw
- 00:00:45before with during inferences
- 00:00:48um so i will just first outline what
- 00:00:52what kind of the problem we work on here
- 00:00:55just so that it's it's more clear
- 00:00:58what what the rest of this talk will be
- 00:00:59about so what we care about here is like
- 00:01:01supervised learning and specifically
- 00:01:04asian supervised learning so
- 00:01:06uh we have a label data set uh the
- 00:01:09training set and we want to predict
- 00:01:11labels for unlabeled validation set
- 00:01:14and we care especially about calibration
- 00:01:17here interpretability
- 00:01:19but also of course we want to be
- 00:01:20accurate and lastly something that so
- 00:01:24far fell a little bit under the table in
- 00:01:26many basin works uh we want to be fast
- 00:01:31so
- 00:01:32from here on i will start with a little
- 00:01:34bit of background first a little bit of
- 00:01:36a few short learning um and then a
- 00:01:38little bit of bayesian learning but it
- 00:01:41should be easy i guess
- 00:01:44so first a future learning which is also
- 00:01:46called meta learning sometimes
- 00:01:49i don't know probably many of you know
- 00:01:51it already i'll still explain it very
- 00:01:53quickly
- 00:01:54um
- 00:01:55the idea is that we
- 00:01:57learn
- 00:01:58on one level higher so
- 00:02:00we're not given just a data set of
- 00:02:02images with labels where we're given a
- 00:02:05data set
- 00:02:07or a set of data sets
- 00:02:09so we're given many many different data
- 00:02:12sets
- 00:02:13um with classification tasks in this
- 00:02:15case so here
- 00:02:17we're given for example a catch versus
- 00:02:19bird specification
- 00:02:21set or flowers where the spike
- 00:02:23specification data set and the idea is
- 00:02:24now
- 00:02:26that an algorithm learns to learn or
- 00:02:28learns to classify here
- 00:02:30for a new
- 00:02:32unseen data set and that would be here
- 00:02:34the doc versus author data set on the
- 00:02:36right never seen during training and the
- 00:02:39idea now is that it learned
- 00:02:41how to classify objects and can do so
- 00:02:44really fast now that's the future part
- 00:02:47we can classify objects of course
- 00:02:48already with standard neural networks
- 00:02:50but the idea is that you learn to do it
- 00:02:53better by having available this set of
- 00:02:55other data sets that look similar
- 00:02:59and
- 00:03:00we call this first phase the meta
- 00:03:02training phase so the meta training is
- 00:03:04on on these many data sets and then we
- 00:03:06have the second phase the meta testing
- 00:03:08um and that is where we tried it out and
- 00:03:10never seen data sets before and we can
- 00:03:13formalize this as like during meta
- 00:03:16training we have a set of data sets
- 00:03:18available which i call d1 and d2 and so
- 00:03:20on here
- 00:03:21each consisting of pairs of
- 00:03:25inputs and outputs
- 00:03:26and
- 00:03:28then we have validation set there as
- 00:03:30well uh here one validation example only
- 00:03:33but it could be a set as well uh we have
- 00:03:35the same for meta testing
- 00:03:37um just a
- 00:03:38data set as well as uh validation sets
- 00:03:41and what we care about here like if we
- 00:03:43look at all these what's the thing that
- 00:03:45it predicted
- 00:03:46and what is predicted here is only this
- 00:03:49white head so so the rest of these is
- 00:03:51used for training but all that is
- 00:03:53predicted here is the white test that's
- 00:03:55the special thing
- 00:03:57we just learned from the left
- 00:03:59how to predict and we predict only on
- 00:04:01the right
- 00:04:02is this clear how like what q shot
- 00:04:04learning is in general
- 00:04:10and i guess since there's a few yeah you
- 00:04:13can easily interrupt me and this won't
- 00:04:15like cause me through head detector or
- 00:04:17anything um so don't hesitate to
- 00:04:20interrupt me please um
- 00:04:23the other background that i guess
- 00:04:26relevant is asian inference especially
- 00:04:29for supervised learning
- 00:04:31and
- 00:04:32here we just generally consider a prior
- 00:04:35like a prior over some latent variable i
- 00:04:37call it t here but you can call it
- 00:04:39whatever
- 00:04:40and this key usually or the c describes
- 00:04:43your
- 00:04:43your dependency between the input and
- 00:04:46the output so uh for example a bayesian
- 00:04:48neural network i call it dnn here
- 00:04:51uh
- 00:04:52describes uh or says that the output
- 00:04:56should have a relationship to the input
- 00:04:57such that there is a neural network
- 00:05:00that's mapping the inputs to the outputs
- 00:05:03but you can be more creative than that
- 00:05:05in terms of what your prior is
- 00:05:10um
- 00:05:11and this is kind of how you given such a
- 00:05:14it's given such a late and this is how
- 00:05:16you traditionally solve the
- 00:05:18classification problem uh or like a
- 00:05:21standard supervised learning
- 00:05:22classification problems so we are given
- 00:05:23some data set d
- 00:05:25and we want to now predict our y
- 00:05:28and to do that we use uh
- 00:05:32bayes rule here
- 00:05:33to get the distribution or the
- 00:05:35distribution over the latent given the
- 00:05:37data and then we can
- 00:05:39integrate that out to get the
- 00:05:41distribution of the y and this is what
- 00:05:43we call the
- 00:05:44posterior predictive distribution and in
- 00:05:47this work we will actually approximate
- 00:05:49this thing here directly we will not go
- 00:05:53this way we will not do base based uh
- 00:05:57formula or anything we will just do this
- 00:05:59directly um and maybe you see already
- 00:06:02the connection to few short learning
- 00:06:04here as we have a distribution that
- 00:06:06takes in a data set
- 00:06:09similar to
- 00:06:10the data sets here
- 00:06:12okay so
- 00:06:14i will talk about this relationship
- 00:06:16between meta-learning and bayesian
- 00:06:17inference and we will exploit this for
- 00:06:21uh for the approximation of the
- 00:06:23posteriors for gaussian processes uh
- 00:06:25even with meta priors uh
- 00:06:28and for our base neural networks and uh
- 00:06:30we'll get into that a little more
- 00:06:32and lastly also this is possible to use
- 00:06:35for basic optimization even
- 00:06:38okay so so what is the approach in in in
- 00:06:41general and this is a little example of
- 00:06:44how to train
- 00:06:45a prior data fitted network that is the
- 00:06:48network that we propose
- 00:06:50um which is
- 00:06:52which which is a transformer in the end
- 00:06:56and
- 00:06:57how we how we train these is
- 00:06:59we we train on many different data sets
- 00:07:02similar to metal learning
- 00:07:04but here the data sets are generated by
- 00:07:06our prior so we define a prior just like
- 00:07:10for equation
- 00:07:11for bayesian imprint
- 00:07:13and we sample data from this prior so
- 00:07:15you can imagine
- 00:07:17uh a gp prior for example and this is
- 00:07:19where this data actually comes from as
- 00:07:21well so this is data that we sampled
- 00:07:23from a gt prior and this way we generate
- 00:07:26data sets so these are our x-y pairs
- 00:07:28this is the x axis this is the y so we
- 00:07:30want to label uh the the the cross here
- 00:07:34uh given the blue dots
- 00:07:36and
- 00:07:37we want to label the y of this cross
- 00:07:41and we generate data sets like this
- 00:07:43artificial data sets as many as we want
- 00:07:46it's usually in the millions
- 00:07:49and train our transformer or our kifn to
- 00:07:52predict this uh this this holdout
- 00:07:54example
- 00:07:55correctly so it learns to
- 00:07:59it does make the learning or it does um
- 00:08:01view short learning but on a on data
- 00:08:04that comes from a prior
- 00:08:07and what is so cool about this is if we
- 00:08:09have now real data that comes let's say
- 00:08:12this is the function what what could
- 00:08:13this be the altitude of some some hilly
- 00:08:16region here and we want to predict the
- 00:08:18altitude for the 0.0.8 here all we all
- 00:08:21we have to do is
- 00:08:23feed this data through our to our psn
- 00:08:26and it will spit out not only like a
- 00:08:28like uh like a mean expectation
- 00:08:31but uh or sorry a mean prediction
- 00:08:35but it will spit out a distribution over
- 00:08:37what is likely to be the mean so
- 00:08:40it will spit out the posterior for or
- 00:08:44the posterior predictive distribution
- 00:08:46for this particular prior that we chose
- 00:08:47to train on
- 00:08:53so actually good question here could you
- 00:08:55say could you go back to this plot sorry
- 00:08:57um so training and testing so
- 00:09:00you test them on um
- 00:09:03on the same functions that you use
- 00:09:04during training but just different data
- 00:09:07okay
- 00:09:09no you generate um
- 00:09:12so if you say testing you mean meter
- 00:09:14testing i assume well you say test here
- 00:09:16in your blood so you have to train your
- 00:09:18points and test data points so
- 00:09:20um oh i am sorry okay so you mean this
- 00:09:22test data point here yeah these yeah oh
- 00:09:25they come from the same function yes
- 00:09:26okay but this is your meta testing this
- 00:09:28is your the testing for your meta
- 00:09:30training basically
- 00:09:32um
- 00:09:33these are basically just a
- 00:09:35examples that we use during training
- 00:09:37and then this side i would call meta
- 00:09:40testing okay
- 00:09:42okay i'm sorry the vocabulary is yeah it
- 00:09:45kind of can get mixed up between the
- 00:09:46meter and the not meter level okay no
- 00:09:51[Music]
- 00:09:52orange cross i want to say because i
- 00:09:54wasn't so explicit about it these are
- 00:09:56all samples from the same prior and then
- 00:09:58we just randomly select what's the
- 00:09:59holdout the holdout is not sampled in
- 00:10:01any different way it's just like a
- 00:10:03random subset of our data set is that
- 00:10:05hold up
- 00:10:07uh louise
- 00:10:11um yeah i think maybe you'll get to this
- 00:10:14but i just wanted to quickly ask so can
- 00:10:16this actually satisfy the komagara of
- 00:10:18extension theorem
- 00:10:21oh i don't know the konmager of
- 00:10:22extension for stochastic processes um
- 00:10:26like can you comment on
- 00:10:27sort of the the permutation invariants
- 00:10:31yeah we
- 00:10:32oh okay okay yeah we have we have that
- 00:10:35um i will get to that um
- 00:10:37and maybe you tell me if we satisfy the
- 00:10:39kolmogorov extension theorem um but we
- 00:10:42have permutation variance for sure okay
- 00:10:44okay
- 00:10:45i'll just wait and listen then
- 00:10:47yeah it comes in like three slides or so
- 00:10:49the architecture
- 00:10:52um
- 00:10:52[Music]
- 00:10:54yeah so
- 00:10:55maybe this is not easy for you because
- 00:10:57the questions were already quite
- 00:10:58advanced
- 00:11:00but i will just say once again like
- 00:11:02written out in
- 00:11:04in formula
- 00:11:06what what we do
- 00:11:08we we sample data sets from here are
- 00:11:10called pd but we sample from our prior
- 00:11:13in some way we sample these data sets so
- 00:11:15this could be a gp prior or a page
- 00:11:17neural network provider something like
- 00:11:19that we sample a lot of these data sets
- 00:11:21um this uppercase k here
- 00:11:24is in our experiments usually something
- 00:11:26like
- 00:11:27like a few million
- 00:11:29um so we sample a lot of these data sets
- 00:11:31because it's super cheap to sample them
- 00:11:33we can just do that over and over and
- 00:11:35over again
- 00:11:36um and then we train our p van with just
- 00:11:40the very normal negative log likelihood
- 00:11:42or cross-entropy loss on on these data
- 00:11:45sets or on the pulled out part of these
- 00:11:47data sets
- 00:11:49and what we get out there is an
- 00:11:51approximation
- 00:11:53to the
- 00:11:55to the
- 00:11:56the true
- 00:11:57um posterior predictive distribution so
- 00:11:59to the bayesian
- 00:12:02posterior predictive distribution that
- 00:12:04that one cares about if one has this
- 00:12:05prior one wants this distribution um and
- 00:12:08we actually get an approximation to that
- 00:12:10uh on
- 00:12:11data that comes from the real world so
- 00:12:13not feeding in the same data again or so
- 00:12:16but like unseen data that the network
- 00:12:18never saw before so what we have here is
- 00:12:20a network that will do
- 00:12:22that we'll do a form of bayesian
- 00:12:23prediction um
- 00:12:25or they will do an approximation of
- 00:12:27patient prediction
- 00:12:28um as the forward path so you just give
- 00:12:30it the data and the hold out examples
- 00:12:32and it will do patient prediction for
- 00:12:34you uh there's no no gradient or
- 00:12:36anything involved there
- 00:12:38at inference time or at this
- 00:12:41meter inference time
- 00:12:43and yeah just very quick because this is
- 00:12:46actually
- 00:12:47quite easy connection um
- 00:12:50so this is just the connection to to
- 00:12:53bayesian inference and the connection
- 00:12:54basically just boils down to
- 00:12:57the laws that they just showed you
- 00:12:59before or the generalization of that
- 00:13:02loss over
- 00:13:03over the whole expectation of these data
- 00:13:05sets um
- 00:13:07actually
- 00:13:08is in a meaningful way
- 00:13:10connected to the kl divergence
- 00:13:13of our
- 00:13:14posterior approximation to the real
- 00:13:16posterior so it's a
- 00:13:18it's the
- 00:13:21it's the mean kl divergence of our
- 00:13:24approximation to the real posterior uh
- 00:13:26predictive distribution plus some
- 00:13:28constant
- 00:13:31and we can optimize this directly that's
- 00:13:32the cool thing so we optimize directly
- 00:13:35to become the posterior predictive
- 00:13:37distribution
- 00:13:39okay and now
- 00:13:40the part luis asked about i think and
- 00:13:44that is what is our model and
- 00:13:47our model is uh is the transformer
- 00:13:50where we uh throw away
- 00:13:52the
- 00:13:53the uh the positional encodings and that
- 00:13:56is actually pretty much enough to make
- 00:13:58the transformer positioner or
- 00:14:01sorry permutation invariant
- 00:14:04and then the other thing we do is we
- 00:14:06restrict the attention such that
- 00:14:08our
- 00:14:10our holdout example can only look at the
- 00:14:12data set
- 00:14:14so can attend to the data set but the
- 00:14:16data set can only attend to other points
- 00:14:18in the dataset so the dataset actually
- 00:14:20aggregates information about itself or
- 00:14:23the sorry i said data set i mean
- 00:14:25training set in this case and the whole
- 00:14:27load example can then look things up in
- 00:14:30in these building um
- 00:14:32building up representations and here we
- 00:14:35have actually two whole other examples
- 00:14:36before i always just showed one but in
- 00:14:38general we train with multiple for
- 00:14:40efficiency reasons
- 00:14:42and these are separated such that each
- 00:14:44one of these can only attend to the
- 00:14:46training set and not to each other i
- 00:14:48mean that that generally was pretty bad
- 00:14:50if they could do that because
- 00:14:52in general they should not depend on
- 00:14:54each other
- 00:14:55that's i think if you know transformers
- 00:14:58already i know this is kind of assuming
- 00:14:59you know a little like a lot about
- 00:15:01transformers but if you know a lot about
- 00:15:03uh like if you know how transformers
- 00:15:05work i think this should
- 00:15:07like describe the whole model um maybe
- 00:15:10one more thing is how we encode the x's
- 00:15:12and y's
- 00:15:13in in our case it's just linear layers
- 00:15:15so we
- 00:15:16so we just put the x's in the linear
- 00:15:18layer and the y's in the linear layer
- 00:15:20and add them together if we want to
- 00:15:21encode a pair or we don't have the y if
- 00:15:24we don't want to encode the y because we
- 00:15:26don't want to give away the information
- 00:15:27for the whole odd examples
- 00:15:34so one thing that is missing in what i
- 00:15:37described here is
- 00:15:39how exactly this distribution up here
- 00:15:41works so i just say yeah what comes out
- 00:15:43here
- 00:15:45will be this posterior distribution but
- 00:15:47i
- 00:15:49uh i didn't explain how
- 00:15:51and the traditional way would be to to
- 00:15:53predict uh
- 00:15:54something like a gaussian here some
- 00:15:56normal distribution predict the mean and
- 00:15:58the variance we tried that and it didn't
- 00:16:01work and what worked much better was
- 00:16:03like a
- 00:16:04discretization of the space so
- 00:16:07so this is our space we care about
- 00:16:08predictions in uh so we want to predict
- 00:16:11uh things that that lie within let's
- 00:16:14hear say let's say minus three up to
- 00:16:15three
- 00:16:16and what we do now is we discretize the
- 00:16:18space into little buckets
- 00:16:20and put a soft max in front so we just
- 00:16:23classify
- 00:16:24uh basically uh things are
- 00:16:27either or we classify
- 00:16:29things to be in one of these packets we
- 00:16:33this is something we that just worked
- 00:16:35out of the box pretty much and you don't
- 00:16:37need to tune very much
- 00:16:39usually 1 000 buckets is a good number
- 00:16:42um
- 00:16:44this is a problem though because if you
- 00:16:46don't know this probability for your
- 00:16:48prior so sometimes maybe a point lies
- 00:16:50here like below minus three you get like
- 00:16:53a minus infinity loss which
- 00:16:55would be bad
- 00:16:56so very simple hack to
- 00:16:59have good uh good losses it's just to
- 00:17:02replace the site with like half normal
- 00:17:03so now
- 00:17:04we actually have support for
- 00:17:07minus infinity to plus infinity um we
- 00:17:09have support everywhere
- 00:17:12other questions about the model because
- 00:17:14then we could already start into the
- 00:17:16experiments next that's so actually this
- 00:17:18side i haven't fully understood so
- 00:17:20basically instead of a particular mean
- 00:17:22and the variance you predict
- 00:17:24quantiles
- 00:17:25or something like that yeah instead of
- 00:17:27predicting a mean and a variance we
- 00:17:28basically uh we we have a soft maxim in
- 00:17:31the end and we predict in which one of
- 00:17:33these buckets our
- 00:17:35our y will land
- 00:17:37so these are the bucket the first bucket
- 00:17:39goes from -3 to minus 1.5 for example
- 00:17:42and then we give you the probability and
- 00:17:43say 0.1
- 00:17:46likelihood that we learned in this part
- 00:17:48okay so basically make it a
- 00:17:50classification problem
- 00:17:51right exactly
- 00:17:53why not why not predicting quantiles so
- 00:17:57why not what predicting the quantile so
- 00:18:02um i i didn't really think about this uh
- 00:18:05so you mean like predicting the borders
- 00:18:07of these
- 00:18:08quantum
- 00:18:09are the quantile in the distribution or
- 00:18:12yeah
- 00:18:13okay this is this data point yeah
- 00:18:14exactly um
- 00:18:16the quantizer question
- 00:18:19[Music]
- 00:18:20i mean the point is to make it a
- 00:18:22classification problem right
- 00:18:24because transformers are good at that
- 00:18:26[Music]
- 00:18:27right tim
- 00:18:29yeah yeah yeah yeah for sure for sure i
- 00:18:31mean
- 00:18:32okay i think still one could probably do
- 00:18:34like zero to one and then the bars again
- 00:18:38but uh okay that's probably not what you
- 00:18:39mean uh yeah i i didn't try the quantile
- 00:18:43but i had the feeling that the
- 00:18:44regression head works much worse with
- 00:18:46transformers or with neural networks
- 00:18:48maybe even in general no that's too much
- 00:18:50of a claim that's used in many places
- 00:18:51but with transformers it seemed to work
- 00:18:53much worse
- 00:18:54you could still make the classification
- 00:18:56problem but it just sends a predictor
- 00:18:58okay this this point is like within the
- 00:19:00ten percent quantity twenty percent
- 00:19:01contacts and
- 00:19:02percentage yeah for sure you can
- 00:19:04um
- 00:19:06we kind of do that actually in a way
- 00:19:08because we how we define these bars is
- 00:19:11we take a big example from our prior
- 00:19:14and now we create bars for every
- 00:19:17quantile so for every one percent type
- 00:19:19of quantile we have one bar
- 00:19:21so we will have for normal just like for
- 00:19:24gaussian process prior for example we
- 00:19:25have smaller bars in the center and
- 00:19:27wider on the sides so we kind of do
- 00:19:29quantity prediction i guess okay i see
- 00:19:31yeah okay
- 00:19:33and and as a remark here i think it
- 00:19:35would be a really interesting project if
- 00:19:37anyone is interested in this to
- 00:19:40look into
- 00:19:41whether we can
- 00:19:43just in generally and in general do
- 00:19:45regression like this and whether in
- 00:19:47general this is a good idea to to do
- 00:19:49regression
- 00:19:51like this with neural networks because
- 00:19:54sometimes we tried regression with
- 00:19:55neural networks and it didn't work all
- 00:19:57that well
- 00:19:58and um yeah this could be a generic
- 00:20:01approach for
- 00:20:02just how to do regression with neural
- 00:20:04nets but um
- 00:20:05yeah we haven't looked into it if
- 00:20:07someone would left like to do that um
- 00:20:09maybe together that'd be great
- 00:20:11get in touch
- 00:20:15yeah
- 00:20:17okay so any more questions
- 00:20:21just a quick question do you have any
- 00:20:22insight on
- 00:20:24let's say at a finite collection of test
- 00:20:27points
- 00:20:28um do you have any insight about what
- 00:20:31the marginal looks like because you know
- 00:20:33in in gaussian processors we have these
- 00:20:35guarantees about the marginal
- 00:20:37distribution at a finite collection
- 00:20:39they're jointly gaussian
- 00:20:42is that even kind of useful thing to
- 00:20:44think about in this setting and if so do
- 00:20:47you have any insight on what what those
- 00:20:48things look like
- 00:20:50so you mean the joint distribution of
- 00:20:52multiple points
- 00:20:53yeah exactly so so the the joint
- 00:20:56distribution at a finite collection of
- 00:20:57points will always be a multivariate
- 00:21:00gaussian right in the case by definition
- 00:21:02in the case of gaussian processes
- 00:21:05yeah we actually
- 00:21:06don't like this empirically
- 00:21:09okay um
- 00:21:10i mean what
- 00:21:12i didn't look into the joint
- 00:21:13distributions i have to say so i can't
- 00:21:15really tell you much about this we can
- 00:21:17say that each of these distributions
- 00:21:19looks very much like a version so if you
- 00:21:20like pull up the distribution they will
- 00:21:22match pretty exactly but just like by
- 00:21:24eye because in the end one is like bars
- 00:21:27and the other is like a like a like a
- 00:21:29continuous line
- 00:21:31um but for the joints i don't know i
- 00:21:34guess empirically you could calculate
- 00:21:36the covariances at any pair of points
- 00:21:39and you can kind of see
- 00:21:40how it's distributed i think that would
- 00:21:42be kind of interesting
- 00:21:44and i i would think it would look like a
- 00:21:47gaussian thing just because the prior
- 00:21:49you're learning from is is gaussian
- 00:21:51um
- 00:21:52another quick question i had is uh in
- 00:21:54the context of
- 00:21:56like uh marginal likelihood um just
- 00:21:59tuning the hyper parameters as you would
- 00:22:02in a traditional gp
- 00:22:03um like do you have do you consider when
- 00:22:06when you're drawing your prior data sets
- 00:22:08different
- 00:22:09uh length scales or kernel amplitudes
- 00:22:13uh variances things like this or do you
- 00:22:15just keep it fixed and does that
- 00:22:17actually then
- 00:22:19from learning from that prior does that
- 00:22:21actually
- 00:22:22um
- 00:22:23generalize well
- 00:22:25yeah we'll get to that um we do a lot of
- 00:22:27that so we do a lot of this like mixing
- 00:22:29different different prior hyper
- 00:22:31parameters right
- 00:22:34but we'll get to that in the experiment
- 00:22:36okay cool
- 00:22:37thanks but but maybe to answer lucy's
- 00:22:40quest there is first question um
- 00:22:42first um
- 00:22:44so i i do think we know what the
- 00:22:45marginals look like they look
- 00:22:48pretty much exactly like what we would
- 00:22:50expect um with the true posterior
- 00:22:53distribution
- 00:22:54um
- 00:22:55so if we train on a prior that comes
- 00:22:58from a gp then well our predictions will
- 00:23:01look pretty much exactly like a gaussian
- 00:23:03because well that that's a kl divergence
- 00:23:05we're optimizing between these two
- 00:23:06distributions um
- 00:23:09the joint of
- 00:23:11multiple predictions
- 00:23:13well actually that will look very
- 00:23:15different
- 00:23:16um we're just not going to predict any
- 00:23:19cross-correlations here
- 00:23:21because that that's what sam said we
- 00:23:23predict independently for x4 and for x5
- 00:23:26if you want to join then you would have
- 00:23:28to predict for x4 put it into the
- 00:23:30training data do another forward prop
- 00:23:32and get the prediction for x5 and then i
- 00:23:34would expect that we get pretty nice
- 00:23:37joined
- 00:23:38gaussians but that we haven't tried yet
- 00:23:43i see gotcha
- 00:23:48thanks okay
- 00:23:50um then we jump into the experiments and
- 00:23:53the experiments will actually be first
- 00:23:55experiments of the original paper that
- 00:23:57was also i think in the investigation so
- 00:23:59this transformers can do based on
- 00:24:00inference and then we'll get to
- 00:24:02experiments of a paper we just uploaded
- 00:24:04to archive so very fresh out of the oven
- 00:24:08you'll see
- 00:24:09um
- 00:24:10so first
- 00:24:12uh the motivating experiment is we train
- 00:24:14on a gaussian
- 00:24:15prior and want to see that our network
- 00:24:19actually does something
- 00:24:20that flower creates something that looks
- 00:24:22like the gaussian posterior
- 00:24:24and we actually do get that that was uh
- 00:24:27that was nice back then um and so what
- 00:24:30we see on the left here is the black
- 00:24:33dots are are
- 00:24:35our given examples or the training that
- 00:24:37you could say so
- 00:24:38this is known up front and
- 00:24:41now we're interested in the
- 00:24:43posterior distribution of our of our
- 00:24:45gaussian process given these points and
- 00:24:48so here we plot the density and we can
- 00:24:50see that this is pretty smooth there are
- 00:24:53some edges in there but i think this is
- 00:24:54mostly due to x being in buckets
- 00:24:58not so much why because we use a lot of
- 00:25:01uh we use a lot of packets here like
- 00:25:03that 1000 or so
- 00:25:05um
- 00:25:07this is so hard to compare to a
- 00:25:09to a gp like for a gp we know the
- 00:25:11material right so
- 00:25:12it's actually a nice motivating example
- 00:25:14because we can compare to the true
- 00:25:16predictions unlike for many other priors
- 00:25:18uh we can we can hear compared to the to
- 00:25:21the ground truth
- 00:25:23and this is on the right here
- 00:25:26in blue it's the pfn and in green we
- 00:25:29have the gp
- 00:25:30and you see that at least for these
- 00:25:32simple examples they uh they match very
- 00:25:36exactly there are some differences we
- 00:25:38mark them here with errors but in
- 00:25:40general it looks like
- 00:25:42it fits it pretty pretty well and even
- 00:25:45the confidence interval looks looks
- 00:25:47pretty similar
- 00:25:50we can do this also here for another
- 00:25:53length scale to show that yeah even the
- 00:25:56length scale is different it also still
- 00:25:57works and with more points uh same
- 00:26:00experiment
- 00:26:01and
- 00:26:03and now we compared to the attentiveness
- 00:26:06processes but this is in red here and
- 00:26:08that's kind of what was there before
- 00:26:10and we can see that yeah there is a
- 00:26:13clear difference the transformer
- 00:26:14architecture seems to be really good at
- 00:26:16this task surprisingly compared to an
- 00:26:19architecture that was invented for
- 00:26:21future learning
- 00:26:25and yeah now i would like to show you a
- 00:26:27little demo
- 00:26:29um you can find the link to this little
- 00:26:31demo actually in the paper or on the
- 00:26:34github
- 00:26:35so if you want to try it yourself you
- 00:26:37can
- 00:26:38so
- 00:26:40i think that's so yeah let's see so in
- 00:26:42this little demo we basically create the
- 00:26:44same figures we had before
- 00:26:47uh with exactly the same legend again
- 00:26:50but we can create our own and we could
- 00:26:53now for example here create a point in
- 00:26:56the middle somewhere
- 00:26:57i wanted to try an example before but no
- 00:26:59i didn't really and make it much higher
- 00:27:02because they're pretty much on one line
- 00:27:03so that it's maybe a little more
- 00:27:04interesting
- 00:27:09okay so we have something something like
- 00:27:10this and then okay
- 00:27:13i always press new column that's a
- 00:27:15problem you press new column right
- 00:27:21what did i type here 0.9
- 00:27:23and now a new row and now you can do
- 00:27:260.6 and we can see like if we make it
- 00:27:29more extreme it will be harder for the
- 00:27:31for the network
- 00:27:37let's see what with this so it should
- 00:27:39have like a steep stand let's see if it
- 00:27:41still can model it uh it still can model
- 00:27:44it but you can see that there are more
- 00:27:45errors than before i would say
- 00:27:48for sure um and
- 00:27:51and you can like this is
- 00:27:53this is due to being much more out of
- 00:27:55the distribution than
- 00:27:57and
- 00:27:59like the first example but yeah you can
- 00:28:01play around with this demo if you like
- 00:28:02as well
- 00:28:05yeah frank posted the link very nice
- 00:28:07cool let's go back to the presentation
- 00:28:15so
- 00:28:15now we look at uh
- 00:28:18not anymore at night but her yes you
- 00:28:20know still has nice of course everything
- 00:28:22are and all of these are nice but um now
- 00:28:25we look at um
- 00:28:26performance plot uh here we see the
- 00:28:29negative likelihood so as we've shown
- 00:28:31like as we can show that this is a
- 00:28:33measure for the kl divergence between
- 00:28:35our posterior and the true posterior um
- 00:28:39we use the negative log likelihood to
- 00:28:41measure our method uh on priority later
- 00:28:44and we compare it here with the true gp
- 00:28:46so this is again an example where we
- 00:28:47have the true true gp so we don't do
- 00:28:50anything where we can't get the true
- 00:28:52criteria up to here two thousand
- 00:28:54examples
- 00:28:56and
- 00:28:57what we can see is that we have
- 00:28:59different pfns and this is the number of
- 00:29:01data sets they've seen during training
- 00:29:03so this uppercase k from back then
- 00:29:06um
- 00:29:08and what we can see is that
- 00:29:09the pl events like always get better
- 00:29:12with more training data like a very
- 00:29:14traditional result i guess in machine
- 00:29:16learning but um you can see they they
- 00:29:18they get better and better with more
- 00:29:20training and more training and we can
- 00:29:21see similar things actually with the
- 00:29:23transformers side um not as strong but
- 00:29:26similar
- 00:29:30quick question
- 00:29:32um
- 00:29:32do you have some insight about uh like
- 00:29:36at what point you reach
- 00:29:37point of diminishing returns like i
- 00:29:40assume with all things transformers you
- 00:29:42have you have these power laws and uh
- 00:29:45more is always more data is always
- 00:29:46better and so on but let's say i wanted
- 00:29:48to deploy this myself i wanted to train
- 00:29:50this from scratch by myself
- 00:29:52and i don't have all the time or compute
- 00:29:54budget in the world um like how do you
- 00:29:57have some intuition about how
- 00:30:00how high or how much data to go to
- 00:30:03and and at what point it's maybe just
- 00:30:05not worth it anymore like you hit the
- 00:30:07point of diminishing returns
- 00:30:10so i would say this generally depends a
- 00:30:13lot on your prior so if you have a very
- 00:30:15simple prior of course you hit that
- 00:30:16point earlier so probably one of the
- 00:30:20examples where you hit it earlier would
- 00:30:21be something like the sculpture process
- 00:30:23here
- 00:30:24um
- 00:30:25and
- 00:30:26in general like what i can see is that
- 00:30:28training for a day or one gpu
- 00:30:32yeah will yield pretty good results and
- 00:30:34training for five days will be like a
- 00:30:36little bit better but not much
- 00:30:39this is something it can give you as
- 00:30:41like a
- 00:30:42so
- 00:30:43yeah that's that's so far
- 00:30:45how how we scale we hope we find some
- 00:30:47way to make better use of more compute
- 00:30:49but i think so far
- 00:30:51like after one day or
- 00:30:53yeah we spent sometimes five days on the
- 00:30:55training but for this first paper it was
- 00:30:57only one day
- 00:30:59and generally this is also where it
- 00:31:00doesn't get much better
- 00:31:03and just quick question on the gp prior
- 00:31:05that you use to generate the data sets
- 00:31:07like do you just always have uh you
- 00:31:10uniformly sample um
- 00:31:12a fixed number of
- 00:31:14input
- 00:31:15like observed input points and then just
- 00:31:18one test point
- 00:31:20and you just yeah um domain or this is
- 00:31:22very so
- 00:31:24so yeah for the gp prior exactly since
- 00:31:26the gp prior doesn't really entail
- 00:31:28access we have to sign up for the access
- 00:31:30ourselves in some way
- 00:31:32and we do
- 00:31:34yeah pretty much what you say we sample
- 00:31:36uniformly at random or from uh from a
- 00:31:39standard normal
- 00:31:41and
- 00:31:43we
- 00:31:44what we do though is that what is
- 00:31:46important is to train for different uh
- 00:31:49for different numbers of data points
- 00:31:50because you want to generalize among
- 00:31:53different data set sizes
- 00:31:56um and so we we sample uniformly at
- 00:31:58random how many
- 00:32:00examples are
- 00:32:01training set and how many are tested
- 00:32:07okay
- 00:32:07and
- 00:32:09do do you do this for higher dimensions
- 00:32:12or is it just the one dimensional
- 00:32:14problem
- 00:32:16uh i'm sorry this is missing here but
- 00:32:18this is for example for five dimensions
- 00:32:20this plot we do it for high dimensions
- 00:32:22as well yeah
- 00:32:23we go
- 00:32:25i think our largest experiment like like
- 00:32:28close to 800 dimensions like 784
- 00:32:32okay
- 00:32:33use this center strategy
- 00:32:36so do you use um user fixed kernel for
- 00:32:39ugp prior or do you use um yeah
- 00:32:42different types of current different
- 00:32:43link skills noise and so on
- 00:32:45uh yeah i'm sorry i i just see that i
- 00:32:47threw out the plot where i where i use
- 00:32:50different length scales and kernels um
- 00:32:52but in this plot this is just the same
- 00:32:54length scale kernel but you can do the
- 00:32:57same um
- 00:32:58by
- 00:32:59just varying the length scale as you
- 00:33:01sample like for each dataset use a
- 00:33:02different length scale distribution over
- 00:33:04these and we did that as well so the
- 00:33:06so-called hyper priors in the literature
- 00:33:09and um that works as well but we don't
- 00:33:12have the nine space landing anymore
- 00:33:13because there's no or there's no easy
- 00:33:16way to get the
- 00:33:17to get the correct um to get the correct
- 00:33:20criteria we can only approximate it
- 00:33:22there are only approximations for it out
- 00:33:24there
- 00:33:24so it's a little harder to repair i'm
- 00:33:27just wondering because if you if you
- 00:33:28then sample different length scales and
- 00:33:30potentially different colors and your
- 00:33:32prior gets more more uninformative right
- 00:33:34and then it's you know
- 00:33:36it yeah what is your model to learn
- 00:33:39right here
- 00:33:40in the extreme case you have like a
- 00:33:42super uninformed fire and then yeah
- 00:33:44well the model is not going to help
- 00:33:45anymore because it's just models
- 00:33:48yeah for sure that's that's that's the
- 00:33:50trade-off there's some sweet spot yeah
- 00:33:54yeah you don't want to make it too broad
- 00:33:56but so far
- 00:33:58yeah generally so far we saw making it
- 00:33:59broader can can help in many
- 00:34:01applications
- 00:34:03um so because i think so far priors are
- 00:34:05pretty restricted
- 00:34:07with this really traditional vi or mcmc
- 00:34:10methods because we have to have these
- 00:34:12like easy to interpret latents or easy
- 00:34:14to generate latent
- 00:34:15ah
- 00:34:16we don't need these anymore
- 00:34:18cool
- 00:34:19um yeah so this is uh since we talked
- 00:34:22about it this is a beijing and then or
- 00:34:24no we didn't really talk about it
- 00:34:26this is the
- 00:34:28bayesian neural network and we we do the
- 00:34:30approximation there here as well and
- 00:34:32what we look here at is um just to show
- 00:34:34you the time scales that's actually
- 00:34:37quite interesting so the svi baseline
- 00:34:39here is the
- 00:34:40base by backdrop or like a newer
- 00:34:42variation of that and then we have the
- 00:34:44nuts like an mcmc
- 00:34:47sampler
- 00:34:48and
- 00:34:51these are two other approximations to
- 00:34:52the to the prior to the posterior for
- 00:34:56bayesian neural networks and we compare
- 00:34:59our method to these so this is our
- 00:35:00method and it takes
- 00:35:03we can't really change the time it takes
- 00:35:05for us to predict the label for a data
- 00:35:07set because it's a forward pass we can't
- 00:35:09really decide to only do half of our
- 00:35:11passwords also that's why we have a dot
- 00:35:14here so our method takes like uh like a
- 00:35:17hundredth of a second to
- 00:35:20to predict uh the labels for a given
- 00:35:22task and then we looked at different
- 00:35:24budgets
- 00:35:25uh for our baselines and we saw that
- 00:35:27they reach our performance after using
- 00:35:30like something like 1000 x of the time
- 00:35:34compared to us
- 00:35:36which was yeah
- 00:35:37quite cool
- 00:35:38and and we can do this like we do this a
- 00:35:41and b here it's just both be an end but
- 00:35:44different sizes the left is a little
- 00:35:45smaller than the right and generally
- 00:35:47both of them are really small because
- 00:35:49otherwise you can't do nuts
- 00:35:52so
- 00:35:53in this case are you just uh that the
- 00:35:56data that you're learning from the data
- 00:35:57sets that you generate are these just
- 00:35:59literally um
- 00:36:01you you have a bnn with some with
- 00:36:04different random initializations and
- 00:36:06that's how you're generating the data
- 00:36:07sets
- 00:36:08so yeah we have an mlp with random
- 00:36:10initialization and then we generate our
- 00:36:13x's randomly like before for the gps and
- 00:36:16feed them through and generate our y's
- 00:36:18that way
- 00:36:19okay and just to clarify what i'm
- 00:36:21looking at here this is so this is the
- 00:36:23test likelihood i'm assuming on on some
- 00:36:25regression or some
- 00:36:28uh
- 00:36:29problem like classification or
- 00:36:30regression problems or is this um this
- 00:36:33is um yeah so sorry this is uh
- 00:36:36this is actually classification problems
- 00:36:39um and these are classification problems
- 00:36:42sampled from the prior so it's still in
- 00:36:45prior with like we
- 00:36:46to get rid of compounding factors like
- 00:36:49choosing a choosing a data set that
- 00:36:50might not lie within the prior week
- 00:36:53we trained on unseen data sets from the
- 00:36:55pro
- 00:36:58okay thanks
- 00:37:00and yes there's no fine tuning here
- 00:37:01right so you wouldn't
- 00:37:03you do the you train your transformer
- 00:37:05and then
- 00:37:06it's basically just testing so you
- 00:37:07wouldn't update your
- 00:37:10uh the transformer based on this new
- 00:37:11data set right
- 00:37:13no yeah the weights stay the same
- 00:37:17it's yeah it's one training and then
- 00:37:18only feed forward
- 00:37:22don't know feed forward the wrong word
- 00:37:24forward
- 00:37:26so yeah okay here a little lookout um
- 00:37:28it's actually yeah
- 00:37:30yeah it it just to show you this is the
- 00:37:32average regret and we can do uh we can
- 00:37:35do we compare here the gp here is a
- 00:37:38sorry we do bayesian optimization in
- 00:37:39this plot and we compared to a gp that
- 00:37:42uses mle too so that's kind of the
- 00:37:45standard approach right now invasion
- 00:37:47optimization for
- 00:37:48to approximate these uh these type of
- 00:37:51prior
- 00:37:52gps
- 00:37:53um
- 00:37:54and uh
- 00:37:56we compare our pfn against that and
- 00:37:58against random start and we train rpa
- 00:38:00event on the exact same prior the gp has
- 00:38:02so the expectation would be that they
- 00:38:05should be similar because it's two
- 00:38:06similar approximations to the same
- 00:38:08problem
- 00:38:09and this is only for like up to 15
- 00:38:12trials so relatively small
- 00:38:14examples um
- 00:38:17and both use ei so we can actually use
- 00:38:20ei on top of our pfn so that's expected
- 00:38:23improvements like one specific
- 00:38:24acquisition function and what we could
- 00:38:26see there is that our method performs
- 00:38:29actually very similar to
- 00:38:31to the gp and the cool thing now is yeah
- 00:38:33you know you can get creative and build
- 00:38:35new players that are not possible with
- 00:38:37the gp to to even beat it and that's
- 00:38:40what i'm working on right now actually
- 00:38:44then the last part we're already at 42
- 00:38:47but yeah i guess we don't really need
- 00:38:48discussion up there
- 00:38:50um
- 00:38:53uh the last part is that the type event
- 00:38:55so that's the paper we just
- 00:38:57just submitted and also just uploaded to
- 00:38:59archive
- 00:39:01and um
- 00:39:04and that is basically one pfn so one of
- 00:39:07these prior data fitted networks one
- 00:39:10transformer one could also say
- 00:39:13in which we train no full ultra ml
- 00:39:15method so we
- 00:39:17want to make this one thing um be good
- 00:39:20at
- 00:39:21tabular automl in our case like the
- 00:39:23automl and tabular datasets
- 00:39:28what's cool about this is that makes
- 00:39:30autumn l pretty much instant
- 00:39:32our method performs comparably to like
- 00:39:35state-of-the-art automl methods that get
- 00:39:38five minutes of time while we only take
- 00:39:40a second
- 00:39:43and
- 00:39:44this
- 00:39:45simplifies out to ml at least from the
- 00:39:47view of a deeper learner from the
- 00:39:49viewpoint of a deep learner because
- 00:39:51before
- 00:39:52you had a setup like this for example uh
- 00:39:56where we have some meteor learning
- 00:39:57submission optimization etc etc which is
- 00:40:01all
- 00:40:02replaced now by a single neural network
- 00:40:06plus some pre-processing i've said that
- 00:40:08but people's listening is something like
- 00:40:09normalizing data so i hope that's
- 00:40:12allowed
- 00:40:12[Music]
- 00:40:14and
- 00:40:16one caveat here i want to give up front
- 00:40:19is that we
- 00:40:20looked mostly at small datas or we
- 00:40:22looked at small data sets so we care
- 00:40:24only about data sets up to 1000 examples
- 00:40:28okay so
- 00:40:30now
- 00:40:32let's talk about the prior so as i said
- 00:40:35before you can get creative with these
- 00:40:36priors because we just have to be able
- 00:40:38to sample data sets from them so what is
- 00:40:40it even a prior i mean you can also call
- 00:40:41it like a simulation of a real data so
- 00:40:44um
- 00:40:45and here we just like started from
- 00:40:47scratch we want to be able to
- 00:40:50yield a transformer that's just good at
- 00:40:52uh it's good at classification um on
- 00:40:55tabular datasets so yeah of course we
- 00:40:57wanted to be simple
- 00:40:58so
- 00:40:59we thought like
- 00:41:01always we should have something in there
- 00:41:02that like forces our um or latents or
- 00:41:07whatever the latent is um to to
- 00:41:09represent simple solutions that can be
- 00:41:11described with few words
- 00:41:15and where we took inspiration is from
- 00:41:17the
- 00:41:18from the basin and ends so on the on the
- 00:41:21left here
- 00:41:22that's the standard vision and then so
- 00:41:24how we talked about it with luis uh
- 00:41:26where
- 00:41:27where we just like during training we
- 00:41:30just sample the bayesian uh sorry we
- 00:41:32just sampled the weight
- 00:41:33from from a standard initialization
- 00:41:36and now feed random data through it to
- 00:41:39get generate wise and this is now our
- 00:41:41data set
- 00:41:42one data set
- 00:41:44and um
- 00:41:46what we what we can do then on top of
- 00:41:48that is the
- 00:41:49what we call the nas but only in and
- 00:41:52takes in quotation marks uh the nas part
- 00:41:55and that is uh we can actually sample
- 00:41:57the architecture as well so we we build
- 00:42:00a posterior over different architectures
- 00:42:03not only
- 00:42:04a single architecture like a normal
- 00:42:05vision enhance but but multiple
- 00:42:07architectures so that it basically
- 00:42:09chooses the architecture that's most
- 00:42:11likely or we have like an ensemble over
- 00:42:13architectures
- 00:42:14and that was though in the earlier paper
- 00:42:17and now in the new tapifn paper we took
- 00:42:19that one step further
- 00:42:20and and looked at structural causal
- 00:42:23models which can give some
- 00:42:26uh some some some nice uh
- 00:42:29buildup i think for our prior
- 00:42:32and how they work is basically
- 00:42:34these are pretty much pruned
- 00:42:37or yeah i guess for you and
- 00:42:39these are pruned page neural networks
- 00:42:42and now we just change where the input
- 00:42:45is and where the outputs are the outputs
- 00:42:47and the inputs don't have to be at the
- 00:42:48beginning or the end anymore we have
- 00:42:50random inputs these are like these
- 00:42:52epsilons here
- 00:42:54here these epsilons
- 00:42:56and um uh like if i say input now i mean
- 00:42:59input to the to the to the neural
- 00:43:02network but our inputs that we repeat to
- 00:43:04the transformers then later to the p of
- 00:43:06n then later can for example be here at
- 00:43:08the output node it doesn't really matter
- 00:43:10where they are or the the y can actually
- 00:43:13influence the x um like the
- 00:43:16x can can come from the y so it could be
- 00:43:19before so they're basically just at
- 00:43:20random points in this graph and the
- 00:43:22integration comes from these are called
- 00:43:25the links so these like
- 00:43:27the the the arrows boil down to um
- 00:43:30weighted thumbs then and and we think of
- 00:43:33these as like the causing links for the
- 00:43:35next thing um and
- 00:43:39this way we can generate graphs with
- 00:43:41like few causing links and generate data
- 00:43:44that at least for us look
- 00:43:46very much like real data
- 00:43:49um there's a lot of action in the chats
- 00:43:51today
- 00:43:52okay frank
- 00:43:53that's good
- 00:43:54um
- 00:43:56well in the end you could refer to it
- 00:43:57but maybe let's push it okay
- 00:44:00okay
- 00:44:02and
- 00:44:04yeah so what i mean with it looks like
- 00:44:05real data these are the covariance
- 00:44:07matrices of
- 00:44:08a real data set that we found in our
- 00:44:10benchmark compared to a data set that we
- 00:44:12generate and we see these covariant
- 00:44:14matrices they actually look pretty much
- 00:44:16the same build clusters and have other
- 00:44:19parts that are really not very
- 00:44:20correlated
- 00:44:22it's similar in a way so
- 00:44:24here it's on the on the right it's
- 00:44:26synthetic data and on the left we have
- 00:44:28the real data
- 00:44:31cool i hope it's kind of clear what this
- 00:44:34prior is all about and
- 00:44:38we can
- 00:44:39jump to
- 00:44:41the experiment to the experiments and
- 00:44:44what we did for experiments we have like
- 00:44:46one
- 00:44:46main experiment here and that is based
- 00:44:49on the openml 5018 suit so we didn't
- 00:44:52want to cherry pick our tabular data
- 00:44:54sets because i think that actually
- 00:44:55happens pretty easily there are many
- 00:44:57available and it's not clear which one
- 00:44:58is the image net of familiar data sets
- 00:45:01so
- 00:45:02it can happen easily that you cherry
- 00:45:04pick that's why we use use the benchmark
- 00:45:07that's out there and as i said before we
- 00:45:10uh we restricted ourselves to only 1000
- 00:45:13examples so from this benchmark we threw
- 00:45:15away
- 00:45:16all the data sets that have over 1 000
- 00:45:19um training examples
- 00:45:22and we were left with
- 00:45:24uh like around two thirds of the data
- 00:45:27sets from this from this benchmark so
- 00:45:29this was still around
- 00:45:3130 data sets i think or so
- 00:45:34that we had in the end
- 00:45:36and
- 00:45:38here we compared to like a few of the
- 00:45:41modern autumn l frameworks as well as
- 00:45:44through boots of trees and some very
- 00:45:46simple methods like the peas and
- 00:45:48logistic regression and k nearest
- 00:45:51neighbors and what we what we look at
- 00:45:53here is um
- 00:45:55in each plot we have
- 00:45:57on the on the x-axis we have the time so
- 00:45:59how much time does it take to to get
- 00:46:02results on a data set on average for
- 00:46:04these real-world data sets that we have
- 00:46:06from the openml cc8
- 00:46:08benchmark and what performance do we get
- 00:46:11in that time so on the very left we have
- 00:46:13the the rock aug
- 00:46:15um for these uh like the average of the
- 00:46:18rock augs
- 00:46:20for these uh different data sets and we
- 00:46:22can see we're here on the very far left
- 00:46:24and we get comparable results only like
- 00:46:27after like 100x time or so
- 00:46:31um by methods like ultra's killer and 2
- 00:46:34and auto group 1.
- 00:46:36we also compared in terms of the number
- 00:46:39of wins and the number of or
- 00:46:41average ranking and we see a similar
- 00:46:43result like
- 00:46:45we are going to have the best rank and
- 00:46:48win at the very beginning but even then
- 00:46:50given one hour which we don't use we use
- 00:46:52only one second uh we are we have
- 00:46:54comparable platform
- 00:46:57so
- 00:46:58how long did it take to change the
- 00:46:59transformer so this
- 00:47:01was fantastic
- 00:47:02this particular transformer was trained
- 00:47:04for one day on atp
- 00:47:09um but it can be so
- 00:47:11yeah the
- 00:47:12it can be used now for all kinds of data
- 00:47:14sets but it's only trained once and uh
- 00:47:16it's published and
- 00:47:17you can now use it for your other data
- 00:47:19sets that we never saw before
- 00:47:22i guess you you would the comparable
- 00:47:24question would be how long did it take
- 00:47:26to um
- 00:47:28develop auto escalant two and i guess
- 00:47:30that's a couple of years so
- 00:47:32um it's it's really a one-time training
- 00:47:36um as part of algorithm development
- 00:47:40and and when we come up with a new prior
- 00:47:42then that's sort of like coming up with
- 00:47:43a new framework
- 00:47:45right but but i mean okay i think that's
- 00:47:47that's out of the scope of this of this
- 00:47:49paper and maybe also
- 00:47:51besides the point but a fair comparison
- 00:47:53would be maybe you know how would
- 00:47:55um all the other ml with transfo uh
- 00:47:58transfer learning work here right so um
- 00:48:00but maybe but it's there's no transfer
- 00:48:03it's just running on artificial data
- 00:48:06yeah that's training
- 00:48:07okay i see yeah yeah
- 00:48:09i never saw a real tabula data set
- 00:48:11before
- 00:48:12okay yeah i see a point yeah yeah true
- 00:48:16okay then at the last uh cherry like a
- 00:48:20last little cool thing that uh we don't
- 00:48:23really understand though
- 00:48:24is um we then tried it later with longer
- 00:48:27data sets so data sets that have more
- 00:48:29training examples that we ever
- 00:48:30considered during training because like
- 00:48:32like i said before we sample these
- 00:48:34artificial data sets during training and
- 00:48:35we
- 00:48:36in this case we sample data sets of the
- 00:48:38length
- 00:48:391024 we always sampled
- 00:48:41um
- 00:48:42so it should learn to to be able to
- 00:48:44predict for everything smaller or equal
- 00:48:46to one one thousand twenty four examples
- 00:48:49but we later run on larger data sets
- 00:48:51from the autumn l benchmark and um it
- 00:48:54actually finally still generalized and
- 00:48:57uh and could
- 00:48:58gain improvements
- 00:49:00uh after after the point so it can
- 00:49:03somehow generalize and make use of
- 00:49:04larger data sets that never saw during
- 00:49:06training which is i guess a quite
- 00:49:08interesting feature of transformers
- 00:49:13okay and now this is the final slide
- 00:49:16already and that's the lookout and uh
- 00:49:18what what one could use this for and the
- 00:49:22and what's still out there and uh the
- 00:49:25first thing for the basic optimization
- 00:49:26which we which we work on and uh then of
- 00:49:30course larger data sets important i said
- 00:49:31we always restrict ourselves to 1000
- 00:49:34examples and as you know training a
- 00:49:36transformer with very long sequences can
- 00:49:38be a problem and that's why we restrict
- 00:49:40ourselves to something like that
- 00:49:43but we're working on more
- 00:49:45yeah making it larger and also if you're
- 00:49:47interested reached out and and then
- 00:49:50there's this creative thing of inventing
- 00:49:52new priors and what
- 00:49:54has really described the world really
- 00:49:56well uh where we where we are able to
- 00:49:58maybe generate data but we can't for
- 00:50:00example compute the probability of
- 00:50:02different data points or so because we
- 00:50:04don't need to we just need to sample now
- 00:50:06um
- 00:50:08and
- 00:50:09yeah other things are the parametric
- 00:50:10posterior approximation which would be
- 00:50:12pretty much stimulation based inference
- 00:50:14and um interpretability is also
- 00:50:18something that could be interesting here
- 00:50:20because you can do forward passes much
- 00:50:22faster than before so you can actually
- 00:50:24try out a lot of different uh different
- 00:50:27scenarios to figure out what
- 00:50:29what actually makes uh make some
- 00:50:32prediction be one or zero and um you can
- 00:50:36also take gradients to the pfn so
- 00:50:38you can figure out like you can
- 00:50:40basically do gradient descent to find
- 00:50:42the direction in which your label
- 00:50:44changes and lastly uh yeah you can do
- 00:50:47something like aliatoric versus a
- 00:50:48systemic uncertainty notification but
- 00:50:50yeah that's all on the horizon basically
- 00:50:52i just wanted to tell you there's many
- 00:50:55paths
- 00:50:58cool
- 00:50:58um
- 00:51:00yeah that's it and i guess we can answer
- 00:51:03some questions
- Bayesian Inference
- Transformers
- Meta-Learning
- Few-Shot Learning
- Machine Learning
- Automatic Machine Learning
- Supervised Learning
- Prior Data
- Inference
- Neural Networks