Samuel Mueller | "PFNs: Use neural networks for 100x faster Bayesian predictions"

00:51:04
https://www.youtube.com/watch?v=XnngBWe2WYE

Ringkasan

TLDRSamuel præsenterede, hvordan transformatorer kan anvendes til bayesiansk inferens i sit arbejde med automatisk maskinlæring. Han forklarede, hvordan modellen er trænet på prior-data for at optimere forudsigelser, og hvordan metoden gør det muligt for transformatorer hurtigt at give nøjagtige resultater på usete data. Der blev også præsenteret eksperimenter, som viste sammenligningen af præstationer mellem den foreslåede metode og traditionelle bayesianske tilgange, hvilket viste, at transformatorer kunne give bedre resultater med betydeligt mindre beregningstid. Diskussioner inkluderer fremtidige retninger for forskning og muligheder for forbedring af metoden.

Takeaways

  • 📊 Samlet overblik over bayesiansk inferens med transformatorer.
  • ⚡ Hurtige og nøjagtige forudsigelser på uset data.
  • 🧠 Meta-læring bruger tilpassede datasæt til træning.
  • 🔍 Sammenligning mellem forskellige maskinlæringsmetoder.
  • ⏳ Træningsprocessen krævede kun en dags beregningstid.
  • 📈 Retning mod bedre generalisering med større datasæt.
  • 📚 Vigtigheden af at anvende passende priors.
  • 💻 Mulighed for effektiv automatisering af maskinlæring.
  • 🧩 Interdisciplinære muligheder for fremtidig forskning.

Garis waktu

  • 00:00:00 - 00:05:00

    Introduktion til dagens præsentation, hvor Samuel præsenterer sit arbejde med transformer-netværk i konteksten af bayesiansk inferens og automatiseret maskinlæring. Målet er at lave hurtige og præcise forudsigelser for ukendte problemer ved hjælp af standard transformer-modeller.

  • 00:05:00 - 00:10:00

    Samuel beskriver forskellen mellem overvåget læring og bayesiansk overvåget læring med fokus på kalibrering og fortolkning. Han introducerer ideen om 'meta-learning', hvor modellen lærer at lære fra multiple datasæt.

  • 00:10:00 - 00:15:00

    Forklaring af meta-læring, hvor en algoritme trænes på flere datasæt (meta-træning) og derefter anvendes på et nyt uset datasæt (meta-testning). Formålet er at lære at klassificere objekter hurtigt ved hjælp af dine præcedenter.

  • 00:15:00 - 00:20:00

    Samuel forklarer bayesiansk inference, hvor der etableres et forhold mellem input og output via latent variabel. Han introducerer normalt fordelte forudsigelser ved hjælp af bayesiansk reglen for at finde posterior forudsigelser.

  • 00:20:00 - 00:25:00

    Præsentationen drejer sig nu om, hvordan de standard transformer-modeller kan tilpasses til at udføre bayesiansk inference ved at optimalisere over posterie distributions. En direkte tilgang i stedet for at anvende traditionelle bayesianske metoder.

  • 00:25:00 - 00:30:00

    Exempler på træning af en netværk med data genereret fra en given prior, hvilket giver mulighed for at skabe store datasæt til træningen. Dette gør det muligt for transformerne at lære hurtigere og mere effektivt ved at anvende disse datasæt.

  • 00:30:00 - 00:35:00

    Diskussion om vigtigheden af permutation invarians og hvordan transformers kan designes til at tillade træningssæt at aggregere oplysninger om sig selv. Samuel understreger vigtigheden af at begrænse opmærksomheden til data i træningssættet.

  • 00:35:00 - 00:40:00

    Samuel præsenterer et konkret eksempel på, hvordan data samles, og hvordan metoden evaluerer yderligere observationer. Han diskuterer betydningen af at have præcise repræsentationer over distributionsformer for bedre inferens.

  • 00:40:00 - 00:45:00

    Præsentation af testresultaterne, hvor forskellige modeller testes mod hverandre for at evaluere ydeevnen. Samuel sammenligner resultaterne af transformer-baserede metoder med andre standardmetoder for maksimal likelihood estimator.

  • 00:45:00 - 00:51:04

    Endelig diskussion om potentialet ved at anvende transformer-modeller til automatisering af maskinlæring med fokus på realtidsevaluation og hvordan metoderne kan skaleres til større datasæt fremover.

Tampilkan lebih banyak

Peta Pikiran

Video Tanya Jawab

  • What is the main focus of Samuel's presentation?

    The presentation focuses on how transformers can be adapted to perform Bayesian inference for supervised learning tasks.

  • What are the key aspects of the transformer model discussed?

    Key aspects include the training process, the relationship between meta-learning and Bayesian inference, and the advantages of using transformers for fast and accurate predictions.

  • What is few-shot learning as mentioned in the talk?

    Few-shot learning refers to the process where an algorithm learns to classify unseen data based on prior knowledge from multiple similar datasets.

  • How does Bayesian inference relate to transformers?

    Bayesian inference is leveraged in transformers to optimize predictions, allowing for better generalization on unseen data.

  • What experiments were conducted to validate the approach?

    Experiments included training on Gaussian priors and comparing performance against traditional Bayesian approaches, showing that transformers can achieve comparable or better results.

  • What is the significance of using priors in this model?

    Priors allow the generation of datasets for training and help in approximating true posterior distributions for accurate predictions.

  • How long was the transformer trained?

    The transformer was trained for one day on a TPU.

  • What are the future directions suggested in the talk?

    Future directions include exploring larger datasets, improving prior generations, and enhancing interpretability.

  • Is this method effective for larger datasets?

    Yes, the method showed generalization capabilities to handle data it had not seen during training.

  • What does the presentation suggest about the speed of transformer predictions?

    The model can produce predictions in a fraction of the time compared to traditional Bayesian methods.

Lihat lebih banyak ringkasan video

Dapatkan akses instan ke ringkasan video YouTube gratis yang didukung oleh AI!
Teks
en
Gulir Otomatis:
  • 00:00:02
    okay hello everyone um
  • 00:00:05
    today is the last session before our
  • 00:00:06
    summer break um
  • 00:00:08
    and that's my great pleasure to
  • 00:00:09
    introduce our today speaker samuel he's
  • 00:00:11
    a pc student university of fibro
  • 00:00:14
    um yeah working on
  • 00:00:16
    transformers and you know transformers
  • 00:00:18
    foundation entrance uh also in the
  • 00:00:20
    context of automated machine learning
  • 00:00:22
    so yeah with that the flaws is yours
  • 00:00:26
    from here thank you very much erin um
  • 00:00:29
    yeah and that's also what i would like
  • 00:00:31
    to talk about today so
  • 00:00:33
    that is how we can get our transformers
  • 00:00:37
    uh pretty standard transformers um to do
  • 00:00:41
    bayesian inference for us
  • 00:00:43
    on any new problem that we never saw
  • 00:00:45
    before with during inferences
  • 00:00:48
    um so i will just first outline what
  • 00:00:52
    what kind of the problem we work on here
  • 00:00:55
    just so that it's it's more clear
  • 00:00:58
    what what the rest of this talk will be
  • 00:00:59
    about so what we care about here is like
  • 00:01:01
    supervised learning and specifically
  • 00:01:04
    asian supervised learning so
  • 00:01:06
    uh we have a label data set uh the
  • 00:01:09
    training set and we want to predict
  • 00:01:11
    labels for unlabeled validation set
  • 00:01:14
    and we care especially about calibration
  • 00:01:17
    here interpretability
  • 00:01:19
    but also of course we want to be
  • 00:01:20
    accurate and lastly something that so
  • 00:01:24
    far fell a little bit under the table in
  • 00:01:26
    many basin works uh we want to be fast
  • 00:01:31
    so
  • 00:01:32
    from here on i will start with a little
  • 00:01:34
    bit of background first a little bit of
  • 00:01:36
    a few short learning um and then a
  • 00:01:38
    little bit of bayesian learning but it
  • 00:01:41
    should be easy i guess
  • 00:01:44
    so first a future learning which is also
  • 00:01:46
    called meta learning sometimes
  • 00:01:49
    i don't know probably many of you know
  • 00:01:51
    it already i'll still explain it very
  • 00:01:53
    quickly
  • 00:01:54
    um
  • 00:01:55
    the idea is that we
  • 00:01:57
    learn
  • 00:01:58
    on one level higher so
  • 00:02:00
    we're not given just a data set of
  • 00:02:02
    images with labels where we're given a
  • 00:02:05
    data set
  • 00:02:07
    or a set of data sets
  • 00:02:09
    so we're given many many different data
  • 00:02:12
    sets
  • 00:02:13
    um with classification tasks in this
  • 00:02:15
    case so here
  • 00:02:17
    we're given for example a catch versus
  • 00:02:19
    bird specification
  • 00:02:21
    set or flowers where the spike
  • 00:02:23
    specification data set and the idea is
  • 00:02:24
    now
  • 00:02:26
    that an algorithm learns to learn or
  • 00:02:28
    learns to classify here
  • 00:02:30
    for a new
  • 00:02:32
    unseen data set and that would be here
  • 00:02:34
    the doc versus author data set on the
  • 00:02:36
    right never seen during training and the
  • 00:02:39
    idea now is that it learned
  • 00:02:41
    how to classify objects and can do so
  • 00:02:44
    really fast now that's the future part
  • 00:02:47
    we can classify objects of course
  • 00:02:48
    already with standard neural networks
  • 00:02:50
    but the idea is that you learn to do it
  • 00:02:53
    better by having available this set of
  • 00:02:55
    other data sets that look similar
  • 00:02:59
    and
  • 00:03:00
    we call this first phase the meta
  • 00:03:02
    training phase so the meta training is
  • 00:03:04
    on on these many data sets and then we
  • 00:03:06
    have the second phase the meta testing
  • 00:03:08
    um and that is where we tried it out and
  • 00:03:10
    never seen data sets before and we can
  • 00:03:13
    formalize this as like during meta
  • 00:03:16
    training we have a set of data sets
  • 00:03:18
    available which i call d1 and d2 and so
  • 00:03:20
    on here
  • 00:03:21
    each consisting of pairs of
  • 00:03:25
    inputs and outputs
  • 00:03:26
    and
  • 00:03:28
    then we have validation set there as
  • 00:03:30
    well uh here one validation example only
  • 00:03:33
    but it could be a set as well uh we have
  • 00:03:35
    the same for meta testing
  • 00:03:37
    um just a
  • 00:03:38
    data set as well as uh validation sets
  • 00:03:41
    and what we care about here like if we
  • 00:03:43
    look at all these what's the thing that
  • 00:03:45
    it predicted
  • 00:03:46
    and what is predicted here is only this
  • 00:03:49
    white head so so the rest of these is
  • 00:03:51
    used for training but all that is
  • 00:03:53
    predicted here is the white test that's
  • 00:03:55
    the special thing
  • 00:03:57
    we just learned from the left
  • 00:03:59
    how to predict and we predict only on
  • 00:04:01
    the right
  • 00:04:02
    is this clear how like what q shot
  • 00:04:04
    learning is in general
  • 00:04:10
    and i guess since there's a few yeah you
  • 00:04:13
    can easily interrupt me and this won't
  • 00:04:15
    like cause me through head detector or
  • 00:04:17
    anything um so don't hesitate to
  • 00:04:20
    interrupt me please um
  • 00:04:23
    the other background that i guess
  • 00:04:26
    relevant is asian inference especially
  • 00:04:29
    for supervised learning
  • 00:04:31
    and
  • 00:04:32
    here we just generally consider a prior
  • 00:04:35
    like a prior over some latent variable i
  • 00:04:37
    call it t here but you can call it
  • 00:04:39
    whatever
  • 00:04:40
    and this key usually or the c describes
  • 00:04:43
    your
  • 00:04:43
    your dependency between the input and
  • 00:04:46
    the output so uh for example a bayesian
  • 00:04:48
    neural network i call it dnn here
  • 00:04:51
    uh
  • 00:04:52
    describes uh or says that the output
  • 00:04:56
    should have a relationship to the input
  • 00:04:57
    such that there is a neural network
  • 00:05:00
    that's mapping the inputs to the outputs
  • 00:05:03
    but you can be more creative than that
  • 00:05:05
    in terms of what your prior is
  • 00:05:10
    um
  • 00:05:11
    and this is kind of how you given such a
  • 00:05:14
    it's given such a late and this is how
  • 00:05:16
    you traditionally solve the
  • 00:05:18
    classification problem uh or like a
  • 00:05:21
    standard supervised learning
  • 00:05:22
    classification problems so we are given
  • 00:05:23
    some data set d
  • 00:05:25
    and we want to now predict our y
  • 00:05:28
    and to do that we use uh
  • 00:05:32
    bayes rule here
  • 00:05:33
    to get the distribution or the
  • 00:05:35
    distribution over the latent given the
  • 00:05:37
    data and then we can
  • 00:05:39
    integrate that out to get the
  • 00:05:41
    distribution of the y and this is what
  • 00:05:43
    we call the
  • 00:05:44
    posterior predictive distribution and in
  • 00:05:47
    this work we will actually approximate
  • 00:05:49
    this thing here directly we will not go
  • 00:05:53
    this way we will not do base based uh
  • 00:05:57
    formula or anything we will just do this
  • 00:05:59
    directly um and maybe you see already
  • 00:06:02
    the connection to few short learning
  • 00:06:04
    here as we have a distribution that
  • 00:06:06
    takes in a data set
  • 00:06:09
    similar to
  • 00:06:10
    the data sets here
  • 00:06:12
    okay so
  • 00:06:14
    i will talk about this relationship
  • 00:06:16
    between meta-learning and bayesian
  • 00:06:17
    inference and we will exploit this for
  • 00:06:21
    uh for the approximation of the
  • 00:06:23
    posteriors for gaussian processes uh
  • 00:06:25
    even with meta priors uh
  • 00:06:28
    and for our base neural networks and uh
  • 00:06:30
    we'll get into that a little more
  • 00:06:32
    and lastly also this is possible to use
  • 00:06:35
    for basic optimization even
  • 00:06:38
    okay so so what is the approach in in in
  • 00:06:41
    general and this is a little example of
  • 00:06:44
    how to train
  • 00:06:45
    a prior data fitted network that is the
  • 00:06:48
    network that we propose
  • 00:06:50
    um which is
  • 00:06:52
    which which is a transformer in the end
  • 00:06:56
    and
  • 00:06:57
    how we how we train these is
  • 00:06:59
    we we train on many different data sets
  • 00:07:02
    similar to metal learning
  • 00:07:04
    but here the data sets are generated by
  • 00:07:06
    our prior so we define a prior just like
  • 00:07:10
    for equation
  • 00:07:11
    for bayesian imprint
  • 00:07:13
    and we sample data from this prior so
  • 00:07:15
    you can imagine
  • 00:07:17
    uh a gp prior for example and this is
  • 00:07:19
    where this data actually comes from as
  • 00:07:21
    well so this is data that we sampled
  • 00:07:23
    from a gt prior and this way we generate
  • 00:07:26
    data sets so these are our x-y pairs
  • 00:07:28
    this is the x axis this is the y so we
  • 00:07:30
    want to label uh the the the cross here
  • 00:07:34
    uh given the blue dots
  • 00:07:36
    and
  • 00:07:37
    we want to label the y of this cross
  • 00:07:41
    and we generate data sets like this
  • 00:07:43
    artificial data sets as many as we want
  • 00:07:46
    it's usually in the millions
  • 00:07:49
    and train our transformer or our kifn to
  • 00:07:52
    predict this uh this this holdout
  • 00:07:54
    example
  • 00:07:55
    correctly so it learns to
  • 00:07:59
    it does make the learning or it does um
  • 00:08:01
    view short learning but on a on data
  • 00:08:04
    that comes from a prior
  • 00:08:07
    and what is so cool about this is if we
  • 00:08:09
    have now real data that comes let's say
  • 00:08:12
    this is the function what what could
  • 00:08:13
    this be the altitude of some some hilly
  • 00:08:16
    region here and we want to predict the
  • 00:08:18
    altitude for the 0.0.8 here all we all
  • 00:08:21
    we have to do is
  • 00:08:23
    feed this data through our to our psn
  • 00:08:26
    and it will spit out not only like a
  • 00:08:28
    like uh like a mean expectation
  • 00:08:31
    but uh or sorry a mean prediction
  • 00:08:35
    but it will spit out a distribution over
  • 00:08:37
    what is likely to be the mean so
  • 00:08:40
    it will spit out the posterior for or
  • 00:08:44
    the posterior predictive distribution
  • 00:08:46
    for this particular prior that we chose
  • 00:08:47
    to train on
  • 00:08:53
    so actually good question here could you
  • 00:08:55
    say could you go back to this plot sorry
  • 00:08:57
    um so training and testing so
  • 00:09:00
    you test them on um
  • 00:09:03
    on the same functions that you use
  • 00:09:04
    during training but just different data
  • 00:09:07
    okay
  • 00:09:09
    no you generate um
  • 00:09:12
    so if you say testing you mean meter
  • 00:09:14
    testing i assume well you say test here
  • 00:09:16
    in your blood so you have to train your
  • 00:09:18
    points and test data points so
  • 00:09:20
    um oh i am sorry okay so you mean this
  • 00:09:22
    test data point here yeah these yeah oh
  • 00:09:25
    they come from the same function yes
  • 00:09:26
    okay but this is your meta testing this
  • 00:09:28
    is your the testing for your meta
  • 00:09:30
    training basically
  • 00:09:32
    um
  • 00:09:33
    these are basically just a
  • 00:09:35
    examples that we use during training
  • 00:09:37
    and then this side i would call meta
  • 00:09:40
    testing okay
  • 00:09:42
    okay i'm sorry the vocabulary is yeah it
  • 00:09:45
    kind of can get mixed up between the
  • 00:09:46
    meter and the not meter level okay no
  • 00:09:51
    [Music]
  • 00:09:52
    orange cross i want to say because i
  • 00:09:54
    wasn't so explicit about it these are
  • 00:09:56
    all samples from the same prior and then
  • 00:09:58
    we just randomly select what's the
  • 00:09:59
    holdout the holdout is not sampled in
  • 00:10:01
    any different way it's just like a
  • 00:10:03
    random subset of our data set is that
  • 00:10:05
    hold up
  • 00:10:07
    uh louise
  • 00:10:11
    um yeah i think maybe you'll get to this
  • 00:10:14
    but i just wanted to quickly ask so can
  • 00:10:16
    this actually satisfy the komagara of
  • 00:10:18
    extension theorem
  • 00:10:21
    oh i don't know the konmager of
  • 00:10:22
    extension for stochastic processes um
  • 00:10:26
    like can you comment on
  • 00:10:27
    sort of the the permutation invariants
  • 00:10:31
    yeah we
  • 00:10:32
    oh okay okay yeah we have we have that
  • 00:10:35
    um i will get to that um
  • 00:10:37
    and maybe you tell me if we satisfy the
  • 00:10:39
    kolmogorov extension theorem um but we
  • 00:10:42
    have permutation variance for sure okay
  • 00:10:44
    okay
  • 00:10:45
    i'll just wait and listen then
  • 00:10:47
    yeah it comes in like three slides or so
  • 00:10:49
    the architecture
  • 00:10:52
    um
  • 00:10:52
    [Music]
  • 00:10:54
    yeah so
  • 00:10:55
    maybe this is not easy for you because
  • 00:10:57
    the questions were already quite
  • 00:10:58
    advanced
  • 00:11:00
    but i will just say once again like
  • 00:11:02
    written out in
  • 00:11:04
    in formula
  • 00:11:06
    what what we do
  • 00:11:08
    we we sample data sets from here are
  • 00:11:10
    called pd but we sample from our prior
  • 00:11:13
    in some way we sample these data sets so
  • 00:11:15
    this could be a gp prior or a page
  • 00:11:17
    neural network provider something like
  • 00:11:19
    that we sample a lot of these data sets
  • 00:11:21
    um this uppercase k here
  • 00:11:24
    is in our experiments usually something
  • 00:11:26
    like
  • 00:11:27
    like a few million
  • 00:11:29
    um so we sample a lot of these data sets
  • 00:11:31
    because it's super cheap to sample them
  • 00:11:33
    we can just do that over and over and
  • 00:11:35
    over again
  • 00:11:36
    um and then we train our p van with just
  • 00:11:40
    the very normal negative log likelihood
  • 00:11:42
    or cross-entropy loss on on these data
  • 00:11:45
    sets or on the pulled out part of these
  • 00:11:47
    data sets
  • 00:11:49
    and what we get out there is an
  • 00:11:51
    approximation
  • 00:11:53
    to the
  • 00:11:55
    to the
  • 00:11:56
    the true
  • 00:11:57
    um posterior predictive distribution so
  • 00:11:59
    to the bayesian
  • 00:12:02
    posterior predictive distribution that
  • 00:12:04
    that one cares about if one has this
  • 00:12:05
    prior one wants this distribution um and
  • 00:12:08
    we actually get an approximation to that
  • 00:12:10
    uh on
  • 00:12:11
    data that comes from the real world so
  • 00:12:13
    not feeding in the same data again or so
  • 00:12:16
    but like unseen data that the network
  • 00:12:18
    never saw before so what we have here is
  • 00:12:20
    a network that will do
  • 00:12:22
    that we'll do a form of bayesian
  • 00:12:23
    prediction um
  • 00:12:25
    or they will do an approximation of
  • 00:12:27
    patient prediction
  • 00:12:28
    um as the forward path so you just give
  • 00:12:30
    it the data and the hold out examples
  • 00:12:32
    and it will do patient prediction for
  • 00:12:34
    you uh there's no no gradient or
  • 00:12:36
    anything involved there
  • 00:12:38
    at inference time or at this
  • 00:12:41
    meter inference time
  • 00:12:43
    and yeah just very quick because this is
  • 00:12:46
    actually
  • 00:12:47
    quite easy connection um
  • 00:12:50
    so this is just the connection to to
  • 00:12:53
    bayesian inference and the connection
  • 00:12:54
    basically just boils down to
  • 00:12:57
    the laws that they just showed you
  • 00:12:59
    before or the generalization of that
  • 00:13:02
    loss over
  • 00:13:03
    over the whole expectation of these data
  • 00:13:05
    sets um
  • 00:13:07
    actually
  • 00:13:08
    is in a meaningful way
  • 00:13:10
    connected to the kl divergence
  • 00:13:13
    of our
  • 00:13:14
    posterior approximation to the real
  • 00:13:16
    posterior so it's a
  • 00:13:18
    it's the
  • 00:13:21
    it's the mean kl divergence of our
  • 00:13:24
    approximation to the real posterior uh
  • 00:13:26
    predictive distribution plus some
  • 00:13:28
    constant
  • 00:13:31
    and we can optimize this directly that's
  • 00:13:32
    the cool thing so we optimize directly
  • 00:13:35
    to become the posterior predictive
  • 00:13:37
    distribution
  • 00:13:39
    okay and now
  • 00:13:40
    the part luis asked about i think and
  • 00:13:44
    that is what is our model and
  • 00:13:47
    our model is uh is the transformer
  • 00:13:50
    where we uh throw away
  • 00:13:52
    the
  • 00:13:53
    the uh the positional encodings and that
  • 00:13:56
    is actually pretty much enough to make
  • 00:13:58
    the transformer positioner or
  • 00:14:01
    sorry permutation invariant
  • 00:14:04
    and then the other thing we do is we
  • 00:14:06
    restrict the attention such that
  • 00:14:08
    our
  • 00:14:10
    our holdout example can only look at the
  • 00:14:12
    data set
  • 00:14:14
    so can attend to the data set but the
  • 00:14:16
    data set can only attend to other points
  • 00:14:18
    in the dataset so the dataset actually
  • 00:14:20
    aggregates information about itself or
  • 00:14:23
    the sorry i said data set i mean
  • 00:14:25
    training set in this case and the whole
  • 00:14:27
    load example can then look things up in
  • 00:14:30
    in these building um
  • 00:14:32
    building up representations and here we
  • 00:14:35
    have actually two whole other examples
  • 00:14:36
    before i always just showed one but in
  • 00:14:38
    general we train with multiple for
  • 00:14:40
    efficiency reasons
  • 00:14:42
    and these are separated such that each
  • 00:14:44
    one of these can only attend to the
  • 00:14:46
    training set and not to each other i
  • 00:14:48
    mean that that generally was pretty bad
  • 00:14:50
    if they could do that because
  • 00:14:52
    in general they should not depend on
  • 00:14:54
    each other
  • 00:14:55
    that's i think if you know transformers
  • 00:14:58
    already i know this is kind of assuming
  • 00:14:59
    you know a little like a lot about
  • 00:15:01
    transformers but if you know a lot about
  • 00:15:03
    uh like if you know how transformers
  • 00:15:05
    work i think this should
  • 00:15:07
    like describe the whole model um maybe
  • 00:15:10
    one more thing is how we encode the x's
  • 00:15:12
    and y's
  • 00:15:13
    in in our case it's just linear layers
  • 00:15:15
    so we
  • 00:15:16
    so we just put the x's in the linear
  • 00:15:18
    layer and the y's in the linear layer
  • 00:15:20
    and add them together if we want to
  • 00:15:21
    encode a pair or we don't have the y if
  • 00:15:24
    we don't want to encode the y because we
  • 00:15:26
    don't want to give away the information
  • 00:15:27
    for the whole odd examples
  • 00:15:34
    so one thing that is missing in what i
  • 00:15:37
    described here is
  • 00:15:39
    how exactly this distribution up here
  • 00:15:41
    works so i just say yeah what comes out
  • 00:15:43
    here
  • 00:15:45
    will be this posterior distribution but
  • 00:15:47
    i
  • 00:15:49
    uh i didn't explain how
  • 00:15:51
    and the traditional way would be to to
  • 00:15:53
    predict uh
  • 00:15:54
    something like a gaussian here some
  • 00:15:56
    normal distribution predict the mean and
  • 00:15:58
    the variance we tried that and it didn't
  • 00:16:01
    work and what worked much better was
  • 00:16:03
    like a
  • 00:16:04
    discretization of the space so
  • 00:16:07
    so this is our space we care about
  • 00:16:08
    predictions in uh so we want to predict
  • 00:16:11
    uh things that that lie within let's
  • 00:16:14
    hear say let's say minus three up to
  • 00:16:15
    three
  • 00:16:16
    and what we do now is we discretize the
  • 00:16:18
    space into little buckets
  • 00:16:20
    and put a soft max in front so we just
  • 00:16:23
    classify
  • 00:16:24
    uh basically uh things are
  • 00:16:27
    either or we classify
  • 00:16:29
    things to be in one of these packets we
  • 00:16:33
    this is something we that just worked
  • 00:16:35
    out of the box pretty much and you don't
  • 00:16:37
    need to tune very much
  • 00:16:39
    usually 1 000 buckets is a good number
  • 00:16:42
    um
  • 00:16:44
    this is a problem though because if you
  • 00:16:46
    don't know this probability for your
  • 00:16:48
    prior so sometimes maybe a point lies
  • 00:16:50
    here like below minus three you get like
  • 00:16:53
    a minus infinity loss which
  • 00:16:55
    would be bad
  • 00:16:56
    so very simple hack to
  • 00:16:59
    have good uh good losses it's just to
  • 00:17:02
    replace the site with like half normal
  • 00:17:03
    so now
  • 00:17:04
    we actually have support for
  • 00:17:07
    minus infinity to plus infinity um we
  • 00:17:09
    have support everywhere
  • 00:17:12
    other questions about the model because
  • 00:17:14
    then we could already start into the
  • 00:17:16
    experiments next that's so actually this
  • 00:17:18
    side i haven't fully understood so
  • 00:17:20
    basically instead of a particular mean
  • 00:17:22
    and the variance you predict
  • 00:17:24
    quantiles
  • 00:17:25
    or something like that yeah instead of
  • 00:17:27
    predicting a mean and a variance we
  • 00:17:28
    basically uh we we have a soft maxim in
  • 00:17:31
    the end and we predict in which one of
  • 00:17:33
    these buckets our
  • 00:17:35
    our y will land
  • 00:17:37
    so these are the bucket the first bucket
  • 00:17:39
    goes from -3 to minus 1.5 for example
  • 00:17:42
    and then we give you the probability and
  • 00:17:43
    say 0.1
  • 00:17:46
    likelihood that we learned in this part
  • 00:17:48
    okay so basically make it a
  • 00:17:50
    classification problem
  • 00:17:51
    right exactly
  • 00:17:53
    why not why not predicting quantiles so
  • 00:17:57
    why not what predicting the quantile so
  • 00:18:02
    um i i didn't really think about this uh
  • 00:18:05
    so you mean like predicting the borders
  • 00:18:07
    of these
  • 00:18:08
    quantum
  • 00:18:09
    are the quantile in the distribution or
  • 00:18:12
    yeah
  • 00:18:13
    okay this is this data point yeah
  • 00:18:14
    exactly um
  • 00:18:16
    the quantizer question
  • 00:18:19
    [Music]
  • 00:18:20
    i mean the point is to make it a
  • 00:18:22
    classification problem right
  • 00:18:24
    because transformers are good at that
  • 00:18:26
    [Music]
  • 00:18:27
    right tim
  • 00:18:29
    yeah yeah yeah yeah for sure for sure i
  • 00:18:31
    mean
  • 00:18:32
    okay i think still one could probably do
  • 00:18:34
    like zero to one and then the bars again
  • 00:18:38
    but uh okay that's probably not what you
  • 00:18:39
    mean uh yeah i i didn't try the quantile
  • 00:18:43
    but i had the feeling that the
  • 00:18:44
    regression head works much worse with
  • 00:18:46
    transformers or with neural networks
  • 00:18:48
    maybe even in general no that's too much
  • 00:18:50
    of a claim that's used in many places
  • 00:18:51
    but with transformers it seemed to work
  • 00:18:53
    much worse
  • 00:18:54
    you could still make the classification
  • 00:18:56
    problem but it just sends a predictor
  • 00:18:58
    okay this this point is like within the
  • 00:19:00
    ten percent quantity twenty percent
  • 00:19:01
    contacts and
  • 00:19:02
    percentage yeah for sure you can
  • 00:19:04
    um
  • 00:19:06
    we kind of do that actually in a way
  • 00:19:08
    because we how we define these bars is
  • 00:19:11
    we take a big example from our prior
  • 00:19:14
    and now we create bars for every
  • 00:19:17
    quantile so for every one percent type
  • 00:19:19
    of quantile we have one bar
  • 00:19:21
    so we will have for normal just like for
  • 00:19:24
    gaussian process prior for example we
  • 00:19:25
    have smaller bars in the center and
  • 00:19:27
    wider on the sides so we kind of do
  • 00:19:29
    quantity prediction i guess okay i see
  • 00:19:31
    yeah okay
  • 00:19:33
    and and as a remark here i think it
  • 00:19:35
    would be a really interesting project if
  • 00:19:37
    anyone is interested in this to
  • 00:19:40
    look into
  • 00:19:41
    whether we can
  • 00:19:43
    just in generally and in general do
  • 00:19:45
    regression like this and whether in
  • 00:19:47
    general this is a good idea to to do
  • 00:19:49
    regression
  • 00:19:51
    like this with neural networks because
  • 00:19:54
    sometimes we tried regression with
  • 00:19:55
    neural networks and it didn't work all
  • 00:19:57
    that well
  • 00:19:58
    and um yeah this could be a generic
  • 00:20:01
    approach for
  • 00:20:02
    just how to do regression with neural
  • 00:20:04
    nets but um
  • 00:20:05
    yeah we haven't looked into it if
  • 00:20:07
    someone would left like to do that um
  • 00:20:09
    maybe together that'd be great
  • 00:20:11
    get in touch
  • 00:20:15
    yeah
  • 00:20:17
    okay so any more questions
  • 00:20:21
    just a quick question do you have any
  • 00:20:22
    insight on
  • 00:20:24
    let's say at a finite collection of test
  • 00:20:27
    points
  • 00:20:28
    um do you have any insight about what
  • 00:20:31
    the marginal looks like because you know
  • 00:20:33
    in in gaussian processors we have these
  • 00:20:35
    guarantees about the marginal
  • 00:20:37
    distribution at a finite collection
  • 00:20:39
    they're jointly gaussian
  • 00:20:42
    is that even kind of useful thing to
  • 00:20:44
    think about in this setting and if so do
  • 00:20:47
    you have any insight on what what those
  • 00:20:48
    things look like
  • 00:20:50
    so you mean the joint distribution of
  • 00:20:52
    multiple points
  • 00:20:53
    yeah exactly so so the the joint
  • 00:20:56
    distribution at a finite collection of
  • 00:20:57
    points will always be a multivariate
  • 00:21:00
    gaussian right in the case by definition
  • 00:21:02
    in the case of gaussian processes
  • 00:21:05
    yeah we actually
  • 00:21:06
    don't like this empirically
  • 00:21:09
    okay um
  • 00:21:10
    i mean what
  • 00:21:12
    i didn't look into the joint
  • 00:21:13
    distributions i have to say so i can't
  • 00:21:15
    really tell you much about this we can
  • 00:21:17
    say that each of these distributions
  • 00:21:19
    looks very much like a version so if you
  • 00:21:20
    like pull up the distribution they will
  • 00:21:22
    match pretty exactly but just like by
  • 00:21:24
    eye because in the end one is like bars
  • 00:21:27
    and the other is like a like a like a
  • 00:21:29
    continuous line
  • 00:21:31
    um but for the joints i don't know i
  • 00:21:34
    guess empirically you could calculate
  • 00:21:36
    the covariances at any pair of points
  • 00:21:39
    and you can kind of see
  • 00:21:40
    how it's distributed i think that would
  • 00:21:42
    be kind of interesting
  • 00:21:44
    and i i would think it would look like a
  • 00:21:47
    gaussian thing just because the prior
  • 00:21:49
    you're learning from is is gaussian
  • 00:21:51
    um
  • 00:21:52
    another quick question i had is uh in
  • 00:21:54
    the context of
  • 00:21:56
    like uh marginal likelihood um just
  • 00:21:59
    tuning the hyper parameters as you would
  • 00:22:02
    in a traditional gp
  • 00:22:03
    um like do you have do you consider when
  • 00:22:06
    when you're drawing your prior data sets
  • 00:22:08
    different
  • 00:22:09
    uh length scales or kernel amplitudes
  • 00:22:13
    uh variances things like this or do you
  • 00:22:15
    just keep it fixed and does that
  • 00:22:17
    actually then
  • 00:22:19
    from learning from that prior does that
  • 00:22:21
    actually
  • 00:22:22
    um
  • 00:22:23
    generalize well
  • 00:22:25
    yeah we'll get to that um we do a lot of
  • 00:22:27
    that so we do a lot of this like mixing
  • 00:22:29
    different different prior hyper
  • 00:22:31
    parameters right
  • 00:22:34
    but we'll get to that in the experiment
  • 00:22:36
    okay cool
  • 00:22:37
    thanks but but maybe to answer lucy's
  • 00:22:40
    quest there is first question um
  • 00:22:42
    first um
  • 00:22:44
    so i i do think we know what the
  • 00:22:45
    marginals look like they look
  • 00:22:48
    pretty much exactly like what we would
  • 00:22:50
    expect um with the true posterior
  • 00:22:53
    distribution
  • 00:22:54
    um
  • 00:22:55
    so if we train on a prior that comes
  • 00:22:58
    from a gp then well our predictions will
  • 00:23:01
    look pretty much exactly like a gaussian
  • 00:23:03
    because well that that's a kl divergence
  • 00:23:05
    we're optimizing between these two
  • 00:23:06
    distributions um
  • 00:23:09
    the joint of
  • 00:23:11
    multiple predictions
  • 00:23:13
    well actually that will look very
  • 00:23:15
    different
  • 00:23:16
    um we're just not going to predict any
  • 00:23:19
    cross-correlations here
  • 00:23:21
    because that that's what sam said we
  • 00:23:23
    predict independently for x4 and for x5
  • 00:23:26
    if you want to join then you would have
  • 00:23:28
    to predict for x4 put it into the
  • 00:23:30
    training data do another forward prop
  • 00:23:32
    and get the prediction for x5 and then i
  • 00:23:34
    would expect that we get pretty nice
  • 00:23:37
    joined
  • 00:23:38
    gaussians but that we haven't tried yet
  • 00:23:43
    i see gotcha
  • 00:23:48
    thanks okay
  • 00:23:50
    um then we jump into the experiments and
  • 00:23:53
    the experiments will actually be first
  • 00:23:55
    experiments of the original paper that
  • 00:23:57
    was also i think in the investigation so
  • 00:23:59
    this transformers can do based on
  • 00:24:00
    inference and then we'll get to
  • 00:24:02
    experiments of a paper we just uploaded
  • 00:24:04
    to archive so very fresh out of the oven
  • 00:24:08
    you'll see
  • 00:24:09
    um
  • 00:24:10
    so first
  • 00:24:12
    uh the motivating experiment is we train
  • 00:24:14
    on a gaussian
  • 00:24:15
    prior and want to see that our network
  • 00:24:19
    actually does something
  • 00:24:20
    that flower creates something that looks
  • 00:24:22
    like the gaussian posterior
  • 00:24:24
    and we actually do get that that was uh
  • 00:24:27
    that was nice back then um and so what
  • 00:24:30
    we see on the left here is the black
  • 00:24:33
    dots are are
  • 00:24:35
    our given examples or the training that
  • 00:24:37
    you could say so
  • 00:24:38
    this is known up front and
  • 00:24:41
    now we're interested in the
  • 00:24:43
    posterior distribution of our of our
  • 00:24:45
    gaussian process given these points and
  • 00:24:48
    so here we plot the density and we can
  • 00:24:50
    see that this is pretty smooth there are
  • 00:24:53
    some edges in there but i think this is
  • 00:24:54
    mostly due to x being in buckets
  • 00:24:58
    not so much why because we use a lot of
  • 00:25:01
    uh we use a lot of packets here like
  • 00:25:03
    that 1000 or so
  • 00:25:05
    um
  • 00:25:07
    this is so hard to compare to a
  • 00:25:09
    to a gp like for a gp we know the
  • 00:25:11
    material right so
  • 00:25:12
    it's actually a nice motivating example
  • 00:25:14
    because we can compare to the true
  • 00:25:16
    predictions unlike for many other priors
  • 00:25:18
    uh we can we can hear compared to the to
  • 00:25:21
    the ground truth
  • 00:25:23
    and this is on the right here
  • 00:25:26
    in blue it's the pfn and in green we
  • 00:25:29
    have the gp
  • 00:25:30
    and you see that at least for these
  • 00:25:32
    simple examples they uh they match very
  • 00:25:36
    exactly there are some differences we
  • 00:25:38
    mark them here with errors but in
  • 00:25:40
    general it looks like
  • 00:25:42
    it fits it pretty pretty well and even
  • 00:25:45
    the confidence interval looks looks
  • 00:25:47
    pretty similar
  • 00:25:50
    we can do this also here for another
  • 00:25:53
    length scale to show that yeah even the
  • 00:25:56
    length scale is different it also still
  • 00:25:57
    works and with more points uh same
  • 00:26:00
    experiment
  • 00:26:01
    and
  • 00:26:03
    and now we compared to the attentiveness
  • 00:26:06
    processes but this is in red here and
  • 00:26:08
    that's kind of what was there before
  • 00:26:10
    and we can see that yeah there is a
  • 00:26:13
    clear difference the transformer
  • 00:26:14
    architecture seems to be really good at
  • 00:26:16
    this task surprisingly compared to an
  • 00:26:19
    architecture that was invented for
  • 00:26:21
    future learning
  • 00:26:25
    and yeah now i would like to show you a
  • 00:26:27
    little demo
  • 00:26:29
    um you can find the link to this little
  • 00:26:31
    demo actually in the paper or on the
  • 00:26:34
    github
  • 00:26:35
    so if you want to try it yourself you
  • 00:26:37
    can
  • 00:26:38
    so
  • 00:26:40
    i think that's so yeah let's see so in
  • 00:26:42
    this little demo we basically create the
  • 00:26:44
    same figures we had before
  • 00:26:47
    uh with exactly the same legend again
  • 00:26:50
    but we can create our own and we could
  • 00:26:53
    now for example here create a point in
  • 00:26:56
    the middle somewhere
  • 00:26:57
    i wanted to try an example before but no
  • 00:26:59
    i didn't really and make it much higher
  • 00:27:02
    because they're pretty much on one line
  • 00:27:03
    so that it's maybe a little more
  • 00:27:04
    interesting
  • 00:27:09
    okay so we have something something like
  • 00:27:10
    this and then okay
  • 00:27:13
    i always press new column that's a
  • 00:27:15
    problem you press new column right
  • 00:27:21
    what did i type here 0.9
  • 00:27:23
    and now a new row and now you can do
  • 00:27:26
    0.6 and we can see like if we make it
  • 00:27:29
    more extreme it will be harder for the
  • 00:27:31
    for the network
  • 00:27:37
    let's see what with this so it should
  • 00:27:39
    have like a steep stand let's see if it
  • 00:27:41
    still can model it uh it still can model
  • 00:27:44
    it but you can see that there are more
  • 00:27:45
    errors than before i would say
  • 00:27:48
    for sure um and
  • 00:27:51
    and you can like this is
  • 00:27:53
    this is due to being much more out of
  • 00:27:55
    the distribution than
  • 00:27:57
    and
  • 00:27:59
    like the first example but yeah you can
  • 00:28:01
    play around with this demo if you like
  • 00:28:02
    as well
  • 00:28:05
    yeah frank posted the link very nice
  • 00:28:07
    cool let's go back to the presentation
  • 00:28:15
    so
  • 00:28:15
    now we look at uh
  • 00:28:18
    not anymore at night but her yes you
  • 00:28:20
    know still has nice of course everything
  • 00:28:22
    are and all of these are nice but um now
  • 00:28:25
    we look at um
  • 00:28:26
    performance plot uh here we see the
  • 00:28:29
    negative likelihood so as we've shown
  • 00:28:31
    like as we can show that this is a
  • 00:28:33
    measure for the kl divergence between
  • 00:28:35
    our posterior and the true posterior um
  • 00:28:39
    we use the negative log likelihood to
  • 00:28:41
    measure our method uh on priority later
  • 00:28:44
    and we compare it here with the true gp
  • 00:28:46
    so this is again an example where we
  • 00:28:47
    have the true true gp so we don't do
  • 00:28:50
    anything where we can't get the true
  • 00:28:52
    criteria up to here two thousand
  • 00:28:54
    examples
  • 00:28:56
    and
  • 00:28:57
    what we can see is that we have
  • 00:28:59
    different pfns and this is the number of
  • 00:29:01
    data sets they've seen during training
  • 00:29:03
    so this uppercase k from back then
  • 00:29:06
    um
  • 00:29:08
    and what we can see is that
  • 00:29:09
    the pl events like always get better
  • 00:29:12
    with more training data like a very
  • 00:29:14
    traditional result i guess in machine
  • 00:29:16
    learning but um you can see they they
  • 00:29:18
    they get better and better with more
  • 00:29:20
    training and more training and we can
  • 00:29:21
    see similar things actually with the
  • 00:29:23
    transformers side um not as strong but
  • 00:29:26
    similar
  • 00:29:30
    quick question
  • 00:29:32
    um
  • 00:29:32
    do you have some insight about uh like
  • 00:29:36
    at what point you reach
  • 00:29:37
    point of diminishing returns like i
  • 00:29:40
    assume with all things transformers you
  • 00:29:42
    have you have these power laws and uh
  • 00:29:45
    more is always more data is always
  • 00:29:46
    better and so on but let's say i wanted
  • 00:29:48
    to deploy this myself i wanted to train
  • 00:29:50
    this from scratch by myself
  • 00:29:52
    and i don't have all the time or compute
  • 00:29:54
    budget in the world um like how do you
  • 00:29:57
    have some intuition about how
  • 00:30:00
    how high or how much data to go to
  • 00:30:03
    and and at what point it's maybe just
  • 00:30:05
    not worth it anymore like you hit the
  • 00:30:07
    point of diminishing returns
  • 00:30:10
    so i would say this generally depends a
  • 00:30:13
    lot on your prior so if you have a very
  • 00:30:15
    simple prior of course you hit that
  • 00:30:16
    point earlier so probably one of the
  • 00:30:20
    examples where you hit it earlier would
  • 00:30:21
    be something like the sculpture process
  • 00:30:23
    here
  • 00:30:24
    um
  • 00:30:25
    and
  • 00:30:26
    in general like what i can see is that
  • 00:30:28
    training for a day or one gpu
  • 00:30:32
    yeah will yield pretty good results and
  • 00:30:34
    training for five days will be like a
  • 00:30:36
    little bit better but not much
  • 00:30:39
    this is something it can give you as
  • 00:30:41
    like a
  • 00:30:42
    so
  • 00:30:43
    yeah that's that's so far
  • 00:30:45
    how how we scale we hope we find some
  • 00:30:47
    way to make better use of more compute
  • 00:30:49
    but i think so far
  • 00:30:51
    like after one day or
  • 00:30:53
    yeah we spent sometimes five days on the
  • 00:30:55
    training but for this first paper it was
  • 00:30:57
    only one day
  • 00:30:59
    and generally this is also where it
  • 00:31:00
    doesn't get much better
  • 00:31:03
    and just quick question on the gp prior
  • 00:31:05
    that you use to generate the data sets
  • 00:31:07
    like do you just always have uh you
  • 00:31:10
    uniformly sample um
  • 00:31:12
    a fixed number of
  • 00:31:14
    input
  • 00:31:15
    like observed input points and then just
  • 00:31:18
    one test point
  • 00:31:20
    and you just yeah um domain or this is
  • 00:31:22
    very so
  • 00:31:24
    so yeah for the gp prior exactly since
  • 00:31:26
    the gp prior doesn't really entail
  • 00:31:28
    access we have to sign up for the access
  • 00:31:30
    ourselves in some way
  • 00:31:32
    and we do
  • 00:31:34
    yeah pretty much what you say we sample
  • 00:31:36
    uniformly at random or from uh from a
  • 00:31:39
    standard normal
  • 00:31:41
    and
  • 00:31:43
    we
  • 00:31:44
    what we do though is that what is
  • 00:31:46
    important is to train for different uh
  • 00:31:49
    for different numbers of data points
  • 00:31:50
    because you want to generalize among
  • 00:31:53
    different data set sizes
  • 00:31:56
    um and so we we sample uniformly at
  • 00:31:58
    random how many
  • 00:32:00
    examples are
  • 00:32:01
    training set and how many are tested
  • 00:32:07
    okay
  • 00:32:07
    and
  • 00:32:09
    do do you do this for higher dimensions
  • 00:32:12
    or is it just the one dimensional
  • 00:32:14
    problem
  • 00:32:16
    uh i'm sorry this is missing here but
  • 00:32:18
    this is for example for five dimensions
  • 00:32:20
    this plot we do it for high dimensions
  • 00:32:22
    as well yeah
  • 00:32:23
    we go
  • 00:32:25
    i think our largest experiment like like
  • 00:32:28
    close to 800 dimensions like 784
  • 00:32:32
    okay
  • 00:32:33
    use this center strategy
  • 00:32:36
    so do you use um user fixed kernel for
  • 00:32:39
    ugp prior or do you use um yeah
  • 00:32:42
    different types of current different
  • 00:32:43
    link skills noise and so on
  • 00:32:45
    uh yeah i'm sorry i i just see that i
  • 00:32:47
    threw out the plot where i where i use
  • 00:32:50
    different length scales and kernels um
  • 00:32:52
    but in this plot this is just the same
  • 00:32:54
    length scale kernel but you can do the
  • 00:32:57
    same um
  • 00:32:58
    by
  • 00:32:59
    just varying the length scale as you
  • 00:33:01
    sample like for each dataset use a
  • 00:33:02
    different length scale distribution over
  • 00:33:04
    these and we did that as well so the
  • 00:33:06
    so-called hyper priors in the literature
  • 00:33:09
    and um that works as well but we don't
  • 00:33:12
    have the nine space landing anymore
  • 00:33:13
    because there's no or there's no easy
  • 00:33:16
    way to get the
  • 00:33:17
    to get the correct um to get the correct
  • 00:33:20
    criteria we can only approximate it
  • 00:33:22
    there are only approximations for it out
  • 00:33:24
    there
  • 00:33:24
    so it's a little harder to repair i'm
  • 00:33:27
    just wondering because if you if you
  • 00:33:28
    then sample different length scales and
  • 00:33:30
    potentially different colors and your
  • 00:33:32
    prior gets more more uninformative right
  • 00:33:34
    and then it's you know
  • 00:33:36
    it yeah what is your model to learn
  • 00:33:39
    right here
  • 00:33:40
    in the extreme case you have like a
  • 00:33:42
    super uninformed fire and then yeah
  • 00:33:44
    well the model is not going to help
  • 00:33:45
    anymore because it's just models
  • 00:33:48
    yeah for sure that's that's that's the
  • 00:33:50
    trade-off there's some sweet spot yeah
  • 00:33:54
    yeah you don't want to make it too broad
  • 00:33:56
    but so far
  • 00:33:58
    yeah generally so far we saw making it
  • 00:33:59
    broader can can help in many
  • 00:34:01
    applications
  • 00:34:03
    um so because i think so far priors are
  • 00:34:05
    pretty restricted
  • 00:34:07
    with this really traditional vi or mcmc
  • 00:34:10
    methods because we have to have these
  • 00:34:12
    like easy to interpret latents or easy
  • 00:34:14
    to generate latent
  • 00:34:15
    ah
  • 00:34:16
    we don't need these anymore
  • 00:34:18
    cool
  • 00:34:19
    um yeah so this is uh since we talked
  • 00:34:22
    about it this is a beijing and then or
  • 00:34:24
    no we didn't really talk about it
  • 00:34:26
    this is the
  • 00:34:28
    bayesian neural network and we we do the
  • 00:34:30
    approximation there here as well and
  • 00:34:32
    what we look here at is um just to show
  • 00:34:34
    you the time scales that's actually
  • 00:34:37
    quite interesting so the svi baseline
  • 00:34:39
    here is the
  • 00:34:40
    base by backdrop or like a newer
  • 00:34:42
    variation of that and then we have the
  • 00:34:44
    nuts like an mcmc
  • 00:34:47
    sampler
  • 00:34:48
    and
  • 00:34:51
    these are two other approximations to
  • 00:34:52
    the to the prior to the posterior for
  • 00:34:56
    bayesian neural networks and we compare
  • 00:34:59
    our method to these so this is our
  • 00:35:00
    method and it takes
  • 00:35:03
    we can't really change the time it takes
  • 00:35:05
    for us to predict the label for a data
  • 00:35:07
    set because it's a forward pass we can't
  • 00:35:09
    really decide to only do half of our
  • 00:35:11
    passwords also that's why we have a dot
  • 00:35:14
    here so our method takes like uh like a
  • 00:35:17
    hundredth of a second to
  • 00:35:20
    to predict uh the labels for a given
  • 00:35:22
    task and then we looked at different
  • 00:35:24
    budgets
  • 00:35:25
    uh for our baselines and we saw that
  • 00:35:27
    they reach our performance after using
  • 00:35:30
    like something like 1000 x of the time
  • 00:35:34
    compared to us
  • 00:35:36
    which was yeah
  • 00:35:37
    quite cool
  • 00:35:38
    and and we can do this like we do this a
  • 00:35:41
    and b here it's just both be an end but
  • 00:35:44
    different sizes the left is a little
  • 00:35:45
    smaller than the right and generally
  • 00:35:47
    both of them are really small because
  • 00:35:49
    otherwise you can't do nuts
  • 00:35:52
    so
  • 00:35:53
    in this case are you just uh that the
  • 00:35:56
    data that you're learning from the data
  • 00:35:57
    sets that you generate are these just
  • 00:35:59
    literally um
  • 00:36:01
    you you have a bnn with some with
  • 00:36:04
    different random initializations and
  • 00:36:06
    that's how you're generating the data
  • 00:36:07
    sets
  • 00:36:08
    so yeah we have an mlp with random
  • 00:36:10
    initialization and then we generate our
  • 00:36:13
    x's randomly like before for the gps and
  • 00:36:16
    feed them through and generate our y's
  • 00:36:18
    that way
  • 00:36:19
    okay and just to clarify what i'm
  • 00:36:21
    looking at here this is so this is the
  • 00:36:23
    test likelihood i'm assuming on on some
  • 00:36:25
    regression or some
  • 00:36:28
    uh
  • 00:36:29
    problem like classification or
  • 00:36:30
    regression problems or is this um this
  • 00:36:33
    is um yeah so sorry this is uh
  • 00:36:36
    this is actually classification problems
  • 00:36:39
    um and these are classification problems
  • 00:36:42
    sampled from the prior so it's still in
  • 00:36:45
    prior with like we
  • 00:36:46
    to get rid of compounding factors like
  • 00:36:49
    choosing a choosing a data set that
  • 00:36:50
    might not lie within the prior week
  • 00:36:53
    we trained on unseen data sets from the
  • 00:36:55
    pro
  • 00:36:58
    okay thanks
  • 00:37:00
    and yes there's no fine tuning here
  • 00:37:01
    right so you wouldn't
  • 00:37:03
    you do the you train your transformer
  • 00:37:05
    and then
  • 00:37:06
    it's basically just testing so you
  • 00:37:07
    wouldn't update your
  • 00:37:10
    uh the transformer based on this new
  • 00:37:11
    data set right
  • 00:37:13
    no yeah the weights stay the same
  • 00:37:17
    it's yeah it's one training and then
  • 00:37:18
    only feed forward
  • 00:37:22
    don't know feed forward the wrong word
  • 00:37:24
    forward
  • 00:37:26
    so yeah okay here a little lookout um
  • 00:37:28
    it's actually yeah
  • 00:37:30
    yeah it it just to show you this is the
  • 00:37:32
    average regret and we can do uh we can
  • 00:37:35
    do we compare here the gp here is a
  • 00:37:38
    sorry we do bayesian optimization in
  • 00:37:39
    this plot and we compared to a gp that
  • 00:37:42
    uses mle too so that's kind of the
  • 00:37:45
    standard approach right now invasion
  • 00:37:47
    optimization for
  • 00:37:48
    to approximate these uh these type of
  • 00:37:51
    prior
  • 00:37:52
    gps
  • 00:37:53
    um
  • 00:37:54
    and uh
  • 00:37:56
    we compare our pfn against that and
  • 00:37:58
    against random start and we train rpa
  • 00:38:00
    event on the exact same prior the gp has
  • 00:38:02
    so the expectation would be that they
  • 00:38:05
    should be similar because it's two
  • 00:38:06
    similar approximations to the same
  • 00:38:08
    problem
  • 00:38:09
    and this is only for like up to 15
  • 00:38:12
    trials so relatively small
  • 00:38:14
    examples um
  • 00:38:17
    and both use ei so we can actually use
  • 00:38:20
    ei on top of our pfn so that's expected
  • 00:38:23
    improvements like one specific
  • 00:38:24
    acquisition function and what we could
  • 00:38:26
    see there is that our method performs
  • 00:38:29
    actually very similar to
  • 00:38:31
    to the gp and the cool thing now is yeah
  • 00:38:33
    you know you can get creative and build
  • 00:38:35
    new players that are not possible with
  • 00:38:37
    the gp to to even beat it and that's
  • 00:38:40
    what i'm working on right now actually
  • 00:38:44
    then the last part we're already at 42
  • 00:38:47
    but yeah i guess we don't really need
  • 00:38:48
    discussion up there
  • 00:38:50
    um
  • 00:38:53
    uh the last part is that the type event
  • 00:38:55
    so that's the paper we just
  • 00:38:57
    just submitted and also just uploaded to
  • 00:38:59
    archive
  • 00:39:01
    and um
  • 00:39:04
    and that is basically one pfn so one of
  • 00:39:07
    these prior data fitted networks one
  • 00:39:10
    transformer one could also say
  • 00:39:13
    in which we train no full ultra ml
  • 00:39:15
    method so we
  • 00:39:17
    want to make this one thing um be good
  • 00:39:20
    at
  • 00:39:21
    tabular automl in our case like the
  • 00:39:23
    automl and tabular datasets
  • 00:39:28
    what's cool about this is that makes
  • 00:39:30
    autumn l pretty much instant
  • 00:39:32
    our method performs comparably to like
  • 00:39:35
    state-of-the-art automl methods that get
  • 00:39:38
    five minutes of time while we only take
  • 00:39:40
    a second
  • 00:39:43
    and
  • 00:39:44
    this
  • 00:39:45
    simplifies out to ml at least from the
  • 00:39:47
    view of a deeper learner from the
  • 00:39:49
    viewpoint of a deep learner because
  • 00:39:51
    before
  • 00:39:52
    you had a setup like this for example uh
  • 00:39:56
    where we have some meteor learning
  • 00:39:57
    submission optimization etc etc which is
  • 00:40:01
    all
  • 00:40:02
    replaced now by a single neural network
  • 00:40:06
    plus some pre-processing i've said that
  • 00:40:08
    but people's listening is something like
  • 00:40:09
    normalizing data so i hope that's
  • 00:40:12
    allowed
  • 00:40:12
    [Music]
  • 00:40:14
    and
  • 00:40:16
    one caveat here i want to give up front
  • 00:40:19
    is that we
  • 00:40:20
    looked mostly at small datas or we
  • 00:40:22
    looked at small data sets so we care
  • 00:40:24
    only about data sets up to 1000 examples
  • 00:40:28
    okay so
  • 00:40:30
    now
  • 00:40:32
    let's talk about the prior so as i said
  • 00:40:35
    before you can get creative with these
  • 00:40:36
    priors because we just have to be able
  • 00:40:38
    to sample data sets from them so what is
  • 00:40:40
    it even a prior i mean you can also call
  • 00:40:41
    it like a simulation of a real data so
  • 00:40:44
    um
  • 00:40:45
    and here we just like started from
  • 00:40:47
    scratch we want to be able to
  • 00:40:50
    yield a transformer that's just good at
  • 00:40:52
    uh it's good at classification um on
  • 00:40:55
    tabular datasets so yeah of course we
  • 00:40:57
    wanted to be simple
  • 00:40:58
    so
  • 00:40:59
    we thought like
  • 00:41:01
    always we should have something in there
  • 00:41:02
    that like forces our um or latents or
  • 00:41:07
    whatever the latent is um to to
  • 00:41:09
    represent simple solutions that can be
  • 00:41:11
    described with few words
  • 00:41:15
    and where we took inspiration is from
  • 00:41:17
    the
  • 00:41:18
    from the basin and ends so on the on the
  • 00:41:21
    left here
  • 00:41:22
    that's the standard vision and then so
  • 00:41:24
    how we talked about it with luis uh
  • 00:41:26
    where
  • 00:41:27
    where we just like during training we
  • 00:41:30
    just sample the bayesian uh sorry we
  • 00:41:32
    just sampled the weight
  • 00:41:33
    from from a standard initialization
  • 00:41:36
    and now feed random data through it to
  • 00:41:39
    get generate wise and this is now our
  • 00:41:41
    data set
  • 00:41:42
    one data set
  • 00:41:44
    and um
  • 00:41:46
    what we what we can do then on top of
  • 00:41:48
    that is the
  • 00:41:49
    what we call the nas but only in and
  • 00:41:52
    takes in quotation marks uh the nas part
  • 00:41:55
    and that is uh we can actually sample
  • 00:41:57
    the architecture as well so we we build
  • 00:42:00
    a posterior over different architectures
  • 00:42:03
    not only
  • 00:42:04
    a single architecture like a normal
  • 00:42:05
    vision enhance but but multiple
  • 00:42:07
    architectures so that it basically
  • 00:42:09
    chooses the architecture that's most
  • 00:42:11
    likely or we have like an ensemble over
  • 00:42:13
    architectures
  • 00:42:14
    and that was though in the earlier paper
  • 00:42:17
    and now in the new tapifn paper we took
  • 00:42:19
    that one step further
  • 00:42:20
    and and looked at structural causal
  • 00:42:23
    models which can give some
  • 00:42:26
    uh some some some nice uh
  • 00:42:29
    buildup i think for our prior
  • 00:42:32
    and how they work is basically
  • 00:42:34
    these are pretty much pruned
  • 00:42:37
    or yeah i guess for you and
  • 00:42:39
    these are pruned page neural networks
  • 00:42:42
    and now we just change where the input
  • 00:42:45
    is and where the outputs are the outputs
  • 00:42:47
    and the inputs don't have to be at the
  • 00:42:48
    beginning or the end anymore we have
  • 00:42:50
    random inputs these are like these
  • 00:42:52
    epsilons here
  • 00:42:54
    here these epsilons
  • 00:42:56
    and um uh like if i say input now i mean
  • 00:42:59
    input to the to the to the neural
  • 00:43:02
    network but our inputs that we repeat to
  • 00:43:04
    the transformers then later to the p of
  • 00:43:06
    n then later can for example be here at
  • 00:43:08
    the output node it doesn't really matter
  • 00:43:10
    where they are or the the y can actually
  • 00:43:13
    influence the x um like the
  • 00:43:16
    x can can come from the y so it could be
  • 00:43:19
    before so they're basically just at
  • 00:43:20
    random points in this graph and the
  • 00:43:22
    integration comes from these are called
  • 00:43:25
    the links so these like
  • 00:43:27
    the the the arrows boil down to um
  • 00:43:30
    weighted thumbs then and and we think of
  • 00:43:33
    these as like the causing links for the
  • 00:43:35
    next thing um and
  • 00:43:39
    this way we can generate graphs with
  • 00:43:41
    like few causing links and generate data
  • 00:43:44
    that at least for us look
  • 00:43:46
    very much like real data
  • 00:43:49
    um there's a lot of action in the chats
  • 00:43:51
    today
  • 00:43:52
    okay frank
  • 00:43:53
    that's good
  • 00:43:54
    um
  • 00:43:56
    well in the end you could refer to it
  • 00:43:57
    but maybe let's push it okay
  • 00:44:00
    okay
  • 00:44:02
    and
  • 00:44:04
    yeah so what i mean with it looks like
  • 00:44:05
    real data these are the covariance
  • 00:44:07
    matrices of
  • 00:44:08
    a real data set that we found in our
  • 00:44:10
    benchmark compared to a data set that we
  • 00:44:12
    generate and we see these covariant
  • 00:44:14
    matrices they actually look pretty much
  • 00:44:16
    the same build clusters and have other
  • 00:44:19
    parts that are really not very
  • 00:44:20
    correlated
  • 00:44:22
    it's similar in a way so
  • 00:44:24
    here it's on the on the right it's
  • 00:44:26
    synthetic data and on the left we have
  • 00:44:28
    the real data
  • 00:44:31
    cool i hope it's kind of clear what this
  • 00:44:34
    prior is all about and
  • 00:44:38
    we can
  • 00:44:39
    jump to
  • 00:44:41
    the experiment to the experiments and
  • 00:44:44
    what we did for experiments we have like
  • 00:44:46
    one
  • 00:44:46
    main experiment here and that is based
  • 00:44:49
    on the openml 5018 suit so we didn't
  • 00:44:52
    want to cherry pick our tabular data
  • 00:44:54
    sets because i think that actually
  • 00:44:55
    happens pretty easily there are many
  • 00:44:57
    available and it's not clear which one
  • 00:44:58
    is the image net of familiar data sets
  • 00:45:01
    so
  • 00:45:02
    it can happen easily that you cherry
  • 00:45:04
    pick that's why we use use the benchmark
  • 00:45:07
    that's out there and as i said before we
  • 00:45:10
    uh we restricted ourselves to only 1000
  • 00:45:13
    examples so from this benchmark we threw
  • 00:45:15
    away
  • 00:45:16
    all the data sets that have over 1 000
  • 00:45:19
    um training examples
  • 00:45:22
    and we were left with
  • 00:45:24
    uh like around two thirds of the data
  • 00:45:27
    sets from this from this benchmark so
  • 00:45:29
    this was still around
  • 00:45:31
    30 data sets i think or so
  • 00:45:34
    that we had in the end
  • 00:45:36
    and
  • 00:45:38
    here we compared to like a few of the
  • 00:45:41
    modern autumn l frameworks as well as
  • 00:45:44
    through boots of trees and some very
  • 00:45:46
    simple methods like the peas and
  • 00:45:48
    logistic regression and k nearest
  • 00:45:51
    neighbors and what we what we look at
  • 00:45:53
    here is um
  • 00:45:55
    in each plot we have
  • 00:45:57
    on the on the x-axis we have the time so
  • 00:45:59
    how much time does it take to to get
  • 00:46:02
    results on a data set on average for
  • 00:46:04
    these real-world data sets that we have
  • 00:46:06
    from the openml cc8
  • 00:46:08
    benchmark and what performance do we get
  • 00:46:11
    in that time so on the very left we have
  • 00:46:13
    the the rock aug
  • 00:46:15
    um for these uh like the average of the
  • 00:46:18
    rock augs
  • 00:46:20
    for these uh different data sets and we
  • 00:46:22
    can see we're here on the very far left
  • 00:46:24
    and we get comparable results only like
  • 00:46:27
    after like 100x time or so
  • 00:46:31
    um by methods like ultra's killer and 2
  • 00:46:34
    and auto group 1.
  • 00:46:36
    we also compared in terms of the number
  • 00:46:39
    of wins and the number of or
  • 00:46:41
    average ranking and we see a similar
  • 00:46:43
    result like
  • 00:46:45
    we are going to have the best rank and
  • 00:46:48
    win at the very beginning but even then
  • 00:46:50
    given one hour which we don't use we use
  • 00:46:52
    only one second uh we are we have
  • 00:46:54
    comparable platform
  • 00:46:57
    so
  • 00:46:58
    how long did it take to change the
  • 00:46:59
    transformer so this
  • 00:47:01
    was fantastic
  • 00:47:02
    this particular transformer was trained
  • 00:47:04
    for one day on atp
  • 00:47:09
    um but it can be so
  • 00:47:11
    yeah the
  • 00:47:12
    it can be used now for all kinds of data
  • 00:47:14
    sets but it's only trained once and uh
  • 00:47:16
    it's published and
  • 00:47:17
    you can now use it for your other data
  • 00:47:19
    sets that we never saw before
  • 00:47:22
    i guess you you would the comparable
  • 00:47:24
    question would be how long did it take
  • 00:47:26
    to um
  • 00:47:28
    develop auto escalant two and i guess
  • 00:47:30
    that's a couple of years so
  • 00:47:32
    um it's it's really a one-time training
  • 00:47:36
    um as part of algorithm development
  • 00:47:40
    and and when we come up with a new prior
  • 00:47:42
    then that's sort of like coming up with
  • 00:47:43
    a new framework
  • 00:47:45
    right but but i mean okay i think that's
  • 00:47:47
    that's out of the scope of this of this
  • 00:47:49
    paper and maybe also
  • 00:47:51
    besides the point but a fair comparison
  • 00:47:53
    would be maybe you know how would
  • 00:47:55
    um all the other ml with transfo uh
  • 00:47:58
    transfer learning work here right so um
  • 00:48:00
    but maybe but it's there's no transfer
  • 00:48:03
    it's just running on artificial data
  • 00:48:06
    yeah that's training
  • 00:48:07
    okay i see yeah yeah
  • 00:48:09
    i never saw a real tabula data set
  • 00:48:11
    before
  • 00:48:12
    okay yeah i see a point yeah yeah true
  • 00:48:16
    okay then at the last uh cherry like a
  • 00:48:20
    last little cool thing that uh we don't
  • 00:48:23
    really understand though
  • 00:48:24
    is um we then tried it later with longer
  • 00:48:27
    data sets so data sets that have more
  • 00:48:29
    training examples that we ever
  • 00:48:30
    considered during training because like
  • 00:48:32
    like i said before we sample these
  • 00:48:34
    artificial data sets during training and
  • 00:48:35
    we
  • 00:48:36
    in this case we sample data sets of the
  • 00:48:38
    length
  • 00:48:39
    1024 we always sampled
  • 00:48:41
    um
  • 00:48:42
    so it should learn to to be able to
  • 00:48:44
    predict for everything smaller or equal
  • 00:48:46
    to one one thousand twenty four examples
  • 00:48:49
    but we later run on larger data sets
  • 00:48:51
    from the autumn l benchmark and um it
  • 00:48:54
    actually finally still generalized and
  • 00:48:57
    uh and could
  • 00:48:58
    gain improvements
  • 00:49:00
    uh after after the point so it can
  • 00:49:03
    somehow generalize and make use of
  • 00:49:04
    larger data sets that never saw during
  • 00:49:06
    training which is i guess a quite
  • 00:49:08
    interesting feature of transformers
  • 00:49:13
    okay and now this is the final slide
  • 00:49:16
    already and that's the lookout and uh
  • 00:49:18
    what what one could use this for and the
  • 00:49:22
    and what's still out there and uh the
  • 00:49:25
    first thing for the basic optimization
  • 00:49:26
    which we which we work on and uh then of
  • 00:49:30
    course larger data sets important i said
  • 00:49:31
    we always restrict ourselves to 1000
  • 00:49:34
    examples and as you know training a
  • 00:49:36
    transformer with very long sequences can
  • 00:49:38
    be a problem and that's why we restrict
  • 00:49:40
    ourselves to something like that
  • 00:49:43
    but we're working on more
  • 00:49:45
    yeah making it larger and also if you're
  • 00:49:47
    interested reached out and and then
  • 00:49:50
    there's this creative thing of inventing
  • 00:49:52
    new priors and what
  • 00:49:54
    has really described the world really
  • 00:49:56
    well uh where we where we are able to
  • 00:49:58
    maybe generate data but we can't for
  • 00:50:00
    example compute the probability of
  • 00:50:02
    different data points or so because we
  • 00:50:04
    don't need to we just need to sample now
  • 00:50:06
    um
  • 00:50:08
    and
  • 00:50:09
    yeah other things are the parametric
  • 00:50:10
    posterior approximation which would be
  • 00:50:12
    pretty much stimulation based inference
  • 00:50:14
    and um interpretability is also
  • 00:50:18
    something that could be interesting here
  • 00:50:20
    because you can do forward passes much
  • 00:50:22
    faster than before so you can actually
  • 00:50:24
    try out a lot of different uh different
  • 00:50:27
    scenarios to figure out what
  • 00:50:29
    what actually makes uh make some
  • 00:50:32
    prediction be one or zero and um you can
  • 00:50:36
    also take gradients to the pfn so
  • 00:50:38
    you can figure out like you can
  • 00:50:40
    basically do gradient descent to find
  • 00:50:42
    the direction in which your label
  • 00:50:44
    changes and lastly uh yeah you can do
  • 00:50:47
    something like aliatoric versus a
  • 00:50:48
    systemic uncertainty notification but
  • 00:50:50
    yeah that's all on the horizon basically
  • 00:50:52
    i just wanted to tell you there's many
  • 00:50:55
    paths
  • 00:50:58
    cool
  • 00:50:58
    um
  • 00:51:00
    yeah that's it and i guess we can answer
  • 00:51:03
    some questions
Tags
  • Bayesian Inference
  • Transformers
  • Meta-Learning
  • Few-Shot Learning
  • Machine Learning
  • Automatic Machine Learning
  • Supervised Learning
  • Prior Data
  • Inference
  • Neural Networks