Strasbourg Summer School in Chemoinformatics, 2024 : Alex TROPSHA

01:07:28
https://www.youtube.com/watch?v=oj_kKetKUgo

摘要

TLDRDer Vortrag begann mit einer Einführung in die Chemieinformatik und ihre Herausforderungen, fokussiert auf die Qualität der Daten und deren Einfluss auf die Modellierung. Der Sprecher, ein erfahrener Chemieinformatiker, reflektiert über wesentliche Entwicklungen und Trends in der Branche. Insbesondere wird das Problem der Fehler in Datenbanken hervorgehoben, die zur Erstellung unzuverlässiger Modelle führen können. Der Vortrag umfasst neuere Fortschritte im Bereich der Chemieinformatik, einschließlich der Nutzung von künstlicher Intelligenz und maschinellem Lernen bei der Analyse großer Datensätze sowie die klinische Einbindung dieser Modelle. Auch der Hype um überbewertete wissenschaftliche Entdeckungen wird kritisch betrachtet. Der Vortrag schließt mit der Betonung der Wichtigkeit des genauen rezeptiven Umgangs mit Chemieinformatikdaten.

心得

  • 🎓 Chemieinformatik erfordert genaue Datenvorbereitung zur Vermeidung von Modellierungsfehlern.
  • 📚 Lernen aus der Vergangenheit ist entscheidend für die Fortentwicklung der Chemieinformatik.
  • 🔍 Datenqualität bleibt eine zentrale Herausforderung im Bereich der Chemieinformatik.
  • 🤝 Integration von Chemieinformatik und klinischen Daten kann unmittelbare klinische Vorteile bieten.
  • 🚫 Übermäßiger Hype um neue Entdeckungen wird kritisiert und kann problematisch sein.
  • 🧠 Maschinelles Lernen verbessert Effizienz und Genauigkeit, erfordert jedoch sorgfältige Anwendung.
  • 🗂️ Die Korrektion von Datenfehlern in bestehenden Datenbanken ist unerlässlich.
  • 🔄 Korrektive Ansätze und Kontrollmechanismen sind für die Datenqualität und -kuration entscheidend.
  • 📈 Der Übergang zu großen Datensätzen birgt neue Herausforderungen und Chancen.
  • 📝 Wissenschaftler sollten vorsichtig mit überheblichen Versprechungen in Publikationen umgehen.

时间轴

  • 00:00:00 - 00:05:00

    Sasha bedankt sich für die Einladung und gratuliert Professor Vnik zum Gewinn des Scholnik-Preises der American Chemical Society. Er spricht darüber, dass diese Workshops Generationen von C-Informatikern ausgebildet haben und wie wichtig es ist, Ideen auszutauschen.

  • 00:05:00 - 00:10:00

    Er beschreibt die Herausforderung, die ihm gestellt wurde: über das gesamte Gebiet der C-Informatik in weniger als einer Stunde zu sprechen. Er wird aus der Perspektive eines sarkastischen, aber optimistischen C-Informatikers sprechen und sowohl Vergangenheit als auch Zukunft des Feldes beleuchten.

  • 00:10:00 - 00:15:00

    Er führt in das Konzept der Reminiszenz ein, das das Erinnern an episodische Erinnerungen aus der persönlichen Vergangenheit umfasst und erklärt, wie dies Gedanken über die Zukunft auslösen kann. Er erörtert die philosophischen Aspekte der Erinnerung an die Zukunft.

  • 00:15:00 - 00:20:00

    Er gibt einen Überblick über die gegenwärtigen und zukünftigen Herausforderungen der C-Informatik, wie z.B. die Entdeckung von Arzneimitteln, die chemische Datenvernetzung und die Notwendigkeit, chemische Experimente zum Nutzen der Forschung zu integrieren.

  • 00:20:00 - 00:25:00

    Er betont die Wichtigkeit der Fehlererkennnung in Datenbanken, insbesondere bei großen Datenmengen. Er spricht über die Entwicklung von Protokollen zur Erstellung genauer chemischer Daten und die Notwendigkeit, Kanonisierungen bei biologischen Daten vorzunehmen.

  • 00:25:00 - 00:30:00

    Er bespricht das Konzept der Vorhersage für externe Datensätze mit betonten Vorsichtsmaßnahmen und die Geschichte der Domänen der Anwendbarkeit. Er hebt hervor, dass es wichtig ist, die Zuverlässigkeit der Vorhersagen zu gewährleisten.

  • 00:30:00 - 00:35:00

    Er führt das Konzept der Modifizierbarkeit von Daten ein, das bestimmt, wie gut ein Datensatz modelliert werden kann. Er spricht auch über die Bedingungen, unter denen chemische Modelle interpretiert werden können.

  • 00:35:00 - 00:40:00

    Er kritisiert, dass häufig ungenaue Modell-Ergebnisse veröffentlicht werden. Er spricht über die Bedeutung, sich bei der Modellvalidierung nicht auf Ungenauigkeiten zu verlassen, besonders bei unausgewogenen Datensätzen.

  • 00:40:00 - 00:45:00

    Er hebt hervor, dass die Balance von Datensätzen bei der Modellentwicklung entscheidend ist. Wenn Datensätze unausgewogen sind, ist es schwieriger, genaue Modelle zu erstellen, was oft zu fehlerhaften Ergebnissen führt.

  • 00:45:00 - 00:50:00

    Er spricht über die Herausforderungen bei der Interpretation chemischer Modelle und zeigt auf, dass chemische Merkmale niemals isoliert wirken. Eine einfache Interpretation durch einige wenige Deskriptoren ist oft nicht vernünftig.

  • 00:50:00 - 00:55:00

    Er beginnt mit der Analyse der jüngsten Entwicklungen in der C-Informatik und weist darauf hin, dass es wichtig ist, über die aktuelle Literatur hinaus zu lesen, um Erkenntnisse von früheren Wissenschaftlern zu integrieren.

  • 00:55:00 - 01:00:00

    Er diskutiert die aktuellen Herausforderungen in der C-Informatik, einschließlich der Notwendigkeit, bessere Benchmarks für maschinelles Lernen und valide chemische Strukturen zu haben. Er erwähnt auch die Fälle von falschen Daten, die zu falschen Schlussfolgerungen führen können.

  • 01:00:00 - 01:07:28

    Er beleuchtet, wie tiefgehendes maschinelles Lernen oft übertrieben dargestellt wird, und hebt hervor, wie Korrekturen in hochangesehenen Publikationen vorgenommen werden müssen, wenn ursprüngliche Schlüsse ungenau waren.

显示更多

思维导图

视频问答

  • Was ist der Schwerpunktthema des Vortrags?

    Der Schwerpunkt liegt auf der Datenqualität und Modellierung in der Chemieinformatik.

  • Welche Hauptprobleme werden in der Chemieinformatik angesprochen?

    Probleme mit der Datenqualität, Fehlermanagement in Datenbanken und der Signifikanz von Modellergebnissen.

  • Welche neuen Trends werden in der Chemieinformatik diskutiert?

    Der Einsatz von maschinellem Lernen, große Datensätze und die Integration mit klinischen Anwendungen.

  • Warum wird die Geschichte der Chemieinformatik hervorgehoben?

    Um die Bedeutung des Lernens aus vergangenen wissenschaftlichen Arbeiten zu betonen und Fehler nicht zu wiederholen.

  • Was ist ein wesentlicher Bestandteil eines erfolgreichen Chemieinformatikprojekts laut Vortragendem?

    Die Vorbereitung und Bereinigung der Daten ist essentiell.

  • Welcher Kritikpunkt wird an aktuellen wissenschaftlichen Veröffentlichungen geübt?

    Es wird Kritik an übermäßigem Hype geübt, insbesondere in Bezug auf unbestätigte Entdeckungen.

  • Welche Rolle spielt maschinelles Lernen in der Chemieinformatik?

    Maschinelles Lernen verbessert die Effizienz und Genauigkeit der Modellierung, insbesondere bei großen Datensätzen.

  • Welche historische Figur wird zitiert, um den Wert der Vergangenheit in der Forschung zu verdeutlichen?

    Isaac Newton wird mit seinem Zitat über das Stehen auf den Schultern von Giganten zitiert.

  • Was wird als zukünftige Herausforderung in der Chemieinformatik identifiziert?

    Die Möglichkeit, chemische Informationen mit klinischen Daten zu integrieren, um direkte klinische Auswirkungen zu erzielen.

  • Welcher Rat wird für den Umgang mit chemischen Daten gegeben?

    Der Vortrag betont die Notwendigkeit einer strikten Datenkurierung und Qualitätskontrollen.

查看更多视频摘要

即时访问由人工智能支持的免费 YouTube 视频摘要!
字幕
en
自动滚动:
  • 00:00:00
    right so Sasha thank you for um the
  • 00:00:03
    great intro um as a few people in the
  • 00:00:07
    room I've been part of this circus since
  • 00:00:10
    since the beginning and enjoyed many uh
  • 00:00:14
    wonderful events lectures friendship
  • 00:00:17
    professional interractions so uh thank
  • 00:00:19
    you for inviting me again for keeping me
  • 00:00:22
    on my toes for so many
  • 00:00:23
    years uh I think it's a special year uh
  • 00:00:27
    because uh not only should I thank
  • 00:00:31
    Professor vnik for inviting me but also
  • 00:00:34
    congratulate him on winning the scholnik
  • 00:00:35
    award from the American Chemical Society
  • 00:00:37
    this year which uh a few people in this
  • 00:00:40
    room are going to be attending and
  • 00:00:42
    speaking at this event and in addition
  • 00:00:45
    to his uh multiple scientific
  • 00:00:47
    contributions he was also acknowledged
  • 00:00:49
    for um for this workshops for a training
  • 00:00:53
    generations of c informaticians and
  • 00:00:56
    allowing us to get together and exchange
  • 00:00:58
    scientific ideas so congratulations and
  • 00:01:01
    thank you for um a lot of service to the
  • 00:01:04
    community that you've done over the
  • 00:01:06
    years so I'm really honored to to speak
  • 00:01:09
    here um thank you also for giving me an
  • 00:01:12
    impossible
  • 00:01:13
    test uh because when Sasha approached me
  • 00:01:16
    he said just speak about everything the
  • 00:01:19
    entire field in less than one hour he
  • 00:01:22
    restricted me
  • 00:01:24
    unfortun um and that is to explain a
  • 00:01:28
    subtitle
  • 00:01:30
    uh of my talk uh which is that you're
  • 00:01:32
    going to hear comments from a sarcastic
  • 00:01:34
    but optimistic C informatician uh which
  • 00:01:38
    is how I describe myself and so
  • 00:01:39
    hopefully will be combination of uh
  • 00:01:42
    notes about the past and present and
  • 00:01:46
    future of the field and I will start
  • 00:01:48
    with a
  • 00:01:50
    philosophical comment uh to explain the
  • 00:01:53
    title of the lecture because um I think
  • 00:01:55
    it's somewhat self-contradictory
  • 00:01:57
    reminisent about the future uh but
  • 00:02:00
    apparently there is a field there's the
  • 00:02:01
    entire field called reminiscence and the
  • 00:02:04
    comments are coming from a paper
  • 00:02:06
    published in the international Journal
  • 00:02:08
    of reminiscence and Life
  • 00:02:11
    review you can't imagine how many
  • 00:02:13
    journals exist this days you could find
  • 00:02:15
    whenever you have some idea you could
  • 00:02:16
    always find something published but it's
  • 00:02:19
    actually quite interesting uh
  • 00:02:21
    reminiscence by definition involves
  • 00:02:22
    recalling episodic memories from one's
  • 00:02:24
    personal past which is how we started
  • 00:02:27
    this is this is memories about this
  • 00:02:29
    workshops
  • 00:02:30
    this process often triggers thoughts of
  • 00:02:32
    the future so think about this I think
  • 00:02:34
    it's actually accurate and conversely
  • 00:02:37
    imagine our future can frequently
  • 00:02:38
    stimulate
  • 00:02:40
    reminisence most interesting what I
  • 00:02:43
    found is that not only past and future
  • 00:02:45
    share certain properties of common that
  • 00:02:47
    are distinct from the present which is
  • 00:02:50
    where we are right now and the present
  • 00:02:52
    is the least attractive the least
  • 00:02:54
    interesting because both past and future
  • 00:02:57
    are infinite in contrast the present is
  • 00:02:59
    coming into being and sleeping away
  • 00:03:02
    lasting perhaps less than an hour of
  • 00:03:06
    this lecture so um that's that's that's
  • 00:03:10
    my intro um I think it's it's pretty
  • 00:03:13
    profound and fortunately I found
  • 00:03:16
    something so um so here's the outline as
  • 00:03:21
    advertised I will talk about the past
  • 00:03:24
    the basics the reminisence uh but I'll
  • 00:03:26
    try to position reminisence in the
  • 00:03:28
    context of and future research uh in cam
  • 00:03:33
    informatics uh and I also uh found the
  • 00:03:36
    words from
  • 00:03:37
    ch's famous Sun the blue cafair which I
  • 00:03:41
    think applies to this field where have
  • 00:03:43
    you been where you going to I want to
  • 00:03:44
    know what's new I want to go with you
  • 00:03:46
    and I think that that should Define the
  • 00:03:49
    essence of study and CH for Magics for
  • 00:03:51
    all of us uh and I want make a point
  • 00:03:55
    that the good and all days CTIC concepts
  • 00:03:57
    are still valid so hopefully um you will
  • 00:04:00
    agree I'll talk about recent
  • 00:04:03
    developments as well as orance and that
  • 00:04:05
    will be a somewhat sarcastic part of the
  • 00:04:08
    lecture uh discussing Sense and
  • 00:04:10
    Sensibility of C informatics that now we
  • 00:04:13
    use the word d always um as as we
  • 00:04:16
    discuss this concept on occasion
  • 00:04:18
    throwing in terms such as Ai and other
  • 00:04:21
    forms of learning and in the future the
  • 00:04:23
    impactful trends um and and I think the
  • 00:04:27
    the most impactful Trend that is ref
  • 00:04:29
    reflected in the field is that it has
  • 00:04:31
    become a big data science discipline
  • 00:04:34
    with everything that big data science
  • 00:04:36
    brings into the challenges that are
  • 00:04:39
    facing us so I'll start by the
  • 00:04:43
    foundations of the field in 2009 the
  • 00:04:46
    Journal of CH informatics was formed and
  • 00:04:48
    the very first
  • 00:04:49
    issue uh there were Grand challenges
  • 00:04:52
    that were defined um and as I'm not
  • 00:04:55
    going to read this to you I'll let you
  • 00:04:56
    read but perhaps you could see that the
  • 00:05:01
    same Grand challenges can be formulated
  • 00:05:03
    today almost entirely the same way
  • 00:05:06
    except perhaps that we're dealing with
  • 00:05:09
    uh the different types of data different
  • 00:05:11
    volumes of data and different technical
  • 00:05:14
    challenges but philosophically the grand
  • 00:05:16
    challenges such as drug Discovery GRE
  • 00:05:19
    chemistry understanding life as chemists
  • 00:05:23
    and enabling networks of information and
  • 00:05:25
    connected knowledge to be explored this
  • 00:05:28
    are still big challenges in front of the
  • 00:05:30
    field so it's it's it's good because it
  • 00:05:32
    it only illustrates how fundamental this
  • 00:05:35
    discipline that at some point was coined
  • 00:05:37
    and gold C formatics is so I'll go
  • 00:05:41
    through a very few Concepts that I
  • 00:05:43
    consider key paradigms of the field uh
  • 00:05:46
    and and overall and that has become is
  • 00:05:49
    becoming I reflect on this more and more
  • 00:05:51
    important that that the computations are
  • 00:05:55
    done in the name of experiments fueled
  • 00:05:57
    by experiments processed just
  • 00:05:59
    computational but it's always important
  • 00:06:02
    to remember that in the end we want to
  • 00:06:03
    use CH informatics methods and tools in
  • 00:06:06
    order to affect in the most plausible
  • 00:06:08
    way the experiments and inform and and
  • 00:06:11
    increase the experimental heat
  • 00:06:14
    rate it's also important uh as and still
  • 00:06:18
    important today to recognize and it's
  • 00:06:21
    one of the most wonderful papers in the
  • 00:06:23
    history of the field how not to do
  • 00:06:25
    research and I um again will talk about
  • 00:06:28
    how not to do research today and how
  • 00:06:31
    some of the concepts especially those
  • 00:06:33
    that that I highlight they still Haven
  • 00:06:36
    raled in many Publications in the field
  • 00:06:38
    but for those who are relatively new to
  • 00:06:40
    the field knowing how not to do research
  • 00:06:44
    continues to be very important and
  • 00:06:46
    learning about 21 and fact that should
  • 00:06:48
    be
  • 00:06:49
    22 Key rules of how not uh to do
  • 00:06:54
    research in ch informatics is very
  • 00:06:57
    important Arrow detection
  • 00:07:00
    um has been continues to be uh one of
  • 00:07:04
    the key elements of CH informatics
  • 00:07:06
    research and I would like to emphasize
  • 00:07:09
    and re-emphasize how important it is to
  • 00:07:11
    pay attention to the Quality data
  • 00:07:14
    especially as we process large massive
  • 00:07:17
    data today and here are just a few
  • 00:07:19
    examples of the types of errors that
  • 00:07:21
    could be continuously found in databases
  • 00:07:24
    especially duplicates which is special
  • 00:07:27
    case where um data should be properly
  • 00:07:29
    treated for chemical duplicates and uh
  • 00:07:33
    um chemical duplicates that have
  • 00:07:36
    differential measurements associated
  • 00:07:38
    with this and and the existence of uh
  • 00:07:41
    errors and databases uh affects both
  • 00:07:46
    simple and very complex models and
  • 00:07:48
    continuous to
  • 00:07:49
    effect so protocols have been developed
  • 00:07:53
    and continuously should be applied in
  • 00:07:55
    order to create both chemical data uh
  • 00:07:58
    and and we and others have developed
  • 00:07:59
    special protocols going methodically and
  • 00:08:02
    systematically through uh chemical data
  • 00:08:05
    and curating accurate and and and
  • 00:08:08
    presentent chemical structures
  • 00:08:10
    accurately as well as and that that is
  • 00:08:13
    even more important dealing with
  • 00:08:15
    biological uncertainty and there is a um
  • 00:08:18
    an aspect of data cation that I always
  • 00:08:20
    highlight and that is that one could
  • 00:08:23
    talk about canonical chemical structures
  • 00:08:26
    and create chemical structures
  • 00:08:28
    accurately with some caveats where it's
  • 00:08:30
    impossible such as multiple cyal centers
  • 00:08:32
    that have not been resolved or tots that
  • 00:08:35
    have not been resolved but for the most
  • 00:08:37
    part canonical chemical rules can be
  • 00:08:40
    applied uh to create chemical structures
  • 00:08:42
    accurate but when it comes to biological
  • 00:08:44
    data there are no canonical measurements
  • 00:08:47
    biological data is inherently inaccurate
  • 00:08:51
    perhaps irreproducible so a lot of
  • 00:08:53
    attention needs to be paid to processing
  • 00:08:56
    data and eliminating duplicates and
  • 00:08:58
    looking at whether or not duplicative
  • 00:08:59
    chemical structures after curation have
  • 00:09:02
    identical or dissimilar properties
  • 00:09:05
    associated with them so um this is has
  • 00:09:08
    been and continues to be fundamental
  • 00:09:10
    aspects of uh C projects research and
  • 00:09:14
    then how we model the data uh how we
  • 00:09:17
    process the data and how important
  • 00:09:18
    continuously is to uh think about
  • 00:09:23
    prediction for external data sets rather
  • 00:09:26
    than uh overemphasizing the quality of
  • 00:09:29
    models developed for the training sub
  • 00:09:32
    which we could continuously especially
  • 00:09:34
    especially with big date and I allude to
  • 00:09:37
    this um also uh overtrained models or
  • 00:09:41
    the more powerful algorithms we use the
  • 00:09:43
    more is the chance that the models are
  • 00:09:45
    going to go or train so it's important
  • 00:09:47
    to figure out how to convince yourself
  • 00:09:50
    and the world that the models that we've
  • 00:09:52
    built with has data can extrapolate the
  • 00:09:55
    new data and what comes with this
  • 00:09:57
    extrapolation kind of uh precautions we
  • 00:10:00
    need to take um to assure accurate
  • 00:10:04
    extrapolation of models developed on the
  • 00:10:06
    training C so to deal with this issues
  • 00:10:10
    historically another key Paradigm
  • 00:10:12
    applicability domain uh which continues
  • 00:10:14
    to be very important part of modern
  • 00:10:16
    research how to define the applicability
  • 00:10:19
    DI it now has slightly different name
  • 00:10:21
    I'll talk about it but uh the objective
  • 00:10:23
    is to recognize that no matter how large
  • 00:10:25
    our training set is it has limited size
  • 00:10:29
    and extrapolation can be reliable only
  • 00:10:32
    in certain directions in the chemistry
  • 00:10:33
    space and within proximity of the
  • 00:10:36
    training set that needs to be defined
  • 00:10:38
    the chemical space needs to be defined
  • 00:10:40
    and the proximity to the training set
  • 00:10:41
    needs to be defined
  • 00:10:43
    quantitatively and uh the associated
  • 00:10:45
    predictions are can be reliable or less
  • 00:10:47
    reliable depending on how we Define the
  • 00:10:50
    applicability
  • 00:10:51
    domain and then some Concepts that have
  • 00:10:54
    been developed uh which I also find uh
  • 00:10:56
    fundamental such as uh whether a data
  • 00:11:00
    set can be modelable and it's a funny
  • 00:11:02
    word modelability of the data but uh
  • 00:11:07
    empirically uh I think every C forat at
  • 00:11:10
    some point had found out that the data
  • 00:11:13
    for some reason the model of TR perhaps
  • 00:11:16
    look great for the training set perhaps
  • 00:11:18
    did not even look great for the training
  • 00:11:20
    set but the U extrapolation has been
  • 00:11:24
    repeatedly unreliable and so we've
  • 00:11:26
    defined this concept of modelability and
  • 00:11:29
    index that we've called mod which
  • 00:11:31
    effectively is a one nearest neighbor
  • 00:11:33
    model and if a data set has a large
  • 00:11:35
    number of nearest neighbors with
  • 00:11:38
    different designation of the target
  • 00:11:40
    property that particular index that
  • 00:11:44
    calculate calculates the fraction of
  • 00:11:46
    this formal property Clips or activity
  • 00:11:48
    Clips turned out to be a very strong and
  • 00:11:51
    simple indicator of our ability to
  • 00:11:54
    develop and not develop a model that can
  • 00:11:57
    extrapolate well and here is a um
  • 00:12:00
    also an
  • 00:12:02
    observation uh from the same paper that
  • 00:12:04
    defined mud um and that is uh that what
  • 00:12:08
    we've demonstrated is that if we take
  • 00:12:10
    large number of data sets and calculate
  • 00:12:12
    this simple Modi index one near neighbor
  • 00:12:16
    model uh what we find is no matter how
  • 00:12:19
    um
  • 00:12:20
    complex strong deep algorithms we try to
  • 00:12:24
    use to establish a model of high
  • 00:12:27
    accuracy if the Modi index has below
  • 00:12:29
    point6 no matter how much you try you
  • 00:12:31
    typically are unable to develop model of
  • 00:12:34
    sufficient accuracy and that's just the
  • 00:12:36
    property of the data uh it's a point
  • 00:12:39
    I'll come back to and I've chosen some
  • 00:12:41
    points that I will come back to sort of
  • 00:12:43
    in a new way uh reflect on the recent
  • 00:12:47
    development in the
  • 00:12:49
    field uh and then couple of more points
  • 00:12:51
    that I want to make from the past uh
  • 00:12:53
    which again I find important to share
  • 00:12:56
    and emphasize as part of classical
  • 00:12:58
    historic
  • 00:12:59
    fundamental C informatics research one
  • 00:13:02
    is how we convince ourselves and the
  • 00:13:05
    world that the model is accurate and
  • 00:13:07
    here's one of the uh standard errors
  • 00:13:10
    that U have been uh articulated on how
  • 00:13:13
    not to do here s by JN Dean uh and and
  • 00:13:17
    that is uh we need to be mindful about
  • 00:13:19
    the distribution of the data and what
  • 00:13:22
    comes with it many data sets I'll come
  • 00:13:24
    back to this point too um are imbalanced
  • 00:13:27
    and here is a simple example that if a
  • 00:13:28
    data set has 4:1 ratio 80 molecules of
  • 00:13:33
    one class and 20 molecules of another
  • 00:13:35
    class if we just calculate total
  • 00:13:38
    accuracy which is the most intuitive and
  • 00:13:40
    most
  • 00:13:40
    immediate um attempt that one makes uh
  • 00:13:44
    then the total accuracy is going to be
  • 00:13:45
    80% without even building a mod if you
  • 00:13:49
    say that everything is class one all 100
  • 00:13:52
    molecules and 80 of them our class one
  • 00:13:53
    you're 80% accurate and so uh that that
  • 00:13:57
    certainly doesn't make sense and
  • 00:13:59
    therefore uh more specific metrics need
  • 00:14:02
    to be considered when and for the most
  • 00:14:04
    part data sets are imbalanced imbalanced
  • 00:14:07
    uh so in this case a balanced accuracy
  • 00:14:10
    calculation or correct classification
  • 00:14:12
    rate which is essentially the same as
  • 00:14:14
    balanced accuracy when appli to the same
  • 00:14:16
    data set immediately shows that the
  • 00:14:18
    model on average is 50% accurate right
  • 00:14:21
    100% for one plus 0% for another TX and
  • 00:14:25
    so uh we've always advocated for data
  • 00:14:29
    set balancing before your own
  • 00:14:31
    calculations and the reason for this was
  • 00:14:33
    that when the data set is imbalanced and
  • 00:14:35
    you have small number of objects in a
  • 00:14:37
    smaller
  • 00:14:38
    class then uh there is a rather small
  • 00:14:42
    number of chemicals similar enough to
  • 00:14:45
    the molecules and the training set of
  • 00:14:47
    class
  • 00:14:48
    one that
  • 00:14:50
    um that uh um make it relatively
  • 00:14:55
    difficult to build a classification
  • 00:14:57
    model and if you increase the size of um
  • 00:15:00
    if you don't down sample if you don't
  • 00:15:02
    balance the data you make the task of
  • 00:15:05
    developing a model easier and easier the
  • 00:15:08
    more imbalance is in the data set that
  • 00:15:11
    what we what we always Advocate the data
  • 00:15:14
    set needs to be balanced to make the
  • 00:15:16
    task of classification model development
  • 00:15:18
    more difficult because you are
  • 00:15:20
    discriminating a small number of
  • 00:15:22
    instances from another small number of
  • 00:15:25
    similar Ines and recommending that the
  • 00:15:27
    similarity base sampling and down
  • 00:15:30
    sampling of the second class bigger
  • 00:15:32
    class needs to be accomplished I will
  • 00:15:34
    come back to this point because that's
  • 00:15:36
    that's uh a great example of change in
  • 00:15:39
    Paradigm reflecting the nutrend in the
  • 00:15:42
    field and and understanding the data
  • 00:15:44
    associated with it the final point I
  • 00:15:47
    want to make uh is uh and it's a very
  • 00:15:50
    popular topic so-called explainable a in
  • 00:15:52
    today's world is how models or can
  • 00:15:54
    models be interpreted and a typical uh
  • 00:15:58
    statement of interpretation simply means
  • 00:16:00
    that we we identifying or emphasizing
  • 00:16:03
    specific chemical functional groups a
  • 00:16:05
    small number of descriptors that have
  • 00:16:07
    the highest loading in the chemical
  • 00:16:09
    sense and essentially reduce the model
  • 00:16:12
    and reduce the interpretation of the
  • 00:16:14
    model to a few statements about chemical
  • 00:16:17
    functional groups that are important or
  • 00:16:18
    a subset of descriptors that is
  • 00:16:21
    important and uh what I show here to
  • 00:16:23
    illustrate this point is a distribution
  • 00:16:25
    of descriptive values for a particular
  • 00:16:27
    data set
  • 00:16:29
    and then and then we calculate um how
  • 00:16:33
    descriptive values change when you go
  • 00:16:35
    from one molecule to another and what I
  • 00:16:37
    want to emphasize here is that a minor
  • 00:16:40
    perturbation of of of the initial
  • 00:16:43
    molecule affects nearly every chemical
  • 00:16:46
    descriptor that we calculate directly
  • 00:16:47
    from the chemical
  • 00:16:49
    structure and um as we run this uh
  • 00:16:53
    highly similar molecules every single
  • 00:16:56
    descriptor is affected so there is no
  • 00:16:59
    objectively specific or any any reason
  • 00:17:02
    to say that one descriptor really
  • 00:17:05
    explains the entire um behavior of the
  • 00:17:08
    entire molecule and so it's very
  • 00:17:10
    critical as we struggle because chemists
  • 00:17:13
    always want information as we struggle
  • 00:17:15
    to explain statistical Mars it's
  • 00:17:18
    important to remember the chemical
  • 00:17:20
    features never act in isolation from
  • 00:17:23
    each other and when you modify a
  • 00:17:24
    molecule multiple features are getting
  • 00:17:26
    affected and therefore explanation of
  • 00:17:28
    multivar models by one or a few
  • 00:17:31
    descriptors is typically non sensible
  • 00:17:33
    there are ways of taken into account and
  • 00:17:36
    uh I love this out but in general um
  • 00:17:40
    published about how you combine model
  • 00:17:45
    interpretation if you absolutely forc by
  • 00:17:49
    whatever reason to identify a few
  • 00:17:51
    chemical features that are chemically
  • 00:17:53
    modifiable for instance you could use
  • 00:17:56
    Global qsr models in order to evaluate
  • 00:17:59
    valate predicted consequences of a small
  • 00:18:02
    chemical modification uh which is always
  • 00:18:04
    local so there are some way of dealing
  • 00:18:07
    with this but this is a very general
  • 00:18:08
    statement uh which explains uh the
  • 00:18:10
    impossibility of simple interpretation
  • 00:18:13
    of chemical models and that uh I wanted
  • 00:18:16
    to start with a few notes just kind of
  • 00:18:18
    to position ourselves uh with respect to
  • 00:18:21
    the history of of the field of chemtics
  • 00:18:24
    and some key Concepts that people have
  • 00:18:25
    formulated and elaborated upon in um in
  • 00:18:30
    the last 20 or some years since since uh
  • 00:18:33
    the name of the field was point and now
  • 00:18:37
    uh let's talk about recent and current
  • 00:18:39
    developments and some warnings
  • 00:18:41
    associated with these developments and
  • 00:18:42
    that's where I will allow myself to be
  • 00:18:45
    somewhat sarcastic at least so Sense and
  • 00:18:47
    Sensibility of uh deep C formatives uh
  • 00:18:51
    and I want to highlight and emphasize
  • 00:18:53
    how important it is to read Beyond
  • 00:18:57
    current literature and relate what we do
  • 00:19:00
    today to what has been done by wonderful
  • 00:19:04
    and um um great scientists of the past
  • 00:19:09
    and I personally want to acknowledge
  • 00:19:10
    three scientists uh from uh all of our
  • 00:19:14
    recent past um one is
  • 00:19:17
    Archimedes um who worked a few years ago
  • 00:19:20
    as as all of you know uh who presented a
  • 00:19:24
    an incredible case of Ingenuity in
  • 00:19:27
    solving a difficult problem using
  • 00:19:29
    relatively simple means right and
  • 00:19:31
    determining the the density of the
  • 00:19:34
    crown and then getting excited somehow
  • 00:19:37
    when I was looking for the history of
  • 00:19:40
    Archimedes the most exciting um story
  • 00:19:44
    associated with him uh in majority of
  • 00:19:48
    sources that I saw was that he was
  • 00:19:49
    running naked uh on the streets not not
  • 00:19:52
    the scientific content of the discovery
  • 00:19:55
    and then of course uh Isaac Newton uh
  • 00:19:57
    one of the most famous statements uh
  • 00:20:00
    which is very accurate and and I always
  • 00:20:02
    find this is inspirational if I've seen
  • 00:20:04
    further uh it's by stting on the
  • 00:20:06
    shoulders of the Giants so again uh
  • 00:20:09
    there is a tendency not to read the past
  • 00:20:11
    literature and uh it is a statement for
  • 00:20:14
    one of the greatest Minds uh in the
  • 00:20:16
    humankind who um had explained how
  • 00:20:20
    status and education should run and
  • 00:20:23
    finally um Albert Einstein who came up
  • 00:20:27
    with what I will refer in the end of my
  • 00:20:30
    lecture to uh as a universal formula of
  • 00:20:35
    research and
  • 00:20:36
    Discovery uh because I I I realized um
  • 00:20:40
    that this this particular formula has
  • 00:20:42
    much greater value that even that's what
  • 00:20:45
    Einstein initially
  • 00:20:47
    contemplated
  • 00:20:50
    so
  • 00:20:51
    uh what's new what new happened in the
  • 00:20:55
    uh before sort of in the recent history
  • 00:20:58
    of kin informatics that's the next
  • 00:21:00
    section of this lecture and uh what I've
  • 00:21:03
    used in order to highlight uh what is
  • 00:21:06
    considered current challenges in the
  • 00:21:09
    field a blogs that one of our colle Pat
  • 00:21:12
    Walters uh of really Therapeutics has
  • 00:21:15
    been formulated so Pat publishes blogs
  • 00:21:18
    very profound highlighted and
  • 00:21:21
    educational so um those of you have not
  • 00:21:25
    read his blogs he publishes maybe three
  • 00:21:27
    to five blogs a year where he addresses
  • 00:21:30
    current challenges in front of Cam
  • 00:21:32
    informaticians and in 2023
  • 00:21:35
    2024 a few blocks that I summarized
  • 00:21:39
    here sounded awfully familiar if if you
  • 00:21:43
    recall how I I I I um cite at the
  • 00:21:46
    Journal of Cam informatics so what do we
  • 00:21:48
    need to do today that we did not know or
  • 00:21:51
    that we did not do 20 years
  • 00:21:54
    ago we need better benchmarks for
  • 00:21:57
    machine learning when need valid
  • 00:21:59
    structures this is direct quote
  • 00:22:01
    consistent chemical representation and
  • 00:22:04
    account of theor chemistry consistent
  • 00:22:06
    measurements realistic dynamic range of
  • 00:22:10
    the data um fundamental for chemtics and
  • 00:22:15
    cut offs that we use when we transform
  • 00:22:17
    data into a a format that enables
  • 00:22:20
    classification
  • 00:22:21
    model data cation explicitly and here is
  • 00:22:25
    a funny story within recent blog uh
  • 00:22:28
    there is a bloodb brain barrier data set
  • 00:22:30
    there's a um um classical Benchmark data
  • 00:22:33
    set called molecule net that people have
  • 00:22:35
    used in hundreds of Publications and
  • 00:22:39
    that P observation the data set contains
  • 00:22:41
    59 duplicative structures 10 of those
  • 00:22:43
    duplicative structures have different
  • 00:22:45
    labels and yet hundreds of papers are
  • 00:22:49
    published including today with the
  • 00:22:51
    newest C informatics machine learning AI
  • 00:22:54
    deep learning deepest learning super
  • 00:22:56
    deep learning incredibly deep
  • 00:22:58
    and and all of that are published a
  • 00:23:01
    claim in Superior to our previous models
  • 00:23:03
    by about
  • 00:23:05
    0.00005 a see which is the most common
  • 00:23:09
    uh type of of Publications this this
  • 00:23:12
    days when when they field this flooded
  • 00:23:14
    by people trained in computer science
  • 00:23:17
    but not necessarily in chemistry and so
  • 00:23:19
    they trained that they're trained to
  • 00:23:21
    compete on benchmarking and validation
  • 00:23:25
    and unfortunately they're not trained to
  • 00:23:28
    think about the type of data that
  • 00:23:30
    they're dealing with and processing so
  • 00:23:32
    they canete on Benchmark and so that's
  • 00:23:35
    that's what's happens today right and
  • 00:23:37
    this is everything that I've talked
  • 00:23:39
    about everything that was formulated in
  • 00:23:41
    the beginning by the foundational
  • 00:23:44
    Publications and C continues to today
  • 00:23:47
    and then model validation here is uh the
  • 00:23:50
    direct title of one of pet Rec block
  • 00:23:52
    comparing classification models you
  • 00:23:54
    probably doing it wrong and he goes into
  • 00:23:57
    a great extent uh explaining how you
  • 00:24:00
    cannot just compare models by statistics
  • 00:24:04
    assigned to fixed values because values
  • 00:24:06
    have distribution and we need to take
  • 00:24:08
    distribution into account so really um
  • 00:24:12
    simple simply uh written uh but very
  • 00:24:16
    powerful block so um so I want to use a
  • 00:24:19
    few recent examples we're talking about
  • 00:24:22
    C for matrics done in 2023
  • 00:24:25
    2024 and what's new compared to what I
  • 00:24:28
    called foundational
  • 00:24:30
    cheminformatics H and here is uh our own
  • 00:24:32
    recent exercise which should have been
  • 00:24:35
    as I thought an exercise that would take
  • 00:24:38
    a few hours of one of my gradate
  • 00:24:40
    students time because we've been um made
  • 00:24:43
    part of a very large project at UNC uh
  • 00:24:46
    and direct antiviral drug Discovery and
  • 00:24:49
    we decided to look across Campbell
  • 00:24:52
    database to see if people have
  • 00:24:54
    discovered broad spectrum antivirals
  • 00:24:56
    previous so the task was to uh go into
  • 00:25:00
    Campbell identify all antiviral essays
  • 00:25:04
    identify compounds tasted on those essay
  • 00:25:07
    and then look up if the same compound
  • 00:25:09
    show activities in different ass right a
  • 00:25:13
    few minute exercise
  • 00:25:15
    however uh what holy Martin the greatest
  • 00:25:18
    student working on this uh discovered is
  • 00:25:22
    that the way the data is written in a
  • 00:25:24
    highly curated Campbell data set and
  • 00:25:26
    we've complained in the past about uh
  • 00:25:28
    duration effort um and they fixed it
  • 00:25:31
    many things have been fixed I I I've
  • 00:25:34
    published about it but what you want to
  • 00:25:37
    know if you're comparing molecules
  • 00:25:38
    tasted and antiviral essays you want to
  • 00:25:40
    know the type of the essay and you want
  • 00:25:42
    to know essay conditions so that you
  • 00:25:44
    could compare apples and apples
  • 00:25:48
    natur when it comes to essay conditions
  • 00:25:51
    as recorded in um the database what what
  • 00:25:54
    do we typically want to know it's
  • 00:25:55
    antiviral essay we want to know the
  • 00:25:57
    virus cell as a Time concentration
  • 00:26:01
    assessment one of the common ways of
  • 00:26:03
    recording the data in Camp a molecule
  • 00:26:06
    had antiviral activity against
  • 00:26:09
    SARS end of
  • 00:26:11
    story best case
  • 00:26:16
    scenario very elaborate description
  • 00:26:19
    small number of data but that's the
  • 00:26:21
    ideal case scenario and there is a huge
  • 00:26:24
    Spectrum in between when the cells I
  • 00:26:27
    mentioned the virus is mentioned but no
  • 00:26:30
    experimental conditions and you'll left
  • 00:26:32
    to wonder whether it was left out from
  • 00:26:35
    the P primary publication or um was
  • 00:26:38
    never recorded and so uh poly then
  • 00:26:43
    started and I ask you to to document the
  • 00:26:46
    amount of time I'm talking about the
  • 00:26:47
    most simplistic task preparing a data
  • 00:26:50
    site which typically people describe on
  • 00:26:52
    the application one line I prepared the
  • 00:26:54
    data set okay
  • 00:27:04
    so soic acids were missing the cell
  • 00:27:08
    type the most critical parameter of fic
  • 00:27:11
    acid in what cells was the acid
  • 00:27:14
    conducted it was found in the ESS
  • 00:27:17
    description but not where you you expect
  • 00:27:21
    to find so she needed to extract this
  • 00:27:23
    from the essay description or it was
  • 00:27:26
    completely missing 64% % of the data it
  • 00:27:29
    was recorded but not in the right place
  • 00:27:32
    and in some cases it was not reped time
  • 00:27:35
    span 150
  • 00:27:37
    hours sure graduate salary is relatively
  • 00:27:42
    low but still not something and and
  • 00:27:45
    again it's against expectation because
  • 00:27:47
    we're talking about processing easy to
  • 00:27:49
    get data that should be
  • 00:27:51
    really uh so that's that was that so
  • 00:27:54
    total time 200 hours just dealing with
  • 00:27:56
    anic assays um and then and then H
  • 00:28:00
    recognize that we search with the most
  • 00:28:01
    common most obvious qualifier bioessay
  • 00:28:05
    antology uh definition of asset type you
  • 00:28:08
    would have missed
  • 00:28:10
    99.4% all the data because the data was
  • 00:28:13
    there but not on the right place not in
  • 00:28:15
    the right col and so retrieving data is
  • 00:28:18
    critical and as much as as mundane as
  • 00:28:23
    the task appears as simple as the task
  • 00:28:27
    appears but if you don't have the data
  • 00:28:29
    you don't have the models or if the data
  • 00:28:31
    is mislabeled you don't have the models
  • 00:28:33
    and so uh for new and old people in the
  • 00:28:38
    field and I always emphasize this we
  • 00:28:40
    have to spend time and typically
  • 00:28:43
    frequently manual time cleaning the data
  • 00:28:46
    preparing the data it's unexciting you
  • 00:28:47
    cannot get an ni grant for it because uh
  • 00:28:51
    you may succeed on significance perhaps
  • 00:28:54
    impact you will never succeed on
  • 00:28:56
    methodology because there is no
  • 00:28:58
    methodology there's no innovation in
  • 00:29:00
    methodology there's no innovation so and
  • 00:29:02
    if you take a standard gr gr spring
  • 00:29:05
    criteria you're not really
  • 00:29:07
    um going to get the grant so what what
  • 00:29:10
    was supposed to be a few hours of work
  • 00:29:13
    turned out to be a good portion of a
  • 00:29:15
    graduate student thesis and in the end
  • 00:29:17
    she's built a database called smack and
  • 00:29:19
    we're using this database to ask
  • 00:29:22
    questions and building the second
  • 00:29:23
    version of this database now for uh host
  • 00:29:26
    directed therapists uh s the same idea
  • 00:29:28
    of looking up uh or enabling people to
  • 00:29:31
    look up the data and compare compounds
  • 00:29:34
    that they find against compounds found
  • 00:29:35
    in the
  • 00:29:37
    data um other aspects of the field so
  • 00:29:41
    that that was a little bit depressing
  • 00:29:42
    but uh not unexpected what else has been
  • 00:29:45
    happening in in in chem recently and
  • 00:29:48
    this is a a publication from K who who
  • 00:29:51
    is in the room and is speaking later um
  • 00:29:55
    here and uh and I chose the publication
  • 00:29:58
    because it I think reflects the sort of
  • 00:30:00
    transformation and transition in the
  • 00:30:02
    field from relatively naive um
  • 00:30:06
    discussion identification and discussion
  • 00:30:08
    of critical Concepts to much deeper much
  • 00:30:13
    much more rigorous much more thorough
  • 00:30:16
    analysis of uh The Core Concepts in chem
  • 00:30:19
    formatics such as chemistry space and uh
  • 00:30:23
    this publication defined a concept of
  • 00:30:26
    roughness of the chemical l Cape as a
  • 00:30:28
    way of addressing the same problem that
  • 00:30:30
    I've alluded to whether or not a data
  • 00:30:33
    set is modelable but again at much more
  • 00:30:37
    thorough
  • 00:30:39
    level but the the aspects of the cation
  • 00:30:43
    that I highlight um talk about Concepts
  • 00:30:46
    that have been predefined where we're
  • 00:30:48
    redefining the the concepts that have
  • 00:30:50
    been defined previously but in more
  • 00:30:53
    computational and theoretical rigorous
  • 00:30:55
    way and so that has been part of the
  • 00:30:58
    modern Chon informatics uh the influx of
  • 00:31:01
    uh people who think much deeper and from
  • 00:31:04
    a different perspective about problems
  • 00:31:06
    that we have encountered there are
  • 00:31:07
    multiple Publications that address
  • 00:31:09
    issues of uh fundamental value to
  • 00:31:13
    informatics but um that's s of the the
  • 00:31:15
    angle that I chose to introduce this
  • 00:31:20
    speak and then
  • 00:31:23
    come most popular aspects of C
  • 00:31:26
    informatics in the last uh few years and
  • 00:31:28
    that's uh what I've called going deep
  • 00:31:31
    deep here deep there deep
  • 00:31:33
    everywhere and and I want this part of
  • 00:31:36
    the lecture to be um very educational uh
  • 00:31:40
    and not taken in the wrong way because
  • 00:31:43
    I'm going to be critical of ourselves as
  • 00:31:48
    uh scientists and as CH informaticians
  • 00:31:50
    and my first criticism is when you
  • 00:31:54
    develop a new methodology when you
  • 00:31:56
    capture new methodology
  • 00:31:58
    the most obvious um reaction human
  • 00:32:01
    reaction is excitement and when you're
  • 00:32:04
    excited you don't see the light you
  • 00:32:07
    don't you stop being objective and then
  • 00:32:10
    uh and then uh you kind of I called it
  • 00:32:14
    narcissistic
  • 00:32:16
    model because uh you look at the
  • 00:32:19
    reflection in the mirror and it excites
  • 00:32:21
    you because you're
  • 00:32:23
    doing cool things and so
  • 00:32:28
    2017 2018 someone who shares the last
  • 00:32:33
    name with me
  • 00:32:36
    uh along with other
  • 00:32:38
    people said we've we've we've we've
  • 00:32:41
    never done this before we're doing this
  • 00:32:43
    for the first time and we observe higher
  • 00:32:45
    accuracy than those models developed
  • 00:32:47
    with simple machine learning techniques
  • 00:32:49
    of course it's new it's Noel you learn
  • 00:32:53
    new things you get excited and you
  • 00:32:55
    ignore some uh
  • 00:32:58
    um sober voices happen to uh be
  • 00:33:03
    published at the same time that it's not
  • 00:33:05
    always provide
  • 00:33:08
    de but nevertheless the excitement is
  • 00:33:11
    there and the excitement will always be
  • 00:33:12
    there because we constantly inent new
  • 00:33:15
    Stu
  • 00:33:17
    so you start analyzing deep learning on
  • 00:33:21
    papers deeper
  • 00:33:29
    thank
  • 00:33:30
    you and here's one of the um original
  • 00:33:33
    Publications from one of the original
  • 00:33:35
    developers of deep learning methods a
  • 00:33:38
    hardcore um
  • 00:33:41
    statistician uh with verbal statement we
  • 00:33:43
    found the Deep L methods significantly
  • 00:33:45
    outperform all competing methods that's
  • 00:33:48
    a very powerful statement especially the
  • 00:33:50
    word significant because this word is
  • 00:33:52
    used colloquially or scientifically and
  • 00:33:55
    this is not a colloquial use of the word
  • 00:33:58
    and then you look
  • 00:34:02
    um at the actual data because that's
  • 00:34:04
    what we need to learn how to look at the
  • 00:34:06
    actual data not the reflection on the
  • 00:34:08
    mirror and when you look at the data
  • 00:34:10
    what you find is that if you compare a
  • 00:34:13
    dup neural net model with with plain
  • 00:34:15
    spvm or random Forest the largest Au
  • 00:34:19
    difference is
  • 00:34:21
    0.04 Au unit the largest and yet the
  • 00:34:25
    arrow is an order of magnitude
  • 00:34:28
    high and it's just that so we have to
  • 00:34:33
    stop ourselves from using words that
  • 00:34:35
    don't have very strict minion regardless
  • 00:34:38
    of how excited
  • 00:34:39
    we're and here's another
  • 00:34:42
    story one of my
  • 00:34:45
    favorite super Buck killed by super drug
  • 00:34:50
    and new and it's a cell cell is is a
  • 00:34:53
    very highly cited Journal a deep
  • 00:34:55
    learning approach to antibiotic
  • 00:34:57
    discovery
  • 00:34:58
    and a lot of hypee associated with it in
  • 00:35:00
    mass media Publications and that's
  • 00:35:03
    another thing I'll come back to this
  • 00:35:04
    point in a few
  • 00:35:06
    minutes so what's behind it what's
  • 00:35:08
    behind it is a wonderful highly
  • 00:35:11
    appealing publication on Journal of
  • 00:35:13
    chemical CH informatics and modeling
  • 00:35:16
    with a new way of uh learning how to
  • 00:35:20
    represent molecules and how to compare
  • 00:35:22
    molecules and how to build PSR models of
  • 00:35:26
    the highest accuracy today
  • 00:35:28
    this publication
  • 00:35:30
    collected more than 750 citations very
  • 00:35:34
    highly cited paper there is a
  • 00:35:36
    little red sign if you go to the
  • 00:35:40
    website um that most people probably
  • 00:35:43
    miss but I am a sarcastic modeler I
  • 00:35:45
    clicked if you click it brings you to uh
  • 00:35:49
    and this is the original publication as
  • 00:35:51
    you could see there is a a very
  • 00:35:53
    substantial Improvement so deep learning
  • 00:35:54
    point8 random forest6 dramatic
  • 00:35:57
    Improvement 20% better 13% better 26%
  • 00:36:01
    better 39% better deep floran models
  • 00:36:04
    work much much
  • 00:36:07
    better only eight actually it's 20 I I
  • 00:36:11
    was too lazy to to replace all the SP so
  • 00:36:13
    20 citations of the uh Red Line um
  • 00:36:18
    publication and it says correction and
  • 00:36:21
    what this correction says that due to an
  • 00:36:23
    error in processing of the random Forest
  • 00:36:25
    models good old simple Rand Forest
  • 00:36:28
    models you develop a new deep learning
  • 00:36:31
    approach but you cannot process random
  • 00:36:34
    Forest models some right the random
  • 00:36:37
    forestation numbers were incorrupt
  • 00:36:39
    and when you
  • 00:36:43
    corrupt you totally lose significant
  • 00:36:50
    Improvement but who
  • 00:36:52
    cares because because you got a new bu
  • 00:36:56
    and you got a new drve against the new
  • 00:36:58
    bu so he cares what happens with the me
  • 00:37:00
    with the meth well turned out that this
  • 00:37:04
    result was also recalled because it
  • 00:37:05
    turned out that another lab three years
  • 00:37:07
    earlier discovered that same compound
  • 00:37:10
    and decided not to procure it anymore
  • 00:37:11
    because it was
  • 00:37:16
    toxic here's another
  • 00:37:18
    story what if you have the most
  • 00:37:21
    unlimited computational power in the
  • 00:37:22
    world what do you do you you right you
  • 00:37:24
    discover
  • 00:37:25
    drugs and then you tell the world that
  • 00:37:28
    you have the fastest supercomputer and
  • 00:37:30
    therefore you best in the world drug
  • 00:37:32
    discover because you run hundreds and
  • 00:37:34
    hundreds of hours of super computer time
  • 00:37:38
    and the super computer not a human the
  • 00:37:40
    super computer identified 77 new
  • 00:37:44
    antiviral drugs at Oak National
  • 00:37:46
    Laboratory and it was
  • 00:37:49
    published not in nature for some reason
  • 00:37:51
    a little bit suspicious but nevertheless
  • 00:37:54
    enough to get publicity
  • 00:37:57
    and then they the same group had the
  • 00:37:59
    guts of publishing the letter in in in
  • 00:38:02
    uh New England Journal of Medicine the
  • 00:38:04
    journal with the highest one of the
  • 00:38:06
    highest uh citation uh value how to
  • 00:38:09
    discover antiviral drugs quickly no
  • 00:38:15
    less and that's the consequence of
  • 00:38:18
    pandemic thousands and thousands of drug
  • 00:38:20
    Discovery papers that appeared during
  • 00:38:22
    computational papers every the world
  • 00:38:25
    every sits sits at home you can't go to
  • 00:38:27
    you can make
  • 00:38:28
    molecules so you run do studies and you
  • 00:38:31
    publish papers and that's that's that's
  • 00:38:33
    the pandemic um of drug
  • 00:38:36
    Discovery
  • 00:38:38
    uh the best comment I've I read uh about
  • 00:38:43
    publications discovering Noble
  • 00:38:45
    Inhibitors or Noel drugs Etc uh was from
  • 00:38:48
    John Cadera who asked the question in
  • 00:38:50
    his blog is it really discovery of new
  • 00:38:52
    Inhibitors there is zero experimental
  • 00:38:53
    data Maybe prop pual but even that's a
  • 00:38:56
    stretch digital dreams of new Inhibitors
  • 00:39:00
    that's how we should
  • 00:39:01
    be and that's a warning also that's
  • 00:39:04
    that's a warning of not been excited in
  • 00:39:06
    using correct terminology when we make
  • 00:39:09
    discoveries we make digital discover
  • 00:39:12
    there's not a journal a journal that's
  • 00:39:13
    called digital Discovery so the next
  • 00:39:16
    Journal should be called digital dreams
  • 00:39:18
    and and we all should be published there
  • 00:39:20
    because without experiment that's what
  • 00:39:22
    we produce we produce digital dreams I
  • 00:39:25
    found a wonderful quote from planing
  • 00:39:28
    which I really want all of us to
  • 00:39:31
    inscript on our head we've learned from
  • 00:39:34
    experience that truth will come out
  • 00:39:35
    other experimental will repeat your
  • 00:39:37
    experiment and find out whether you are
  • 00:39:39
    wrong or right and although you may gain
  • 00:39:42
    some temporary Fame and excitement you
  • 00:39:44
    will not gain a good reputation as a
  • 00:39:46
    scientist and reputation is everything
  • 00:39:48
    for a scientist so we've published with
  • 00:39:53
    and AR is in the room a little feature
  • 00:39:56
    that was surely a joke in Mr doy and I
  • 00:39:59
    encourage you to read it because we show
  • 00:40:01
    there with data that all these
  • 00:40:03
    predictions by super computer were
  • 00:40:05
    experimentally shown to be inactive at
  • 00:40:09
    the time when they were educa the world
  • 00:40:10
    how to
  • 00:40:13
    discover so just to finish the sarcastic
  • 00:40:17
    part um one of our colleagues had
  • 00:40:20
    presented um at Triple H Mission
  • 00:40:24
    2019 kind of saying that we have to be
  • 00:40:27
    careful especially when we deal with big
  • 00:40:29
    data and super powerful
  • 00:40:31
    algorithms that the the algorithms have
  • 00:40:33
    been specifically developed to optimize
  • 00:40:35
    stuff and the more data you have and the
  • 00:40:37
    more computers you run the better you
  • 00:40:39
    optimize and with more powerful
  • 00:40:41
    computers that few people have access to
  • 00:40:44
    the computational experiments become
  • 00:40:45
    reproducible because nobody else has
  • 00:40:47
    enough power to challenge and and and
  • 00:40:50
    revalidate models and so uh results and
  • 00:40:54
    I Illustrated um with a few examples can
  • 00:40:56
    be misleading and often completely wrong
  • 00:40:58
    there is General recognition ofab crisis
  • 00:41:00
    and sence I disagree with the last uh
  • 00:41:03
    point I don't think there is Crisis and
  • 00:41:05
    science but this is something that we
  • 00:41:06
    have to be very very careful with and
  • 00:41:09
    just to imprint this there's a
  • 00:41:10
    psychological rule of three uh repeats
  • 00:41:13
    right if you want your kids to remember
  • 00:41:15
    to clean dishes or clean the table you
  • 00:41:18
    have to say it three times this is this
  • 00:41:20
    is how we need to educate each other so
  • 00:41:24
    I've looked up um the
  • 00:41:27
    with AI what is if you say AI three
  • 00:41:31
    times and Urban Dictionary says the
  • 00:41:33
    price used by most evil people with
  • 00:41:35
    pattern functions
  • 00:41:38
    so please keep this in mind as you
  • 00:41:40
    investigate new algorithm that brings me
  • 00:41:42
    to the impactful trends and I'll try to
  • 00:41:45
    wrap
  • 00:41:46
    up so why do we want new methods this is
  • 00:41:51
    the biggest transformation I don't think
  • 00:41:53
    it's methods many methods that we
  • 00:41:55
    practice today have been kind developed
  • 00:41:58
    or originated many many years ago the
  • 00:42:00
    absolute biggest transformation of the
  • 00:42:02
    discipline of CH formatics is because of
  • 00:42:05
    the data because of immense huge amount
  • 00:42:08
    of data in every sense of the word here
  • 00:42:12
    is a summary of current number of
  • 00:42:15
    molecules and
  • 00:42:16
    papan hundreds of millions of molecules
  • 00:42:19
    with assigned biological activity 41
  • 00:42:22
    million abstracts and 8.7 so European
  • 00:42:25
    PMC for some reason is bigger than the
  • 00:42:26
    American p
  • 00:42:28
    uh but enormous amount of literature and
  • 00:42:31
    as you know enormous amount of
  • 00:42:32
    literature really enabled the
  • 00:42:34
    development of of tools such as CH GPT
  • 00:42:37
    and aik
  • 00:42:39
    and constantly growing chemical libr for
  • 00:42:42
    rual spring and this this immense amount
  • 00:42:45
    of data challenges model development
  • 00:42:48
    challenges Hardware challenges uh human
  • 00:42:51
    Minds challenges people who develop
  • 00:42:53
    algorithms and as we do it again I ask
  • 00:42:56
    you to uh
  • 00:42:58
    remember uh of problems and challenges
  • 00:43:00
    associated with with big data so uh here
  • 00:43:04
    is my summary of some of the Trends on
  • 00:43:08
    the field and absolutely impossible to
  • 00:43:10
    to reflect on everything despite the
  • 00:43:12
    request that I had for me Sasha but this
  • 00:43:15
    is what I consider um emerging studies
  • 00:43:20
    and what's really helpful I mentioned in
  • 00:43:22
    the beginning that there is no way I
  • 00:43:25
    could cover in one Lu what is important
  • 00:43:29
    and I actually love this code the public
  • 00:43:30
    have insal know everything se's worth
  • 00:43:32
    knowing so this are of my list of things
  • 00:43:35
    that are worth noting ability to model
  • 00:43:37
    increasingly large and complex for
  • 00:43:39
    instance chemical mixture data and
  • 00:43:41
    Screen alter large chemical libaries
  • 00:43:43
    that's
  • 00:43:44
    emerging continue to emerge as the
  • 00:43:46
    challenge in front of us new algorithms
  • 00:43:49
    enhan by Machine learning to improve
  • 00:43:50
    both efficiency and accuracy of
  • 00:43:53
    models instant integration with
  • 00:43:56
    experiments
  • 00:43:58
    um I think gizbert published a paper
  • 00:44:00
    defining dmta or was one of the first
  • 00:44:02
    people defining design make test analyze
  • 00:44:04
    cycle the cycle used to be disconnected
  • 00:44:07
    we design we make and then someone tests
  • 00:44:10
    and then someone synthetizes cell
  • 00:44:12
    driving Labs uh an intelligent chemical
  • 00:44:15
    design with ersion or algorithms that
  • 00:44:17
    work at the level of building blocks
  • 00:44:19
    which is still smaller than the size of
  • 00:44:21
    modern libraries there are 50 60 billion
  • 00:44:24
    compounds and there are interesting
  • 00:44:27
    algorithms continue to emerge that look
  • 00:44:30
    at the space and mine the space of
  • 00:44:31
    building
  • 00:44:33
    BLS in silic real I not going to talk
  • 00:44:36
    about I'll talk about a few points just
  • 00:44:38
    to illustrate um but um it's becoming
  • 00:44:41
    increasingly important to work with
  • 00:44:44
    Scientists conducting in vitra
  • 00:44:46
    experiments and combining silon and
  • 00:44:48
    vitra to develop models that provide um
  • 00:44:51
    reliable alternative to anual testing
  • 00:44:54
    both for toxicity testing and for um
  • 00:44:57
    tast and for biological activity and
  • 00:44:59
    then cross disciplinary knowledge
  • 00:45:01
    integration and Mining to achieve
  • 00:45:03
    clinical impact and I would like to
  • 00:45:06
    suggest that focusing on ways of
  • 00:45:08
    achieving clinical impact through cam
  • 00:45:10
    informatics research is one of the
  • 00:45:13
    outstanding hot challenges in chromatics
  • 00:45:16
    these days and then endless applications
  • 00:45:18
    beyond qsr what I Define some point as
  • 00:45:20
    qsr without bus so uh that's the summary
  • 00:45:24
    let me give you a few examples summary
  • 00:45:26
    search hour some not uh and uh what I
  • 00:45:30
    also um said here this is what's really
  • 00:45:33
    helpful to Define Trends just look at
  • 00:45:36
    the lecture titles in this Workshop
  • 00:45:39
    because because the program it's
  • 00:45:41
    outstanding and it captures all the
  • 00:45:43
    current and emergent CS that I cannot
  • 00:45:45
    cover and nobody could cover my lecture
  • 00:45:47
    but the entire Workshop really stand out
  • 00:45:50
    when I looked at the program in terms of
  • 00:45:52
    um topics that will be covered um
  • 00:45:55
    beginning tomorrow
  • 00:45:58
    uh one of the
  • 00:46:00
    challenges that uh our group found
  • 00:46:03
    exciting is uh how
  • 00:46:05
    to uh achieve Ultra fast virtual screen
  • 00:46:10
    there have been Publications I mentioned
  • 00:46:12
    50 billion compounds that enine released
  • 00:46:14
    recently in their Library so it's very
  • 00:46:16
    apart and it's important uh and
  • 00:46:20
    impossible to use traditional methods um
  • 00:46:24
    Maas is here and and his company has
  • 00:46:26
    been deved
  • 00:46:27
    or published one of the first not
  • 00:46:29
    published sorry released one of the
  • 00:46:33
    first algorithms to enable uh rapid
  • 00:46:35
    searches but but the challenge was out
  • 00:46:37
    there and people have been uh developing
  • 00:46:40
    Solutions right uh and why is it
  • 00:46:42
    difficult Ultra large libraries and and
  • 00:46:45
    traditionally High feature
  • 00:46:47
    dimensionality if you think about making
  • 00:46:51
    50 billion comparisons using 10 24 sized
  • 00:46:54
    fingerprints that takes time that what
  • 00:46:56
    makes uh things really
  • 00:46:59
    um difficult and so
  • 00:47:02
    uh what I want to share is is an
  • 00:47:04
    approach that we have developed over the
  • 00:47:06
    past year uh that uh and that was the
  • 00:47:09
    challenge in front to enable quaran of
  • 00:47:12
    bulion siiz libraries rapidly using as
  • 00:47:15
    little computing power as possible and
  • 00:47:18
    uh allow concurrent queries and free to
  • 00:47:20
    use so what's behind it is uh what has
  • 00:47:25
    constituted the sort of the the depth of
  • 00:47:28
    of uh deep chemistry in in in in the
  • 00:47:32
    last years and that is that transition
  • 00:47:34
    and with um five people um who speak at
  • 00:47:37
    this Workshop coauthored a paper last
  • 00:47:39
    year um gizbert and Sasha and uh uh olis
  • 00:47:44
    uh and art so they're all all in the
  • 00:47:46
    room and we CED a paper on what we
  • 00:47:48
    called emergence of deep qar and that
  • 00:47:51
    what we've highlighted there is the
  • 00:47:53
    principal difference between old qar and
  • 00:47:54
    new qar and that is that we operate uh
  • 00:47:58
    not in a traditional computed chemical
  • 00:48:00
    descriptors but in the edion space and
  • 00:48:03
    so the edion uh enables different ways
  • 00:48:06
    of transforming chemical reality and
  • 00:48:07
    dealing with chemical reality and one of
  • 00:48:10
    them as has been shown uh in in
  • 00:48:13
    Professor Bar's group and um uh Elana is
  • 00:48:16
    going to give a lecture about uh um
  • 00:48:20
    chemography is that that uh this type of
  • 00:48:24
    representation uh allows us to ask
  • 00:48:26
    questions in traditional questions in a
  • 00:48:27
    different way and calculate chemical
  • 00:48:30
    similarity and and address the issues of
  • 00:48:32
    chemical landscape in a way traditional
  • 00:48:35
    descriptors disallow and so uh one
  • 00:48:39
    aspect that we have investigated and
  • 00:48:41
    noticed is that as we run different
  • 00:48:44
    types of encoders and autoencoders and
  • 00:48:46
    transformations in the Lattin
  • 00:48:49
    space chemical similarity is not by
  • 00:48:52
    default preserved and when you analyze
  • 00:48:54
    the distances between compounds and the
  • 00:48:57
    space as Illustrated here using naive
  • 00:48:59
    types of Auto iners compounds tend to
  • 00:49:02
    disaggregate even if they similar in the
  • 00:49:05
    original chemistry space and so we've
  • 00:49:07
    developed an algorithm that we call
  • 00:49:09
    structurally where Le space of Auto
  • 00:49:11
    includer that addressed the specific
  • 00:49:14
    issue of how to inut molecules in the
  • 00:49:16
    latent space while intentionally
  • 00:49:19
    preserving similarity between them and
  • 00:49:21
    one of the fundamental questions that
  • 00:49:23
    we've asked that I encourage every group
  • 00:49:26
    to discuss
  • 00:49:27
    uh and the discussion becomes hotter and
  • 00:49:30
    hotter and more and more informal the
  • 00:49:32
    more beer and other beverages like that
  • 00:49:34
    you consume and that's the discussion
  • 00:49:36
    about what is chemical similarity so uh
  • 00:49:39
    what we have decided to use is what I
  • 00:49:42
    think may be discussed as a fundamental
  • 00:49:46
    chemical similarity metric and that's
  • 00:49:48
    graphed a distance because that is a
  • 00:49:51
    mathematical
  • 00:49:52
    concept it's hard to calculate
  • 00:49:54
    especially if you want to compare it to
  • 00:49:56
    very different molecules but you could
  • 00:49:59
    Define similarity and objectively not
  • 00:50:01
    using any particular descriptors
  • 00:50:03
    directly from the chemical structure and
  • 00:50:04
    chemical structure representation as a
  • 00:50:06
    prod and in order to train the model
  • 00:50:08
    we've developed large data sets of
  • 00:50:11
    analoges and each molecule in this data
  • 00:50:13
    set the an car molecule is taken from
  • 00:50:15
    kemell and then immediate analogs with
  • 00:50:18
    one graph AED
  • 00:50:20
    distance uh continuously is developed
  • 00:50:23
    five to 10 an loocks are developed and
  • 00:50:26
    they gu to be nearest neighbors of each
  • 00:50:28
    other in the objectively defined
  • 00:50:30
    chemical graph space and then and then
  • 00:50:34
    we're on the inod and what this this
  • 00:50:36
    slide illustrates is that as we embed
  • 00:50:38
    molecules we preserve original graph
  • 00:50:40
    edged distances calculated in the
  • 00:50:43
    chemical graph space and more broad
  • 00:50:46
    sense here is a comparison of the
  • 00:50:48
    distribution ofion distances on the left
  • 00:50:52
    naive autoencoder on the right uh salsa
  • 00:50:56
    space which which preserves chemical
  • 00:50:57
    similarity intentionally because that's
  • 00:51:00
    built into the
  • 00:51:01
    algorithm
  • 00:51:03
    and what's the use of it the use is that
  • 00:51:06
    we control the size of the chemical
  • 00:51:09
    space and we could decide uh which size
  • 00:51:11
    of the Lattin space we want to use to
  • 00:51:13
    incode molecules while preserving
  • 00:51:15
    chemical
  • 00:51:16
    similari and uh what we have shown is
  • 00:51:20
    that uh smsa which is a version of the
  • 00:51:22
    salsa with small size latent space
  • 00:51:26
    preserves chemical
  • 00:51:28
    similarity uh much better than than when
  • 00:51:31
    you use uh fingerprints and at the same
  • 00:51:33
    time enables the fastest computational
  • 00:51:35
    speed uh as we compare distances in
  • 00:51:39
    theed space with chemical graph
  • 00:51:41
    distances um for compounds um even
  • 00:51:44
    arbitrarily chosen my second example of
  • 00:51:48
    addressing a fundamental concept of a
  • 00:51:50
    chromatics in um sort of noral way is
  • 00:51:54
    revisiting the the concept of data data
  • 00:51:58
    balance or imbalance and as I mentioned
  • 00:52:00
    initially we advocated for the use of uh
  • 00:52:03
    balance uh
  • 00:52:05
    balance data the objective here was to
  • 00:52:09
    train a model with available data to
  • 00:52:11
    distinguish standard objective active
  • 00:52:12
    from anactive but there is a caveat that
  • 00:52:16
    everybody observes typically when you
  • 00:52:18
    look at um large data sets from from
  • 00:52:20
    sources such as even kemell let alone
  • 00:52:23
    pcam is that the most train sets are
  • 00:52:24
    imbalanced and when we compounds from
  • 00:52:28
    huge virtual libraries we know a
  • 00:52:30
    priority that very small fraction of
  • 00:52:32
    those compounds even if all of them are
  • 00:52:34
    tested will turn out to be to have the
  • 00:52:37
    activity um the the desired activity so
  • 00:52:40
    the training sets are imbalanced in
  • 00:52:42
    practicality and the external virtual
  • 00:52:45
    screening sets are hugely imbalanced and
  • 00:52:48
    what we've realized and and that's that
  • 00:52:50
    Pap our that we need to change our
  • 00:52:54
    historical Trends as to how we've been
  • 00:52:55
    building models
  • 00:52:57
    and change the metcs that we use in
  • 00:52:59
    order to enable the recovery of the
  • 00:53:01
    largest number of positive he because
  • 00:53:02
    when we do virtual screening and then
  • 00:53:04
    experimental testing we are hunting for
  • 00:53:07
    positives we're not hunting for models
  • 00:53:09
    with the highest balanced accuracy and
  • 00:53:12
    so uh and then and then the Practical
  • 00:53:15
    aspect of this work was that typically
  • 00:53:16
    people screen compounds in batches in in
  • 00:53:19
    in uh well plates and so um what we used
  • 00:53:23
    here is uh um positive itive value As
  • 00:53:27
    the metric of accuracy estimated and
  • 00:53:30
    selecting 128 compounds from the
  • 00:53:33
    external set and sets were created from
  • 00:53:35
    Pam uh split into train and to nominate
  • 00:53:38
    from test set and they were
  • 00:53:40
    intentionally imbalanced both the
  • 00:53:41
    training set and external set and what
  • 00:53:44
    we found that uh and this this is a
  • 00:53:47
    comparison between balanced and
  • 00:53:48
    imbalanced in five different data sets
  • 00:53:51
    is that when we train models to have the
  • 00:53:53
    highest positive predictive value
  • 00:53:57
    we achieve the highest much higher
  • 00:54:00
    positive productive value in the
  • 00:54:01
    external set than with model trrain
  • 00:54:04
    traditionally with balance accuracy and
  • 00:54:06
    that's the last column here um more um
  • 00:54:10
    vividly Illustrated in the next slide
  • 00:54:13
    when we um compare the performance of
  • 00:54:15
    traditional models and models
  • 00:54:17
    emphasizing nontraditional non-standard
  • 00:54:19
    metric of of elevation such as
  • 00:54:24
    ppv a few comments about
  • 00:54:27
    um trends that I find emerging and and
  • 00:54:30
    important and they will be covered in
  • 00:54:31
    this Workshop such as AIML enhance
  • 00:54:34
    generative chemistry so gizbert is going
  • 00:54:37
    to talk about it tomorrow uh design make
  • 00:54:40
    text analyze cycle integration with an
  • 00:54:42
    sdl so um this is um um what I
  • 00:54:46
    illustrate here is a publication from uh
  • 00:54:48
    elain asuk uh work uh several years ago
  • 00:54:52
    but there is an absolutely strongly
  • 00:54:55
    emerging Trend in integrating
  • 00:54:56
    computations inside the MTA cycle
  • 00:54:59
    implemented within self-driving weap and
  • 00:55:01
    so I think it's an important area of
  • 00:55:03
    research and of course various uh
  • 00:55:05
    machine learning accelerated
  • 00:55:07
    calculations uh that are made fast and
  • 00:55:09
    accurate and so all is is going to talk
  • 00:55:11
    about this Paradigm of ml enhanced
  • 00:55:15
    calculations as applied to quantum
  • 00:55:16
    mechanical tasks um I think on
  • 00:55:19
    Friday and de duen that art who made it
  • 00:55:23
    despite um the bad train with
  • 00:55:26
    performance uh as promised um develop
  • 00:55:29
    deep do so you're going to hear about
  • 00:55:31
    that uh from art I'm going to skip our
  • 00:55:33
    own work and sort of extension and the
  • 00:55:36
    interest of time and talk about the last
  • 00:55:39
    topic the importance of integration of C
  • 00:55:42
    from with clinical work and uh and even
  • 00:55:46
    though traditionally we're focusing on L
  • 00:55:49
    Discovery and Target Discovery as
  • 00:55:50
    standard Cham informatics tasks um what
  • 00:55:54
    is emerging is holistic understanding of
  • 00:55:56
    the entire pipeline from left to right
  • 00:55:58
    and development and application of Cam
  • 00:56:00
    informatics or Cam informatics like
  • 00:56:02
    techniques to address the entirety of
  • 00:56:05
    the pipeline all the way to the last
  • 00:56:08
    point which is adherence Precision
  • 00:56:10
    medicine and Drug application one
  • 00:56:12
    particular area that is reemerging to be
  • 00:56:16
    exciting is is called drug reprising
  • 00:56:18
    which is application of drugs existing
  • 00:56:20
    drugs for new applications and to me
  • 00:56:22
    this is one of the most exciting
  • 00:56:25
    potential typ typ of research P
  • 00:56:27
    informaticians because it could make an
  • 00:56:29
    immediate impact in the clinic if you
  • 00:56:31
    suggest an existing drug to be
  • 00:56:33
    repurposed for for an no disease um
  • 00:56:36
    Physicians have the right to using those
  • 00:56:39
    drugs in clinic if they they are
  • 00:56:41
    convinced that they could do it and
  • 00:56:42
    there is nothing against it and here is
  • 00:56:45
    a a story that I found fascinating
  • 00:56:48
    inspirational and motivational about a
  • 00:56:51
    new colleag of mine named David Fen
  • 00:56:53
    bound who showing here 10 years more
  • 00:56:55
    than 10 years ago left an athlete from
  • 00:56:58
    upan who got stricken by a rare disease
  • 00:57:02
    called Castleman and uh had his uh Last
  • 00:57:06
    Words uh essentially and and uh was
  • 00:57:09
    nearly dying so two two cases of
  • 00:57:12
    clinical death this is how he looks
  • 00:57:15
    today and what he did was that he
  • 00:57:18
    reasoned
  • 00:57:19
    overd on him on himself on his own and
  • 00:57:24
    once he was diagnosed with what's called
  • 00:57:26
    i apathic multicentric c castolin
  • 00:57:28
    disease what he learned about himself is
  • 00:57:30
    that he had elevated cyto kindes uh
  • 00:57:33
    increased expression of of um one of the
  • 00:57:36
    proteins and Diesel Activation so has
  • 00:57:38
    the question is there any drug that
  • 00:57:40
    could regulate those clinical phenotypes
  • 00:57:43
    andan ceramus an Amor inhibitor known to
  • 00:57:47
    be used uh in patients that received
  • 00:57:50
    organ
  • 00:57:51
    transplant and performed poorly and the
  • 00:57:54
    organ is rejected and there is toxicity
  • 00:57:56
    associated with it so he liked on
  • 00:57:58
    symptoms that he learned about his
  • 00:58:00
    medical research or medical training um
  • 00:58:03
    of patients with organ transplant
  • 00:58:05
    rejection and his own symptoms and
  • 00:58:08
    decided and his doctor basically said
  • 00:58:10
    you're D anyway why not so he's been in
  • 00:58:13
    remission for 10 years and published uh
  • 00:58:16
    a New York Times besteller about uh how
  • 00:58:19
    to cure your own chasing your own cure
  • 00:58:20
    is the title of the book so
  • 00:58:24
    um so there's the this is a slide I
  • 00:58:26
    think we all know about drug reposing
  • 00:58:28
    and and how potentially expeditious
  • 00:58:32
    could the drug Discovery be if we do
  • 00:58:34
    this and we also have been exposed to a
  • 00:58:36
    recent case of drug repacing uh by
  • 00:58:39
    benevolent I nominated bar c as a
  • 00:58:44
    potential drug of choice for CO as early
  • 00:58:47
    as in February of 2020 just uh couple of
  • 00:58:50
    months after the world learned about
  • 00:58:53
    this disease and the way they reasoned
  • 00:58:55
    was that um we have a viral
  • 00:58:58
    entry and it list inflammation so it's a
  • 00:59:01
    very simple reason and so what is behind
  • 00:59:04
    inflammation cyto Kine signalin uh which
  • 00:59:07
    is mediated by a few Kinesis um what
  • 00:59:11
    works on those kindes a few drugs but
  • 00:59:13
    very sitive
  • 00:59:15
    specifically is a polypharmacology drug
  • 00:59:18
    that inhibits all three kinases involved
  • 00:59:21
    either in endocytosis or in cin stor and
  • 00:59:24
    so that was sort of the reason um that
  • 00:59:26
    they use to nominate this drug looking
  • 00:59:28
    at what historically appears as a
  • 00:59:30
    remember chemical systems biology is one
  • 00:59:33
    of the areas of CH informatics and the
  • 00:59:35
    original uh
  • 00:59:37
    publication and what they really used
  • 00:59:40
    was What's called the knowledge gra and
  • 00:59:41
    and Knowledge Graph studies uh I find
  • 00:59:44
    probably one of the most exciting
  • 00:59:46
    projects around lab and in Labs that are
  • 00:59:48
    studying knowledge graphs uh which have
  • 00:59:50
    been introduced by Google and allow
  • 00:59:53
    people to
  • 00:59:54
    answer really really bizarre questions
  • 00:59:57
    such as what is common between the
  • 01:00:00
    Vinci and the eiil
  • 01:00:03
    tower and you must be deeply drown on
  • 01:00:06
    nuts to ask a question like this unless
  • 01:00:09
    you have access to knowledge
  • 01:00:11
    gr behind Wikipedia which says that D
  • 01:00:14
    Vinci painted monola which is in lure
  • 01:00:16
    which is in
  • 01:00:19
    Paris now that's a path through the
  • 01:00:22
    knowledge graph and it's a crazy path
  • 01:00:24
    but if you contemplate the process of
  • 01:00:28
    consolidating the World Knowledge
  • 01:00:30
    distributed across multiple data
  • 01:00:32
    clinical nonclinical chemical biological
  • 01:00:36
    biomedical historical medical clinical
  • 01:00:39
    Etc uh and integrate this
  • 01:00:42
    knowledge
  • 01:00:44
    uh compiled by domain scientists and uh
  • 01:00:48
    stored in databases integrated with
  • 01:00:51
    inferences made from this data to affect
  • 01:00:53
    what domain scientists do in their labs
  • 01:00:56
    that can be done through Knowledge Graph
  • 01:00:58
    construction Knowledge Graph curation
  • 01:01:01
    familiar term Knowledge Graph
  • 01:01:04
    compression Knowledge Graph completion
  • 01:01:06
    from computer science perspective
  • 01:01:08
    Knowledge Graph completion is the same
  • 01:01:09
    as making predictions about new edges
  • 01:01:12
    that connect to semantic objects such as
  • 01:01:14
    a Dr and disease and that effectively
  • 01:01:17
    means anination of a new drug for a new
  • 01:01:20
    disease predicting this particular uh
  • 01:01:23
    new connection and so
  • 01:01:26
    um there are multiple algorithms being
  • 01:01:29
    developed again I find this very
  • 01:01:30
    exciting for CH informatician to work on
  • 01:01:32
    this the simplest approaches that can be
  • 01:01:34
    used is to discover rules that connect
  • 01:01:37
    known or that are behind biological
  • 01:01:40
    Pathways that are behind known drug
  • 01:01:42
    disease connections and then this rules
  • 01:01:44
    can be used to forecast whether a drug
  • 01:01:48
    and disease that have a connected path
  • 01:01:51
    found frequently association with drugs
  • 01:01:53
    and diseases could be repurposed for
  • 01:01:56
    a new disease and I'm part of a huge
  • 01:01:59
    project recently awarded today pen B by
  • 01:02:02
    RP to examine and this is a familiar
  • 01:02:06
    territory all against all similarity
  • 01:02:08
    searches all against all doen
  • 01:02:10
    calculations all against all biological
  • 01:02:14
    biomedical graph minion to calculate the
  • 01:02:17
    strength of association between every
  • 01:02:19
    drug and every disease and through this
  • 01:02:21
    integration searches and identification
  • 01:02:24
    of um scores uh the hope is that
  • 01:02:28
    relatively low hangen fruits will emerge
  • 01:02:31
    similar to the story of David F bound
  • 01:02:33
    curing himself that can be uh tried in
  • 01:02:37
    the clinic and so that's a chemon type
  • 01:02:40
    analysis of a Knowledge Graph remember
  • 01:02:42
    chemicals are graphs and here again
  • 01:02:46
    we're back to the concept of graphs and
  • 01:02:48
    uh ways of mining graphs that are
  • 01:02:50
    familiar to us as informaticians in
  • 01:02:51
    order to uh recover drugs and Linate
  • 01:02:54
    them for clinical trial I promise that
  • 01:02:57
    eals mc² is a universal formula and I
  • 01:03:01
    found uh this picture the real mean of
  • 01:03:04
    EMC eals mc² so that's not mine but what
  • 01:03:08
    is mine is the new formula of scientific
  • 01:03:11
    discovery that I share with you and
  • 01:03:13
    encourage you to use and that's what my
  • 01:03:15
    talk was about how to combine modeling
  • 01:03:18
    and Curative content and so that
  • 01:03:21
    together uh really creates opportunity
  • 01:03:23
    for new scientific discovery overall my
  • 01:03:26
    last two slides as C informaticians we
  • 01:03:30
    depend on data tools and challenges and
  • 01:03:32
    when these three object semantic objects
  • 01:03:35
    are combined a new hypothesis to emerge
  • 01:03:39
    now
  • 01:03:40
    hypothesis have to be
  • 01:03:42
    validated and filtered out so filtering
  • 01:03:46
    building filtering models is very
  • 01:03:47
    important and after filters a value is
  • 01:03:51
    created um Sasha talked about Investors
  • 01:03:54
    so that's the point of of interaction
  • 01:03:56
    with them but value is not necessarily
  • 01:03:58
    dollars it's a new knowledge that goes
  • 01:04:00
    back and fuels data tools and challenges
  • 01:04:03
    integration it takes a village it takes
  • 01:04:06
    learning it takes multiple scientists
  • 01:04:08
    interacting with each other um someone
  • 01:04:10
    once asked me if I have drawn this this
  • 01:04:13
    particular picture of scientists working
  • 01:04:15
    with each other I said no it was Picassa
  • 01:04:17
    but um I was puzzled by the question in
  • 01:04:21
    my personal opinion and I'm allowed to
  • 01:04:23
    share my personal opinion my personal
  • 01:04:24
    opinion there is a another mission
  • 01:04:26
    element of this entire
  • 01:04:29
    workflow thank you for um giving me a
  • 01:04:32
    cup of coffee before this lecture SAS
  • 01:04:33
    certainly was very
  • 01:04:35
    helpful and then I'll I'll end with last
  • 01:04:38
    two slides I've talked a lot about hype
  • 01:04:40
    this uh you may have seen this this is
  • 01:04:42
    Gardner is a consulting company that
  • 01:04:44
    publishes stuff like this hype cycle so
  • 01:04:47
    hopefully with computer existed drug
  • 01:04:49
    Discovery we're Beyond this
  • 01:04:53
    hype uh and we're into what's called um
  • 01:04:57
    plateau of productivity with all the
  • 01:04:59
    methods so I'm enging on really positive
  • 01:05:01
    note despite multiple criticism that
  • 01:05:04
    you've Pro so uh this is this is the
  • 01:05:07
    summary again I'm not going to read it
  • 01:05:08
    I've talked about Big Data I've talked
  • 01:05:10
    about exciting developments really the
  • 01:05:12
    interface between computation organic
  • 01:05:14
    chemistry tools and data sharing there
  • 01:05:16
    are many many topics I have not talked
  • 01:05:18
    about and with that my last educational
  • 01:05:20
    comment um from jerek CL it's not the
  • 01:05:24
    machines are going replace chemists that
  • 01:05:26
    chemists who use machines I really love
  • 01:05:28
    this quote will replace those who don't
  • 01:05:31
    so that's what everybody is learning
  • 01:05:32
    here today thank you very
  • 01:05:50
    much Alex I think you you extremely
  • 01:05:55
    important
  • 01:05:56
    that the experiment will always
  • 01:05:58
    validate how do you deal with the
  • 01:06:00
    knowledge that 80% of what is published
  • 01:06:04
    in biology
  • 01:06:06
    C you
  • 01:06:09
    can't uh but uh there is ongoing
  • 01:06:14
    effort I didn't have time to recruit to
  • 01:06:17
    to to discuss there are two ongoing
  • 01:06:19
    efforts I
  • 01:06:23
    um uh highly support the con cell
  • 01:06:25
    driving webs because it's data
  • 01:06:29
    generation clean in C reproducibly right
  • 01:06:34
    so essential what I'm saying is we
  • 01:06:36
    cannot fix the past but we could change
  • 01:06:39
    the present and the
  • 01:06:41
    future
  • 01:06:44
    and and and second ongoing effort that
  • 01:06:46
    structural generic consortion is pushing
  • 01:06:48
    is a new effort in generating high
  • 01:06:51
    quality data using various types of
  • 01:06:54
    experimental tools
  • 01:06:56
    sharing this data there's a database
  • 01:06:58
    called air check that they're creating
  • 01:07:00
    and and basically having the community
  • 01:07:02
    effort to build it but I I I don't think
  • 01:07:05
    there's anything we could do with pass
  • 01:07:08
    date okay iest to move the discussion to
  • 01:07:13
    the welcome part otherwi we need to
  • 01:07:16
    choose I have some
  • 01:07:20
    or thank thank you
标签
  • Chemieinformatik
  • Datenqualität
  • Modellierung
  • Fehlermangement
  • big data
  • maschinelles Lernen
  • klinische Integration
  • wissenschaftlicher Hype
  • historische Reflektion
  • Datenkurierung