Strasbourg Summer School in Chemoinformatics, 2024 : Alex TROPSHA
Sintesi
TLDRDer Vortrag begann mit einer Einführung in die Chemieinformatik und ihre Herausforderungen, fokussiert auf die Qualität der Daten und deren Einfluss auf die Modellierung. Der Sprecher, ein erfahrener Chemieinformatiker, reflektiert über wesentliche Entwicklungen und Trends in der Branche. Insbesondere wird das Problem der Fehler in Datenbanken hervorgehoben, die zur Erstellung unzuverlässiger Modelle führen können. Der Vortrag umfasst neuere Fortschritte im Bereich der Chemieinformatik, einschließlich der Nutzung von künstlicher Intelligenz und maschinellem Lernen bei der Analyse großer Datensätze sowie die klinische Einbindung dieser Modelle. Auch der Hype um überbewertete wissenschaftliche Entdeckungen wird kritisch betrachtet. Der Vortrag schließt mit der Betonung der Wichtigkeit des genauen rezeptiven Umgangs mit Chemieinformatikdaten.
Punti di forza
- 🎓 Chemieinformatik erfordert genaue Datenvorbereitung zur Vermeidung von Modellierungsfehlern.
- 📚 Lernen aus der Vergangenheit ist entscheidend für die Fortentwicklung der Chemieinformatik.
- 🔍 Datenqualität bleibt eine zentrale Herausforderung im Bereich der Chemieinformatik.
- 🤝 Integration von Chemieinformatik und klinischen Daten kann unmittelbare klinische Vorteile bieten.
- 🚫 Übermäßiger Hype um neue Entdeckungen wird kritisiert und kann problematisch sein.
- 🧠 Maschinelles Lernen verbessert Effizienz und Genauigkeit, erfordert jedoch sorgfältige Anwendung.
- 🗂️ Die Korrektion von Datenfehlern in bestehenden Datenbanken ist unerlässlich.
- 🔄 Korrektive Ansätze und Kontrollmechanismen sind für die Datenqualität und -kuration entscheidend.
- 📈 Der Übergang zu großen Datensätzen birgt neue Herausforderungen und Chancen.
- 📝 Wissenschaftler sollten vorsichtig mit überheblichen Versprechungen in Publikationen umgehen.
Linea temporale
- 00:00:00 - 00:05:00
Sasha bedankt sich für die Einladung und gratuliert Professor Vnik zum Gewinn des Scholnik-Preises der American Chemical Society. Er spricht darüber, dass diese Workshops Generationen von C-Informatikern ausgebildet haben und wie wichtig es ist, Ideen auszutauschen.
- 00:05:00 - 00:10:00
Er beschreibt die Herausforderung, die ihm gestellt wurde: über das gesamte Gebiet der C-Informatik in weniger als einer Stunde zu sprechen. Er wird aus der Perspektive eines sarkastischen, aber optimistischen C-Informatikers sprechen und sowohl Vergangenheit als auch Zukunft des Feldes beleuchten.
- 00:10:00 - 00:15:00
Er führt in das Konzept der Reminiszenz ein, das das Erinnern an episodische Erinnerungen aus der persönlichen Vergangenheit umfasst und erklärt, wie dies Gedanken über die Zukunft auslösen kann. Er erörtert die philosophischen Aspekte der Erinnerung an die Zukunft.
- 00:15:00 - 00:20:00
Er gibt einen Überblick über die gegenwärtigen und zukünftigen Herausforderungen der C-Informatik, wie z.B. die Entdeckung von Arzneimitteln, die chemische Datenvernetzung und die Notwendigkeit, chemische Experimente zum Nutzen der Forschung zu integrieren.
- 00:20:00 - 00:25:00
Er betont die Wichtigkeit der Fehlererkennnung in Datenbanken, insbesondere bei großen Datenmengen. Er spricht über die Entwicklung von Protokollen zur Erstellung genauer chemischer Daten und die Notwendigkeit, Kanonisierungen bei biologischen Daten vorzunehmen.
- 00:25:00 - 00:30:00
Er bespricht das Konzept der Vorhersage für externe Datensätze mit betonten Vorsichtsmaßnahmen und die Geschichte der Domänen der Anwendbarkeit. Er hebt hervor, dass es wichtig ist, die Zuverlässigkeit der Vorhersagen zu gewährleisten.
- 00:30:00 - 00:35:00
Er führt das Konzept der Modifizierbarkeit von Daten ein, das bestimmt, wie gut ein Datensatz modelliert werden kann. Er spricht auch über die Bedingungen, unter denen chemische Modelle interpretiert werden können.
- 00:35:00 - 00:40:00
Er kritisiert, dass häufig ungenaue Modell-Ergebnisse veröffentlicht werden. Er spricht über die Bedeutung, sich bei der Modellvalidierung nicht auf Ungenauigkeiten zu verlassen, besonders bei unausgewogenen Datensätzen.
- 00:40:00 - 00:45:00
Er hebt hervor, dass die Balance von Datensätzen bei der Modellentwicklung entscheidend ist. Wenn Datensätze unausgewogen sind, ist es schwieriger, genaue Modelle zu erstellen, was oft zu fehlerhaften Ergebnissen führt.
- 00:45:00 - 00:50:00
Er spricht über die Herausforderungen bei der Interpretation chemischer Modelle und zeigt auf, dass chemische Merkmale niemals isoliert wirken. Eine einfache Interpretation durch einige wenige Deskriptoren ist oft nicht vernünftig.
- 00:50:00 - 00:55:00
Er beginnt mit der Analyse der jüngsten Entwicklungen in der C-Informatik und weist darauf hin, dass es wichtig ist, über die aktuelle Literatur hinaus zu lesen, um Erkenntnisse von früheren Wissenschaftlern zu integrieren.
- 00:55:00 - 01:00:00
Er diskutiert die aktuellen Herausforderungen in der C-Informatik, einschließlich der Notwendigkeit, bessere Benchmarks für maschinelles Lernen und valide chemische Strukturen zu haben. Er erwähnt auch die Fälle von falschen Daten, die zu falschen Schlussfolgerungen führen können.
- 01:00:00 - 01:07:28
Er beleuchtet, wie tiefgehendes maschinelles Lernen oft übertrieben dargestellt wird, und hebt hervor, wie Korrekturen in hochangesehenen Publikationen vorgenommen werden müssen, wenn ursprüngliche Schlüsse ungenau waren.
Mappa mentale
Video Domande e Risposte
Was ist der Schwerpunktthema des Vortrags?
Der Schwerpunkt liegt auf der Datenqualität und Modellierung in der Chemieinformatik.
Welche Hauptprobleme werden in der Chemieinformatik angesprochen?
Probleme mit der Datenqualität, Fehlermanagement in Datenbanken und der Signifikanz von Modellergebnissen.
Welche neuen Trends werden in der Chemieinformatik diskutiert?
Der Einsatz von maschinellem Lernen, große Datensätze und die Integration mit klinischen Anwendungen.
Warum wird die Geschichte der Chemieinformatik hervorgehoben?
Um die Bedeutung des Lernens aus vergangenen wissenschaftlichen Arbeiten zu betonen und Fehler nicht zu wiederholen.
Was ist ein wesentlicher Bestandteil eines erfolgreichen Chemieinformatikprojekts laut Vortragendem?
Die Vorbereitung und Bereinigung der Daten ist essentiell.
Welcher Kritikpunkt wird an aktuellen wissenschaftlichen Veröffentlichungen geübt?
Es wird Kritik an übermäßigem Hype geübt, insbesondere in Bezug auf unbestätigte Entdeckungen.
Welche Rolle spielt maschinelles Lernen in der Chemieinformatik?
Maschinelles Lernen verbessert die Effizienz und Genauigkeit der Modellierung, insbesondere bei großen Datensätzen.
Welche historische Figur wird zitiert, um den Wert der Vergangenheit in der Forschung zu verdeutlichen?
Isaac Newton wird mit seinem Zitat über das Stehen auf den Schultern von Giganten zitiert.
Was wird als zukünftige Herausforderung in der Chemieinformatik identifiziert?
Die Möglichkeit, chemische Informationen mit klinischen Daten zu integrieren, um direkte klinische Auswirkungen zu erzielen.
Welcher Rat wird für den Umgang mit chemischen Daten gegeben?
Der Vortrag betont die Notwendigkeit einer strikten Datenkurierung und Qualitätskontrollen.
Visualizza altre sintesi video
Foodie Coaches: Get Your Food Costs Under 25%
Milliarden fürs Ausland, aber kein Geld für die Deutschen – Weltwoche Daily DE
Was ist ein Relationales Datenbankmodell? - einfach erklärt!
Die Bessere Google-Suche? KI-Suchmaschine "ChatGPT Search" im Test
HAMMERHART! Gerhard Schröder sagt das UNSAGBARE!
Elderly ladies reveal what they'd do differently if they were young again
- 00:00:00right so Sasha thank you for um the
- 00:00:03great intro um as a few people in the
- 00:00:07room I've been part of this circus since
- 00:00:10since the beginning and enjoyed many uh
- 00:00:14wonderful events lectures friendship
- 00:00:17professional interractions so uh thank
- 00:00:19you for inviting me again for keeping me
- 00:00:22on my toes for so many
- 00:00:23years uh I think it's a special year uh
- 00:00:27because uh not only should I thank
- 00:00:31Professor vnik for inviting me but also
- 00:00:34congratulate him on winning the scholnik
- 00:00:35award from the American Chemical Society
- 00:00:37this year which uh a few people in this
- 00:00:40room are going to be attending and
- 00:00:42speaking at this event and in addition
- 00:00:45to his uh multiple scientific
- 00:00:47contributions he was also acknowledged
- 00:00:49for um for this workshops for a training
- 00:00:53generations of c informaticians and
- 00:00:56allowing us to get together and exchange
- 00:00:58scientific ideas so congratulations and
- 00:01:01thank you for um a lot of service to the
- 00:01:04community that you've done over the
- 00:01:06years so I'm really honored to to speak
- 00:01:09here um thank you also for giving me an
- 00:01:12impossible
- 00:01:13test uh because when Sasha approached me
- 00:01:16he said just speak about everything the
- 00:01:19entire field in less than one hour he
- 00:01:22restricted me
- 00:01:24unfortun um and that is to explain a
- 00:01:28subtitle
- 00:01:30uh of my talk uh which is that you're
- 00:01:32going to hear comments from a sarcastic
- 00:01:34but optimistic C informatician uh which
- 00:01:38is how I describe myself and so
- 00:01:39hopefully will be combination of uh
- 00:01:42notes about the past and present and
- 00:01:46future of the field and I will start
- 00:01:48with a
- 00:01:50philosophical comment uh to explain the
- 00:01:53title of the lecture because um I think
- 00:01:55it's somewhat self-contradictory
- 00:01:57reminisent about the future uh but
- 00:02:00apparently there is a field there's the
- 00:02:01entire field called reminiscence and the
- 00:02:04comments are coming from a paper
- 00:02:06published in the international Journal
- 00:02:08of reminiscence and Life
- 00:02:11review you can't imagine how many
- 00:02:13journals exist this days you could find
- 00:02:15whenever you have some idea you could
- 00:02:16always find something published but it's
- 00:02:19actually quite interesting uh
- 00:02:21reminiscence by definition involves
- 00:02:22recalling episodic memories from one's
- 00:02:24personal past which is how we started
- 00:02:27this is this is memories about this
- 00:02:29workshops
- 00:02:30this process often triggers thoughts of
- 00:02:32the future so think about this I think
- 00:02:34it's actually accurate and conversely
- 00:02:37imagine our future can frequently
- 00:02:38stimulate
- 00:02:40reminisence most interesting what I
- 00:02:43found is that not only past and future
- 00:02:45share certain properties of common that
- 00:02:47are distinct from the present which is
- 00:02:50where we are right now and the present
- 00:02:52is the least attractive the least
- 00:02:54interesting because both past and future
- 00:02:57are infinite in contrast the present is
- 00:02:59coming into being and sleeping away
- 00:03:02lasting perhaps less than an hour of
- 00:03:06this lecture so um that's that's that's
- 00:03:10my intro um I think it's it's pretty
- 00:03:13profound and fortunately I found
- 00:03:16something so um so here's the outline as
- 00:03:21advertised I will talk about the past
- 00:03:24the basics the reminisence uh but I'll
- 00:03:26try to position reminisence in the
- 00:03:28context of and future research uh in cam
- 00:03:33informatics uh and I also uh found the
- 00:03:36words from
- 00:03:37ch's famous Sun the blue cafair which I
- 00:03:41think applies to this field where have
- 00:03:43you been where you going to I want to
- 00:03:44know what's new I want to go with you
- 00:03:46and I think that that should Define the
- 00:03:49essence of study and CH for Magics for
- 00:03:51all of us uh and I want make a point
- 00:03:55that the good and all days CTIC concepts
- 00:03:57are still valid so hopefully um you will
- 00:04:00agree I'll talk about recent
- 00:04:03developments as well as orance and that
- 00:04:05will be a somewhat sarcastic part of the
- 00:04:08lecture uh discussing Sense and
- 00:04:10Sensibility of C informatics that now we
- 00:04:13use the word d always um as as we
- 00:04:16discuss this concept on occasion
- 00:04:18throwing in terms such as Ai and other
- 00:04:21forms of learning and in the future the
- 00:04:23impactful trends um and and I think the
- 00:04:27the most impactful Trend that is ref
- 00:04:29reflected in the field is that it has
- 00:04:31become a big data science discipline
- 00:04:34with everything that big data science
- 00:04:36brings into the challenges that are
- 00:04:39facing us so I'll start by the
- 00:04:43foundations of the field in 2009 the
- 00:04:46Journal of CH informatics was formed and
- 00:04:48the very first
- 00:04:49issue uh there were Grand challenges
- 00:04:52that were defined um and as I'm not
- 00:04:55going to read this to you I'll let you
- 00:04:56read but perhaps you could see that the
- 00:05:01same Grand challenges can be formulated
- 00:05:03today almost entirely the same way
- 00:05:06except perhaps that we're dealing with
- 00:05:09uh the different types of data different
- 00:05:11volumes of data and different technical
- 00:05:14challenges but philosophically the grand
- 00:05:16challenges such as drug Discovery GRE
- 00:05:19chemistry understanding life as chemists
- 00:05:23and enabling networks of information and
- 00:05:25connected knowledge to be explored this
- 00:05:28are still big challenges in front of the
- 00:05:30field so it's it's it's good because it
- 00:05:32it only illustrates how fundamental this
- 00:05:35discipline that at some point was coined
- 00:05:37and gold C formatics is so I'll go
- 00:05:41through a very few Concepts that I
- 00:05:43consider key paradigms of the field uh
- 00:05:46and and overall and that has become is
- 00:05:49becoming I reflect on this more and more
- 00:05:51important that that the computations are
- 00:05:55done in the name of experiments fueled
- 00:05:57by experiments processed just
- 00:05:59computational but it's always important
- 00:06:02to remember that in the end we want to
- 00:06:03use CH informatics methods and tools in
- 00:06:06order to affect in the most plausible
- 00:06:08way the experiments and inform and and
- 00:06:11increase the experimental heat
- 00:06:14rate it's also important uh as and still
- 00:06:18important today to recognize and it's
- 00:06:21one of the most wonderful papers in the
- 00:06:23history of the field how not to do
- 00:06:25research and I um again will talk about
- 00:06:28how not to do research today and how
- 00:06:31some of the concepts especially those
- 00:06:33that that I highlight they still Haven
- 00:06:36raled in many Publications in the field
- 00:06:38but for those who are relatively new to
- 00:06:40the field knowing how not to do research
- 00:06:44continues to be very important and
- 00:06:46learning about 21 and fact that should
- 00:06:48be
- 00:06:4922 Key rules of how not uh to do
- 00:06:54research in ch informatics is very
- 00:06:57important Arrow detection
- 00:07:00um has been continues to be uh one of
- 00:07:04the key elements of CH informatics
- 00:07:06research and I would like to emphasize
- 00:07:09and re-emphasize how important it is to
- 00:07:11pay attention to the Quality data
- 00:07:14especially as we process large massive
- 00:07:17data today and here are just a few
- 00:07:19examples of the types of errors that
- 00:07:21could be continuously found in databases
- 00:07:24especially duplicates which is special
- 00:07:27case where um data should be properly
- 00:07:29treated for chemical duplicates and uh
- 00:07:33um chemical duplicates that have
- 00:07:36differential measurements associated
- 00:07:38with this and and the existence of uh
- 00:07:41errors and databases uh affects both
- 00:07:46simple and very complex models and
- 00:07:48continuous to
- 00:07:49effect so protocols have been developed
- 00:07:53and continuously should be applied in
- 00:07:55order to create both chemical data uh
- 00:07:58and and we and others have developed
- 00:07:59special protocols going methodically and
- 00:08:02systematically through uh chemical data
- 00:08:05and curating accurate and and and
- 00:08:08presentent chemical structures
- 00:08:10accurately as well as and that that is
- 00:08:13even more important dealing with
- 00:08:15biological uncertainty and there is a um
- 00:08:18an aspect of data cation that I always
- 00:08:20highlight and that is that one could
- 00:08:23talk about canonical chemical structures
- 00:08:26and create chemical structures
- 00:08:28accurately with some caveats where it's
- 00:08:30impossible such as multiple cyal centers
- 00:08:32that have not been resolved or tots that
- 00:08:35have not been resolved but for the most
- 00:08:37part canonical chemical rules can be
- 00:08:40applied uh to create chemical structures
- 00:08:42accurate but when it comes to biological
- 00:08:44data there are no canonical measurements
- 00:08:47biological data is inherently inaccurate
- 00:08:51perhaps irreproducible so a lot of
- 00:08:53attention needs to be paid to processing
- 00:08:56data and eliminating duplicates and
- 00:08:58looking at whether or not duplicative
- 00:08:59chemical structures after curation have
- 00:09:02identical or dissimilar properties
- 00:09:05associated with them so um this is has
- 00:09:08been and continues to be fundamental
- 00:09:10aspects of uh C projects research and
- 00:09:14then how we model the data uh how we
- 00:09:17process the data and how important
- 00:09:18continuously is to uh think about
- 00:09:23prediction for external data sets rather
- 00:09:26than uh overemphasizing the quality of
- 00:09:29models developed for the training sub
- 00:09:32which we could continuously especially
- 00:09:34especially with big date and I allude to
- 00:09:37this um also uh overtrained models or
- 00:09:41the more powerful algorithms we use the
- 00:09:43more is the chance that the models are
- 00:09:45going to go or train so it's important
- 00:09:47to figure out how to convince yourself
- 00:09:50and the world that the models that we've
- 00:09:52built with has data can extrapolate the
- 00:09:55new data and what comes with this
- 00:09:57extrapolation kind of uh precautions we
- 00:10:00need to take um to assure accurate
- 00:10:04extrapolation of models developed on the
- 00:10:06training C so to deal with this issues
- 00:10:10historically another key Paradigm
- 00:10:12applicability domain uh which continues
- 00:10:14to be very important part of modern
- 00:10:16research how to define the applicability
- 00:10:19DI it now has slightly different name
- 00:10:21I'll talk about it but uh the objective
- 00:10:23is to recognize that no matter how large
- 00:10:25our training set is it has limited size
- 00:10:29and extrapolation can be reliable only
- 00:10:32in certain directions in the chemistry
- 00:10:33space and within proximity of the
- 00:10:36training set that needs to be defined
- 00:10:38the chemical space needs to be defined
- 00:10:40and the proximity to the training set
- 00:10:41needs to be defined
- 00:10:43quantitatively and uh the associated
- 00:10:45predictions are can be reliable or less
- 00:10:47reliable depending on how we Define the
- 00:10:50applicability
- 00:10:51domain and then some Concepts that have
- 00:10:54been developed uh which I also find uh
- 00:10:56fundamental such as uh whether a data
- 00:11:00set can be modelable and it's a funny
- 00:11:02word modelability of the data but uh
- 00:11:07empirically uh I think every C forat at
- 00:11:10some point had found out that the data
- 00:11:13for some reason the model of TR perhaps
- 00:11:16look great for the training set perhaps
- 00:11:18did not even look great for the training
- 00:11:20set but the U extrapolation has been
- 00:11:24repeatedly unreliable and so we've
- 00:11:26defined this concept of modelability and
- 00:11:29index that we've called mod which
- 00:11:31effectively is a one nearest neighbor
- 00:11:33model and if a data set has a large
- 00:11:35number of nearest neighbors with
- 00:11:38different designation of the target
- 00:11:40property that particular index that
- 00:11:44calculate calculates the fraction of
- 00:11:46this formal property Clips or activity
- 00:11:48Clips turned out to be a very strong and
- 00:11:51simple indicator of our ability to
- 00:11:54develop and not develop a model that can
- 00:11:57extrapolate well and here is a um
- 00:12:00also an
- 00:12:02observation uh from the same paper that
- 00:12:04defined mud um and that is uh that what
- 00:12:08we've demonstrated is that if we take
- 00:12:10large number of data sets and calculate
- 00:12:12this simple Modi index one near neighbor
- 00:12:16model uh what we find is no matter how
- 00:12:19um
- 00:12:20complex strong deep algorithms we try to
- 00:12:24use to establish a model of high
- 00:12:27accuracy if the Modi index has below
- 00:12:29point6 no matter how much you try you
- 00:12:31typically are unable to develop model of
- 00:12:34sufficient accuracy and that's just the
- 00:12:36property of the data uh it's a point
- 00:12:39I'll come back to and I've chosen some
- 00:12:41points that I will come back to sort of
- 00:12:43in a new way uh reflect on the recent
- 00:12:47development in the
- 00:12:49field uh and then couple of more points
- 00:12:51that I want to make from the past uh
- 00:12:53which again I find important to share
- 00:12:56and emphasize as part of classical
- 00:12:58historic
- 00:12:59fundamental C informatics research one
- 00:13:02is how we convince ourselves and the
- 00:13:05world that the model is accurate and
- 00:13:07here's one of the uh standard errors
- 00:13:10that U have been uh articulated on how
- 00:13:13not to do here s by JN Dean uh and and
- 00:13:17that is uh we need to be mindful about
- 00:13:19the distribution of the data and what
- 00:13:22comes with it many data sets I'll come
- 00:13:24back to this point too um are imbalanced
- 00:13:27and here is a simple example that if a
- 00:13:28data set has 4:1 ratio 80 molecules of
- 00:13:33one class and 20 molecules of another
- 00:13:35class if we just calculate total
- 00:13:38accuracy which is the most intuitive and
- 00:13:40most
- 00:13:40immediate um attempt that one makes uh
- 00:13:44then the total accuracy is going to be
- 00:13:4580% without even building a mod if you
- 00:13:49say that everything is class one all 100
- 00:13:52molecules and 80 of them our class one
- 00:13:53you're 80% accurate and so uh that that
- 00:13:57certainly doesn't make sense and
- 00:13:59therefore uh more specific metrics need
- 00:14:02to be considered when and for the most
- 00:14:04part data sets are imbalanced imbalanced
- 00:14:07uh so in this case a balanced accuracy
- 00:14:10calculation or correct classification
- 00:14:12rate which is essentially the same as
- 00:14:14balanced accuracy when appli to the same
- 00:14:16data set immediately shows that the
- 00:14:18model on average is 50% accurate right
- 00:14:21100% for one plus 0% for another TX and
- 00:14:25so uh we've always advocated for data
- 00:14:29set balancing before your own
- 00:14:31calculations and the reason for this was
- 00:14:33that when the data set is imbalanced and
- 00:14:35you have small number of objects in a
- 00:14:37smaller
- 00:14:38class then uh there is a rather small
- 00:14:42number of chemicals similar enough to
- 00:14:45the molecules and the training set of
- 00:14:47class
- 00:14:48one that
- 00:14:50um that uh um make it relatively
- 00:14:55difficult to build a classification
- 00:14:57model and if you increase the size of um
- 00:15:00if you don't down sample if you don't
- 00:15:02balance the data you make the task of
- 00:15:05developing a model easier and easier the
- 00:15:08more imbalance is in the data set that
- 00:15:11what we what we always Advocate the data
- 00:15:14set needs to be balanced to make the
- 00:15:16task of classification model development
- 00:15:18more difficult because you are
- 00:15:20discriminating a small number of
- 00:15:22instances from another small number of
- 00:15:25similar Ines and recommending that the
- 00:15:27similarity base sampling and down
- 00:15:30sampling of the second class bigger
- 00:15:32class needs to be accomplished I will
- 00:15:34come back to this point because that's
- 00:15:36that's uh a great example of change in
- 00:15:39Paradigm reflecting the nutrend in the
- 00:15:42field and and understanding the data
- 00:15:44associated with it the final point I
- 00:15:47want to make uh is uh and it's a very
- 00:15:50popular topic so-called explainable a in
- 00:15:52today's world is how models or can
- 00:15:54models be interpreted and a typical uh
- 00:15:58statement of interpretation simply means
- 00:16:00that we we identifying or emphasizing
- 00:16:03specific chemical functional groups a
- 00:16:05small number of descriptors that have
- 00:16:07the highest loading in the chemical
- 00:16:09sense and essentially reduce the model
- 00:16:12and reduce the interpretation of the
- 00:16:14model to a few statements about chemical
- 00:16:17functional groups that are important or
- 00:16:18a subset of descriptors that is
- 00:16:21important and uh what I show here to
- 00:16:23illustrate this point is a distribution
- 00:16:25of descriptive values for a particular
- 00:16:27data set
- 00:16:29and then and then we calculate um how
- 00:16:33descriptive values change when you go
- 00:16:35from one molecule to another and what I
- 00:16:37want to emphasize here is that a minor
- 00:16:40perturbation of of of the initial
- 00:16:43molecule affects nearly every chemical
- 00:16:46descriptor that we calculate directly
- 00:16:47from the chemical
- 00:16:49structure and um as we run this uh
- 00:16:53highly similar molecules every single
- 00:16:56descriptor is affected so there is no
- 00:16:59objectively specific or any any reason
- 00:17:02to say that one descriptor really
- 00:17:05explains the entire um behavior of the
- 00:17:08entire molecule and so it's very
- 00:17:10critical as we struggle because chemists
- 00:17:13always want information as we struggle
- 00:17:15to explain statistical Mars it's
- 00:17:18important to remember the chemical
- 00:17:20features never act in isolation from
- 00:17:23each other and when you modify a
- 00:17:24molecule multiple features are getting
- 00:17:26affected and therefore explanation of
- 00:17:28multivar models by one or a few
- 00:17:31descriptors is typically non sensible
- 00:17:33there are ways of taken into account and
- 00:17:36uh I love this out but in general um
- 00:17:40published about how you combine model
- 00:17:45interpretation if you absolutely forc by
- 00:17:49whatever reason to identify a few
- 00:17:51chemical features that are chemically
- 00:17:53modifiable for instance you could use
- 00:17:56Global qsr models in order to evaluate
- 00:17:59valate predicted consequences of a small
- 00:18:02chemical modification uh which is always
- 00:18:04local so there are some way of dealing
- 00:18:07with this but this is a very general
- 00:18:08statement uh which explains uh the
- 00:18:10impossibility of simple interpretation
- 00:18:13of chemical models and that uh I wanted
- 00:18:16to start with a few notes just kind of
- 00:18:18to position ourselves uh with respect to
- 00:18:21the history of of the field of chemtics
- 00:18:24and some key Concepts that people have
- 00:18:25formulated and elaborated upon in um in
- 00:18:30the last 20 or some years since since uh
- 00:18:33the name of the field was point and now
- 00:18:37uh let's talk about recent and current
- 00:18:39developments and some warnings
- 00:18:41associated with these developments and
- 00:18:42that's where I will allow myself to be
- 00:18:45somewhat sarcastic at least so Sense and
- 00:18:47Sensibility of uh deep C formatives uh
- 00:18:51and I want to highlight and emphasize
- 00:18:53how important it is to read Beyond
- 00:18:57current literature and relate what we do
- 00:19:00today to what has been done by wonderful
- 00:19:04and um um great scientists of the past
- 00:19:09and I personally want to acknowledge
- 00:19:10three scientists uh from uh all of our
- 00:19:14recent past um one is
- 00:19:17Archimedes um who worked a few years ago
- 00:19:20as as all of you know uh who presented a
- 00:19:24an incredible case of Ingenuity in
- 00:19:27solving a difficult problem using
- 00:19:29relatively simple means right and
- 00:19:31determining the the density of the
- 00:19:34crown and then getting excited somehow
- 00:19:37when I was looking for the history of
- 00:19:40Archimedes the most exciting um story
- 00:19:44associated with him uh in majority of
- 00:19:48sources that I saw was that he was
- 00:19:49running naked uh on the streets not not
- 00:19:52the scientific content of the discovery
- 00:19:55and then of course uh Isaac Newton uh
- 00:19:57one of the most famous statements uh
- 00:20:00which is very accurate and and I always
- 00:20:02find this is inspirational if I've seen
- 00:20:04further uh it's by stting on the
- 00:20:06shoulders of the Giants so again uh
- 00:20:09there is a tendency not to read the past
- 00:20:11literature and uh it is a statement for
- 00:20:14one of the greatest Minds uh in the
- 00:20:16humankind who um had explained how
- 00:20:20status and education should run and
- 00:20:23finally um Albert Einstein who came up
- 00:20:27with what I will refer in the end of my
- 00:20:30lecture to uh as a universal formula of
- 00:20:35research and
- 00:20:36Discovery uh because I I I realized um
- 00:20:40that this this particular formula has
- 00:20:42much greater value that even that's what
- 00:20:45Einstein initially
- 00:20:47contemplated
- 00:20:50so
- 00:20:51uh what's new what new happened in the
- 00:20:55uh before sort of in the recent history
- 00:20:58of kin informatics that's the next
- 00:21:00section of this lecture and uh what I've
- 00:21:03used in order to highlight uh what is
- 00:21:06considered current challenges in the
- 00:21:09field a blogs that one of our colle Pat
- 00:21:12Walters uh of really Therapeutics has
- 00:21:15been formulated so Pat publishes blogs
- 00:21:18very profound highlighted and
- 00:21:21educational so um those of you have not
- 00:21:25read his blogs he publishes maybe three
- 00:21:27to five blogs a year where he addresses
- 00:21:30current challenges in front of Cam
- 00:21:32informaticians and in 2023
- 00:21:352024 a few blocks that I summarized
- 00:21:39here sounded awfully familiar if if you
- 00:21:43recall how I I I I um cite at the
- 00:21:46Journal of Cam informatics so what do we
- 00:21:48need to do today that we did not know or
- 00:21:51that we did not do 20 years
- 00:21:54ago we need better benchmarks for
- 00:21:57machine learning when need valid
- 00:21:59structures this is direct quote
- 00:22:01consistent chemical representation and
- 00:22:04account of theor chemistry consistent
- 00:22:06measurements realistic dynamic range of
- 00:22:10the data um fundamental for chemtics and
- 00:22:15cut offs that we use when we transform
- 00:22:17data into a a format that enables
- 00:22:20classification
- 00:22:21model data cation explicitly and here is
- 00:22:25a funny story within recent blog uh
- 00:22:28there is a bloodb brain barrier data set
- 00:22:30there's a um um classical Benchmark data
- 00:22:33set called molecule net that people have
- 00:22:35used in hundreds of Publications and
- 00:22:39that P observation the data set contains
- 00:22:4159 duplicative structures 10 of those
- 00:22:43duplicative structures have different
- 00:22:45labels and yet hundreds of papers are
- 00:22:49published including today with the
- 00:22:51newest C informatics machine learning AI
- 00:22:54deep learning deepest learning super
- 00:22:56deep learning incredibly deep
- 00:22:58and and all of that are published a
- 00:23:01claim in Superior to our previous models
- 00:23:03by about
- 00:23:050.00005 a see which is the most common
- 00:23:09uh type of of Publications this this
- 00:23:12days when when they field this flooded
- 00:23:14by people trained in computer science
- 00:23:17but not necessarily in chemistry and so
- 00:23:19they trained that they're trained to
- 00:23:21compete on benchmarking and validation
- 00:23:25and unfortunately they're not trained to
- 00:23:28think about the type of data that
- 00:23:30they're dealing with and processing so
- 00:23:32they canete on Benchmark and so that's
- 00:23:35that's what's happens today right and
- 00:23:37this is everything that I've talked
- 00:23:39about everything that was formulated in
- 00:23:41the beginning by the foundational
- 00:23:44Publications and C continues to today
- 00:23:47and then model validation here is uh the
- 00:23:50direct title of one of pet Rec block
- 00:23:52comparing classification models you
- 00:23:54probably doing it wrong and he goes into
- 00:23:57a great extent uh explaining how you
- 00:24:00cannot just compare models by statistics
- 00:24:04assigned to fixed values because values
- 00:24:06have distribution and we need to take
- 00:24:08distribution into account so really um
- 00:24:12simple simply uh written uh but very
- 00:24:16powerful block so um so I want to use a
- 00:24:19few recent examples we're talking about
- 00:24:22C for matrics done in 2023
- 00:24:252024 and what's new compared to what I
- 00:24:28called foundational
- 00:24:30cheminformatics H and here is uh our own
- 00:24:32recent exercise which should have been
- 00:24:35as I thought an exercise that would take
- 00:24:38a few hours of one of my gradate
- 00:24:40students time because we've been um made
- 00:24:43part of a very large project at UNC uh
- 00:24:46and direct antiviral drug Discovery and
- 00:24:49we decided to look across Campbell
- 00:24:52database to see if people have
- 00:24:54discovered broad spectrum antivirals
- 00:24:56previous so the task was to uh go into
- 00:25:00Campbell identify all antiviral essays
- 00:25:04identify compounds tasted on those essay
- 00:25:07and then look up if the same compound
- 00:25:09show activities in different ass right a
- 00:25:13few minute exercise
- 00:25:15however uh what holy Martin the greatest
- 00:25:18student working on this uh discovered is
- 00:25:22that the way the data is written in a
- 00:25:24highly curated Campbell data set and
- 00:25:26we've complained in the past about uh
- 00:25:28duration effort um and they fixed it
- 00:25:31many things have been fixed I I I've
- 00:25:34published about it but what you want to
- 00:25:37know if you're comparing molecules
- 00:25:38tasted and antiviral essays you want to
- 00:25:40know the type of the essay and you want
- 00:25:42to know essay conditions so that you
- 00:25:44could compare apples and apples
- 00:25:48natur when it comes to essay conditions
- 00:25:51as recorded in um the database what what
- 00:25:54do we typically want to know it's
- 00:25:55antiviral essay we want to know the
- 00:25:57virus cell as a Time concentration
- 00:26:01assessment one of the common ways of
- 00:26:03recording the data in Camp a molecule
- 00:26:06had antiviral activity against
- 00:26:09SARS end of
- 00:26:11story best case
- 00:26:16scenario very elaborate description
- 00:26:19small number of data but that's the
- 00:26:21ideal case scenario and there is a huge
- 00:26:24Spectrum in between when the cells I
- 00:26:27mentioned the virus is mentioned but no
- 00:26:30experimental conditions and you'll left
- 00:26:32to wonder whether it was left out from
- 00:26:35the P primary publication or um was
- 00:26:38never recorded and so uh poly then
- 00:26:43started and I ask you to to document the
- 00:26:46amount of time I'm talking about the
- 00:26:47most simplistic task preparing a data
- 00:26:50site which typically people describe on
- 00:26:52the application one line I prepared the
- 00:26:54data set okay
- 00:27:04so soic acids were missing the cell
- 00:27:08type the most critical parameter of fic
- 00:27:11acid in what cells was the acid
- 00:27:14conducted it was found in the ESS
- 00:27:17description but not where you you expect
- 00:27:21to find so she needed to extract this
- 00:27:23from the essay description or it was
- 00:27:26completely missing 64% % of the data it
- 00:27:29was recorded but not in the right place
- 00:27:32and in some cases it was not reped time
- 00:27:35span 150
- 00:27:37hours sure graduate salary is relatively
- 00:27:42low but still not something and and
- 00:27:45again it's against expectation because
- 00:27:47we're talking about processing easy to
- 00:27:49get data that should be
- 00:27:51really uh so that's that was that so
- 00:27:54total time 200 hours just dealing with
- 00:27:56anic assays um and then and then H
- 00:28:00recognize that we search with the most
- 00:28:01common most obvious qualifier bioessay
- 00:28:05antology uh definition of asset type you
- 00:28:08would have missed
- 00:28:1099.4% all the data because the data was
- 00:28:13there but not on the right place not in
- 00:28:15the right col and so retrieving data is
- 00:28:18critical and as much as as mundane as
- 00:28:23the task appears as simple as the task
- 00:28:27appears but if you don't have the data
- 00:28:29you don't have the models or if the data
- 00:28:31is mislabeled you don't have the models
- 00:28:33and so uh for new and old people in the
- 00:28:38field and I always emphasize this we
- 00:28:40have to spend time and typically
- 00:28:43frequently manual time cleaning the data
- 00:28:46preparing the data it's unexciting you
- 00:28:47cannot get an ni grant for it because uh
- 00:28:51you may succeed on significance perhaps
- 00:28:54impact you will never succeed on
- 00:28:56methodology because there is no
- 00:28:58methodology there's no innovation in
- 00:29:00methodology there's no innovation so and
- 00:29:02if you take a standard gr gr spring
- 00:29:05criteria you're not really
- 00:29:07um going to get the grant so what what
- 00:29:10was supposed to be a few hours of work
- 00:29:13turned out to be a good portion of a
- 00:29:15graduate student thesis and in the end
- 00:29:17she's built a database called smack and
- 00:29:19we're using this database to ask
- 00:29:22questions and building the second
- 00:29:23version of this database now for uh host
- 00:29:26directed therapists uh s the same idea
- 00:29:28of looking up uh or enabling people to
- 00:29:31look up the data and compare compounds
- 00:29:34that they find against compounds found
- 00:29:35in the
- 00:29:37data um other aspects of the field so
- 00:29:41that that was a little bit depressing
- 00:29:42but uh not unexpected what else has been
- 00:29:45happening in in in chem recently and
- 00:29:48this is a a publication from K who who
- 00:29:51is in the room and is speaking later um
- 00:29:55here and uh and I chose the publication
- 00:29:58because it I think reflects the sort of
- 00:30:00transformation and transition in the
- 00:30:02field from relatively naive um
- 00:30:06discussion identification and discussion
- 00:30:08of critical Concepts to much deeper much
- 00:30:13much more rigorous much more thorough
- 00:30:16analysis of uh The Core Concepts in chem
- 00:30:19formatics such as chemistry space and uh
- 00:30:23this publication defined a concept of
- 00:30:26roughness of the chemical l Cape as a
- 00:30:28way of addressing the same problem that
- 00:30:30I've alluded to whether or not a data
- 00:30:33set is modelable but again at much more
- 00:30:37thorough
- 00:30:39level but the the aspects of the cation
- 00:30:43that I highlight um talk about Concepts
- 00:30:46that have been predefined where we're
- 00:30:48redefining the the concepts that have
- 00:30:50been defined previously but in more
- 00:30:53computational and theoretical rigorous
- 00:30:55way and so that has been part of the
- 00:30:58modern Chon informatics uh the influx of
- 00:31:01uh people who think much deeper and from
- 00:31:04a different perspective about problems
- 00:31:06that we have encountered there are
- 00:31:07multiple Publications that address
- 00:31:09issues of uh fundamental value to
- 00:31:13informatics but um that's s of the the
- 00:31:15angle that I chose to introduce this
- 00:31:20speak and then
- 00:31:23come most popular aspects of C
- 00:31:26informatics in the last uh few years and
- 00:31:28that's uh what I've called going deep
- 00:31:31deep here deep there deep
- 00:31:33everywhere and and I want this part of
- 00:31:36the lecture to be um very educational uh
- 00:31:40and not taken in the wrong way because
- 00:31:43I'm going to be critical of ourselves as
- 00:31:48uh scientists and as CH informaticians
- 00:31:50and my first criticism is when you
- 00:31:54develop a new methodology when you
- 00:31:56capture new methodology
- 00:31:58the most obvious um reaction human
- 00:32:01reaction is excitement and when you're
- 00:32:04excited you don't see the light you
- 00:32:07don't you stop being objective and then
- 00:32:10uh and then uh you kind of I called it
- 00:32:14narcissistic
- 00:32:16model because uh you look at the
- 00:32:19reflection in the mirror and it excites
- 00:32:21you because you're
- 00:32:23doing cool things and so
- 00:32:282017 2018 someone who shares the last
- 00:32:33name with me
- 00:32:36uh along with other
- 00:32:38people said we've we've we've we've
- 00:32:41never done this before we're doing this
- 00:32:43for the first time and we observe higher
- 00:32:45accuracy than those models developed
- 00:32:47with simple machine learning techniques
- 00:32:49of course it's new it's Noel you learn
- 00:32:53new things you get excited and you
- 00:32:55ignore some uh
- 00:32:58um sober voices happen to uh be
- 00:33:03published at the same time that it's not
- 00:33:05always provide
- 00:33:08de but nevertheless the excitement is
- 00:33:11there and the excitement will always be
- 00:33:12there because we constantly inent new
- 00:33:15Stu
- 00:33:17so you start analyzing deep learning on
- 00:33:21papers deeper
- 00:33:29thank
- 00:33:30you and here's one of the um original
- 00:33:33Publications from one of the original
- 00:33:35developers of deep learning methods a
- 00:33:38hardcore um
- 00:33:41statistician uh with verbal statement we
- 00:33:43found the Deep L methods significantly
- 00:33:45outperform all competing methods that's
- 00:33:48a very powerful statement especially the
- 00:33:50word significant because this word is
- 00:33:52used colloquially or scientifically and
- 00:33:55this is not a colloquial use of the word
- 00:33:58and then you look
- 00:34:02um at the actual data because that's
- 00:34:04what we need to learn how to look at the
- 00:34:06actual data not the reflection on the
- 00:34:08mirror and when you look at the data
- 00:34:10what you find is that if you compare a
- 00:34:13dup neural net model with with plain
- 00:34:15spvm or random Forest the largest Au
- 00:34:19difference is
- 00:34:210.04 Au unit the largest and yet the
- 00:34:25arrow is an order of magnitude
- 00:34:28high and it's just that so we have to
- 00:34:33stop ourselves from using words that
- 00:34:35don't have very strict minion regardless
- 00:34:38of how excited
- 00:34:39we're and here's another
- 00:34:42story one of my
- 00:34:45favorite super Buck killed by super drug
- 00:34:50and new and it's a cell cell is is a
- 00:34:53very highly cited Journal a deep
- 00:34:55learning approach to antibiotic
- 00:34:57discovery
- 00:34:58and a lot of hypee associated with it in
- 00:35:00mass media Publications and that's
- 00:35:03another thing I'll come back to this
- 00:35:04point in a few
- 00:35:06minutes so what's behind it what's
- 00:35:08behind it is a wonderful highly
- 00:35:11appealing publication on Journal of
- 00:35:13chemical CH informatics and modeling
- 00:35:16with a new way of uh learning how to
- 00:35:20represent molecules and how to compare
- 00:35:22molecules and how to build PSR models of
- 00:35:26the highest accuracy today
- 00:35:28this publication
- 00:35:30collected more than 750 citations very
- 00:35:34highly cited paper there is a
- 00:35:36little red sign if you go to the
- 00:35:40website um that most people probably
- 00:35:43miss but I am a sarcastic modeler I
- 00:35:45clicked if you click it brings you to uh
- 00:35:49and this is the original publication as
- 00:35:51you could see there is a a very
- 00:35:53substantial Improvement so deep learning
- 00:35:54point8 random forest6 dramatic
- 00:35:57Improvement 20% better 13% better 26%
- 00:36:01better 39% better deep floran models
- 00:36:04work much much
- 00:36:07better only eight actually it's 20 I I
- 00:36:11was too lazy to to replace all the SP so
- 00:36:1320 citations of the uh Red Line um
- 00:36:18publication and it says correction and
- 00:36:21what this correction says that due to an
- 00:36:23error in processing of the random Forest
- 00:36:25models good old simple Rand Forest
- 00:36:28models you develop a new deep learning
- 00:36:31approach but you cannot process random
- 00:36:34Forest models some right the random
- 00:36:37forestation numbers were incorrupt
- 00:36:39and when you
- 00:36:43corrupt you totally lose significant
- 00:36:50Improvement but who
- 00:36:52cares because because you got a new bu
- 00:36:56and you got a new drve against the new
- 00:36:58bu so he cares what happens with the me
- 00:37:00with the meth well turned out that this
- 00:37:04result was also recalled because it
- 00:37:05turned out that another lab three years
- 00:37:07earlier discovered that same compound
- 00:37:10and decided not to procure it anymore
- 00:37:11because it was
- 00:37:16toxic here's another
- 00:37:18story what if you have the most
- 00:37:21unlimited computational power in the
- 00:37:22world what do you do you you right you
- 00:37:24discover
- 00:37:25drugs and then you tell the world that
- 00:37:28you have the fastest supercomputer and
- 00:37:30therefore you best in the world drug
- 00:37:32discover because you run hundreds and
- 00:37:34hundreds of hours of super computer time
- 00:37:38and the super computer not a human the
- 00:37:40super computer identified 77 new
- 00:37:44antiviral drugs at Oak National
- 00:37:46Laboratory and it was
- 00:37:49published not in nature for some reason
- 00:37:51a little bit suspicious but nevertheless
- 00:37:54enough to get publicity
- 00:37:57and then they the same group had the
- 00:37:59guts of publishing the letter in in in
- 00:38:02uh New England Journal of Medicine the
- 00:38:04journal with the highest one of the
- 00:38:06highest uh citation uh value how to
- 00:38:09discover antiviral drugs quickly no
- 00:38:15less and that's the consequence of
- 00:38:18pandemic thousands and thousands of drug
- 00:38:20Discovery papers that appeared during
- 00:38:22computational papers every the world
- 00:38:25every sits sits at home you can't go to
- 00:38:27you can make
- 00:38:28molecules so you run do studies and you
- 00:38:31publish papers and that's that's that's
- 00:38:33the pandemic um of drug
- 00:38:36Discovery
- 00:38:38uh the best comment I've I read uh about
- 00:38:43publications discovering Noble
- 00:38:45Inhibitors or Noel drugs Etc uh was from
- 00:38:48John Cadera who asked the question in
- 00:38:50his blog is it really discovery of new
- 00:38:52Inhibitors there is zero experimental
- 00:38:53data Maybe prop pual but even that's a
- 00:38:56stretch digital dreams of new Inhibitors
- 00:39:00that's how we should
- 00:39:01be and that's a warning also that's
- 00:39:04that's a warning of not been excited in
- 00:39:06using correct terminology when we make
- 00:39:09discoveries we make digital discover
- 00:39:12there's not a journal a journal that's
- 00:39:13called digital Discovery so the next
- 00:39:16Journal should be called digital dreams
- 00:39:18and and we all should be published there
- 00:39:20because without experiment that's what
- 00:39:22we produce we produce digital dreams I
- 00:39:25found a wonderful quote from planing
- 00:39:28which I really want all of us to
- 00:39:31inscript on our head we've learned from
- 00:39:34experience that truth will come out
- 00:39:35other experimental will repeat your
- 00:39:37experiment and find out whether you are
- 00:39:39wrong or right and although you may gain
- 00:39:42some temporary Fame and excitement you
- 00:39:44will not gain a good reputation as a
- 00:39:46scientist and reputation is everything
- 00:39:48for a scientist so we've published with
- 00:39:53and AR is in the room a little feature
- 00:39:56that was surely a joke in Mr doy and I
- 00:39:59encourage you to read it because we show
- 00:40:01there with data that all these
- 00:40:03predictions by super computer were
- 00:40:05experimentally shown to be inactive at
- 00:40:09the time when they were educa the world
- 00:40:10how to
- 00:40:13discover so just to finish the sarcastic
- 00:40:17part um one of our colleagues had
- 00:40:20presented um at Triple H Mission
- 00:40:242019 kind of saying that we have to be
- 00:40:27careful especially when we deal with big
- 00:40:29data and super powerful
- 00:40:31algorithms that the the algorithms have
- 00:40:33been specifically developed to optimize
- 00:40:35stuff and the more data you have and the
- 00:40:37more computers you run the better you
- 00:40:39optimize and with more powerful
- 00:40:41computers that few people have access to
- 00:40:44the computational experiments become
- 00:40:45reproducible because nobody else has
- 00:40:47enough power to challenge and and and
- 00:40:50revalidate models and so uh results and
- 00:40:54I Illustrated um with a few examples can
- 00:40:56be misleading and often completely wrong
- 00:40:58there is General recognition ofab crisis
- 00:41:00and sence I disagree with the last uh
- 00:41:03point I don't think there is Crisis and
- 00:41:05science but this is something that we
- 00:41:06have to be very very careful with and
- 00:41:09just to imprint this there's a
- 00:41:10psychological rule of three uh repeats
- 00:41:13right if you want your kids to remember
- 00:41:15to clean dishes or clean the table you
- 00:41:18have to say it three times this is this
- 00:41:20is how we need to educate each other so
- 00:41:24I've looked up um the
- 00:41:27with AI what is if you say AI three
- 00:41:31times and Urban Dictionary says the
- 00:41:33price used by most evil people with
- 00:41:35pattern functions
- 00:41:38so please keep this in mind as you
- 00:41:40investigate new algorithm that brings me
- 00:41:42to the impactful trends and I'll try to
- 00:41:45wrap
- 00:41:46up so why do we want new methods this is
- 00:41:51the biggest transformation I don't think
- 00:41:53it's methods many methods that we
- 00:41:55practice today have been kind developed
- 00:41:58or originated many many years ago the
- 00:42:00absolute biggest transformation of the
- 00:42:02discipline of CH formatics is because of
- 00:42:05the data because of immense huge amount
- 00:42:08of data in every sense of the word here
- 00:42:12is a summary of current number of
- 00:42:15molecules and
- 00:42:16papan hundreds of millions of molecules
- 00:42:19with assigned biological activity 41
- 00:42:22million abstracts and 8.7 so European
- 00:42:25PMC for some reason is bigger than the
- 00:42:26American p
- 00:42:28uh but enormous amount of literature and
- 00:42:31as you know enormous amount of
- 00:42:32literature really enabled the
- 00:42:34development of of tools such as CH GPT
- 00:42:37and aik
- 00:42:39and constantly growing chemical libr for
- 00:42:42rual spring and this this immense amount
- 00:42:45of data challenges model development
- 00:42:48challenges Hardware challenges uh human
- 00:42:51Minds challenges people who develop
- 00:42:53algorithms and as we do it again I ask
- 00:42:56you to uh
- 00:42:58remember uh of problems and challenges
- 00:43:00associated with with big data so uh here
- 00:43:04is my summary of some of the Trends on
- 00:43:08the field and absolutely impossible to
- 00:43:10to reflect on everything despite the
- 00:43:12request that I had for me Sasha but this
- 00:43:15is what I consider um emerging studies
- 00:43:20and what's really helpful I mentioned in
- 00:43:22the beginning that there is no way I
- 00:43:25could cover in one Lu what is important
- 00:43:29and I actually love this code the public
- 00:43:30have insal know everything se's worth
- 00:43:32knowing so this are of my list of things
- 00:43:35that are worth noting ability to model
- 00:43:37increasingly large and complex for
- 00:43:39instance chemical mixture data and
- 00:43:41Screen alter large chemical libaries
- 00:43:43that's
- 00:43:44emerging continue to emerge as the
- 00:43:46challenge in front of us new algorithms
- 00:43:49enhan by Machine learning to improve
- 00:43:50both efficiency and accuracy of
- 00:43:53models instant integration with
- 00:43:56experiments
- 00:43:58um I think gizbert published a paper
- 00:44:00defining dmta or was one of the first
- 00:44:02people defining design make test analyze
- 00:44:04cycle the cycle used to be disconnected
- 00:44:07we design we make and then someone tests
- 00:44:10and then someone synthetizes cell
- 00:44:12driving Labs uh an intelligent chemical
- 00:44:15design with ersion or algorithms that
- 00:44:17work at the level of building blocks
- 00:44:19which is still smaller than the size of
- 00:44:21modern libraries there are 50 60 billion
- 00:44:24compounds and there are interesting
- 00:44:27algorithms continue to emerge that look
- 00:44:30at the space and mine the space of
- 00:44:31building
- 00:44:33BLS in silic real I not going to talk
- 00:44:36about I'll talk about a few points just
- 00:44:38to illustrate um but um it's becoming
- 00:44:41increasingly important to work with
- 00:44:44Scientists conducting in vitra
- 00:44:46experiments and combining silon and
- 00:44:48vitra to develop models that provide um
- 00:44:51reliable alternative to anual testing
- 00:44:54both for toxicity testing and for um
- 00:44:57tast and for biological activity and
- 00:44:59then cross disciplinary knowledge
- 00:45:01integration and Mining to achieve
- 00:45:03clinical impact and I would like to
- 00:45:06suggest that focusing on ways of
- 00:45:08achieving clinical impact through cam
- 00:45:10informatics research is one of the
- 00:45:13outstanding hot challenges in chromatics
- 00:45:16these days and then endless applications
- 00:45:18beyond qsr what I Define some point as
- 00:45:20qsr without bus so uh that's the summary
- 00:45:24let me give you a few examples summary
- 00:45:26search hour some not uh and uh what I
- 00:45:30also um said here this is what's really
- 00:45:33helpful to Define Trends just look at
- 00:45:36the lecture titles in this Workshop
- 00:45:39because because the program it's
- 00:45:41outstanding and it captures all the
- 00:45:43current and emergent CS that I cannot
- 00:45:45cover and nobody could cover my lecture
- 00:45:47but the entire Workshop really stand out
- 00:45:50when I looked at the program in terms of
- 00:45:52um topics that will be covered um
- 00:45:55beginning tomorrow
- 00:45:58uh one of the
- 00:46:00challenges that uh our group found
- 00:46:03exciting is uh how
- 00:46:05to uh achieve Ultra fast virtual screen
- 00:46:10there have been Publications I mentioned
- 00:46:1250 billion compounds that enine released
- 00:46:14recently in their Library so it's very
- 00:46:16apart and it's important uh and
- 00:46:20impossible to use traditional methods um
- 00:46:24Maas is here and and his company has
- 00:46:26been deved
- 00:46:27or published one of the first not
- 00:46:29published sorry released one of the
- 00:46:33first algorithms to enable uh rapid
- 00:46:35searches but but the challenge was out
- 00:46:37there and people have been uh developing
- 00:46:40Solutions right uh and why is it
- 00:46:42difficult Ultra large libraries and and
- 00:46:45traditionally High feature
- 00:46:47dimensionality if you think about making
- 00:46:5150 billion comparisons using 10 24 sized
- 00:46:54fingerprints that takes time that what
- 00:46:56makes uh things really
- 00:46:59um difficult and so
- 00:47:02uh what I want to share is is an
- 00:47:04approach that we have developed over the
- 00:47:06past year uh that uh and that was the
- 00:47:09challenge in front to enable quaran of
- 00:47:12bulion siiz libraries rapidly using as
- 00:47:15little computing power as possible and
- 00:47:18uh allow concurrent queries and free to
- 00:47:20use so what's behind it is uh what has
- 00:47:25constituted the sort of the the depth of
- 00:47:28of uh deep chemistry in in in in the
- 00:47:32last years and that is that transition
- 00:47:34and with um five people um who speak at
- 00:47:37this Workshop coauthored a paper last
- 00:47:39year um gizbert and Sasha and uh uh olis
- 00:47:44uh and art so they're all all in the
- 00:47:46room and we CED a paper on what we
- 00:47:48called emergence of deep qar and that
- 00:47:51what we've highlighted there is the
- 00:47:53principal difference between old qar and
- 00:47:54new qar and that is that we operate uh
- 00:47:58not in a traditional computed chemical
- 00:48:00descriptors but in the edion space and
- 00:48:03so the edion uh enables different ways
- 00:48:06of transforming chemical reality and
- 00:48:07dealing with chemical reality and one of
- 00:48:10them as has been shown uh in in
- 00:48:13Professor Bar's group and um uh Elana is
- 00:48:16going to give a lecture about uh um
- 00:48:20chemography is that that uh this type of
- 00:48:24representation uh allows us to ask
- 00:48:26questions in traditional questions in a
- 00:48:27different way and calculate chemical
- 00:48:30similarity and and address the issues of
- 00:48:32chemical landscape in a way traditional
- 00:48:35descriptors disallow and so uh one
- 00:48:39aspect that we have investigated and
- 00:48:41noticed is that as we run different
- 00:48:44types of encoders and autoencoders and
- 00:48:46transformations in the Lattin
- 00:48:49space chemical similarity is not by
- 00:48:52default preserved and when you analyze
- 00:48:54the distances between compounds and the
- 00:48:57space as Illustrated here using naive
- 00:48:59types of Auto iners compounds tend to
- 00:49:02disaggregate even if they similar in the
- 00:49:05original chemistry space and so we've
- 00:49:07developed an algorithm that we call
- 00:49:09structurally where Le space of Auto
- 00:49:11includer that addressed the specific
- 00:49:14issue of how to inut molecules in the
- 00:49:16latent space while intentionally
- 00:49:19preserving similarity between them and
- 00:49:21one of the fundamental questions that
- 00:49:23we've asked that I encourage every group
- 00:49:26to discuss
- 00:49:27uh and the discussion becomes hotter and
- 00:49:30hotter and more and more informal the
- 00:49:32more beer and other beverages like that
- 00:49:34you consume and that's the discussion
- 00:49:36about what is chemical similarity so uh
- 00:49:39what we have decided to use is what I
- 00:49:42think may be discussed as a fundamental
- 00:49:46chemical similarity metric and that's
- 00:49:48graphed a distance because that is a
- 00:49:51mathematical
- 00:49:52concept it's hard to calculate
- 00:49:54especially if you want to compare it to
- 00:49:56very different molecules but you could
- 00:49:59Define similarity and objectively not
- 00:50:01using any particular descriptors
- 00:50:03directly from the chemical structure and
- 00:50:04chemical structure representation as a
- 00:50:06prod and in order to train the model
- 00:50:08we've developed large data sets of
- 00:50:11analoges and each molecule in this data
- 00:50:13set the an car molecule is taken from
- 00:50:15kemell and then immediate analogs with
- 00:50:18one graph AED
- 00:50:20distance uh continuously is developed
- 00:50:23five to 10 an loocks are developed and
- 00:50:26they gu to be nearest neighbors of each
- 00:50:28other in the objectively defined
- 00:50:30chemical graph space and then and then
- 00:50:34we're on the inod and what this this
- 00:50:36slide illustrates is that as we embed
- 00:50:38molecules we preserve original graph
- 00:50:40edged distances calculated in the
- 00:50:43chemical graph space and more broad
- 00:50:46sense here is a comparison of the
- 00:50:48distribution ofion distances on the left
- 00:50:52naive autoencoder on the right uh salsa
- 00:50:56space which which preserves chemical
- 00:50:57similarity intentionally because that's
- 00:51:00built into the
- 00:51:01algorithm
- 00:51:03and what's the use of it the use is that
- 00:51:06we control the size of the chemical
- 00:51:09space and we could decide uh which size
- 00:51:11of the Lattin space we want to use to
- 00:51:13incode molecules while preserving
- 00:51:15chemical
- 00:51:16similari and uh what we have shown is
- 00:51:20that uh smsa which is a version of the
- 00:51:22salsa with small size latent space
- 00:51:26preserves chemical
- 00:51:28similarity uh much better than than when
- 00:51:31you use uh fingerprints and at the same
- 00:51:33time enables the fastest computational
- 00:51:35speed uh as we compare distances in
- 00:51:39theed space with chemical graph
- 00:51:41distances um for compounds um even
- 00:51:44arbitrarily chosen my second example of
- 00:51:48addressing a fundamental concept of a
- 00:51:50chromatics in um sort of noral way is
- 00:51:54revisiting the the concept of data data
- 00:51:58balance or imbalance and as I mentioned
- 00:52:00initially we advocated for the use of uh
- 00:52:03balance uh
- 00:52:05balance data the objective here was to
- 00:52:09train a model with available data to
- 00:52:11distinguish standard objective active
- 00:52:12from anactive but there is a caveat that
- 00:52:16everybody observes typically when you
- 00:52:18look at um large data sets from from
- 00:52:20sources such as even kemell let alone
- 00:52:23pcam is that the most train sets are
- 00:52:24imbalanced and when we compounds from
- 00:52:28huge virtual libraries we know a
- 00:52:30priority that very small fraction of
- 00:52:32those compounds even if all of them are
- 00:52:34tested will turn out to be to have the
- 00:52:37activity um the the desired activity so
- 00:52:40the training sets are imbalanced in
- 00:52:42practicality and the external virtual
- 00:52:45screening sets are hugely imbalanced and
- 00:52:48what we've realized and and that's that
- 00:52:50Pap our that we need to change our
- 00:52:54historical Trends as to how we've been
- 00:52:55building models
- 00:52:57and change the metcs that we use in
- 00:52:59order to enable the recovery of the
- 00:53:01largest number of positive he because
- 00:53:02when we do virtual screening and then
- 00:53:04experimental testing we are hunting for
- 00:53:07positives we're not hunting for models
- 00:53:09with the highest balanced accuracy and
- 00:53:12so uh and then and then the Practical
- 00:53:15aspect of this work was that typically
- 00:53:16people screen compounds in batches in in
- 00:53:19in uh well plates and so um what we used
- 00:53:23here is uh um positive itive value As
- 00:53:27the metric of accuracy estimated and
- 00:53:30selecting 128 compounds from the
- 00:53:33external set and sets were created from
- 00:53:35Pam uh split into train and to nominate
- 00:53:38from test set and they were
- 00:53:40intentionally imbalanced both the
- 00:53:41training set and external set and what
- 00:53:44we found that uh and this this is a
- 00:53:47comparison between balanced and
- 00:53:48imbalanced in five different data sets
- 00:53:51is that when we train models to have the
- 00:53:53highest positive predictive value
- 00:53:57we achieve the highest much higher
- 00:54:00positive productive value in the
- 00:54:01external set than with model trrain
- 00:54:04traditionally with balance accuracy and
- 00:54:06that's the last column here um more um
- 00:54:10vividly Illustrated in the next slide
- 00:54:13when we um compare the performance of
- 00:54:15traditional models and models
- 00:54:17emphasizing nontraditional non-standard
- 00:54:19metric of of elevation such as
- 00:54:24ppv a few comments about
- 00:54:27um trends that I find emerging and and
- 00:54:30important and they will be covered in
- 00:54:31this Workshop such as AIML enhance
- 00:54:34generative chemistry so gizbert is going
- 00:54:37to talk about it tomorrow uh design make
- 00:54:40text analyze cycle integration with an
- 00:54:42sdl so um this is um um what I
- 00:54:46illustrate here is a publication from uh
- 00:54:48elain asuk uh work uh several years ago
- 00:54:52but there is an absolutely strongly
- 00:54:55emerging Trend in integrating
- 00:54:56computations inside the MTA cycle
- 00:54:59implemented within self-driving weap and
- 00:55:01so I think it's an important area of
- 00:55:03research and of course various uh
- 00:55:05machine learning accelerated
- 00:55:07calculations uh that are made fast and
- 00:55:09accurate and so all is is going to talk
- 00:55:11about this Paradigm of ml enhanced
- 00:55:15calculations as applied to quantum
- 00:55:16mechanical tasks um I think on
- 00:55:19Friday and de duen that art who made it
- 00:55:23despite um the bad train with
- 00:55:26performance uh as promised um develop
- 00:55:29deep do so you're going to hear about
- 00:55:31that uh from art I'm going to skip our
- 00:55:33own work and sort of extension and the
- 00:55:36interest of time and talk about the last
- 00:55:39topic the importance of integration of C
- 00:55:42from with clinical work and uh and even
- 00:55:46though traditionally we're focusing on L
- 00:55:49Discovery and Target Discovery as
- 00:55:50standard Cham informatics tasks um what
- 00:55:54is emerging is holistic understanding of
- 00:55:56the entire pipeline from left to right
- 00:55:58and development and application of Cam
- 00:56:00informatics or Cam informatics like
- 00:56:02techniques to address the entirety of
- 00:56:05the pipeline all the way to the last
- 00:56:08point which is adherence Precision
- 00:56:10medicine and Drug application one
- 00:56:12particular area that is reemerging to be
- 00:56:16exciting is is called drug reprising
- 00:56:18which is application of drugs existing
- 00:56:20drugs for new applications and to me
- 00:56:22this is one of the most exciting
- 00:56:25potential typ typ of research P
- 00:56:27informaticians because it could make an
- 00:56:29immediate impact in the clinic if you
- 00:56:31suggest an existing drug to be
- 00:56:33repurposed for for an no disease um
- 00:56:36Physicians have the right to using those
- 00:56:39drugs in clinic if they they are
- 00:56:41convinced that they could do it and
- 00:56:42there is nothing against it and here is
- 00:56:45a a story that I found fascinating
- 00:56:48inspirational and motivational about a
- 00:56:51new colleag of mine named David Fen
- 00:56:53bound who showing here 10 years more
- 00:56:55than 10 years ago left an athlete from
- 00:56:58upan who got stricken by a rare disease
- 00:57:02called Castleman and uh had his uh Last
- 00:57:06Words uh essentially and and uh was
- 00:57:09nearly dying so two two cases of
- 00:57:12clinical death this is how he looks
- 00:57:15today and what he did was that he
- 00:57:18reasoned
- 00:57:19overd on him on himself on his own and
- 00:57:24once he was diagnosed with what's called
- 00:57:26i apathic multicentric c castolin
- 00:57:28disease what he learned about himself is
- 00:57:30that he had elevated cyto kindes uh
- 00:57:33increased expression of of um one of the
- 00:57:36proteins and Diesel Activation so has
- 00:57:38the question is there any drug that
- 00:57:40could regulate those clinical phenotypes
- 00:57:43andan ceramus an Amor inhibitor known to
- 00:57:47be used uh in patients that received
- 00:57:50organ
- 00:57:51transplant and performed poorly and the
- 00:57:54organ is rejected and there is toxicity
- 00:57:56associated with it so he liked on
- 00:57:58symptoms that he learned about his
- 00:58:00medical research or medical training um
- 00:58:03of patients with organ transplant
- 00:58:05rejection and his own symptoms and
- 00:58:08decided and his doctor basically said
- 00:58:10you're D anyway why not so he's been in
- 00:58:13remission for 10 years and published uh
- 00:58:16a New York Times besteller about uh how
- 00:58:19to cure your own chasing your own cure
- 00:58:20is the title of the book so
- 00:58:24um so there's the this is a slide I
- 00:58:26think we all know about drug reposing
- 00:58:28and and how potentially expeditious
- 00:58:32could the drug Discovery be if we do
- 00:58:34this and we also have been exposed to a
- 00:58:36recent case of drug repacing uh by
- 00:58:39benevolent I nominated bar c as a
- 00:58:44potential drug of choice for CO as early
- 00:58:47as in February of 2020 just uh couple of
- 00:58:50months after the world learned about
- 00:58:53this disease and the way they reasoned
- 00:58:55was that um we have a viral
- 00:58:58entry and it list inflammation so it's a
- 00:59:01very simple reason and so what is behind
- 00:59:04inflammation cyto Kine signalin uh which
- 00:59:07is mediated by a few Kinesis um what
- 00:59:11works on those kindes a few drugs but
- 00:59:13very sitive
- 00:59:15specifically is a polypharmacology drug
- 00:59:18that inhibits all three kinases involved
- 00:59:21either in endocytosis or in cin stor and
- 00:59:24so that was sort of the reason um that
- 00:59:26they use to nominate this drug looking
- 00:59:28at what historically appears as a
- 00:59:30remember chemical systems biology is one
- 00:59:33of the areas of CH informatics and the
- 00:59:35original uh
- 00:59:37publication and what they really used
- 00:59:40was What's called the knowledge gra and
- 00:59:41and Knowledge Graph studies uh I find
- 00:59:44probably one of the most exciting
- 00:59:46projects around lab and in Labs that are
- 00:59:48studying knowledge graphs uh which have
- 00:59:50been introduced by Google and allow
- 00:59:53people to
- 00:59:54answer really really bizarre questions
- 00:59:57such as what is common between the
- 01:00:00Vinci and the eiil
- 01:00:03tower and you must be deeply drown on
- 01:00:06nuts to ask a question like this unless
- 01:00:09you have access to knowledge
- 01:00:11gr behind Wikipedia which says that D
- 01:00:14Vinci painted monola which is in lure
- 01:00:16which is in
- 01:00:19Paris now that's a path through the
- 01:00:22knowledge graph and it's a crazy path
- 01:00:24but if you contemplate the process of
- 01:00:28consolidating the World Knowledge
- 01:00:30distributed across multiple data
- 01:00:32clinical nonclinical chemical biological
- 01:00:36biomedical historical medical clinical
- 01:00:39Etc uh and integrate this
- 01:00:42knowledge
- 01:00:44uh compiled by domain scientists and uh
- 01:00:48stored in databases integrated with
- 01:00:51inferences made from this data to affect
- 01:00:53what domain scientists do in their labs
- 01:00:56that can be done through Knowledge Graph
- 01:00:58construction Knowledge Graph curation
- 01:01:01familiar term Knowledge Graph
- 01:01:04compression Knowledge Graph completion
- 01:01:06from computer science perspective
- 01:01:08Knowledge Graph completion is the same
- 01:01:09as making predictions about new edges
- 01:01:12that connect to semantic objects such as
- 01:01:14a Dr and disease and that effectively
- 01:01:17means anination of a new drug for a new
- 01:01:20disease predicting this particular uh
- 01:01:23new connection and so
- 01:01:26um there are multiple algorithms being
- 01:01:29developed again I find this very
- 01:01:30exciting for CH informatician to work on
- 01:01:32this the simplest approaches that can be
- 01:01:34used is to discover rules that connect
- 01:01:37known or that are behind biological
- 01:01:40Pathways that are behind known drug
- 01:01:42disease connections and then this rules
- 01:01:44can be used to forecast whether a drug
- 01:01:48and disease that have a connected path
- 01:01:51found frequently association with drugs
- 01:01:53and diseases could be repurposed for
- 01:01:56a new disease and I'm part of a huge
- 01:01:59project recently awarded today pen B by
- 01:02:02RP to examine and this is a familiar
- 01:02:06territory all against all similarity
- 01:02:08searches all against all doen
- 01:02:10calculations all against all biological
- 01:02:14biomedical graph minion to calculate the
- 01:02:17strength of association between every
- 01:02:19drug and every disease and through this
- 01:02:21integration searches and identification
- 01:02:24of um scores uh the hope is that
- 01:02:28relatively low hangen fruits will emerge
- 01:02:31similar to the story of David F bound
- 01:02:33curing himself that can be uh tried in
- 01:02:37the clinic and so that's a chemon type
- 01:02:40analysis of a Knowledge Graph remember
- 01:02:42chemicals are graphs and here again
- 01:02:46we're back to the concept of graphs and
- 01:02:48uh ways of mining graphs that are
- 01:02:50familiar to us as informaticians in
- 01:02:51order to uh recover drugs and Linate
- 01:02:54them for clinical trial I promise that
- 01:02:57eals mc² is a universal formula and I
- 01:03:01found uh this picture the real mean of
- 01:03:04EMC eals mc² so that's not mine but what
- 01:03:08is mine is the new formula of scientific
- 01:03:11discovery that I share with you and
- 01:03:13encourage you to use and that's what my
- 01:03:15talk was about how to combine modeling
- 01:03:18and Curative content and so that
- 01:03:21together uh really creates opportunity
- 01:03:23for new scientific discovery overall my
- 01:03:26last two slides as C informaticians we
- 01:03:30depend on data tools and challenges and
- 01:03:32when these three object semantic objects
- 01:03:35are combined a new hypothesis to emerge
- 01:03:39now
- 01:03:40hypothesis have to be
- 01:03:42validated and filtered out so filtering
- 01:03:46building filtering models is very
- 01:03:47important and after filters a value is
- 01:03:51created um Sasha talked about Investors
- 01:03:54so that's the point of of interaction
- 01:03:56with them but value is not necessarily
- 01:03:58dollars it's a new knowledge that goes
- 01:04:00back and fuels data tools and challenges
- 01:04:03integration it takes a village it takes
- 01:04:06learning it takes multiple scientists
- 01:04:08interacting with each other um someone
- 01:04:10once asked me if I have drawn this this
- 01:04:13particular picture of scientists working
- 01:04:15with each other I said no it was Picassa
- 01:04:17but um I was puzzled by the question in
- 01:04:21my personal opinion and I'm allowed to
- 01:04:23share my personal opinion my personal
- 01:04:24opinion there is a another mission
- 01:04:26element of this entire
- 01:04:29workflow thank you for um giving me a
- 01:04:32cup of coffee before this lecture SAS
- 01:04:33certainly was very
- 01:04:35helpful and then I'll I'll end with last
- 01:04:38two slides I've talked a lot about hype
- 01:04:40this uh you may have seen this this is
- 01:04:42Gardner is a consulting company that
- 01:04:44publishes stuff like this hype cycle so
- 01:04:47hopefully with computer existed drug
- 01:04:49Discovery we're Beyond this
- 01:04:53hype uh and we're into what's called um
- 01:04:57plateau of productivity with all the
- 01:04:59methods so I'm enging on really positive
- 01:05:01note despite multiple criticism that
- 01:05:04you've Pro so uh this is this is the
- 01:05:07summary again I'm not going to read it
- 01:05:08I've talked about Big Data I've talked
- 01:05:10about exciting developments really the
- 01:05:12interface between computation organic
- 01:05:14chemistry tools and data sharing there
- 01:05:16are many many topics I have not talked
- 01:05:18about and with that my last educational
- 01:05:20comment um from jerek CL it's not the
- 01:05:24machines are going replace chemists that
- 01:05:26chemists who use machines I really love
- 01:05:28this quote will replace those who don't
- 01:05:31so that's what everybody is learning
- 01:05:32here today thank you very
- 01:05:50much Alex I think you you extremely
- 01:05:55important
- 01:05:56that the experiment will always
- 01:05:58validate how do you deal with the
- 01:06:00knowledge that 80% of what is published
- 01:06:04in biology
- 01:06:06C you
- 01:06:09can't uh but uh there is ongoing
- 01:06:14effort I didn't have time to recruit to
- 01:06:17to to discuss there are two ongoing
- 01:06:19efforts I
- 01:06:23um uh highly support the con cell
- 01:06:25driving webs because it's data
- 01:06:29generation clean in C reproducibly right
- 01:06:34so essential what I'm saying is we
- 01:06:36cannot fix the past but we could change
- 01:06:39the present and the
- 01:06:41future
- 01:06:44and and and second ongoing effort that
- 01:06:46structural generic consortion is pushing
- 01:06:48is a new effort in generating high
- 01:06:51quality data using various types of
- 01:06:54experimental tools
- 01:06:56sharing this data there's a database
- 01:06:58called air check that they're creating
- 01:07:00and and basically having the community
- 01:07:02effort to build it but I I I don't think
- 01:07:05there's anything we could do with pass
- 01:07:08date okay iest to move the discussion to
- 01:07:13the welcome part otherwi we need to
- 01:07:16choose I have some
- 01:07:20or thank thank you
- Chemieinformatik
- Datenqualität
- Modellierung
- Fehlermangement
- big data
- maschinelles Lernen
- klinische Integration
- wissenschaftlicher Hype
- historische Reflektion
- Datenkurierung