Bioinformatics Project from Scratch - Drug Discovery #6 (Deploy Model as Web App) | Streamlit #22

00:24:28
https://www.youtube.com/watch?v=m0sePkuyTKs

Resumen

TLDRIn this tutorial from the 'Bioinformatics from Scratch' series, viewers learn to transform a machine learning model developed for bioactivity predictions into a deployable web application using Python and Streamlit. The video walks through compiling bioactivity datasets from the ChEMBL database, performing exploratory data analysis, building and comparing machine learning models like random forests, and deploying the final model as a web app. The app allows user-interactive predictions for a targeted protein, exemplified by acetylcholinesterase, relevant for Alzheimer's research. The provided GitHub repository includes required folders and code, enabling users to activate necessary environments and run scripts to display and operate the app. The use of tools like Paddle descriptor for molecular fingerprints and the data manipulation in Jupyter is also highlighted.

Para llevar

  • 🔍 Learn to integrate machine learning models into web applications.
  • 💻 Use Python and Streamlit for cloud deployment.
  • 📁 Access code and resources via GitHub.
  • 🔬 Focus on bioactivity related to acetylcholinesterase.
  • 🧬 Convert smile notations to molecular descriptors.
  • 📊 Demonstrate and evaluate using random forest models.
  • 🛠 Customize and deploy apps using minimal coding.
  • 📈 Visualize molecular fingerprint data effectively.
  • 🔗 Connect model to real-time predictions for users.
  • 🧪 Specialize in predictive modeling for biochemistry.

Cronología

  • 00:00:00 - 00:05:00

    In previous videos of the "Bioinformatics from Scratch" series, topics covered included compiling bioactivity datasets from the ChEMBL database, performing exploratory data analysis on Lipinski descriptors, building a random forest model, and comparing model performances using the Lazy Predict library. The current video focuses on converting these machine learning models into a web application using the Streamlit library, allowing deployment in the cloud for user interaction and prediction tasks. The example application predicts bioactivity for acetylcholinesterase, a target protein for Alzheimer's disease.

  • 00:05:00 - 00:10:00

    The video demonstrates the setup of the application environment, starting with activating a Conda environment and locating the app files, including 'app.py'. It walks through running the application with Streamlit and shows an example where users upload an input file containing SMILES and ChEMBL ID. These identifiers aid in generating molecular fingerprints with the PaDEL-Descriptor tool, which are crucial for bioactivity predictions using the trained models. An acetylcholinesterase model is highlighted for its relevance to Alzheimer's research.

  • 00:10:00 - 00:15:00

    The process of building and using a random forest model is explained, beginning with feature selection and descriptor preparation from SMILES strings. The procedure involves reducing descriptors from 881 to 218 to enhance model efficiency. The video provides insight into model training using Jupyter Notebook, saving the model as a pickle file, and preparing it for integration into the web app. Performance evaluation metrics, such as R-squared and mean squared error, are calculated to assess model prediction effectiveness.

  • 00:15:00 - 00:24:28

    A detailed breakdown of the 'app.py' script is provided, where key Python libraries and functions like Streamlit, Pandas, and Pickle are used to construct the web application. Features include a command-line interface for descriptor calculation, prediction model application, and result visualization/download. The script is streamlined, with essential functions for molecular descriptor calculation, model building, and prediction result processing. The video emphasizes practical deployment, showcasing interactivity to upload datasets and generate downloadable bioactivity predictions.

Ver más

Mapa mental

Vídeo de preguntas y respuestas

  • What is the purpose of this video?

    The video teaches how to convert a machine learning model into a web application for bioactivity prediction.

  • Which libraries are used in this project?

    The project uses Python, Streamlit, Pandas, and the Paddle descriptor software.

  • What type of model is being deployed in the web app?

    The web app deploys a random forest machine learning model.

  • What is the target protein in this demonstration?

    The target protein is acetylcholinesterase.

  • How are molecular fingerprints used in this project?

    Molecular fingerprints are computed using Paddle descriptor and used for model predictions.

  • What database is used to compile bioactivity data sets?

    The ChEMBL database is used for compiling datasets.

  • Where can the app folder be found?

    The app folder is available in a GitHub repository linked in the video description.

  • What file format is used for input data?

    Input data is uploaded in .txt or .csv format.

  • What happens when you hit the "predict" button in the app?

    It computes molecular descriptors and makes predictions on bioactivity.

  • How many descriptors are used from the molecular fingerprint?

    A subset of 218 descriptors is used from the initially 881 computed ones.

Ver más resúmenes de vídeos

Obtén acceso instantáneo a resúmenes gratuitos de vídeos de YouTube gracias a la IA.
Subtítulos
en
Desplazamiento automático:
  • 00:00:00
    in prior videos of the bioinformatics
  • 00:00:02
    from scratch
  • 00:00:03
    series you have learned how to compile
  • 00:00:06
    your very own
  • 00:00:07
    bio activity data sets directly from the
  • 00:00:10
    chambo database
  • 00:00:11
    how to perform exploratory data analysis
  • 00:00:14
    on the computed lipinski descriptors you
  • 00:00:17
    have also learned how to build random
  • 00:00:19
    forest model
  • 00:00:20
    as well as building several machine
  • 00:00:22
    learning models for comparing the model
  • 00:00:24
    performance
  • 00:00:25
    using the lazy predict library and so in
  • 00:00:28
    this video
  • 00:00:28
    we will be taking a look at how we can
  • 00:00:31
    take that machine learning model of the
  • 00:00:33
    bioactivity
  • 00:00:34
    data set and convert it into a web
  • 00:00:36
    application
  • 00:00:37
    that you could deploy on the cloud that
  • 00:00:39
    will allow users to be able to
  • 00:00:41
    make predictions on your machine
  • 00:00:44
    learning model
  • 00:00:45
    for the target protein of your interest
  • 00:00:47
    and so without further ado
  • 00:00:49
    we're starting right now
  • 00:00:53
    okay so the first thing that you want to
  • 00:00:55
    do is
  • 00:00:57
    go to the bioactivity prediction app
  • 00:01:00
    folder
  • 00:01:01
    and so this folder will be provided in
  • 00:01:03
    the github link
  • 00:01:04
    in the video description and so before
  • 00:01:07
    we start
  • 00:01:08
    let me show you how the app looks like
  • 00:01:13
    so i'm going to activate my condy
  • 00:01:15
    environment
  • 00:01:18
    and for you please make sure to activate
  • 00:01:20
    your own content environment as well
  • 00:01:22
    so on my computer i'm using the data
  • 00:01:24
    professor
  • 00:01:25
    environment so i'm going to activate it
  • 00:01:28
    by
  • 00:01:28
    typing in conda activate data professor
  • 00:01:32
    and i'm going to go to the desktop
  • 00:01:34
    because that is where
  • 00:01:36
    the streamlet folder resides
  • 00:01:43
    and then we're going to go to the
  • 00:01:45
    bioactivity folder
  • 00:01:51
    let's have a look at the contents so the
  • 00:01:53
    app.py
  • 00:01:54
    will be the application and so we're
  • 00:01:57
    going to type in
  • 00:01:58
    streamlit run app.py in order to launch
  • 00:02:02
    this bioactivity prediction app
  • 00:02:09
    okay and so this is the bioactivity
  • 00:02:11
    prediction app
  • 00:02:12
    that i'm going to be showing you today
  • 00:02:14
    how you could build one
  • 00:02:15
    and so let's have a look at the example
  • 00:02:17
    input file
  • 00:02:20
    so this is the example input file
  • 00:02:23
    so in order to proceed with using this
  • 00:02:26
    app
  • 00:02:27
    we're going to have to upload the file
  • 00:02:30
    drag and drop right here
  • 00:02:31
    or browse files and select the input
  • 00:02:33
    file and so while waiting for
  • 00:02:35
    a input file to be uploaded you can see
  • 00:02:38
    here that the
  • 00:02:39
    blue box will be giving us a waiting
  • 00:02:42
    message
  • 00:02:43
    so it's saying upload input data in the
  • 00:02:45
    sidebar to start
  • 00:02:47
    so essentially the input file contains
  • 00:02:50
    the smiles notation and the chambo
  • 00:02:53
    id and so the tempo id you can think of
  • 00:02:56
    it as kind of like the name
  • 00:02:57
    of the molecule here and particularly
  • 00:03:00
    the tempo id is a unique
  • 00:03:02
    identification number of this particular
  • 00:03:04
    molecule
  • 00:03:05
    that chambo database has assigned to it
  • 00:03:08
    and the
  • 00:03:09
    smile notation here is a one-dimensional
  • 00:03:12
    representation of this particular
  • 00:03:15
    chemical structure and so this
  • 00:03:17
    my instantation will be used by the
  • 00:03:20
    paddle descriptor software
  • 00:03:22
    that we're going to be using here today
  • 00:03:24
    in the app in order to generate
  • 00:03:26
    molecular fingerprint which describe the
  • 00:03:29
    unique chemical features of the molecule
  • 00:03:31
    and then such molecular fingerprints
  • 00:03:34
    will then be used by the machine
  • 00:03:36
    learning model
  • 00:03:36
    to make a prediction okay
  • 00:03:40
    and so the prediction will be the pic50
  • 00:03:42
    values
  • 00:03:43
    that you see here and the psc 50 value
  • 00:03:45
    is the bioactivity
  • 00:03:47
    against the target protein of interest
  • 00:03:49
    and so in this application
  • 00:03:51
    the target protein is
  • 00:03:53
    acetylcholinesterase
  • 00:03:55
    and this target protein is a target for
  • 00:03:57
    the
  • 00:03:58
    alzheimer's disease okay and so this app
  • 00:04:01
    is built in python
  • 00:04:02
    using the trim-lit library and the
  • 00:04:05
    molecular fingerprints
  • 00:04:06
    are calculated using the paddle
  • 00:04:12
    descriptor
  • 00:04:23
    so back in 2016 we have published a
  • 00:04:26
    paper
  • 00:04:27
    describing the developments of a q-star
  • 00:04:30
    model
  • 00:04:31
    for predicting the bioactivity of the
  • 00:04:35
    acetylcholinesterase and so if you're
  • 00:04:37
    interested in this article
  • 00:04:38
    please feel free to read it and so i'm
  • 00:04:40
    going to provide you the link
  • 00:04:41
    in the video description as well
  • 00:04:46
    okay so let's drag and drop the
  • 00:04:49
    input file so example esthetically in
  • 00:04:54
    series i'm going to drag a drop here
  • 00:04:58
    and then in order to initiate the
  • 00:05:00
    prediction i'm going to
  • 00:05:02
    press on the predict button
  • 00:05:05
    and as you see here the input file is
  • 00:05:08
    giving you
  • 00:05:09
    this data frame and then it's
  • 00:05:11
    calculating the descriptor
  • 00:05:13
    and the calculated descriptor is
  • 00:05:15
    provided here
  • 00:05:17
    in this particular data frame so you're
  • 00:05:19
    going to see here that there are a total
  • 00:05:21
    of five
  • 00:05:22
    input molecules and there are 882
  • 00:05:26
    columns and you're going to see here
  • 00:05:28
    that the first column
  • 00:05:29
    is the tempo id so in reality
  • 00:05:33
    you're going to have a total of 881
  • 00:05:36
    molecular fingerprints and the molecular
  • 00:05:39
    fingerprints that we're using today
  • 00:05:40
    is the pubchem fingerprint and because
  • 00:05:43
    we have previously built
  • 00:05:44
    a machine learning model which i will be
  • 00:05:46
    showing you using this
  • 00:05:48
    file the jupyter notebook file
  • 00:05:51
    we had reduced the number of descriptors
  • 00:05:54
    from
  • 00:05:54
    881 to 217 no actually 218 because we
  • 00:05:58
    have already deleted the first column
  • 00:06:00
    the name of the the symbol id column and
  • 00:06:03
    so we have reduced from 881
  • 00:06:05
    columns to 218 columns okay and so
  • 00:06:09
    in the code we're going to be selecting
  • 00:06:11
    the same
  • 00:06:12
    218 columns that you see here which
  • 00:06:15
    corresponds to the descriptor subsets
  • 00:06:18
    from the initially full set of eight
  • 00:06:21
    eight one okay so we're going to use the
  • 00:06:23
    218
  • 00:06:24
    as the x variables in order to predict
  • 00:06:27
    the psa 50
  • 00:06:28
    and finally we have the prediction
  • 00:06:30
    output and the last data frame here
  • 00:06:33
    and we have the corresponding tempo id
  • 00:06:35
    and then we could also download the
  • 00:06:37
    prediction
  • 00:06:37
    by pressing on this link
  • 00:06:46
    and then the prediction is provided here
  • 00:06:48
    in the csv
  • 00:06:49
    file okay
  • 00:06:53
    so the data is provided here
  • 00:06:56
    all right and so let's get started shall
  • 00:07:00
    we
  • 00:07:06
    okay so we have to first build our
  • 00:07:09
    prediction model
  • 00:07:10
    using the jupyter notebook and then
  • 00:07:12
    we're going to
  • 00:07:13
    save the model as a pickle file right
  • 00:07:16
    here
  • 00:07:16
    okay so let me show you in which will
  • 00:07:20
    take
  • 00:07:20
    just a moment so let me open up a new
  • 00:07:23
    terminal
  • 00:07:25
    and then i'm going to activate jupyter
  • 00:07:28
    typing in jupyter notebook
  • 00:07:32
    okay so i have to first activate condy
  • 00:07:34
    environment
  • 00:07:37
    kind of activate data professor so it's
  • 00:07:39
    the same environment
  • 00:07:40
    and then jupyter notebook
  • 00:07:45
    all right there you go and then i'm
  • 00:07:47
    going to open up the jupyter notebook
  • 00:07:51
    all right and here we go so actually
  • 00:07:53
    this was adapted from one of the prior
  • 00:07:56
    tutorials in this bioinformatic from
  • 00:07:58
    scratch
  • 00:07:59
    series and essentially we're going to
  • 00:08:01
    just
  • 00:08:02
    download the calculated fingerprints
  • 00:08:05
    from the github of data professor
  • 00:08:07
    using this url link and so we're
  • 00:08:10
    importing pandas as pd and then we're
  • 00:08:13
    downloading
  • 00:08:13
    and reading it in using pandas and the
  • 00:08:16
    resulting data frame looks like this
  • 00:08:18
    and so you're gonna see here that we
  • 00:08:20
    have
  • 00:08:21
    all of this so one column the last
  • 00:08:24
    column is pic50
  • 00:08:26
    and we have 881 columns for the
  • 00:08:29
    pubchem fingerprint and then the next
  • 00:08:32
    cell here is we're going to be
  • 00:08:35
    dropping the last column or the pic50
  • 00:08:38
    column
  • 00:08:39
    in order to assign it to the x variable
  • 00:08:43
    and then we're going to just select the
  • 00:08:46
    last column denoted here by -1
  • 00:08:50
    and assigning it to the y variable
  • 00:08:53
    and so now that we have the x and y
  • 00:08:55
    separated we're going to next
  • 00:08:57
    remove the low variance feature from the
  • 00:09:00
    x
  • 00:09:00
    variable so initially we have 881
  • 00:09:04
    and so applying a threshold of 0.1 this
  • 00:09:08
    resulted in 218 columns
  • 00:09:11
    and then we're going to be saving it
  • 00:09:13
    into a descriptor list.csv
  • 00:09:16
    file so let me show you that
  • 00:09:21
    descriptor lists the csv file
  • 00:09:24
    okay and then you're going to see here
  • 00:09:25
    that the first row
  • 00:09:27
    will contain the names of the
  • 00:09:29
    fingerprints that are retained
  • 00:09:31
    in other words the name of the
  • 00:09:33
    descriptors of the 218 columns here
  • 00:09:37
    we here you can see that pop can
  • 00:09:39
    fingerprint 0
  • 00:09:40
    1 2 has been removed and we have
  • 00:09:43
    fingerprint 3
  • 00:09:44
    and fingerprints 4 until 11 has been
  • 00:09:47
    removed
  • 00:09:48
    fingerprint 14 has been removed
  • 00:09:50
    fingerprint 17 has also been removed
  • 00:09:54
    so more than 600 fingerprints have been
  • 00:09:58
    deleted
  • 00:09:58
    from the x variable and so the removal
  • 00:10:01
    of excessive redundant features will
  • 00:10:03
    allow us to build the model
  • 00:10:05
    much quicker okay and so in just a few
  • 00:10:08
    moments i will be telling you
  • 00:10:10
    how we're going to be making use of this
  • 00:10:12
    descriptor list
  • 00:10:14
    in order to select the subsets from the
  • 00:10:16
    computed
  • 00:10:17
    descriptors that we obtained from the
  • 00:10:20
    input query right here let me show you
  • 00:10:25
    that we get from the input query right
  • 00:10:27
    here
  • 00:10:28
    so out of this small citation we
  • 00:10:31
    generated 881
  • 00:10:34
    columns and then we're going to be
  • 00:10:36
    selecting
  • 00:10:37
    a subset of 218 from the initially 881
  • 00:10:41
    by using this particular list of
  • 00:10:44
    descriptors okay
  • 00:10:49
    and let's go back to the
  • 00:10:52
    jupyter notebook all right
  • 00:10:58
    let's save it
  • 00:11:03
    and then we're going to be building the
  • 00:11:05
    model random forest model
  • 00:11:08
    we're setting here the random states to
  • 00:11:10
    be 42
  • 00:11:11
    the number of estimators to be 500 and
  • 00:11:14
    we're using the random force regressor
  • 00:11:16
    and we fit the model here in order to
  • 00:11:18
    train it and then we're going to be
  • 00:11:20
    calculating the score
  • 00:11:21
    which is the r2 score
  • 00:11:25
    and then we're assigning it to the r2
  • 00:11:27
    variable and then finally
  • 00:11:29
    we're going to be applying the trained
  • 00:11:32
    model
  • 00:11:32
    to make a prediction on the x variable
  • 00:11:35
    which is the training sets
  • 00:11:36
    also and then we're assigning it to the
  • 00:11:39
    wide red
  • 00:11:40
    variable
  • 00:11:44
    okay so here we see that the r squared
  • 00:11:46
    value is 0.86
  • 00:11:49
    and then let's print out the performance
  • 00:11:53
    mean squared error of 0.34 and let's
  • 00:11:55
    make the scatter plot
  • 00:11:57
    of the actual and predicted values
  • 00:12:01
    okay so we get this plot here
  • 00:12:04
    and then finally we're going to be
  • 00:12:05
    saving the model by
  • 00:12:07
    dumping it using the pickle function
  • 00:12:09
    pico dot dump
  • 00:12:11
    and then as input argument we're going
  • 00:12:12
    to have model and then we're going to
  • 00:12:14
    save it as
  • 00:12:15
    essential calling series model dot pkl
  • 00:12:20
    and there you go we have already saved
  • 00:12:22
    the model okay so i'm going to go ahead
  • 00:12:24
    and
  • 00:12:24
    close this stupid notebook
  • 00:12:33
    and let's help over back and
  • 00:12:37
    let's take a look at the app.py file
  • 00:12:42
    okay so let's have a brief look you're
  • 00:12:44
    going to see here that
  • 00:12:46
    the app.py is less than 90 lines of code
  • 00:12:50
    and about 87 to be exact and you're
  • 00:12:53
    going to see that there are
  • 00:12:54
    some white spaces so even if we delete
  • 00:12:57
    all
  • 00:12:57
    the white space it might be even less
  • 00:13:00
    maybe 80 lines of code
  • 00:13:02
    okay so the first seven lines of code
  • 00:13:05
    will be
  • 00:13:06
    importing the necessary libraries and so
  • 00:13:09
    we're making use of streamlit as the web
  • 00:13:11
    framework
  • 00:13:12
    and we're using pandas in order to
  • 00:13:14
    display the data frame
  • 00:13:15
    and the image function from the pil
  • 00:13:18
    library is used to display this
  • 00:13:21
    illustration
  • 00:13:22
    and the descriptor calculation will be
  • 00:13:24
    made possible by
  • 00:13:25
    using the subprocess library so that
  • 00:13:28
    will allow us to compute the
  • 00:13:29
    title descriptor via the use of java and
  • 00:13:32
    we're
  • 00:13:33
    using the os library in order to perform
  • 00:13:36
    file handling so here you're going to
  • 00:13:38
    see that we're using the os
  • 00:13:40
    dot remove in order to remove the
  • 00:13:42
    molecule.smi
  • 00:13:43
    file so i'm going to explain to you that
  • 00:13:45
    in just a moment
  • 00:13:47
    base64 will be used for encoding
  • 00:13:49
    decoding
  • 00:13:50
    of the file when we will make the file
  • 00:13:54
    available for download the prediction
  • 00:13:56
    results
  • 00:13:56
    and the pickle library will be used for
  • 00:13:59
    loading up the pickled file
  • 00:14:01
    of the model okay and so you're going to
  • 00:14:04
    be seeing here that we're
  • 00:14:05
    making three custom functions so lines
  • 00:14:09
    10
  • 00:14:09
    through 15 the first custom function
  • 00:14:12
    will be our
  • 00:14:13
    molecular descriptor calculator so we're
  • 00:14:16
    defining a function called desk calc
  • 00:14:19
    and then the statement underneath it
  • 00:14:21
    will be the batch command
  • 00:14:23
    and so this batch command is what we're
  • 00:14:25
    normally using
  • 00:14:27
    when we type into the command line
  • 00:14:30
    okay and so this option here will allow
  • 00:14:33
    us to run the code in the command line
  • 00:14:35
    without
  • 00:14:36
    launching a gui version of paddle
  • 00:14:38
    descriptor and so without this
  • 00:14:40
    option here it will launch a gui version
  • 00:14:43
    but since
  • 00:14:44
    we don't want that to happen we're going
  • 00:14:46
    to use this option
  • 00:14:48
    okay and so we're using the jar file to
  • 00:14:51
    make the calculation
  • 00:14:52
    of the fingerprints and then you're
  • 00:14:53
    going to see here that we have
  • 00:14:55
    additional
  • 00:14:56
    options such as removing salt
  • 00:14:58
    standardizing the nitro group of the
  • 00:15:00
    molecule
  • 00:15:01
    and then we're using the fingerprint to
  • 00:15:03
    be the pubchem fingerprint
  • 00:15:05
    using the xml file here and then finally
  • 00:15:08
    we're generating the molecular
  • 00:15:10
    descriptor file by saving it to the
  • 00:15:12
    descriptor's underscore output.csv
  • 00:15:15
    file and so this batch command will be
  • 00:15:18
    serving as input
  • 00:15:19
    right here in the subprocess.p
  • 00:15:23
    open function okay
  • 00:15:26
    and then finally after the descriptor
  • 00:15:28
    has been calculated
  • 00:15:30
    we're removing the molecule.smi file and
  • 00:15:33
    so the molecule.smi file
  • 00:15:35
    will be generated in another function so
  • 00:15:37
    i will be discussing that in just a
  • 00:15:39
    moment
  • 00:15:40
    and the second custom function that
  • 00:15:42
    we're generating here
  • 00:15:44
    is file download so after making the
  • 00:15:46
    prediction
  • 00:15:47
    we're going to be encoding decoding the
  • 00:15:49
    results
  • 00:15:50
    and then the output will be available as
  • 00:15:52
    a file for downloading
  • 00:15:53
    using this link and the third function
  • 00:15:56
    that we're
  • 00:15:57
    creating is called build model so it
  • 00:16:00
    will be accepting
  • 00:16:01
    the input argument which is the input
  • 00:16:03
    data
  • 00:16:04
    and then it will be loading up the
  • 00:16:07
    pickle
  • 00:16:08
    file which is the built model into a
  • 00:16:11
    load model
  • 00:16:12
    variable and then the model which we
  • 00:16:14
    have loaded
  • 00:16:15
    will be used for making a prediction
  • 00:16:18
    on the input data which is specified
  • 00:16:21
    here
  • 00:16:22
    and after a prediction has been made
  • 00:16:24
    we're going to be
  • 00:16:25
    assigning it to the prediction variable
  • 00:16:27
    then we're going to be printing out the
  • 00:16:29
    header called prediction outputs which
  • 00:16:32
    is right here
  • 00:16:33
    and underneath it we're going to create
  • 00:16:36
    a variable called prediction output
  • 00:16:38
    and we're going to be creating a pd dot
  • 00:16:40
    series
  • 00:16:41
    so essentially it is a column using
  • 00:16:43
    pandas
  • 00:16:44
    and so the first column is prediction
  • 00:16:47
    and then we're
  • 00:16:48
    naming it pic50 which is here
  • 00:16:53
    and then we're going to create another
  • 00:16:55
    variable called molecule name
  • 00:16:57
    and the column that we're creating is
  • 00:17:00
    the chamber id
  • 00:17:01
    or the molecule name which is right here
  • 00:17:05
    the first column and then we're going to
  • 00:17:08
    be combining these two columns
  • 00:17:11
    given by the individual variables
  • 00:17:14
    called prediction outputs and molecule
  • 00:17:17
    name
  • 00:17:17
    so we're using the pd.concat function
  • 00:17:21
    and then in bracket we're using molecule
  • 00:17:23
    name which is the first column
  • 00:17:25
    prediction output which is the second
  • 00:17:26
    column and then we're using an axis
  • 00:17:28
    equals to one in order to tell it to
  • 00:17:31
    combine
  • 00:17:32
    the two variables or the two columns
  • 00:17:35
    in a side-by-side manner okay so axis
  • 00:17:38
    one will allow us to have the two
  • 00:17:39
    columns side by side
  • 00:17:41
    otherwise it will be stacked underneath
  • 00:17:43
    it so psv50 column will be stacked
  • 00:17:45
    underneath the molecule name if the axis
  • 00:17:48
    was to be
  • 00:17:49
    zero okay and finally we're writing out
  • 00:17:52
    the data frame which is here
  • 00:17:54
    and then we're allowing it to generate
  • 00:17:57
    the
  • 00:17:57
    download link which is right here and
  • 00:18:00
    we're
  • 00:18:00
    making use of the file download function
  • 00:18:03
    described earlier on here
  • 00:18:05
    okay and then aligns number 38
  • 00:18:08
    we're generating this or displaying this
  • 00:18:12
    image of the web app
  • 00:18:15
    okay and lines number 43 until 51
  • 00:18:19
    or 52 is the header here
  • 00:18:22
    the bioactivity prediction app title and
  • 00:18:24
    then the description
  • 00:18:26
    of the app and then the credits of the
  • 00:18:28
    app and
  • 00:18:29
    this is written in markdown language
  • 00:18:32
    all right and so let's have a look
  • 00:18:34
    further lines 55
  • 00:18:37
    until 59 will be displaying the sidebar
  • 00:18:41
    right here so 55 will be displaying the
  • 00:18:44
    header
  • 00:18:45
    number one upload your csv data and then
  • 00:18:48
    we're creating a
  • 00:18:49
    variable called uploaded file and here
  • 00:18:52
    we're using the st.sidebar
  • 00:18:54
    dot file loader file uploader and then
  • 00:18:57
    as
  • 00:18:57
    input argument or displaying the text
  • 00:19:00
    upload your input file which is also
  • 00:19:02
    right here
  • 00:19:04
    and then the type of the file will be
  • 00:19:06
    the txt file so right here
  • 00:19:10
    and then we're creating a link using
  • 00:19:12
    markdown language
  • 00:19:13
    to the example file provided here to the
  • 00:19:16
    example essential coding series
  • 00:19:18
    so it's going to be the exact same file
  • 00:19:20
    that we have selected
  • 00:19:21
    as input okay so that's the sidebar
  • 00:19:27
    function that you see here all right and
  • 00:19:29
    so let's have a look for
  • 00:19:31
    so here you can see that from line 61
  • 00:19:34
    until 87
  • 00:19:35
    we have the if and else condition so if
  • 00:19:39
    we click on the predict button which is
  • 00:19:41
    right here
  • 00:19:42
    using the st.sidebar dot button function
  • 00:19:45
    with input argument of predict so if we
  • 00:19:48
    click on it
  • 00:19:49
    it will make the descriptor calculation
  • 00:19:51
    and
  • 00:19:52
    apply the machine learning model to make
  • 00:19:54
    a prediction and finally
  • 00:19:56
    displaying the results of the prediction
  • 00:19:58
    right here
  • 00:19:59
    and allow the user to download the
  • 00:20:01
    predictions however
  • 00:20:03
    if we didn't click anything whereby we
  • 00:20:06
    loaded up the web page
  • 00:20:07
    from the beginning as i will show you
  • 00:20:09
    right now
  • 00:20:10
    you will see a blue box displaying the
  • 00:20:12
    message of
  • 00:20:14
    upload input data in the sidebar to
  • 00:20:16
    start
  • 00:20:17
    okay so two conditions if the predict
  • 00:20:19
    button is clicked
  • 00:20:20
    it will make a prediction otherwise it
  • 00:20:23
    will just display the text here
  • 00:20:25
    that it is waiting for you to upload the
  • 00:20:27
    input data
  • 00:20:28
    okay so let's have a look under the if
  • 00:20:30
    condition
  • 00:20:31
    so upon clicking on the predict button
  • 00:20:34
    as you have guessed it will load the
  • 00:20:36
    data that you had just drag and dropped
  • 00:20:38
    and then it will be saving it as a
  • 00:20:40
    molecule.smi file and this
  • 00:20:42
    very same file here molecule.smi
  • 00:20:46
    will be used by the desk calculation
  • 00:20:49
    function
  • 00:20:50
    that we have discussed earlier on
  • 00:20:52
    particularly
  • 00:20:53
    the molecule.smi file will be used by
  • 00:20:56
    the paddle descriptor software
  • 00:20:58
    for the molecular descriptor calculation
  • 00:21:01
    and after
  • 00:21:02
    the descriptors have been calculated we
  • 00:21:04
    will assign it
  • 00:21:05
    as the x variable it's right here
  • 00:21:09
    okay so i'm going to tell you in just a
  • 00:21:11
    moment lines number 65
  • 00:21:13
    will be printing out the header right
  • 00:21:15
    here
  • 00:21:16
    so let me make a prediction first so
  • 00:21:18
    that we can see
  • 00:21:20
    let's drag and drop the input file
  • 00:21:22
    [Music]
  • 00:21:24
    press on the predict button it's right
  • 00:21:27
    here
  • 00:21:28
    original input data
  • 00:21:31
    line number 65. line number 66
  • 00:21:34
    will be printing out the data frame of
  • 00:21:36
    the input file
  • 00:21:38
    so you're going to see here two columns
  • 00:21:40
    the smile citation
  • 00:21:42
    which represent the chemical structure
  • 00:21:44
    information and the tempo id column
  • 00:21:46
    lines number 68 will be displaying a
  • 00:21:50
    spinner so upon loading up this
  • 00:21:54
    results here by pressing on the predict
  • 00:21:56
    button you saw earlier on that they had
  • 00:21:58
    a
  • 00:21:59
    yellow message box saying calculating
  • 00:22:01
    descriptor
  • 00:22:02
    and so underneath we have the desk
  • 00:22:04
    calculation function
  • 00:22:06
    and after it is calculated it will be
  • 00:22:08
    displaying the following content
  • 00:22:10
    the calculated molecular descriptor
  • 00:22:12
    which follows here
  • 00:22:14
    on lines number 72 right here
  • 00:22:17
    calculated molecular descriptor and then
  • 00:22:20
    it will be reading in
  • 00:22:21
    the calculated descriptor from the
  • 00:22:23
    descriptor output.csv file
  • 00:22:26
    it will be assigning it to the desk
  • 00:22:28
    variable
  • 00:22:29
    then we're going to be writing out right
  • 00:22:31
    here
  • 00:22:32
    and showing the data frame of the
  • 00:22:34
    descriptors that have been calculated
  • 00:22:36
    and then we're going to be printing out
  • 00:22:37
    the shape of the descriptor and so we
  • 00:22:40
    see here that
  • 00:22:41
    it has five rows or five molecules
  • 00:22:45
    881 molecular fingerprints
  • 00:22:48
    and then in lines number 78 until 82
  • 00:22:51
    is going to be the subset of descriptor
  • 00:22:54
    that is read
  • 00:22:55
    from the previously built model from the
  • 00:22:58
    file
  • 00:22:59
    descriptor list dot csv and so you can
  • 00:23:02
    see here that we're going to create a
  • 00:23:03
    variable called x list and then we're
  • 00:23:06
    reading in
  • 00:23:07
    the columns okay and then we're going to
  • 00:23:09
    be
  • 00:23:10
    from the initial descriptor of 881
  • 00:23:14
    we're going to be selecting a subset
  • 00:23:17
    provided in the x list
  • 00:23:18
    and then we assign the subset of
  • 00:23:21
    descriptor
  • 00:23:22
    which is 218 descriptors selected from
  • 00:23:25
    the initially
  • 00:23:26
    set of 881 and then we assigned that to
  • 00:23:30
    the desk
  • 00:23:30
    subset variable and then finally we
  • 00:23:33
    printed out
  • 00:23:34
    as a data frame and we also print out
  • 00:23:36
    the dimension as well
  • 00:23:37
    so we see here that there are five
  • 00:23:39
    molecules and 218
  • 00:23:42
    columns or 218 fingerprints
  • 00:23:45
    and finally we make use of this
  • 00:23:47
    calculated
  • 00:23:48
    molecular descriptor subset and use it
  • 00:23:51
    as an
  • 00:23:51
    input argument to the build model
  • 00:23:53
    function
  • 00:23:54
    and then as i have mentioned earlier on
  • 00:23:56
    it will be
  • 00:23:57
    building the model and then finally it
  • 00:23:59
    will be displaying the model
  • 00:24:01
    prediction result right here so users
  • 00:24:04
    can download it
  • 00:24:05
    into their own computer if you're
  • 00:24:07
    finding value in this video
  • 00:24:09
    please help us out by smashing the like
  • 00:24:11
    button subscribing if you haven't
  • 00:24:13
    already
  • 00:24:14
    and make sure to hit on the notification
  • 00:24:16
    bell so that you'll be notified
  • 00:24:18
    of the next video and as always the best
  • 00:24:21
    way to learn data science
  • 00:24:22
    is to do data science and please enjoy
  • 00:24:25
    the journey
Etiquetas
  • bioinformatics
  • machine learning
  • web application
  • Streamlit
  • bioactivity
  • acetylcholinesterase
  • Python
  • Paddle descriptor
  • GitHub
  • ChEMBL