What is the purpose of this video?

The video teaches how to convert a machine learning model into a web application for bioactivity prediction.

Which libraries are used in this project?

The project uses Python, Streamlit, Pandas, and the Paddle descriptor software.

What type of model is being deployed in the web app?

The web app deploys a random forest machine learning model.

What is the target protein in this demonstration?

The target protein is acetylcholinesterase.

How are molecular fingerprints used in this project?

Molecular fingerprints are computed using Paddle descriptor and used for model predictions.

What database is used to compile bioactivity data sets?

The ChEMBL database is used for compiling datasets.

Where can the app folder be found?

The app folder is available in a GitHub repository linked in the video description.

What file format is used for input data?

Input data is uploaded in .txt or .csv format.

What happens when you hit the "predict" button in the app?

It computes molecular descriptors and makes predictions on bioactivity.

How many descriptors are used from the molecular fingerprint?

A subset of 218 descriptors is used from the initially 881 computed ones.

Bioinformatics Project from Scratch - Drug Discovery #6 (Deploy Model as Web App) | Streamlit #22

00:24:28

https://www.youtube.com/watch?v=m0sePkuyTKs

Summary

TLDRIn this tutorial from the 'Bioinformatics from Scratch' series, viewers learn to transform a machine learning model developed for bioactivity predictions into a deployable web application using Python and Streamlit. The video walks through compiling bioactivity datasets from the ChEMBL database, performing exploratory data analysis, building and comparing machine learning models like random forests, and deploying the final model as a web app. The app allows user-interactive predictions for a targeted protein, exemplified by acetylcholinesterase, relevant for Alzheimer's research. The provided GitHub repository includes required folders and code, enabling users to activate necessary environments and run scripts to display and operate the app. The use of tools like Paddle descriptor for molecular fingerprints and the data manipulation in Jupyter is also highlighted.

Takeaways

🔍 Learn to integrate machine learning models into web applications.
💻 Use Python and Streamlit for cloud deployment.
📁 Access code and resources via GitHub.
🔬 Focus on bioactivity related to acetylcholinesterase.
🧬 Convert smile notations to molecular descriptors.
📊 Demonstrate and evaluate using random forest models.
🛠 Customize and deploy apps using minimal coding.
📈 Visualize molecular fingerprint data effectively.
🔗 Connect model to real-time predictions for users.
🧪 Specialize in predictive modeling for biochemistry.

Timeline

00:00:00 - 00:05:00
In previous videos of the "Bioinformatics from Scratch" series, topics covered included compiling bioactivity datasets from the ChEMBL database, performing exploratory data analysis on Lipinski descriptors, building a random forest model, and comparing model performances using the Lazy Predict library. The current video focuses on converting these machine learning models into a web application using the Streamlit library, allowing deployment in the cloud for user interaction and prediction tasks. The example application predicts bioactivity for acetylcholinesterase, a target protein for Alzheimer's disease.
00:05:00 - 00:10:00
The video demonstrates the setup of the application environment, starting with activating a Conda environment and locating the app files, including 'app.py'. It walks through running the application with Streamlit and shows an example where users upload an input file containing SMILES and ChEMBL ID. These identifiers aid in generating molecular fingerprints with the PaDEL-Descriptor tool, which are crucial for bioactivity predictions using the trained models. An acetylcholinesterase model is highlighted for its relevance to Alzheimer's research.
00:10:00 - 00:15:00
The process of building and using a random forest model is explained, beginning with feature selection and descriptor preparation from SMILES strings. The procedure involves reducing descriptors from 881 to 218 to enhance model efficiency. The video provides insight into model training using Jupyter Notebook, saving the model as a pickle file, and preparing it for integration into the web app. Performance evaluation metrics, such as R-squared and mean squared error, are calculated to assess model prediction effectiveness.
00:15:00 - 00:24:28
A detailed breakdown of the 'app.py' script is provided, where key Python libraries and functions like Streamlit, Pandas, and Pickle are used to construct the web application. Features include a command-line interface for descriptor calculation, prediction model application, and result visualization/download. The script is streamlined, with essential functions for molecular descriptor calculation, model building, and prediction result processing. The video emphasizes practical deployment, showcasing interactivity to upload datasets and generate downloadable bioactivity predictions.

Mind Map

Video Q&A

What is the purpose of this video?
The video teaches how to convert a machine learning model into a web application for bioactivity prediction.
Which libraries are used in this project?
The project uses Python, Streamlit, Pandas, and the Paddle descriptor software.
What type of model is being deployed in the web app?
The web app deploys a random forest machine learning model.
What is the target protein in this demonstration?
The target protein is acetylcholinesterase.
How are molecular fingerprints used in this project?
Molecular fingerprints are computed using Paddle descriptor and used for model predictions.
What database is used to compile bioactivity data sets?
The ChEMBL database is used for compiling datasets.
Where can the app folder be found?
The app folder is available in a GitHub repository linked in the video description.
What file format is used for input data?
Input data is uploaded in .txt or .csv format.
What happens when you hit the "predict" button in the app?
It computes molecular descriptors and makes predictions on bioactivity.
How many descriptors are used from the molecular fingerprint?
A subset of 218 descriptors is used from the initially 881 computed ones.

View more video summaries

Get instant access to free YouTube video summaries powered by AI!

Subtitles

Auto Scroll:

00:00:00
in prior videos of the bioinformatics
00:00:02
from scratch
00:00:03
series you have learned how to compile
00:00:06
your very own
00:00:07
bio activity data sets directly from the
00:00:10
chambo database
00:00:11
how to perform exploratory data analysis
00:00:14
on the computed lipinski descriptors you
00:00:17
have also learned how to build random
00:00:19
forest model
00:00:20
as well as building several machine
00:00:22
learning models for comparing the model
00:00:24
performance
00:00:25
using the lazy predict library and so in
00:00:28
this video
00:00:28
we will be taking a look at how we can
00:00:31
take that machine learning model of the
00:00:33
bioactivity
00:00:34
data set and convert it into a web
00:00:36
application
00:00:37
that you could deploy on the cloud that
00:00:39
will allow users to be able to
00:00:41
make predictions on your machine
00:00:44
learning model
00:00:45
for the target protein of your interest
00:00:47
and so without further ado
00:00:49
we're starting right now
00:00:53
okay so the first thing that you want to
00:00:55
do is
00:00:57
go to the bioactivity prediction app
00:01:00
folder
00:01:01
and so this folder will be provided in
00:01:03
the github link
00:01:04
in the video description and so before
00:01:07
we start
00:01:08
let me show you how the app looks like
00:01:13
so i'm going to activate my condy
00:01:15
environment
00:01:18
and for you please make sure to activate
00:01:20
your own content environment as well
00:01:22
so on my computer i'm using the data
00:01:24
professor
00:01:25
environment so i'm going to activate it
00:01:28
by
00:01:28
typing in conda activate data professor
00:01:32
and i'm going to go to the desktop
00:01:34
because that is where
00:01:36
the streamlet folder resides
00:01:43
and then we're going to go to the
00:01:45
bioactivity folder
00:01:51
let's have a look at the contents so the
00:01:53
app.py
00:01:54
will be the application and so we're
00:01:57
going to type in
00:01:58
streamlit run app.py in order to launch
00:02:02
this bioactivity prediction app
00:02:09
okay and so this is the bioactivity
00:02:11
prediction app
00:02:12
that i'm going to be showing you today
00:02:14
how you could build one
00:02:15
and so let's have a look at the example
00:02:17
input file
00:02:20
so this is the example input file
00:02:23
so in order to proceed with using this
00:02:26
app
00:02:27
we're going to have to upload the file
00:02:30
drag and drop right here
00:02:31
or browse files and select the input
00:02:33
file and so while waiting for
00:02:35
a input file to be uploaded you can see
00:02:38
here that the
00:02:39
blue box will be giving us a waiting
00:02:42
message
00:02:43
so it's saying upload input data in the
00:02:45
sidebar to start
00:02:47
so essentially the input file contains
00:02:50
the smiles notation and the chambo
00:02:53
id and so the tempo id you can think of
00:02:56
it as kind of like the name
00:02:57
of the molecule here and particularly
00:03:00
the tempo id is a unique
00:03:02
identification number of this particular
00:03:04
molecule
00:03:05
that chambo database has assigned to it
00:03:08
and the
00:03:09
smile notation here is a one-dimensional
00:03:12
representation of this particular
00:03:15
chemical structure and so this
00:03:17
my instantation will be used by the
00:03:20
paddle descriptor software
00:03:22
that we're going to be using here today
00:03:24
in the app in order to generate
00:03:26
molecular fingerprint which describe the
00:03:29
unique chemical features of the molecule
00:03:31
and then such molecular fingerprints
00:03:34
will then be used by the machine
00:03:36
learning model
00:03:36
to make a prediction okay
00:03:40
and so the prediction will be the pic50
00:03:42
values
00:03:43
that you see here and the psc 50 value
00:03:45
is the bioactivity
00:03:47
against the target protein of interest
00:03:49
and so in this application
00:03:51
the target protein is
00:03:53
acetylcholinesterase
00:03:55
and this target protein is a target for
00:03:57
the
00:03:58
alzheimer's disease okay and so this app
00:04:01
is built in python
00:04:02
using the trim-lit library and the
00:04:05
molecular fingerprints
00:04:06
are calculated using the paddle
00:04:12
descriptor
00:04:23
so back in 2016 we have published a
00:04:26
paper
00:04:27
describing the developments of a q-star
00:04:30
model
00:04:31
for predicting the bioactivity of the
00:04:35
acetylcholinesterase and so if you're
00:04:37
interested in this article
00:04:38
please feel free to read it and so i'm
00:04:40
going to provide you the link
00:04:41
in the video description as well
00:04:46
okay so let's drag and drop the
00:04:49
input file so example esthetically in
00:04:54
series i'm going to drag a drop here
00:04:58
and then in order to initiate the
00:05:00
prediction i'm going to
00:05:02
press on the predict button
00:05:05
and as you see here the input file is
00:05:08
giving you
00:05:09
this data frame and then it's
00:05:11
calculating the descriptor
00:05:13
and the calculated descriptor is
00:05:15
provided here
00:05:17
in this particular data frame so you're
00:05:19
going to see here that there are a total
00:05:21
of five
00:05:22
input molecules and there are 882
00:05:26
columns and you're going to see here
00:05:28
that the first column
00:05:29
is the tempo id so in reality
00:05:33
you're going to have a total of 881
00:05:36
molecular fingerprints and the molecular
00:05:39
fingerprints that we're using today
00:05:40
is the pubchem fingerprint and because
00:05:43
we have previously built
00:05:44
a machine learning model which i will be
00:05:46
showing you using this
00:05:48
file the jupyter notebook file
00:05:51
we had reduced the number of descriptors
00:05:54
from
00:05:54
881 to 217 no actually 218 because we
00:05:58
have already deleted the first column
00:06:00
the name of the the symbol id column and
00:06:03
so we have reduced from 881
00:06:05
columns to 218 columns okay and so
00:06:09
in the code we're going to be selecting
00:06:11
the same
00:06:12
218 columns that you see here which
00:06:15
corresponds to the descriptor subsets
00:06:18
from the initially full set of eight
00:06:21
eight one okay so we're going to use the
00:06:23
218
00:06:24
as the x variables in order to predict
00:06:27
the psa 50
00:06:28
and finally we have the prediction
00:06:30
output and the last data frame here
00:06:33
and we have the corresponding tempo id
00:06:35
and then we could also download the
00:06:37
prediction
00:06:37
by pressing on this link
00:06:46
and then the prediction is provided here
00:06:48
in the csv
00:06:49
file okay
00:06:53
so the data is provided here
00:06:56
all right and so let's get started shall
00:07:00
we
00:07:06
okay so we have to first build our
00:07:09
prediction model
00:07:10
using the jupyter notebook and then
00:07:12
we're going to
00:07:13
save the model as a pickle file right
00:07:16
here
00:07:16
okay so let me show you in which will
00:07:20
take
00:07:20
just a moment so let me open up a new
00:07:23
terminal
00:07:25
and then i'm going to activate jupyter
00:07:28
typing in jupyter notebook
00:07:32
okay so i have to first activate condy
00:07:34
environment
00:07:37
kind of activate data professor so it's
00:07:39
the same environment
00:07:40
and then jupyter notebook
00:07:45
all right there you go and then i'm
00:07:47
going to open up the jupyter notebook
00:07:51
all right and here we go so actually
00:07:53
this was adapted from one of the prior
00:07:56
tutorials in this bioinformatic from
00:07:58
scratch
00:07:59
series and essentially we're going to
00:08:01
just
00:08:02
download the calculated fingerprints
00:08:05
from the github of data professor
00:08:07
using this url link and so we're
00:08:10
importing pandas as pd and then we're
00:08:13
downloading
00:08:13
and reading it in using pandas and the
00:08:16
resulting data frame looks like this
00:08:18
and so you're gonna see here that we
00:08:20
have
00:08:21
all of this so one column the last
00:08:24
column is pic50
00:08:26
and we have 881 columns for the
00:08:29
pubchem fingerprint and then the next
00:08:32
cell here is we're going to be
00:08:35
dropping the last column or the pic50
00:08:38
column
00:08:39
in order to assign it to the x variable
00:08:43
and then we're going to just select the
00:08:46
last column denoted here by -1
00:08:50
and assigning it to the y variable
00:08:53
and so now that we have the x and y
00:08:55
separated we're going to next
00:08:57
remove the low variance feature from the
00:09:00
x
00:09:00
variable so initially we have 881
00:09:04
and so applying a threshold of 0.1 this
00:09:08
resulted in 218 columns
00:09:11
and then we're going to be saving it
00:09:13
into a descriptor list.csv
00:09:16
file so let me show you that
00:09:21
descriptor lists the csv file
00:09:24
okay and then you're going to see here
00:09:25
that the first row
00:09:27
will contain the names of the
00:09:29
fingerprints that are retained
00:09:31
in other words the name of the
00:09:33
descriptors of the 218 columns here
00:09:37
we here you can see that pop can
00:09:39
fingerprint 0
00:09:40
1 2 has been removed and we have
00:09:43
fingerprint 3
00:09:44
and fingerprints 4 until 11 has been
00:09:47
removed
00:09:48
fingerprint 14 has been removed
00:09:50
fingerprint 17 has also been removed
00:09:54
so more than 600 fingerprints have been
00:09:58
deleted
00:09:58
from the x variable and so the removal
00:10:01
of excessive redundant features will
00:10:03
allow us to build the model
00:10:05
much quicker okay and so in just a few
00:10:08
moments i will be telling you
00:10:10
how we're going to be making use of this
00:10:12
descriptor list
00:10:14
in order to select the subsets from the
00:10:16
computed
00:10:17
descriptors that we obtained from the
00:10:20
input query right here let me show you
00:10:25
that we get from the input query right
00:10:27
here
00:10:28
so out of this small citation we
00:10:31
generated 881
00:10:34
columns and then we're going to be
00:10:36
selecting
00:10:37
a subset of 218 from the initially 881
00:10:41
by using this particular list of
00:10:44
descriptors okay
00:10:49
and let's go back to the
00:10:52
jupyter notebook all right
00:10:58
let's save it
00:11:03
and then we're going to be building the
00:11:05
model random forest model
00:11:08
we're setting here the random states to
00:11:10
be 42
00:11:11
the number of estimators to be 500 and
00:11:14
we're using the random force regressor
00:11:16
and we fit the model here in order to
00:11:18
train it and then we're going to be
00:11:20
calculating the score
00:11:21
which is the r2 score
00:11:25
and then we're assigning it to the r2
00:11:27
variable and then finally
00:11:29
we're going to be applying the trained
00:11:32
model
00:11:32
to make a prediction on the x variable
00:11:35
which is the training sets
00:11:36
also and then we're assigning it to the
00:11:39
wide red
00:11:40
variable
00:11:44
okay so here we see that the r squared
00:11:46
value is 0.86
00:11:49
and then let's print out the performance
00:11:53
mean squared error of 0.34 and let's
00:11:55
make the scatter plot
00:11:57
of the actual and predicted values
00:12:01
okay so we get this plot here
00:12:04
and then finally we're going to be
00:12:05
saving the model by
00:12:07
dumping it using the pickle function
00:12:09
pico dot dump
00:12:11
and then as input argument we're going
00:12:12
to have model and then we're going to
00:12:14
save it as
00:12:15
essential calling series model dot pkl
00:12:20
and there you go we have already saved
00:12:22
the model okay so i'm going to go ahead
00:12:24
and
00:12:24
close this stupid notebook
00:12:33
and let's help over back and
00:12:37
let's take a look at the app.py file
00:12:42
okay so let's have a brief look you're
00:12:44
going to see here that
00:12:46
the app.py is less than 90 lines of code
00:12:50
and about 87 to be exact and you're
00:12:53
going to see that there are
00:12:54
some white spaces so even if we delete
00:12:57
all
00:12:57
the white space it might be even less
00:13:00
maybe 80 lines of code
00:13:02
okay so the first seven lines of code
00:13:05
will be
00:13:06
importing the necessary libraries and so
00:13:09
we're making use of streamlit as the web
00:13:11
framework
00:13:12
and we're using pandas in order to
00:13:14
display the data frame
00:13:15
and the image function from the pil
00:13:18
library is used to display this
00:13:21
illustration
00:13:22
and the descriptor calculation will be
00:13:24
made possible by
00:13:25
using the subprocess library so that
00:13:28
will allow us to compute the
00:13:29
title descriptor via the use of java and
00:13:32
we're
00:13:33
using the os library in order to perform
00:13:36
file handling so here you're going to
00:13:38
see that we're using the os
00:13:40
dot remove in order to remove the
00:13:42
molecule.smi
00:13:43
file so i'm going to explain to you that
00:13:45
in just a moment
00:13:47
base64 will be used for encoding
00:13:49
decoding
00:13:50
of the file when we will make the file
00:13:54
available for download the prediction
00:13:56
results
00:13:56
and the pickle library will be used for
00:13:59
loading up the pickled file
00:14:01
of the model okay and so you're going to
00:14:04
be seeing here that we're
00:14:05
making three custom functions so lines
00:14:09
10
00:14:09
through 15 the first custom function
00:14:12
will be our
00:14:13
molecular descriptor calculator so we're
00:14:16
defining a function called desk calc
00:14:19
and then the statement underneath it
00:14:21
will be the batch command
00:14:23
and so this batch command is what we're
00:14:25
normally using
00:14:27
when we type into the command line
00:14:30
okay and so this option here will allow
00:14:33
us to run the code in the command line
00:14:35
without
00:14:36
launching a gui version of paddle
00:14:38
descriptor and so without this
00:14:40
option here it will launch a gui version
00:14:43
but since
00:14:44
we don't want that to happen we're going
00:14:46
to use this option
00:14:48
okay and so we're using the jar file to
00:14:51
make the calculation
00:14:52
of the fingerprints and then you're
00:14:53
going to see here that we have
00:14:55
additional
00:14:56
options such as removing salt
00:14:58
standardizing the nitro group of the
00:15:00
molecule
00:15:01
and then we're using the fingerprint to
00:15:03
be the pubchem fingerprint
00:15:05
using the xml file here and then finally
00:15:08
we're generating the molecular
00:15:10
descriptor file by saving it to the
00:15:12
descriptor's underscore output.csv
00:15:15
file and so this batch command will be
00:15:18
serving as input
00:15:19
right here in the subprocess.p
00:15:23
open function okay
00:15:26
and then finally after the descriptor
00:15:28
has been calculated
00:15:30
we're removing the molecule.smi file and
00:15:33
so the molecule.smi file
00:15:35
will be generated in another function so
00:15:37
i will be discussing that in just a
00:15:39
moment
00:15:40
and the second custom function that
00:15:42
we're generating here
00:15:44
is file download so after making the
00:15:46
prediction
00:15:47
we're going to be encoding decoding the
00:15:49
results
00:15:50
and then the output will be available as
00:15:52
a file for downloading
00:15:53
using this link and the third function
00:15:56
that we're
00:15:57
creating is called build model so it
00:16:00
will be accepting
00:16:01
the input argument which is the input
00:16:03
data
00:16:04
and then it will be loading up the
00:16:07
pickle
00:16:08
file which is the built model into a
00:16:11
load model
00:16:12
variable and then the model which we
00:16:14
have loaded
00:16:15
will be used for making a prediction
00:16:18
on the input data which is specified
00:16:21
here
00:16:22
and after a prediction has been made
00:16:24
we're going to be
00:16:25
assigning it to the prediction variable
00:16:27
then we're going to be printing out the
00:16:29
header called prediction outputs which
00:16:32
is right here
00:16:33
and underneath it we're going to create
00:16:36
a variable called prediction output
00:16:38
and we're going to be creating a pd dot
00:16:40
series
00:16:41
so essentially it is a column using
00:16:43
pandas
00:16:44
and so the first column is prediction
00:16:47
and then we're
00:16:48
naming it pic50 which is here
00:16:53
and then we're going to create another
00:16:55
variable called molecule name
00:16:57
and the column that we're creating is
00:17:00
the chamber id
00:17:01
or the molecule name which is right here
00:17:05
the first column and then we're going to
00:17:08
be combining these two columns
00:17:11
given by the individual variables
00:17:14
called prediction outputs and molecule
00:17:17
name
00:17:17
so we're using the pd.concat function
00:17:21
and then in bracket we're using molecule
00:17:23
name which is the first column
00:17:25
prediction output which is the second
00:17:26
column and then we're using an axis
00:17:28
equals to one in order to tell it to
00:17:31
combine
00:17:32
the two variables or the two columns
00:17:35
in a side-by-side manner okay so axis
00:17:38
one will allow us to have the two
00:17:39
columns side by side
00:17:41
otherwise it will be stacked underneath
00:17:43
it so psv50 column will be stacked
00:17:45
underneath the molecule name if the axis
00:17:48
was to be
00:17:49
zero okay and finally we're writing out
00:17:52
the data frame which is here
00:17:54
and then we're allowing it to generate
00:17:57
the
00:17:57
download link which is right here and
00:18:00
we're
00:18:00
making use of the file download function
00:18:03
described earlier on here
00:18:05
okay and then aligns number 38
00:18:08
we're generating this or displaying this
00:18:12
image of the web app
00:18:15
okay and lines number 43 until 51
00:18:19
or 52 is the header here
00:18:22
the bioactivity prediction app title and
00:18:24
then the description
00:18:26
of the app and then the credits of the
00:18:28
app and
00:18:29
this is written in markdown language
00:18:32
all right and so let's have a look
00:18:34
further lines 55
00:18:37
until 59 will be displaying the sidebar
00:18:41
right here so 55 will be displaying the
00:18:44
header
00:18:45
number one upload your csv data and then
00:18:48
we're creating a
00:18:49
variable called uploaded file and here
00:18:52
we're using the st.sidebar
00:18:54
dot file loader file uploader and then
00:18:57
as
00:18:57
input argument or displaying the text
00:19:00
upload your input file which is also
00:19:02
right here
00:19:04
and then the type of the file will be
00:19:06
the txt file so right here
00:19:10
and then we're creating a link using
00:19:12
markdown language
00:19:13
to the example file provided here to the
00:19:16
example essential coding series
00:19:18
so it's going to be the exact same file
00:19:20
that we have selected
00:19:21
as input okay so that's the sidebar
00:19:27
function that you see here all right and
00:19:29
so let's have a look for
00:19:31
so here you can see that from line 61
00:19:34
until 87
00:19:35
we have the if and else condition so if
00:19:39
we click on the predict button which is
00:19:41
right here
00:19:42
using the st.sidebar dot button function
00:19:45
with input argument of predict so if we
00:19:48
click on it
00:19:49
it will make the descriptor calculation
00:19:51
and
00:19:52
apply the machine learning model to make
00:19:54
a prediction and finally
00:19:56
displaying the results of the prediction
00:19:58
right here
00:19:59
and allow the user to download the
00:20:01
predictions however
00:20:03
if we didn't click anything whereby we
00:20:06
loaded up the web page
00:20:07
from the beginning as i will show you
00:20:09
right now
00:20:10
you will see a blue box displaying the
00:20:12
message of
00:20:14
upload input data in the sidebar to
00:20:16
start
00:20:17
okay so two conditions if the predict
00:20:19
button is clicked
00:20:20
it will make a prediction otherwise it
00:20:23
will just display the text here
00:20:25
that it is waiting for you to upload the
00:20:27
input data
00:20:28
okay so let's have a look under the if
00:20:30
condition
00:20:31
so upon clicking on the predict button
00:20:34
as you have guessed it will load the
00:20:36
data that you had just drag and dropped
00:20:38
and then it will be saving it as a
00:20:40
molecule.smi file and this
00:20:42
very same file here molecule.smi
00:20:46
will be used by the desk calculation
00:20:49
function
00:20:50
that we have discussed earlier on
00:20:52
particularly
00:20:53
the molecule.smi file will be used by
00:20:56
the paddle descriptor software
00:20:58
for the molecular descriptor calculation
00:21:01
and after
00:21:02
the descriptors have been calculated we
00:21:04
will assign it
00:21:05
as the x variable it's right here
00:21:09
okay so i'm going to tell you in just a
00:21:11
moment lines number 65
00:21:13
will be printing out the header right
00:21:15
here
00:21:16
so let me make a prediction first so
00:21:18
that we can see
00:21:20
let's drag and drop the input file
00:21:22
[Music]
00:21:24
press on the predict button it's right
00:21:27
here
00:21:28
original input data
00:21:31
line number 65. line number 66
00:21:34
will be printing out the data frame of
00:21:36
the input file
00:21:38
so you're going to see here two columns
00:21:40
the smile citation
00:21:42
which represent the chemical structure
00:21:44
information and the tempo id column
00:21:46
lines number 68 will be displaying a
00:21:50
spinner so upon loading up this
00:21:54
results here by pressing on the predict
00:21:56
button you saw earlier on that they had
00:21:58
a
00:21:59
yellow message box saying calculating
00:22:01
descriptor
00:22:02
and so underneath we have the desk
00:22:04
calculation function
00:22:06
and after it is calculated it will be
00:22:08
displaying the following content
00:22:10
the calculated molecular descriptor
00:22:12
which follows here
00:22:14
on lines number 72 right here
00:22:17
calculated molecular descriptor and then
00:22:20
it will be reading in
00:22:21
the calculated descriptor from the
00:22:23
descriptor output.csv file
00:22:26
it will be assigning it to the desk
00:22:28
variable
00:22:29
then we're going to be writing out right
00:22:31
here
00:22:32
and showing the data frame of the
00:22:34
descriptors that have been calculated
00:22:36
and then we're going to be printing out
00:22:37
the shape of the descriptor and so we
00:22:40
see here that
00:22:41
it has five rows or five molecules
00:22:45
881 molecular fingerprints
00:22:48
and then in lines number 78 until 82
00:22:51
is going to be the subset of descriptor
00:22:54
that is read
00:22:55
from the previously built model from the
00:22:58
file
00:22:59
descriptor list dot csv and so you can
00:23:02
see here that we're going to create a
00:23:03
variable called x list and then we're
00:23:06
reading in
00:23:07
the columns okay and then we're going to
00:23:09
be
00:23:10
from the initial descriptor of 881
00:23:14
we're going to be selecting a subset
00:23:17
provided in the x list
00:23:18
and then we assign the subset of
00:23:21
descriptor
00:23:22
which is 218 descriptors selected from
00:23:25
the initially
00:23:26
set of 881 and then we assigned that to
00:23:30
the desk
00:23:30
subset variable and then finally we
00:23:33
printed out
00:23:34
as a data frame and we also print out
00:23:36
the dimension as well
00:23:37
so we see here that there are five
00:23:39
molecules and 218
00:23:42
columns or 218 fingerprints
00:23:45
and finally we make use of this
00:23:47
calculated
00:23:48
molecular descriptor subset and use it
00:23:51
as an
00:23:51
input argument to the build model
00:23:53
function
00:23:54
and then as i have mentioned earlier on
00:23:56
it will be
00:23:57
building the model and then finally it
00:23:59
will be displaying the model
00:24:01
prediction result right here so users
00:24:04
can download it
00:24:05
into their own computer if you're
00:24:07
finding value in this video
00:24:09
please help us out by smashing the like
00:24:11
button subscribing if you haven't
00:24:13
already
00:24:14
and make sure to hit on the notification
00:24:16
bell so that you'll be notified
00:24:18
of the next video and as always the best
00:24:21
way to learn data science
00:24:22
is to do data science and please enjoy
00:24:25
the journey