00:00:00
in prior videos of the bioinformatics
00:00:02
from scratch
00:00:03
series you have learned how to compile
00:00:06
your very own
00:00:07
bio activity data sets directly from the
00:00:10
chambo database
00:00:11
how to perform exploratory data analysis
00:00:14
on the computed lipinski descriptors you
00:00:17
have also learned how to build random
00:00:19
forest model
00:00:20
as well as building several machine
00:00:22
learning models for comparing the model
00:00:24
performance
00:00:25
using the lazy predict library and so in
00:00:28
this video
00:00:28
we will be taking a look at how we can
00:00:31
take that machine learning model of the
00:00:33
bioactivity
00:00:34
data set and convert it into a web
00:00:36
application
00:00:37
that you could deploy on the cloud that
00:00:39
will allow users to be able to
00:00:41
make predictions on your machine
00:00:44
learning model
00:00:45
for the target protein of your interest
00:00:47
and so without further ado
00:00:49
we're starting right now
00:00:53
okay so the first thing that you want to
00:00:55
do is
00:00:57
go to the bioactivity prediction app
00:01:00
folder
00:01:01
and so this folder will be provided in
00:01:03
the github link
00:01:04
in the video description and so before
00:01:07
we start
00:01:08
let me show you how the app looks like
00:01:13
so i'm going to activate my condy
00:01:15
environment
00:01:18
and for you please make sure to activate
00:01:20
your own content environment as well
00:01:22
so on my computer i'm using the data
00:01:24
professor
00:01:25
environment so i'm going to activate it
00:01:28
by
00:01:28
typing in conda activate data professor
00:01:32
and i'm going to go to the desktop
00:01:34
because that is where
00:01:36
the streamlet folder resides
00:01:43
and then we're going to go to the
00:01:45
bioactivity folder
00:01:51
let's have a look at the contents so the
00:01:53
app.py
00:01:54
will be the application and so we're
00:01:57
going to type in
00:01:58
streamlit run app.py in order to launch
00:02:02
this bioactivity prediction app
00:02:09
okay and so this is the bioactivity
00:02:11
prediction app
00:02:12
that i'm going to be showing you today
00:02:14
how you could build one
00:02:15
and so let's have a look at the example
00:02:17
input file
00:02:20
so this is the example input file
00:02:23
so in order to proceed with using this
00:02:26
app
00:02:27
we're going to have to upload the file
00:02:30
drag and drop right here
00:02:31
or browse files and select the input
00:02:33
file and so while waiting for
00:02:35
a input file to be uploaded you can see
00:02:38
here that the
00:02:39
blue box will be giving us a waiting
00:02:42
message
00:02:43
so it's saying upload input data in the
00:02:45
sidebar to start
00:02:47
so essentially the input file contains
00:02:50
the smiles notation and the chambo
00:02:53
id and so the tempo id you can think of
00:02:56
it as kind of like the name
00:02:57
of the molecule here and particularly
00:03:00
the tempo id is a unique
00:03:02
identification number of this particular
00:03:04
molecule
00:03:05
that chambo database has assigned to it
00:03:08
and the
00:03:09
smile notation here is a one-dimensional
00:03:12
representation of this particular
00:03:15
chemical structure and so this
00:03:17
my instantation will be used by the
00:03:20
paddle descriptor software
00:03:22
that we're going to be using here today
00:03:24
in the app in order to generate
00:03:26
molecular fingerprint which describe the
00:03:29
unique chemical features of the molecule
00:03:31
and then such molecular fingerprints
00:03:34
will then be used by the machine
00:03:36
learning model
00:03:36
to make a prediction okay
00:03:40
and so the prediction will be the pic50
00:03:42
values
00:03:43
that you see here and the psc 50 value
00:03:45
is the bioactivity
00:03:47
against the target protein of interest
00:03:49
and so in this application
00:03:51
the target protein is
00:03:53
acetylcholinesterase
00:03:55
and this target protein is a target for
00:03:57
the
00:03:58
alzheimer's disease okay and so this app
00:04:01
is built in python
00:04:02
using the trim-lit library and the
00:04:05
molecular fingerprints
00:04:06
are calculated using the paddle
00:04:12
descriptor
00:04:23
so back in 2016 we have published a
00:04:26
paper
00:04:27
describing the developments of a q-star
00:04:30
model
00:04:31
for predicting the bioactivity of the
00:04:35
acetylcholinesterase and so if you're
00:04:37
interested in this article
00:04:38
please feel free to read it and so i'm
00:04:40
going to provide you the link
00:04:41
in the video description as well
00:04:46
okay so let's drag and drop the
00:04:49
input file so example esthetically in
00:04:54
series i'm going to drag a drop here
00:04:58
and then in order to initiate the
00:05:00
prediction i'm going to
00:05:02
press on the predict button
00:05:05
and as you see here the input file is
00:05:08
giving you
00:05:09
this data frame and then it's
00:05:11
calculating the descriptor
00:05:13
and the calculated descriptor is
00:05:15
provided here
00:05:17
in this particular data frame so you're
00:05:19
going to see here that there are a total
00:05:21
of five
00:05:22
input molecules and there are 882
00:05:26
columns and you're going to see here
00:05:28
that the first column
00:05:29
is the tempo id so in reality
00:05:33
you're going to have a total of 881
00:05:36
molecular fingerprints and the molecular
00:05:39
fingerprints that we're using today
00:05:40
is the pubchem fingerprint and because
00:05:43
we have previously built
00:05:44
a machine learning model which i will be
00:05:46
showing you using this
00:05:48
file the jupyter notebook file
00:05:51
we had reduced the number of descriptors
00:05:54
from
00:05:54
881 to 217 no actually 218 because we
00:05:58
have already deleted the first column
00:06:00
the name of the the symbol id column and
00:06:03
so we have reduced from 881
00:06:05
columns to 218 columns okay and so
00:06:09
in the code we're going to be selecting
00:06:11
the same
00:06:12
218 columns that you see here which
00:06:15
corresponds to the descriptor subsets
00:06:18
from the initially full set of eight
00:06:21
eight one okay so we're going to use the
00:06:23
218
00:06:24
as the x variables in order to predict
00:06:27
the psa 50
00:06:28
and finally we have the prediction
00:06:30
output and the last data frame here
00:06:33
and we have the corresponding tempo id
00:06:35
and then we could also download the
00:06:37
prediction
00:06:37
by pressing on this link
00:06:46
and then the prediction is provided here
00:06:48
in the csv
00:06:49
file okay
00:06:53
so the data is provided here
00:06:56
all right and so let's get started shall
00:07:00
we
00:07:06
okay so we have to first build our
00:07:09
prediction model
00:07:10
using the jupyter notebook and then
00:07:12
we're going to
00:07:13
save the model as a pickle file right
00:07:16
here
00:07:16
okay so let me show you in which will
00:07:20
take
00:07:20
just a moment so let me open up a new
00:07:23
terminal
00:07:25
and then i'm going to activate jupyter
00:07:28
typing in jupyter notebook
00:07:32
okay so i have to first activate condy
00:07:34
environment
00:07:37
kind of activate data professor so it's
00:07:39
the same environment
00:07:40
and then jupyter notebook
00:07:45
all right there you go and then i'm
00:07:47
going to open up the jupyter notebook
00:07:51
all right and here we go so actually
00:07:53
this was adapted from one of the prior
00:07:56
tutorials in this bioinformatic from
00:07:58
scratch
00:07:59
series and essentially we're going to
00:08:01
just
00:08:02
download the calculated fingerprints
00:08:05
from the github of data professor
00:08:07
using this url link and so we're
00:08:10
importing pandas as pd and then we're
00:08:13
downloading
00:08:13
and reading it in using pandas and the
00:08:16
resulting data frame looks like this
00:08:18
and so you're gonna see here that we
00:08:20
have
00:08:21
all of this so one column the last
00:08:24
column is pic50
00:08:26
and we have 881 columns for the
00:08:29
pubchem fingerprint and then the next
00:08:32
cell here is we're going to be
00:08:35
dropping the last column or the pic50
00:08:38
column
00:08:39
in order to assign it to the x variable
00:08:43
and then we're going to just select the
00:08:46
last column denoted here by -1
00:08:50
and assigning it to the y variable
00:08:53
and so now that we have the x and y
00:08:55
separated we're going to next
00:08:57
remove the low variance feature from the
00:09:00
x
00:09:00
variable so initially we have 881
00:09:04
and so applying a threshold of 0.1 this
00:09:08
resulted in 218 columns
00:09:11
and then we're going to be saving it
00:09:13
into a descriptor list.csv
00:09:16
file so let me show you that
00:09:21
descriptor lists the csv file
00:09:24
okay and then you're going to see here
00:09:25
that the first row
00:09:27
will contain the names of the
00:09:29
fingerprints that are retained
00:09:31
in other words the name of the
00:09:33
descriptors of the 218 columns here
00:09:37
we here you can see that pop can
00:09:39
fingerprint 0
00:09:40
1 2 has been removed and we have
00:09:43
fingerprint 3
00:09:44
and fingerprints 4 until 11 has been
00:09:47
removed
00:09:48
fingerprint 14 has been removed
00:09:50
fingerprint 17 has also been removed
00:09:54
so more than 600 fingerprints have been
00:09:58
deleted
00:09:58
from the x variable and so the removal
00:10:01
of excessive redundant features will
00:10:03
allow us to build the model
00:10:05
much quicker okay and so in just a few
00:10:08
moments i will be telling you
00:10:10
how we're going to be making use of this
00:10:12
descriptor list
00:10:14
in order to select the subsets from the
00:10:16
computed
00:10:17
descriptors that we obtained from the
00:10:20
input query right here let me show you
00:10:25
that we get from the input query right
00:10:27
here
00:10:28
so out of this small citation we
00:10:31
generated 881
00:10:34
columns and then we're going to be
00:10:36
selecting
00:10:37
a subset of 218 from the initially 881
00:10:41
by using this particular list of
00:10:44
descriptors okay
00:10:49
and let's go back to the
00:10:52
jupyter notebook all right
00:10:58
let's save it
00:11:03
and then we're going to be building the
00:11:05
model random forest model
00:11:08
we're setting here the random states to
00:11:10
be 42
00:11:11
the number of estimators to be 500 and
00:11:14
we're using the random force regressor
00:11:16
and we fit the model here in order to
00:11:18
train it and then we're going to be
00:11:20
calculating the score
00:11:21
which is the r2 score
00:11:25
and then we're assigning it to the r2
00:11:27
variable and then finally
00:11:29
we're going to be applying the trained
00:11:32
model
00:11:32
to make a prediction on the x variable
00:11:35
which is the training sets
00:11:36
also and then we're assigning it to the
00:11:39
wide red
00:11:40
variable
00:11:44
okay so here we see that the r squared
00:11:46
value is 0.86
00:11:49
and then let's print out the performance
00:11:53
mean squared error of 0.34 and let's
00:11:55
make the scatter plot
00:11:57
of the actual and predicted values
00:12:01
okay so we get this plot here
00:12:04
and then finally we're going to be
00:12:05
saving the model by
00:12:07
dumping it using the pickle function
00:12:09
pico dot dump
00:12:11
and then as input argument we're going
00:12:12
to have model and then we're going to
00:12:14
save it as
00:12:15
essential calling series model dot pkl
00:12:20
and there you go we have already saved
00:12:22
the model okay so i'm going to go ahead
00:12:24
and
00:12:24
close this stupid notebook
00:12:33
and let's help over back and
00:12:37
let's take a look at the app.py file
00:12:42
okay so let's have a brief look you're
00:12:44
going to see here that
00:12:46
the app.py is less than 90 lines of code
00:12:50
and about 87 to be exact and you're
00:12:53
going to see that there are
00:12:54
some white spaces so even if we delete
00:12:57
all
00:12:57
the white space it might be even less
00:13:00
maybe 80 lines of code
00:13:02
okay so the first seven lines of code
00:13:05
will be
00:13:06
importing the necessary libraries and so
00:13:09
we're making use of streamlit as the web
00:13:11
framework
00:13:12
and we're using pandas in order to
00:13:14
display the data frame
00:13:15
and the image function from the pil
00:13:18
library is used to display this
00:13:21
illustration
00:13:22
and the descriptor calculation will be
00:13:24
made possible by
00:13:25
using the subprocess library so that
00:13:28
will allow us to compute the
00:13:29
title descriptor via the use of java and
00:13:32
we're
00:13:33
using the os library in order to perform
00:13:36
file handling so here you're going to
00:13:38
see that we're using the os
00:13:40
dot remove in order to remove the
00:13:42
molecule.smi
00:13:43
file so i'm going to explain to you that
00:13:45
in just a moment
00:13:47
base64 will be used for encoding
00:13:49
decoding
00:13:50
of the file when we will make the file
00:13:54
available for download the prediction
00:13:56
results
00:13:56
and the pickle library will be used for
00:13:59
loading up the pickled file
00:14:01
of the model okay and so you're going to
00:14:04
be seeing here that we're
00:14:05
making three custom functions so lines
00:14:09
10
00:14:09
through 15 the first custom function
00:14:12
will be our
00:14:13
molecular descriptor calculator so we're
00:14:16
defining a function called desk calc
00:14:19
and then the statement underneath it
00:14:21
will be the batch command
00:14:23
and so this batch command is what we're
00:14:25
normally using
00:14:27
when we type into the command line
00:14:30
okay and so this option here will allow
00:14:33
us to run the code in the command line
00:14:35
without
00:14:36
launching a gui version of paddle
00:14:38
descriptor and so without this
00:14:40
option here it will launch a gui version
00:14:43
but since
00:14:44
we don't want that to happen we're going
00:14:46
to use this option
00:14:48
okay and so we're using the jar file to
00:14:51
make the calculation
00:14:52
of the fingerprints and then you're
00:14:53
going to see here that we have
00:14:55
additional
00:14:56
options such as removing salt
00:14:58
standardizing the nitro group of the
00:15:00
molecule
00:15:01
and then we're using the fingerprint to
00:15:03
be the pubchem fingerprint
00:15:05
using the xml file here and then finally
00:15:08
we're generating the molecular
00:15:10
descriptor file by saving it to the
00:15:12
descriptor's underscore output.csv
00:15:15
file and so this batch command will be
00:15:18
serving as input
00:15:19
right here in the subprocess.p
00:15:23
open function okay
00:15:26
and then finally after the descriptor
00:15:28
has been calculated
00:15:30
we're removing the molecule.smi file and
00:15:33
so the molecule.smi file
00:15:35
will be generated in another function so
00:15:37
i will be discussing that in just a
00:15:39
moment
00:15:40
and the second custom function that
00:15:42
we're generating here
00:15:44
is file download so after making the
00:15:46
prediction
00:15:47
we're going to be encoding decoding the
00:15:49
results
00:15:50
and then the output will be available as
00:15:52
a file for downloading
00:15:53
using this link and the third function
00:15:56
that we're
00:15:57
creating is called build model so it
00:16:00
will be accepting
00:16:01
the input argument which is the input
00:16:03
data
00:16:04
and then it will be loading up the
00:16:07
pickle
00:16:08
file which is the built model into a
00:16:11
load model
00:16:12
variable and then the model which we
00:16:14
have loaded
00:16:15
will be used for making a prediction
00:16:18
on the input data which is specified
00:16:21
here
00:16:22
and after a prediction has been made
00:16:24
we're going to be
00:16:25
assigning it to the prediction variable
00:16:27
then we're going to be printing out the
00:16:29
header called prediction outputs which
00:16:32
is right here
00:16:33
and underneath it we're going to create
00:16:36
a variable called prediction output
00:16:38
and we're going to be creating a pd dot
00:16:40
series
00:16:41
so essentially it is a column using
00:16:43
pandas
00:16:44
and so the first column is prediction
00:16:47
and then we're
00:16:48
naming it pic50 which is here
00:16:53
and then we're going to create another
00:16:55
variable called molecule name
00:16:57
and the column that we're creating is
00:17:00
the chamber id
00:17:01
or the molecule name which is right here
00:17:05
the first column and then we're going to
00:17:08
be combining these two columns
00:17:11
given by the individual variables
00:17:14
called prediction outputs and molecule
00:17:17
name
00:17:17
so we're using the pd.concat function
00:17:21
and then in bracket we're using molecule
00:17:23
name which is the first column
00:17:25
prediction output which is the second
00:17:26
column and then we're using an axis
00:17:28
equals to one in order to tell it to
00:17:31
combine
00:17:32
the two variables or the two columns
00:17:35
in a side-by-side manner okay so axis
00:17:38
one will allow us to have the two
00:17:39
columns side by side
00:17:41
otherwise it will be stacked underneath
00:17:43
it so psv50 column will be stacked
00:17:45
underneath the molecule name if the axis
00:17:48
was to be
00:17:49
zero okay and finally we're writing out
00:17:52
the data frame which is here
00:17:54
and then we're allowing it to generate
00:17:57
the
00:17:57
download link which is right here and
00:18:00
we're
00:18:00
making use of the file download function
00:18:03
described earlier on here
00:18:05
okay and then aligns number 38
00:18:08
we're generating this or displaying this
00:18:12
image of the web app
00:18:15
okay and lines number 43 until 51
00:18:19
or 52 is the header here
00:18:22
the bioactivity prediction app title and
00:18:24
then the description
00:18:26
of the app and then the credits of the
00:18:28
app and
00:18:29
this is written in markdown language
00:18:32
all right and so let's have a look
00:18:34
further lines 55
00:18:37
until 59 will be displaying the sidebar
00:18:41
right here so 55 will be displaying the
00:18:44
header
00:18:45
number one upload your csv data and then
00:18:48
we're creating a
00:18:49
variable called uploaded file and here
00:18:52
we're using the st.sidebar
00:18:54
dot file loader file uploader and then
00:18:57
as
00:18:57
input argument or displaying the text
00:19:00
upload your input file which is also
00:19:02
right here
00:19:04
and then the type of the file will be
00:19:06
the txt file so right here
00:19:10
and then we're creating a link using
00:19:12
markdown language
00:19:13
to the example file provided here to the
00:19:16
example essential coding series
00:19:18
so it's going to be the exact same file
00:19:20
that we have selected
00:19:21
as input okay so that's the sidebar
00:19:27
function that you see here all right and
00:19:29
so let's have a look for
00:19:31
so here you can see that from line 61
00:19:34
until 87
00:19:35
we have the if and else condition so if
00:19:39
we click on the predict button which is
00:19:41
right here
00:19:42
using the st.sidebar dot button function
00:19:45
with input argument of predict so if we
00:19:48
click on it
00:19:49
it will make the descriptor calculation
00:19:51
and
00:19:52
apply the machine learning model to make
00:19:54
a prediction and finally
00:19:56
displaying the results of the prediction
00:19:58
right here
00:19:59
and allow the user to download the
00:20:01
predictions however
00:20:03
if we didn't click anything whereby we
00:20:06
loaded up the web page
00:20:07
from the beginning as i will show you
00:20:09
right now
00:20:10
you will see a blue box displaying the
00:20:12
message of
00:20:14
upload input data in the sidebar to
00:20:16
start
00:20:17
okay so two conditions if the predict
00:20:19
button is clicked
00:20:20
it will make a prediction otherwise it
00:20:23
will just display the text here
00:20:25
that it is waiting for you to upload the
00:20:27
input data
00:20:28
okay so let's have a look under the if
00:20:30
condition
00:20:31
so upon clicking on the predict button
00:20:34
as you have guessed it will load the
00:20:36
data that you had just drag and dropped
00:20:38
and then it will be saving it as a
00:20:40
molecule.smi file and this
00:20:42
very same file here molecule.smi
00:20:46
will be used by the desk calculation
00:20:49
function
00:20:50
that we have discussed earlier on
00:20:52
particularly
00:20:53
the molecule.smi file will be used by
00:20:56
the paddle descriptor software
00:20:58
for the molecular descriptor calculation
00:21:01
and after
00:21:02
the descriptors have been calculated we
00:21:04
will assign it
00:21:05
as the x variable it's right here
00:21:09
okay so i'm going to tell you in just a
00:21:11
moment lines number 65
00:21:13
will be printing out the header right
00:21:15
here
00:21:16
so let me make a prediction first so
00:21:18
that we can see
00:21:20
let's drag and drop the input file
00:21:22
[Music]
00:21:24
press on the predict button it's right
00:21:27
here
00:21:28
original input data
00:21:31
line number 65. line number 66
00:21:34
will be printing out the data frame of
00:21:36
the input file
00:21:38
so you're going to see here two columns
00:21:40
the smile citation
00:21:42
which represent the chemical structure
00:21:44
information and the tempo id column
00:21:46
lines number 68 will be displaying a
00:21:50
spinner so upon loading up this
00:21:54
results here by pressing on the predict
00:21:56
button you saw earlier on that they had
00:21:58
a
00:21:59
yellow message box saying calculating
00:22:01
descriptor
00:22:02
and so underneath we have the desk
00:22:04
calculation function
00:22:06
and after it is calculated it will be
00:22:08
displaying the following content
00:22:10
the calculated molecular descriptor
00:22:12
which follows here
00:22:14
on lines number 72 right here
00:22:17
calculated molecular descriptor and then
00:22:20
it will be reading in
00:22:21
the calculated descriptor from the
00:22:23
descriptor output.csv file
00:22:26
it will be assigning it to the desk
00:22:28
variable
00:22:29
then we're going to be writing out right
00:22:31
here
00:22:32
and showing the data frame of the
00:22:34
descriptors that have been calculated
00:22:36
and then we're going to be printing out
00:22:37
the shape of the descriptor and so we
00:22:40
see here that
00:22:41
it has five rows or five molecules
00:22:45
881 molecular fingerprints
00:22:48
and then in lines number 78 until 82
00:22:51
is going to be the subset of descriptor
00:22:54
that is read
00:22:55
from the previously built model from the
00:22:58
file
00:22:59
descriptor list dot csv and so you can
00:23:02
see here that we're going to create a
00:23:03
variable called x list and then we're
00:23:06
reading in
00:23:07
the columns okay and then we're going to
00:23:09
be
00:23:10
from the initial descriptor of 881
00:23:14
we're going to be selecting a subset
00:23:17
provided in the x list
00:23:18
and then we assign the subset of
00:23:21
descriptor
00:23:22
which is 218 descriptors selected from
00:23:25
the initially
00:23:26
set of 881 and then we assigned that to
00:23:30
the desk
00:23:30
subset variable and then finally we
00:23:33
printed out
00:23:34
as a data frame and we also print out
00:23:36
the dimension as well
00:23:37
so we see here that there are five
00:23:39
molecules and 218
00:23:42
columns or 218 fingerprints
00:23:45
and finally we make use of this
00:23:47
calculated
00:23:48
molecular descriptor subset and use it
00:23:51
as an
00:23:51
input argument to the build model
00:23:53
function
00:23:54
and then as i have mentioned earlier on
00:23:56
it will be
00:23:57
building the model and then finally it
00:23:59
will be displaying the model
00:24:01
prediction result right here so users
00:24:04
can download it
00:24:05
into their own computer if you're
00:24:07
finding value in this video
00:24:09
please help us out by smashing the like
00:24:11
button subscribing if you haven't
00:24:13
already
00:24:14
and make sure to hit on the notification
00:24:16
bell so that you'll be notified
00:24:18
of the next video and as always the best
00:24:21
way to learn data science
00:24:22
is to do data science and please enjoy
00:24:25
the journey