00:00:00
now say this with me I will not
00:00:01
illegally scrape any website with the
00:00:04
things I learn in this video I had a
00:00:06
client that asked me to build an AI
00:00:08
chatbot for his WhatsApp business based
00:00:09
on his e-commerce products it was all
00:00:12
fine until I asked him for the access to
00:00:14
the products database he stared at me
00:00:16
for some seconds and said well I used a
00:00:19
shared hosting platform and the problem
00:00:21
with this shared hosting platform is
00:00:22
that it basically blocks remote MySQL
00:00:25
access and while I could wi list the
00:00:27
server's IP there were more issues
00:00:29
behind the D I needed and it was all too
00:00:31
complicated scraping would be the way to
00:00:34
go the client's website had some default
00:00:36
bot blockers that made it a bit more
00:00:38
difficult to scrape and because of that
00:00:40
in this video I'd like to show you ways
00:00:42
you can scrape data from a website while
00:00:43
bypassing some antibot systems just in
00:00:46
case your client's data is a mess
00:00:48
despite the front end being neat and
00:00:50
since this channel is focused on AI
00:00:52
building we will also use craw PRI to
00:00:54
fetch the data structured beautifully by
00:00:56
using the local deep seek model so I get
00:00:58
why the majority of people will would
00:01:00
search for a video like this either they
00:01:01
want to scrape social media or a website
00:01:04
that requires login and I really don't
00:01:06
want to incentivize that so what I did
00:01:08
was create my own website and it has a
00:01:10
lot of features that would prevent a bot
00:01:12
from scraping it and if you are a code
00:01:14
bro and you're looking at this you might
00:01:15
be thinking well this isn't like the
00:01:17
ideal form of preventing scraping uh
00:01:20
like rate limiting I'm I'm instead of
00:01:22
using reddis I'm using the servers local
00:01:24
memory and all but you guys get the
00:01:26
point this is really just to test how
00:01:28
the system would work so inside this
00:01:30
website I implemented five different
00:01:32
ways to prevent scraping the first way
00:01:34
was using recapture so it's placed there
00:01:36
in a way that the this text and also the
00:01:39
dynamic table will only show after the
00:01:42
recapture has been validated some
00:01:44
scrapers just stay on headless mode and
00:01:46
they don't configure the agent and if
00:01:48
you don't do that then your fetch will
00:01:51
include like the Headless inside of it
00:01:53
or a bot maybe like depending on which
00:01:55
framework you're using to to perform the
00:01:57
Fetch and many times a website will
00:01:59
identify by that inside of the user
00:02:01
agent so I did that too it redirects the
00:02:03
users to the block page if so then as
00:02:06
you can also see here I'm restricting by
00:02:08
geolocation so you won't be able to
00:02:10
access the website from the United
00:02:11
Kingdom then you'll get sent to the
00:02:13
block country route I've also placed a
00:02:15
rate limiter this isn't using red is as
00:02:17
I said it's just a local way to identify
00:02:20
like in a Spam of 10 minutes did the
00:02:23
user access the website like multiple
00:02:24
times and after the user accesses the
00:02:27
website for the fifth time then it just
00:02:29
restricts the access and we can test
00:02:31
that right now just by fing the website
00:02:35
yeah United Kingdom is blocked and why
00:02:37
is that it's because I'm connected
00:02:39
through a VPN in the United Kingdom so
00:02:41
if I deactivate that and try to access
00:02:44
the secure path again it will open up
00:02:47
just fine then F5 again opened up third
00:02:50
time fourth time fifth it will open and
00:02:54
now it should block my access yeah so
00:02:57
that's that's a simple weate limiter and
00:02:59
obviously website wouldn't block your
00:03:00
access after five times you you fetched
00:03:03
it in 10 minutes it would probably block
00:03:05
your access after like the 100th time
00:03:07
that you're trying to access it in a
00:03:08
minute and ideally you wouldn't want to
00:03:11
do that right unless you're trying to De
00:03:13
dos the website so again you promised me
00:03:15
this in the beginning you won't use this
00:03:16
to do anything illegal right now let's
00:03:19
open the actual scrapers uh I have a
00:03:21
simple scraper and then the Scraper on
00:03:23
steroid which is using crawl for AI
00:03:25
along with a pretty cool proxy and for
00:03:27
our simple scraper we're using Puppeteer
00:03:29
and as I said in a previous video that
00:03:31
should be enough like for 90% of cases
00:03:33
you can scrape a website just using
00:03:35
Puppeteer selenium or a beautiful soup
00:03:37
just with Puppeteer you can get past two
00:03:39
of the problems we had earlier one of
00:03:41
them is the user agent and the way we
00:03:43
bypass that is just simulating an actual
00:03:45
real user you can simulate that easily
00:03:47
with just this line of code and also in
00:03:49
this line we're disabling the automation
00:03:51
Flags these automation Flags normally
00:03:53
comes with the scraper and they kind of
00:03:55
tell the website it's a bot accessing
00:03:57
them but with this simple line you can
00:03:59
deactivate that as for recapture what it
00:04:02
tries to do is find an actual user
00:04:04
inside of the page so a bot would move
00:04:07
differently from a human being my mouse
00:04:09
if it's right here and I move it
00:04:11
straight down to this line it won't be a
00:04:13
straight movement as if it were a bot it
00:04:15
could directly go from this coordinate
00:04:16
down to this in a direct line and like
00:04:19
since human aren't able to exactly do
00:04:21
that the bot will identify that it's not
00:04:24
a human there poiter can solve that by
00:04:26
default or you can just build a random
00:04:28
function that moves the the cursor
00:04:30
around randomly and then maybe the
00:04:32
recapture won't detect you so at this
00:04:34
point we bypassed a lot of things just
00:04:36
by using Puppeteer it defaults like out
00:04:38
of the bat it does a lot of things and
00:04:40
along with that we can Implement things
00:04:41
like this that guarantee that we can
00:04:43
access the website but for geol blocking
00:04:46
and rate limiting that involves our IP
00:04:48
there's no best way to get past that
00:04:49
than using a proxy I understand that
00:04:52
sometimes I receive comments like these
00:04:53
but for some problems like these
00:04:55
companies like iami really comes in
00:04:56
handy back then in the story of the
00:04:58
beginning of the video if I used this
00:05:00
residential proxy from iami I calculated
00:05:03
that fetching every 10 minutes every day
00:05:05
every week of the month I would have up
00:05:08
to just $2 of expenses every month and
00:05:10
along with all the features of proxy
00:05:12
like iami can provide you you even have
00:05:14
the security of not accessing the
00:05:16
website with your own IP and you'll
00:05:18
notice that that will really help you
00:05:20
from avoid being blocked by that site on
00:05:22
your own IP and not being able to access
00:05:24
it even through your browser so heading
00:05:26
over to aami's documentation you'll see
00:05:28
that we can integrate them with what
00:05:30
we're using right now which is Puppeteer
00:05:32
beautiful soup playr selenium and also
00:05:35
we'll integrate this proxy with craw
00:05:36
fori just in a bit integrating with a
00:05:39
proxy is something really simple but
00:05:41
aami's dashboard just makes it really
00:05:42
intuitive so you can set the location
00:05:45
settings right here let's select resultz
00:05:47
for now down here is the most important
00:05:49
step to guarantee that you don't get
00:05:51
picked up on rate limiting earlier today
00:05:53
there was a comment in one of my videos
00:05:55
on which the person got a rate limiting
00:05:56
error and that would not have happened
00:05:58
if they were using this down here in the
00:06:00
expert settings you'll have an adlock
00:06:01
toggle and you'd want to toggle on this
00:06:03
adlock especially when you're feeding
00:06:05
the data to an llm I don't know if
00:06:06
you've ever used speed test.net probably
00:06:09
to test your internet connection this is
00:06:11
how using a proxy would make it look now
00:06:13
let's move all the way down to the mode
00:06:15
the quality mode is what guarantees that
00:06:16
you avoid captas as well as can access
00:06:19
websites with strict antibot measures
00:06:21
because honestly what I did here was
00:06:22
just a rookie way of trying to block
00:06:24
Bots there are a lot of different apps
00:06:26
and tools that people will use to try to
00:06:28
block these bots so this code is going
00:06:30
to do is run concurrent requests it's
00:06:32
going to run seven requests over to that
00:06:34
website first not using proxy and then
00:06:37
you'll see that it will be able to
00:06:39
bypass capture as well as the user agent
00:06:42
blockings but above the sixth request it
00:06:44
will start blocking it as for Iam's
00:06:47
proxy it won't block it and it'll also
00:06:49
show the IPS going from the country that
00:06:51
we selected yeah so I'm running this
00:06:53
with FN it's first testing it without a
00:06:56
proxy yeah I surpass the rate limiting
00:06:59
probably because because I was already
00:07:00
visiting the site some minutes ago yeah
00:07:02
now with the proxy enabled and the
00:07:04
country selected being Brazil you see
00:07:07
that it goes through different IPS for
00:07:09
each call and then it successfully
00:07:12
brings all the information if I change
00:07:14
my IP over to the United States and
00:07:18
start my VPN and then run this again
00:07:22
you'll see that the the the search
00:07:24
without proxy might fetch not might it
00:07:27
certainly will fetch five requests
00:07:30
successfully but then two of them as you
00:07:32
can see here will be blocked because the
00:07:34
rate limiting was exceeded now here's
00:07:36
the thing I want scrape this table and
00:07:38
get it back in a structured way the
00:07:40
trick is that if I have five this and
00:07:42
you'll see that one HTML tag here is
00:07:45
figure then address then fig caption if
00:07:48
I update it was those HTML tags now we
00:07:51
have summary output and details now if
00:07:55
we F5 again you'll see that we have a
00:07:57
div so every sing SLE time the HTML tag
00:08:00
is changing and I acknowledge that we
00:08:03
could try to scrape it from The Styling
00:08:05
or some other variable or or from the
00:08:08
position like we we understand that
00:08:10
there's only this table in here and not
00:08:12
necessarily would we need to use an llm
00:08:14
but there are cases where maybe I
00:08:16
wouldn't even know that I want wanted to
00:08:18
get to this exact website and and the
00:08:20
crawler just found it then it wanted to
00:08:22
scrape it in those cases you would not
00:08:24
know the structure and if you want more
00:08:25
details on this please check out my
00:08:27
video above and also if you try out this
00:08:29
code you have to get this string right
00:08:30
here from iami go over to the
00:08:33
environment variable Place proxy URL and
00:08:35
just place it right here the code will
00:08:37
identify your user your password and
00:08:39
just proceed with the proxy for my
00:08:41
Scraper on steroids I did it a bit
00:08:43
differently so yeah the only difference
00:08:45
is really I get the string I get the the
00:08:49
initial part of it this is the server
00:08:50
you can also find this information right
00:08:53
over here yeah so username password host
00:08:56
name Port yeah you can get that all up
00:08:58
there just just place all the
00:09:00
information inside of this proxy config
00:09:01
dictionary this will be sent to the
00:09:03
browser config which will then be sent
00:09:05
to the WebCrawler and then you avoid
00:09:07
being R limited so proceeding on let's
00:09:10
just python main pi and let craw Frei do
00:09:13
its job this is what it brought back to
00:09:15
us and if you copy the exact message it
00:09:18
retrieved get a test Json place that in
00:09:21
there and then format the code that's it
00:09:24
it's completely structured you got it
00:09:27
despite it having some strange HTML tags
00:09:30
like you could you could scrape this you
00:09:32
might be questioning well I had some
00:09:33
expenses because I'm using the open API
00:09:35
key here and that's where olama comes in
00:09:38
to use DC locally you'll just have to
00:09:40
head over to Alama download it now head
00:09:43
over to either your CMD or your terminal
00:09:45
type in ama full deep SE R1
00:09:49
workb and wait for it to be installed
00:09:52
you'll notice that we change the
00:09:53
provider to / deeps R 1114b and we've
00:09:57
removed the API token since we're
00:09:59
running a local model while that is
00:10:01
downloading let me give you guys some
00:10:02
insights on how you can crawl or scrape
00:10:04
a website that needs some credentials so
00:10:08
way that website's used to identify if
00:10:09
you're logged in is through cookies and
00:10:11
if you find your cookies in here in
00:10:13
inspect you go over to application
00:10:15
you'll find a bunch of cookies right
00:10:17
here this is for like this particular
00:10:19
website so in there you'll find some
00:10:21
strings that you can pass over to your
00:10:23
code either if you're using Puppeteer
00:10:25
selenium or craw for AI and then it will
00:10:27
interpret that session session and
00:10:29
continue logged in while it's scraping
00:10:32
so it kind of simulates that you're
00:10:33
already logged in because it has the
00:10:35
session of the login that you've
00:10:37
performed previously so that's done it
00:10:40
already pulled DC car 114b we have that
00:10:44
configured in there let's run our code
00:10:47
this might take up to a minute and for
00:10:49
real I wouldn't recommend using this for
00:10:51
production like Alama deeps R1 orb it
00:10:54
doesn't have a high capacity of solving
00:10:56
pretty easy things and it might take a
00:10:59
while depending on your GPU okay it's
00:11:01
done and this is what it brought back so
00:11:03
let's copy that and send it over to our
00:11:05
test Json it actually got even more than
00:11:08
expected so if we head over to the
00:11:09
website you'll see that we have all this
00:11:13
scattered around the website it got
00:11:15
everything but I suppose that's more
00:11:16
because of my prompting the more weaker
00:11:18
the llm is the more specific you need to
00:11:20
be so it just comes down to just
00:11:22
optimizing this prompt it would probably
00:11:24
have got it correctly that is it for
00:11:26
today if you have any questions please
00:11:27
let me know in the comment section and
00:11:29
I'll see you in the next video till then