What is the main focus of this video?

The video demonstrates how to scrape data from websites while bypassing various anti-bot measures.

What tools are used for web scraping in this tutorial?

Puppeteer and craw for AI are primarily used for web scraping.

How can anti-bot measures be bypassed?

By using user agent simulation, proxy servers, and cookie management.

What is the ethical stance on web scraping mentioned?

The presenter emphasizes that scraping should not be used for illegal activities.

What is a recommended strategy for managing geolocation restrictions during scraping?

Using proxy servers can help you avoid geolocation restrictions.

Why is it important to manage the user agent during scraping?

Some websites block requests based on the user agent, so simulating a real user is crucial.

What should you do if a website requires a login to access data?

Use cookies collected from a logged-in session to simulate being logged in during the scrape.

What is the purpose of using proxies in scraping?

Proxies help avoid rate limiting and keep your own IP address hidden.

How to scrape through captchas, geo blockers and rate limits (crawl4ai + Local Deepseek + Proxy)

00:11:32

https://www.youtube.com/watch?v=Htb_NsGlbgc

摘要

TLDRIn this video tutorial, the presenter guides viewers through methods of web scraping while addressing the legal and ethical implications. The scenario begins with a client needing an AI chatbot and encountering issues with accessing an e-commerce database due to shared hosting limitations. The presenter explains how to scrape data while bypassing anti-bot systems, including techniques like using Puppeteer and craw for AI. Viewers learn about five different methods to prevent scraping, such as captcha validation and geolocation blocking. Additionally, the video covers integrating residential proxies, managing user agents, and handling login sessions with cookies for effective data extraction.

心得

🤖 Scraping requires ethical considerations.
🛡️ Learn to bypass anti-bot measures.
📊 Use Puppeteer for effective scraping.
🌐 Proxies can hide your IP address.
🔒 Manage cookies for logged-in scraping.
💡 Simulate a user's behavior to avoid detection.
📈 Rate limiting can be circumvented.
📉 Understand the website's structure for efficient scraping.
🔎 Remember to use tools like craw for professional-grade scraping.
🏁 Always test your scrapers effectively before deployment.

时间轴

00:00:00 - 00:05:00
In this video, the speaker discusses building an AI chatbot for a client's e-commerce business and the challenges faced when trying to access the product database due to restrictions on their shared hosting platform. Subsequently, the speaker emphasizes the importance of not using scraping techniques for illicit purposes while introducing advanced methods for bypassing bot detection and scraping website data effectively. They implement various anti-scraping mechanisms on their own website as a demonstration, such as IP blocking, CAPTCHA, and rate limiting, which they bypass using Puppeteer and proxies.
00:05:00 - 00:11:32
The speaker delves deeper into the scraping tools and techniques, explaining how to utilize Puppeteer, residential proxies from Iami, and incorporating additional features like ad block toggles to prevent CAPTCHA and rate limiting issues. They demonstrate practical examples of scraping data while managing different HTML structures and discuss the potential use of local models like Alama for processing data. The presentation concludes with a reminder about ethical scraping practices and an invitation for viewers to ask questions.

思维导图

视频问答

What is the main focus of this video?
The video demonstrates how to scrape data from websites while bypassing various anti-bot measures.
What tools are used for web scraping in this tutorial?
Puppeteer and craw for AI are primarily used for web scraping.
How can anti-bot measures be bypassed?
By using user agent simulation, proxy servers, and cookie management.
What is the ethical stance on web scraping mentioned?
The presenter emphasizes that scraping should not be used for illegal activities.
What is a recommended strategy for managing geolocation restrictions during scraping?
Using proxy servers can help you avoid geolocation restrictions.
Why is it important to manage the user agent during scraping?
Some websites block requests based on the user agent, so simulating a real user is crucial.
What should you do if a website requires a login to access data?
Use cookies collected from a logged-in session to simulate being logged in during the scrape.
What is the purpose of using proxies in scraping?
Proxies help avoid rate limiting and keep your own IP address hidden.

查看更多视频摘要

即时访问由人工智能支持的免费 YouTube 视频摘要！

字幕

自动滚动:

00:00:00
now say this with me I will not
00:00:01
illegally scrape any website with the
00:00:04
things I learn in this video I had a
00:00:06
client that asked me to build an AI
00:00:08
chatbot for his WhatsApp business based
00:00:09
on his e-commerce products it was all
00:00:12
fine until I asked him for the access to
00:00:14
the products database he stared at me
00:00:16
for some seconds and said well I used a
00:00:19
shared hosting platform and the problem
00:00:21
with this shared hosting platform is
00:00:22
that it basically blocks remote MySQL
00:00:25
access and while I could wi list the
00:00:27
server's IP there were more issues
00:00:29
behind the D I needed and it was all too
00:00:31
complicated scraping would be the way to
00:00:34
go the client's website had some default
00:00:36
bot blockers that made it a bit more
00:00:38
difficult to scrape and because of that
00:00:40
in this video I'd like to show you ways
00:00:42
you can scrape data from a website while
00:00:43
bypassing some antibot systems just in
00:00:46
case your client's data is a mess
00:00:48
despite the front end being neat and
00:00:50
since this channel is focused on AI
00:00:52
building we will also use craw PRI to
00:00:54
fetch the data structured beautifully by
00:00:56
using the local deep seek model so I get
00:00:58
why the majority of people will would
00:01:00
search for a video like this either they
00:01:01
want to scrape social media or a website
00:01:04
that requires login and I really don't
00:01:06
want to incentivize that so what I did
00:01:08
was create my own website and it has a
00:01:10
lot of features that would prevent a bot
00:01:12
from scraping it and if you are a code
00:01:14
bro and you're looking at this you might
00:01:15
be thinking well this isn't like the
00:01:17
ideal form of preventing scraping uh
00:01:20
like rate limiting I'm I'm instead of
00:01:22
using reddis I'm using the servers local
00:01:24
memory and all but you guys get the
00:01:26
point this is really just to test how
00:01:28
the system would work so inside this
00:01:30
website I implemented five different
00:01:32
ways to prevent scraping the first way
00:01:34
was using recapture so it's placed there
00:01:36
in a way that the this text and also the
00:01:39
dynamic table will only show after the
00:01:42
recapture has been validated some
00:01:44
scrapers just stay on headless mode and
00:01:46
they don't configure the agent and if
00:01:48
you don't do that then your fetch will
00:01:51
include like the Headless inside of it
00:01:53
or a bot maybe like depending on which
00:01:55
framework you're using to to perform the
00:01:57
Fetch and many times a website will
00:01:59
identify by that inside of the user
00:02:01
agent so I did that too it redirects the
00:02:03
users to the block page if so then as
00:02:06
you can also see here I'm restricting by
00:02:08
geolocation so you won't be able to
00:02:10
access the website from the United
00:02:11
Kingdom then you'll get sent to the
00:02:13
block country route I've also placed a
00:02:15
rate limiter this isn't using red is as
00:02:17
I said it's just a local way to identify
00:02:20
like in a Spam of 10 minutes did the
00:02:23
user access the website like multiple
00:02:24
times and after the user accesses the
00:02:27
website for the fifth time then it just
00:02:29
restricts the access and we can test
00:02:31
that right now just by fing the website
00:02:35
yeah United Kingdom is blocked and why
00:02:37
is that it's because I'm connected
00:02:39
through a VPN in the United Kingdom so
00:02:41
if I deactivate that and try to access
00:02:44
the secure path again it will open up
00:02:47
just fine then F5 again opened up third
00:02:50
time fourth time fifth it will open and
00:02:54
now it should block my access yeah so
00:02:57
that's that's a simple weate limiter and
00:02:59
obviously website wouldn't block your
00:03:00
access after five times you you fetched
00:03:03
it in 10 minutes it would probably block
00:03:05
your access after like the 100th time
00:03:07
that you're trying to access it in a
00:03:08
minute and ideally you wouldn't want to
00:03:11
do that right unless you're trying to De
00:03:13
dos the website so again you promised me
00:03:15
this in the beginning you won't use this
00:03:16
to do anything illegal right now let's
00:03:19
open the actual scrapers uh I have a
00:03:21
simple scraper and then the Scraper on
00:03:23
steroid which is using crawl for AI
00:03:25
along with a pretty cool proxy and for
00:03:27
our simple scraper we're using Puppeteer
00:03:29
and as I said in a previous video that
00:03:31
should be enough like for 90% of cases
00:03:33
you can scrape a website just using
00:03:35
Puppeteer selenium or a beautiful soup
00:03:37
just with Puppeteer you can get past two
00:03:39
of the problems we had earlier one of
00:03:41
them is the user agent and the way we
00:03:43
bypass that is just simulating an actual
00:03:45
real user you can simulate that easily
00:03:47
with just this line of code and also in
00:03:49
this line we're disabling the automation
00:03:51
Flags these automation Flags normally
00:03:53
comes with the scraper and they kind of
00:03:55
tell the website it's a bot accessing
00:03:57
them but with this simple line you can
00:03:59
deactivate that as for recapture what it
00:04:02
tries to do is find an actual user
00:04:04
inside of the page so a bot would move
00:04:07
differently from a human being my mouse
00:04:09
if it's right here and I move it
00:04:11
straight down to this line it won't be a
00:04:13
straight movement as if it were a bot it
00:04:15
could directly go from this coordinate
00:04:16
down to this in a direct line and like
00:04:19
since human aren't able to exactly do
00:04:21
that the bot will identify that it's not
00:04:24
a human there poiter can solve that by
00:04:26
default or you can just build a random
00:04:28
function that moves the the cursor
00:04:30
around randomly and then maybe the
00:04:32
recapture won't detect you so at this
00:04:34
point we bypassed a lot of things just
00:04:36
by using Puppeteer it defaults like out
00:04:38
of the bat it does a lot of things and
00:04:40
along with that we can Implement things
00:04:41
like this that guarantee that we can
00:04:43
access the website but for geol blocking
00:04:46
and rate limiting that involves our IP
00:04:48
there's no best way to get past that
00:04:49
than using a proxy I understand that
00:04:52
sometimes I receive comments like these
00:04:53
but for some problems like these
00:04:55
companies like iami really comes in
00:04:56
handy back then in the story of the
00:04:58
beginning of the video if I used this
00:05:00
residential proxy from iami I calculated
00:05:03
that fetching every 10 minutes every day
00:05:05
every week of the month I would have up
00:05:08
to just $2 of expenses every month and
00:05:10
along with all the features of proxy
00:05:12
like iami can provide you you even have
00:05:14
the security of not accessing the
00:05:16
website with your own IP and you'll
00:05:18
notice that that will really help you
00:05:20
from avoid being blocked by that site on
00:05:22
your own IP and not being able to access
00:05:24
it even through your browser so heading
00:05:26
over to aami's documentation you'll see
00:05:28
that we can integrate them with what
00:05:30
we're using right now which is Puppeteer
00:05:32
beautiful soup playr selenium and also
00:05:35
we'll integrate this proxy with craw
00:05:36
fori just in a bit integrating with a
00:05:39
proxy is something really simple but
00:05:41
aami's dashboard just makes it really
00:05:42
intuitive so you can set the location
00:05:45
settings right here let's select resultz
00:05:47
for now down here is the most important
00:05:49
step to guarantee that you don't get
00:05:51
picked up on rate limiting earlier today
00:05:53
there was a comment in one of my videos
00:05:55
on which the person got a rate limiting
00:05:56
error and that would not have happened
00:05:58
if they were using this down here in the
00:06:00
expert settings you'll have an adlock
00:06:01
toggle and you'd want to toggle on this
00:06:03
adlock especially when you're feeding
00:06:05
the data to an llm I don't know if
00:06:06
you've ever used speed test.net probably
00:06:09
to test your internet connection this is
00:06:11
how using a proxy would make it look now
00:06:13
let's move all the way down to the mode
00:06:15
the quality mode is what guarantees that
00:06:16
you avoid captas as well as can access
00:06:19
websites with strict antibot measures
00:06:21
because honestly what I did here was
00:06:22
just a rookie way of trying to block
00:06:24
Bots there are a lot of different apps
00:06:26
and tools that people will use to try to
00:06:28
block these bots so this code is going
00:06:30
to do is run concurrent requests it's
00:06:32
going to run seven requests over to that
00:06:34
website first not using proxy and then
00:06:37
you'll see that it will be able to
00:06:39
bypass capture as well as the user agent
00:06:42
blockings but above the sixth request it
00:06:44
will start blocking it as for Iam's
00:06:47
proxy it won't block it and it'll also
00:06:49
show the IPS going from the country that
00:06:51
we selected yeah so I'm running this
00:06:53
with FN it's first testing it without a
00:06:56
proxy yeah I surpass the rate limiting
00:06:59
probably because because I was already
00:07:00
visiting the site some minutes ago yeah
00:07:02
now with the proxy enabled and the
00:07:04
country selected being Brazil you see
00:07:07
that it goes through different IPS for
00:07:09
each call and then it successfully
00:07:12
brings all the information if I change
00:07:14
my IP over to the United States and
00:07:18
start my VPN and then run this again
00:07:22
you'll see that the the the search
00:07:24
without proxy might fetch not might it
00:07:27
certainly will fetch five requests
00:07:30
successfully but then two of them as you
00:07:32
can see here will be blocked because the
00:07:34
rate limiting was exceeded now here's
00:07:36
the thing I want scrape this table and
00:07:38
get it back in a structured way the
00:07:40
trick is that if I have five this and
00:07:42
you'll see that one HTML tag here is
00:07:45
figure then address then fig caption if
00:07:48
I update it was those HTML tags now we
00:07:51
have summary output and details now if
00:07:55
we F5 again you'll see that we have a
00:07:57
div so every sing SLE time the HTML tag
00:08:00
is changing and I acknowledge that we
00:08:03
could try to scrape it from The Styling
00:08:05
or some other variable or or from the
00:08:08
position like we we understand that
00:08:10
there's only this table in here and not
00:08:12
necessarily would we need to use an llm
00:08:14
but there are cases where maybe I
00:08:16
wouldn't even know that I want wanted to
00:08:18
get to this exact website and and the
00:08:20
crawler just found it then it wanted to
00:08:22
scrape it in those cases you would not
00:08:24
know the structure and if you want more
00:08:25
details on this please check out my
00:08:27
video above and also if you try out this
00:08:29
code you have to get this string right
00:08:30
here from iami go over to the
00:08:33
environment variable Place proxy URL and
00:08:35
just place it right here the code will
00:08:37
identify your user your password and
00:08:39
just proceed with the proxy for my
00:08:41
Scraper on steroids I did it a bit
00:08:43
differently so yeah the only difference
00:08:45
is really I get the string I get the the
00:08:49
initial part of it this is the server
00:08:50
you can also find this information right
00:08:53
over here yeah so username password host
00:08:56
name Port yeah you can get that all up
00:08:58
there just just place all the
00:09:00
information inside of this proxy config
00:09:01
dictionary this will be sent to the
00:09:03
browser config which will then be sent
00:09:05
to the WebCrawler and then you avoid
00:09:07
being R limited so proceeding on let's
00:09:10
just python main pi and let craw Frei do
00:09:13
its job this is what it brought back to
00:09:15
us and if you copy the exact message it
00:09:18
retrieved get a test Json place that in
00:09:21
there and then format the code that's it
00:09:24
it's completely structured you got it
00:09:27
despite it having some strange HTML tags
00:09:30
like you could you could scrape this you
00:09:32
might be questioning well I had some
00:09:33
expenses because I'm using the open API
00:09:35
key here and that's where olama comes in
00:09:38
to use DC locally you'll just have to
00:09:40
head over to Alama download it now head
00:09:43
over to either your CMD or your terminal
00:09:45
type in ama full deep SE R1
00:09:49
workb and wait for it to be installed
00:09:52
you'll notice that we change the
00:09:53
provider to / deeps R 1114b and we've
00:09:57
removed the API token since we're
00:09:59
running a local model while that is
00:10:01
downloading let me give you guys some
00:10:02
insights on how you can crawl or scrape
00:10:04
a website that needs some credentials so
00:10:08
way that website's used to identify if
00:10:09
you're logged in is through cookies and
00:10:11
if you find your cookies in here in
00:10:13
inspect you go over to application
00:10:15
you'll find a bunch of cookies right
00:10:17
here this is for like this particular
00:10:19
website so in there you'll find some
00:10:21
strings that you can pass over to your
00:10:23
code either if you're using Puppeteer
00:10:25
selenium or craw for AI and then it will
00:10:27
interpret that session session and
00:10:29
continue logged in while it's scraping
00:10:32
so it kind of simulates that you're
00:10:33
already logged in because it has the
00:10:35
session of the login that you've
00:10:37
performed previously so that's done it
00:10:40
already pulled DC car 114b we have that
00:10:44
configured in there let's run our code
00:10:47
this might take up to a minute and for
00:10:49
real I wouldn't recommend using this for
00:10:51
production like Alama deeps R1 orb it
00:10:54
doesn't have a high capacity of solving
00:10:56
pretty easy things and it might take a
00:10:59
while depending on your GPU okay it's
00:11:01
done and this is what it brought back so
00:11:03
let's copy that and send it over to our
00:11:05
test Json it actually got even more than
00:11:08
expected so if we head over to the
00:11:09
website you'll see that we have all this
00:11:13
scattered around the website it got
00:11:15
everything but I suppose that's more
00:11:16
because of my prompting the more weaker
00:11:18
the llm is the more specific you need to
00:11:20
be so it just comes down to just
00:11:22
optimizing this prompt it would probably
00:11:24
have got it correctly that is it for
00:11:26
today if you have any questions please
00:11:27
let me know in the comment section and
00:11:29
I'll see you in the next video till then

标签

web scraping
anti-bot measures
Puppeteer
craw for AI
data extraction
proxies
geolocation
user agent
cookies
ethical scraping