How to scrape through captchas, geo blockers and rate limits (crawl4ai + Local Deepseek + Proxy)

00:11:32
https://www.youtube.com/watch?v=Htb_NsGlbgc

Zusammenfassung

TLDRIn this video tutorial, the presenter guides viewers through methods of web scraping while addressing the legal and ethical implications. The scenario begins with a client needing an AI chatbot and encountering issues with accessing an e-commerce database due to shared hosting limitations. The presenter explains how to scrape data while bypassing anti-bot systems, including techniques like using Puppeteer and craw for AI. Viewers learn about five different methods to prevent scraping, such as captcha validation and geolocation blocking. Additionally, the video covers integrating residential proxies, managing user agents, and handling login sessions with cookies for effective data extraction.

Mitbringsel

  • 🤖 Scraping requires ethical considerations.
  • 🛡️ Learn to bypass anti-bot measures.
  • 📊 Use Puppeteer for effective scraping.
  • 🌐 Proxies can hide your IP address.
  • 🔒 Manage cookies for logged-in scraping.
  • 💡 Simulate a user's behavior to avoid detection.
  • 📈 Rate limiting can be circumvented.
  • 📉 Understand the website's structure for efficient scraping.
  • 🔎 Remember to use tools like craw for professional-grade scraping.
  • 🏁 Always test your scrapers effectively before deployment.

Zeitleiste

  • 00:00:00 - 00:05:00

    In this video, the speaker discusses building an AI chatbot for a client's e-commerce business and the challenges faced when trying to access the product database due to restrictions on their shared hosting platform. Subsequently, the speaker emphasizes the importance of not using scraping techniques for illicit purposes while introducing advanced methods for bypassing bot detection and scraping website data effectively. They implement various anti-scraping mechanisms on their own website as a demonstration, such as IP blocking, CAPTCHA, and rate limiting, which they bypass using Puppeteer and proxies.

  • 00:05:00 - 00:11:32

    The speaker delves deeper into the scraping tools and techniques, explaining how to utilize Puppeteer, residential proxies from Iami, and incorporating additional features like ad block toggles to prevent CAPTCHA and rate limiting issues. They demonstrate practical examples of scraping data while managing different HTML structures and discuss the potential use of local models like Alama for processing data. The presentation concludes with a reminder about ethical scraping practices and an invitation for viewers to ask questions.

Mind Map

Video-Fragen und Antworten

  • What is the main focus of this video?

    The video demonstrates how to scrape data from websites while bypassing various anti-bot measures.

  • What tools are used for web scraping in this tutorial?

    Puppeteer and craw for AI are primarily used for web scraping.

  • How can anti-bot measures be bypassed?

    By using user agent simulation, proxy servers, and cookie management.

  • What is the ethical stance on web scraping mentioned?

    The presenter emphasizes that scraping should not be used for illegal activities.

  • What is a recommended strategy for managing geolocation restrictions during scraping?

    Using proxy servers can help you avoid geolocation restrictions.

  • Why is it important to manage the user agent during scraping?

    Some websites block requests based on the user agent, so simulating a real user is crucial.

  • What should you do if a website requires a login to access data?

    Use cookies collected from a logged-in session to simulate being logged in during the scrape.

  • What is the purpose of using proxies in scraping?

    Proxies help avoid rate limiting and keep your own IP address hidden.

Weitere Video-Zusammenfassungen anzeigen

Erhalten Sie sofortigen Zugang zu kostenlosen YouTube-Videozusammenfassungen, die von AI unterstützt werden!
Untertitel
en
Automatisches Blättern:
  • 00:00:00
    now say this with me I will not
  • 00:00:01
    illegally scrape any website with the
  • 00:00:04
    things I learn in this video I had a
  • 00:00:06
    client that asked me to build an AI
  • 00:00:08
    chatbot for his WhatsApp business based
  • 00:00:09
    on his e-commerce products it was all
  • 00:00:12
    fine until I asked him for the access to
  • 00:00:14
    the products database he stared at me
  • 00:00:16
    for some seconds and said well I used a
  • 00:00:19
    shared hosting platform and the problem
  • 00:00:21
    with this shared hosting platform is
  • 00:00:22
    that it basically blocks remote MySQL
  • 00:00:25
    access and while I could wi list the
  • 00:00:27
    server's IP there were more issues
  • 00:00:29
    behind the D I needed and it was all too
  • 00:00:31
    complicated scraping would be the way to
  • 00:00:34
    go the client's website had some default
  • 00:00:36
    bot blockers that made it a bit more
  • 00:00:38
    difficult to scrape and because of that
  • 00:00:40
    in this video I'd like to show you ways
  • 00:00:42
    you can scrape data from a website while
  • 00:00:43
    bypassing some antibot systems just in
  • 00:00:46
    case your client's data is a mess
  • 00:00:48
    despite the front end being neat and
  • 00:00:50
    since this channel is focused on AI
  • 00:00:52
    building we will also use craw PRI to
  • 00:00:54
    fetch the data structured beautifully by
  • 00:00:56
    using the local deep seek model so I get
  • 00:00:58
    why the majority of people will would
  • 00:01:00
    search for a video like this either they
  • 00:01:01
    want to scrape social media or a website
  • 00:01:04
    that requires login and I really don't
  • 00:01:06
    want to incentivize that so what I did
  • 00:01:08
    was create my own website and it has a
  • 00:01:10
    lot of features that would prevent a bot
  • 00:01:12
    from scraping it and if you are a code
  • 00:01:14
    bro and you're looking at this you might
  • 00:01:15
    be thinking well this isn't like the
  • 00:01:17
    ideal form of preventing scraping uh
  • 00:01:20
    like rate limiting I'm I'm instead of
  • 00:01:22
    using reddis I'm using the servers local
  • 00:01:24
    memory and all but you guys get the
  • 00:01:26
    point this is really just to test how
  • 00:01:28
    the system would work so inside this
  • 00:01:30
    website I implemented five different
  • 00:01:32
    ways to prevent scraping the first way
  • 00:01:34
    was using recapture so it's placed there
  • 00:01:36
    in a way that the this text and also the
  • 00:01:39
    dynamic table will only show after the
  • 00:01:42
    recapture has been validated some
  • 00:01:44
    scrapers just stay on headless mode and
  • 00:01:46
    they don't configure the agent and if
  • 00:01:48
    you don't do that then your fetch will
  • 00:01:51
    include like the Headless inside of it
  • 00:01:53
    or a bot maybe like depending on which
  • 00:01:55
    framework you're using to to perform the
  • 00:01:57
    Fetch and many times a website will
  • 00:01:59
    identify by that inside of the user
  • 00:02:01
    agent so I did that too it redirects the
  • 00:02:03
    users to the block page if so then as
  • 00:02:06
    you can also see here I'm restricting by
  • 00:02:08
    geolocation so you won't be able to
  • 00:02:10
    access the website from the United
  • 00:02:11
    Kingdom then you'll get sent to the
  • 00:02:13
    block country route I've also placed a
  • 00:02:15
    rate limiter this isn't using red is as
  • 00:02:17
    I said it's just a local way to identify
  • 00:02:20
    like in a Spam of 10 minutes did the
  • 00:02:23
    user access the website like multiple
  • 00:02:24
    times and after the user accesses the
  • 00:02:27
    website for the fifth time then it just
  • 00:02:29
    restricts the access and we can test
  • 00:02:31
    that right now just by fing the website
  • 00:02:35
    yeah United Kingdom is blocked and why
  • 00:02:37
    is that it's because I'm connected
  • 00:02:39
    through a VPN in the United Kingdom so
  • 00:02:41
    if I deactivate that and try to access
  • 00:02:44
    the secure path again it will open up
  • 00:02:47
    just fine then F5 again opened up third
  • 00:02:50
    time fourth time fifth it will open and
  • 00:02:54
    now it should block my access yeah so
  • 00:02:57
    that's that's a simple weate limiter and
  • 00:02:59
    obviously website wouldn't block your
  • 00:03:00
    access after five times you you fetched
  • 00:03:03
    it in 10 minutes it would probably block
  • 00:03:05
    your access after like the 100th time
  • 00:03:07
    that you're trying to access it in a
  • 00:03:08
    minute and ideally you wouldn't want to
  • 00:03:11
    do that right unless you're trying to De
  • 00:03:13
    dos the website so again you promised me
  • 00:03:15
    this in the beginning you won't use this
  • 00:03:16
    to do anything illegal right now let's
  • 00:03:19
    open the actual scrapers uh I have a
  • 00:03:21
    simple scraper and then the Scraper on
  • 00:03:23
    steroid which is using crawl for AI
  • 00:03:25
    along with a pretty cool proxy and for
  • 00:03:27
    our simple scraper we're using Puppeteer
  • 00:03:29
    and as I said in a previous video that
  • 00:03:31
    should be enough like for 90% of cases
  • 00:03:33
    you can scrape a website just using
  • 00:03:35
    Puppeteer selenium or a beautiful soup
  • 00:03:37
    just with Puppeteer you can get past two
  • 00:03:39
    of the problems we had earlier one of
  • 00:03:41
    them is the user agent and the way we
  • 00:03:43
    bypass that is just simulating an actual
  • 00:03:45
    real user you can simulate that easily
  • 00:03:47
    with just this line of code and also in
  • 00:03:49
    this line we're disabling the automation
  • 00:03:51
    Flags these automation Flags normally
  • 00:03:53
    comes with the scraper and they kind of
  • 00:03:55
    tell the website it's a bot accessing
  • 00:03:57
    them but with this simple line you can
  • 00:03:59
    deactivate that as for recapture what it
  • 00:04:02
    tries to do is find an actual user
  • 00:04:04
    inside of the page so a bot would move
  • 00:04:07
    differently from a human being my mouse
  • 00:04:09
    if it's right here and I move it
  • 00:04:11
    straight down to this line it won't be a
  • 00:04:13
    straight movement as if it were a bot it
  • 00:04:15
    could directly go from this coordinate
  • 00:04:16
    down to this in a direct line and like
  • 00:04:19
    since human aren't able to exactly do
  • 00:04:21
    that the bot will identify that it's not
  • 00:04:24
    a human there poiter can solve that by
  • 00:04:26
    default or you can just build a random
  • 00:04:28
    function that moves the the cursor
  • 00:04:30
    around randomly and then maybe the
  • 00:04:32
    recapture won't detect you so at this
  • 00:04:34
    point we bypassed a lot of things just
  • 00:04:36
    by using Puppeteer it defaults like out
  • 00:04:38
    of the bat it does a lot of things and
  • 00:04:40
    along with that we can Implement things
  • 00:04:41
    like this that guarantee that we can
  • 00:04:43
    access the website but for geol blocking
  • 00:04:46
    and rate limiting that involves our IP
  • 00:04:48
    there's no best way to get past that
  • 00:04:49
    than using a proxy I understand that
  • 00:04:52
    sometimes I receive comments like these
  • 00:04:53
    but for some problems like these
  • 00:04:55
    companies like iami really comes in
  • 00:04:56
    handy back then in the story of the
  • 00:04:58
    beginning of the video if I used this
  • 00:05:00
    residential proxy from iami I calculated
  • 00:05:03
    that fetching every 10 minutes every day
  • 00:05:05
    every week of the month I would have up
  • 00:05:08
    to just $2 of expenses every month and
  • 00:05:10
    along with all the features of proxy
  • 00:05:12
    like iami can provide you you even have
  • 00:05:14
    the security of not accessing the
  • 00:05:16
    website with your own IP and you'll
  • 00:05:18
    notice that that will really help you
  • 00:05:20
    from avoid being blocked by that site on
  • 00:05:22
    your own IP and not being able to access
  • 00:05:24
    it even through your browser so heading
  • 00:05:26
    over to aami's documentation you'll see
  • 00:05:28
    that we can integrate them with what
  • 00:05:30
    we're using right now which is Puppeteer
  • 00:05:32
    beautiful soup playr selenium and also
  • 00:05:35
    we'll integrate this proxy with craw
  • 00:05:36
    fori just in a bit integrating with a
  • 00:05:39
    proxy is something really simple but
  • 00:05:41
    aami's dashboard just makes it really
  • 00:05:42
    intuitive so you can set the location
  • 00:05:45
    settings right here let's select resultz
  • 00:05:47
    for now down here is the most important
  • 00:05:49
    step to guarantee that you don't get
  • 00:05:51
    picked up on rate limiting earlier today
  • 00:05:53
    there was a comment in one of my videos
  • 00:05:55
    on which the person got a rate limiting
  • 00:05:56
    error and that would not have happened
  • 00:05:58
    if they were using this down here in the
  • 00:06:00
    expert settings you'll have an adlock
  • 00:06:01
    toggle and you'd want to toggle on this
  • 00:06:03
    adlock especially when you're feeding
  • 00:06:05
    the data to an llm I don't know if
  • 00:06:06
    you've ever used speed test.net probably
  • 00:06:09
    to test your internet connection this is
  • 00:06:11
    how using a proxy would make it look now
  • 00:06:13
    let's move all the way down to the mode
  • 00:06:15
    the quality mode is what guarantees that
  • 00:06:16
    you avoid captas as well as can access
  • 00:06:19
    websites with strict antibot measures
  • 00:06:21
    because honestly what I did here was
  • 00:06:22
    just a rookie way of trying to block
  • 00:06:24
    Bots there are a lot of different apps
  • 00:06:26
    and tools that people will use to try to
  • 00:06:28
    block these bots so this code is going
  • 00:06:30
    to do is run concurrent requests it's
  • 00:06:32
    going to run seven requests over to that
  • 00:06:34
    website first not using proxy and then
  • 00:06:37
    you'll see that it will be able to
  • 00:06:39
    bypass capture as well as the user agent
  • 00:06:42
    blockings but above the sixth request it
  • 00:06:44
    will start blocking it as for Iam's
  • 00:06:47
    proxy it won't block it and it'll also
  • 00:06:49
    show the IPS going from the country that
  • 00:06:51
    we selected yeah so I'm running this
  • 00:06:53
    with FN it's first testing it without a
  • 00:06:56
    proxy yeah I surpass the rate limiting
  • 00:06:59
    probably because because I was already
  • 00:07:00
    visiting the site some minutes ago yeah
  • 00:07:02
    now with the proxy enabled and the
  • 00:07:04
    country selected being Brazil you see
  • 00:07:07
    that it goes through different IPS for
  • 00:07:09
    each call and then it successfully
  • 00:07:12
    brings all the information if I change
  • 00:07:14
    my IP over to the United States and
  • 00:07:18
    start my VPN and then run this again
  • 00:07:22
    you'll see that the the the search
  • 00:07:24
    without proxy might fetch not might it
  • 00:07:27
    certainly will fetch five requests
  • 00:07:30
    successfully but then two of them as you
  • 00:07:32
    can see here will be blocked because the
  • 00:07:34
    rate limiting was exceeded now here's
  • 00:07:36
    the thing I want scrape this table and
  • 00:07:38
    get it back in a structured way the
  • 00:07:40
    trick is that if I have five this and
  • 00:07:42
    you'll see that one HTML tag here is
  • 00:07:45
    figure then address then fig caption if
  • 00:07:48
    I update it was those HTML tags now we
  • 00:07:51
    have summary output and details now if
  • 00:07:55
    we F5 again you'll see that we have a
  • 00:07:57
    div so every sing SLE time the HTML tag
  • 00:08:00
    is changing and I acknowledge that we
  • 00:08:03
    could try to scrape it from The Styling
  • 00:08:05
    or some other variable or or from the
  • 00:08:08
    position like we we understand that
  • 00:08:10
    there's only this table in here and not
  • 00:08:12
    necessarily would we need to use an llm
  • 00:08:14
    but there are cases where maybe I
  • 00:08:16
    wouldn't even know that I want wanted to
  • 00:08:18
    get to this exact website and and the
  • 00:08:20
    crawler just found it then it wanted to
  • 00:08:22
    scrape it in those cases you would not
  • 00:08:24
    know the structure and if you want more
  • 00:08:25
    details on this please check out my
  • 00:08:27
    video above and also if you try out this
  • 00:08:29
    code you have to get this string right
  • 00:08:30
    here from iami go over to the
  • 00:08:33
    environment variable Place proxy URL and
  • 00:08:35
    just place it right here the code will
  • 00:08:37
    identify your user your password and
  • 00:08:39
    just proceed with the proxy for my
  • 00:08:41
    Scraper on steroids I did it a bit
  • 00:08:43
    differently so yeah the only difference
  • 00:08:45
    is really I get the string I get the the
  • 00:08:49
    initial part of it this is the server
  • 00:08:50
    you can also find this information right
  • 00:08:53
    over here yeah so username password host
  • 00:08:56
    name Port yeah you can get that all up
  • 00:08:58
    there just just place all the
  • 00:09:00
    information inside of this proxy config
  • 00:09:01
    dictionary this will be sent to the
  • 00:09:03
    browser config which will then be sent
  • 00:09:05
    to the WebCrawler and then you avoid
  • 00:09:07
    being R limited so proceeding on let's
  • 00:09:10
    just python main pi and let craw Frei do
  • 00:09:13
    its job this is what it brought back to
  • 00:09:15
    us and if you copy the exact message it
  • 00:09:18
    retrieved get a test Json place that in
  • 00:09:21
    there and then format the code that's it
  • 00:09:24
    it's completely structured you got it
  • 00:09:27
    despite it having some strange HTML tags
  • 00:09:30
    like you could you could scrape this you
  • 00:09:32
    might be questioning well I had some
  • 00:09:33
    expenses because I'm using the open API
  • 00:09:35
    key here and that's where olama comes in
  • 00:09:38
    to use DC locally you'll just have to
  • 00:09:40
    head over to Alama download it now head
  • 00:09:43
    over to either your CMD or your terminal
  • 00:09:45
    type in ama full deep SE R1
  • 00:09:49
    workb and wait for it to be installed
  • 00:09:52
    you'll notice that we change the
  • 00:09:53
    provider to / deeps R 1114b and we've
  • 00:09:57
    removed the API token since we're
  • 00:09:59
    running a local model while that is
  • 00:10:01
    downloading let me give you guys some
  • 00:10:02
    insights on how you can crawl or scrape
  • 00:10:04
    a website that needs some credentials so
  • 00:10:08
    way that website's used to identify if
  • 00:10:09
    you're logged in is through cookies and
  • 00:10:11
    if you find your cookies in here in
  • 00:10:13
    inspect you go over to application
  • 00:10:15
    you'll find a bunch of cookies right
  • 00:10:17
    here this is for like this particular
  • 00:10:19
    website so in there you'll find some
  • 00:10:21
    strings that you can pass over to your
  • 00:10:23
    code either if you're using Puppeteer
  • 00:10:25
    selenium or craw for AI and then it will
  • 00:10:27
    interpret that session session and
  • 00:10:29
    continue logged in while it's scraping
  • 00:10:32
    so it kind of simulates that you're
  • 00:10:33
    already logged in because it has the
  • 00:10:35
    session of the login that you've
  • 00:10:37
    performed previously so that's done it
  • 00:10:40
    already pulled DC car 114b we have that
  • 00:10:44
    configured in there let's run our code
  • 00:10:47
    this might take up to a minute and for
  • 00:10:49
    real I wouldn't recommend using this for
  • 00:10:51
    production like Alama deeps R1 orb it
  • 00:10:54
    doesn't have a high capacity of solving
  • 00:10:56
    pretty easy things and it might take a
  • 00:10:59
    while depending on your GPU okay it's
  • 00:11:01
    done and this is what it brought back so
  • 00:11:03
    let's copy that and send it over to our
  • 00:11:05
    test Json it actually got even more than
  • 00:11:08
    expected so if we head over to the
  • 00:11:09
    website you'll see that we have all this
  • 00:11:13
    scattered around the website it got
  • 00:11:15
    everything but I suppose that's more
  • 00:11:16
    because of my prompting the more weaker
  • 00:11:18
    the llm is the more specific you need to
  • 00:11:20
    be so it just comes down to just
  • 00:11:22
    optimizing this prompt it would probably
  • 00:11:24
    have got it correctly that is it for
  • 00:11:26
    today if you have any questions please
  • 00:11:27
    let me know in the comment section and
  • 00:11:29
    I'll see you in the next video till then
Tags
  • web scraping
  • anti-bot measures
  • Puppeteer
  • craw for AI
  • data extraction
  • proxies
  • geolocation
  • user agent
  • cookies
  • ethical scraping