The Biggest Mistake Beginners Make When Web Scraping

00:10:20
https://www.youtube.com/watch?v=G7s0eGOaRPE

Sintesi

TLDRIn deze video bespreekt de spreker de stappen om effectief data van een moderne website te verkrijgen. Moderne websites bestaan uit front-end en back-end systemen, waarbij alle cruciale data op de back-end aanwezig is. Het is daarom inefficiënt om data van de front-end te proberen halen. Front-end maakt gebruik van JavaScript frameworks die via AJAX of Axios verzoeken doen aan de back-end, oftewel het daadwerkelijke systeem met de data. Om toegang te krijgen tot de back-end data, zonder door de front-end te moeten, hanteert CORS (Cross-Origin Resource Sharing) vaak het gebruik van cookies. Dit proces wordt gedemonstreerd met behulp van de tool Playwright, waarmee een gebruiker een headless browser sessie start om een geldige cookie te verkrijgen en deze voor data requests gebruikt.

Punti di forza

  • 💡 Front-end is voor presentatie, back-end bevat de data.
  • 🍪 Cookies zijn cruciaal voor toegang tot back-end.
  • 📜 Moderne sites gebruiken JavaScript frameworks voor data requests.
  • 🔄 CORS vereist vaak cookies om cross-origin requests toe te staan.
  • 🛠 Gebruik Playwright om cookies te krijgen via een headless browser.
  • 🔍 Insomnia of Postman kunnen helpen bij het maken van geldige requests.
  • ⏳ Cookies verlopen; ververs ze regelmatig.
  • 🔧 Playwright helpt bij het verkrijgen van de juiste cookie-informatie.
  • ⚙️ Gebruik sessies voor herhaalde requests vanuit de back-end.
  • 🔑 Focus op toegang verkrijgen tot de back-end voor complete data.

Linea temporale

  • 00:00:00 - 00:05:00

    In moderne websites is de back-end verantwoordelijk voor het opslaan van de gegevens, terwijl de front-end meestal in JavaScript is geschreven en een verzoek verzendt naar de back-end. Voor een gebruiker die gegevens wilt verkrijgen, is het efficiënter om rechtstreeks toegang te krijgen tot de back-end. Echter, vanwege CORS-beperkingen (cross-origin resource sharing), is het noodzakelijk om zich voor te doen als een front-end gebruiker door cookies te gebruiken. Het gebruik van tools zoals Playwright kan helpen bij het verkrijgen van deze cookies door een headless browser te laden en de cookies op te slaan voor latere verzoeken via bijvoorbeeld Postman.

  • 00:05:00 - 00:10:20

    Cookies spelen een cruciale rol bij het verkrijgen van gegevens van moderne websites, omdat ze authenticatie-informatie bevatten die nodig is voor toegangsverzoeken. Het is mogelijk om een browser in headless-modus te laden met Playwright om cookies op te halen en deze te gebruiken bij API-verzoeken. Dit proces is nuttig wanneer de JSON-data die van de back-end wordt ontvangen omvangrijk is, omdat het voorkomt dat het programma faalt vanwege te grote responsen. Het vereenvoudigt ook herhaalde toegang door dezelfde cookies te gebruiken voor opeenvolgende verzoeken, waardoor handmatige tussenkomst wordt verminderd, zoals gedemonstreerd met behulp van tools zoals Insomnia.

Mappa mentale

Video Domande e Risposte

  • Waarom is het niet handig om data van de front-end van een website te halen?

    Moderne websites hebben hun data in de back-end opgeslagen, terwijl de front-end alleen de presentatie van deze data faciliteert.

  • Wat betekent CORS in webtechnologie?

    CORS staat voor Cross-Origin Resource Sharing, een mechanisme dat browsers gebruiken om beperkte middelen op een webpagina op een ander domein te verkrijgen.

  • Hoe kun je data van de back-end verkrijgen zonder door de front-end te gaan?

    Je kunt een headless browser zoals Playwright gebruiken om een cookie van de back-end te verkrijgen en daarmee requests naar de back-end te sturen zonder beperkingen.

  • Waarom is een cookie belangrijk bij het maken van back-end requests?

    Een cookie helpt om de browser als een legitieme gebruiker van de website voor te doen en zo toegang tot de back-end te verkrijgen.

  • Wat doe je als een cookie verloopt?

    Wanneer cookies verlopen, moet je een nieuwe headless browser-sessie starten om een verse cookie te verkrijgen.

  • Wat is de belangrijkste les uit de video?

    De video benadrukt het belang van het verkrijgen van data van de back-end met het gebruik van een cookie en tools als Insomnia of Postman om de requests goed te maken.

Visualizza altre sintesi video

Ottenete l'accesso immediato ai riassunti gratuiti dei video di YouTube grazie all'intelligenza artificiale!
Sottotitoli
en
Scorrimento automatico:
  • 00:00:00
    if you're trying to take the data from
  • 00:00:02
    the front end of a website there's a
  • 00:00:04
    good chance that you're going to be
  • 00:00:05
    doing it wrong and you're not going to
  • 00:00:06
    get what you need
  • 00:00:08
    modern websites are made up of a
  • 00:00:10
    front-end and a back-end system and it's
  • 00:00:12
    the back-end that has all the
  • 00:00:14
    information all the data on it that we
  • 00:00:16
    want so why would we make a request to
  • 00:00:18
    the front end when it's the back end
  • 00:00:20
    that's actually got the data well to
  • 00:00:22
    work this out and understand we need to
  • 00:00:23
    talk a little bit about how a modern
  • 00:00:25
    website works including using cores
  • 00:00:27
    which is the cross origin resource
  • 00:00:30
    sharing so the front-end website that we
  • 00:00:32
    load up in our browser is pretty much
  • 00:00:34
    always javascript whichever framework is
  • 00:00:37
    most popular at the time probably and
  • 00:00:39
    what that does is you go to the page and
  • 00:00:41
    it will use something like ajax with
  • 00:00:43
    axios or something like that that will
  • 00:00:44
    make a request to an endpoint on the
  • 00:00:46
    back end of the on the back end website
  • 00:00:49
    which will be completely separate that
  • 00:00:51
    will then send that data to the front
  • 00:00:53
    end so then it will be displayed and
  • 00:00:55
    rendered properly so for for us the end
  • 00:00:57
    user so what we want to do is we want to
  • 00:01:00
    be able to go straight to the back end
  • 00:01:01
    and get the data but you see it's not
  • 00:01:03
    going to allow us to do that unless we
  • 00:01:06
    pretend that we are coming through the
  • 00:01:07
    front end of this at front end through
  • 00:01:09
    cause which is generally going to
  • 00:01:11
    involve a cookie so what i'm going to do
  • 00:01:13
    is i'm going to walk you through an
  • 00:01:15
    example that i've done here i'll just
  • 00:01:16
    show you the code now
  • 00:01:18
    and i'm going to tell you about why i've
  • 00:01:20
    made some of these decisions what they
  • 00:01:23
    mean and also how you can take a cookie
  • 00:01:26
    from loading up a headless chrome using
  • 00:01:29
    something like playwright playwright in
  • 00:01:31
    this case and then we can send it to
  • 00:01:33
    requests so we can actually get a new
  • 00:01:36
    cookie every time that we want to do
  • 00:01:38
    this because cookies do expire before we
  • 00:01:41
    get to that today's video is sponsored
  • 00:01:43
    by skillshare skillshare is an online
  • 00:01:46
    learning community with thousands of
  • 00:01:47
    classes ready to help you explore your
  • 00:01:49
    creativity and inspire you if you have a
  • 00:01:51
    specific skill you're trying to learn or
  • 00:01:54
    maybe you're like me and you like to
  • 00:01:55
    utilize the breadth and depth of classes
  • 00:01:58
    to help you with the other parts of
  • 00:02:00
    personal growth to support your site
  • 00:02:02
    projects this week i've been watching
  • 00:02:04
    creativity unleashed discover hone and
  • 00:02:07
    share your voice online by nathaniel
  • 00:02:09
    drew nathaniel is a youtuber who i am
  • 00:02:11
    very familiar with having followed
  • 00:02:13
    online for several years now and i was
  • 00:02:15
    very excited to take his class i believe
  • 00:02:17
    there's great value had to be had in
  • 00:02:19
    watching and learning from someone who's
  • 00:02:21
    out there creating and making stuff
  • 00:02:23
    every day and this was exactly that so
  • 00:02:25
    the first 1000 people to use the link in
  • 00:02:28
    the description below or my code john
  • 00:02:30
    watson rooney will get one month free
  • 00:02:33
    access to skillshare so once again click
  • 00:02:35
    that link in the description below or my
  • 00:02:37
    code john watson rooney and thank you to
  • 00:02:39
    skillshare for sponsoring this episode
  • 00:02:41
    so let's move over to the actual website
  • 00:02:44
    which i've got here and you'll see that
  • 00:02:46
    when you load this up for the first time
  • 00:02:47
    especially in private browsing it tells
  • 00:02:50
    you you need to accept cookies and this
  • 00:02:52
    is very common and this is exactly what
  • 00:02:54
    we need to do so i'm going to hit accept
  • 00:02:56
    all it's going to load up the page and
  • 00:02:58
    it's going to have all the information
  • 00:02:59
    on now you'll see that here here's the
  • 00:03:01
    list and it's all done in a nice fancy
  • 00:03:03
    way so you click on it and it loads up
  • 00:03:05
    more stuff etc etc we're all familiar
  • 00:03:07
    with how these websites work what i'm
  • 00:03:09
    going to do is we're going to go to the
  • 00:03:10
    inspect element tool and go to the
  • 00:03:12
    network tab try and make this a bit
  • 00:03:14
    bigger hit reload and we're going to see
  • 00:03:17
    that the front end is making requests to
  • 00:03:19
    the back end for the actual information
  • 00:03:22
    there's quite a few here but what i'm
  • 00:03:24
    going to show you is the page data let's
  • 00:03:27
    move this out of the way here
  • 00:03:30
    move it
  • 00:03:31
    so you can see that in this one we have
  • 00:03:34
    these specific headers that we are
  • 00:03:37
    requesting with our request headers and
  • 00:03:39
    the response headers these ones up here
  • 00:03:41
    and we can see that the actual response
  • 00:03:43
    and even though in this case has been
  • 00:03:45
    truncated and i'll come back to that
  • 00:03:47
    actually has the information from the
  • 00:03:50
    website that we are after
  • 00:03:52
    so what we want to do is we want to just
  • 00:03:54
    make this request ourselves
  • 00:03:56
    but it's not that simple because we need
  • 00:03:59
    to obey the rules of the cause across
  • 00:04:01
    origin resource sharing so we need to
  • 00:04:04
    have a cookie so we can actually
  • 00:04:06
    mimic this and be a part of this now in
  • 00:04:08
    my previous videos if you've watched any
  • 00:04:10
    of those i've said just copy this copy
  • 00:04:13
    it as curl and we'll use postman or
  • 00:04:15
    insomnia and that's great and that works
  • 00:04:18
    but when you actually get to the point
  • 00:04:19
    where you need a new cookie you have to
  • 00:04:21
    make a new request
  • 00:04:23
    what i did is i did copy as curl and i
  • 00:04:26
    opened up insomnia which i've got here
  • 00:04:28
    and what i've done is i've just been
  • 00:04:30
    through the header section this is the
  • 00:04:32
    request and i've ticked out all of the
  • 00:04:34
    ones that i don't think that we need
  • 00:04:36
    except for the cookie and when i run
  • 00:04:38
    this it will take a second because
  • 00:04:41
    as i said this response is quite big on
  • 00:04:43
    the opposite side which is just hidden
  • 00:04:45
    by my head let's move that out of the
  • 00:04:47
    way you'll see that we get this neat
  • 00:04:50
    json data with all of the information
  • 00:04:53
    that we could possibly want now this is
  • 00:04:54
    the information that the back end is
  • 00:04:56
    sent to the front end part of the
  • 00:04:58
    website which has rendered all nice and
  • 00:05:00
    neat
  • 00:05:01
    in here to show us this
  • 00:05:04
    and you can actually click through and
  • 00:05:06
    every time you click on a person's name
  • 00:05:08
    it makes a new request and this is its
  • 00:05:10
    own endpoint but we're still using the
  • 00:05:12
    same cookie so if you wanted to do that
  • 00:05:14
    you could actually expand on this and
  • 00:05:16
    get the information from each one of
  • 00:05:17
    these as well so let's go back to our
  • 00:05:21
    insomnia or postman or whatever you're
  • 00:05:23
    using if i untick the cookie and tick
  • 00:05:26
    everything else
  • 00:05:27
    for example so we just have the
  • 00:05:30
    we don't send the cookie so you can see
  • 00:05:32
    here's our cause and everything like
  • 00:05:33
    that it's basically all of the
  • 00:05:35
    information that's being sent over if we
  • 00:05:37
    send this we get this blank page and
  • 00:05:40
    that is basically the response is
  • 00:05:43
    there'll be some javascript in here
  • 00:05:44
    which insomnia is not loading up telling
  • 00:05:46
    us that we need to have a cookie or need
  • 00:05:48
    to accept the cookie or something
  • 00:05:49
    similar okay so let's unselect all of
  • 00:05:51
    these again
  • 00:05:54
    to do click the cookie back on
  • 00:05:58
    and then run this now
  • 00:06:01
    we're gonna get all the information back
  • 00:06:04
    so this is the main header that's the
  • 00:06:05
    most important one this is what's
  • 00:06:07
    identifying us what i like to do from
  • 00:06:09
    here is to use i uh my
  • 00:06:12
    api tool to actually generate some code
  • 00:06:15
    for me you can see here because i've
  • 00:06:17
    only got the cookie header
  • 00:06:21
    selected that's the one that's come back
  • 00:06:23
    out and this is the one that we need so
  • 00:06:25
    as i said before
  • 00:06:27
    we could just use this code here exactly
  • 00:06:29
    and paste it into vs code or whatever
  • 00:06:31
    and this would give us that json data
  • 00:06:34
    but as soon as this cookie expires and
  • 00:06:35
    that's different for different websites
  • 00:06:38
    this will no longer work so we needed to
  • 00:06:40
    make it more repeatable and that's where
  • 00:06:41
    we're going to use playwright to load a
  • 00:06:43
    browser up
  • 00:06:45
    so if we go back to our code you'll see
  • 00:06:48
    here that i'm using playwrights to load
  • 00:06:51
    up my chromium browser and i'm asking
  • 00:06:53
    for the context because the context is
  • 00:06:55
    where the
  • 00:06:56
    cookie information is so if we come back
  • 00:06:59
    to one of my working files so this is
  • 00:07:01
    just the playwright part let's move this
  • 00:07:03
    over here
  • 00:07:04
    and i print out the cookie context from
  • 00:07:07
    from playwright
  • 00:07:08
    you'll see that it loaded the browser up
  • 00:07:10
    and that's because we needed to do that
  • 00:07:12
    and i've got this in headless is true
  • 00:07:14
    it's false at the moment so i could see
  • 00:07:16
    what's going on but you'll see that we
  • 00:07:18
    get this dictionary back with all the
  • 00:07:19
    cookies with all the headers rather and
  • 00:07:21
    this is the one that we were interested
  • 00:07:23
    in
  • 00:07:24
    and this should be very similar to the
  • 00:07:26
    one i was parting off into requests so
  • 00:07:29
    we want to take this out and then move
  • 00:07:31
    it into requests but why i wanted to do
  • 00:07:34
    that was because of the actual size of
  • 00:07:38
    the json response that i was getting so
  • 00:07:40
    if you're trying to do this on a
  • 00:07:41
    different site and the actual response
  • 00:07:44
    that you're after for json is not that
  • 00:07:47
    big you could just stop right here and
  • 00:07:49
    then get the response.json
  • 00:07:52
    but because the actual json file that
  • 00:07:54
    we're getting back from this website has
  • 00:07:56
    so much information you can see it's
  • 00:07:58
    super long
  • 00:07:59
    it was too big and it was causing my
  • 00:08:02
    playwright to fail
  • 00:08:04
    but that led me on to pushing the cookie
  • 00:08:06
    into requests which i think is quite
  • 00:08:08
    valuable
  • 00:08:09
    so we can go back to it here and we can
  • 00:08:11
    see then
  • 00:08:13
    i'm taking the cookie for requests and
  • 00:08:15
    the cookie context
  • 00:08:17
    number three which was the third
  • 00:08:19
    index of the list we're grabbing the
  • 00:08:21
    value and taking the code from what
  • 00:08:24
    our um insomnia had generated we can see
  • 00:08:28
    that the cookie is in this format here
  • 00:08:30
    and this is specific to requests on how
  • 00:08:32
    it's going to be sent over they're just
  • 00:08:34
    formatted slightly differently so all i
  • 00:08:36
    did was copy this
  • 00:08:38
    into here and then used an f string
  • 00:08:42
    to add in the actual
  • 00:08:45
    cookie part v with all the information
  • 00:08:47
    that i was getting back from
  • 00:08:49
    playwright and that means that we can
  • 00:08:51
    then use the same cookie and we could
  • 00:08:53
    have a session in here if we were going
  • 00:08:55
    to
  • 00:08:56
    want to make the other requests like i
  • 00:08:59
    showed you uh down here these ones with
  • 00:09:01
    all the extra specific information
  • 00:09:04
    we would use a request session to use
  • 00:09:07
    the cookie the same cookie over and over
  • 00:09:09
    again
  • 00:09:10
    from here it was just a case of then
  • 00:09:13
    printing out the json and i've
  • 00:09:15
    specifically indexed it down here
  • 00:09:18
    this is actually all the information so
  • 00:09:20
    what i liked about this was using
  • 00:09:22
    playwright to do one thing grab me the
  • 00:09:24
    cookie and then pass it off onto
  • 00:09:26
    requests to then
  • 00:09:29
    use it so we could actually make that
  • 00:09:31
    request so if we didn't have the cookie
  • 00:09:33
    to send through with requests our
  • 00:09:35
    request would be failed like i showed
  • 00:09:37
    you when we were doing it in insomnia so
  • 00:09:39
    i'm going to put this code in the
  • 00:09:40
    description down below for you to have a
  • 00:09:42
    look at and have a play with what i was
  • 00:09:45
    trying to show you here is that if
  • 00:09:47
    you're trying to get data from a website
  • 00:09:49
    and you're getting it trying to grab it
  • 00:09:51
    from the front end and it's a modern
  • 00:09:52
    website you really want to try to put
  • 00:09:55
    your efforts into grabbing it from the
  • 00:09:57
    back end directly
  • 00:10:00
    using the cookie that you can grab this
  • 00:10:02
    way or from the actual request you made
  • 00:10:04
    in your browser initially if that works
  • 00:10:07
    for you
  • 00:10:08
    if you've enjoyed this video i think
  • 00:10:10
    you're going to like this one here which
  • 00:10:11
    goes into this method in a slightly
  • 00:10:13
    different way but more in-depth coding
  • 00:10:16
    it out so that might be more useful to
  • 00:10:18
    some of you
Tag
  • web scraping
  • front-end
  • back-end
  • CORS
  • cookies
  • Playwright
  • JavaScript
  • axios
  • AJAX
  • Insomnia