Elasticsearch text analysis and full text search - a quick introduction

00:10:03
https://www.youtube.com/watch?v=ajNfOPeWiAY

Resumen

TLDRThe video elaborates on the differences between full text search and term level queries, elaborating on text analysis processes that transform lengthy text into manageable tokens for easier matching in search queries. By examining a product review, it demonstrates various aspects of search in Elasticsearch, including match queries and phrase queries, alongside parameters like slop, while also introducing the concepts of precision and recall. Viewers are encouraged to explore more in the documentation for deeper insights into full text search and relevance.

Para llevar

  • 🔍 Full text search uses analyzed text for complex queries.
  • 📜 Text analysis transforms text into tokens for better matching.
  • ⚙️ Analyzers have character filters, tokenizers, and token filters.
  • ✨ Tokenization splits text based on word boundaries.
  • 📏 Precision demands exact matches but yields fewer results.
  • 🔄 Recall allows broader matches but may reduce relevance.
  • 🔧 Elasticsearch provides tools for balancing precision and recall.
  • 📖 Explore Elasticsearch documentation for in-depth knowledge.

Cronología

  • 00:00:00 - 00:10:03

    The video discusses the differences between full text search and term level queries, with a focus on text analysis for complex text queries. It highlights how full text search facilitates match queries without requiring exact term placement, which is critical for handling longer, analyzed texts like product reviews. By storing text in a specialized field, tokenization processes enable effective matching, as illustrated by a dehumidifier review, and demonstrates how search criteria are converted into tokens for evaluation against stored text.

Mapa mental

Vídeo de preguntas y respuestas

  • What is the main difference between full text search and term level queries?

    Full text search runs against analyzed text and allows for complex queries, while term level queries use exact field values.

  • What is text analysis?

    Text analysis is the process of transforming text into tokens for better matching in search queries.

  • What components does an analyzer have?

    An analyzer consists of character filters, a tokenizer, and token filters.

  • How does tokenization work?

    The tokenizer splits text into individual tokens based on specific boundaries, like spaces and punctuation.

  • What is precision in the context of searching?

    Precision refers to the accuracy of search results, typically resulting in fewer but highly relevant hits.

  • What is recall in searching?

    Recall involves broader search criteria, yielding more hits, but with varying degrees of relevance.

  • What tools does Elasticsearch provide for text search?

    Elasticsearch offers various query types and parameters to balance precision and recall in searches.

  • What is a phrase query?

    A phrase query requires the terms to appear in a specific order within the document.

Ver más resúmenes de vídeos

Obtén acceso instantáneo a resúmenes gratuitos de vídeos de YouTube gracias a la IA.
Subtítulos
en
Desplazamiento automático:
  • 00:00:00
    full text search works very differently
  • 00:00:02
    to a term level query at a higher level
  • 00:00:04
    term level queries run against the exact
  • 00:00:06
    values stored in the field
  • 00:00:08
    full text queries run against analyzed
  • 00:00:10
    text and allows for complex text queries
  • 00:00:12
    providing a lot of functionality over
  • 00:00:14
    how documents are matched
  • 00:00:16
    i'll explain what text analysis is and
  • 00:00:18
    why it's performed
  • 00:00:20
    we have an index containing product
  • 00:00:22
    reviews here's an example from one of
  • 00:00:24
    the reviews
  • 00:00:26
    the dehumidifier helps dry my laundry
  • 00:00:28
    really quickly it's great looking
  • 00:00:30
    efficient and i use the water from the
  • 00:00:32
    tank to water my plants it's quite noisy
  • 00:00:35
    though
  • 00:00:36
    one difference between this sort of text
  • 00:00:38
    and the type of text we've stored in
  • 00:00:40
    keyword fields so far is the length of
  • 00:00:42
    the text keyword fields are used to
  • 00:00:44
    store shorter pieces of text where exact
  • 00:00:47
    or almost exact matching is frequently
  • 00:00:49
    done
  • 00:00:50
    paragraph length text like this by
  • 00:00:52
    nature is going to require a different
  • 00:00:54
    set of tools for matching if we store
  • 00:00:56
    the data in a keyword field and wanted
  • 00:00:58
    to find reviews that mentioned drawing
  • 00:01:00
    laundry quickly our options are pretty
  • 00:01:03
    limited
  • 00:01:04
    we could use a wildcard query matching
  • 00:01:06
    for laundry style quickly but we'd also
  • 00:01:09
    need to look for quickly style laundry
  • 00:01:11
    in case someone had written it quickly
  • 00:01:13
    dries my laundry
  • 00:01:14
    we'd also need to make sure the query
  • 00:01:16
    was case insensitive which would make a
  • 00:01:18
    slow query even slower
  • 00:01:20
    another option would be a regular
  • 00:01:22
    expression which again would perform
  • 00:01:25
    very badly and be painful to write and
  • 00:01:28
    how could we deal with when somebody
  • 00:01:29
    wrote it drives my laundry really
  • 00:01:31
    quickly
  • 00:01:32
    it dries my laundry real quick or my
  • 00:01:35
    laundry has never gotten dry so quick
  • 00:01:38
    this is the sort of problem full text
  • 00:01:40
    search is designed to solve
  • 00:01:42
    here's how we could solve for some of
  • 00:01:43
    these issues using the default full text
  • 00:01:46
    search options
  • 00:01:47
    to use vortex search we store the review
  • 00:01:50
    in a text field then write a match query
  • 00:01:53
    it doesn't matter which order we put the
  • 00:01:55
    query terms in and neither does the case
  • 00:01:57
    make a difference
  • 00:01:59
    how is running a match query over a text
  • 00:02:01
    field able to do all of this
  • 00:02:04
    two words text
  • 00:02:06
    analysis when text is stored in a text
  • 00:02:09
    field it's not stored in its exact form
  • 00:02:12
    the analysis process conducted by an
  • 00:02:14
    analyzer takes text and produces a
  • 00:02:17
    stream of tokens which are maintained in
  • 00:02:19
    the inverted index
  • 00:02:21
    a token is a fragment of the original
  • 00:02:23
    text usually an individual word or
  • 00:02:25
    derived from an individual word
  • 00:02:28
    when you run a query your search
  • 00:02:30
    criteria goes through the same analysis
  • 00:02:32
    process resulting in another stream of
  • 00:02:34
    tokens
  • 00:02:36
    elasticsearch then looks for matches
  • 00:02:38
    where tokens from your search criteria
  • 00:02:40
    are in the field you're matching against
  • 00:02:43
    an analyzer has three components
  • 00:02:46
    the first step in analysis is done by
  • 00:02:48
    character filters character filters work
  • 00:02:51
    on individual characters in the source
  • 00:02:53
    text adding modifying or removing
  • 00:02:56
    characters
  • 00:02:57
    this step is commonly used to replace
  • 00:02:59
    characters or remove them entirely if
  • 00:03:01
    you wanted to take out formatting or
  • 00:03:03
    html for example
  • 00:03:05
    you're able to use multiple character
  • 00:03:07
    filters in an analyzer and they are
  • 00:03:08
    executed in the order they're defined
  • 00:03:11
    character filters aren't required though
  • 00:03:14
    once all character filters have
  • 00:03:15
    completed their work or there aren't any
  • 00:03:17
    character filters in the analyzer the
  • 00:03:19
    resulting or original text is sent to
  • 00:03:22
    the tokenizer
  • 00:03:24
    the tokenizer splits up text into
  • 00:03:26
    individual tokens
  • 00:03:28
    you must have exactly one tokenizer in
  • 00:03:30
    the analyzer the text is usually but not
  • 00:03:34
    always split on word boundaries like
  • 00:03:36
    spaces full stops and other punctuation
  • 00:03:39
    another very important role of the
  • 00:03:40
    tokenizer is to record the position of
  • 00:03:43
    each token in the original text such as
  • 00:03:45
    the start and end offsets
  • 00:03:47
    this data is used for some query types
  • 00:03:49
    as well as highlighting the search
  • 00:03:51
    criteria in hits
  • 00:03:53
    the tokens produced by the analyzer are
  • 00:03:55
    sent to the final step the token filters
  • 00:03:59
    token filters add modify or remove
  • 00:04:01
    tokens produced by the tokenizer this is
  • 00:04:04
    another optional step and like character
  • 00:04:06
    filters multiple token filters can be
  • 00:04:08
    used in the analyzer
  • 00:04:10
    token filters typically perform tasks
  • 00:04:12
    like converting tokens to lowercase
  • 00:04:14
    removing tokens you don't want like very
  • 00:04:16
    short common or rude words
  • 00:04:18
    and stemming reducing a word to its root
  • 00:04:21
    form in the dehumidifier review for
  • 00:04:23
    example the word quickly could be
  • 00:04:25
    stemmed to just quick
  • 00:04:27
    there are several analyzers you can use
  • 00:04:29
    in elasticsearch and you're even able to
  • 00:04:30
    craft your own by selecting the
  • 00:04:32
    character filters tokenizer and token
  • 00:04:34
    filters you want
  • 00:04:36
    text fields by default use the standard
  • 00:04:38
    analyzer
  • 00:04:40
    we're able to see the output of the
  • 00:04:42
    analysis process by using the analyze
  • 00:04:44
    api so let's try that with a product
  • 00:04:46
    review text
  • 00:04:47
    we make a post request to underscore
  • 00:04:49
    analyze providing the analyzer we want
  • 00:04:51
    to use and the text we want analyzed in
  • 00:04:54
    the request body
  • 00:04:55
    the standard analyzer doesn't have any
  • 00:04:57
    character filters it tokenizes based on
  • 00:05:00
    word boundaries using the standard
  • 00:05:02
    tokenizer then lower cases each token
  • 00:05:05
    it also has a stop token filter which is
  • 00:05:08
    disabled by default but can be used to
  • 00:05:10
    remove tokens you don't want included
  • 00:05:12
    referred to as stop words the output of
  • 00:05:15
    the analyze api shows us all the tokens
  • 00:05:18
    that will be included in the contents of
  • 00:05:20
    the text field
  • 00:05:21
    this is sometimes referred to as a bag
  • 00:05:23
    of words
  • 00:05:25
    all the words in the original text are
  • 00:05:26
    in the array of tokens but they've been
  • 00:05:28
    lower case by the lowercase token filter
  • 00:05:31
    the start and end offsets and the
  • 00:05:33
    position are also calculated for each
  • 00:05:35
    token
  • 00:05:36
    using different search criteria shows
  • 00:05:38
    that not all the terms in the search
  • 00:05:40
    criteria need to be in the field value
  • 00:05:43
    in order to get a match
  • 00:05:45
    searching for laundry quickly rabbit
  • 00:05:47
    still produces a hit despite there being
  • 00:05:49
    no mention of a rabbit in the review
  • 00:05:52
    adding a couple more reviews will give
  • 00:05:54
    us a better picture of what's going on
  • 00:05:56
    running that same query for laundry
  • 00:05:59
    quickly rabbit now produces three hits
  • 00:06:02
    the original for the dehumidifier and
  • 00:06:04
    the two new documents notice the value
  • 00:06:06
    of underscore score though
  • 00:06:08
    the rabbit review gets a higher score
  • 00:06:11
    all three search terms were matched in
  • 00:06:12
    that document two were matched in the
  • 00:06:14
    dehumidifier review and only one in the
  • 00:06:17
    note on full text search
  • 00:06:19
    elasticsearch or lucine really indicates
  • 00:06:23
    this in the scores for the three hits
  • 00:06:25
    what if we only wanted to match
  • 00:06:27
    documents containing all three of those
  • 00:06:29
    terms
  • 00:06:30
    we can tell the match query to do this
  • 00:06:32
    using the operator parameter
  • 00:06:34
    this narrows down the results to just
  • 00:06:36
    the rabbit review
  • 00:06:38
    there's another useful parameter in the
  • 00:06:40
    match query minimum should match
  • 00:06:43
    we've used this before in bull queries
  • 00:06:45
    to specify how many should clauses must
  • 00:06:47
    match in order for a document to be a
  • 00:06:49
    hit
  • 00:06:50
    it can also be used with a match query
  • 00:06:52
    to specify how many of our query terms
  • 00:06:55
    must match the text field for the
  • 00:06:56
    document to be a hit
  • 00:06:58
    we can see this by changing the minimum
  • 00:07:00
    should match value for this query and
  • 00:07:02
    seeing how it affects the results
  • 00:07:05
    it's clear from what we've seen here the
  • 00:07:07
    position of the terms in the hit doesn't
  • 00:07:09
    matter
  • 00:07:10
    the query for really laundry rabbit with
  • 00:07:12
    the minimum should match of 3 still
  • 00:07:15
    produce a hit on the rabbit review
  • 00:07:16
    despite the field terms not being in the
  • 00:07:19
    same order as the query terms
  • 00:07:21
    we can look for tokens in order by
  • 00:07:24
    searching for a phrase
  • 00:07:26
    a phrase is a sequence of terms in a
  • 00:07:28
    specific order
  • 00:07:30
    when you search for a phrase documents
  • 00:07:32
    must have those same terms in that same
  • 00:07:35
    order as the phrase
  • 00:07:37
    to search for a phrase we use a match
  • 00:07:39
    phrase query
  • 00:07:41
    this query only matches the dehumidifier
  • 00:07:43
    review as that's the only one containing
  • 00:07:45
    really and quickly in that order
  • 00:07:49
    the match phrase query has a really
  • 00:07:50
    useful parameter called slop which you
  • 00:07:53
    can use to broaden your search
  • 00:07:55
    it allows for a certain number of terms
  • 00:07:57
    between the terms in the phrase
  • 00:07:59
    as an example if we search for water
  • 00:08:02
    plants nothing will hit
  • 00:08:04
    if we want to look a bit wider and allow
  • 00:08:06
    a single term between the words water
  • 00:08:09
    and plants we can add a slot parameter
  • 00:08:11
    with a value of one
  • 00:08:14
    this finds the dehumidifier review
  • 00:08:15
    because of the word my between water and
  • 00:08:18
    plants being permitted by the slop
  • 00:08:21
    i've explained the high level view of
  • 00:08:24
    voltex search and showing how you're
  • 00:08:26
    able to use different queries and
  • 00:08:27
    parameters to match documents based on
  • 00:08:29
    your own requirements
  • 00:08:31
    this is only an introduction though
  • 00:08:33
    there's a lot more to full text search
  • 00:08:35
    but you can get a long way with the
  • 00:08:37
    tools i'm going to give you in this
  • 00:08:38
    course
  • 00:08:39
    there are two concepts i want to
  • 00:08:40
    introduce now that you've seen some
  • 00:08:42
    examples of searching for text precision
  • 00:08:45
    and recall
  • 00:08:46
    precision and recall are two extremes of
  • 00:08:49
    a query spectrum
  • 00:08:51
    at the precision end you'd be very
  • 00:08:53
    precise with your criteria with very
  • 00:08:55
    specific requirements for documents that
  • 00:08:57
    need to match
  • 00:08:59
    precise queries won't match many
  • 00:09:00
    documents but the ones that you do match
  • 00:09:02
    will be exactly what you're looking for
  • 00:09:04
    and the score differences between the
  • 00:09:06
    hits will be small the opposite end
  • 00:09:08
    recall is when you're casting a very
  • 00:09:11
    wide net when looking for matches
  • 00:09:13
    your requirements will be much looser
  • 00:09:15
    and will result in more hits
  • 00:09:17
    those hits will still match your query
  • 00:09:19
    but because your query is less strict
  • 00:09:21
    the hits you'll match will match to
  • 00:09:23
    varying degrees and the score
  • 00:09:25
    differences between the hits will be
  • 00:09:26
    much larger
  • 00:09:28
    when crafting a full text query you'll
  • 00:09:30
    need to find the right balance between
  • 00:09:32
    precision and recall now what defines
  • 00:09:35
    write balance is completely up to you
  • 00:09:37
    and will be based on what your users
  • 00:09:39
    expect to see
  • 00:09:40
    there are lots of tools available in
  • 00:09:42
    elasticsearch to help you strike that
  • 00:09:44
    balance wherever it may be
  • 00:09:46
    we'll cover some of these tools but i
  • 00:09:48
    encourage you to read through all the
  • 00:09:50
    nitty gritty and the elastic search
  • 00:09:52
    documentation
  • 00:09:54
    and other material about full text
  • 00:09:55
    search relevance and information
  • 00:09:57
    retrieval as a whole
  • 00:09:59
    the rabbit hole goes very deep
Etiquetas
  • Elasticsearch
  • full text search
  • text analysis
  • tokenization
  • precision
  • recall
  • analyzer
  • query parameters
  • match query
  • phrase query