What is the main difference between full text search and term level queries?

Full text search runs against analyzed text and allows for complex queries, while term level queries use exact field values.

What is text analysis?

Text analysis is the process of transforming text into tokens for better matching in search queries.

What components does an analyzer have?

An analyzer consists of character filters, a tokenizer, and token filters.

How does tokenization work?

The tokenizer splits text into individual tokens based on specific boundaries, like spaces and punctuation.

What is precision in the context of searching?

Precision refers to the accuracy of search results, typically resulting in fewer but highly relevant hits.

What is recall in searching?

Recall involves broader search criteria, yielding more hits, but with varying degrees of relevance.

What tools does Elasticsearch provide for text search?

Elasticsearch offers various query types and parameters to balance precision and recall in searches.

What is a phrase query?

A phrase query requires the terms to appear in a specific order within the document.

Elasticsearch text analysis and full text search - a quick introduction

00:10:03

https://www.youtube.com/watch?v=ajNfOPeWiAY

Resumen

TLDRThe video elaborates on the differences between full text search and term level queries, elaborating on text analysis processes that transform lengthy text into manageable tokens for easier matching in search queries. By examining a product review, it demonstrates various aspects of search in Elasticsearch, including match queries and phrase queries, alongside parameters like slop, while also introducing the concepts of precision and recall. Viewers are encouraged to explore more in the documentation for deeper insights into full text search and relevance.

Para llevar

🔍 Full text search uses analyzed text for complex queries.
📜 Text analysis transforms text into tokens for better matching.
⚙️ Analyzers have character filters, tokenizers, and token filters.
✨ Tokenization splits text based on word boundaries.
📏 Precision demands exact matches but yields fewer results.
🔄 Recall allows broader matches but may reduce relevance.
🔧 Elasticsearch provides tools for balancing precision and recall.
📖 Explore Elasticsearch documentation for in-depth knowledge.

Cronología

00:00:00 - 00:10:03
The video discusses the differences between full text search and term level queries, with a focus on text analysis for complex text queries. It highlights how full text search facilitates match queries without requiring exact term placement, which is critical for handling longer, analyzed texts like product reviews. By storing text in a specialized field, tokenization processes enable effective matching, as illustrated by a dehumidifier review, and demonstrates how search criteria are converted into tokens for evaluation against stored text.

Mapa mental

Vídeo de preguntas y respuestas

What is the main difference between full text search and term level queries?
Full text search runs against analyzed text and allows for complex queries, while term level queries use exact field values.
What is text analysis?
Text analysis is the process of transforming text into tokens for better matching in search queries.
What components does an analyzer have?
An analyzer consists of character filters, a tokenizer, and token filters.
How does tokenization work?
The tokenizer splits text into individual tokens based on specific boundaries, like spaces and punctuation.
What is precision in the context of searching?
Precision refers to the accuracy of search results, typically resulting in fewer but highly relevant hits.
What is recall in searching?
Recall involves broader search criteria, yielding more hits, but with varying degrees of relevance.
What tools does Elasticsearch provide for text search?
Elasticsearch offers various query types and parameters to balance precision and recall in searches.
What is a phrase query?
A phrase query requires the terms to appear in a specific order within the document.

Ver más resúmenes de vídeos

Obtén acceso instantáneo a resúmenes gratuitos de vídeos de YouTube gracias a la IA.

Subtítulos

Desplazamiento automático:

00:00:00
full text search works very differently
00:00:02
to a term level query at a higher level
00:00:04
term level queries run against the exact
00:00:06
values stored in the field
00:00:08
full text queries run against analyzed
00:00:10
text and allows for complex text queries
00:00:12
providing a lot of functionality over
00:00:14
how documents are matched
00:00:16
i'll explain what text analysis is and
00:00:18
why it's performed
00:00:20
we have an index containing product
00:00:22
reviews here's an example from one of
00:00:24
the reviews
00:00:26
the dehumidifier helps dry my laundry
00:00:28
really quickly it's great looking
00:00:30
efficient and i use the water from the
00:00:32
tank to water my plants it's quite noisy
00:00:35
though
00:00:36
one difference between this sort of text
00:00:38
and the type of text we've stored in
00:00:40
keyword fields so far is the length of
00:00:42
the text keyword fields are used to
00:00:44
store shorter pieces of text where exact
00:00:47
or almost exact matching is frequently
00:00:49
done
00:00:50
paragraph length text like this by
00:00:52
nature is going to require a different
00:00:54
set of tools for matching if we store
00:00:56
the data in a keyword field and wanted
00:00:58
to find reviews that mentioned drawing
00:01:00
laundry quickly our options are pretty
00:01:03
limited
00:01:04
we could use a wildcard query matching
00:01:06
for laundry style quickly but we'd also
00:01:09
need to look for quickly style laundry
00:01:11
in case someone had written it quickly
00:01:13
dries my laundry
00:01:14
we'd also need to make sure the query
00:01:16
was case insensitive which would make a
00:01:18
slow query even slower
00:01:20
another option would be a regular
00:01:22
expression which again would perform
00:01:25
very badly and be painful to write and
00:01:28
how could we deal with when somebody
00:01:29
wrote it drives my laundry really
00:01:31
quickly
00:01:32
it dries my laundry real quick or my
00:01:35
laundry has never gotten dry so quick
00:01:38
this is the sort of problem full text
00:01:40
search is designed to solve
00:01:42
here's how we could solve for some of
00:01:43
these issues using the default full text
00:01:46
search options
00:01:47
to use vortex search we store the review
00:01:50
in a text field then write a match query
00:01:53
it doesn't matter which order we put the
00:01:55
query terms in and neither does the case
00:01:57
make a difference
00:01:59
how is running a match query over a text
00:02:01
field able to do all of this
00:02:04
two words text
00:02:06
analysis when text is stored in a text
00:02:09
field it's not stored in its exact form
00:02:12
the analysis process conducted by an
00:02:14
analyzer takes text and produces a
00:02:17
stream of tokens which are maintained in
00:02:19
the inverted index
00:02:21
a token is a fragment of the original
00:02:23
text usually an individual word or
00:02:25
derived from an individual word
00:02:28
when you run a query your search
00:02:30
criteria goes through the same analysis
00:02:32
process resulting in another stream of
00:02:34
tokens
00:02:36
elasticsearch then looks for matches
00:02:38
where tokens from your search criteria
00:02:40
are in the field you're matching against
00:02:43
an analyzer has three components
00:02:46
the first step in analysis is done by
00:02:48
character filters character filters work
00:02:51
on individual characters in the source
00:02:53
text adding modifying or removing
00:02:56
characters
00:02:57
this step is commonly used to replace
00:02:59
characters or remove them entirely if
00:03:01
you wanted to take out formatting or
00:03:03
html for example
00:03:05
you're able to use multiple character
00:03:07
filters in an analyzer and they are
00:03:08
executed in the order they're defined
00:03:11
character filters aren't required though
00:03:14
once all character filters have
00:03:15
completed their work or there aren't any
00:03:17
character filters in the analyzer the
00:03:19
resulting or original text is sent to
00:03:22
the tokenizer
00:03:24
the tokenizer splits up text into
00:03:26
individual tokens
00:03:28
you must have exactly one tokenizer in
00:03:30
the analyzer the text is usually but not
00:03:34
always split on word boundaries like
00:03:36
spaces full stops and other punctuation
00:03:39
another very important role of the
00:03:40
tokenizer is to record the position of
00:03:43
each token in the original text such as
00:03:45
the start and end offsets
00:03:47
this data is used for some query types
00:03:49
as well as highlighting the search
00:03:51
criteria in hits
00:03:53
the tokens produced by the analyzer are
00:03:55
sent to the final step the token filters
00:03:59
token filters add modify or remove
00:04:01
tokens produced by the tokenizer this is
00:04:04
another optional step and like character
00:04:06
filters multiple token filters can be
00:04:08
used in the analyzer
00:04:10
token filters typically perform tasks
00:04:12
like converting tokens to lowercase
00:04:14
removing tokens you don't want like very
00:04:16
short common or rude words
00:04:18
and stemming reducing a word to its root
00:04:21
form in the dehumidifier review for
00:04:23
example the word quickly could be
00:04:25
stemmed to just quick
00:04:27
there are several analyzers you can use
00:04:29
in elasticsearch and you're even able to
00:04:30
craft your own by selecting the
00:04:32
character filters tokenizer and token
00:04:34
filters you want
00:04:36
text fields by default use the standard
00:04:38
analyzer
00:04:40
we're able to see the output of the
00:04:42
analysis process by using the analyze
00:04:44
api so let's try that with a product
00:04:46
review text
00:04:47
we make a post request to underscore
00:04:49
analyze providing the analyzer we want
00:04:51
to use and the text we want analyzed in
00:04:54
the request body
00:04:55
the standard analyzer doesn't have any
00:04:57
character filters it tokenizes based on
00:05:00
word boundaries using the standard
00:05:02
tokenizer then lower cases each token
00:05:05
it also has a stop token filter which is
00:05:08
disabled by default but can be used to
00:05:10
remove tokens you don't want included
00:05:12
referred to as stop words the output of
00:05:15
the analyze api shows us all the tokens
00:05:18
that will be included in the contents of
00:05:20
the text field
00:05:21
this is sometimes referred to as a bag
00:05:23
of words
00:05:25
all the words in the original text are
00:05:26
in the array of tokens but they've been
00:05:28
lower case by the lowercase token filter
00:05:31
the start and end offsets and the
00:05:33
position are also calculated for each
00:05:35
token
00:05:36
using different search criteria shows
00:05:38
that not all the terms in the search
00:05:40
criteria need to be in the field value
00:05:43
in order to get a match
00:05:45
searching for laundry quickly rabbit
00:05:47
still produces a hit despite there being
00:05:49
no mention of a rabbit in the review
00:05:52
adding a couple more reviews will give
00:05:54
us a better picture of what's going on
00:05:56
running that same query for laundry
00:05:59
quickly rabbit now produces three hits
00:06:02
the original for the dehumidifier and
00:06:04
the two new documents notice the value
00:06:06
of underscore score though
00:06:08
the rabbit review gets a higher score
00:06:11
all three search terms were matched in
00:06:12
that document two were matched in the
00:06:14
dehumidifier review and only one in the
00:06:17
note on full text search
00:06:19
elasticsearch or lucine really indicates
00:06:23
this in the scores for the three hits
00:06:25
what if we only wanted to match
00:06:27
documents containing all three of those
00:06:29
terms
00:06:30
we can tell the match query to do this
00:06:32
using the operator parameter
00:06:34
this narrows down the results to just
00:06:36
the rabbit review
00:06:38
there's another useful parameter in the
00:06:40
match query minimum should match
00:06:43
we've used this before in bull queries
00:06:45
to specify how many should clauses must
00:06:47
match in order for a document to be a
00:06:49
hit
00:06:50
it can also be used with a match query
00:06:52
to specify how many of our query terms
00:06:55
must match the text field for the
00:06:56
document to be a hit
00:06:58
we can see this by changing the minimum
00:07:00
should match value for this query and
00:07:02
seeing how it affects the results
00:07:05
it's clear from what we've seen here the
00:07:07
position of the terms in the hit doesn't
00:07:09
matter
00:07:10
the query for really laundry rabbit with
00:07:12
the minimum should match of 3 still
00:07:15
produce a hit on the rabbit review
00:07:16
despite the field terms not being in the
00:07:19
same order as the query terms
00:07:21
we can look for tokens in order by
00:07:24
searching for a phrase
00:07:26
a phrase is a sequence of terms in a
00:07:28
specific order
00:07:30
when you search for a phrase documents
00:07:32
must have those same terms in that same
00:07:35
order as the phrase
00:07:37
to search for a phrase we use a match
00:07:39
phrase query
00:07:41
this query only matches the dehumidifier
00:07:43
review as that's the only one containing
00:07:45
really and quickly in that order
00:07:49
the match phrase query has a really
00:07:50
useful parameter called slop which you
00:07:53
can use to broaden your search
00:07:55
it allows for a certain number of terms
00:07:57
between the terms in the phrase
00:07:59
as an example if we search for water
00:08:02
plants nothing will hit
00:08:04
if we want to look a bit wider and allow
00:08:06
a single term between the words water
00:08:09
and plants we can add a slot parameter
00:08:11
with a value of one
00:08:14
this finds the dehumidifier review
00:08:15
because of the word my between water and
00:08:18
plants being permitted by the slop
00:08:21
i've explained the high level view of
00:08:24
voltex search and showing how you're
00:08:26
able to use different queries and
00:08:27
parameters to match documents based on
00:08:29
your own requirements
00:08:31
this is only an introduction though
00:08:33
there's a lot more to full text search
00:08:35
but you can get a long way with the
00:08:37
tools i'm going to give you in this
00:08:38
course
00:08:39
there are two concepts i want to
00:08:40
introduce now that you've seen some
00:08:42
examples of searching for text precision
00:08:45
and recall
00:08:46
precision and recall are two extremes of
00:08:49
a query spectrum
00:08:51
at the precision end you'd be very
00:08:53
precise with your criteria with very
00:08:55
specific requirements for documents that
00:08:57
need to match
00:08:59
precise queries won't match many
00:09:00
documents but the ones that you do match
00:09:02
will be exactly what you're looking for
00:09:04
and the score differences between the
00:09:06
hits will be small the opposite end
00:09:08
recall is when you're casting a very
00:09:11
wide net when looking for matches
00:09:13
your requirements will be much looser
00:09:15
and will result in more hits
00:09:17
those hits will still match your query
00:09:19
but because your query is less strict
00:09:21
the hits you'll match will match to
00:09:23
varying degrees and the score
00:09:25
differences between the hits will be
00:09:26
much larger
00:09:28
when crafting a full text query you'll
00:09:30
need to find the right balance between
00:09:32
precision and recall now what defines
00:09:35
write balance is completely up to you
00:09:37
and will be based on what your users
00:09:39
expect to see
00:09:40
there are lots of tools available in
00:09:42
elasticsearch to help you strike that
00:09:44
balance wherever it may be
00:09:46
we'll cover some of these tools but i
00:09:48
encourage you to read through all the
00:09:50
nitty gritty and the elastic search
00:09:52
documentation
00:09:54
and other material about full text
00:09:55
search relevance and information
00:09:57
retrieval as a whole
00:09:59
the rabbit hole goes very deep

Etiquetas

Elasticsearch
full text search
text analysis
tokenization
precision
recall
analyzer
query parameters
match query
phrase query