00:00:00
full text search works very differently
00:00:02
to a term level query at a higher level
00:00:04
term level queries run against the exact
00:00:06
values stored in the field
00:00:08
full text queries run against analyzed
00:00:10
text and allows for complex text queries
00:00:12
providing a lot of functionality over
00:00:14
how documents are matched
00:00:16
i'll explain what text analysis is and
00:00:18
why it's performed
00:00:20
we have an index containing product
00:00:22
reviews here's an example from one of
00:00:24
the reviews
00:00:26
the dehumidifier helps dry my laundry
00:00:28
really quickly it's great looking
00:00:30
efficient and i use the water from the
00:00:32
tank to water my plants it's quite noisy
00:00:35
though
00:00:36
one difference between this sort of text
00:00:38
and the type of text we've stored in
00:00:40
keyword fields so far is the length of
00:00:42
the text keyword fields are used to
00:00:44
store shorter pieces of text where exact
00:00:47
or almost exact matching is frequently
00:00:49
done
00:00:50
paragraph length text like this by
00:00:52
nature is going to require a different
00:00:54
set of tools for matching if we store
00:00:56
the data in a keyword field and wanted
00:00:58
to find reviews that mentioned drawing
00:01:00
laundry quickly our options are pretty
00:01:03
limited
00:01:04
we could use a wildcard query matching
00:01:06
for laundry style quickly but we'd also
00:01:09
need to look for quickly style laundry
00:01:11
in case someone had written it quickly
00:01:13
dries my laundry
00:01:14
we'd also need to make sure the query
00:01:16
was case insensitive which would make a
00:01:18
slow query even slower
00:01:20
another option would be a regular
00:01:22
expression which again would perform
00:01:25
very badly and be painful to write and
00:01:28
how could we deal with when somebody
00:01:29
wrote it drives my laundry really
00:01:31
quickly
00:01:32
it dries my laundry real quick or my
00:01:35
laundry has never gotten dry so quick
00:01:38
this is the sort of problem full text
00:01:40
search is designed to solve
00:01:42
here's how we could solve for some of
00:01:43
these issues using the default full text
00:01:46
search options
00:01:47
to use vortex search we store the review
00:01:50
in a text field then write a match query
00:01:53
it doesn't matter which order we put the
00:01:55
query terms in and neither does the case
00:01:57
make a difference
00:01:59
how is running a match query over a text
00:02:01
field able to do all of this
00:02:04
two words text
00:02:06
analysis when text is stored in a text
00:02:09
field it's not stored in its exact form
00:02:12
the analysis process conducted by an
00:02:14
analyzer takes text and produces a
00:02:17
stream of tokens which are maintained in
00:02:19
the inverted index
00:02:21
a token is a fragment of the original
00:02:23
text usually an individual word or
00:02:25
derived from an individual word
00:02:28
when you run a query your search
00:02:30
criteria goes through the same analysis
00:02:32
process resulting in another stream of
00:02:34
tokens
00:02:36
elasticsearch then looks for matches
00:02:38
where tokens from your search criteria
00:02:40
are in the field you're matching against
00:02:43
an analyzer has three components
00:02:46
the first step in analysis is done by
00:02:48
character filters character filters work
00:02:51
on individual characters in the source
00:02:53
text adding modifying or removing
00:02:56
characters
00:02:57
this step is commonly used to replace
00:02:59
characters or remove them entirely if
00:03:01
you wanted to take out formatting or
00:03:03
html for example
00:03:05
you're able to use multiple character
00:03:07
filters in an analyzer and they are
00:03:08
executed in the order they're defined
00:03:11
character filters aren't required though
00:03:14
once all character filters have
00:03:15
completed their work or there aren't any
00:03:17
character filters in the analyzer the
00:03:19
resulting or original text is sent to
00:03:22
the tokenizer
00:03:24
the tokenizer splits up text into
00:03:26
individual tokens
00:03:28
you must have exactly one tokenizer in
00:03:30
the analyzer the text is usually but not
00:03:34
always split on word boundaries like
00:03:36
spaces full stops and other punctuation
00:03:39
another very important role of the
00:03:40
tokenizer is to record the position of
00:03:43
each token in the original text such as
00:03:45
the start and end offsets
00:03:47
this data is used for some query types
00:03:49
as well as highlighting the search
00:03:51
criteria in hits
00:03:53
the tokens produced by the analyzer are
00:03:55
sent to the final step the token filters
00:03:59
token filters add modify or remove
00:04:01
tokens produced by the tokenizer this is
00:04:04
another optional step and like character
00:04:06
filters multiple token filters can be
00:04:08
used in the analyzer
00:04:10
token filters typically perform tasks
00:04:12
like converting tokens to lowercase
00:04:14
removing tokens you don't want like very
00:04:16
short common or rude words
00:04:18
and stemming reducing a word to its root
00:04:21
form in the dehumidifier review for
00:04:23
example the word quickly could be
00:04:25
stemmed to just quick
00:04:27
there are several analyzers you can use
00:04:29
in elasticsearch and you're even able to
00:04:30
craft your own by selecting the
00:04:32
character filters tokenizer and token
00:04:34
filters you want
00:04:36
text fields by default use the standard
00:04:38
analyzer
00:04:40
we're able to see the output of the
00:04:42
analysis process by using the analyze
00:04:44
api so let's try that with a product
00:04:46
review text
00:04:47
we make a post request to underscore
00:04:49
analyze providing the analyzer we want
00:04:51
to use and the text we want analyzed in
00:04:54
the request body
00:04:55
the standard analyzer doesn't have any
00:04:57
character filters it tokenizes based on
00:05:00
word boundaries using the standard
00:05:02
tokenizer then lower cases each token
00:05:05
it also has a stop token filter which is
00:05:08
disabled by default but can be used to
00:05:10
remove tokens you don't want included
00:05:12
referred to as stop words the output of
00:05:15
the analyze api shows us all the tokens
00:05:18
that will be included in the contents of
00:05:20
the text field
00:05:21
this is sometimes referred to as a bag
00:05:23
of words
00:05:25
all the words in the original text are
00:05:26
in the array of tokens but they've been
00:05:28
lower case by the lowercase token filter
00:05:31
the start and end offsets and the
00:05:33
position are also calculated for each
00:05:35
token
00:05:36
using different search criteria shows
00:05:38
that not all the terms in the search
00:05:40
criteria need to be in the field value
00:05:43
in order to get a match
00:05:45
searching for laundry quickly rabbit
00:05:47
still produces a hit despite there being
00:05:49
no mention of a rabbit in the review
00:05:52
adding a couple more reviews will give
00:05:54
us a better picture of what's going on
00:05:56
running that same query for laundry
00:05:59
quickly rabbit now produces three hits
00:06:02
the original for the dehumidifier and
00:06:04
the two new documents notice the value
00:06:06
of underscore score though
00:06:08
the rabbit review gets a higher score
00:06:11
all three search terms were matched in
00:06:12
that document two were matched in the
00:06:14
dehumidifier review and only one in the
00:06:17
note on full text search
00:06:19
elasticsearch or lucine really indicates
00:06:23
this in the scores for the three hits
00:06:25
what if we only wanted to match
00:06:27
documents containing all three of those
00:06:29
terms
00:06:30
we can tell the match query to do this
00:06:32
using the operator parameter
00:06:34
this narrows down the results to just
00:06:36
the rabbit review
00:06:38
there's another useful parameter in the
00:06:40
match query minimum should match
00:06:43
we've used this before in bull queries
00:06:45
to specify how many should clauses must
00:06:47
match in order for a document to be a
00:06:49
hit
00:06:50
it can also be used with a match query
00:06:52
to specify how many of our query terms
00:06:55
must match the text field for the
00:06:56
document to be a hit
00:06:58
we can see this by changing the minimum
00:07:00
should match value for this query and
00:07:02
seeing how it affects the results
00:07:05
it's clear from what we've seen here the
00:07:07
position of the terms in the hit doesn't
00:07:09
matter
00:07:10
the query for really laundry rabbit with
00:07:12
the minimum should match of 3 still
00:07:15
produce a hit on the rabbit review
00:07:16
despite the field terms not being in the
00:07:19
same order as the query terms
00:07:21
we can look for tokens in order by
00:07:24
searching for a phrase
00:07:26
a phrase is a sequence of terms in a
00:07:28
specific order
00:07:30
when you search for a phrase documents
00:07:32
must have those same terms in that same
00:07:35
order as the phrase
00:07:37
to search for a phrase we use a match
00:07:39
phrase query
00:07:41
this query only matches the dehumidifier
00:07:43
review as that's the only one containing
00:07:45
really and quickly in that order
00:07:49
the match phrase query has a really
00:07:50
useful parameter called slop which you
00:07:53
can use to broaden your search
00:07:55
it allows for a certain number of terms
00:07:57
between the terms in the phrase
00:07:59
as an example if we search for water
00:08:02
plants nothing will hit
00:08:04
if we want to look a bit wider and allow
00:08:06
a single term between the words water
00:08:09
and plants we can add a slot parameter
00:08:11
with a value of one
00:08:14
this finds the dehumidifier review
00:08:15
because of the word my between water and
00:08:18
plants being permitted by the slop
00:08:21
i've explained the high level view of
00:08:24
voltex search and showing how you're
00:08:26
able to use different queries and
00:08:27
parameters to match documents based on
00:08:29
your own requirements
00:08:31
this is only an introduction though
00:08:33
there's a lot more to full text search
00:08:35
but you can get a long way with the
00:08:37
tools i'm going to give you in this
00:08:38
course
00:08:39
there are two concepts i want to
00:08:40
introduce now that you've seen some
00:08:42
examples of searching for text precision
00:08:45
and recall
00:08:46
precision and recall are two extremes of
00:08:49
a query spectrum
00:08:51
at the precision end you'd be very
00:08:53
precise with your criteria with very
00:08:55
specific requirements for documents that
00:08:57
need to match
00:08:59
precise queries won't match many
00:09:00
documents but the ones that you do match
00:09:02
will be exactly what you're looking for
00:09:04
and the score differences between the
00:09:06
hits will be small the opposite end
00:09:08
recall is when you're casting a very
00:09:11
wide net when looking for matches
00:09:13
your requirements will be much looser
00:09:15
and will result in more hits
00:09:17
those hits will still match your query
00:09:19
but because your query is less strict
00:09:21
the hits you'll match will match to
00:09:23
varying degrees and the score
00:09:25
differences between the hits will be
00:09:26
much larger
00:09:28
when crafting a full text query you'll
00:09:30
need to find the right balance between
00:09:32
precision and recall now what defines
00:09:35
write balance is completely up to you
00:09:37
and will be based on what your users
00:09:39
expect to see
00:09:40
there are lots of tools available in
00:09:42
elasticsearch to help you strike that
00:09:44
balance wherever it may be
00:09:46
we'll cover some of these tools but i
00:09:48
encourage you to read through all the
00:09:50
nitty gritty and the elastic search
00:09:52
documentation
00:09:54
and other material about full text
00:09:55
search relevance and information
00:09:57
retrieval as a whole
00:09:59
the rabbit hole goes very deep