In search engine indexing, a physique of textual content is usually processed earlier than it’s listed. A standard instance is stemming, had been phrases are decreased to their root type (plurals are dropped, tense is normalized). Different examples aref lemmatization, soundex transformation, casing, and many others.
So, this sentence…
My title is Bond, James Bond
…may be listed as the next tokens.
A fundamental precept of knowledge retrieval is that this solely works should you do the transformation on the question as properly.
If I used to be to seek for “James”, it will not match as a result of that token was remodeled to “jame”. My search can solely reliably work if the very same set of transformations happen on the question as properly, so my question for “James” could be equally remodeled into “jame” earlier than any token matching was tried.
(I liken this to algebra, the place you need to do the identical calculations on each side of the equals signal.)
Is there a reputation for this precept of getting to rework each listed content material and question content material equally earlier than try to check them? I am making an attempt to clarify this idea to some college students, and it will be useful if there was an current time period for it.