2012年9月29日

ElasticSearch 筆記

full text search 實在博大精深,學然後知不足,記錄一些ElasticSearch筆記

analyzer, tokenizer, filter 可以在_setting裡面設定,_mapping可以使用這些設定來做index和search時的物件對應方法

參考"contains"設定by Mikko


$ curl -X PUT "http://localhost:9200/catalog" -d '{ "mappings" : { "product" : { "properties" : { "title" : { "type" : "string", "search_analyzer" : "str_search_analyzer", "index_analyzer" : "str_index_analyzer" } } } }, "settings" : { "analysis" : { "analyzer" : { "str_search_analyzer" : { "tokenizer" : "keyword", "filter" : ["lowercase"] }, "str_index_analyzer" : { "tokenizer" : "keyword", "filter" : ["lowercase", "substring"] } }, "filter" : { "substring" : { "type" : "nGram", "min_gram" : 1, "max_gram" : 20 } } } } }'; echo

說明
  1. We specify our mapping in the mapping block. The product block specifies that we want to apply the mapping to product type. The properties block allows us to set properties for the fields.
  2. For the title field of product type we set the field type to be string. We also set two analyzers for it: the index_analyzer which is used when a new product is indexed and the search_analyzer which is used when searching for products. These need to be different: we want to use the nGram tokenizer only when indexing, because we want to keep user entered search query as it was. If we had applied the nGram tokenizer also for search_analyzer, it would had tokenized the user’s query wire also to substrings, that is searching for wire would have matched also any of w, wi, wir, wire, i, ir, ire, r, re
  3. Now we define the analyzers str_search_analyzer and str_index_analyzer in the settings block. We set both analyzers to use keyword tokenizer, which leaves the input string untouched. Then, for both analyzers we apply lowercase filter. This allows for search string LOgI to match logi_and _Logitech to be indexed as logitech.
  4. For the str_index_analyzer we also apply the substring filter, which is defined later. This is the one that does all the magic - tokenizes the lowercased input string as substrings.
  5. Lastly, but not leastly, we define our own substring filter, which uses the nGram filter and sets the min_gram parameter to 1 and max_gram to 20. This means, the maximum substring we can search for is 20 characters long.


Stopwords指的是搜尋引擎不爽用的字,比如像這些,在ElasticSearch中可以自定

nGram是切詞斷詞法,n是指切詞長度,可以參考這篇,更應該看一下這篇告訴我們nGram從統計模型的馬可夫模型而來,相當有趣

edgeNGram指的是從邊緣算起的nGram,可以用來作網站的autocomplete,和nGram一樣可以指定n的範圍

tokenizer中的keyword指的是取出完整的input當成一個token,不會被像是空白或是stopwords影響

tokenizer中的stop就是去除stopwords

tokenizer中的whitespace就是用whitespace來切詞

stemming 指的是去除同義字比如說car,cars,有許多的stemming algorithm可以選擇,Porter Stem據說是比較兇狠的演算法,有些字甚至會被打回字根原形,所以許多時候我們必須要知道我們正在處理什麼語言才能做有效的stemming

snowball analyzer指的是後面用lowercase, stop tokenizer and snowball filter, snowball filter指的是Snowball-generated stemmer可以指定語言,我們也可以選擇比較無害的Stemmer像是KStem,主要只有幫忙做單複數的檢查

一般來說,stemming是兩面刃,因為即使使用者很精確的詢問想要的東西,這個東西還是被stemmer轉化過了,搜尋結果並且沒有任何分數上的懲處,我私自認為像是標題搜尋可能不需要做stemming或是用比較溫和的stemmer,尤其像是大家熟知的名稱或人名,而內文檢索才比較需要stemming

facet指的是面相類別,基本上可以看成分類,搜尋時可以搭配tags做統計來當做網站上的提示搜尋,好處是這些建議搜尋絕對找的到東西

ElasticSearch default operator是'OR'代表只要符合其一,搜尋時可以藉由default_operator參數來修改成一般我們比較習慣的'AND'

一次做multi field搜尋時,評分很容易被不同field影響,如果知道使用者是針對哪個部分做搜尋,最好指針對該field做搜尋和評分




沒有留言:

張貼留言