Design a Search Engine
Major components of search engine are
Crawler ← → Indexer ← → Query processing Engine
before you run indexer in the crawled documents you need to
- clean the noise (remove hyperlink and other infomation which is not needed)
- remove the stop words (The,is,a,….) They convey no information and are bad for the index (blow up memory)
- Do stemming of the works : eg quickly and quick should be represented as same
Indexer: (is the heart of any search system). See Lucene architecture for more detail
Will index all the document and create a inverted index for fast searching
Inverted index Data structure in C++ this can be
unordered_map<string,list<int,int>> invertedIndex;
the key is all the terms (strings) in the your documents.
The list represent two integer .<doc_id,position_in_doc>. These list are sorted by document id. This helps later in merge operation ( phrase queries)
How to handle suffix queries? eg “las*”
in that case you need to have the inverted index like. Here you can apply the binary search in the map keys.
map<string,list<int,int>> inverted-Index.
Distributed index?
Mostly the index stay on memory (RAM) for fast performance. if the collection is too large (internet) you need to have this index in multiple systems (shard) and merge the results after the query is done in each index shard. A lot of effort and energy is poured in to compact the data structures so more index is packed in the memory of one shard.
Another interesting fact is how to snapshot the shard of a index? we need a way to recover from induvial shard failure
Incremental indexing is another interesting features of a indexers. How do you accumulate changes done in a collection of documents without running the full indexing again?
An extremely good short course to understand all the indexing related issue are explained in the following videos