Design a Search Engine

Deepak Pant
2 min readDec 13, 2020

Major components of search engine are

Crawler ← → Indexer ← → Query processing Engine

before you run indexer in the crawled documents you need to

  1. clean the noise (remove hyperlink and other infomation which is not needed)
  2. remove the stop words (The,is,a,….) They convey no information and are bad for the index (blow up memory)
  3. Do stemming of the works : eg quickly and quick should be represented as same

Indexer: (is the heart of any search system). See Lucene architecture for more detail

Will index all the document and create a inverted index for fast searching

Inverted index Data structure in C++ this can be

unordered_map<string,list<int,int>> invertedIndex;

the key is all the terms (strings) in the your documents.

The list represent two integer .<doc_id,position_in_doc>. These list are sorted by document id. This helps later in merge operation ( phrase queries)

How to handle suffix queries? eg “las*”

in that case you need to have the inverted index like. Here you can apply the binary search in the map keys.

map<string,list<int,int>> inverted-Index.

Distributed index?

Mostly the index stay on memory (RAM) for fast performance. if the collection is too large (internet) you need to have this index in multiple systems (shard) and merge the results after the query is done in each index shard. A lot of effort and energy is poured in to compact the data structures so more index is packed in the memory of one shard.

Another interesting fact is how to snapshot the shard of a index? we need a way to recover from induvial shard failure

Incremental indexing is another interesting features of a indexers. How do you accumulate changes done in a collection of documents without running the full indexing again?

An extremely good short course to understand all the indexing related issue are explained in the following videos

(165) IR7 Inverted Indexing — YouTube

--

--