Why and when Distributed Systems are used ?
Distributed systems emerged from the need to scale computation and storage for large-scale data processing.
We should always frame our software system having two distinct components: Compute (usually done on CPU , GPU + RAM ) and Storage (Hard disk , SSD etc)
As a software architect you should consider using distributed systems iff either of the following two things holds true:
- After applying program optimization (say multi-threading, algorithmic optimization) techniques you still have slower than expected processing times. OR based on the rough calculation it is implied that single machine can’t handle the compute within the reasonable time.
- Data sizes exceed what can feasibly be stored on a single machine. The threshold depends on the data and use case. If data is in TB i think it is worth considering data storage distribution
Keep in mind that the “data growth or rate of growth” is an important aspect to consider. A data based on user click and activity is most likely to grow faster than the user registration data. Given same amount of initial data in your system, which one is more prone to reach capacity errors as user registration data OR user activity (click/browsing) data?
Now lets ask what is the key fundamental enabler of distributed computing?
Divide and Conquer (merge). We should be able to divide the task / data and somehow merge the results back from each node to get the final results. As long as you can think of a way to divide and conquer your task in hand you can implement it using distributed systems. This strategy is usually application independent and hence the standard library and framework of distributed systems implement this.
Lets turn our attention to second pillar Distribution data storage
Q) How do we build a storage system of 10 TB, capacity if we have many machines which are only equipped with a 1 TB hard disk, what should we do then?
Expand horizontally (And not vertically by adding capacity to single system) ! Instead of getting a 10 TB hard drive single system, build one with smaller 1 TB systems. This has several advantages !
- Economy (smaller disks systems are cheaper).
- It Distributes data so it resides in the same machine/node OR at least rack where data is being processed.
- Higher IO throughput can be achieved as IO tasked can be done in parallel.
Lets take a classical concrete example
Given “1 million documents each of which have 10 MB of text and your job is to create a word frequency count of the corpus” This is kind of problem which is very classic and canonical for search engines.
Q) Do you think this can be done by just a single machine with 16 CPUs in less than 10 mins?
Lets try to estimate. 1 Million (files) X 10 MB (per file) == 10 TB of total data.
How much do you think it will take to load this to memory from a fast SSD?
Fast SSD can load 1GB per second so 10 TB = 10 x 1000 seconds to load = 166 mins. This does not include any processing time etc. its time to just load the data with state of art SSD’s. I think you got the point. At some point it will be impractical to get the work done by single computer for such a large scale task! Hence we need two or more computers working together for common goal forming distributed computer.
One aspect to notice in the above problem is clear division of work in smaller chunks . You can imagine a set of few files processed by a single machine (using multi-threading) and hence the entire set of 1 Million files are distributed to N machines and later combined to get the final results. Hence this type of problem is very well suited for distributed computing ( we will talk more about it in map-reduce paradigm)
Q) Is there a high level of guidance on when to use distributed systems vs single machine system?
Here is my thumb rule on deciding which type of system might be needed based on the size of data?
- In GBs: If the performance goals are not too aggressive, Single CPU might do the job with careful optimizations!
- Low TBs: Small-scale distributed systems, or high-performance data warehousing needed to achieve decent performance
- High TBs: Small-to-medium-scale distributed systems is needed to achieve decent performance
- in PB: Large-scale distributed systems are needed to achieve decent performance
Question to ponder and consider?
Q1) What are the key challenges in distributed storage?
Q2) Can you outline the difference between multitasking, multiprocessing , and multi threading
This blog is part of the distributed system series here. I highly encourage to consume the content in the order to gain maximal understanding.
Further reading: