Data Modelling: Relational DB OR No-SQL.
Think do you need to make a SQL database vs No-Sql vs hybrid? How do you choose choose one vs other? Here are some key points which can help you decide
- For Relational DB you have to think schema upfront and it become harder to change it later. If your application have requirement which are fixed (think designing a blog with fixed format) you can go for Relational DB. But if your schema is hard to predict upfront and you expect changes overtime go for No-SQL based system which have flexible schema
- SQL will be needed if you need multi row transaction support. Think of implementing a ticketing service like ticket master . A multi-row transaction support is needed when you have to update a multiple rows (within a table or in multiple table) atomically. Notice that SQL have good support of transactions using BEGIN TRANSACTION <….> END TRANSACTION. No-SQL have limited transaction support (atleast does not have full support).Since NoSQL databse shared the tables (collection) based on the row-key so it is quiet possible that one row of a table is in node X while other in node Y. This is the reason why no-sql DB does not support multirow transaction (as it have to involve multiple machine to synchronize which kills the performance) Also notice you can’t do application level locking to achieve similar effect of transaction level support. Since you might be running multiple application servers so locking at app level will not make transaction atomic
- if your data is structured it is OK to use SQL. For semi-structured or unstructured data (think product catalog) it is not a good idea to store it in SQL (although you can model it). In these cases you can still store data in Relational table either by doing sparse rows or making a table of key-value. Performance (read/write/query) will not be good for relationals DB in this case. Also notice even for storing the sparse data in columns you need to know what key-value pair will be there upfront (during the DB design phase). you will make each key as a column.
- Time series data which have high velocity usually gets better handled (fast) in No-SQl database. See below for detailed description. Also many No-SQL database have support for range query [from_ts1, to_ts2]
- Relational DB(PostgreSQL) does have support of full text and you can also put index on text field. so for website like blog and all it is still OK to use SQL as long as your blog fields are static (not changing). How ever if you are desigining a customizable blog than No-Sql might be better
- if you decide to use SQL you can use EXPLAIN ANALYZE <QUERY> to profile the query (prostgreSQL). This will help you in checking if you need secondary index or not
Based on the above guidance you should decide if you need SQL or No-Sql DB. Notice that in both SQL and no-SQL db you can always handle table like structure so the table design is kind of similar
Learn how to model things in SQL? How to design schema?
Following video give some rational about designing a SQL schema (designing Open table).
- Identify objects (DB tables): Any real word entity is usually an table: eg user , restaurant, table_in_restaurant. Software artifacts will also be a table like blog_post, picture_table. User activity also become a table like: reservation_table (opentable) , like_table (FB)
- Once you have identify all the tables identify the realation between them. identify if it is 1–1 or 1-many or many-many relation.
DB (Relational) schema design for a blog post system
Notice user,post,comment,category are all table. They are either a realword entity or a object which exist in software world. Now can you figure out relationship between these table?
Same schema an be done for NO-Sql (Wide-column) system:
Notice we have 1–many relation between post and comment, so comment can be embedded for the row-key. max row size 32MB.
You can have secondary index for any column you like. Row key always primary index and you can make it composite. You can do range scan in the row key. so make row key <fixed_attribute>+something_for_range_scan
Make row-key such that it is rather random when hashed. This is how No-Sql system shards the table.
Modeling 1-Many relationship in No-Sql (wide column) store
Modelling Many to many relationship
In sql system you need a third table (user_book_rating) to represent many to many relationship between user and book.
In No-Sql you are replicating the data in both user and book table
Read more about the schema design in No-Sql
Another Beautiful must watch video about No-SQl modelling benefit
Following video gives basic introduction to schema design in mongo-db. it also highlight the following facts
- Sql (relational DB ) based system were designed i`1n 1970 when disk were expensive so one of the goal is to minimize the repetition of infomation. This constraint is not true currenty
- SQL database schema is rigid and very hard to change once you design it. Mongo-DB you can change the schema on the fly so it is very flexible. You should validate your schema in your application code to have a consistent schema.
- Similar to SQL You can put multiple index in the column of the mongodb (ie keys of the document). Of couse these comes with a cost of write or update performance
- mongoDB Document is equivalent to a row in a table. Collection is equavelent to table. Max size of collection is ²⁶⁴ (which is huge). Max size of a document is 16MB so if you are trying to embed a things which grow is size (Size not known at the beginning) its a good idea to use reference (similar to foriegn key in sql world) rather than embeddeding
- 1–1 relation can be easily manage by doing embedding
- 1-many relation can also be handled by doing embedding
- Make sure the things which you are embedding does not grow too large (16 MB) limit so in a way only embeded things where you know the max-size upfront
- Many to Many relationship can be handled by embedding the data in both the documents. Thnks customer and books relation. 1 customer can read many books and A book can be read by many customer. embed both ways
- Write concern is mongo db is a way to ensure that your data is durable in multiple sources before you get an OK signal from the mongo-db. Ofcourse it comes with a extra latency cost. So do it when you have data you want to be 100% sure to have a replication before OK is send by the mongo-db. This is tradeoff between your availablity of DB and speed of write.
Following video gives highlight about the schema flexibility given by No-SQL system. Also they are more scalable as they autoshard based on row-key. Design your rowkey carefully as it will be used for determining which shard the row will go. Also row-key are automatically index and searches are fast in it. Design row key : <SENSOR_ID>_timestamp. and then you can get all the rows associated with sensor_id between the time range [ range scan ]
Detailed blog for Sql and No-Sql dara modelling
Wide column oriented DB schema design
Caching:
Memcached [AP system]
Uses consistent hashing at client side to figure out which bucket your key lies. Every client have mapping from bucket → IP+PORT of memcached server. Memcached do not do replication. It assumes its just a cache and its OK to loose some buckets (nodes)
Here are some ways to connect to the memcached servers.
var memcached = new Memcached({ '192.168.0.102:11211': 1, '192.168.0.103:11211': 2, '192.168.0.104:11211': 1 });var memcached = new Memcached([ '192.168.0.102:11211', '192.168.0.103:11211', '192.168.0.104:11211' ]);var memcached = new Memcached('192.168.0.102:11211');
Configure options for the failure of server. Failed server will be removed from the list of server and the failOverServers will take over.
var memcached = new Memcached('localhost:11211', {retries:10,retry:10000,remove:true,failOverServers:['192.168.0.103:11211']});
Above code will work OK as long as you have 1 client (application server) and multiple memcached server. If you have multiple client and multiple servers some might see the server down and remove that server from the list, while other will keep using from that server.
How to deal with memcached insistencies
Performace benchmark for databases
Time Series modelling
Key insight:
- Data is immutable
- We need to have query like give me all value associated with Key X between Tstart and Tend
Sql for time series database
SQL traditional way to model:
TS;KEY;value
OpenTSTB format
sensor_id;time_window_start;time_offset1;time_offset2…..;BLOB( for backfilling)
Usually ,Single row fetch is much faster than multi row fetch in DB. here we are trying to compact / group lot of info in single row.
4 Node mapr ==> 100 M insertion / sec with linear scaling
Ted Dunning (Mapr) for times series database
Common architecure
Sharding + read replica for scale