Data Modelling: Relational DB OR No-SQL.

8 min readSep 30, 2019

Think do you need to make a SQL database vs No-Sql vs hybrid? How do you choose choose one vs other? Here are some key points which can help you decide

For Relational DB you have to think schema upfront and it become harder to change it later. If your application have requirement which are fixed (think designing a blog with fixed format) you can go for Relational DB. But if your schema is hard to predict upfront and you expect changes overtime go for No-SQL based system which have flexible schema
SQL will be needed if you need multi row transaction support. Think of implementing a ticketing service like ticket master . A multi-row transaction support is needed when you have to update a multiple rows (within a table or in multiple table) atomically. Notice that SQL have good support of transactions using BEGIN TRANSACTION <….> END TRANSACTION. No-SQL have limited transaction support (atleast does not have full support).Since NoSQL databse shared the tables (collection) based on the row-key so it is quiet possible that one row of a table is in node X while other in node Y. This is the reason why no-sql DB does not support multirow transaction (as it have to involve multiple machine to synchronize which kills the performance) Also notice you can’t do application level locking to achieve similar effect of transaction level support. Since you might be running multiple application servers so locking at app level will not make transaction atomic
if your data is structured it is OK to use SQL. For semi-structured or unstructured data (think product catalog) it is not a good idea to store it in SQL (although you can model it). In these cases you can still store data in Relational table either by doing sparse rows or making a table of key-value. Performance (read/write/query) will not be good for relationals DB in this case. Also notice even for storing the sparse data in columns you need to know what key-value pair will be there upfront (during the DB design phase). you will make each key as a column.
Time series data which have high velocity usually gets better handled (fast) in No-SQl database. See below for detailed description. Also many No-SQL database have support for range query [from_ts1, to_ts2]
Relational DB(PostgreSQL) does have support of full text and you can also put index on text field. so for website like blog and all it is still OK to use SQL as long as your blog fields are static (not changing). How ever if you are desigining a customizable blog than No-Sql might be better
if you decide to use SQL you can use EXPLAIN ANALYZE <QUERY> to profile the query (prostgreSQL). This will help you in checking if you need secondary index or not

Based on the above guidance you should decide if you need SQL or No-Sql DB. Notice that in both SQL and no-SQL db you can always handle table like structure so the table design is kind of similar

Learn how to model things in SQL? How to design schema?

Following video give some rational about designing a SQL schema (designing Open table).

sql schema design for opentable

Identify objects (DB tables): Any real word entity is usually an table: eg user , restaurant, table_in_restaurant. Software artifacts will also be a table like blog_post, picture_table. User activity also become a table like: reservation_table (opentable) , like_table (FB)
Once you have identify all the tables identify the realation between them. identify if it is 1–1 or 1-many or many-many relation.

DB (Relational) schema design for a blog post system

Notice user,post,comment,category are all table. They are either a realword entity or a object which exist in software world. Now can you figure out relationship between these table?

Same schema an be done for NO-Sql (Wide-column) system:

Notice we have 1–many relation between post and comment, so comment can be embedded for the row-key. max row size 32MB.

Notice No-Sql (HBASE) sytem works good for blog post system

You can have secondary index for any column you like. Row key always primary index and you can make it composite. You can do range scan in the row key. so make row key <fixed_attribute>+something_for_range_scan

Make row-key such that it is rather random when hashed. This is how No-Sql system shards the table.

Modeling 1-Many relationship in No-Sql (wide column) store

Modelling Many to many relationship

In sql system you need a third table (user_book_rating) to represent many to many relationship between user and book.

In No-Sql you are replicating the data in both user and book table

Read more about the schema design in No-Sql

Guidelines for HBase Schema Design

In this blog post, I’ll discuss how HBase schema is different from traditional relational schema modeling, and I’ll…

mapr.com

Another Beautiful must watch video about No-SQl modelling benefit

Following video gives basic introduction to schema design in mongo-db. it also highlight the following facts

Sql (relational DB ) based system were designed i`1n 1970 when disk were expensive so one of the goal is to minimize the repetition of infomation. This constraint is not true currenty
SQL database schema is rigid and very hard to change once you design it. Mongo-DB you can change the schema on the fly so it is very flexible. You should validate your schema in your application code to have a consistent schema.
Similar to SQL You can put multiple index in the column of the mongodb (ie keys of the document). Of couse these comes with a cost of write or update performance
mongoDB Document is equivalent to a row in a table. Collection is equavelent to table. Max size of collection is ²⁶⁴ (which is huge). Max size of a document is 16MB so if you are trying to embed a things which grow is size (Size not known at the beginning) its a good idea to use reference (similar to foriegn key in sql world) rather than embeddeding
1–1 relation can be easily manage by doing embedding
1-many relation can also be handled by doing embedding
Make sure the things which you are embedding does not grow too large (16 MB) limit so in a way only embeded things where you know the max-size upfront
Many to Many relationship can be handled by embedding the data in both the documents. Thnks customer and books relation. 1 customer can read many books and A book can be read by many customer. embed both ways
Write concern is mongo db is a way to ensure that your data is durable in multiple sources before you get an OK signal from the mongo-db. Ofcourse it comes with a extra latency cost. So do it when you have data you want to be 100% sure to have a replication before OK is send by the mongo-db. This is tradeoff between your availablity of DB and speed of write.

Following video gives highlight about the schema flexibility given by No-SQL system. Also they are more scalable as they autoshard based on row-key. Design your rowkey carefully as it will be used for determining which shard the row will go. Also row-key are automatically index and searches are fast in it. Design row key : <SENSOR_ID>_timestamp. and then you can get all the rows associated with sensor_id between the time range [ range scan ]

Detailed blog for Sql and No-Sql dara modelling

Data Modeling Guidelines for NoSQL JSON Document Databases

In this blog post, I’ll discuss how NoSQL data modeling is different from traditional relational schema data modeling…

mapr.com

Wide column oriented DB schema design

Guidelines for HBase Schema Design

In this blog post, I’ll discuss how HBase schema is different from traditional relational schema modeling, and I’ll…

mapr.com

Caching:

Memcached [AP system]

Uses consistent hashing at client side to figure out which bucket your key lies. Every client have mapping from bucket → IP+PORT of memcached server. Memcached do not do replication. It assumes its just a cache and its OK to loose some buckets (nodes)

Here are some ways to connect to the memcached servers.

var memcached = new Memcached({ '192.168.0.102:11211': 1, '192.168.0.103:11211': 2, '192.168.0.104:11211': 1 });var memcached = new Memcached([ '192.168.0.102:11211', '192.168.0.103:11211', '192.168.0.104:11211' ]);var memcached = new Memcached('192.168.0.102:11211');

Configure options for the failure of server. Failed server will be removed from the list of server and the failOverServers will take over.

var memcached = new Memcached('localhost:11211', {retries:10,retry:10000,remove:true,failOverServers:['192.168.0.103:11211']});

Above code will work OK as long as you have 1 client (application server) and multiple memcached server. If you have multiple client and multiple servers some might see the server down and remove that server from the list, while other will keep using from that server.

Distributed Caching with Memcached

Memcached is a high-performance, distributed caching system. Although application-neutral, it’s most commonly used to…

www.linuxjournal.com

How to deal with memcached insistencies

What are the most common sources of memcached cache inconsistency?

Answer: 1. Application-level bugs (including bugs in previous versions of the code that didn’t delete values that they…

www.quora.com

Performace benchmark for databases

Apache Cassandra NoSQL Performance Benchmarks

Apache Cassandra™ is a leading NoSQL database platform for modern applications. By offering benefits of continuous…

academy.datastax.com

If you can live with non-consistent system Cassandra is the Best

Time Series modelling

Key insight:

Data is immutable
We need to have query like give me all value associated with Key X between Tstart and Tend

Sql for time series database

SQL traditional way to model:

TS;KEY;value

OpenTSTB format

sensor_id;time_window_start;time_offset1;time_offset2…..;BLOB( for backfilling)

Usually ,Single row fetch is much faster than multi row fetch in DB. here we are trying to compact / group lot of info in single row.

4 Node mapr ==> 100 M insertion / sec with linear scaling

Ted Dunning (Mapr) for times series database

Common architecure

A three tier architecure of typical web service

Sharding + read replica for scale

Data Modelling: Relational DB OR No-SQL.

Guidelines for HBase Schema Design

In this blog post, I’ll discuss how HBase schema is different from traditional relational schema modeling, and I’ll…

Data Modeling Guidelines for NoSQL JSON Document Databases

In this blog post, I’ll discuss how NoSQL data modeling is different from traditional relational schema data modeling…

Guidelines for HBase Schema Design

In this blog post, I’ll discuss how HBase schema is different from traditional relational schema modeling, and I’ll…

Distributed Caching with Memcached

Memcached is a high-performance, distributed caching system. Although application-neutral, it’s most commonly used to…

What are the most common sources of memcached cache inconsistency?

Answer: 1. Application-level bugs (including bugs in previous versions of the code that didn’t delete values that they…

Apache Cassandra NoSQL Performance Benchmarks

Apache Cassandra™ is a leading NoSQL database platform for modern applications. By offering benefits of continuous…

Written by Deepak Pant

No responses yet