Summary of experience in writing trading app storage layer

Deepak Pant
5 min readMar 5, 2021

System setup. Windows 10 Home (host) + VM (Virtualbox 6.2) running ubuntu 20. 20 GB RAM and a I5 processor 12 threads. Both windows and VM are installed in SSD (1TB)

Data is hard to acquire: It is very hard to get good reliable datasource for stock data. I am using yfinance (not official yahoo api) for daily data and polygon.io ($199/month) for minute data. I am quite happy with their performance. Its not cheap, i could not find any cheaper (free) alternative for data source.

sqlite: works well (speed) when the size of the database < 10G. My daily data of all ~9K ticker records for (max history) is ~4 GB So it works very well. For the minute data only 550 tickers, going back 5 years in time is around 35 G. This is where sqlite start to chock. (Slow write performance, ~200 Writes/Sec). In a smaller size DB (<10GB size) it is 2–3K Writes/Sec. Also notice the size of DB grows very fast, if i have choose to store all tradable symbols minute data for 5 year i am going to have a DB size of ~900 G.

Sqlite does not have good concurrency support. My webapp goes offline (read ops) as db returns error while my data fetcher scripts update the DB (every night) with new data

sqlite database is easily corruptible. Many time it got corrupted when the sqlitebrowser got hung and i have to kill it. I was not able to read the db again. Luckily i have a backup (copy of the db file)

Storing 5 year historical data of SNP 500 is not trivial , its almost 300 Million data points . There are multiple ways to handle this. Make separate database for every year or month or week(time period) and making different tables based on symbols will help for speed. This increase the code complexity a lot as you are doing sharding yourself. This makes migration (to other databases) also hard (as you might want to merge this all tables and db to single dbs). In sqlite3 Tables in multple dbs can use ATTACH to bind all these database to same connection objects. There is a max limit on how many ATTACH one can do in a single connection object.

Following query in a (single) combined 28G table will be extremely fast and resolves in sub seconds

select count(*) from stock_price_minute_data where ticker = 'SNPS';

Storage overhead:

The is a considerable overhead of storing data in sqlite3 (or any database) the overhead comes from variety of reasons: index, padding, empty db pages etc. some of this storage (usually ~10% max) can be recovered by using VACUUM command (works both in postgre and sqlite3). There is an option of auto VACUUM too.

I tried VACCUM command which usually reclaims about 10% of space by removing the empty page overhead and rearranging the rows etc

VACUUM in sqlite3. Saves a bit in DB size

To give a concrete example when i calculate the raw size of data for one of my storage table (row size * # of rows) it turn out to be 17MB while the sqlite3 took almost 47 MB to store it. Used following query to find the size of sqllite3 table size (Notice following command output does not includes the index overhead etc)

SELECT SUM("pgsize") FROM "dbstat" WHERE name='stock_price_minute_data_A';

Another important take away is even though sqlite3 is not a type strict database for application using lot of data it is important to specify the type in schema. This makes a lot of difference in the database size (or table size)

My initial schema was mentioned below. Notice the fields which are not specified will be interpreted as text and due to floating point nature of numbers in stocks it might take lot of space. This also create problem while migration to other database which are type strict (postgres)

CREATE TABLE stock_price_daily_all_tradable (
ticker TEXT NOT NULL,
date TIMESTAMP NOT NULL,
open NOT NULL,
high,
low ,
close,
volume,
PRIMARY KEY (ticker , date)
);

So the correct way to specify a schema is

CREATE TABLE stock_price_daily_all_tradable (
ticker TEXT NOT NULL,
date TIMESTAMP NOT NULL,
open REAL NOT NULL,
high REAL,
low REAL,
close REAL,
volume REAL,
PRIMARY KEY (ticker , date)
);

Above schema changed the table size from 2.3 GB to 2.2 GB (relative minor savings), but this will be a major saving when we import the data to postgres.

Since my database size is 3.3 GB roughly a 1GB of space is going for index?

sqlite3_analyzer is another good tool to figure out what % of storage is going where.

sqlite3_analyzer results

Postgres is bit better with storage usage. For my daily stock price (all tickers, max history) sqlite3 took ~3.5 GB of total space, postgres was 1.8GB table size + 890 MB (index size) = 2.6 GB total space so it is more efficient storage wise.

postgres index size can be checked using following command

# Gives index sizetestdb ==> SELECT pg_indexes_size('stock_price_daily_all_tradable')# Gives table sizetestdb ==>\dt+

Data strorage compression

So compressing this further is not possible natively. sqlite3 does not provide any built in capabilities for compression during storage. There are some hacky way to compress using extensions, but i don’t know how stable they are. same is true to postgres. To use real compression in postgres (for full table) you need to have postgres pro version.

Connection object: (A word of caution)

One should create sqlite3 connection object local to the function you are using. if you create a global object and using it across the function you will end up having race condition when multiple function start using database at same time (very common in webapp and cronjob scenerio). connection object and cursor are run in thread and if you share it across you are cobbling the data.

Migration from sqlite3 → postgres

While sqlite3 is a good alternative to start your storage solution, you will quickly release (due to above mentioned reasons) that postgres is a better choice for storage.

Migration is a bit pain in database world, it takes time and you might have to write some custom scripts to do it. Also sqlite3 is not that type strict while postgres is so same SQL syntax for creating table (CREATE TABLE ….) will not work in both without modification

--

--