Carlos Fenollosa — Blog

Thoughts on science and tips for researchers who use computers

SQLite: a standalone database for your application

November 10, 2011 — Carlos Fenollosa

Us researchers are used to store data in plain text formats, because it's very easy to parse and work with. While this is appropriate for some data types—and, I'd add, very useful to send later to R—, in some cases disk access is slow or just inefficient.

This topic is actually very important for some projects, as storing records into a plain text file is very slow to query afterwards. And this is the key question to ask ourselves before considering to use a database. Databases are great for complex, unordered queries, but not so great for sequential access of raw data. Let's see an example.

There is a data file which stores atom coordinates, for example, from a Molecular Dynamics simulation. This data is very likely to be read once, sequentially, then processed in memory. The information represents a matrix which will be processed by mathematic functions. This is the classic example where data files (either binary of plain text) are used correctly.

But now let's think of a list of proteins and some properties, for example, molecular weight and number of available 3D structures. All these objects are independent, they have their own entity. While you can store a text file with one line per <protein, weight, structures>, it makes more sense to store it on a database.

Databases allow complex queries to be resolved very quickly. For example, give me all proteins with molecular weight > 50,000 , list all proteins which have no crystal structures, or print all the proteins which have duplicate structures. Were we working with a text file, we would need to process it completely every time we perform a query. That's very, very slow. Databases internally store the information in such a way that queries don't need to go through all elements to get the answer. Namely, they store data on trees by indices.

How do indices work? It's a complex issue but let's think of a very basic example. Let's say you have three protein structures (1BCD, 2KI5, 1AGI) which you want to index by name and molecular weight. The system will then automatically build a protein binary tree where 1BCD is the parent, the left child is 1AGI and 2KI5 is the right child. Then, it will create another tree where the left child is the lightest protein, the parent is the middle one, and the right child is the heaviest one.

If the index tree is always sorted where the left child is alphabetically inferior than the parent, and the right child is always superior to the parent, then we can access any element or group of elements not only without checking every item but also in logarithmic time. Databases do this once for every index you configure, so complex queries can be solved super fast because for each of them the system only needs to process a few items of the many millions you might have stored in the DB. That's because every time you jump to a child element, the system is avoiding to process half of the database, then a half of this half (1/4), then 1/8, etc.

To summarize, if you have some data where each record has its own entity (i.e. can be thought of as an "object") and you think you'll make queries which retrieve an arbitrary number of the elements, then you need to use a database. Databases have even more advantages, like relationships between objects (e.g. each crystal structure has its own entity, and can be related to a protein), but database design is a complex topic and this article will cover only the basic data storage.

However, databases are usually configured by the system administrator and handled by a daemon—oracle, mysql, postgresql. Here I will talk about yet another way of creating databases, without the need to start any daemons, have any user privileges and, more importantly, easily portable. This is done via sqlite.

SQLite is a library that implements a SQL engine inside your own application. This means that while the database is persistent inside a file, all the querying infrastructure is deployed along with your code and stopped when the code finishes running. The databases can be created very easily, making it easy to have multiple DBs for testing, and without the need to bother the system administrator.

sqlite has bindings for almost all popular languages and also a commandline interface which is handy for testing and debugging. The data is stored on a single file which can be deployed with your application without needing to install any standalone servers. Obviously, it is not a replacement for Oracle's solutions, but it can speed up a lot some applications which need to work querying data and don't have access to one.

Most popular software uses some kind of database to store data, as this is a super fast way to access preferences and other items. For scientific programs, it is always necessary to think twice before using one, as database design is an art on its own, and as said before, it does not suit all needs.

When used properly, a small ad-hoc database like sqlite can speed up software, make data access very easy and allow the manipulation of large, objectified, in-related data collections with simple queries instead of writing long and slow algorithms which process all the data when you only need one item.

Tags: software, programming

Comments? Tweet