Specialized Databases in Data Science

There are now very many specialized databases, which provide features specifically for data science. The basis of this variety is storage medium, scaling topology, selected aspects of the CAP theorem, organization of data on media, storage model (object, graph, document, table-relational -- all substantially related to each other). The number of databases in existence is very large, due to the many permutations of database attributes. We find that a few database technologies are particularly important.

Analytic databases

These databases are unique because of the way they organize data on disk: Instead of storing data row-at-a-time on disk, data for a particular column are stored together. This saves save IO for queries which do not involve all columns, particularly those which involve aggregating over a single column - thus they are very good at queries against a star schema, and particularly queries which calculate aggregates. They also allow similar data to be stored consecutively, which makes data type-specific compression much more effective which further reduces IO.

The aspects of a column store which make it good at analytical queries make it poor for queries which CRUD (create, read, update, delete) all columns of a single row. The easy way to think about the performance advantages of columns stores and when they will perform well, is the resemblance of your data model to a star schema -- as the organization of a able on disk in a column store is essentially the same as a star schema.

We have deployed Actian's Vectorwise to impressive effect, in a data warehouse scenario. Additionally, we have deployed MonetDB and LucidDB, which are free and open analytic/column store databases, also to good effect. Amazon Redshift is a cloud-based, scalable analytic database, which may be attractive as well - and is based on Par Accel, which is the MMP version of Actian's Vectorwise.

Scalable document databases

These databases trade consistency for partition tolerance. Typically they are document-based, rather than table-based. These databases are very fast to read and write, can scale to very large clusters. They are typically the back-end data store for online web applications or real-time systems which demand scalability, but can tolerate eventually consistency. We have most familiarity with Cassandra (commercialized by DataStax) and MongoDB (commercialized by 10gen).

There are several categories of data store, and many attributes of a data store which do not bear mentioning here which are none-the-less important.

Contact us today to find out how Inquidia can show you how to collect, integrate and enrich your data. We do data. You can, too.

Would you like to know more?

Sign up for our fascinating (albeit infrequent) emails. Get the latest news, tips, tricks and other cool info from Inquidia.