Agile Data Modeling with Big DataAugust 25, 2015
Here at Zoomph we use an agile software development methodology which lets us deliver features and updates to our customers in a rapid fashion, however, as with most things there is no free lunch. Agile design methodologies are to an extent predicated on the idea that changing software is ‘cheap’. Generally this is true, changing the layout of a user interface or creating a new workflow are not hugely time consuming activities, they generally come with few major technical hurdles; this is not true of large scale databases though. In this article I’m going to go over some best practices I’ve learned over the years which allows Zoomph’s databases to be just as agile as the rest of our platform:
Make use of Schemas
There has been a lot of buzz in the industry about many NOSQL databases which claim to be schema-less as a major selling point. Here is the dirty secret though, all databases have a schema, even “schema-less” ones. The only difference is whether the schema lives in your application’s code or in the DB itself. By allowing the DB to be a free for all, the complexity of dealing with all sorts of data anomalies and quirks are pushed to the code where the schema can end up with enormous complexity, complexity which leads to bugs, technical debt and analytics difficulties. That being said you don’t always need a hardcore 3rd normal form database with constraints (these don’t scale reads well generally either), it’s a balancing act. Just keep in mind that once the data starts getting messy it can prove extremely complicated if not downright impossible to cleanup, so don’t let it get messy in the first place.
Open Ended Schemas
To the above point, schemas are important, but they should not hold back your ability to adapt the change. Designing schemas which can be easily updated later is vitally important especially in big data systems in which changing the schema can be massively (if not intractably) resource intensive. Once example of this which I like to use is avoiding the use of boolean types in databases, use integers/enums instead. For example if your modeling the status of a blog post, initially the requirement may be for published/unpublished to be the only states, which is easy to model with a boolean. But what happens when a third and fourth state are added? By using an integer you have tons of flexibility to add new states whether it be through bit masking (which allows for multiple states to be on/off at the same time) or just adding new members to an enumeration.
Make a point of understanding where the system is heading
Within most systems the data model / database provide core APIs and abstractions which the rest of the system sit on top of. When these APIs and abstractions change they have a rippling effect through the rest of the codebase. While you don’t want to necessarily waterfall your database, some planning for the future and understanding were the business is going goes a long way in making future updates less painful.
Pay close attention to how big the data sets are planned on growing
Whether businesses like it or not as data sets grow the amount of time spent dealing with the problems arising from the size will grow as well. Paying attention to how they are going to grow is of utmost importance to save yourself from investing time into architectures which can’t scale to meet demand. When in development stages the data sets are generally fairly small which allows developers to architect solutions which work well at first but quickly fall apart as the data sets grow. Some examples of this you can run into are:
- Multifaceted/Full text search
- Many common databases offer FTS (full text search) as a feature which work great as long the data sets are relatively small but this can quickly fail to scale if the DB is not designed specifically for this use case. Here on Zoomph we make use of ElasticSearch which is designed from the ground up to scale search.
- Joins / Search across disparate databases
- In many environments (including Zoomph) you will have multiple databases each which fulfill their own role. But what happens when you need to perform queries which require data from multiple DBs? A very common approach to this is to simply query both databases in application code and then using code to merge the data together. This can work but certain types of operations becomes extremely expensive as the data sets grow, which can force a re-architecture of the databases to support new use cases.
- Doing analytics in application code
- If developers are tasked with determining the top used words in a set of tweets they may be tempted to simply query the DB for the tweets, then use application code to count words. This will work as long as the number of tweets is small, but this is an O(N) operation. When 1K tweets are being analyzed it will work flawlessly, when 100K tweets are being analyzed it will be slow but likely still work, when 10M tweets are being analyzed you may end up waiting several minutes for all of the tweets to be loaded into the application server and counted.
Prepare for horizontal scaling from the start:
Vertical scaling offers lots of benefits early on such as simplicity to manage and setup but can quickly turn into a losing game as time goes on (there are limits to how ‘up’ you can go). That being said you don’t need to invest in huge database clusters from the get go, simply avoiding practices which CAN’T scale horizontally goes a long way. Examples of this include making use of map-reduce friendly algorithms and avoiding maintaining state in application code’s memory.
Also, choosing a database which can nativity scale horizontally is a must. Many databases can technically be scaled horizontally but many do not do a particularly good job of it. You can save yourself a lot of operational pain by selecting databases which scales horizontally naively, not with some 3rd party plugin which never works quite right.