Blog / Agile Data Modeling with Big Data

Agile Data Modeling with Big Data

Here at Zoomph we use an agile software development methodology which lets us deliver features and updates to our customers in a rapid fashion, however, as with most things there is no free lunch.  Agile design methodologies are to an extent predicated on the idea that changing software is ‘cheap’.  Generally this is true, changing the layout of a user interface or creating a new workflow are not hugely time consuming activities, they generally come with few major technical hurdles; this is not true of large scale databases though.  In this article I’m going to go over some best practices I’ve learned over the years which allows Zoomph’s databases to be just as agile as the rest of our platform:

Make use of Schemas

There has been a lot of buzz in the industry about many NOSQL databases which claim to be schema-less as a major selling point.  Here is the dirty secret though, all databases have a schema, even “schema-less” ones.  The only difference is whether the schema lives in your application’s code or in the DB itself.  By allowing the DB to be a free for all, the complexity of dealing with all sorts of data anomalies and quirks are pushed to the code where the schema can end up with enormous complexity, complexity which leads to bugs, technical debt and analytics difficulties.  That being said you don’t always need a hardcore 3rd normal form database with constraints (these don’t scale reads well generally either), it’s a balancing act.  Just keep in mind that once the data starts getting messy it can prove extremely complicated if not downright impossible to cleanup, so don’t let it get messy in the first place.

Open Ended Schemas

To the above point, schemas are important, but they should not hold back your ability to adapt the change.  Designing schemas which can be easily updated later is vitally important especially in big data systems in which changing the schema can be massively (if not intractably) resource intensive.  Once example of this which I like to use is avoiding the use of boolean types in databases, use integers/enums instead.  For example if your modeling the status of a blog post, initially the requirement may be for published/unpublished to be the only states, which is easy to model with a boolean.  But what happens when a third and fourth state are added?  By using an integer you have tons of flexibility to add new states whether it be through bit masking (which allows for multiple states to be on/off at the same time) or just adding new members to an enumeration.

Make a point of understanding where the system is heading

Within most systems the data model / database provide core APIs and abstractions which the rest of the system sit on top of.  When these APIs and abstractions change they have a rippling effect through the rest of the codebase.  While you don’t want to necessarily waterfall your database, some planning for the future and understanding were the business is going goes a long way in making future updates less painful.

Pay close attention to how big the data sets are planned on growing

Whether businesses like it or not as data sets grow the amount of time spent dealing with the problems arising from the size will grow as well.  Paying attention to how they are going to grow is of utmost importance to save yourself from investing time into architectures which can’t scale to meet demand.  When in development stages the data sets are generally fairly small which allows developers to architect solutions which work well at first but quickly fall apart as the data sets grow.  Some examples of this you can run into are:

  • Multifaceted/Full text search
    • Many common databases offer FTS (full text search) as a feature which work great as long the data sets are relatively small but this can quickly fail to scale if the DB is not designed specifically for this use case.  Here on Zoomph we make use of ElasticSearch which is designed from the ground up to scale search.
  • Joins / Search across disparate databases
    • In many environments (including Zoomph) you will have multiple databases each which fulfill their own role.  But what happens when you need to perform queries which require data from multiple DBs?  A very common approach to this is to simply query both databases in application code and then using code to merge the data together.  This can work but certain types of operations becomes extremely expensive as the data sets grow, which can force a re-architecture of the databases to support new use cases.
  • Doing analytics in application code
    • If developers are tasked with determining the top used words in a set of tweets they may be tempted to simply query the DB for the tweets, then use application code to count words.  This will work as long as the number of tweets is small, but this is an O(N) operation.  When 1K tweets are being analyzed it will work flawlessly, when 100K tweets are being analyzed it will be slow but likely still work, when 10M tweets are being analyzed you may end up waiting several minutes for all of the tweets to be loaded into the application server and counted.

Prepare for horizontal scaling from the start:

Vertical scaling offers lots of benefits early on such as simplicity to manage and setup but can quickly turn into a losing game as time goes on (there are limits to how ‘up’ you can go).  That being said you don’t need to invest in huge database clusters from the get go, simply avoiding practices which CAN’T scale horizontally goes a long way.  Examples of this include making use of map-reduce friendly algorithms and avoiding maintaining state in application code’s memory.

Also, choosing a database which can nativity scale horizontally is a must.  Many databases can technically be scaled horizontally but many do not do a particularly good job of it.  You can save yourself a lot of operational pain by selecting databases which scales horizontally naively, not with some 3rd party plugin which never works quite right.

Visit us at Zoomph to learn about the latest and the greatest in social media listening and data enrichment.

WP_Term Object ( [term_id] => 959 [name] => Thought Leadership [slug] => thought-leadership [term_group] => 0 [term_taxonomy_id] => 960 [taxonomy] => updates [description] => [parent] => 0 [count] => 33 [filter] => raw )
Thought Leadership

See Zoomph In Action

Talk to one of our specialists who will help you measure insights and maximize your sponsorship growth efforts

UFC Sponsorship Value and Social Analysis from UFC 244

November 06, 2019

UFC 244 took place last weekend, which included matchups of some of the league's top fighters. We wanted to understand which fighters matchups and audiences made up the conversation on social, and how much social… Read More

How Two NHL Teams Have Gone the Non-traditional Route To Grow Fan Engagement

November 01, 2019

Who are the most talked about guys in the NHL, on Twitter? I’ll give you a hint; it’s not Alex Ovechkin, nor Sidney Crosby. One embraces the word “ugly.” The other is probably the cutest thing… Read More

MLB Sponsors and Brands Hit It Out of the Park in the 2019 World Series

October 31, 2019

While they found themselves down, they never folded. From starting the season 19-31 to late-inning magic in the National League Wild Card game, it’s hard to say they weren’t a team of destiny, but nonetheless,… Read More

New Features in Zoomph: Compare Against Competitor Campaigns & Benchmark Custom Audiences

October 30, 2019

Know What Content Works and Compare Against Your Competitors Comparing your social campaigns against other brands, leagues, and teams is now easier than ever. View analytics from multiple campaigns to understand how your content is… Read More

All Posts