Tidy vs EAVil data structures

“Don’t take a fence down until you know why it was put up.” G.K. Chesterton

I once had the joy [sarcasm] of working with an entity-attribute-value (EAV) data model. When I first laid eyes on it I knew something was different but I didn’t know what exactly – I didn’t even know what EAV was at the time. I typically work downstream of database (DB) design, so ETL/SQL experts and DB administrators look after the design of DB and I have simply queried DBs to get the data I want for analysis. So I’ve seen a lot of DBs but I haven’t designed any – however, you don’t have to be an architect to recognize when a structure looks a bit dodgy!

From my basic knowledge of relational DB structure:

  1. Each table should be related to a particular subject or entity, e.g. a product table, a customer table, etc.
  2. Each row represents a record. So a row in in the product table is a product, a row in the customer table is a customer, etc.
  3. Each column is an attribute holding some type of information about the records,  e.g. in the customer table there could be a first_name column, an age column, etc. Also, columns should have reasonably meaningful names.

This comes from Codd’s 3rd normal form and it is the basis for Hadley Wickham’s approach to tidy data (pdf). The EAV model I came across was not structured like this, see the sample snippet in this google spreadsheet to compare how the same data is structured in the EAV versus the tidy format. As mentioned already, I am not a DB architect so maybe this data structure was something awesome that I just had not heard of yet – I’m always willing to learn new things and so I started to fish around for explanations but what I discovered only confirmed my initial fears:

  • This Microsoft Database Design Basics article specifically mentions that “When you see columns numbered this way [repeating groups like numerator 1, numerator 2, numerator 3, etc], you should revisit your design.”
  • My exact concern came up once on StackExchange and the most popular answer gives a very strong argument against EAV.
  • We needed to create scatter plots on dashboards for this client and to create scatter plots with tidy data I could simply plot one column against another but with EAV data we’d need to write quite a few lines of code to reshape the data before doing a simple scatter plot.
  • Tidy data is more amenable to the analytical base table structure that is required for model building.
  • EAV is placing different data types in the same column. Often our data are strictly counts or decimals and it is better practice to store them as integer or float data types as appropriate.
  • EAV is not easily human readable and generally needs to be pivoted or reshaped to do anything useful. A conventional relational model is more human readable because it is easy to look across a row and see attributes for a particular entity.

Unfortunately for me, a great deal of reporting and tools had been built on the existing database so even when I eventually persuaded the client to revert to a more conventional structure there was a real risk of breaking other things.

My takeaway from this experience is that a good default first guess in any arena is the conventional solution but we should always be open to hearing about new and better ways of doing things. Conversely, if you are proposing a new approach, be sure that you thoroughly know the conventional solution first. Which is my long winded way of paraphrasing that beautiful Chesterton quote at the top of this post – see here for more on that quote.

P.S. If you ever find yourself in my predicament and you need more ammo to talk someone out of EAV here are some more resources to check out:

  • Number 3 in this list of Five Simple Database Design Errors You should Avoid gives an example of how queries can become very complicated in EAV models.
  • This post on Ten Common Database Design Mistakes again mentions the issue of repeating column names appended with numbers, i.e. num1, num2, num3, etc, and the problems it can lead to.
  • This article outlines some of the attractions of EAV, namely it’s simple data model and the ease with which new attributes can be added but ultimately he considers it dangerous and recommends extreme caution before implementing EAV.
  • Here is a rather long but humorous story of how an EAV structure brought down an entire IT department.
  • I followed Karl Popper’s scientific method and kept looking for something that would prove me wrong and I eventually found one example of a large e-commerce company called Magento using EAV and it seems the attraction for them is the scalability. But it will take more than this one example for me to embrace EAV!
Advertisements

Scaled Agile Framework

In December 2015 I earned my Scaled Agile Framework Practitioner certificate. Suffice to say I am a big fan of the Agile Manifesto. The SAFe homepage is a rich source of information and if you google “scrum” or “agile” you will find a host of awesome resources.

But right here I just want to share my takeaways from the training and specifically how it can relate to delivering data analytics projects. Think of it as Anto’s unofficial guide to Agile in no particular order:

  1. Always remember what the fundamental deliverable is: value. We can lose sight of that when we’re down in the weeds of a project.
  2. Write your tests before you code, not after.
  3. Agile is robust and designed to handle the world as it is, not as we would like it to be. However, C-level people often want certainty, i.e. what will be completed when, and this is difficult to deliver in an uncertain world. Do not underestimate the challenge of selling Agile methodology to clients.
  4. Having a constant predictable work velocity is better than having alternate periods of high and low workload.
  5. User stories are not pushed on people, they are pulled down by members of the team. This is a bottom up approach to management, people are empowered to take on work items.
  6. Build good people and they will build good products.
  7. Five Whys: Asking why five times consecutively will often get to the root of a problem.
  8. Clearly define performance metrics upfront. Do not assume that everyone agrees on a definition, make sure stakeholders sign off on it.
  9. Data, like food, has a shelf life. It is better to get interim “good enough” information to stakeholders in a timely manner rather than spend a long time working on the “best” product.
  10. Multitasking is a drag, see here, here and here. Work one story at a time, one sprint at a time.
  11. Quality, scope, time: pick 2 out of the 3. We should always want to maintain high quality. Time can often be rigid due to external stakeholder needs. Scope is the one we sometimes need to adjust.
  12. Timeboxing at every level is important whether it’s planning meetings, scrum meetings or sprints themselves.
  13. There should be one product owner. I know from experience, too many chiefs can really suck.
  14. All activities should have business value and you should be able to explain that value to stakeholders. Even experimental efforts that fail can add value.
  15. Structure user stories like so: As a <…> I want to <…> so that <…>.
  16. User story is not complete without acceptance criteria.
  17. No changing priorities during 2 week sprint.
  18. Leave a little slack in your plan to allow for inevitable unknowns.