Denormalise & join tabular data into simple rows

01 Jul 2019
tutorial 3 minute read

Like a database query join, how to denormalise tabular data to create expanded rows, ideal for mail-merges and data imports.

Introduction

Before importing data into a system, or creating a mail-merge, you might need to denormalise your data. In simple terms, this means taking a more efficient data representation and expand it into a series of (often partially repeating) rows.

This is essential because whilst it is easy for you to read a compressed table of data, it isn’t good fodder for feeding into a mail-merge routine or another system. For this, you will need a single row of data for each ‘thing’ you are representing - e.g. label you want to print.

Source Data

Imagine we need to print book labels, with a single label for each book. Every individual will need one book for each subject, and each of these books will need a label. If we have 30 names, and 10 subjects, we will need 300 labels printed. To do this with a conventional mail-merge, we will need 300 rows in our source spreadsheet.

We might start with a simple table containing the names of each individual and a list of all the subjects (assuming everyone takes the same subjects).

Names [A]Group [B]Subjects [C]
John Doe6AMaths
Jane Doe6AEnglish
Jimmy Doe6BScience
  History
  Geography
  

We now need to turn this into our de-normalised rows, one row for each name (Column A) and subject (Column C). We can do this with a couple of clever little formulas.

Denormalising

The first of these formulas will duplicate the names so that there are multiple rows (matching the number of subjects in column C) for each name. We start with the second row (A2:A) because the first row contains our column headings.

ARRAYFORMULA(TRANSPOSE(SPLIT(CONCATENATE(ARRAYFORMULA(SPLIT(REPT(FILTER(A2:A, A2:A <> "")&"✏", COUNTIF(C2:C, "<>")),"✏"))&"✏"),"✏")))

To pull across the group codes for each name, we could use a similar formula, or we could fall back to using a simple index/match approach (here we are assuming the output data begins on column E).

IF(LEN(E2)>0,INDEX(B:B,MATCH(E2,A:A,0),1),)

Finally, we use a similar formula to above to create the list of subjects. But, instead of being the same value repeated a certain number of times, it is the entire list of subjects, repeated for each name.

TRANSPOSE(ARRAYFORMULA(SPLIT(CONCATENATE(TRANSPOSE(ARRAYFORMULA(SPLIT(REPT(FILTER(C2:C, C2:C <> "")&"✏", COUNTIF(A2:A, "<>")),"✏")))&"✏"),"✏")))

Output Rows

Using these formulas on our example data from above (3 names and 5 subjects), we will end up with 15 output rows that look something like this:

Names [D]Group [E]Subjects [F]
John Doe6AMaths
John Doe6AEnglish
John Doe6AScience
John Doe6AHistory
John Doe6AGeography
Jane Doe6AMaths
Jane Doe6AEnglish
Jane Doe6AScience
Jane Doe6AHistory
Jane Doe6AGeography
Jimmy Doe6BMaths
Jimmy Doe6BEnglish
Jimmy Doe6BScience
Jimmy Doe6BHistory
Jimmy Doe6BGeography

You can see the two patterns here, 5 (the number of subjects) rows for each of the names, and the 5 subjects repeated for each of the 3 names.

This data can then be used to create our label mailmerge. Best of all, if we need to add extra subjects or names to our source data; our output rows will automatically update to reflect the new changes!

Why the ✏?

No reason at all! You just need to use any character or symbol that will not appear in your values at all! The pencil unicode symbol is just a memorable one chosen to meet this requirement.

Tagged Google, Sheet

Related Google Calendars on the Web (01 Sep 2018), Protecting Data with Google Drive (14 Aug 2018), Dealing with confidential information in schools (24 Jul 2018)