How to identify dissimilar duplicate rows in sheets

15 Feb 2018
tutorial 4 minute read

How to identify duplicate rows in a sheet that have dissimilar values, such as multiple grades for the same student in the same subject from different teachers.

Introduction

You might encounter a situation where you have duplicate data rows that you need to identify and deal with. This commonly occurs during data collection cycles, where multiple people might erroneously enter data about the same ‘thing’. Grades for students in a particular subject is a classic example.

Having realised that you have these pseudo-duplicates, where the majority of the row is the same but just one column/cell might be different, how do you use formulas to show you which duplicate rows need your attention automatically?

Sorting to the rescue

It’s not as difficult as it might initially seem. Consider the rows below, which is a simplified version of what you might face.

StudentNameSubjectSkillGrade
1001Ann Q. PaceEnglishReadingB
1001Ann Q. PaceMathsNumeracyB
1001Ann Q. PaceEnglishReadingA
1002Bernard A. HansonEnglishReadingC
1003Adrienne EricksonEnglishReadingA+
1002Bernard A. HansonMathsNumeracyB
1004Hakeem FaulknerMathsNumeracyE
1004Hakeem FaulknerEnglishReadingC
1004Hakeem FaulknerEnglishReadingC-
1003Adrienne EricksonMathsNumeracyD

Of course, with such a small number of rows, it’s easy to identify the two pairs of pseudo-duplicate entries (Rows 1 & 3, 8 & 9). But imagine if this dataset was for 200 students, with ten subjects and four skills in each subject. That’s 8,000 rows already, which is too many to scan manually and accurately.

The first task is to ensure our data is sorted suitably. We need to order the data by student, then subject and finally skill (in that precedence). The data in the above it almost sorted, with just a few minor adjustments needed.

StudentNameSubjectSkillGrade
1001Ann Q. PaceEnglishReadingB
1001Ann Q. PaceEnglishReadingA
1001Ann Q. PaceMathsNumeracyB
1002Bernard A. HansonEnglishReadingC
1002Bernard A. HansonMathsNumeracyB
1003Adrienne EricksonEnglishReadingA+
1003Adrienne EricksonMathsNumeracyD
1004Hakeem FaulknerEnglishReadingC
1004Hakeem FaulknerEnglishReadingC-
1004Hakeem FaulknerMathsNumeracyE

Once the data has sorted correctly, we can use two comparison formulas to identify which rows are about the same student/subject/skill, and which have dissimilar grades.

Concatenating Keys

To compare our rows, we need to reduce them to two keys. The first needs to concatenate the student, subject and skill, and the second adds the grade to this mix.

CONCATENATE(A2,"_",C2,"_",D2)
CONCATENATE(A2,"_",C2,"_",D2,"_",E2)

To work out whether the current row is similar (same student, subject and skill) to the previous one, we need to compare the first of these keys. To work out if it has a different grade, we need to compare the second key to the previous row.

IF(CONCATENATE(A2,"_",C2,"_",D2)=CONCATENATE(A3,"_",C3,"_",D3),TRUE,FALSE)
IF(CONCATENATE(A2,"_",C2,"_",D2,"_",E2)=CONCATENATE(A3,"_",C3,"_",D3,"_",E3),TRUE,FALSE)

A duplicate row would have a true value for the first comparison (it is similar to the row before) and a false value for the second (the grade is not the same).

Single Formula Solution

We can build a conditional statement to test this, which will show true if the row is a pseudo-duplicate (different grade) of the previous one. The formula can be simplified by just comparing the grade as the second part of the conditional test.

IF(AND(IF(CONCATENATE(A2,"_",C2,"_",D2)=CONCATENATE(A3,"_",C3,"_",D3),TRUE,FALSE)=TRUE, IF(E2=E3,TRUE,FALSE)=FALSE), TRUE,)

This outputs the following, correctly identifying the rows which are similar to the previous ones (apart from the grade). The sorts we implemented ensures that all the duplicates will be found.

StudentNameSubjectSkillGradeDuplicate
1001Ann Q. PaceEnglishReadingB 
1001Ann Q. PaceEnglishReadingATRUE
1001Ann Q. PaceMathsNumeracyB 
1002Bernard A. HansonEnglishReadingC 
1002Bernard A. HansonMathsNumeracyB 
1003Adrienne EricksonEnglishReadingA+ 
1003Adrienne EricksonMathsNumeracyD 
1004Hakeem FaulknerEnglishReadingC 
1004Hakeem FaulknerEnglishReadingC-TRUE
1004Hakeem FaulknerMathsNumeracyE 

Tagged Google, Sheet, Markbook

Related Google Calendars on the Web (01 Sep 2018), Protecting Data with Google Drive (14 Aug 2018), Dealing with confidential information in schools (24 Jul 2018)