How to identify duplicate rows in a sheet that have dissimilar values, such as multiple grades for the same student in the same subject from different teachers.
Introduction
You might encounter a situation where you have duplicate data rows that you need to identify and deal with. This commonly occurs during data collection cycles, where multiple people might erroneously enter data about the same ‘thing’. Grades for students in a particular subject is a classic example.
Having realised that you have these pseudo-duplicates, where the majority of the row is the same but just one column/cell might be different, how do you use formulas to show you which duplicate rows need your attention automatically?
Sorting to the rescue
It’s not as difficult as it might initially seem. Consider the rows below, which is a simplified version of what you might face.
Student | Name | Subject | Skill | Grade |
---|---|---|---|---|
1001 | Ann Q. Pace | English | Reading | B |
1001 | Ann Q. Pace | Maths | Numeracy | B |
1001 | Ann Q. Pace | English | Reading | A |
1002 | Bernard A. Hanson | English | Reading | C |
1003 | Adrienne Erickson | English | Reading | A+ |
1002 | Bernard A. Hanson | Maths | Numeracy | B |
1004 | Hakeem Faulkner | Maths | Numeracy | E |
1004 | Hakeem Faulkner | English | Reading | C |
1004 | Hakeem Faulkner | English | Reading | C- |
1003 | Adrienne Erickson | Maths | Numeracy | D |
Of course, with such a small number of rows, it’s easy to identify the two pairs of pseudo-duplicate entries (Rows 1 & 3, 8 & 9). But imagine if this dataset was for 200 students, with ten subjects and four skills in each subject. That’s 8,000 rows already, which is too many to scan manually and accurately.
The first task is to ensure our data is sorted suitably. We need to order the data by student, then subject and finally skill (in that precedence). The data in the above it almost sorted, with just a few minor adjustments needed.
Student | Name | Subject | Skill | Grade |
---|---|---|---|---|
1001 | Ann Q. Pace | English | Reading | B |
1001 | Ann Q. Pace | English | Reading | A |
1001 | Ann Q. Pace | Maths | Numeracy | B |
1002 | Bernard A. Hanson | English | Reading | C |
1002 | Bernard A. Hanson | Maths | Numeracy | B |
1003 | Adrienne Erickson | English | Reading | A+ |
1003 | Adrienne Erickson | Maths | Numeracy | D |
1004 | Hakeem Faulkner | English | Reading | C |
1004 | Hakeem Faulkner | English | Reading | C- |
1004 | Hakeem Faulkner | Maths | Numeracy | E |
Once the data has sorted correctly, we can use two comparison formulas to identify which rows are about the same student/subject/skill, and which have dissimilar grades.
Concatenating Keys
To compare our rows, we need to reduce them to two keys. The first needs to concatenate the student, subject and skill, and the second adds the grade to this mix.
To work out whether the current row is similar (same student, subject and skill) to the previous one, we need to compare the first of these keys. To work out if it has a different grade, we need to compare the second key to the previous row.
A duplicate row would have a true value for the first comparison (it is similar to the row before) and a false value for the second (the grade is not the same).
Single Formula Solution
We can build a conditional statement to test this, which will show true if the row is a pseudo-duplicate (different grade) of the previous one. The formula can be simplified by just comparing the grade as the second part of the conditional test.
This outputs the following, correctly identifying the rows which are similar to the previous ones (apart from the grade). The sorts we implemented ensures that all the duplicates will be found.
Student | Name | Subject | Skill | Grade | Duplicate |
---|---|---|---|---|---|
1001 | Ann Q. Pace | English | Reading | B | |
1001 | Ann Q. Pace | English | Reading | A | TRUE |
1001 | Ann Q. Pace | Maths | Numeracy | B | |
1002 | Bernard A. Hanson | English | Reading | C | |
1002 | Bernard A. Hanson | Maths | Numeracy | B | |
1003 | Adrienne Erickson | English | Reading | A+ | |
1003 | Adrienne Erickson | Maths | Numeracy | D | |
1004 | Hakeem Faulkner | English | Reading | C | |
1004 | Hakeem Faulkner | English | Reading | C- | TRUE |
1004 | Hakeem Faulkner | Maths | Numeracy | E |
Tagged ► Google, Sheet, Markbook
Related ► Google Calendars on the Web (01 Sep 2018), Protecting Data with Google Drive (14 Aug 2018), Dealing with confidential information in schools (24 Jul 2018)