
If you want to identify duplicates across the entire data set, then select the entire set. Actually, you don’t have to select the entire data set you may want to identify duplicate values in a particular column or row. That may not be a big deal for a data set with about 50 rows of data, but it can be an incredibly inefficient process for a data set that contains, say, over 50,000 rows of data. Without this feature I would be forced to manually check each data point. The Conditional Formatting feature programmatically identifies duplicates in an entire data set.

It usually takes finding one set of duplicate data points for me to determine that Conditional Formatting should be applied to identify if any additional duplicates are present in a data file. Thankfully, Excel offers two handy features that simplify the identification and removal of duplicate data points from a file!

For instance, if I am looking at a data set on the number of hamsters across the United States and I see that Wisconsin has two data points, both of which are 50,000 (totally fabricated!), then I can infer that the data set has mistakenly included two duplicate values for Wisconsin.So why does this matter? It matters because duplicate data points may inadvertently lead to miscalculation or misunderstanding of the data. The appearance of duplicates does not necessarily mean the entire data set is completely wrong – only that the data set may require a closer eye and some additional clean-up work as do most data sets. Duplicates are exactly what they sound like: exact copies of the same data point. Duplicate data points are probably one of the most difficult to spot unless you’re lucky. These can include blank values, outlier data points, data label misspellings, and so on. There may be more than a few data points to double-check as you review and clean a data file. Stay tuned for Diana’s experiences, tips, and tricks with finding, analyzing and visualizing data.

And now she is bringing her trials, tribulations, and expertise with data to you in a monthly blog, Tips with Diana. The person that SAGE Publishing - the parent of MethodSpace - turns to when it has questions is Diana Aleman – editor extraordinaire for SAGE Stats and U.S. Collecting, analyzing, and reporting with data can be daunting.
