Analytics Calculations: Match+Index 3: Finding Duplicates
josephmckeown
Posts: 45Moderator, Lightspeed Staff moderator
We can use Analytics to help identify instances of duplicate Products, Customers, or pretty much any dimension on a report. Say we want to find duplicate Products. We could start by running the "Dusty Inventory" report, and removing the filters for "Is Dusty" and "Quantity on Hand is greater than zero"
Let's also add System ID as a dimension, to distinguish instances of Products with the same name from one another...
Then, to rule out case-sensitivity, let's use the "Upper" calculation from our cell-based functions to convert all content in the Product Description to upper case...
So when saving, everything is presented in capital letters...
The next thing we want to do is ask Analytics to find the first instance of each Description. We can do this by using the "Match" calculation from our table-based functions. I'll start by creating the calculation in brackets...
And we are going to ask Analytics to match the upper case descriptions...
...to the upper case descriptions...
...this should show us the row that each Description first shows up on, and we should expect to find that in most cases...
...the first instance is identical to the row.
Most cases.
It's the ones that don't match their row that we want to find now.
Let's save our first version of the report...
So now, let's ask Analytics to identify where the first instance is not the same as the row ID, this means that the first instance is higher on the table, and the current instance therefore, a duplicate.
Let's create an IF statement. If the row is the same as the first instance, return a 0, if not, return a 1.
So we'll start the IF statement...
Copy+paste the "First Instance" calculation...
and compare it to the row...
So that when we save...
We get a new calculation highlighting duplicates over others. Again, let's save this report as new...
Now we don't want to just identify the duplicates. We want to pull them out in a meaningful way. To start this, let's have the duplicates rise in sequence. We could do this by using the running_total function from our math functions...
and ask it to add the 1s from the duplicates...
so that when we save...
...with each new duplicate on the list comes a new number in sequence.
Let's save our work so far as a new report...
So with each new duplicate comes a new integer: 1, 2, 3, etc. Does this remind you of any other sequence? If you guessed rows, you're right!
That means we can ask Analytics, by row, to find where on the table Duplicate 1 occurs, again, using the Match function.
We'll start it with brackets...
First, we'll ask it to look at the Row...
...then, we'll copy the running total of duplicates...
...and ask it to look at this sequence...
Now look at the numbers that show up on our table...
It tells us that the first duplicate is in row 298, that the second duplicate is in row 409, that the third is in row 700, and so on...
We're almost done, but once again, let's save our work as new...
Finally, let's use the Index function to pull up these Descriptions so that we can now find them in Lightspeed!
We'll start our Index function with brackets...
Then ask it to look in the Descriptions column...
...for our matched position...
...of rising duplicates.
So now when we save...
We get a list of only duplicated Products!
Let's save our work again as new.
There are a few cosmetic things we can do now to tighten up the report. First, we can remove all but perhaps the final two calculations...
So that gives us more real estate to play with...
Also, let's hide the first set of dimensions from the Visualization...
This is an important step because the Dimensions on the left-hand-side have no bearing to the results on the right...
Finally, we could also ask Analytics to look for the System IDs for duplicate products, just by copy+pasting our final calculation...
...but substituting the index column from Description for the the System ID
This way, when we save...
We get one report that finds the results we need for our next steps in Lightspeed!
*Note, this process will work only for the first 5,000 rows of results in Analytics. Remember you can filter results by starting letters of Description, so that if duplicates are not in the same Category or Vendor, you can still identify them!
Calculations Home
Let's also add System ID as a dimension, to distinguish instances of Products with the same name from one another...
Then, to rule out case-sensitivity, let's use the "Upper" calculation from our cell-based functions to convert all content in the Product Description to upper case...
So when saving, everything is presented in capital letters...
The next thing we want to do is ask Analytics to find the first instance of each Description. We can do this by using the "Match" calculation from our table-based functions. I'll start by creating the calculation in brackets...
And we are going to ask Analytics to match the upper case descriptions...
...to the upper case descriptions...
...this should show us the row that each Description first shows up on, and we should expect to find that in most cases...
...the first instance is identical to the row.
Most cases.
It's the ones that don't match their row that we want to find now.
Let's save our first version of the report...
So now, let's ask Analytics to identify where the first instance is not the same as the row ID, this means that the first instance is higher on the table, and the current instance therefore, a duplicate.
Let's create an IF statement. If the row is the same as the first instance, return a 0, if not, return a 1.
So we'll start the IF statement...
Copy+paste the "First Instance" calculation...
and compare it to the row...
So that when we save...
We get a new calculation highlighting duplicates over others. Again, let's save this report as new...
Now we don't want to just identify the duplicates. We want to pull them out in a meaningful way. To start this, let's have the duplicates rise in sequence. We could do this by using the running_total function from our math functions...
and ask it to add the 1s from the duplicates...
so that when we save...
...with each new duplicate on the list comes a new number in sequence.
Let's save our work so far as a new report...
So with each new duplicate comes a new integer: 1, 2, 3, etc. Does this remind you of any other sequence? If you guessed rows, you're right!
That means we can ask Analytics, by row, to find where on the table Duplicate 1 occurs, again, using the Match function.
We'll start it with brackets...
First, we'll ask it to look at the Row...
...then, we'll copy the running total of duplicates...
...and ask it to look at this sequence...
Now look at the numbers that show up on our table...
It tells us that the first duplicate is in row 298, that the second duplicate is in row 409, that the third is in row 700, and so on...
We're almost done, but once again, let's save our work as new...
Finally, let's use the Index function to pull up these Descriptions so that we can now find them in Lightspeed!
We'll start our Index function with brackets...
Then ask it to look in the Descriptions column...
...for our matched position...
...of rising duplicates.
So now when we save...
We get a list of only duplicated Products!
Let's save our work again as new.
There are a few cosmetic things we can do now to tighten up the report. First, we can remove all but perhaps the final two calculations...
So that gives us more real estate to play with...
Also, let's hide the first set of dimensions from the Visualization...
This is an important step because the Dimensions on the left-hand-side have no bearing to the results on the right...
Finally, we could also ask Analytics to look for the System IDs for duplicate products, just by copy+pasting our final calculation...
...but substituting the index column from Description for the the System ID
This way, when we save...
We get one report that finds the results we need for our next steps in Lightspeed!
*Note, this process will work only for the first 5,000 rows of results in Analytics. Remember you can filter results by starting letters of Description, so that if duplicates are not in the same Category or Vendor, you can still identify them!
Calculations Home
Post edited by josephmckeown on
Joseph McKeown
Lightspeed Analytics and reporting consultant
Lightspeed HQ
Lightspeed Analytics and reporting consultant
Lightspeed HQ