Day 13: Let's remove some duplicates
You're probably familiar with the 'remove duplicates' feature in Microsoft Excel
It simply removes all duplicates from the data you've selected
Now...
Why you'd need to remove duplicates depends on the kind of analysis you are doing but there's an easy way to do it in R
Continuing with the tidyverse package, the distinct function keeps... you guessed it - what's distinct
But how does this work?
In simple terms, it keeps all the rows in a dataset.
It comes down to the code itself.
Focusing on just one column? You get what's distinct in that column
Focusing on multiple? You get a distinct 'pair'
Continuing with the flights dataset,
flight |>
distinct()
You should be seeing the same number of rows as per yesterday - i.e., 336,776
This makes sense given what we know about the dataset
However,
If we narrow it down to just two columns,
flights|>
distinct(carrier,tailnum)
You should see much lesser rows...
Personally, I'm not happy with what this data looks like so I've rearranged it
How?
You can do that using the arrange function from yesterday.
What's the code?
I'll leave that for you to figure out :)
Get some thinking in 😉
You're probably thinking, the code today is the easiest code you've had to run all week but remember -
it's the simplest code that can be the most effective