Day 13: Let's remove some duplicates

September 4, 2024 Rainar Angelo

You're probably familiar with the 'remove duplicates' feature in Microsoft Excel

It simply removes all duplicates from the data you've selected

Now...

Why you'd need to remove duplicates depends on the kind of analysis you are doing but there's an easy way to do it in R

Continuing with the tidyverse package, the distinct function keeps... you guessed it - what's distinct

But how does this work?

In simple terms, it keeps all the rows in a dataset.

It comes down to the code itself.

Focusing on just one column? You get what's distinct in that column

Focusing on multiple? You get a distinct 'pair'

Continuing with the flights dataset,

flight |>
distinct()

You should be seeing the same number of rows as per yesterday - i.e., 336,776

This makes sense given what we know about the dataset

However,

If we narrow it down to just two columns,

flights|>
distinct(carrier,tailnum)

You should see much lesser rows...

Personally, I'm not happy with what this data looks like so I've rearranged it

How?

You can do that using the arrange function from yesterday.

What's the code?

I'll leave that for you to figure out :)

Get some thinking in 😉

You're probably thinking, the code today is the easiest code you've had to run all week but remember -

it's the simplest code that can be the most effective