Python Pandas – Find difference between two data frames

Python Pandas – Find difference between two data frames

By using drop_duplicates

pd.concat([df1,df2]).drop_duplicates(keep=False)

Update :

The above method only works for those data frames that dont already have duplicates themselves. For example:

df1=pd.DataFrame({A:[1,2,3,3],B:[2,3,4,4]})
df2=pd.DataFrame({A:[1],B:[2]})

It will output like below , which is wrong

Wrong Output :

pd.concat([df1, df2]).drop_duplicates(keep=False)
Out[655]: 
   A  B
1  2  3

Correct Output

Out[656]: 
   A  B
1  2  3
2  3  4
3  3  4

How to achieve that?

Method 1: Using isin with tuple

df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
Out[657]: 
   A  B
1  2  3
2  3  4
3  3  4

Method 2: merge with indicator

df1.merge(df2,indicator = True, how=left).loc[lambda x : x[_merge]!=both]
Out[421]: 
   A  B     _merge
1  2  3  left_only
2  3  4  left_only
3  3  4  left_only

For rows, try this, where Name is the joint index column (can be a list for multiple common columns, or specify left_on and right_on):

m = df1.merge(df2, on=Name, how=outer, suffixes=[, _], indicator=True)

The indicator=True setting is useful as it adds a column called _merge, with all changes between df1 and df2, categorized into 3 possible kinds: left_only, right_only or both.

For columns, try this:

set(df1.columns).symmetric_difference(df2.columns)

Python Pandas – Find difference between two data frames

Accepted answer Method 1 will not work for data frames with NaNs inside, as pd.np.nan != pd.np.nan. I am not sure if this is the best way, but it can be avoided by

df1[~df1.astype(str).apply(tuple, 1).isin(df2.astype(str).apply(tuple, 1))]

Its slower, because it needs to cast data to string, but thanks to this casting pd.np.nan == pd.np.nan.

Lets go trough the code. First we cast values to string, and apply tuple function to each row.

df1.astype(str).apply(tuple, 1)
df2.astype(str).apply(tuple, 1)

Thanks to that, we get pd.Series object with list of tuples. Each tuple contains whole row from df1/df2.
Then we apply isin method on df1 to check if each tuple is in df2.
The result is pd.Series with bool values. True if tuple from df1 is in df2. In the end, we negate results with ~ sign, and applying filter on df1. Long story short, we get only those rows from df1 that are not in df2.

To make it more readable, we may write it as:

df1_str_tuples = df1.astype(str).apply(tuple, 1)
df2_str_tuples = df2.astype(str).apply(tuple, 1)
df1_values_in_df2_filter = df1_str_tuples.isin(df2_str_tuples)
df1_values_not_in_df2 = df1[~df1_values_in_df2_filter]

Leave a Reply

Your email address will not be published.