python pandas remove duplicate columns
python pandas remove duplicate columns
Heres a one line solution to remove columns based on duplicate column names:
df = df.loc[:,~df.columns.duplicated()]
How it works:
Suppose the columns of the data frame are [alpha,beta,alpha]
df.columns.duplicated()
returns a boolean array: a True
or False
for each column. If it is False
then the column name is unique up to that point, if it is True
then the column name is duplicated earlier. For example, using the given example, the returned value would be [False,False,True]
.
Pandas
allows one to index using boolean values whereby it selects only the True
values. Since we want to keep the unduplicated columns, we need the above boolean array to be flipped (ie [True, True, False] = ~[False,False,True]
)
Finally, df.loc[:,[True,True,False]]
selects only the non-duplicated columns using the aforementioned indexing capability.
Note: the above only checks columns names, not column values.
It sounds like you already know the unique column names. If thats the case, then df = df[Time, Time Relative, N2]
would work.
If not, your solution should work:
In [101]: vals = np.random.randint(0,20, (4,3))
vals
Out[101]:
array([[ 3, 13, 0],
[ 1, 15, 14],
[14, 19, 14],
[19, 5, 1]])
In [106]: df = pd.DataFrame(np.hstack([vals, vals]), columns=[Time, H1, N2, Time Relative, N2, Time] )
df
Out[106]:
Time H1 N2 Time Relative N2 Time
0 3 13 0 3 13 0
1 1 15 14 1 15 14
2 14 19 14 14 19 14
3 19 5 1 19 5 1
In [107]: df.T.drop_duplicates().T
Out[107]:
Time H1 N2
0 3 13 0
1 1 15 14
2 14 19 14
3 19 5 1
You probably have something specific to your data thats messing it up. We could give more help if theres more details you could give us about the data.
Edit:
Like Andy said, the problem is probably with the duplicate column titles.
For a sample table file dummy.csv I made up:
Time H1 N2 Time N2 Time Relative
3 13 13 3 13 0
1 15 15 1 15 14
14 19 19 14 19 14
19 5 5 19 5 1
using read_table
gives unique columns and works properly:
In [151]: df2 = pd.read_table(dummy.csv)
df2
Out[151]:
Time H1 N2 Time.1 N2.1 Time Relative
0 3 13 13 3 13 0
1 1 15 15 1 15 14
2 14 19 19 14 19 14
3 19 5 5 19 5 1
In [152]: df2.T.drop_duplicates().T
Out[152]:
Time H1 Time Relative
0 3 13 0
1 1 15 14
2 14 19 14
3 19 5 1
If your version doesnt let your, you can hack together a solution to make them unique:
In [169]: df2 = pd.read_table(dummy.csv, header=None)
df2
Out[169]:
0 1 2 3 4 5
0 Time H1 N2 Time N2 Time Relative
1 3 13 13 3 13 0
2 1 15 15 1 15 14
3 14 19 19 14 19 14
4 19 5 5 19 5 1
In [171]: from collections import defaultdict
col_counts = defaultdict(int)
col_ix = df2.first_valid_index()
In [172]: cols = []
for col in df2.ix[col_ix]:
cnt = col_counts[col]
col_counts[col] += 1
suf = _ + str(cnt) if cnt else
cols.append(col + suf)
cols
Out[172]:
[Time, H1, N2, Time_1, N2_1, Time Relative]
In [174]: df2.columns = cols
df2 = df2.drop([col_ix])
In [177]: df2
Out[177]:
Time H1 N2 Time_1 N2_1 Time Relative
1 3 13 13 3 13 0
2 1 15 15 1 15 14
3 14 19 19 14 19 14
4 19 5 5 19 5 1
In [178]: df2.T.drop_duplicates().T
Out[178]:
Time H1 Time Relative
1 3 13 0
2 1 15 14
3 14 19 14
4 19 5 1
python pandas remove duplicate columns
Transposing is inefficient for large DataFrames. Here is an alternative:
def duplicate_columns(frame):
groups = frame.columns.to_series().groupby(frame.dtypes).groups
dups = []
for t, v in groups.items():
dcols = frame[v].to_dict(orient=list)
vs = dcols.values()
ks = dcols.keys()
lvs = len(vs)
for i in range(lvs):
for j in range(i+1,lvs):
if vs[i] == vs[j]:
dups.append(ks[i])
break
return dups
Use it like this:
dups = duplicate_columns(frame)
frame = frame.drop(dups, axis=1)
Edit
A memory efficient version that treats nans like any other value:
from pandas.core.common import array_equivalent
def duplicate_columns(frame):
groups = frame.columns.to_series().groupby(frame.dtypes).groups
dups = []
for t, v in groups.items():
cs = frame[v].columns
vs = frame[v]
lcs = len(cs)
for i in range(lcs):
ia = vs.iloc[:,i].values
for j in range(i+1, lcs):
ja = vs.iloc[:,j].values
if array_equivalent(ia, ja):
dups.append(cs[i])
break
return dups