Index of duplicates items in a python list

Index of duplicates items in a python list

You want to pass in the optional second parameter to index, the location where you want index to start looking. After you find each match, reset this parameter to the location just after the match that was found.

def list_duplicates_of(seq,item):
    start_at = -1
    locs = []
    while True:
        try:
            loc = seq.index(item,start_at+1)
        except ValueError:
            break
        else:
            locs.append(loc)
            start_at = loc
    return locs

source = ABABDBAAEDSBQEWBAFLSAFB
print(list_duplicates_of(source, B))

Prints:

[1, 3, 5, 11, 15, 22]

You can find all the duplicates at once in a single pass through source, by using a defaultdict to keep a list of all seen locations for any item, and returning those items that were seen more than once.

from collections import defaultdict

def list_duplicates(seq):
    tally = defaultdict(list)
    for i,item in enumerate(seq):
        tally[item].append(i)
    return ((key,locs) for key,locs in tally.items() 
                            if len(locs)>1)

for dup in sorted(list_duplicates(source)):
    print(dup)

Prints:

(A, [0, 2, 6, 7, 16, 20])
(B, [1, 3, 5, 11, 15, 22])
(D, [4, 9])
(E, [8, 13])
(F, [17, 21])
(S, [10, 19])

If you want to do repeated testing for various keys against the same source, you can use functools.partial to create a new function variable, using a partially complete argument list, that is, specifying the seq, but omitting the item to search for:

from functools import partial
dups_in_source = partial(list_duplicates_of, source)

for c in ABDEFS:
    print(c, dups_in_source(c))

Prints:

A [0, 2, 6, 7, 16, 20]
B [1, 3, 5, 11, 15, 22]
D [4, 9]
E [8, 13]
F [17, 21]
S [10, 19]
>>> def indices(lst, item):
...   return [i for i, x in enumerate(lst) if x == item]
... 
>>> indices(List, A)
[0, 2]

To get all duplicates, you can use the below method, but it is not very efficient. If efficiency is important you should consider Ignacios solution instead.

>>> dict((x, indices(List, x)) for x in set(List) if List.count(x) > 1)
{A: [0, 2]}

As for solving it using the index method of list instead, that method takes a second optional argument indicating where to start, so you could just repeatedly call it with the previous index plus 1.

>>> List.index(A)
0
>>> List.index(A, 1)
2

Index of duplicates items in a python list

I made a benchmark of all solutions suggested here and also added another solution to this problem (described in the end of the answer).

Benchmarks

First, the benchmarks. I initialize a list of n random ints within a range [1, n/2] and then call timeit over all algorithms

The solutions of @Paul McGuire and @Ignacio Vazquez-Abrams works about twice as fast as the rest on the list of 100 ints:

Testing algorithm on the list of 100 items using 10000 loops
Algorithm: dupl_eat
Timing: 1.46247477189
####################
Algorithm: dupl_utdemir
Timing: 2.93324529055
####################
Algorithm: dupl_lthaulow
Timing: 3.89198786645
####################
Algorithm: dupl_pmcguire
Timing: 0.583058259784
####################
Algorithm: dupl_ivazques_abrams
Timing: 0.645062989076
####################
Algorithm: dupl_rbespal
Timing: 1.06523873786
####################

If you change the number of items to 1000, the difference becomes much bigger (BTW, Ill be happy if someone could explain why) :

Testing algorithm on the list of 1000 items using 1000 loops
Algorithm: dupl_eat
Timing: 5.46171654555
####################
Algorithm: dupl_utdemir
Timing: 25.5582547323
####################
Algorithm: dupl_lthaulow
Timing: 39.284285326
####################
Algorithm: dupl_pmcguire
Timing: 0.56558489513
####################
Algorithm: dupl_ivazques_abrams
Timing: 0.615980005148
####################
Algorithm: dupl_rbespal
Timing: 1.21610942322
####################

On the bigger lists, the solution of @Paul McGuire continues to be the most efficient and my algorithm begins having problems.

Testing algorithm on the list of 1000000 items using 1 loops
Algorithm: dupl_pmcguire
Timing: 1.5019953958
####################
Algorithm: dupl_ivazques_abrams
Timing: 1.70856155898
####################
Algorithm: dupl_rbespal
Timing: 3.95820421595
####################

The full code of the benchmark is here

Another algorithm

Here is my solution to the same problem:

def dupl_rbespal(c):
    alreadyAdded = False
    dupl_c = dict()
    sorted_ind_c = sorted(range(len(c)), key=lambda x: c[x]) # sort incoming list but save the indexes of sorted items

    for i in xrange(len(c) - 1): # loop over indexes of sorted items
        if c[sorted_ind_c[i]] == c[sorted_ind_c[i+1]]: # if two consecutive indexes point to the same value, add it to the duplicates
            if not alreadyAdded:
                dupl_c[c[sorted_ind_c[i]]] = [sorted_ind_c[i], sorted_ind_c[i+1]]
                alreadyAdded = True
            else:
                dupl_c[c[sorted_ind_c[i]]].append( sorted_ind_c[i+1] )
        else:
            alreadyAdded = False
    return dupl_c

Although its not the best it allowed me to generate a little bit different structure needed for my problem (i needed something like a linked list of indexes of the same value)

Leave a Reply

Your email address will not be published.