How do I calculate the MD5 checksum of a file in Python?
How do I calculate the MD5 checksum of a file in Python?
In regards to your error and whats missing in your code. m
is a name which is not defined for getmd5()
function.
No offence, I know you are a beginner, but your code is all over the place. Lets look at your issues one by one 🙂
First, you are not using hashlib.md5.hexdigest()
method correctly. Please refer explanation on hashlib functions in Python Doc Library. The correct way to return MD5 for provided string is to do something like this:
>>> import hashlib
>>> hashlib.md5(filename.exe).hexdigest()
2a53375ff139d9837e93a38a279d63e5
However, you have a bigger problem here. You are calculating MD5 on a file name string, where in reality MD5 is calculated based on file contents. You will need to basically read file contents and pipe it though MD5. My next example is not very efficient, but something like this:
>>> import hashlib
>>> hashlib.md5(open(filename.exe,rb).read()).hexdigest()
d41d8cd98f00b204e9800998ecf8427e
As you can clearly see second MD5 hash is totally different from the first one. The reason for that is that we are pushing contents of the file through, not just file name.
A simple solution could be something like that:
# Import hashlib library (md5 method is part of it)
import hashlib
# File to check
file_name = filename.exe
# Correct original md5 goes here
original_md5 = 5d41402abc4b2a76b9719d911017c592
# Open,close, read file and calculate MD5 on its contents
with open(file_name, rb) as file_to_check:
# read contents of the file
data = file_to_check.read()
# pipe contents of the file through
md5_returned = hashlib.md5(data).hexdigest()
# Finally compare original MD5 with freshly calculated
if original_md5 == md5_returned:
print MD5 verified.
else:
print MD5 verification failed!.
Please look at the post Python: Generating a MD5 checksum of a file. It explains in detail a couple of ways how it can be achieved efficiently.
Best of luck.
In Python 3.8+ you can do
import hashlib
with open(your_filename.png, rb) as f:
file_hash = hashlib.md5()
while chunk := f.read(8192):
file_hash.update(chunk)
print(file_hash.digest())
print(file_hash.hexdigest()) # to get a printable str instead of bytes
On Python 3.7 and below:
with open(your_filename.png, rb) as f:
file_hash = hashlib.md5()
chunk = f.read(8192)
while chunk:
file_hash.update(chunk)
chunk = f.read(8192)
print(file_hash.hexdigest())
This reads the file 8192 (or 2¹³) bytes at a time instead of all at once with f.read()
to use less memory.
Consider using hashlib.blake2b
instead of md5
(just replace md5
with blake2b
in the above snippets). Its cryptographically secure and faster than MD5.
How do I calculate the MD5 checksum of a file in Python?
hashlib
methods also support mmap
module, so I often use
from hashlib import md5
from mmap import mmap, ACCESS_READ
path = ...
with open(path) as file, mmap(file.fileno(), 0, access=ACCESS_READ) as file:
print(md5(file).hexdigest())
where path
is the path to your file.
Ref: https://docs.python.org/library/mmap.html#mmap.mmap
Edit: Comparison with the plain-read method.
from hashlib import md5
from mmap import ACCESS_READ, mmap
from matplotlib.pyplot import grid, legend, plot, show, tight_layout, xlabel, ylabel
from memory_profiler import memory_usage
from numpy import arange
def MemoryMap():
with open(path) as file, mmap(file.fileno(), 0, access=ACCESS_READ) as file:
print(md5(file).hexdigest())
def PlainRead():
with open(path, rb) as file:
print(md5(file.read()).hexdigest())
if __name__ == __main__:
path = ...
y = memory_usage(MemoryMap, interval=0.01)
plot(arange(len(y)) / 100, y, label=mmap)
y = memory_usage(PlainRead, interval=0.01)
plot(arange(len(y)) / 100, y, label=read)
ylabel(Memory Usage (MiB))
xlabel(Time (s))
legend()
grid()
tight_layout()
show()
path
is the path to a 3.77GiB csv file.