CoffeeFiend said:
Modified Windows files? How about running sfc /scannow? That's built in, and meant to fix precisely those kinds of problems (there's system restore too). Or otherwise, why not compare the SHA1 hash of the file with one of the online lists that already exist, or from a known good file on another machine? As for identifying malware by running fc /b on 2 files... Most people have an antivirus which seems like a far better option for that
I don't need to endure belittling comments like this in my attempts to just try and learn something. All I want is advice, and I'll pursue the necessary steps based on feedback I've been given to improve on this. That's all I really want here...
Please note that I already mentioned my awareness to the system file checker command, and how I was just trying to find a very quick example. The use of this thing is limited to imagination, disregarding the limits of the script itself for larger files which has also already been made well aware to me throughout this thread by several other members. I don't need to have it reiterated to me over and over, that only really just gives me a headache and adds to extra posts in this thread that give me deja vu when I find out I'm reading what's already been told to me though.
I KNOW I'm not the greatest batch programmer, but I need some new information as well. Having it bashed into my head from criticism on how my script here is useless does not help me. It may be useless no matter what I do to this script. But I just want to learn how to improve on it so I can become better at batch.
I'm not looking to make a masterpiece here, or the absolute best script in the world for comparing files. If I was to do that, I probably wouldn't even use batch as it's slow anyway.
Take a look at what you've said to me below here...
CoffeeFiend said:
A simple byte-for-byte comparison would catch that. That's pretty easy to write in any language if it's not already built-in (you said you're using perl which has File::Compare, python has filecmp.cmp, etc). Hashing here only adds CPU load for no reason (and as Jaclaz already pointed out, MD5 is quite old and a bad idea in general, SHA1 is a common replacement for it). It also increases comparison time not only by being CPU bound, but also by forcing you to hash the whole thing, whereas when you're doing a byte-for-byte comparison you can easily quit at the first byte that's dissimilar (and it very well may be the first byte of a file that's hundreds of MBs). Using hashes is mainly useful in different scenarios, like comparing one file to a known hash i.e. when you don't have the other file on hand, or don't want to send/copy it elsewhere to compare it there (and other tasks like for password authentication obviously). Unless you want to compare a large number of files together and identify duplicates (not necessarily comparing against one specific file), in which case hashing indeed works nicely (it saves times by not having to re-read lots of files, lots of times)
This part is something new to take into account. I already know well about password authentication and hash comparison methods, specifically used on websites for the most part to authenticate a user to a MySQL database, and either SHA1 or MD5 are most commonly used for that.
CoffeeFiend said:
Multi-threading is of no use here anyway. I'm not sure how you were expecting to use it, or what for. But if you try to read two or more files at once and then hashing them it's going to be quite slower, due to drastically increased seeking (except on SSDs). Unless you plan on having one thread reading the entire file (which might be huge) to RAM, and then while it hashes the other thread loads another file to RAM -- or one thread that queues files to hash in byte arrays in RAM while the other thread does the hashing. That would require TONS of RAM if there is large files (e.g. comparing two DVD9 .iso's would require more than 16GB of free RAM), and the speed gain is rather minimal vs using streams (which uses very little memory).
What would be the purpose of comparing DVD's though? Multi-threading can be of use, it just depends on how you want to use it. As you say you could compare 100 different files on a same thread, but memory would be a factor there.
CoffeeFiend said:
I'm well aware of those points
SHA1 is getting old indeed, but it's still "good enough" for most file comparison tasks and what's still getting used the most today, even when security is involved. Other hash algos tend to be slower and mainly overkill for this particular job here. MD5 though... I can't think of a reason I'd start on a new program/design using that in 2012.
There's nothing wrong with MD5 in my opinion. It still works. There's tons of hashes out there, but MD5 is common in most things. If you say SHA1 is still "good enough" then I don't see a reason as to why MD5 can't be classified in the same way.
As jaclaz said, and I myself have never came across an MD5 collision either, it's highly unlikely, and still fairly unlikely unless you look to try to match them purposely. But that would be hard to do with something like malware because you'd need to have the malicious function operational, while still trying to match an MD5, and just binding the file alone you're going to have to try to calculate how that can be done if you're binding to a windows file. This would include possible compression of that windows file to make sure that your binded malicious code matches the original filesize as well after everything is done. Not a very easy task for malicious code developers.
It's still a VERY common and used hashing algorithm out there, so there must not be anything wrong with it. C programming language is old, but it's still used out there lots. Being old doesn't mean we're forced to the idea of change because of the fact that it's old that there must be something better out there.
This post has been edited by AceInfinity: 28 January 2012 - 05:54 PM