[python] Poor man’s forensics

So after a period of ‘lesser technical times’ I finally  got a chance to play around with bits, bytes and other subjects of the information security world.  A while back I got involved in a forensic investigation and participated with the team to answer the investigative questions.  This was an interesting journey since a lot of things peeked my interest or ended up on one of my todo lists.

One of the reasons that my interest was peeked is that yes, you can use a lot of pre-made tools to process the disk images and after that processing is done you can start your investigation. However, there are still a lot of questions you could answer much quicker if you had a subset of that data available ‘instantly’. The other reason is that not all the tools understand all the filesystems out there, which means that if you encounter an exotic file system your options are heavily reduced. One of the tools I like and which inspired me for these quick & dirty scripts is ‘mac-robber‘ (be aware that it changes file times if the destination is not mounted read-only) since it’s able to process any file system as long as it’s mounted on an operating system on which mac-robber is able to run. An example of running mac-robber:

sudo mac-robber mnt/ | head

You can even timeline the output if you want with mactime:

sudo mac-robber mnt/ | mactime -d | head
Date,Size,Type,Mode,UID,GID,Meta,File Name
Thu Jan 01 1970 01:00:00,2048,…b,dr-xr-xr-x,0,0,0,”mnt/.disk”
Thu Jan 01 1970 01:00:00,0,…b,-r–r–r–,0,0,0,”mnt/.disk/base_installable”
Thu Jan 01 1970 01:00:00,37,…b,-r–r–r–,0,0,0,”mnt/.disk/casper-uuid-generic”
Thu Jan 01 1970 01:00:00,15,…b,-r–r–r–,0,0,0,”mnt/.disk/cd_type”
Thu Jan 01 1970 01:00:00,60,…b,-r–r–r–,0,0,0,”mnt/.disk/info”

Now that’s pretty useful and quick! One of the things I missed however was the ability to quickly extend the tools as well as focus on just files. From a penetration testing perspective I find files much more interesting in an forensic investigation than directories and their meta-data. This is of course tied to the type of investigation you are doing, the goal of the investigation and the questions you need answered.

I decided to write a mac-robber(ish) python version to aid me in future investigations as well as learning a thing or two along the way. Before you continue reading please be aware that:

  1. The scripts have not gone through extensive testing
  2. Thus should not be blindly trusted to produce forensically sound output
  3. The regular ‘professional’ tools are not perfect either and still contain bugs ;)

That being said, let’s have a look at the type of questions you can answer with a limited set of data and how that could be done with custom written tools. If you don’t care about my ramblings, just access the Github repo here. It has become a bit of a long article, so here are the ‘chapters’ that you will encounter:

  1. What data do we want?
  2. How do we get the data?
  3. Working with the data, answering questions
    1. Converting to body file format
    2. Finding duplicate hashes
    3. Permission issues
    4. Entropy / file type issues
  4. Final thoughts

What data do we want?

We can answer this question partially by looking at the standard fls tool:

  • md5
  • file type as reported in file name and metadata structure (see above)
  • Metadata Address
  • name
  • mtime (last modified time)
  • atime (last accessed time)
  • ctime (last changed time)
  • crtime (created time)
  • size (in bytes)
  • uid (User ID)
  • gid (Group ID)

The above can answer quite a lot of questions, although it would be nice to also have information like:

  • multiple hash formats
  • file entropy
  • file type (as outputted normally by the ‘file’ command)

The multiple hash formats is a nice to have since md5 should really be deprecated, the entropy is nice since it helps us to maybe find encrypted files. Which brings us into the danger region of ‘oh but this data is also nice’ behaviour, so for now we are going to settle on the following:

  • <variable hashes> (as supported by hashlib)
  • path (file path)
  • atime
  • mtime
  • ctime
  • size
  • uid
  • gid
  • permissions (octal representation)
  • permissions_h (symbolic representation)
  • inode
  • device_id
  • st_blocks
  • st_blksize
  • st_rdev
  • st_flags
  • st_gen
  • st_birthtime
  • st_ftype
  • st_attrs
  • st_obtype
  • entropy (shannon calculation)
  • type (file output)

Why the above data? Mostly because of:

  • Large part is standard to other tools as well
  • Large part is just all the output of python’s ‘os.stat’
  • Some is useful to have and avoids processing or querying the file again

How do we get the data?

I choose to use python since it’s easy to develop for and it has a lot of build in libraries. Additionally it runs on a lot of platforms if you’d ever need to run the script on a different platform. So what are some of the requirements?

  • Run on mount points
    • Focus on files only
  • Be fast
    • Avoid redoing tasks
    • Try to be disk i/o efficient
      • Read once, operate many
  • Workable output format

The reason for the requirement of the script to operate on mount points is that this way we can avoid the challenge of operating on obscure file systems. If we need to operate on an obscure file system we can just expose that file system over some kind of sharing mechanism like NFS or SMB or run the script directly on the operating system. This does of course influence how forensically accurate the data can be depending on the sharing method, but since this is just intended to answer some quick questions while the more professional tools are working we should be fine.

If what we want to achieve has to be done while other tools are retrieving more detailed data for forensics analysis, it means that our script should be as quick as possible (within the constraints of an interpreted language). For this we are going to use the multiprocessing module. The reason for this is that python threads are not really as effective as ‘real’ threads. If you are wondering why, you should read this article. The short version is that python has a ‘Global Interpreter Lock (GIL)’ which prevents python from really running different threads at the same time. Thus if we really need concurrent operations to happen we have to resort to splitting the tasks up into different processes. Another way of improving speed is of course to not redo tasks which other specialised tools can perform much faster. For example, recursively walking a directory tree could be done with find:

find / -type f > filelist.txt

Thus to avoid redoing tasks it’d be great if we could just do:

cat filelist.txt | our_script.py

Since we are going to script it, reducing disk i/o is a must. Most of the data could be retrieved with bash and standard linux tools, but it would greatly slow the process down due to the disk i/o. The speed requirement can be easily achieved by reading all files in manageable chunks and then perform as many of the data extraction operations as possible on each chunk. I choose to implement this in the following way (fiddle with chunk size if you want less disk i/o for each file, avoid reading entire file due to possible memory constraints):

    def chunked_reading(self):
        with open(self.fileloc, 'rb') as f:
            while True:
                chunk = f.read(CHUNKSIZE)
                if chunk != '':
                    yield chunk

The above function makes it possible to iterate over each chunk, thus being able to do this:

        for ictr, i in enumerate(self.chunked_reading()):
            if ictr == 0:
                self.magic = magic.from_buffer(i) #comment if no filemagic available

Pretty cool right? We just need one disk i/o operation to get:

  • Multiple hashes
  • File type, based on the lib-magic library
  • Entropy of the file

The one thing that seems to not be possible, is to retrieve the os.stat output within the same disk i/o operation, which for our script is fine. Coming back to the speed requirement it also means that we can process each file individually and thus operate with multiple processes at once, like so:

def create_workers(filelist_q, output_q, algorithms, amount=get_cpucount()):
    workers = list()
    for ictr, i in enumerate(range(amount)):
        procname = "processfile.%s" % ictr
        p = Process(target=processfile, name=procname, args=(filelist_q, output_q, algorithms))
    return workers

After all the above has been implemented into pmf.py the output looks like this(shortened to keep layout):

dev@devm:~$ sudo python pmf.py /etc/ md5 sha1 | head -n2

Like you can see I choose the CSV output format with all the values quoted as the ‘workable’ format. This will hopefully make it easy to convert the output to other formats and work with it on the command-line.

Working with the data, answering questions

Since we now have a script that is able to produce the data which we need, let’s see how we can use this data to answer a couple of example questions. All the work will be done command line to keep it simple, which also results in the benefit that you can reuse the commands with other CSV file based forensic output.

One of the easiest way to work with the CSV format is to use ‘csvtool’, which you can install by running:

sudo apt-get install csvtool

If you want to benefit from the libmagic file type identification of pmf.py you need to install the corresponding python library with pip:

sudo pip install python-magic

We will also need some test data, for this I used the file ”ubuntu-16.04-desktop-amd64.iso’ and mounted it on the ‘mnt’ folder within the following directory structure:

  • ~/
  • ~/test/
  • ~/test/mnt

The command I used was:

sudo mount -o loop,ro,noexec /mnt/hgfs/iso/ubuntu-16.04-desktop-amd64.iso mnt/

After this was done I generated the data with the following commands:

sudo find mnt/ -type f > filelist.txt
sudo python pmf.py mnt/ md5 sha1 > o.txt

It might take a while, but just let it run. If you need to list the guaranteed hashing algorithms by hash lib you can do:

python -c “import hashlib;print hashlib.algorithms”

Now that we’ve setup the data that we’d like to work with, let’s start answering questions.

Converting to body file format (timeline)

One of the goals was to have a somewhat workable format, which means that we should be able to get pretty close the defined body file format as follow:

csvtool -u \| namedcol md5,path,inode,permissions_h,uid,gid,size,atime,mtime,ctime,st_birthtime o.txt

which should output:

csvtool -u \| namedcol md5,path,inode,permissions_h,uid,gid,size,atime,mtime,ctime,st_birthtime o.txt | head

Which like you might realise results in the ability to timeline it with mactime, but you loose the extra data that we added.

Finding duplicate hashes

Finding the hashes:

csvtool namedcol md5 o.txt | csvtool drop 1 – | sort -t, -k1,1 | uniq -c | grep -v ‘1 ‘ | sed s/’^ *’//

2 4a4dd3598707603b3f76a2378a4504aa
3 d41d8cd98f00b204e9800998ecf8427e

Displaying the names sorted by size can be done as well by extending the one liner:

for i in $(csvtool namedcol md5 o.txt | csvtool drop 1 – | sort -t, -k1,1 | uniq -c | grep -v ‘1 ‘ | sed s/’^ *’// | cut -d’ ‘ -f2);do grep $i o.txt | csvtool cols 1,7,3 – | sort -t, -k3,3;done


Permission issues

csvtool namedcol uid,permissions,path e.txt | csvtool drop 1 – | sort -t, -k2,2 | cut -d, -f2 | uniq -c | sort -b -k1 -n

1 0444
2 0440
13 0664
16 0640
25 0600
213 0755
1380 0644

Or if you want the more human readable output:

csvtool namedcol permissions_h,path e.txt | csvtool drop 1 – | sort -t, -k1,1 | cut -d, -f1 | uniq -c | sort -b -k1 -n

1 -r–r–r–
2 -r–r—–
13 -rw-rw-r–
16 -rw-r—–
25 -rw——-
213 -rwxr-xr-x
1380 -rw-r–r–

With the above output you can now zoom in on files which are world readable or writeable or which are read and write by for example just grepping for the permissions in the original data.

Entropy / file type issues

To create an overview of the files and their entropy you can just do:

csvtool namedcol entropy,path,type o.txt | csvtool drop 1 – | sort -t, -k1,1

0.1104900679412788,mnt/isolinux/boot.cat,”FoxPro FPT, blocks size 0, next free block index 16777216″
0.7708022808463271,mnt/boot/grub/x86_64-efi/legacy_password_test.mod,”ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV)”
1.5567796494470394,mnt/dists/xenial/main/binary-i386/Packages.gz,”gzip compressed data, from Unix, max compression”
1.5567796494470394,mnt/dists/xenial/restricted/binary-i386/Packages.gz,”gzip compressed data, from Unix, max compression”

If you are wondering how this can be useful, if we rerun the same command on pmf.py output on the /etc/ directory we find:

6.020547926192353,/etc/ssl/private/ssl-cert-snakeoil.key,ASCII text

Which due to the ‘.key’ extension is pretty obvious, but without that extension we would not have found that this is in fact a private key. Entropy is a neat little extra piece of information which can help you to find crypto containers, private keys or other high entropy data.

Final thoughts

Just like the title says, this is just an example of how you can perform some poor man’s forensics with self written tools. Even though it isn’t as sophisticated as the professional tools out there you can get a fair amount of work done by just using some rudimentary information and ‘querying’ it in a smart way. If you want to improve the ‘querying’  part you could of course import the data into a Splunk or ELK instance.

Make sure you read and understand the source of the scripts and that you verify and validate the output, like all software they contain bugs ;)


One thought on “[python] Poor man’s forensics”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: