Like you’ve probably read, Fox-IT released their incident response framework called dissect, but before that they released the cstruct part of their framework. Ever since they released it publicly I’ve been wanting to find an excuse to play with it on public projects. I witnissed the birth of cstruct back when I was still working at Fox-IT and am very happy to see it all has finally been made public, it sure has evolved since I had a look at the very first version! Special thanks to Erik Schamper (@Schamperr) for answering late night questions about some of the inner workings of dissect.cstruct.
This is one of those things that you can encounter during your incident response assignment and for which life is a bit easier if you can just parse the binary file format with python. Since with incident response you never know in which format exactly you want to receive the data for analysis or what you are looking for it really helps to work with tools that can be rapidly adjusted. python is an ideal environment to achieve this. An added benefit of parsing the structures ourselves with python is that we can avoid string parsing and thus avoid confusion and mistakes.
The atop tool is a performance monitoring tool that can write the output into a binary file format. The creator explains it way better than I do:
Atop is an ASCII full-screen performance monitor for Linux that is capable of reporting the activity of all processes (even if processes have finished during the interval), daily logging of system and process activity for long-term analysis, highlighting overloaded system resources by using colors, etc. At regular intervals, it shows system-level activity related to the CPU, memory, swap, disks (including LVM) and network layers, and for every process (and thread) it shows e.g. the CPU utilization, memory growth, disk utilization, priority, username, state, and exit code.
The atop tool website
In combination with the optional kernel module netatop, it even shows network activity per process/thread.
Like you can imagine, having the above information is of course a nice treasure throve to find during an incident response, even if it is based on a pre-set interval. For the most basic information, you can at least extract process executions with their respective commandlines and the corresponding timestamp.
Since this is an open source tool we can just look at the structure definitions in C and lift them right into cstruct to start parsing. The atop tool itself offers the ability to parse written binary files as well, for example using this commend:
atop -PPRG -r <file>
For the rest of this blog entry we will look at parsing atop binary log files with python and dissect.cstruct. Mostly intended as a walkthrough of the thought process as well.
You can also skip reading the rest of this blog entry and jump to the code if you are impatient or familiar with similar thought processes.
As with most ideas that you want to implement, the best place to start is the code and/or the documentation. Seriously, start there first it will save you many hours of tinkering around [1]. For atop things are mady easy due to the availability of their code on github and the fact that the code is very neat and documented.
Reading the man page is already useful since that is how you discover that a) atop has binary raw files b) atop is capable of parsing those binary files to display their content. This also gives us our first indicate that if we go looking through the code for the structures we should keep an eye on references to ‘raw’ or ‘binary’. If you read through the list of files on github you’ll notice two files that are probably related to parsing the binary log file:
https://github.com/Atoptool/atop/blob/fdf3526bd35c1a84dd11bb73110c1a1f4148e39d/rawlog.h
https://github.com/Atoptool/atop/blob/fdf3526bd35c1a84dd11bb73110c1a1f4148e39d/rawlog.c
In addition, for me the ‘cat’ files also caught my attention, mainly because it indicates that regular binary concatenation is not sufficient and the developer implemented some code to specifically do this for their file format.
https://github.com/Atoptool/atop/blob/fdf3526bd35c1a84dd11bb73110c1a1f4148e39d/atopcat.c
If we read the three files we obtain a pretty clear picture of how the binary file format is setup. Besides the code itself with the writing/reading logic it also really helps that the developer of the tool provides a nice visual overview at the top of the rawlog.h
header file:
/*
** structure describing the raw file contents
**
** layout raw file: rawheader
**
** rawrecord \
** compressed system-level statistics | sample 1
** compressed process-level statistics /
**
** rawrecord \
** compressed system-level statistics | sample 2
** compressed process-level statistics /
**
** etcetera .....
*/
The references between the structures is as follow:
The rawheader
specifies the size of the rawrecord
The rawheader
specifices the size of the uncompressed process-level (tstat)
structure
The rawrecord
specifies the size of the compressed system-level (sstat)
& compressed proccess-level (tstat)
statistics
There is much more information that aids in other sizes, but the above is the main gist that we need to parse the file format. Since all that we need to do is basically:
- Read the header and get the size of the raw record
- Read the raw record and get the sstat, tstat sizes
- Read the compressed sstat
- Read the compressed tstat
- Go back to #2
After all of that we have all the data that we need to decompress the (in our case) relevant data which is tstat. The code for the above logic is just a couple of lines:
atopbinfile = open(sys.argv[2], "rb")
atop_rheader = atop_header(atopbinfile)
# ensure we are parsing the right version
# you can skip this if you just want to yolo
atopbinfile_version = int(get_version(atop_rheader.aversion).replace(".", ""))
if atopversion != atopbinfile_version:
print(
f"[!] Version mismatch file:{atopbinfile_version} arg:{atopversion}",
file=sys.stderr,
)
sys.exit()
print(struct2json(atop_rheader))
atop_rcompressed = atop_records_compressed(atopbinfile, atop_rheader)
for record_compressed in atop_rcompressed:
print(struct2json(record_compressed["rawrecord"]))
for process_entry in decompress_processlevel(
record_compressed["rawrecord"], record_compressed["processlevel_compressed"]
):
print(struct2json(process_entry))
atopbinfile.close()
See the github repo linked earlier in this blog post for all the code, but I have to confirm that dissect.cstruct makes life a lot more pleasent when you have to deal with C structures.
Oh and more bonus tip: When debugging file formats, structs & offsets please use ImHex, that tools is awesome! Similar concept to dissect.cstruct, it accepts C structs and applies them to the file format. I’ve included a very limited ImHex pattern that I used for debugging some initial alignment issues.
[1] Yes, there are also situations where the documentation is just plain wrong and it wastes many hours.
One thought on “Parsing atop files with python dissect.cstruct”