ETL Stages

From truxwiki.com
Jump to navigation Jump to search

Explanation

ETL Stages are key to solving the problem of process coordination. The Truxton exploitation process was broken down into a series of steps. These steps are performed in a stage. When the stages are complete, Truxton is finished processing that media.

By associating a stage number (1-255) to an exploitation process, Truxton can manage the transition from chaotic to ordered processing.

Les State Machine.png

ETL Communications

The ETL processes communicate using a message bus. The messages contain information about a file needing processing and how to get the file contents.

Simple Exploitation Walk Through

The first stage is Load. It performs the following tasks:

  1. Navigate the source media (disk image, folder or file)
  2. Puts file contents into depot files
  3. Puts meta data into the database
  4. Identifies the file's contents
  5. Based on the file's type, the file is routed to the exploitation process by putting a message into that ETL's message queue.

The ETL process will:

  1. Wait until a message arrives for it to process
  2. Retrieve the contents of the file referenced in the message
  3. Report status to the Load Status Monitor (Les)
  4. Exploit those contents to produce more files, or entities, or chat messages, etc.
  5. Files produced by one ETL can be sent to another

When all ETLs have finished, the load is complete.

Stages and Status Monitoring

In order to process all of the data, we must begin with utter chaos and transition to a ordered steps that must be completed linearly. Some ETLs can thrive in the chaos, some cannot and some live in both worlds. There are ranges of stage values for an ETL. The rule is, if two ETL processes have the same stage value, they can operate in parallel. If one ETL has a higher stage value than another, it will execute after that other ETL. This is ignored in the Chaos region, becomes relevant in the Semi-Chaotic region and becomes law in the Linear.

It is the job of the Load Status Monitor (Les) to watch all of the ETLs and advance the media through the stages of exploitation.

Chaos

The chaotic stages are where files are produced and/or atomically exploited. If a file is stand-alone, not requiring any other files to exploit it, it is considered to be "atomic." This is the easiest ETL to write.

One example of a chaotic file is a zip file. An ETL process that unzips the file to produce child files is atomic in that all it needs to do its job is the contents of that one zip file. ETL processes run in a parallel and operate on different media simultaneously (participate in different "loads"). Chaotic stages produce files in random order.

These stages can be thought of as executing in non-linear time. Child files can be processed before their parents. Files are processed in random order.

In the above illustration, loading and expanding are the chaos stages. They can produce any number of files in any order.

Semi-Chaotic

When the chaotic ETLs have finished producing files for a piece of media, the next stage of exploitation can begin. Semi-chaotic ETLs produce files but in a more orderly fashion and after the chaos is finished. When they produce files, everything falls down again into utter chaos.

The exploitation process enters this loop between Chaos and Semi-Chaotic until no Semi-Chaotic ETL produces any files, then the linear processing can take place.

One example of a semi-chaotic file is a spanned zip file. This is a zip archive that spans several files. It cannot reliably be exploited in the Chaotic stages because all of the files in the archive might not yet exist in Truxton. By waiting until the chaos has subsided, we know that all of the files in the span will be present. Truxton will notify the Poly File Expander when the chaos is complete. Poly will then:

  1. Search the current media for any files known to be part of a type that requires more than one file to exploit
  2. It will then exploit that file during which, the other files can be queried for
  3. It produces child files which cause the chaos stage to reignite

For the spanned zip, Poly will find all pieces of the span, combine them together then expand the archive. Since Poly will restart the chaos, it keeps track of which files it has processed. When it is again told to process a piece of media, it can ignore files it has already processed.

Semi-chaotic ETLs operate in non-linear time but after chaos. Once they are complete, processing enters the linear stage.

Note: Stage Value Behavior

Stitching (fragmented file carving) is another semi-chaotic ETL. Poly is stage 32 while Stitch is stage 40. This means that Stitch will not start until Poly is completely finished. If Stitch was stage 32, it would be told to execute at the same time as Poly and miss any data produced by poly file expansion.

Linear

Linear stages are where some sort of sanity is brought to the process. ETLs in this stage execute one after another. For example, Alerts (stage 128) is executed before Reports (stage 160) because any alerts that were generated need to be included in reports. If two different linear ETLs have the same stage number, they will execute in parallel.

While earlier stages were mainly concerned with producing files and artifacts, Linear and later stages are concerned with Media processing. Early stage ETLs were routed based on file type, from this stage forward, everyone gets notified of media to process instead of a file to process.

Untracked

There is a class of ETLs that operate in a fashion that does not affect the status of media being processed. Some examples include:

  • SOLRFile - Indexes contents and metadata into a solr cluster for full text searching.
  • Forensic Logging - log messages are sent to various destination such as Azure Log Analytics
  • SendGrid Notifier - When a load completes, the Media Summary report is emailed to a distribution list

Stage Ranges

Here are the values for all of the stages;.

Name Value (Inclusive) Meaning
Load/Expand 1-31 Chaos. ETLs in this stage range can produce files and artifacts in random order.
Poly File Expansion 32-63 Semi-Chaotic. ETLs in this stage range can produce files and artifacts in random order but only after previous stage ranges have completed.
Summarizing 64-96 All files and artifacts (entities) have been produced. ETLs in this range query the data to produce summaries such as unique lists of artifacts.
Alerting 128-159 ETLs in this range query the data to alerts any analysts may have wanted.
Reporting 160-191 ETLs in this range query the data produced by any previous stage and create reports from it.
Feeding 192-223 ETLs in this range query the data produced by any previous stage and feed it to external systems.
Finished 240-254 ETLs perform any final tasks needed to make the media ready for the analyst. No further processing will take place.
DoNotTrack 255 This ETL should not be considered when determining the status of media

ETLs and Their Stages

Here's a list of ETLs, their stages and message bus queue names. Remember, stage 255 means "do not track."

ETL Executable Stage Percent Complete Queue Name
Load Load.exe 1 48% loadq
Truxton Alert Generator Alert.exe 128 75% alert
Truxton Archive Expander Archives.exe 6 48% archives
Truxton Azure Image Analyzer Azure.AnalyzeImage.exe 9 48% azureanalyzeimage
Truxton Azure OCR Azure.OCR.exe 9 48% azureocr
Truxton Carve Carve.exe 4 48% carve
Truxton Contact Sheet Creator ContactSheet.exe 6 48% contactsheet
Load as an ETL Load.exe 11 48% loadq
Truxton Email EMail.exe 12 48% email
Truxton Expand Expand.exe 3 48% expand
Truxton Finished Loads Monitor Finished.exe 240 94-100% finished
Truxton Forensic Finding Logger ForensicLogger.exe 255 N/A flogger
Truxton Identify Identify.exe 2 48% identify
Truxton Language Identifier LangID.exe 18 48% langid
Maintenance Maintenance.exe 255 100% Maintenance
Truxton Notifier Notify.exe 255 100% Notify
Truxton Poly File Coordinator Poly.exe 32 60% poly
Truxton PST Processor PST.exe 18 48% pst
Truxton Registry Expander Registry.exe 9 48% registry
RegRipper RegRipper.exe 8 48% regripper
Truxton Remote Expand RemoteFileExpander.exe 3 48% remoteexpand
Report Report.exe 160 80% report
Truxton SOLR Contents Indexer SOLR.exe 192 90% solrcontentstage
Truxton SOLR File Indexer SOLRFile.exe 255 N/A solrfile
Truxton File Stitcher Stitch.exe 40 62% stitch
Truxton Text Extractor TextExtract.exe 15 48% tqueue
Truxton Thumbnail Generator Thumbnail.exe 8 48% thumbnail
Truxton Yara Scanner Yara.exe 7 48% yara

ETLs sorted by Stage:

ETL Executable Stage Percent Complete Queue Name
Load Load.exe 1 48% loadq
Truxton Identify Identify.exe 2 48% identify
Truxton Expand Expand.exe 3 48% expand
Truxton Carve Carve.exe 4 48% carve
Truxton Archive Expander Archives.exe 6 48% archives
Truxton Yara Scanner Yara.exe 7 48% yara
Truxton Thumbnail Generator Thumbnail.exe 8 48% thumbnail
Truxton Registry Expander Registry.exe 9 48% registry
Truxton Remote Expand RemoteFileExpander.exe 10 48% remoteexpand
Load as an ETL Load.exe 11 48% loadq
Truxton Email EMail.exe 12 48% email
RegRipper RegRipper.exe 13 48% regripper
Truxton Contact Sheet Creator ContactSheet.exe 14 48% contactsheet
Truxton Text Extractor TextExtract.exe 15 48% tqueue
Truxton Azure OCR Azure.OCR.exe 16 48% azureocr
Truxton Azure Image Analyzer Azure.AnalyzeImage.exe 17 48% azureanalyzeimage
Truxton Language Identifier LangID.exe 18 48% langid
Truxton PST Processor PST.exe 18 48% pst
Truxton Poly File Coordinator Poly.exe 32 60% poly
Truxton File Stitcher Stitch.exe 40 62% stitch
Truxton Alert Generator Alert.exe 128 75% alert
Report Report.exe 160 80% report
Truxton SOLR Contents Indexer SOLR.exe 192 90% solrcontentstage
Truxton Finished Loads Monitor Finished.exe 240 94-100% finished
Truxton Notifier Notify.exe 255 100% Notify
Truxton Forensic Finding Logger ForensicLogger.exe 255 N/A flogger
Truxton SOLR File Indexer SOLRFile.exe 255 N/A solrfile
Maintenance Maintenance.exe 255 100% Maintenance