Skip to content

Latest commit

 

History

History
588 lines (449 loc) · 24.4 KB

README.md

File metadata and controls

588 lines (449 loc) · 24.4 KB
title
Investigative Data and Evidence Analyser (IDEA)

Investigative Data and Evidence Analyser (IDEA)

The Investigative Data and Evidence Analyser (IDEA) is a toolkit for conducting investigations using data. It is a Python package, written in Python and R.

Table of Contents

[TOC]

Description

IDEA is a toolkit for conducting investigations using data, written in Python and R. It provides a Python package which bundles functionality for case management, item/evidence comparisons, triangulation, checking for corroboration, data cleaning, metadata analysis, internet analysis, network analysis, web crawling, and more. IDEA can read and write your results to a large variety of file types (e.g. .xlsx, .csv, .txt, .json, .graphML).

Features

Case management

  • File management
  • Item/evidence analysis and comparisons
  • Object-oriented case management interface

Data cleaning

  • Text cleaning
    • Reformatting
    • Stopword removal
  • Text tokenizing
    • Word tokenization
    • Sentence tokenization
  • HTML parsing

Metadata analysis

  • Metadata similarity analysis

Text analysis

  • Keyword analysis
  • Extraction of key information (e.g. names, locations)
  • Text similarity analysis

Image analysis

  • Reverse image search

Location analysis

  • Geolocation
  • Chronolocation

Internet analysis

  • Web scraping and crawling
  • WhoIs lookups on domains and IP addresses
  • Web search
  • Website similarity analysis
  • Web archiving
    • Internet Archive/Wayback Machine
    • Archive.is
    • Common Crawl

Social media analysis

  • Platform-specific searches
  • Username lookups
  • Scraping

Network analysis

  • Centrality analysis
  • Co-link analysis
  • Community detection
  • and much more...

Data visualisation

  • Network visualisation
  • Timelines

User Guide

Installation

To download from GitHub, run the following code in your console:

gh repo clone J-A-Ha/IDEA
cd IDEA
pip install -r requirements.txt

Or download as a .zip folder from GitHub.

Examples

Importing IDEA

import idea

Creating and saving a case from an Excel file

example = idea.open_case(case_name = 'example', file_address = 'example.xlsx')

example.save_as(file_name = 'example', file_type = 'case', file_address = '/')

Creating a case from a web crawl

example = idea.Case(case_name = 'example')
example.crawl_web()

# You will be asked to input a URL or list of URLs to crawl from.

Running all analysis functions on a case using run_full_analysis()

This will:

  • Parse all raw data
  • Extract keywords
  • Identify instances of coinciding data, metatada, links, etc.
  • Index all items, data, metadata, etc.
  • Generate similarity networks based on inputted data
  • Generate link networks if links are provided
  • Run all statistical analytics
  • Save to the Case object
example = idea.open_case(case_name = 'example', file_address = 'example.case')
example.run_full_analysis()
print(example.analytics)

Beginners Guide

1. Install Python (version >=3.9) if it is not yet installed.

Download Python from here or using a tool like Anaconda.

2. Install the repository and dependencies.
# In the command line, navigate to the folder you wish to install IDEA in.

gh repo clone J-A-Ha/IDEA
cd IDEA
pip install -r requirements.txt
3. Run Python and import the package.
python
import idea
4. Creating a Project.
project = idea.Project(project_name = 'project')
5. Adding a Case.
# Adding a blank case
project.add_case(case_name = 'example')

# Viewing the case's contents and properties
project.example
6. Running a limited web crawl and adding results to the case.
# visit_limit defines the number of websites to be crawled.
project.example.from_web_crawl(seed_urls='https://example.com/', visit_limit=5, be_polite=True)
7. Running analyses
project.example.parse_rawdata()
project.example.generate_keywords()
project.example.infer_all_info_categories()
project.example.generate_indexes()
project.example.generate_all_networks()
project.example.generate_analytics()

# Viewing analytics results
print(project.example.analytics)
8. Saving the Case
project.example.save_as()

The package will ask for you to input:

  • File name.
  • File type. ('.case' is the recommended format).
  • File path to save to.

Key Classes and Functions

Classes

Project
class Project(...)

A collection of Case objects. See in docs.

Cases can be selected by:

  1. Entering '<project_name>.<case_name>'
  2. Using subscripting ( '<project_name>["case_name"]')
  3. Using the method .get_case("<case_name>").
project.case

# is the same as
project['case']

# and
project.get_case('case')

Key methods:

  • contents: Returns the Project’s attributes as a list. Excludes object properties attribute.
  • add_case: Adds a Case object to the Project.
  • get_case: Returns a Case when given its attribute name.
  • export_folder: Exports Project’s contents to a folder.
  • save_as: Exports the Project to file or folder type of your choice.
  • save: Exports the Project to an existing file or folder.
Case
class Case(...)

An object to store raw data, metadata, and other information related to investigative cases. See in docs.

Contents:

  • properties
  • dataframes
  • items
  • entities
  • events
  • indexes
  • networks
  • analytics
  • description
  • notes

Case contents can be selected by:

  1. Entering '<case_name>.<attribute_name>'
  2. Using subscripting ( '<case_name>["attribute_name"]')
  3. Using get methods, e.g.:
    • .get_item("<item_name>").
    • .get_entity("<entity_name>").
    • .get_network("<network_name>").
case.attribute

# is the same as
case['attribute']

Some contents of case attributes can themselves be subscripted:


case.get_item('item')

# is the same as
case.items.item

# and
case['items'].item

# and
case['items']['item']

You can even subscript using the name of a dataframe, item, entity, event, or network:

case['item']

# is the same as
case.items['item']

# and
case['items']['item']

# and
case.get_item('item')

Key methods:

  • backup: Creates backup of the Case.
  • make_default: Sets the Case as the default case in the environment.
  • contents: Returns the Case’s attributes as a list.
  • search: Searches Case for a query string. If found, returns a dataframe of all items containing the string.
  • advanced_search: An advanced search function. Searches items using a series of keyword commands. If found, returns a dataframe of all items containing the string.
  • add_item: Adds an item to the Case’s item set.
  • from_web_crawl: Creates a Case object from a web crawl.
  • get_item: Returns an item if given its ID.
  • get_info: Returns all information entries as a Pandas series.
  • get_metadata: Returns all metadata entries as a Pandas series.
  • get_keywords: Returns a keywords dataframe based on user’s choice of ranking metric.
  • get_project: If the Case is assigned to a Project, returns that Project.
  • parse_rawdata: Parses raw data entries for all items.
  • generate_indexes: Generates all indexes and assigns them to the Case’s CaseIndexes attribute. Returns the updated CaseIndexes.
  • generate_all_networks: Generates all network types and assigns to the Case’s CaseNetworks collection.
  • generate_analytics: Generates all analytics and appends the results to the Case’s CaseAnalytics collection.
  • identify_coincidences: Runs all coincidence identification methods.
  • infer_all_info_categories: Identifies potential information from items’ text data and appends to information sets. Parses data if not parsed.
  • run_full_analysis: Runs all analysis functions on the Case.
  • export_folder: Exports the Case to a folder.
  • export_network: Exports a network to one of a variety of graph file types. Defaults to .graphML.
  • save: Saves the Case to its source file. If no source given, saves to a new file.
  • save_as: Saves the Case to a file.
CaseData
class CaseData(...)

A collection of Pandas dataframes containing the combined data for a Case. See in docs.

Contents:

  • data: item data
  • metadata: item metadata
  • information: items' labelled information
  • other: items' links, references, contents, and other miscellaneous data.
  • keywords: keywords associated with the case. A CaseKeywords object containing dataframes:
    • frequent_words
    • central_words
  • coinciding_data: patterns of how data coincides. A dictionary containing dataframes.

Dataframes can be selected by:

  1. Entering 'dataframes.<dataframe_name>'
  2. Using subscripting ( 'dataframes["dataframe_name"]')
  3. Using the method .get_dataframe("<dataframe_name>").
case.dataframes.dataframe

# is the same as
case.dataframes['dataframe']

# and
dataframe.get_dataframe('dataframe')
CaseItem
class CaseItem(...)

An object representing a piece of material or evidence associated with a Case. See in docs.

Contents:

  • properties
  • data
  • metadata
  • information
  • whois
  • links
  • references
  • contains
  • files
  • relations
  • user_assessments

Item contents can be selected by:

  1. Entering '<item_name>.<attribute_name>'
  2. Using subscripting: '<item_name>["<attribute_name>"]'.
  3. Using get methods. E.g.,
    • .get_data().
    • .get_metadata()
    • .get_info()
item.data

# is the same as
item['data']

# and
item.get_data()

You can retrieve a CaseItem from a CaseItemSet object by:

  1. Entering 'items.<item_name>'
  2. Subscripting using its name: 'items["<item_name>"]'
  3. Subscripting using a numeric index, in the same style as a list: 'items[index]'
  4. Using the .get_item('<item_name>') method.
items.item

# is the same as
items['item']

# and (if 0 is the item's index position)
items[0]

# and 
items.get_item('item')

Key methods:

CaseNetwork
class CaseNetwork(igraph.Graph)

A modified igraph.Graph object. It provides additional analytics methods and functionality for Case management. CaseNetworks can convert both igraph and NetworkX graph objects. See in docs.

Key attributes:

  • vs['name']: returns a list of vertex names.
  • es['name']: returns a list of edge names.
  • es['weight']: returns a list of edge weights.

Key methods:

  • attributes: returns the network's global attributes.
  • summary: Returns the summary of the network.
  • vs.attributes: returns a list of the names of all vertex attributes.
  • es.attributes: returns a list of the names of all edge attributes.
  • get_adjacency: Returns the adjacency matrix of the network.
  • degree: Returns some vertex degrees from the network.
  • density: Calculates the density of the network.
  • average_path_length: Calculates the average path length in the network.
  • diameter: Calculates the diameter of the network.
  • betweenness: Calculates or estimates the betweenness of vertices in the network.
  • eigenvector_centrality: Calculates the eigenvector centralities of the vertices in the network.
  • all_centralities: Calculates all centrality measures for network. Returns as a dataframe.
  • colinks: Runs a colink analysis on the network. Returns a dataframe.
  • community_detection: Identifies communities in the network. Gives the option of using different algorithms.
  • degrees_dataframe: Returns the network's degree distribution as a dataframe.
  • export_network: Exports network to one of a variety of graph file types. Defaults to .graphML.
  • to_networkx: Converts the CaseNetwork to networkx format.
CaseFile
class CaseFile(CaseObject)

An object which stores details about a digital file associated with a case or piece of evidence. See in docs.

Contents:

  • path: the file's filepath in the directory.
  • name: the file's filename.
  • suffix: the file's filetype or extension.
  • type: the type of directory object, i.e. directory, folder, file, etc.
  • absolute: the file's absolute filepath.
  • parent: the filepath of the file's parent directory.
  • root: the filepath of the root directory.
  • and more...

Key functions:

  • get_children
  • listdir
  • walk
  • scandir

Functions

Case management
  • open_case: Opens a Case from a file.
  • save: Saves a Case to its source file. If no file exists, requests file details from user input.
  • save_as: Saves a Case to a file. Requests file details from user input.
  • get_backups: Returns the Backups directory and registry.
  • set_default_case: Sets a case as the default in the environment.
  • get_default_case: Returns the default case.
Importing files
  • import_case_excel: Imports a Case from a formatted Excel (.xlsx) file.
  • import_case_csv_folder: Imports a Case from a folder of formatted CSV (.csv) files.
  • import_case_txt: Imports a Case from a pickled text file (.txt or .case).
  • read_pdf: Loads and parses PDF file. Returns a dictionary of data.
  • read_pdf_url: Downloads and parses PDF file from a URL. Returns a dictionary of data.
Location analysis
Internet analysis
  • lookup_whois: Performs a WhoIs lookup on an inputted domain or IP address.
Web searching
  • open_url: Opens URL in the default web browser.
  • open_url_source: Opens URL’s source code in the default web browser.
  • search_web: Launches a website-specific Google search for an inputted query and URL.
  • multi_search_web: Launches multiple web searches by iterating on a query through a list of terms.
  • search_images: Launches an image search using the default web browser.
  • search_social_media: Launches a Google search focused on specified social media platform for inputted query.
Web crawling
  • crawl_site: Crawls website’s internal pages. Returns any links found as a list.
  • crawl_web: Crawls internet from a single URL or list of URLs. Returns details like links found, HTML scraped, and site metadata.
Social media analysis

Documentation

For the full documentation, click here.

Contributing

Authors and acknowledgments

IDEA was created by Jamie Hancock.

It relies on packages, modules, and datasets created by:

License

IDEA is licensed under GPL-3.0.

IDEA is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

IDEA is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with IDEA. If not, see https://www.gnu.org/licenses/.

Appendix and FAQ

Project Timeline


gantt
    title Timeline
Loading
tags: Python, R, Investigations, OSI, OSINT, Documentation