title
Investigative Data and Evidence Analyser (IDEA)

Investigative Data and Evidence Analyser (IDEA)

The Investigative Data and Evidence Analyser (IDEA) is a toolkit for conducting investigations using data. It is a Python package, written in Python and R.

Description

IDEA is a toolkit for conducting investigations using data, written in Python and R. It provides a Python package which bundles functionality for case management, item/evidence comparisons, triangulation, checking for corroboration, data cleaning, metadata analysis, internet analysis, network analysis, web crawling, and more. IDEA can read and write your results to a large variety of file types (e.g. .xlsx, .csv, .txt, .json, .graphML).

Features

Case management

File management
Item/evidence analysis and comparisons
Object-oriented case management interface

Data cleaning

Text cleaning
- Reformatting
- Stopword removal
Text tokenizing
- Word tokenization
- Sentence tokenization
HTML parsing

Metadata analysis

Metadata similarity analysis

Text analysis

Keyword analysis
Extraction of key information (e.g. names, locations)
Text similarity analysis

Image analysis

Reverse image search

Location analysis

Geolocation
Chronolocation

Internet analysis

Web scraping and crawling
WhoIs lookups on domains and IP addresses
Web search
Website similarity analysis
Web archiving
- Internet Archive/Wayback Machine
- Archive.is
- Common Crawl

Social media analysis

Platform-specific searches
Username lookups
Scraping

Network analysis

Centrality analysis
Co-link analysis
Community detection
and much more...

Data visualisation

Network visualisation
Timelines

User Guide

Installation

To download from GitHub, run the following code in your console:

gh repo clone J-A-Ha/IDEA
cd IDEA
pip install -r requirements.txt

Or download as a .zip folder from GitHub.

Examples

Importing IDEA

import idea

Creating and saving a case from an Excel file

example = idea.open_case(case_name = 'example', file_address = 'example.xlsx')

example.save_as(file_name = 'example', file_type = 'case', file_address = '/')

Creating a case from a web crawl

example = idea.Case(case_name = 'example')
example.crawl_web()

# You will be asked to input a URL or list of URLs to crawl from.

Running all analysis functions on a case using run_full_analysis()

This will:

Parse all raw data
Extract keywords
Identify instances of coinciding data, metatada, links, etc.
Index all items, data, metadata, etc.
Generate similarity networks based on inputted data
Generate link networks if links are provided
Run all statistical analytics
Save to the Case object

example = idea.open_case(case_name = 'example', file_address = 'example.case')
example.run_full_analysis()
print(example.analytics)

Beginners Guide

1. Install Python (version >=3.9) if it is not yet installed.

Download Python from here or using a tool like Anaconda.

2. Install the repository and dependencies.

# In the command line, navigate to the folder you wish to install IDEA in.

gh repo clone J-A-Ha/IDEA
cd IDEA
pip install -r requirements.txt

3. Run Python and import the package.

python

import idea

4. Creating a Project.

project = idea.Project(project_name = 'project')

5. Adding a Case.

# Adding a blank case
project.add_case(case_name = 'example')

# Viewing the case's contents and properties
project.example

6. Running a limited web crawl and adding results to the case.

# visit_limit defines the number of websites to be crawled.
project.example.from_web_crawl(seed_urls='https://example.com/', visit_limit=5, be_polite=True)

7. Running analyses

project.example.parse_rawdata()
project.example.generate_keywords()
project.example.infer_all_info_categories()
project.example.generate_indexes()
project.example.generate_all_networks()
project.example.generate_analytics()

# Viewing analytics results
print(project.example.analytics)

8. Saving the Case

project.example.save_as()

The package will ask for you to input:

File name.
File type. ('.case' is the recommended format).
File path to save to.

Key Classes and Functions

Classes

Project

class Project(...)

A collection of Case objects. See in docs.

Cases can be selected by:

Entering '<project_name>.<case_name>'
Using subscripting ( '<project_name>["case_name"]')
Using the method .get_case("<case_name>").

project.case

# is the same as
project['case']

# and
project.get_case('case')

Key methods:

contents: Returns the Project’s attributes as a list. Excludes object properties attribute.
add_case: Adds a Case object to the Project.
get_case: Returns a Case when given its attribute name.
export_folder: Exports Project’s contents to a folder.
save_as: Exports the Project to file or folder type of your choice.
save: Exports the Project to an existing file or folder.

Case

class Case(...)

An object to store raw data, metadata, and other information related to investigative cases. See in docs.

Contents:

properties
dataframes
items
entities
events
indexes
networks
analytics
description
notes

Case contents can be selected by:

Entering '<case_name>.<attribute_name>'
Using subscripting ( '<case_name>["attribute_name"]')
Using get methods, e.g.:
- .get_item("<item_name>").
- .get_entity("<entity_name>").
- .get_network("<network_name>").

case.attribute

# is the same as
case['attribute']

Some contents of case attributes can themselves be subscripted:


case.get_item('item')

# is the same as
case.items.item

# and
case['items'].item

# and
case['items']['item']

You can even subscript using the name of a dataframe, item, entity, event, or network:

case['item']

# is the same as
case.items['item']

# and
case['items']['item']

# and
case.get_item('item')

Key methods:

backup: Creates backup of the Case.
make_default: Sets the Case as the default case in the environment.
contents: Returns the Case’s attributes as a list.
search: Searches Case for a query string. If found, returns a dataframe of all items containing the string.
advanced_search: An advanced search function. Searches items using a series of keyword commands. If found, returns a dataframe of all items containing the string.
add_item: Adds an item to the Case’s item set.
from_web_crawl: Creates a Case object from a web crawl.
get_item: Returns an item if given its ID.
get_info: Returns all information entries as a Pandas series.
get_metadata: Returns all metadata entries as a Pandas series.
get_keywords: Returns a keywords dataframe based on user’s choice of ranking metric.
get_project: If the Case is assigned to a Project, returns that Project.
parse_rawdata: Parses raw data entries for all items.
generate_indexes: Generates all indexes and assigns them to the Case’s CaseIndexes attribute. Returns the updated CaseIndexes.
generate_all_networks: Generates all network types and assigns to the Case’s CaseNetworks collection.
generate_analytics: Generates all analytics and appends the results to the Case’s CaseAnalytics collection.
identify_coincidences: Runs all coincidence identification methods.
infer_all_info_categories: Identifies potential information from items’ text data and appends to information sets. Parses data if not parsed.
run_full_analysis: Runs all analysis functions on the Case.
export_folder: Exports the Case to a folder.
export_network: Exports a network to one of a variety of graph file types. Defaults to .graphML.
save: Saves the Case to its source file. If no source given, saves to a new file.
save_as: Saves the Case to a file.

CaseData

class CaseData(...)

A collection of Pandas dataframes containing the combined data for a Case. See in docs.

Contents:

data: item data
metadata: item metadata
information: items' labelled information
other: items' links, references, contents, and other miscellaneous data.
keywords: keywords associated with the case. A CaseKeywords object containing dataframes:
- frequent_words
- central_words
coinciding_data: patterns of how data coincides. A dictionary containing dataframes.

Dataframes can be selected by:

Entering 'dataframes.<dataframe_name>'
Using subscripting ( 'dataframes["dataframe_name"]')
Using the method .get_dataframe("<dataframe_name>").

case.dataframes.dataframe

# is the same as
case.dataframes['dataframe']

# and
dataframe.get_dataframe('dataframe')

CaseItem

class CaseItem(...)

An object representing a piece of material or evidence associated with a Case. See in docs.

Contents:

properties
data
metadata
information
whois
links
references
contains
files
relations
user_assessments

Item contents can be selected by:

Entering '<item_name>.<attribute_name>'
Using subscripting: '<item_name>["<attribute_name>"]'.
Using get methods. E.g.,
- .get_data().
- .get_metadata()
- .get_info()

item.data

# is the same as
item['data']

# and
item.get_data()

You can retrieve a CaseItem from a CaseItemSet object by:

Entering 'items.<item_name>'
Subscripting using its name: 'items["<item_name>"]'
Subscripting using a numeric index, in the same style as a list: 'items[index]'
Using the .get_item('<item_name>') method.

items.item

# is the same as
items['item']

# and (if 0 is the item's index position)
items[0]

# and 
items.get_item('item')

Key methods:

add_metadata: Adds single metadata entry to an item’s metadata dataframe.
add_data: Adds single data entry to an item’s data dataframe.
add_info: Adds a single information entry to object.
add_link: Adds a link to an item’s list of links.
get_data: Returns item’s data.
get_metadata: Returns item’s metadata.
get_info: Returns item’s information.
get_url: Returns URL metadata.
scrape_url: Scrapes data from item URL’s site.
crawl_web_from_url: Runs web crawl from item’s URL metadata.
export_excel: Exports item as Excel (.xlsx) file.

CaseNetwork

class CaseNetwork(igraph.Graph)

A modified igraph.Graph object. It provides additional analytics methods and functionality for Case management. CaseNetworks can convert both igraph and NetworkX graph objects. See in docs.

Key attributes:

vs['name']: returns a list of vertex names.
es['name']: returns a list of edge names.
es['weight']: returns a list of edge weights.

Key methods:

attributes: returns the network's global attributes.
summary: Returns the summary of the network.
vs.attributes: returns a list of the names of all vertex attributes.
es.attributes: returns a list of the names of all edge attributes.
get_adjacency: Returns the adjacency matrix of the network.
degree: Returns some vertex degrees from the network.
density: Calculates the density of the network.
average_path_length: Calculates the average path length in the network.
diameter: Calculates the diameter of the network.
betweenness: Calculates or estimates the betweenness of vertices in the network.
eigenvector_centrality: Calculates the eigenvector centralities of the vertices in the network.
all_centralities: Calculates all centrality measures for network. Returns as a dataframe.
colinks: Runs a colink analysis on the network. Returns a dataframe.
community_detection: Identifies communities in the network. Gives the option of using different algorithms.
degrees_dataframe: Returns the network's degree distribution as a dataframe.
export_network: Exports network to one of a variety of graph file types. Defaults to .graphML.
to_networkx: Converts the CaseNetwork to networkx format.

CaseFile

class CaseFile(CaseObject)

An object which stores details about a digital file associated with a case or piece of evidence. See in docs.

Contents:

path: the file's filepath in the directory.
name: the file's filename.
suffix: the file's filetype or extension.
type: the type of directory object, i.e. directory, folder, file, etc.
absolute: the file's absolute filepath.
parent: the filepath of the file's parent directory.
root: the filepath of the root directory.
and more...

Key functions:

get_children
listdir
walk
scandir

Functions

Case management

open_case: Opens a Case from a file.
save: Saves a Case to its source file. If no file exists, requests file details from user input.
save_as: Saves a Case to a file. Requests file details from user input.
get_backups: Returns the Backups directory and registry.
set_default_case: Sets a case as the default in the environment.
get_default_case: Returns the default case.

Importing files

import_case_excel: Imports a Case from a formatted Excel (.xlsx) file.
import_case_csv_folder: Imports a Case from a folder of formatted CSV (.csv) files.
import_case_txt: Imports a Case from a pickled text file (.txt or .case).
read_pdf: Loads and parses PDF file. Returns a dictionary of data.
read_pdf_url: Downloads and parses PDF file from a URL. Returns a dictionary of data.

Location analysis

get_coordinates_location: Takes coordinates and returns the location associated by Geopy’s geocoder.
get_location_address: Takes location details and returns the address associated by Geopy’s geocoder.
get_location_coordinates: Takes location details and returns the coordinates associated by Geopy’s geocoder.

Internet analysis

lookup_whois: Performs a WhoIs lookup on an inputted domain or IP address.

Web searching

open_url: Opens URL in the default web browser.
open_url_source: Opens URL’s source code in the default web browser.
search_web: Launches a website-specific Google search for an inputted query and URL.
multi_search_web: Launches multiple web searches by iterating on a query through a list of terms.
search_images: Launches an image search using the default web browser.
search_social_media: Launches a Google search focused on specified social media platform for inputted query.

Web crawling

crawl_site: Crawls website’s internal pages. Returns any links found as a list.
crawl_web: Crawls internet from a single URL or list of URLs. Returns details like links found, HTML scraped, and site metadata.

Social media analysis

search_username: Runs a Sherlock search for a username.

Documentation

For the full documentation, click here.

Contributing

Authors and acknowledgments

IDEA was created by Jamie Hancock.

It relies on packages, modules, and datasets created by:

Geocoder: Denis Carriere
Geopy: Adam Tygart et al.
Shodan: John Matherly
Sherlock: Siddharth Dushantha et al.
Instaloader: Alexander Graf et al.
youtube-comment-downloader: Egbert Bouman
youtube-dl: Ricardo Garcia Gonzalez et al.
RPy2: Laurent Gautier
ERGM: Mark S. Handcock et al.
python-whois: Richard Penman
ipwhois: Phillip Hane
Trafilatura: Adrien Barbaresi
Cloudscraper: VeNoMouS
Levenshtein: Max Bachmann
names-dataset: Philippe Remy
country_list: Niels Lemmens
geonamescache: Ramiro Gómez, using GeoNames
langcodes: Elia Robyn Lake (Robyn Speer)
language_data: Elia Robyn Lake (Robyn Speer)

License

IDEA is licensed under GPL-3.0.

IDEA is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

IDEA is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with IDEA. If not, see https://www.gnu.org/licenses/.

Appendix and FAQ

Project Timeline

gantt
    title Timeline

Loading

Files

README.md

Latest commit

History

README.md

File metadata and controls

Investigative Data and Evidence Analyser (IDEA)

Table of Contents

Description

Features

User Guide

Installation

Examples

Importing IDEA

Creating and saving a case from an Excel file

Creating a case from a web crawl

Running all analysis functions on a case using run_full_analysis()

Beginners Guide

1. Install Python (version >=3.9) if it is not yet installed.

2. Install the repository and dependencies.

3. Run Python and import the package.

4. Creating a Project.

5. Adding a Case.

6. Running a limited web crawl and adding results to the case.

7. Running analyses

8. Saving the Case

Key Classes and Functions

Classes

Project

Case

CaseData

CaseItem

CaseNetwork

CaseFile

Functions

Case management

Importing files

Location analysis

Internet analysis

Web searching

Web crawling

Social media analysis

Documentation

Contributing

Authors and acknowledgments

License

Appendix and FAQ

Project Timeline

tags: Python, R, Investigations, OSI, OSINT, Documentation

tags: `Python`, `R`, `Investigations`, `OSI`, `OSINT`, `Documentation`