title |
---|
Investigative Data and Evidence Analyser (IDEA) |
The Investigative Data and Evidence Analyser (IDEA) is a toolkit for conducting investigations using data. It is a Python package, written in Python and R.
[TOC]
IDEA is a toolkit for conducting investigations using data, written in Python and R. It provides a Python package which bundles functionality for case management, item/evidence comparisons, triangulation, checking for corroboration, data cleaning, metadata analysis, internet analysis, network analysis, web crawling, and more. IDEA can read and write your results to a large variety of file types (e.g. .xlsx, .csv, .txt, .json, .graphML).
Case management
- File management
- Item/evidence analysis and comparisons
- Object-oriented case management interface
Data cleaning
- Text cleaning
- Reformatting
- Stopword removal
- Text tokenizing
- Word tokenization
- Sentence tokenization
- HTML parsing
Metadata analysis
- Metadata similarity analysis
Text analysis
- Keyword analysis
- Extraction of key information (e.g. names, locations)
- Text similarity analysis
Image analysis
- Reverse image search
Location analysis
- Geolocation
- Chronolocation
Internet analysis
- Web scraping and crawling
- WhoIs lookups on domains and IP addresses
- Web search
- Website similarity analysis
- Web archiving
- Internet Archive/Wayback Machine
- Archive.is
- Common Crawl
Social media analysis
- Platform-specific searches
- Username lookups
- Scraping
Network analysis
- Centrality analysis
- Co-link analysis
- Community detection
- and much more...
Data visualisation
- Network visualisation
- Timelines
To download from GitHub, run the following code in your console:
gh repo clone J-A-Ha/IDEA
cd IDEA
pip install -r requirements.txt
Or download as a .zip folder from GitHub.
import idea
example = idea.open_case(case_name = 'example', file_address = 'example.xlsx')
example.save_as(file_name = 'example', file_type = 'case', file_address = '/')
example = idea.Case(case_name = 'example')
example.crawl_web()
# You will be asked to input a URL or list of URLs to crawl from.
This will:
- Parse all raw data
- Extract keywords
- Identify instances of coinciding data, metatada, links, etc.
- Index all items, data, metadata, etc.
- Generate similarity networks based on inputted data
- Generate link networks if links are provided
- Run all statistical analytics
- Save to the Case object
example = idea.open_case(case_name = 'example', file_address = 'example.case')
example.run_full_analysis()
print(example.analytics)
Download Python from here or using a tool like Anaconda.
# In the command line, navigate to the folder you wish to install IDEA in.
gh repo clone J-A-Ha/IDEA
cd IDEA
pip install -r requirements.txt
python
import idea
project = idea.Project(project_name = 'project')
# Adding a blank case
project.add_case(case_name = 'example')
# Viewing the case's contents and properties
project.example
# visit_limit defines the number of websites to be crawled.
project.example.from_web_crawl(seed_urls='https://example.com/', visit_limit=5, be_polite=True)
project.example.parse_rawdata()
project.example.generate_keywords()
project.example.infer_all_info_categories()
project.example.generate_indexes()
project.example.generate_all_networks()
project.example.generate_analytics()
# Viewing analytics results
print(project.example.analytics)
project.example.save_as()
The package will ask for you to input:
- File name.
- File type. ('.case' is the recommended format).
- File path to save to.
class Project(...)
A collection of Case objects. See in docs.
Cases can be selected by:
- Entering '<project_name>.<case_name>'
- Using subscripting ( '<project_name>["case_name"]')
- Using the method .get_case("<case_name>").
project.case
# is the same as
project['case']
# and
project.get_case('case')
Key methods:
- contents: Returns the Project’s attributes as a list. Excludes object properties attribute.
- add_case: Adds a Case object to the Project.
- get_case: Returns a Case when given its attribute name.
- export_folder: Exports Project’s contents to a folder.
- save_as: Exports the Project to file or folder type of your choice.
- save: Exports the Project to an existing file or folder.
class Case(...)
An object to store raw data, metadata, and other information related to investigative cases. See in docs.
Contents:
- properties
- dataframes
- items
- entities
- events
- indexes
- networks
- analytics
- description
- notes
Case contents can be selected by:
- Entering '<case_name>.<attribute_name>'
- Using subscripting ( '<case_name>["attribute_name"]')
- Using get methods, e.g.:
- .get_item("<item_name>").
- .get_entity("<entity_name>").
- .get_network("<network_name>").
case.attribute
# is the same as
case['attribute']
Some contents of case attributes can themselves be subscripted:
case.get_item('item')
# is the same as
case.items.item
# and
case['items'].item
# and
case['items']['item']
You can even subscript using the name of a dataframe, item, entity, event, or network:
case['item']
# is the same as
case.items['item']
# and
case['items']['item']
# and
case.get_item('item')
Key methods:
- backup: Creates backup of the Case.
- make_default: Sets the Case as the default case in the environment.
- contents: Returns the Case’s attributes as a list.
- search: Searches Case for a query string. If found, returns a dataframe of all items containing the string.
- advanced_search: An advanced search function. Searches items using a series of keyword commands. If found, returns a dataframe of all items containing the string.
- add_item: Adds an item to the Case’s item set.
- from_web_crawl: Creates a Case object from a web crawl.
- get_item: Returns an item if given its ID.
- get_info: Returns all information entries as a Pandas series.
- get_metadata: Returns all metadata entries as a Pandas series.
- get_keywords: Returns a keywords dataframe based on user’s choice of ranking metric.
- get_project: If the Case is assigned to a Project, returns that Project.
- parse_rawdata: Parses raw data entries for all items.
- generate_indexes: Generates all indexes and assigns them to the Case’s CaseIndexes attribute. Returns the updated CaseIndexes.
- generate_all_networks: Generates all network types and assigns to the Case’s CaseNetworks collection.
- generate_analytics: Generates all analytics and appends the results to the Case’s CaseAnalytics collection.
- identify_coincidences: Runs all coincidence identification methods.
- infer_all_info_categories: Identifies potential information from items’ text data and appends to information sets. Parses data if not parsed.
- run_full_analysis: Runs all analysis functions on the Case.
- export_folder: Exports the Case to a folder.
- export_network: Exports a network to one of a variety of graph file types. Defaults to .graphML.
- save: Saves the Case to its source file. If no source given, saves to a new file.
- save_as: Saves the Case to a file.
class CaseData(...)
A collection of Pandas dataframes containing the combined data for a Case. See in docs.
Contents:
- data: item data
- metadata: item metadata
- information: items' labelled information
- other: items' links, references, contents, and other miscellaneous data.
- keywords: keywords associated with the case. A CaseKeywords object containing dataframes:
- frequent_words
- central_words
- coinciding_data: patterns of how data coincides. A dictionary containing dataframes.
Dataframes can be selected by:
- Entering 'dataframes.<dataframe_name>'
- Using subscripting ( 'dataframes["dataframe_name"]')
- Using the method .get_dataframe("<dataframe_name>").
case.dataframes.dataframe
# is the same as
case.dataframes['dataframe']
# and
dataframe.get_dataframe('dataframe')
class CaseItem(...)
An object representing a piece of material or evidence associated with a Case. See in docs.
Contents:
- properties
- data
- metadata
- information
- whois
- links
- references
- contains
- files
- relations
- user_assessments
Item contents can be selected by:
- Entering '<item_name>.<attribute_name>'
- Using subscripting: '<item_name>["<attribute_name>"]'.
- Using get methods. E.g.,
- .get_data().
- .get_metadata()
- .get_info()
item.data
# is the same as
item['data']
# and
item.get_data()
You can retrieve a CaseItem from a CaseItemSet object by:
- Entering 'items.<item_name>'
- Subscripting using its name: 'items["<item_name>"]'
- Subscripting using a numeric index, in the same style as a list: 'items[index]'
- Using the .get_item('<item_name>') method.
items.item
# is the same as
items['item']
# and (if 0 is the item's index position)
items[0]
# and
items.get_item('item')
Key methods:
- add_metadata: Adds single metadata entry to an item’s metadata dataframe.
- add_data: Adds single data entry to an item’s data dataframe.
- add_info: Adds a single information entry to object.
- add_link: Adds a link to an item’s list of links.
- get_data: Returns item’s data.
- get_metadata: Returns item’s metadata.
- get_info: Returns item’s information.
- get_url: Returns URL metadata.
- scrape_url: Scrapes data from item URL’s site.
- crawl_web_from_url: Runs web crawl from item’s URL metadata.
- export_excel: Exports item as Excel (.xlsx) file.
class CaseNetwork(igraph.Graph)
A modified igraph.Graph object. It provides additional analytics methods and functionality for Case management. CaseNetworks can convert both igraph and NetworkX graph objects. See in docs.
Key attributes:
- vs['name']: returns a list of vertex names.
- es['name']: returns a list of edge names.
- es['weight']: returns a list of edge weights.
Key methods:
- attributes: returns the network's global attributes.
- summary: Returns the summary of the network.
- vs.attributes: returns a list of the names of all vertex attributes.
- es.attributes: returns a list of the names of all edge attributes.
- get_adjacency: Returns the adjacency matrix of the network.
- degree: Returns some vertex degrees from the network.
- density: Calculates the density of the network.
- average_path_length: Calculates the average path length in the network.
- diameter: Calculates the diameter of the network.
- betweenness: Calculates or estimates the betweenness of vertices in the network.
- eigenvector_centrality: Calculates the eigenvector centralities of the vertices in the network.
- all_centralities: Calculates all centrality measures for network. Returns as a dataframe.
- colinks: Runs a colink analysis on the network. Returns a dataframe.
- community_detection: Identifies communities in the network. Gives the option of using different algorithms.
- degrees_dataframe: Returns the network's degree distribution as a dataframe.
- export_network: Exports network to one of a variety of graph file types. Defaults to .graphML.
- to_networkx: Converts the CaseNetwork to networkx format.
class CaseFile(CaseObject)
An object which stores details about a digital file associated with a case or piece of evidence. See in docs.
Contents:
- path: the file's filepath in the directory.
- name: the file's filename.
- suffix: the file's filetype or extension.
- type: the type of directory object, i.e. directory, folder, file, etc.
- absolute: the file's absolute filepath.
- parent: the filepath of the file's parent directory.
- root: the filepath of the root directory.
- and more...
Key functions:
- get_children
- listdir
- walk
- scandir
- open_case: Opens a Case from a file.
- save: Saves a Case to its source file. If no file exists, requests file details from user input.
- save_as: Saves a Case to a file. Requests file details from user input.
- get_backups: Returns the Backups directory and registry.
- set_default_case: Sets a case as the default in the environment.
- get_default_case: Returns the default case.
- import_case_excel: Imports a Case from a formatted Excel (.xlsx) file.
- import_case_csv_folder: Imports a Case from a folder of formatted CSV (.csv) files.
- import_case_txt: Imports a Case from a pickled text file (.txt or .case).
- read_pdf: Loads and parses PDF file. Returns a dictionary of data.
- read_pdf_url: Downloads and parses PDF file from a URL. Returns a dictionary of data.
- get_coordinates_location: Takes coordinates and returns the location associated by Geopy’s geocoder.
- get_location_address: Takes location details and returns the address associated by Geopy’s geocoder.
- get_location_coordinates: Takes location details and returns the coordinates associated by Geopy’s geocoder.
- lookup_whois: Performs a WhoIs lookup on an inputted domain or IP address.
- open_url: Opens URL in the default web browser.
- open_url_source: Opens URL’s source code in the default web browser.
- search_web: Launches a website-specific Google search for an inputted query and URL.
- multi_search_web: Launches multiple web searches by iterating on a query through a list of terms.
- search_images: Launches an image search using the default web browser.
- search_social_media: Launches a Google search focused on specified social media platform for inputted query.
- crawl_site: Crawls website’s internal pages. Returns any links found as a list.
- crawl_web: Crawls internet from a single URL or list of URLs. Returns details like links found, HTML scraped, and site metadata.
- search_username: Runs a Sherlock search for a username.
For the full documentation, click here.
IDEA was created by Jamie Hancock.
It relies on packages, modules, and datasets created by:
- Geocoder: Denis Carriere
- Geopy: Adam Tygart et al.
- Shodan: John Matherly
- Sherlock: Siddharth Dushantha et al.
- Instaloader: Alexander Graf et al.
- youtube-comment-downloader: Egbert Bouman
- youtube-dl: Ricardo Garcia Gonzalez et al.
- RPy2: Laurent Gautier
- ERGM: Mark S. Handcock et al.
- python-whois: Richard Penman
- ipwhois: Phillip Hane
- Trafilatura: Adrien Barbaresi
- Cloudscraper: VeNoMouS
- Levenshtein: Max Bachmann
- names-dataset: Philippe Remy
- country_list: Niels Lemmens
- geonamescache: Ramiro Gómez, using GeoNames
- langcodes: Elia Robyn Lake (Robyn Speer)
- language_data: Elia Robyn Lake (Robyn Speer)
IDEA is licensed under GPL-3.0.
IDEA is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
IDEA is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with IDEA. If not, see https://www.gnu.org/licenses/.
gantt
title Timeline