App is hosted at: https://data-scrapper.up.railway.app
A modern application to discover, crawl, and extract structured knowledge from company websites. It provides a user-friendly web interface for extracting company information, discovering related URLs, blogs, and technical content, and organizing the results for further analysis.
- Company Crawler: Crawls a company website to find all internal pages, blogs, and related URLs.
- Company Info Extraction: Extracts company name, description, industry, founders, key people, and social media links using LLMs or fallback methods.
- Founder Discovery: Finds founders via web search and on-site analysis.
- Blog & External Mention Discovery: Identifies blog posts, founder blogs, and external mentions using Google Search API and LLMs.
- Knowledge Scrapper: Scrapes and processes technical content from discovered URLs, supporting HTML, PDF, and plain text.
- Database Integration: Stores extracted knowledge in a MongoDB database for search and statistics.
- Modern Web UI: Elegant Flask-based interface for running crawls, scrapes, and viewing results.
- File Management: Delete URL files and their corresponding subpage files to clean up storage.
You can check existing data for the following team IDs:
aline123
groove123
numeric20
Follow these steps for a typical workflow using the web UI:
-
Open Crawler and Add Company Webpage
-
Go to Scrapper and Add Team ID
-
Return to Crawler and Enable External URL Search
-
Go to Scrapper and Turn Off Iterative Subdirectory Discovery
-
This step will yield nearly all the detailed technical information available about the company.
- Install dependencies:
pip install -r requirements.txt
- Start the UI:
python UI/run_ui.py
- Open http://localhost:5000 in your browser.
Create a .env
file in the root directory with the following variables:
GOOGLE_API_KEY=your_google_api_key_here
GOOGLE_CSE_ID=your_google_custom_search_engine_id_here
GEMINI_API_KEY=your_gemini_api_key_here
GEMINI_MODEL=gemini-2.0-flash-lite
# MongoDB Configuration
MONGODB_URI=mongodb+srv://sh.2uayc9a.mongodb.net/
MONGODB_DATABASE=your_database_name_here
MONGODB_COLLECTION=your_collection_name_here