Skip to content
/ froxy Public

πŸ•ΈοΈ Froxy – A chill open-source web indexing engine built with Go, Node.js, and Next.js. Crawls, analyzes, and serves structured web data with TF-IDF magic and Supabase as the brain.

License

Notifications You must be signed in to change notification settings

MultiX0/froxy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

98 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ•·οΈ Froxy

A chill, open-source web engine that crawls, indexes, and vibes with web content using semantic search.

froxy banner


πŸ’‘ What is Froxy?

Froxy is a modular full-stack web engine designed to crawl web pages, extract content, and index it using semantic embeddings for intelligent search β€” all powered by modern tools. It includes:

  • A Go-based crawler (aka the spider πŸ•·οΈ) with real-time indexing
  • FastEmbed service for generating semantic embeddings
  • Qdrant vector database for semantic search
  • Froxy Apex - AI-powered intelligent search (Perplexity-style)
  • A PostgreSQL database for structured data
  • A Next.js front-end UI (fully integrated with real APIs)

This project is built for learning, experimenting, and extending β€” great for developers who want to understand how modern semantic search engines work from scratch.

Fun fact: I made this project in just 3 days β€” so it might not be perfect, but you know what? It works!

(We'll keep evolving this codebase together ❀️)

Note: I prefer simplicity over unnecessary complexity. We might make the architecture more advanced in the future, but for now, it's simple, clean, and straightforwardβ€”no fancy stuff, no over-engineering. It's just a chill project for now. If needed, we can scale and make it more complex later. After all, it started as a fun projectβ€”nothing more. <3


πŸ” Features

  • 🌐 Crawl websites with real-time indexing (Go)
  • 🧠 Semantic search using embeddings (FastEmbed + Qdrant)
  • πŸ€– AI-powered intelligent search with LLM integration (Froxy Apex)
  • πŸš€ Vector similarity search for intelligent results
  • πŸ“Š Chunk-based relevance scoring with cosine similarity
  • πŸ•Ί Store structured data in PostgreSQL
  • 🎨 Modern UI in Next.js + Tailwind
  • 🐳 Fully containerized with Docker

The frontend is fully connected to the backend and provides semantic search capabilities.


πŸ“‚ Folder Structure

froxy/
β”œβ”€β”€ front-end/          # Next.js frontend
β”‚   β”œβ”€β”€ app/            # App routes (search, terms, about, etc.)
β”‚   β”œβ”€β”€ components/     # UI components (shadcn-style)
β”‚   β”œβ”€β”€ hooks/          # React hooks
β”‚   β”œβ”€β”€ lib/            # Utility logic
β”‚   β”œβ”€β”€ public/         # Static assets
β”‚   └── styles/         # TailwindCSS setup
β”œβ”€β”€ indexer-search/     # Node.js search backend
β”‚   └── lib/
β”‚       β”œβ”€β”€ functions/ 
β”‚       β”œβ”€β”€ services/   # DB + search service
β”‚       └── utils/      # Helper utilities
β”œβ”€β”€ froxy-apex/         # AI-powered intelligent search service
β”‚   β”œβ”€β”€ api/            # API endpoints
β”‚   β”œβ”€β”€ db/             # Database connections
β”‚   β”œβ”€β”€ functions/      # AI processing logic
β”‚   β”œβ”€β”€ llama/          # LLM integration
β”‚   β”œβ”€β”€ models/         # Data models
β”‚   └── utils/          # Helper utilities
β”œβ”€β”€ spider/             # Web crawler in Go with real-time indexing
β”‚   β”œβ”€β”€ db/             # DB handling (PostgreSQL + Qdrant)
β”‚   β”œβ”€β”€ functions/      # Crawl + indexing logic + Proxies (if-need it)
β”‚   β”œβ”€β”€ models/         # Data models
β”‚   └── utils/          # Misc helpers
β”œβ”€β”€ fastembed/          # FastEmbed embedding service
β”‚   β”œβ”€β”€ models/         # Cached embedding models
β”‚   └── docker-compose.yml
β”œβ”€β”€ qdrant/             # Qdrant vector database
β”‚   └── docker-compose.yml
β”œβ”€β”€ db/                 # PostgreSQL database
β”‚   β”œβ”€β”€ scripts/        # Shell backups
β”‚   └── docker-compose.yml
β”œβ”€β”€ froxy.sh            # Automated setup & runner script
β”œβ”€β”€ LICENSE             # MIT License
└── readme.md           # This file

βš™οΈ Getting Started

Requirements

  • Node.js (18+)
  • pnpm or npm
  • Go (1.23+)
  • Docker & Docker Compose
  • At least 2GB RAM (for embedding service)

Quick Setup (Recommended for Crawler)

For the fastest crawler setup without dealing with configuration details:

# Make the script executable and run it
chmod +x froxy.sh
./froxy.sh

The script will automatically:

  • Set up all environment variables with default values
  • Create the Docker network
  • Start all required services (PostgreSQL, Qdrant, FastEmbed)
  • Health check all containers
  • Guide you through the crawling process

Note: The froxy.sh script only handles the crawler setup. You'll need to manually start the froxy-apex AI service and front-end after crawling.

Manual Setup

If you prefer to set things up manually:

# 1. Create Docker network
docker network create froxy-network

# 2. Start Qdrant vector database
cd qdrant
docker-compose up -d --build

# 3. Start PostgreSQL database
cd ../db
# Set proper permissions for PostgreSQL data directory
sudo chown -R 999:999 postgres_data/
docker-compose up -d --build

# 4. Start FastEmbed service
cd ../fastembed
docker-compose up -d --build

# 5. Wait for all services to be healthy, then run the crawler
cd ../spider
go run main.go

# 6. After crawling, start the search backend
cd ../indexer-search
npm install
npm start

# 7. Start the AI-powered search service (Froxy Apex)
# Make sure to configure froxy-apex/.env first
cd ../froxy-apex
go run main.go

# 8. Launch the front-end
cd ../front-end
npm i --legacy-peer-deps
npm run dev

πŸ” Environment Variables

Default Configuration

All services use these environment variables (automatically set by froxy.sh):

# Database Configuration (for spider & indexer-search)
DB_HOST=localhost
DB_PORT=5432
DB_USER=froxy_user
DB_PASSWORD=froxy_password
DB_NAME=froxy_db
DB_SSLMODE=disable

# Vector Database Configuration
QDRANT_API_KEY=froxy-secret-key
QDRANT_HOST=http://localhost:6333

# FastEmbed Service
EMBEDDING_HOST=http://localhost:5050

# AI Service (for froxy-apex)
LLM_API_KEY=your_groq_api_key
API_KEY=your_froxy_apex_api_key

Service-Specific Variables

db/.env

POSTGRES_DB=froxy_db
POSTGRES_USER=froxy_user
POSTGRES_PASSWORD=froxy_password
DB_NAME=froxy_db
DB_SSLMODE=disable

qdrant/.env

QDRANT_API_KEY=froxy-secret-key

spider/.env & indexer-search/.env

DB_HOST=localhost
DB_PORT=5432
DB_USER=froxy_user
DB_PASSWORD=froxy_password
DB_NAME=froxy_db
DB_SSLMODE=disable
QDRANT_API_KEY=froxy-secret-key
EMBEDDING_HOST=http://localhost:5050

froxy-apex/.env

LLM_API_KEY=your_groq_api_key
QDRANT_HOST=http://localhost:6333
EMBEDDING_HOST=http://localhost:5050
API_KEY=your_froxy_apex_api_key
QDRANT_API_KEY=froxy-secret-key

front-end/.env

API_URL=http://localhost:8080
API_KEY=your_api_key
WEBSOCKET_URL=ws://localhost:8080/ws/search
FROXY_APEX_API_KEY=your_froxy_apex_api_key
ACCESS_CODE=auth_access_for_froxy_apex_ui
AUTH_SECRET_TOKEN=jwt_token_for_apex_ui_to_calc_the_usage

πŸ’‘ The froxy.sh script automatically creates .env files with working default values for the crawler and database services. You'll need to manually configure froxy-apex/.env and front-end/.env for the AI search and UI components.


πŸ€” How it works

Traditional Search

  1. Crawler pulls website content from your provided URLs
  2. Real-time indexing generates semantic embeddings using FastEmbed
  3. Qdrant stores vector embeddings for semantic similarity search
  4. PostgreSQL stores structured metadata
  5. Frontend provides intelligent semantic search interface

AI-Powered Search (Froxy Apex)

  1. User query is received and processed
  2. Query enhancement using Llama 3.1 8B via Groq API
  3. Embedding generation for the enhanced query using FastEmbed
  4. Vector search in Qdrant to find relevant pages
  5. Content chunking of relevant pages for detailed analysis
  6. Cosine similarity calculation for each chunk against the query
  7. LLM processing to generate structured response with:
    • Concise summary
    • Detailed results with sources
    • Relevance scores
    • Reference links and favicons
    • Confidence ratings

Response Format

{
  "summary": "Concise overview addressing the query directly",
  "results": [
    {
      "point": "Detailed information in markdown format",
      "reference": "https://exact-source-url.com",
      "reference_favicon": "https://exact-source-url.com/favicon.ico",
      "relevance_score": 0.95,
      "timestamp": "when this info was published/updated"
    }
  ],
  "language": "detected_language_code",
  "last_updated": "timestamp",
  "confidence": 0.90
}

Crawling Process

When you run the spider, you'll be prompted to:

  • Enter URLs you want to crawl
  • Set the number of workers (default: 5)

The crawler will:

  • Extract content from each page
  • Generate embeddings in real-time
  • Store vectors in Qdrant
  • Store metadata in PostgreSQL

Manual Service Configuration

Since froxy.sh only handles the crawler, you'll need to manually configure:

  • Froxy Apex: Set up your Groq API key and other environment variables
  • Frontend: Configure API endpoints and keys
  • Service startup: Start each service individually after crawler completes

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Next.js UI   │───▢│  Search Backend  │───▢│   PostgreSQL    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                       β”‚
         β”‚                       β–Ό
         β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚              β”‚     Qdrant       │◀───│   FastEmbed     β”‚
         β”‚              β”‚ (Vector Search)  β”‚    β”‚   (Embeddings)  β”‚
         β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                       β–²                       β–²
         β”‚                       β”‚                       β”‚
         β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”‚
         β”‚              β”‚   Go Crawler     β”‚β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚              β”‚  (Real-time      β”‚
         β”‚              β”‚   Indexing)      β”‚
         β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Froxy Apex    │───▢│   Groq LLM API   β”‚    β”‚  Chunk Analysis β”‚
β”‚ (AI Search)     β”‚    β”‚ (Llama 3.1 8B)   │◀───│ (Cosine Sim)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“™ Tech Stack

  • πŸ•·οΈ Go (Golang) – crawler with real-time indexing
  • 🧠 FastEmbed – embedding generation service
  • πŸš€ Qdrant – vector database for semantic search
  • πŸ€– Froxy Apex – AI-powered search with LLM integration
  • πŸ¦™ Llama 3.1 8B – language model via Groq API
  • πŸ’ͺ Node.js – search backend API
  • πŸ“€ PostgreSQL – structured data storage
  • βš›οΈ Next.js – frontend interface
  • 🎨 TailwindCSS + shadcn/ui – UI components
  • 🐳 Docker – containerized services
  • 🌐 Docker Network – service communication

πŸš€ Key Improvements

  • AI-Powered Search: Perplexity-style intelligent search with LLM integration
  • Semantic Search: Find content by meaning, not just keywords
  • Real-time Indexing: Content is indexed as it's crawled
  • Vector Similarity: Intelligent search results based on context
  • Chunk Analysis: Deep content analysis with cosine similarity
  • Structured Responses: Rich JSON responses with sources and confidence scores
  • Query Enhancement: AI-powered query understanding and improvement
  • Scalable Architecture: Microservices with Docker containers
  • Automated Setup: One-command deployment with froxy.sh

πŸ“¬ Want to contribute?

  • Fork it πŸŒ›
  • Open a PR 🚰
  • Share your ideas πŸ’‘

πŸ“œ License

MIT β€” feel free to fork, remix, and learn from it.


Made with ❀️ for the curious minds of the internet.

Stay weird. Stay building.

"Not all who wander are lost β€” some are just crawling the web with semantic understanding."

About

πŸ•ΈοΈ Froxy – A chill open-source web indexing engine built with Go, Node.js, and Next.js. Crawls, analyzes, and serves structured web data with TF-IDF magic and Supabase as the brain.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published