A chill, open-source web engine that crawls, indexes, and vibes with web content using semantic search.
Froxy is a modular full-stack web engine designed to crawl web pages, extract content, and index it using semantic embeddings for intelligent search β all powered by modern tools. It includes:
- A Go-based crawler (aka the spider π·οΈ) with real-time indexing
- FastEmbed service for generating semantic embeddings
- Qdrant vector database for semantic search
- Froxy Apex - AI-powered intelligent search (Perplexity-style)
- A PostgreSQL database for structured data
- A Next.js front-end UI (fully integrated with real APIs)
This project is built for learning, experimenting, and extending β great for developers who want to understand how modern semantic search engines work from scratch.
Fun fact: I made this project in just 3 days β so it might not be perfect, but you know what? It works!
(We'll keep evolving this codebase together β€οΈ)
Note: I prefer simplicity over unnecessary complexity. We might make the architecture more advanced in the future, but for now, it's simple, clean, and straightforwardβno fancy stuff, no over-engineering. It's just a chill project for now. If needed, we can scale and make it more complex later. After all, it started as a fun projectβnothing more. <3
- π Crawl websites with real-time indexing (Go)
- π§ Semantic search using embeddings (FastEmbed + Qdrant)
- π€ AI-powered intelligent search with LLM integration (Froxy Apex)
- π Vector similarity search for intelligent results
- π Chunk-based relevance scoring with cosine similarity
- πΊ Store structured data in PostgreSQL
- π¨ Modern UI in Next.js + Tailwind
- π³ Fully containerized with Docker
The frontend is fully connected to the backend and provides semantic search capabilities.
froxy/
βββ front-end/ # Next.js frontend
β βββ app/ # App routes (search, terms, about, etc.)
β βββ components/ # UI components (shadcn-style)
β βββ hooks/ # React hooks
β βββ lib/ # Utility logic
β βββ public/ # Static assets
β βββ styles/ # TailwindCSS setup
βββ indexer-search/ # Node.js search backend
β βββ lib/
β βββ functions/
β βββ services/ # DB + search service
β βββ utils/ # Helper utilities
βββ froxy-apex/ # AI-powered intelligent search service
β βββ api/ # API endpoints
β βββ db/ # Database connections
β βββ functions/ # AI processing logic
β βββ llama/ # LLM integration
β βββ models/ # Data models
β βββ utils/ # Helper utilities
βββ spider/ # Web crawler in Go with real-time indexing
β βββ db/ # DB handling (PostgreSQL + Qdrant)
β βββ functions/ # Crawl + indexing logic + Proxies (if-need it)
β βββ models/ # Data models
β βββ utils/ # Misc helpers
βββ fastembed/ # FastEmbed embedding service
β βββ models/ # Cached embedding models
β βββ docker-compose.yml
βββ qdrant/ # Qdrant vector database
β βββ docker-compose.yml
βββ db/ # PostgreSQL database
β βββ scripts/ # Shell backups
β βββ docker-compose.yml
βββ froxy.sh # Automated setup & runner script
βββ LICENSE # MIT License
βββ readme.md # This file
- Node.js (18+)
- pnpm or npm
- Go (1.23+)
- Docker & Docker Compose
- At least 2GB RAM (for embedding service)
For the fastest crawler setup without dealing with configuration details:
# Make the script executable and run it
chmod +x froxy.sh
./froxy.sh
The script will automatically:
- Set up all environment variables with default values
- Create the Docker network
- Start all required services (PostgreSQL, Qdrant, FastEmbed)
- Health check all containers
- Guide you through the crawling process
Note: The froxy.sh
script only handles the crawler setup. You'll need to manually start the froxy-apex
AI service and front-end
after crawling.
If you prefer to set things up manually:
# 1. Create Docker network
docker network create froxy-network
# 2. Start Qdrant vector database
cd qdrant
docker-compose up -d --build
# 3. Start PostgreSQL database
cd ../db
# Set proper permissions for PostgreSQL data directory
sudo chown -R 999:999 postgres_data/
docker-compose up -d --build
# 4. Start FastEmbed service
cd ../fastembed
docker-compose up -d --build
# 5. Wait for all services to be healthy, then run the crawler
cd ../spider
go run main.go
# 6. After crawling, start the search backend
cd ../indexer-search
npm install
npm start
# 7. Start the AI-powered search service (Froxy Apex)
# Make sure to configure froxy-apex/.env first
cd ../froxy-apex
go run main.go
# 8. Launch the front-end
cd ../front-end
npm i --legacy-peer-deps
npm run dev
All services use these environment variables (automatically set by froxy.sh
):
# Database Configuration (for spider & indexer-search)
DB_HOST=localhost
DB_PORT=5432
DB_USER=froxy_user
DB_PASSWORD=froxy_password
DB_NAME=froxy_db
DB_SSLMODE=disable
# Vector Database Configuration
QDRANT_API_KEY=froxy-secret-key
QDRANT_HOST=http://localhost:6333
# FastEmbed Service
EMBEDDING_HOST=http://localhost:5050
# AI Service (for froxy-apex)
LLM_API_KEY=your_groq_api_key
API_KEY=your_froxy_apex_api_key
POSTGRES_DB=froxy_db
POSTGRES_USER=froxy_user
POSTGRES_PASSWORD=froxy_password
DB_NAME=froxy_db
DB_SSLMODE=disable
QDRANT_API_KEY=froxy-secret-key
DB_HOST=localhost
DB_PORT=5432
DB_USER=froxy_user
DB_PASSWORD=froxy_password
DB_NAME=froxy_db
DB_SSLMODE=disable
QDRANT_API_KEY=froxy-secret-key
EMBEDDING_HOST=http://localhost:5050
LLM_API_KEY=your_groq_api_key
QDRANT_HOST=http://localhost:6333
EMBEDDING_HOST=http://localhost:5050
API_KEY=your_froxy_apex_api_key
QDRANT_API_KEY=froxy-secret-key
API_URL=http://localhost:8080
API_KEY=your_api_key
WEBSOCKET_URL=ws://localhost:8080/ws/search
FROXY_APEX_API_KEY=your_froxy_apex_api_key
ACCESS_CODE=auth_access_for_froxy_apex_ui
AUTH_SECRET_TOKEN=jwt_token_for_apex_ui_to_calc_the_usage
π‘ The
froxy.sh
script automatically creates.env
files with working default values for the crawler and database services. You'll need to manually configurefroxy-apex/.env
andfront-end/.env
for the AI search and UI components.
- Crawler pulls website content from your provided URLs
- Real-time indexing generates semantic embeddings using FastEmbed
- Qdrant stores vector embeddings for semantic similarity search
- PostgreSQL stores structured metadata
- Frontend provides intelligent semantic search interface
- User query is received and processed
- Query enhancement using Llama 3.1 8B via Groq API
- Embedding generation for the enhanced query using FastEmbed
- Vector search in Qdrant to find relevant pages
- Content chunking of relevant pages for detailed analysis
- Cosine similarity calculation for each chunk against the query
- LLM processing to generate structured response with:
- Concise summary
- Detailed results with sources
- Relevance scores
- Reference links and favicons
- Confidence ratings
{
"summary": "Concise overview addressing the query directly",
"results": [
{
"point": "Detailed information in markdown format",
"reference": "https://exact-source-url.com",
"reference_favicon": "https://exact-source-url.com/favicon.ico",
"relevance_score": 0.95,
"timestamp": "when this info was published/updated"
}
],
"language": "detected_language_code",
"last_updated": "timestamp",
"confidence": 0.90
}
When you run the spider, you'll be prompted to:
- Enter URLs you want to crawl
- Set the number of workers (default: 5)
The crawler will:
- Extract content from each page
- Generate embeddings in real-time
- Store vectors in Qdrant
- Store metadata in PostgreSQL
Since froxy.sh
only handles the crawler, you'll need to manually configure:
- Froxy Apex: Set up your Groq API key and other environment variables
- Frontend: Configure API endpoints and keys
- Service startup: Start each service individually after crawler completes
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Next.js UI βββββΆβ Search Backend βββββΆβ PostgreSQL β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β β
β βΌ
β ββββββββββββββββββββ βββββββββββββββββββ
β β Qdrant ββββββ FastEmbed β
β β (Vector Search) β β (Embeddings) β
β ββββββββββββββββββββ βββββββββββββββββββ
β β² β²
β β β
β ββββββββββββββββββββ β
β β Go Crawler ββββββββββββββββ
β β (Real-time β
β β Indexing) β
β ββββββββββββββββββββ
β
βΌ
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Froxy Apex βββββΆβ Groq LLM API β β Chunk Analysis β
β (AI Search) β β (Llama 3.1 8B) ββββββ (Cosine Sim) β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
- π·οΈ Go (Golang) β crawler with real-time indexing
- π§ FastEmbed β embedding generation service
- π Qdrant β vector database for semantic search
- π€ Froxy Apex β AI-powered search with LLM integration
- π¦ Llama 3.1 8B β language model via Groq API
- πͺ Node.js β search backend API
- π PostgreSQL β structured data storage
- βοΈ Next.js β frontend interface
- π¨ TailwindCSS + shadcn/ui β UI components
- π³ Docker β containerized services
- π Docker Network β service communication
- AI-Powered Search: Perplexity-style intelligent search with LLM integration
- Semantic Search: Find content by meaning, not just keywords
- Real-time Indexing: Content is indexed as it's crawled
- Vector Similarity: Intelligent search results based on context
- Chunk Analysis: Deep content analysis with cosine similarity
- Structured Responses: Rich JSON responses with sources and confidence scores
- Query Enhancement: AI-powered query understanding and improvement
- Scalable Architecture: Microservices with Docker containers
- Automated Setup: One-command deployment with
froxy.sh
- Fork it π
- Open a PR π°
- Share your ideas π‘
MIT β feel free to fork, remix, and learn from it.
Made with β€οΈ for the curious minds of the internet.
Stay weird. Stay building.
"Not all who wander are lost β some are just crawling the web with semantic understanding."