This is a web scraping and conversion tool called Markdown Lab that combines Python and Rust components to scrape websites and convert HTML content to markdown format. It supports sitemap parsing, semantic chunking for RAG (Retrieval-Augmented Generation), and includes performance optimizations through Rust integration.
Key features include HTML-to-markdown conversion with support for various elements (headers, links, images, lists, code blocks), intelligent content chunking that preserves document structure, and systematic content discovery through sitemap parsing. The hybrid architecture uses Python for high-level operations and Rust for performance-critical tasks.
- ๐ Scrapes any accessible website with robust error handling and rate limiting
- ๐บ๏ธ Parses sitemap.xml to discover and scrape the most relevant content
- ๐ Converts HTML to clean Markdown format
- ๐งฉ Implements intelligent chunking for RAG (Retrieval-Augmented Generation) systems
- ๐ Handles various HTML elements:
- Headers (h1-h6)
- Paragraphs
- Links with resolved relative URLs
- Images with resolved relative URLs
- Ordered and unordered lists
- Blockquotes
- Code blocks
- ๐ Preserves document structure
- ๐ชต Comprehensive logging
- โ Robust error handling with exponential backoff
- ๐๏ธ Performance optimizations and best practices
git clone https://github.com/ursisterbtw/markdown_lab.git
cd markdown_lab
pip install -r requirements.txt
# Build the Rust library
cargo build --release
python main.py https://www.example.com -o output.md
python main.py https://www.example.com -o output.md --save-chunks --chunk-dir my_chunks
python main.py https://www.example.com -o output_dir --use-sitemap --save-chunks
python main.py https://www.example.com -o output_dir \
--use-sitemap \
--min-priority 0.5 \
--include "blog/*" "products/*" \
--exclude "*.pdf" "temp/*" \
--limit 50 \
--save-chunks \
--chunk-dir my_chunks \
--requests-per-second 2.0
Argument | Description | Default |
---|---|---|
url |
The URL to scrape | (required) |
-o, --output |
Output markdown file/directory | output.md |
--save-chunks |
Save content chunks for RAG | False |
--chunk-dir |
Directory to save chunks | chunks |
--chunk-format |
Format for chunks (json , jsonl ) |
jsonl |
--chunk-size |
Maximum chunk size (chars) | 1000 |
--chunk-overlap |
Overlap between chunks (chars) | 200 |
--requests-per-second |
Rate limit for requests | 1.0 |
--use-sitemap |
Use sitemap.xml to discover URLs | False |
--min-priority |
Minimum priority for sitemap URLs | None |
--include |
Regex patterns for URLs to include | None |
--exclude |
Regex patterns for URLs to exclude | None |
--limit |
Maximum number of URLs to scrape | None |
from main import MarkdownScraper
scraper = MarkdownScraper()
html_content = scraper.scrape_website("https://example.com")
markdown_content = scraper.convert_to_markdown(html_content, "https://example.com")
scraper.save_markdown(markdown_content, "output.md")
from main import MarkdownScraper
scraper = MarkdownScraper(requests_per_second=2.0)
# Scrape using sitemap discovery
scraped_urls = scraper.scrape_by_sitemap(
base_url="https://example.com",
output_dir="output_dir",
min_priority=0.5, # Only URLs with priority >= 0.5
include_patterns=["blog/*"], # Only blog URLs
exclude_patterns=["temp/*"], # Exclude temporary pages
limit=20, # Maximum 20 URLs
save_chunks=True, # Enable chunking
chunk_dir="my_chunks", # Save chunks here
chunk_format="jsonl" # Use JSONL format
)
print(f"Successfully scraped {len(scraped_urls)} URLs")
from sitemap_utils import SitemapParser, discover_site_urls
# Quick discovery of URLs from sitemap
urls = discover_site_urls(
base_url="https://example.com",
min_priority=0.7,
include_patterns=["products/*"],
limit=10
)
# Or with more control
parser = SitemapParser()
parser.parse_sitemap("https://example.com")
urls = parser.filter_urls(min_priority=0.5)
parser.export_urls_to_file(urls, "sitemap_urls.txt")
The library intelligently discovers and parses XML sitemaps to scrape exactly what you need:
- Automatic Discovery: Finds sitemaps through robots.txt or common locations
- Sitemap Index Support: Handles multi-level sitemap index files
- Priority-Based Filtering: Choose URLs based on their priority in the sitemap
- Pattern Matching: Include or exclude URLs with regex patterns
- Optimized Scraping: Only scrape the pages that matter most
- Structured Organization: Creates meaningful filenames based on URL paths
The library implements intelligent chunking designed specifically for RAG (Retrieval-Augmented Generation) systems:
- Semantic Chunking: Preserves the semantic structure of documents by chunking based on headers
- Content-Aware: Large sections are split into overlapping chunks for better context preservation
- Metadata-Rich: Each chunk contains detailed metadata for better retrieval
- Multiple Formats: Save chunks as individual JSON files or as a single JSONL file
- Customizable: Control chunk size and overlap to balance between precision and context
The project includes comprehensive unit tests. To run them:
pytest
# Run unit and integration tests
cargo test
# Run tests with logging
RUST_LOG=debug cargo test -- --nocapture
# Run Python binding tests
pytest tests/test_python_bindings.py -v
# Run all benchmarks
cargo bench
# Run specific benchmark
cargo bench html_to_markdown
cargo bench chunk_markdown
After running the benchmarks, you can visualize the results:
python scripts/visualize_benchmarks.py
This will create a benchmark_results.png
file with a bar chart showing the performance of each operation.
-
src/
: Rust source codelib.rs
: Main library and Python bindingshtml_parser.rs
: HTML parsing utilitiesmarkdown_converter.rs
: HTML to Markdown conversionchunker.rs
: Markdown chunking logicjs_renderer.rs
: JavaScript page rendering
-
tests/
: Test files- Rust integration tests
- Python binding tests
-
benches/
: Benchmark files- Performance tests for core operations
To enable real JavaScript rendering with headless Chrome:
cargo build --release --features real_rendering
- HTML to Markdown conversion is optimized for medium to large documents
- Chunking algorithm balances semantic coherence with performance
- JavaScript rendering can be CPU and memory intensive
- requests: Web scraping and HTTP requests
- beautifulsoup4: HTML parsing
- pytest: Testing framework
- typing-extensions: Additional type checking support
- pathlib: Object-oriented filesystem paths
- python-dateutil: Powerful extensions to the standard datetime module
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
main.py
: The main scraper implementationchunk_utils.py
: Utilities for chunking text for RAGsitemap_utils.py
: Sitemap parsing and URL discoverythrottle.py
: Rate limiting for web requeststest_*.py
: Unit tests
- Add support for more HTML elements
- Implement chunking for RAG
- Add sitemap.xml parsing for systematic scraping
- Add support for JavaScript-rendered pages
- Implement custom markdown templates
- Add concurrent scraping for multiple URLs
- Include CSS selector support
- Add configuration file support
๐๐ฆ ursister