Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhanced Summarizer API with Recursive Text Splitting #14

Open
hemanth opened this issue Nov 22, 2024 · 3 comments
Open

Enhanced Summarizer API with Recursive Text Splitting #14

hemanth opened this issue Nov 22, 2024 · 3 comments

Comments

@hemanth
Copy link

hemanth commented Nov 22, 2024

Problem

Currently, our summarizer API doesn't handle large documents efficiently. When the input text exceeds the model's context window, the API fails to process. Users need to manually split large texts and manage the summarization process themselves, which is error-prone and creates inconsistent results.

Proposed Enhancement

Add automatic text splitting and recursive summarization capabilities to the API, with progress monitoring through callbacks.

Key Features

  1. Automatic Document Chunking

    • Split large documents into manageable chunks automatically
    • Maintain context through overlapping chunks
    • Smart splitting at natural boundaries (sentences/paragraphs)
    • Configurable chunk sizes and overlap amounts
  2. Recursive Summarization

    • Process chunks recursively for very large documents
    • Combine intermediate summaries intelligently
    • Maintain consistent summarization quality across the document
  3. Progress Monitoring

    • Callback system to track processing status
    • Monitor individual chunk processing
    • Track intermediate summaries
    • Get completion statistics

Example Usage

const summarizer = await ai.summarizer.create({
  sharedContext: "An article from the Daily Economic News magazine",
  type: "headline",
  length: "short",
  
  // Optional chunking configuration
  chunking: {
    maxChunkSize: 2000,
    overlapSize: 200
  },
  
  // Optional progress callbacks
  callbacks: {
    onChunk: async (chunk, depth) => {
      // Track chunk processing
    },
    onSummary: async (summary, sourceChunk, depth) => {
      // Monitor intermediate summaries
    },
    onComplete: async (finalSummary, stats) => {
      // Handle completion
    }
  }
});

Benefits

  1. Better User Experience

    • No manual text splitting required
    • Consistent results for documents of any size
    • Progress visibility for long-running summarizations
  2. Improved Summary Quality

    • Context preservation through chunk overlap
    • Hierarchical summarization for very large documents
    • Consistent summarization approach across chunks
  3. Developer Flexibility

    • Optional configuration for advanced use cases
    • Progress monitoring for UI updates
    • TypeScript support for better type safety

Backward Compatibility

The enhanced API maintains full compatibility with the current simple usage pattern:

// Simple usage still works
const quickSummary = await ai.summarizer.process(text, {
  type: "headline",
  length: "short"
});

Implementation Considerations

  1. Chunking Strategy

    • Default chunk size based on model's optimal context window
    • Smart text splitting at sentence/paragraph boundaries
    • Configurable overlap to maintain context
  2. Resource Usage

    • Manage concurrent chunk processing
    • Consider memory usage for very large documents
    • Optional batch processing for resource constraints
  3. Error Handling

    • Graceful degradation for partial failures
    • Clear error messages for configuration issues
    • Recovery strategies for failed chunks
@tomayac
Copy link
Contributor

tomayac commented Nov 22, 2024

@andreban has implemented a client-side solution that essentially does what you describe, modulo the automatic context keeping.

@andreban
Copy link

Here's a demo and source code for the client-side solution. It does support overlapping chunks here. It uses langchain.js for chunking text and the Summarizer API for generating summaries, with handwritter recursive summarization.

@hemanth
Copy link
Author

hemanth commented Nov 22, 2024

Yes, I was cooking something similar and came across @andreban and felt like this should be part of the API rather! ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants