Skip to content

read common binary files, such as PDFs and those of Microsoft Office or LibreOffice, in Vim

Notifications You must be signed in to change notification settings

Konfekt/vim-office

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 

Repository files navigation

This plug-in lets Vim read text documents of type PDF, Microsoft Office such as Word (doc(x)), Excel (xls(x)) or Powerpoint (ppt(x)), Open Document (odt), EPUB .... The text extraction depends on external tools, but most use cases are covered by an installation of

  • LibreOffice and a common text browser (such as lynx), and
  • pdftotext.

Extractors

It uses, whenever available, appropriate external converters such as unrtf, pandoc, docx2txt.pl, odt2txt, xlscat, xlsx2csv.py or pptx2md ..., but will fall back to:

  • Either LibreOffice which is an office suite that (together with a common text browser such as lynx) can handle all those formats listed above, except PDFs. (On Microsoft Windows, ensure after its installation that the path of the folder containing the executable, by default %ProgramFiles%\LibreOffice\program, is added to the %PATH% environment variable.

  • Or Tika which is a content extractor that can handle all those formats listed above and many more. To use it:

    1. Download the latest runnable tika-app-...jar from Tika to ~/bin/tika.jar (on Linux) respectively %USERPROFILE%\bin (on Microsoft Windows).

    2. Create

      • on Linux, a shell script ~/bin/tika that reads
          #!/bin/sh
          exec java -Dfile.encoding=UTF-8 -jar "$HOME/bin/tika.jar" "$@" 2>/dev/null

      and mark it executable (by chmod a+x ~/bin/tika).

      • on Microsoft Windows, a batch script %USERPROFILE%\bin\tika.bat that reads
          @echo off
          java -Dfile.encoding=UTF-8 -jar "%USERPROFILE%\bin\tika.jar" %*
    3. Add the folder of the newly created tika executable to your environment variable $PATH (on Linux) respectively %PATH% (on Microsoft Windows):

      • on Linux, if you use bash or zsh by adding to ~/.profile or ~/.zshenv the line
          PATH=$PATH:~/bin
      • on Microsoft Windows, a convenient program to update %PATH% is Rapidee.

OCR

For the (English) text extraction of common image files of common formats, it uses tesseract whenever its executable is found, available on Microsoft Windows and Linux. On Microsoft Windows, ensure after its installation that the path of the folder containing its executable, by default %ProgramFiles%\Tesseract-OCR, is added to the %PATH% environment variable.

Pass additional command-line options to tesseract by g:office_tesseract (left empty by default), for example

  let g:office_tesseract = '-l eng+ita'

to properly extract Italian words (as well as English ones).

Other (media) file formats

To go even further, for example, to read, among many others file formats, media files in Vim, add this Vimscript snippet from lesspipe.sh to your vimrc!

Pandoc

To convert a file to markdown, add the following command to your vimrc and run :PandocToMarkdown inside the buffer of the opened file:

  command! -range=% PandocToMarkdown exe '<line1>,<line2>!pandoc --wrap=preserve --from='..PandocFiletype(&l:filetype)..'--to markdown %:S'
  function! PandocFiletype(filetype) abort
    if a:filetype ==# 'tex'
      return 'latex'
    elseif a:filetype ==# 'pandoc'
      return 'markdown'
    elseif a:filetype ==# 'text' || empty(a:filetype)
      return expand('%:e')
    else
      return a:filetype
    endif
  endfunction

Related

This answer shows how this plug-in works in principle, and refers to vim-util as an alternative implementation for some word document formats using the textutil command on Mac OS that also allows to write the text edited in Vim.

About

read common binary files, such as PDFs and those of Microsoft Office or LibreOffice, in Vim

Resources

Stars

Watchers

Forks

Packages

No packages published