This plug-in lets Vim read text documents of type PDF
, Microsoft Office
such as Word
(doc(x)
), Excel
(xls(x)
) or Powerpoint
(ppt(x)
), Open Document (odt
), EPUB
....
The text extraction depends on external tools, but most use cases are covered by an installation of
LibreOffice
and a common text browser (such aslynx
), andpdftotext
.
It uses, whenever available, appropriate external converters such as unrtf, pandoc, docx2txt.pl, odt2txt, xlscat, xlsx2csv.py or pptx2md ..., but will fall back to:
-
Either LibreOffice which is an office suite that (together with a common text browser such as
lynx
) can handle all those formats listed above, exceptPDF
s. (On Microsoft Windows, ensure after its installation that the path of the folder containing the executable, by default%ProgramFiles%\LibreOffice\program
, is added to the%PATH%
environment variable. -
Or Tika which is a content extractor that can handle all those formats listed above and many more. To use it:
-
Download the latest runnable
tika-app-...jar
from Tika to~/bin/tika.jar
(on Linux) respectively%USERPROFILE%\bin
(on Microsoft Windows). -
Create
- on Linux, a shell script
~/bin/tika
that reads
#!/bin/sh exec java -Dfile.encoding=UTF-8 -jar "$HOME/bin/tika.jar" "$@" 2>/dev/null
and mark it executable (by
chmod a+x ~/bin/tika
).- on Microsoft Windows, a batch script
%USERPROFILE%\bin\tika.bat
that reads
@echo off java -Dfile.encoding=UTF-8 -jar "%USERPROFILE%\bin\tika.jar" %*
- on Linux, a shell script
-
Add the folder of the newly created
tika
executable to your environment variable$PATH
(on Linux) respectively%PATH%
(on Microsoft Windows):- on Linux, if you use
bash
orzsh
by adding to~/.profile
or~/.zshenv
the line
PATH=$PATH:~/bin
- on Microsoft Windows, a convenient program to update
%PATH%
is Rapidee.
- on Linux, if you use
-
For the (English) text extraction of common image files of common formats, it uses tesseract
whenever its executable is found, available on Microsoft Windows and Linux.
On Microsoft Windows, ensure after its installation that the path of the folder containing its executable, by default %ProgramFiles%\Tesseract-OCR
, is added to the %PATH%
environment variable.
Pass additional command-line options to tesseract
by g:office_tesseract
(left empty by default), for example
let g:office_tesseract = '-l eng+ita'
to properly extract Italian words (as well as English ones).
To go even further, for example, to read, among many others file formats, media files in Vim, add this Vimscript snippet from lesspipe.sh to your vimrc
!
To convert a file to markdown, add the following command to your vimrc
and run :PandocToMarkdown
inside the buffer of the opened file:
command! -range=% PandocToMarkdown exe '<line1>,<line2>!pandoc --wrap=preserve --from='..PandocFiletype(&l:filetype)..'--to markdown %:S'
function! PandocFiletype(filetype) abort
if a:filetype ==# 'tex'
return 'latex'
elseif a:filetype ==# 'pandoc'
return 'markdown'
elseif a:filetype ==# 'text' || empty(a:filetype)
return expand('%:e')
else
return a:filetype
endif
endfunction
This answer shows how this plug-in works in principle, and refers to vim-util as an alternative implementation for some word document formats using the textutil
command on Mac OS
that also allows to write the text edited in Vim
.