PDF on Saleem Ansari

PDF on Saleem Ansari /tags/pdf/ Recent content in PDF on Saleem Ansari Hugo -- gohugo.io en (c) 2024 Saleem Ansari Sun, 20 Dec 2015 04:12:29 +0530 Extract Text From PDF /2015/12/20/extract-text-from-pdf/ Sun, 20 Dec 2015 04:12:29 +0530 /2015/12/20/extract-text-from-pdf/ Extract Text from Images in multi-page PDF To extract text from PDF, you would need two software installed on your machine. ghostscript tesseract OCR Installing these on Fedora is very easy: $ sudo yum install -y ghostscript tesseract Now if your PDF file is named story.pdf the you can extract text as follows: $ ghostscript -dNOPAUSE -dBATCH -sDEVICE=pngalpha -r300 -sOutputFile="page%03d".png story.pdf $ for f in page*.png ; do tesseract $f $f. Extract Text from from multi-page PDF with only Images /2013/07/19/extract-text-from-from-multi-page-pdf-with-only-images/ Fri, 19 Jul 2013 00:00:00 +0000 /2013/07/19/extract-text-from-from-multi-page-pdf-with-only-images/ Sometimes there are only images in a PDF. In such cases you can not select text to copy / paste or just for reference. To extract text from an Image or a PDF containing only images, I used Tesseract OCR Engine and Ghostscript. I am running Fedora 19 at the moment, however these steps should apply to an older version of Fedora or Ubuntu. ( I believe this can be done on Windows as well ).