With the prevalence of PDF files, there is a huge demand to convert the tables in these files into CSV/Excel for easy extraction of the data. In Python, there are several FREE libraries that can convert PDF to CSV/Excel like tabula-py and PDFMiner. But the results are not always ideal.
StatCan, in Aug 2021, released a FREE library - SLICEmyPDF - for the express purpose of conversion of tabular data. Results usually beat other free Python/R libraries.
However, SLICEmyPDF requires the installation of several other free software and slight modifications to the code for use on Windows (as the library was created on Linux).
Follow the steps below to install SLICEmyPDF and the needed software.
1. Download SLICEmyPDF from https://github.com/StatCan/SLICEmyPDF
a) Where to save slicemypdf folder?
C:\Users\user-name\anaconda3\Lib
Note to take out slicemypdf folder from “SLICEmyPDF-main” folder
b) Where to save “utilities” folder?
In slicemypdf folder, copy “utilities” folder. Save the copy in
C:\Users\user-name\anaconda3\Lib
c) To import slicemypdf in code editor
from slicemypdf.slicemypdf import Extractor
2. Install ImageMagick software and Wand library
a) Make sure only 1 Python is installed either 32 or 64-bit on your PC.
b) Follow the installation instructions of ImageMagick for Windows from
https://docs.wand-py.org/en/latest/guide/install.html#install-imagemagick-on-windows
c) Download the dynamic version of ImageMagick that is compatible with the bit on your PC.
d) Note that you have to check Install development headers and libraries for C and C++ to make Wand able to link to it.
e) Set Environment Variable.
f) pip install wand
3. Download pdftotext & pdftohtml from poppler
Create a new envir and install poppler package (as there could be conflicts with installation of this package and other existing packages on your PC)
a) Create a new envir and install poppler package. If use existing envir, there maybe conflict in installing poppler
conda create --name poppler poppler
b) Go to bin folder in poppler envir folder
C:\Users\user-name\anaconda3\envs\poppler\Library\bin
c) Check if pdftotext.exe and pdftohtml.exe are in the folder
d) In settings.yaml dependencies path amend path
pdf_text_path: C:/Users/user-name/anaconda3/envs/poppler/Library/bin/pdftotext
pdf_html_path: C:/Users/user-name/anaconda3/envs/poppler/Library/bin/pdftohtml
e) Test pdftohtml & pdftotext are accessible in cmd
In cmd, type
C:/Users/user-name/anaconda3/envs/poppler/Library/bin/pdftotext
C:/Users/user-name/anaconda3/envs/poppler/Library/bin/pdftohtml
you should see
4. Install delegator.py library not delegator
pip install delegator.py
5. Install geopandas library without dependencies
This is to avoid installing fiona & gdal libraries which in turn need Microsoft C++ Build Tools. These libraries and Microsoft C++ Build Tools are not needed for SLICEmyPDF.
a) pip install --no-index -–no-dependencies geopandas
b) Separately install only the required dependencies (excl fionas) for geopandas (https://geopandas.org/getting_started/install.html):
Dependencies
•numpy
•pandas (version 0.24 or later)
•shapely (interface to GEOS)
•
fiona (interface to GDAL)
•pyproj (interface to PROJ; version 2.2.0
or later)
•rtree
c) For shapely, if pip install can’t install the needed dependencies use conda
conda install shapely
6. Install the other libraries stated in slicemypdf.py
7. Amend the below code in slicemypdf.py
a) State exact directory in
settings = yaml.safe_load(open(r"C:/Users/user-name/anaconda3/Lib /slicemypdf /settings.yaml"))
b) Amend cat to type
#b = delegator.run("cat output.xml") # cat is used in Linux sys
b = delegator.run("type
output.xml") # type is for Window sys
=======================================================
Steps for installing geopandas with fiona & gdal libraries
a) Download the wheel for Fiona & gdal from
https://www.lfd.uci.edu/~gohlke/pythonlibs/#gdal
https://www.lfd.uci.edu/~gohlke/pythonlibs/#fiona
b) Download the version compatible with Python version, bit and Window
EG: Python 3.8.8, 64 bit AMD64
whl with cp38 (because CPython 3.8 is what you’re running) win (because Windows) and 64 (because 64-bit)
- GDAL-3.2.3-cp38-cp38-win_amd64.whl
- Fiona‑1.8.19‑cp38‑cp38‑win_amd64.whl
Note: Fiona whl must be compatible with gdal whl
c) Install the whl by specifying the path to the saved whl
pip install C:/Users/user-name/anaconda3/Lib/GDAL-3.2.3-cp38-cp38-win_amd64.whl
d) Add Environment Variable with the name GDAL_DATA and
value = C:\Users\user-name\AppData\Local\Programs\Python\Python38-32\Lib\site-packages\osgeo\data\gdal
e) Install Visual C++ build tools from https://visualstudio.microsoft.com/visual-cpp-build-tools/
f) Re-start PC
g) pip install C:/Users/user-name/anaconda3/Lib/Fiona 1.8.19 cp38 cp38 win_amd64.whl
can I be fine without installing or going through these steps?
ReplyDelete=======================================================
Steps for installing geopandas with fiona & gdal libraries
a) Download the wheel for Fiona & gdal from
https://www.lfd.uci.edu/~gohlke/pythonlibs/#gdal
https://www.lfd.uci.edu/~gohlke/pythonlibs/#fiona
b) Download the version compatible with Python version, bit and Window
EG: Python 3.8.8, 64 bit AMD64
whl with cp38 (because CPython 3.8 is what you’re running) win (because Windows) and 64 (because 64-bit)
GDAL-3.2.3-cp38-cp38-win_amd64.whl
Fiona‑1.8.19‑cp38‑cp38‑win_amd64.whl
Note: Fiona whl must be compatible with gdal whl
c) Install the whl by specifying the path to the saved whl
pip install C:/Users/user-name/anaconda3/Lib/GDAL-3.2.3-cp38-cp38-win_amd64.whl
d) Add Environment Variable with the name GDAL_DATA and
value = C:\Users\user-name\AppData\Local\Programs\Python\Python38-32\Lib\site-packages\osgeo\data\gdal
e) Install Visual C++ build tools from https://visualstudio.microsoft.com/visual-cpp-build-tools/
f) Re-start PC
g) pip install C:/Users/user-name/anaconda3/Lib/Fiona 1.8.19 cp38 cp38 win_amd64.whl
For a Linux environment, follow the readme on this page: https://github.com/StatCan/SLICEmyPDF
ReplyDelete