Thursday, 7 April 2022

How to install StatCan's SLICEmyPDF on Windows?

With the prevalence of PDF files, there is a huge demand to convert the tables in these files into CSV/Excel for easy extraction of the data. In Python, there are several FREE libraries that can convert PDF to CSV/Excel like tabula-py and PDFMiner. But the results are not always ideal. 

StatCan, in Aug 2021, released a FREE library - SLICEmyPDF - for the express purpose of conversion of tabular data. Results usually beat other free Python/R libraries.

However,  SLICEmyPDF requires the installation of several other free software and slight modifications to the code for use on Windows (as the library was created on Linux). 

Follow the steps below to install SLICEmyPDF and the needed software.

1. Download SLICEmyPDF from https://github.com/StatCan/SLICEmyPDF

    a) Where to save slicemypdf folder?

         C:\Users\user-name\anaconda3\Lib

         Note to take out slicemypdf folder from “SLICEmyPDF-main” folder

     b) Where to save “utilities” folder?

         In slicemypdf folder, copy “utilities” folder. Save the copy in

         C:\Users\user-name\anaconda3\Lib

       c) To import slicemypdf in code editor

            from slicemypdf.slicemypdf import Extractor

 

2. Install ImageMagick software and Wand library

   a) Make sure only 1 Python is installed either 32 or 64-bit on your PC.

   b) Follow the installation instructions of ImageMagick for Windows from  

       https://docs.wand-py.org/en/latest/guide/install.html#install-imagemagick-on-windows

   c) Download the dynamic version of ImageMagick that is compatible with the bit on your PC.

   d) Note that you have to check Install development headers and libraries for C and C++ to make Wand able to link to it.

   e)  Set Environment Variable.

   f) pip install wand 

 

3. Download pdftotext & pdftohtml from poppler

Create a new envir and install poppler package (as there could be conflicts with installation of this package and other existing packages on your PC)

      a) Create a new envir and install poppler package. If use existing envir, there maybe conflict in installing poppler

        conda create --name poppler poppler

    b)   Go to bin folder in poppler envir folder

        C:\Users\user-name\anaconda3\envs\poppler\Library\bin

    c)  Check if pdftotext.exe and pdftohtml.exe are in the folder

    d)   In settings.yaml dependencies path amend path

         pdf_text_path: C:/Users/user-name/anaconda3/envs/poppler/Library/bin/pdftotext

          pdf_html_path: C:/Users/user-name/anaconda3/envs/poppler/Library/bin/pdftohtml

     e) Test pdftohtml & pdftotext are accessible in cmd

         In cmd, type

           C:/Users/user-name/anaconda3/envs/poppler/Library/bin/pdftotext

           C:/Users/user-name/anaconda3/envs/poppler/Library/bin/pdftohtml

         you should see

 

 

4. Install delegator.py library not delegator

     pip install delegator.py

 

5. Install geopandas library without dependencies

This is to avoid installing fiona & gdal libraries which in turn need Microsoft C++ Build Tools. These libraries and Microsoft C++ Build Tools are not needed for SLICEmyPDF.

    a) pip install --no-index -–no-dependencies geopandas

    b) Separately install only the required dependencies (excl fionas) for geopandas      (https://geopandas.org/getting_started/install.html):

        Dependencies

        •numpy

        •pandas (version 0.24 or later)

        •shapely (interface to GEOS)

        fiona (interface to GDAL)

        •pyproj (interface to PROJ; version 2.2.0 or later)

        •rtree

     c) For shapely, if pip install can’t install the needed dependencies use conda

         conda install shapely

 

6. Install the other libraries stated in slicemypdf.py

 

7. Amend the below code in slicemypdf.py

       a)  State exact directory in

    settings = yaml.safe_load(open(r"C:/Users/user-name/anaconda3/Lib /slicemypdf       /settings.yaml"))

        b) Amend cat to type

#b = delegator.run("cat output.xml") # cat is used in Linux sys

  b = delegator.run("type output.xml") # type is for Window sys



=======================================================

Steps for installing geopandas with fiona & gdal libraries
 

a) Download the wheel for Fiona & gdal from

https://www.lfd.uci.edu/~gohlke/pythonlibs/#gdal
https://www.lfd.uci.edu/~gohlke/pythonlibs/#fiona

b) Download the version compatible with Python version, bit and Window

EG: Python 3.8.8, 64 bit AMD64

whl with cp38 (because CPython 3.8 is what you’re running) win (because Windows) and 64 (because 64-bit)

  • GDAL-3.2.3-cp38-cp38-win_amd64.whl
  • Fiona‑1.8.19‑cp38‑cp38‑win_amd64.whl

Note: Fiona whl must be compatible with gdal whl

c) Install the whl by specifying the path to the saved whl

pip install C:/Users/
user-name/anaconda3/Lib/GDAL-3.2.3-cp38-cp38-win_amd64.whl

d) Add Environment Variable with the name GDAL_DATA and

value = C:\Users\user-name\AppData\Local\Programs\Python\Python38-32\Lib\site-packages\osgeo\data\gdal

e) Install Visual C++ build tools from https://visualstudio.microsoft.com/visual-cpp-build-tools/

f) Re-start PC

g) pip install C:/Users/user-name/anaconda3/Lib/Fiona 1.8.19 cp38 cp38 win_amd64.whl


 

 



 


 

 


 




2 comments:

  1. can I be fine without installing or going through these steps?

    =======================================================

    Steps for installing geopandas with fiona & gdal libraries


    a) Download the wheel for Fiona & gdal from

    https://www.lfd.uci.edu/~gohlke/pythonlibs/#gdal
    https://www.lfd.uci.edu/~gohlke/pythonlibs/#fiona

    b) Download the version compatible with Python version, bit and Window

    EG: Python 3.8.8, 64 bit AMD64

    whl with cp38 (because CPython 3.8 is what you’re running) win (because Windows) and 64 (because 64-bit)

    GDAL-3.2.3-cp38-cp38-win_amd64.whl
    Fiona‑1.8.19‑cp38‑cp38‑win_amd64.whl
    Note: Fiona whl must be compatible with gdal whl

    c) Install the whl by specifying the path to the saved whl

    pip install C:/Users/user-name/anaconda3/Lib/GDAL-3.2.3-cp38-cp38-win_amd64.whl

    d) Add Environment Variable with the name GDAL_DATA and

    value = C:\Users\user-name\AppData\Local\Programs\Python\Python38-32\Lib\site-packages\osgeo\data\gdal

    e) Install Visual C++ build tools from https://visualstudio.microsoft.com/visual-cpp-build-tools/

    f) Re-start PC

    g) pip install C:/Users/user-name/anaconda3/Lib/Fiona 1.8.19 cp38 cp38 win_amd64.whl

    ReplyDelete
  2. For a Linux environment, follow the readme on this page: https://github.com/StatCan/SLICEmyPDF

    ReplyDelete

How to Read in 1 or All Excel Sheets into a Pandas DF Using Xlwings?

                                                                   Photo by Jeff Sheldon on Unsplash One of the advantages of using Xlwings...