create account

PDFPLUMBER: Extract Data You Need With This Super Easy To Use Python Module by geekgirl

View this thread on: hive.blogpeakd.comecency.com
· @geekgirl ·
$131.55
PDFPLUMBER: Extract Data You Need With This Super Easy To Use Python Module
<center>![pytonpdf.jpg](https://images.hive.blog/DQmVGgeToFjpmq85FnwsDNQmn644EysojtZh92b1ZrZrrUs/pytonpdf.jpg)</center>

I love Python! It is easy to learn. It is fun to use. But most importantly it saves time. Time is the most precious asset we all have. Often we spend out time doing repetitive work over and over again. Computers are really good at doing repetitive work and they do it in more efficient manner. To tell computers do things we need a way to communicate with them. This is where programming languages come in. Learning any programming language is a big task. Python makes learning how to code easy and accessible to anybody with a little effort. With right right mindset anybody can learn basics of python to use it for daily repetitive tasks and ultimately save time.

Python has a big community and many many libraries available to tackle various tasks. One of the python module's I have been using lately is `pdfplumber`. As the name suggests this module works with pdf files and helps with extracting relevant data. 

PDF is a one of the widely used documents formats. If your business, work, and school activities involve any documents, chances are you are familiar with pdf files. What if your daily activities involve reading through large amounts of pdf documents with many many pages? Over time we can get more efficient and effective with how we process these documents manually. But we still have physical limitations and do end up spending countless hours on such repetitive tasks.

Using `pdfplumber` we can tell the computer to do the repetitive parts of the task, identifying what is needed, extracting relevant data, and maybe even use this data to further analysis or storing for future use and comparison. This is not the only module that helps with extracting data from pdf files. There are many more solutions out there. I found this one to be the easiest to understand and use. And it just works. If you know of any better solutions, feel free to let me know in the comments. 

`pdfplumber` has a great documentation and has examples to demonstrate how it works. Please visit [pdfplumber GitHub page](https://github.com/jsvine/pdfplumber) for the details. 

The most important feature I have been using is extracting text from pdf files. This can be accomplished as following:

```
import pdfplumber

with pdfplumber.open("path/to/file.pdf") as pdf:
    pages = pdf.pages
    first_page = pages[0]

    print(first_page.page_number)
    print(first_page.width)
    print(first_page.height)
    print(len(first_page.chars))  

```
<br>
`pdf.pages` in the code above returns the list of all pages. This will be a list of page objects. Using properties like '.page_number', '.width', '.height' we can get these self-explanatory values. '.chars' returns a list of all characters used in the page. It has many useful properties as well. This can be used for more complex data extraction. I will share more about '.chars' a bit later.

What makes `pdfplumber` awesome and super easy to use is its line by line text extraction. Take a look at the following code.

```
import pdfplumber

with pdfplumber.open("path/to/file.pdf") as pdf:
    pages = pdf.pages
    for page in pages:
        text = page.extract_text().split('\n')
        print(len(text))
```

This codes read the pdf file, stores pages in a pages variable. Then we iterate through pages and extract text for each page. We split the extracted text and get a list of text for each line of text. If we know what documents we are working with we can identify certain text patterns to keep the text we need and throw away not needed ones.

Since the text lines are already in order as they appear in the document, this helps us in building a more useful code based on what text appears after certain text patterns. This line by line text extraction function of `pdfplumber` while may seem very simple, is very powerful and saves me a lot of time.

If you want to build more complex algorithms in extracting data you need, `.chars` property of the page can be very helpful. It takes a character at a time and provides a lot of information about the character like the value, font, size, x and y locations on the page, etc. To see the full list of  `.char` visit the GitHub link above and/or experiment in your code.

This module can also extract various other objects in a pdf file like lines, rectangles, curves, annotation, and images. They all have similar properties like the char object. Moreover, `pdfplumbler` can also help with table extraction and has visual debugging feature.

If you work with pdf files a lot and use python, give this module a try. I hope it can help you automate some tasks and save time as well. If you already use it, let me know about your experience with the module in the comments.
👍  , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , and 351 others
properties (23)
authorgeekgirl
permlinkpdfplumber-extract-data-you-need-with-this-super-easy-to-use-python-module
categoryhive-148441
json_metadata{"tags":["hive-148441","python","pdf","programming","data","proofofbrain","stem","neoxian"],"image":["https://images.hive.blog/DQmVGgeToFjpmq85FnwsDNQmn644EysojtZh92b1ZrZrrUs/pytonpdf.jpg"],"links":["https://github.com/jsvine/pdfplumber"],"app":"hiveblog/0.1","format":"markdown"}
created2021-11-16 02:26:48
last_update2021-11-16 02:26:48
depth0
children12
last_payout2021-11-23 02:26:48
cashout_time1969-12-31 23:59:59
total_payout_value65.873 HBD
curator_payout_value65.679 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length4,807
author_reputation1,586,488,611,824,452
root_title"PDFPLUMBER: Extract Data You Need With This Super Easy To Use Python Module"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id107,776,675
net_rshares89,541,815,638,693
author_curate_reward""
vote details (415)
@ace108 ·
$0.43
Cool. Looks like this ranks higher then PyPDF2. 
Thanks for the information.
👍  , , , ,
properties (23)
authorace108
permlinkr2p1fa
categoryhive-148441
json_metadata{"app":"hiveblog/0.1"}
created2021-11-17 01:57:12
last_update2021-11-17 01:57:12
depth1
children1
last_payout2021-11-24 01:57:12
cashout_time1969-12-31 23:59:59
total_payout_value0.215 HBD
curator_payout_value0.212 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length76
author_reputation1,221,584,858,014,761
root_title"PDFPLUMBER: Extract Data You Need With This Super Easy To Use Python Module"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id107,798,761
net_rshares273,373,861,404
author_curate_reward""
vote details (5)
@geekgirl ·
I was going to try pypdf2 next. Haven't tried it yet.
👍  
properties (23)
authorgeekgirl
permlinkr2p3qr
categoryhive-148441
json_metadata{"app":"hiveblog/0.1"}
created2021-11-17 02:47:18
last_update2021-11-17 02:47:18
depth2
children0
last_payout2021-11-24 02:47:18
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length53
author_reputation1,586,488,611,824,452
root_title"PDFPLUMBER: Extract Data You Need With This Super Easy To Use Python Module"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id107,799,451
net_rshares9,326,619
author_curate_reward""
vote details (1)
@anomadsoul · (edited)
$0.66
Learning Python is on my list once I am done with react and nodejs, so I'm saving this post for later (right now I am pretty sure I am just going to be confused and lonely) and I'll be back in around one month (hopefully since I am putting in like 8-10 hours a day to learn to code) to see what's this all about :D
👍  , , , , ,
properties (23)
authoranomadsoul
permlinkre-geekgirl-20211116t144629308z
categoryhive-148441
json_metadata{"tags":["hive-148441","python","pdf","programming","data","proofofbrain","stem","neoxian"],"app":"ecency/3.0.19-vision","format":"markdown+html"}
created2021-11-16 20:46:30
last_update2021-11-17 03:49:42
depth1
children2
last_payout2021-11-23 20:46:30
cashout_time1969-12-31 23:59:59
total_payout_value0.329 HBD
curator_payout_value0.326 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length314
author_reputation1,681,171,138,068,684
root_title"PDFPLUMBER: Extract Data You Need With This Super Easy To Use Python Module"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id107,793,875
net_rshares433,892,389,812
author_curate_reward""
vote details (6)
@geekgirl ·
$0.13
I remember seeing that you were learning javascript. I always wanted to learn react too. That is awesome. When you get a chance you should look into threejs. Looking forward to seeing some cool apps from you.
👍  , , , ,
properties (23)
authorgeekgirl
permlinkr2p3ot
categoryhive-148441
json_metadata{"app":"hiveblog/0.1"}
created2021-11-17 02:46:06
last_update2021-11-17 02:46:06
depth2
children1
last_payout2021-11-24 02:46:06
cashout_time1969-12-31 23:59:59
total_payout_value0.067 HBD
curator_payout_value0.064 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length208
author_reputation1,586,488,611,824,452
root_title"PDFPLUMBER: Extract Data You Need With This Super Easy To Use Python Module"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id107,799,435
net_rshares85,232,124,935
author_curate_reward""
vote details (5)
@anomadsoul ·
$0.30
I'm still there and damn, I'm loving every step of the way although I'm getting a little too obsessed with progress and some days I go on for too long without breaks, so I gotta pace myself. 
I will definitely check threejs (never heard of it). I hope that at some point of early 2022 I am able to start developing, if so, you are definitely on the list of hivers I'll tell before release :D 
👍  , , , ,
properties (23)
authoranomadsoul
permlinkre-geekgirl-20211117t101333862z
categoryhive-148441
json_metadata{"tags":["ecency"],"app":"ecency/3.0.19-vision","format":"markdown+html"}
created2021-11-17 16:13:33
last_update2021-11-17 16:13:33
depth3
children0
last_payout2021-11-24 16:13:33
cashout_time1969-12-31 23:59:59
total_payout_value0.150 HBD
curator_payout_value0.147 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length392
author_reputation1,681,171,138,068,684
root_title"PDFPLUMBER: Extract Data You Need With This Super Easy To Use Python Module"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id107,811,985
net_rshares189,058,841,017
author_curate_reward""
vote details (5)
@benthomaswwd ·
$0.28
Sounds like a handy tool thanks for sharing very informative have the best day
👍  , , , ,
properties (23)
authorbenthomaswwd
permlinkr2nz71
categoryhive-148441
json_metadata{"app":"hiveblog/0.1"}
created2021-11-16 12:11:27
last_update2021-11-16 12:11:27
depth1
children0
last_payout2021-11-23 12:11:27
cashout_time1969-12-31 23:59:59
total_payout_value0.138 HBD
curator_payout_value0.138 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length78
author_reputation21,253,441,713,412
root_title"PDFPLUMBER: Extract Data You Need With This Super Easy To Use Python Module"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id107,784,658
net_rshares189,173,725,939
author_curate_reward""
vote details (5)
@emeka4 ·
$0.24
Thanks updating on stuff like this it's really awesome. We live in a world were technology had gone viral with the essence of making work easier and faster for us to handle and it's also nice knowing about the python programming
👍  , ,
properties (23)
authoremeka4
permlinkr2nbor
categoryhive-148441
json_metadata{"app":"hiveblog/0.1"}
created2021-11-16 03:43:45
last_update2021-11-16 03:43:45
depth1
children0
last_payout2021-11-23 03:43:45
cashout_time1969-12-31 23:59:59
total_payout_value0.122 HBD
curator_payout_value0.122 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length228
author_reputation234,154,110,917,475
root_title"PDFPLUMBER: Extract Data You Need With This Super Easy To Use Python Module"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id107,777,694
net_rshares168,455,244,806
author_curate_reward""
vote details (3)
@gabrielatravels ·
$0.29
I kept saying to myself that I should start learning coding and especially Python. Everything looks so easy when it's explained by someone else but when it comes your turn, things are different. 🙄
👍  , , , ,
properties (23)
authorgabrielatravels
permlinkr2pd74
categoryhive-148441
json_metadata{"app":"hiveblog/0.1"}
created2021-11-17 06:11:33
last_update2021-11-17 06:11:33
depth1
children1
last_payout2021-11-24 06:11:33
cashout_time1969-12-31 23:59:59
total_payout_value0.146 HBD
curator_payout_value0.144 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length196
author_reputation974,812,091,618,976
root_title"PDFPLUMBER: Extract Data You Need With This Super Easy To Use Python Module"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id107,802,213
net_rshares189,546,209,914
author_curate_reward""
vote details (5)
@geekgirl ·
You can do it.
properties (22)
authorgeekgirl
permlinkr2r773
categoryhive-148441
json_metadata{"app":"hiveblog/0.1"}
created2021-11-18 05:57:06
last_update2021-11-18 05:57:06
depth2
children0
last_payout2021-11-25 05:57:06
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length14
author_reputation1,586,488,611,824,452
root_title"PDFPLUMBER: Extract Data You Need With This Super Easy To Use Python Module"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id107,826,238
net_rshares0
@hivebuzz ·
Congratulations @geekgirl! You have completed the following achievement on the Hive blockchain and have been rewarded with new badge(s):

<table><tr><td><img src="https://images.hive.blog/60x70/http://hivebuzz.me/@geekgirl/posts.png?202111160334"></td><td>You published more than 550 posts.<br>Your next target is to reach 600 posts.</td></tr>
</table>

<sub>_You can view your badges on [your board](https://hivebuzz.me/@geekgirl) and compare yourself to others in the [Ranking](https://hivebuzz.me/ranking)_</sub>
<sub>_If you no longer want to receive notifications, reply to this comment with the word_ `STOP`</sub>


To support your work, I also upvoted your post!
properties (22)
authorhivebuzz
permlinkhivebuzz-notify-geekgirl-20211116t034837
categoryhive-148441
json_metadata{"image":["http://hivebuzz.me/notify.t6.png"]}
created2021-11-16 03:48:36
last_update2021-11-16 03:48:36
depth1
children0
last_payout2021-11-23 03:48:36
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length670
author_reputation369,247,454,404,928
root_title"PDFPLUMBER: Extract Data You Need With This Super Easy To Use Python Module"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id107,777,764
net_rshares0
@indiaunited ·
Indiaunited Curation 1637031258247
This post has been manually curated by @bhattg from Indiaunited community. Join us on our [Discord Server](https://discord.gg/bGmS2tE). 

Do you know that you can earn a passive income by delegating to @indiaunited. We share 100 % of the curation rewards with the delegators. 

Here are some handy links for delegations: [100HP](https://hivesigner.com/sign/delegateVestingShares?delegator=&delegatee=indiaunited&vesting_shares=185629.0627800587%20VESTS), [250HP](https://hivesigner.com/sign/delegateVestingShares?delegator=&delegatee=indiaunited&vesting_shares=464072.6569501468%20VESTS), [500HP](https://hivesigner.com/sign/delegateVestingShares?delegator=&delegatee=indiaunited&vesting_shares=928145.3139002935%20VESTS), [1000HP](https://hivesigner.com/sign/delegateVestingShares?delegator=&delegatee=indiaunited&vesting_shares=1856290.627800587%20VESTS). 

Read our latest [announcement post](https://hive.blog/hive-186042/@indiaunited/indiaunited-2-0-active-again-with-a-lot-more-energy-this-time) to get more information. 

[![image.png](https://files.peakd.com/file/peakd-hive/bala41288/46eaz12N-image.png)](https://discord.gg/bGmS2tE) 

<sub>**Please contribute to the community by upvoting this comment and posts made by @indiaunited.**</sub>
properties (22)
authorindiaunited
permlinkindiaunited-1637031258247
categoryhive-148441
json_metadata{"app":"hiveblog/0.1","tags":["hive-148441","python","pdf","programming","data","proofofbrain","stem","neoxian"]}
created2021-11-16 02:54:18
last_update2021-11-16 02:54:18
depth1
children0
last_payout2021-11-23 02:54:18
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length1,250
author_reputation95,461,361,055,441
root_title"PDFPLUMBER: Extract Data You Need With This Super Easy To Use Python Module"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id107,777,130
net_rshares0
@videoaddiction ·
$0.40
It has been always a mess to copy text from a PDF file to Word or Notepad. It seems it will be easy to do with Python. 
👍  , , , ,
properties (23)
authorvideoaddiction
permlinkre-geekgirl-20211116t91817891z
categoryhive-148441
json_metadata{"tags":["hive-148441","python","pdf","programming","data","proofofbrain","stem","neoxian"],"app":"ecency/3.0.23-mobile","format":"markdown+html"}
created2021-11-16 06:18:21
last_update2021-11-16 06:18:21
depth1
children0
last_payout2021-11-23 06:18:21
cashout_time1969-12-31 23:59:59
total_payout_value0.199 HBD
curator_payout_value0.198 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length119
author_reputation165,539,973,605,358
root_title"PDFPLUMBER: Extract Data You Need With This Super Easy To Use Python Module"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id107,779,737
net_rshares274,779,459,470
author_curate_reward""
vote details (5)