create account

Extracting PDF Data With Pdfplumber - Lines, Rectangles, And Crop by geekgirl

View this thread on: hive.blogpeakd.comecency.com
· @geekgirl ·
$112.01
Extracting PDF Data With Pdfplumber - Lines, Rectangles, And Crop
![pdfplumber_data.png](https://images.hive.blog/DQmesCaeFmx11bh64otb8SBqoZribdW9eUiBTvH54VtjEBw/pdfplumber_data.png)

In the past I have written how useful **pdfplumber** library is when extracting data from pdf files. Its true power becomes evident with dealing with multiple pdf files that have hundreds of pages. When you know what you are looking for, and don't want to go through hundreds of pages manually, and if you have to do deal with such files on daily basis, best thing to do is to automate. That's what python is great at, automating. **Pdfplumber** as the naming suggest works with pdf files and makes it easy to extract data. It works best with machine-generated pdf files rather than scanned pdf files.

When extracting data from pdf files we can utilize multiple approaches. If we just need some text, we can start with the simple `.extract_text()` method. However, **pdfplumber** let's us extract all objects in the document like images, lines, rectangles, curves, chars, or we can just get all of these objects with `.objects`. Sometimes machine generated pdf files utilize lines and rectangles to separate the information on the page. This can help up in identifying the type of text within those lines or rectangles. I recently came across some financial pdf data formatted in such a way. Using the location of these lines and rectangles can help to select the text in that area using **pdfplumber**'s `.crop()` method.

First, let's take a look at basic text extraction with `pdfplumber`. 

```
import pdfplumber

with pdfplumber.open('/Users/librarian/Desktop/document.pdf') as pdf:
    page1 = pdf.pages[0]
    page1_text = page1.extract_text().split('\n')
    for text in page1_text:
        print(text)
```

We open the file with pdfplumber, `.pages` returns list of pages in the pdf and all the data within those pages. Since it is a list we can access them one by one. In the example above we are just looking at page one for now. Using `.extract_text()` method, we can get all text of page one. It is one long string. If we want to separate the text line by line, we use the `.split('\n')`. Now that we have a list of lines of text from page one, we can iterate through the list and display all lines of text.

In most cases, this might be all you need. But sometimes you may want to extract these lines of text and retain the layout formatting. To do this, we add `layout=True` parameter to `.extract_text()` method, like this `page1.extract_text(layout=True).split('\n')`. Be careful when using `layout=True`, because this feature is experimental and not stable yet. In might work in most cases, but sometimes it may return unexpected results. 

Now that we know how to extract the text from the page, we can apply some string manipulation and regex to get only the data that we actually need. If we know the exact area on the page where our data is located, we can use `.crop()` method and extract only that data using the same extraction methods described above.

**pdfplumber.Page** class has properties like `.page_number`, `.width`, and `.height`. We can use width and height of the page in determining which area we are going to crop. Let's take a look at a code example using `.crop()`

```
import pdfplumber

with pdfplumber.open('/Users/librarian/Desktop/document.pdf') as pdf:
    page1 = pdf.pages[0]
    bounding_box = (200, 300, 400, 450)
    crop_area = page1.crop(bounding_box)
    crop_text = crop_area.extract_text().split('\n')
    for text in crop_text:
        print(text)
```

Once we have our page instance, we use the `.crop(bounding_box)` method, and result is still page but only covers the area defined by bounding_box. Think of it is a piece of the page, but it still is a page, and we can apply other other methods like `.extract_text()` on this piece of a page.

This cropping the area can be very useful if you know the exact area your text is located in. This feature become even more useful when the pdf documents we are working with have lines and rectangles for formatting and separating information. We can extract all the lines and rectangles on the page and get their locations. Using these locations we can easily identify which area of the page we need to crop. To get the lines on the page, we use `.lines` property and to get the rectangles on the page we use `.rects` property. To see how many lines we have on the page and properties of a line we can run the following code.

```
import pdfplumber
import pprint

with pdfplumber.open('/Users/librarian/Desktop/document.pdf') as pdf:
    page1 = pdf.pages[0]
    lines = page1.lines
    print(len(lines))
    pprint.pprint(lines[0])
```

The result would show the following properties and their values line objects will have. Some of them will be useful, other we can ignore. 

```
{'bottom': 130.64999999999998,
 'doctop': 130.64999999999998,
 'evenodd': False,
 'fill': False,
 'height': 0.0,
 'linewidth': 1,
 'non_stroking_color': [0.859],
 'object_type': 'line',
 'page_number': 1,
 'pts': [(18.0, 661.35), (590.25, 661.35)],
 'stroke': True,
 'stroking_color': (0, 0, 0),
 'top': 130.64999999999998,
 'width': 572.25,
 'x0': 18.0,
 'x1': 590.25,
 'y0': 661.35,
 'y1': 661.35}
```

Which property to use will be based on the project. In my case I would be using ***top, bottom, x0, and x1***. Although top and bottom values are same in this example because line width is only 1, I would still get both values just in case the value of the line width changes in the future. 

We would get the rectangles on the page the same way as we did with lines. In this case we change the property to `.rects`. When using rects, the top and bottom value will be different for obvious reasons. Now that we have the coordinates where we need to crop and extract text from, we just plug in these values we get from `.lines` and `.rects` into our bounding_box for `.crop()` method.

I just started using these features of **pdfplumber** today, and so far everything is working great and I have seen any issues yet. If you work with many pdf files to extract data and these documents have repeating lines and rectangles that separate information, you too may find **pdfplumber** to be useful in automating these tasks. Let me know your thoughts and experiences about text extraction from pdf documents in the comments.

Pdfplumber has great documentation. Feel free to visit the github page: https://github.com/jsvine/pdfplumber
👍  , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , and 432 others
👎  , ,
properties (23)
authorgeekgirl
permlinkextracting-pdf-data-with-pdfplumber-lines-rectangles-and-crop
categorypython
json_metadata{"tags":["python","pdfplumber","coding","programming","vyb","proofofbrain","stem","neoxian"],"image":["https://images.hive.blog/DQmesCaeFmx11bh64otb8SBqoZribdW9eUiBTvH54VtjEBw/pdfplumber_data.png"],"links":["https://github.com/jsvine/pdfplumber"],"app":"hiveblog/0.1","format":"markdown"}
created2022-08-02 03:57:06
last_update2022-08-02 03:57:06
depth0
children13
last_payout2022-08-09 03:57:06
cashout_time1969-12-31 23:59:59
total_payout_value56.074 HBD
curator_payout_value55.940 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length6,440
author_reputation1,586,488,611,824,452
root_title"Extracting PDF Data With Pdfplumber - Lines, Rectangles, And Crop"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id115,358,210
net_rshares149,180,661,596,381
author_curate_reward""
vote details (499)
@adolf39 · (edited)
Fantastic tutorial on extracting PDF data with Pdfplumber! The step-by-step guide to working with lines, rectangles, and crop features is incredibly helpful. For those looking to take their PDF manipulation to the next level, I highly recommend checking out https://pdfflex.com/png-to-pdf – a free PDF converter that simplifies editing, merging, and compressing PDFs with just a few clicks. It's a game-changer! 
properties (22)
authoradolf39
permlinkre-geekgirl-20231127t143233142z
categorypython
json_metadata{"tags":["python","pdfplumber","coding","programming","vyb","proofofbrain","stem","neoxian"],"app":"ecency/3.0.37-vision","format":"markdown+html"}
created2023-11-27 13:32:33
last_update2023-12-06 10:10:30
depth1
children0
last_payout2023-12-04 13:32:33
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length412
author_reputation-24,042,061,666
root_title"Extracting PDF Data With Pdfplumber - Lines, Rectangles, And Crop"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id129,209,538
net_rshares0
@chinito ·
$0.12
that is neat! very helpful tool! 😉🤙
👍  , , , ,
properties (23)
authorchinito
permlinkre-geekgirl-rg0f52
categorypython
json_metadata{"tags":["python"],"app":"peakd/2022.07.1"}
created2022-08-02 22:43:51
last_update2022-08-02 22:43:51
depth1
children0
last_payout2022-08-09 22:43:51
cashout_time1969-12-31 23:59:59
total_payout_value0.058 HBD
curator_payout_value0.058 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length35
author_reputation187,326,767,517,951
root_title"Extracting PDF Data With Pdfplumber - Lines, Rectangles, And Crop"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id115,381,923
net_rshares154,945,450,118
author_curate_reward""
vote details (5)
@diyhub ·
<div class="pull-right"><a href="https://steempeak.com/trending/hive-189641"><img src="https://cdn.steemitimages.com/DQmV9e1dikviiK47vokoSCH3WjuGWrd6PScpsgEL8JBEZp5/icon_comments.png"></a></div>

###### Thank you for sharing this amazing post on HIVE!

- Your content got selected by our fellow curator @priyanarc & you just received a little thank you via an upvote from our **non-profit** curation initiative!

- You will be **featured in** one of our recurring **curation compilations** and on our **pinterest** boards! Both are aiming to offer you a **stage to widen your audience** within and outside of the DIY scene of hive.

**Join** the official [DIYHub community on HIVE](https://peakd.com/trending/hive-189641) and show us more of your amazing work and feel free to connect with us and other DIYers via our discord server: https://discord.gg/mY5uCfQ !

If you want to support our goal to motivate other DIY/art/music/homesteading/... creators just delegate to us and earn 100% of your curation rewards!

###### Stay creative & hive on!
properties (22)
authordiyhub
permlinkre-extracting-pdf-data-with-pdfplumber-lines-rectangles-and-crop-20220803t202403z
categorypython
json_metadata"{"app": "beem/0.24.26"}"
created2022-08-03 20:24:03
last_update2022-08-03 20:24:03
depth1
children0
last_payout2022-08-10 20:24:03
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length1,046
author_reputation531,742,985,056,890
root_title"Extracting PDF Data With Pdfplumber - Lines, Rectangles, And Crop"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id115,408,330
net_rshares0
@emeka4 ·
$0.12
This is really nice @geekgirl and thanks for sharing
👍  , , , ,
properties (23)
authoremeka4
permlinkrfz1ze
categorypython
json_metadata{"users":["geekgirl"],"app":"hiveblog/0.1"}
created2022-08-02 05:02:06
last_update2022-08-02 05:02:06
depth1
children0
last_payout2022-08-09 05:02:06
cashout_time1969-12-31 23:59:59
total_payout_value0.058 HBD
curator_payout_value0.057 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length52
author_reputation234,166,618,016,346
root_title"Extracting PDF Data With Pdfplumber - Lines, Rectangles, And Crop"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id115,359,041
net_rshares155,888,063,133
author_curate_reward""
vote details (5)
@garryob ·
Merging PDF files sometimes takes a lot of time but it is still a solvable problem. While Adobe Acrobat Pro DC seems like an obvious choice, its capabilities fall short of expectations. I use Guru's feature-rich PDF converter, this tool not only flattens PDFs but also bypasses the file size and usage restrictions faced by other online sources. All the tricks and innovations of this file conversion technology are described in the blog https://pdfguru.com/blog/pdf-history-and-future .
Therefore, using a PDF converter, you can quickly and efficiently combine PDF files and solve the problem associated with the complexity of layers.
properties (22)
authorgarryob
permlinkse8zxu
categorypython
json_metadata{"links":["https://pdfguru.com/blog/pdf-history-and-future"],"app":"hiveblog/0.1"}
created2024-05-29 12:56:18
last_update2024-05-29 12:56:18
depth1
children0
last_payout2024-06-05 12:56:18
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length635
author_reputation4,498,098,886
root_title"Extracting PDF Data With Pdfplumber - Lines, Rectangles, And Crop"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id134,031,569
net_rshares0
@hivebuzz ·
Congratulations @geekgirl! You have completed the following achievement on the Hive blockchain and have been rewarded with new badge(s):

<table><tr><td><img src="https://images.hive.blog/60x70/http://hivebuzz.me/@geekgirl/upvoted.png?202208020713"></td><td>You received more than 180000 upvotes.<br>Your next target is to reach 190000 upvotes.</td></tr>
</table>

<sub>_You can view your badges on [your board](https://hivebuzz.me/@geekgirl) and compare yourself to others in the [Ranking](https://hivebuzz.me/ranking)_</sub>
<sub>_If you no longer want to receive notifications, reply to this comment with the word_ `STOP`</sub>



**Check out the last post from @hivebuzz:**
<table><tr><td><a href="/hive-122221/@hivebuzz/pum-202207-result"><img src="https://images.hive.blog/64x128/https://i.imgur.com/mzwqdSL.png"></a></td><td><a href="/hive-122221/@hivebuzz/pum-202207-result">Hive Power Up Month Challenge 2022-07 - Winners List</a></td></tr><tr><td><a href="/hive-122221/@hivebuzz/pum-202208"><img src="https://images.hive.blog/64x128/https://i.imgur.com/M9RD8KS.png"></a></td><td><a href="/hive-122221/@hivebuzz/pum-202208">The 8th edition of the Hive Power Up Month starts today!</a></td></tr><tr><td><a href="/hive-122221/@hivebuzz/pud-202208"><img src="https://images.hive.blog/64x128/https://i.imgur.com/805FIIt.jpg"></a></td><td><a href="/hive-122221/@hivebuzz/pud-202208">Hive Power Up Day - August 1st 2022</a></td></tr></table>
properties (22)
authorhivebuzz
permlinknotify-geekgirl-20220802t073700
categorypython
json_metadata{"image":["http://hivebuzz.me/notify.t6.png"]}
created2022-08-02 07:37:00
last_update2022-08-02 07:37:00
depth1
children0
last_payout2022-08-09 07:37:00
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length1,444
author_reputation369,876,905,487,545
root_title"Extracting PDF Data With Pdfplumber - Lines, Rectangles, And Crop"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id115,361,473
net_rshares0
@iykewatch12 ·
$0.17
You have widened my horizon via this information you have passed out I will use this system to get pdf data when ever I have the need. Thank you a lot.
👍  , , , ,
properties (23)
authoriykewatch12
permlinkrfzmel
categorypython
json_metadata{"app":"hiveblog/0.1"}
created2022-08-02 12:23:15
last_update2022-08-02 12:23:15
depth1
children0
last_payout2022-08-09 12:23:15
cashout_time1969-12-31 23:59:59
total_payout_value0.084 HBD
curator_payout_value0.084 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length151
author_reputation13,668,240,655,441
root_title"Extracting PDF Data With Pdfplumber - Lines, Rectangles, And Crop"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id115,367,058
net_rshares221,894,064,135
author_curate_reward""
vote details (5)
@lhes ·
$0.12
I am not that good with regards to things like this.
Thank you for sharing
👍  , , , ,
properties (23)
authorlhes
permlinkre-geekgirl-rfzr82
categorypython
json_metadata{"tags":["python"],"app":"peakd/2022.07.1"}
created2022-08-02 14:09:54
last_update2022-08-02 14:09:54
depth1
children0
last_payout2022-08-09 14:09:54
cashout_time1969-12-31 23:59:59
total_payout_value0.058 HBD
curator_payout_value0.058 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length74
author_reputation316,732,055,715,881
root_title"Extracting PDF Data With Pdfplumber - Lines, Rectangles, And Crop"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id115,369,566
net_rshares155,160,102,929
author_curate_reward""
vote details (5)
@maggard ·
Great information.  Thank you.
👍  , , ,
properties (23)
authormaggard
permlinkrg05i0
categorypython
json_metadata{"tags":["stem"],"app":"stemgeeks/0.1","canonical_url":"https://stemgeeks.net/@maggard/rg05i0"}
created2022-08-02 19:15:36
last_update2022-08-02 19:15:36
depth1
children0
last_payout2022-08-09 19:15:36
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length30
author_reputation29,995,847
root_title"Extracting PDF Data With Pdfplumber - Lines, Rectangles, And Crop"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id115,377,074
net_rshares1,919,734,257
author_curate_reward""
vote details (4)
@marshallsss45 ·
I will say that for those who deal with a large number of scanned documents, PDF Harvester from CoolUtils on the website https://www.coolutils.com/PDFCombine will be a real godsend, which is much easier to use. Not only does it merge files, but it also automatically removes those annoying blank pages. Saved me a lot of time and will definitely do the same for you. You can start by using the free version on the website.
properties (22)
authormarshallsss45
permlinks5zfz8
categorypython
json_metadata{"links":["https://www.coolutils.com/PDFCombine"],"app":"hiveblog/0.1"}
created2023-12-20 20:37:57
last_update2023-12-20 20:37:57
depth1
children0
last_payout2023-12-27 20:37:57
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length422
author_reputation-21,177,227,860
root_title"Extracting PDF Data With Pdfplumber - Lines, Rectangles, And Crop"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id129,832,901
net_rshares0
@shohana1 ·
$0.12
>Pdfplumber has great documentation

Agree on that and github is a great source where from we collect resources. Thanks for sharing such helpful blog with us. 
👍  , , , ,
properties (23)
authorshohana1
permlinkre-geekgirl-202282t11726377z
categorypython
json_metadata{"tags":["python","pdfplumber","coding","programming","vyb","proofofbrain","stem","neoxian"],"app":"ecency/3.0.32-mobile","format":"markdown+html"}
created2022-08-02 05:07:27
last_update2022-08-02 05:07:27
depth1
children0
last_payout2022-08-09 05:07:27
cashout_time1969-12-31 23:59:59
total_payout_value0.058 HBD
curator_payout_value0.057 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length159
author_reputation75,357,217,090,889
root_title"Extracting PDF Data With Pdfplumber - Lines, Rectangles, And Crop"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id115,359,135
net_rshares155,671,279,371
author_curate_reward""
vote details (5)
@stemsocial ·
re-geekgirl-extracting-pdf-data-with-pdfplumber-lines-rectangles-and-crop-20220802t045807142z
<div class='text-justify'> <div class='pull-left'>
 <img src='https://stem.openhive.network/images/stemsocialsupport7.png'> </div>

Thanks for your contribution to the <a href='/trending/hive-196387'>STEMsocial community</a>. Feel free to join us on <a href='https://discord.gg/9c7pKVD'>discord</a> to get to know the rest of us!

Please consider delegating to the @stemsocial account (85% of the curation rewards are returned).

You may also include @stemsocial as a beneficiary of the rewards of this post to get a stronger support.&nbsp;<br />&nbsp;<br />
</div>
👍  , , , , , , , , , , , , , , , , , ,
properties (23)
authorstemsocial
permlinkre-geekgirl-extracting-pdf-data-with-pdfplumber-lines-rectangles-and-crop-20220802t045807142z
categorypython
json_metadata{"app":"STEMsocial"}
created2022-08-02 04:58:06
last_update2022-08-02 04:58:06
depth1
children0
last_payout2022-08-09 04:58:06
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length565
author_reputation22,909,313,058,047
root_title"Extracting PDF Data With Pdfplumber - Lines, Rectangles, And Crop"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id115,358,986
net_rshares0
author_curate_reward""
vote details (19)
@videoaddiction ·
$0.17
Extracting text from a PDF is a real mess. With pdfplumber, we can also extract the tables or shapes from a PDF page. Perhaps, it will be much more capable of doing from a scanned PDF after some developments.
👍  , , , ,
properties (23)
authorvideoaddiction
permlinkre-geekgirl-202282t95556217z
categorypython
json_metadata{"tags":["python","pdfplumber","coding","programming","vyb","proofofbrain","stem","neoxian"],"app":"ecency/3.0.32-mobile","format":"markdown+html"}
created2022-08-02 06:55:57
last_update2022-08-02 06:55:57
depth1
children0
last_payout2022-08-09 06:55:57
cashout_time1969-12-31 23:59:59
total_payout_value0.083 HBD
curator_payout_value0.083 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length208
author_reputation165,652,292,195,025
root_title"Extracting PDF Data With Pdfplumber - Lines, Rectangles, And Crop"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id115,360,805
net_rshares222,623,402,335
author_curate_reward""
vote details (5)