PDF to Excel conversion: Your final information to the most effective instruments

PDF to Excel conversion: Your final information to the most effective instruments



PDF to Excel conversion: Your final information to the most effective instruments
Need to extract knowledge from PDF information right into a spreadsheet so you possibly can analyze it? Find out how seven PDF to Excel conversion instruments fared in head-to-head exams with more and more complicated knowledge sources.

Sasirin Pamai / Shutterstock / Rob Schultz

In an excellent world, the info we have to analyze could be obtainable in ready-to-use format. In the world we reside in, although, a whole lot of worthwhile knowledge is locked inside Portable Document Format (PDF) paperwork. How to extract that knowledge from PDFs into an Excel spreadsheet? You have plenty of PDF to Excel converters to select from.

There’s software program from main distributors like Microsoft and Adobe, task-specific cloud providers together with PDFTables and Cometdocs, providers from general-purpose cloud suppliers similar to Amazon, and even free open-source choices.

Which is the “best” PDF to Excel converter? As with the “best computer,” the reply is determined by your particular circumstances.

3 inquiries to ask when selecting a PDF to Excel converter

There are a number of essential issues when choosing a PDF converter.

1. Was my PDF generated by an utility or is it a scanned picture? There are two kinds of PDF information. One is generated by an utility like Microsoft Word; the opposite comes from a scanned or different picture file. You can inform which one you may have by attempting to focus on some textual content within the doc. If a click on and drag works to focus on textual content, your PDF is app-generated. If it doesn’t, you’ve bought a scan. Not all PDF conversion instruments work on scanned PDFs.

[ Related: Excel: Your entry into the world of data analytics ]

2. How complicated is the info construction? Almost each software will work effectively on a easy one-page desk. Things get extra sophisticated if tables are unfold over a number of pages, desk cells are merged, or some knowledge inside a desk cell wraps over a number of traces.

3. Do I’ve a big quantity of information that want batch file conversions or automation? Our best-performing software on app-generated PDFs might not be the only option for you if you wish to automate frequent batch conversions.

In addition, as with every software program alternative, you must determine how a lot you worth efficiency versus price and ease of use.

How we examined the converter instruments

To show you how to discover what’s finest on your duties, we examined seven PDF to Excel conversion instruments utilizing 4 completely different PDF information starting from easy to nightmare. You’ll see how all of the instruments carry out in every situation — and discover out the strengths and weaknesses of every one.

PDF to Excel converters we examined

Here are the instruments we examined, beginning with our total finest performers (however do not forget that “best” relies upon partly on the precise supply doc). All these instruments did fairly effectively on not less than a few of our duties, so rankings vary from “Excellent” to “Good.”

Adobe Acrobat Export PDF subscription

As the creator of the Portable Document Format customary, you’d anticipate Adobe to do effectively in parsing PDFs — and it does. A full-featured conversion subscription is considerably dear, however there’s additionally an affordable $2/month plan (annual subscription required) that features a vast variety of PDF to Excel conversions. (You can output Microsoft Word information with this software as effectively).

The Excel conversions embody any textual content on pages which have each textual content and tables. This is usually a profit in case you’d wish to preserve that context or a downside in case you simply need knowledge for extra evaluation.

Rating: Excellent — our hands-down winner for non-scanned PDFs.

Cost: $24/12 months

Pros: Outstanding outcomes; preserves a lot of the unique formatting; offers effectively with tables spanning a number of pages; limitless conversions of information as much as 100MB; inexpensive for frequent customers.

Cons: No built-in scripting/automation workflow; costly in case you solely convert a couple of paperwork a 12 months.

Bottom line: If you don’t have to script or automate a whole lot of conversions and don’t thoughts paying $24 per 12 months, this can be a nice alternative.

Amazon AWS Textract

For an AWS cloud service, Textract is surprisingly simple to make use of. While you definitely can undergo the same old multi-step AWS setup and coding course of for Textract, Amazon additionally affords a drag-and-drop net demo that allows you to obtain outcomes as zipped CSVs. You simply want to join a (free) Amazon AWS account.

Rating: Excellent — this was our greatest choice for an advanced scanned PDF.

Cost: 1.5 cents per web page (100 pages monthly free on your first three months at AWS)

Pros: Best choice examined for an advanced scanned PDF; carried out extraordinarily effectively on all of the app-generated PDFs; affords a alternative of viewing outcomes with merged or unmerged cell format; simple to make use of; inexpensive.

Cons: Uploaded information are restricted to 10 pages at a time. For those that wish to automate, utilizing this API is extra sophisticated than another choices.

Bottom line: An glorious alternative in case you don’t thoughts the AWS setup and both handbook add or coding with a fancy API.

Tabula

If you’re searching for free and open supply, give Tabula a strive. Unlike some free choices from the Python world, Tabula is straightforward each to put in and to make use of. And it has each a command-line and a browser interface, making it equally helpful for batch conversions and point-and-click use.

Tabula did very effectively on PDFs of low or reasonable complexity, though it did have a difficulty with the complicated one (as did lots of the paid platforms). Tabula requires a separate Java set up on Windows and Linux.

Rating: Very good — and you may’t beat the value.

Cost: Free

Pros: Free; simple to put in; has each a GUI and scripting choices; lets you manually change what areas of the web page must be analyzed for tables; can save outcomes as a CSV, TSV, JSON, or script; affords two completely different knowledge extraction strategies.

Cons: Needed some handbook knowledge cleanup on complicated formatting; works on app-generated PDFs solely.

Bottom line: A good selection if price, ease of use, and automation choices are excessive in your record of desired options and your PDFs aren’t scanned.

PDFTables

A key benefit to this service is automation. Its API is effectively documented and helps every little thing from Windows PowerShell and VBA (Office Visual Basic for Applications) to programming languages like Java, C++, PHP, Python, and R.

PDFTables carried out effectively on many of the app-generated PDF tables, even understanding {that a} two-column header could be finest as a single-column header row. It did have some issue with knowledge in columns that had been principally empty but additionally had some knowledge in cells unfold over two traces. And whereas it choked on the scanned nightmare PDF, not less than it didn’t cost me for that.

Rating: Very good total; glorious on automation.

Cost: 50 pages free at signup — together with API use. After that it’s $40 for as much as 1,000 pages, and your credit are solely good for a 12 months.

Pros: Very good API; higher efficiency on the reasonably complicated PDF than a number of of its paid rivals.

Cons: Pricey, particularly in case you use greater than the 50 free pages however lower than 1,000 web page conversions in a 12 months. Doesn’t work on scanned PDFs.

Bottom line: Performs effectively and is straightforward to make use of each on the internet and thru scripting and programming. If you don’t want a chic API, nevertheless, you could choose a inexpensive choice.

PDFtoExcel.com

This is a freemium platform with paid choices. It proved to be the lone free alternative that was capable of deal with our scanned nightmare PDF.

Rating: Good.

Cost: Free within the cloud, $5/month or $49/12 months premium cloud for batch conversions and quicker service, desktop software program $35 for 30-day use or $150 lifetime.

Pros: A variety of functionality for the free choice; works on scanned PDFs; inexpensive.

Cons: No API or cloud automation (we didn’t check the desktop software program); paid choice required for batch conversions; break up single-row multi-line knowledge into a number of rows.

Bottom line: Nice stability of price and options. This was most compelling for complicated scanned PDFs, however others did higher when cell knowledge ran throughout a number of traces.

Cometdocs

This web-based service is notable for a number of file format conversions: In addition to producing Excel, it will probably obtain outcomes as Word, PowerLevel, AutoCAD, HTML, OpenOffice, and others. Free accounts can convert as much as 5 information per week (30MB every); paid customers get a vast variety of conversions (2GB/day knowledge restrict).

Cometdocs is a supporter of public service journalism; the service affords free premium accounts to Investigative Reporters & Editors members (disclosure: I’ve one).

Rating: Good.

Cost: 5 free conversions/week; in any other case $10/month, $70/12 months or $130 “lifetime.”

Pro: Works on scanned PDFs; a number of enter and output codecs; typically good outcomes; did extraordinarily effectively on a 2-page PDF with complicated desk format.

Cons: Not as sturdy on complicated scanned PDFs as another choices; break up one row’s multi-line knowledge into a number of rows; no clear script/automation choice.

Bottom line: Particularly compelling in case you’re taken with a number of format exports and never simply Excel.

Microsoft Excel

Many individuals don’t know that Excel can import PDFs immediately — however provided that you’ve bought a Microsoft 365 or Office 365 subscription on Windows. It was a good selection for the easy file however bought extra cumbersome to make use of as PDF complexity rose. It’s additionally more likely to be complicated to individuals who aren’t accustomed to Excel’s Power Query / Get & Transform interface.

How to import a PDF immediately into Excel: In the Ribbon toolbar, go to Data > Get Data > From File > From PDF and choose your file. For a single desk, you’ll probably have one option to import. Select it and you need to see a preview of the desk and an choice to both load it or remodel the info earlier than loading. Click Load and the desk will pop into your Excel sheet.

For a single desk on one web page, this can be a fast and fairly easy alternative. If you may have a number of tables in a multi-page PDF, this additionally works effectively — so long as every desk is confined to 1 web page. Things get a bit extra complicated in case you’ve bought one desk over a number of PDF pages, although, and also you’ll want information of Power Query instructions.

It’s considerably unfair to check Power Query knowledge transformation with the opposite instruments, because the outcomes of any of those different PDF to Excel converters may be imported into Excel for Power Query wrangling.

Rating: Good.

Cost: Included in a Microsoft 365/Office 365 Windows subscription.

Pro: You don’t have to go away Excel to cope with the file; a whole lot of built-in knowledge wrangling obtainable for many who know Power Query.

Cons: Complex to make use of in contrast with most others on all however the easiest of PDFs; doesn’t work on scanned PDFs; requires a Microsoft 365/Office 365 subscription on Windows.

Bottom line: If you’ve already bought Microsoft 365/Office 365 on Windows and also you’ve bought a easy conversion job, Excel is value a strive. If you already know Power Query, positively take into account this for extra PDF conversions! (If you don’t, Power Query is a superb talent to study for Excel customers on the whole.) If your PDF is more difficult and also you don’t already use Power Query / Get & Transform, although, you’re most likely higher off with another choice.

PDF to Excel check outcomes

Here’s how the seven instruments fared in our 4 conversion exams:

1. Simple PDF to Excel problem

Our “simple” job was a single-page app-generated PDF pulled from web page 5 of a Boston housing report. It contained one desk and a few textual content, however column headers and two knowledge cells did embody wrapped textual content over two traces.

All the platforms we examined dealt with this one effectively. However, a number of broke up the multi-line textual content into a number of rows. The problem was simple to identify and repair on this instance, however this problem could possibly be tough in bigger information. For this simple one-pager, although, the PDF to Excel converters that weren’t in first or second place nonetheless had superb outcomes. All had been value utilizing for this kind of conversion.

First place: Tie — Adobe and AWS Textract. With Adobe, no knowledge cleanup was wanted. The column headers even had the colour formatting of the unique. Adobe’s conversion included textual content (with pretty formatting), which is helpful if you wish to preserve written explanations along with the info in Excel. You’d have to delete the textual content manually in order for you knowledge solely, however that’s easy sufficient.

AWS Textract transformed knowledge solely. No knowledge cleanup was wanted.

Close second: Excel. Data solely. Excel didn’t break wrapped textual content into two rows, however it did seem to run textual content collectively with no house with multi-line rows. The knowledge was truly right, although, if you checked out it within the formulation bar — it simply appeared unsuitable within the total spreadsheet. This was simply mounted by formatting cells with “wrap textual content.” However, not everybody would possibly know to do this when taking a look at their spreadsheet.

Others:

PDFTables: returned knowledge and textual content. Same points as Excel with showing to maintain wrapped textual content in a single line with no house between phrases. This was additionally simply mounted by wrapping textual content, in case you knew to take action. This consequence additionally would wish cleanup of a few phrases from a brand that appeared under the info. Explanatory textual content exterior the brand had no issues, although.

Tabula: knowledge solely. Split multi-line cells into a number of rows.

Cometdocs: knowledge and textual content. Split multi-line cells into a number of rows. Surrounding textual content was correct, together with brand textual content.

PDFtoExcel.com: comparable efficiency to Cometdocs.

2. Moderate PDF to Excel problem

Our reasonable PDF problem was a single app-generated desk spanning a number of PDF pages, through the Boston-area Metropolitan Water Resources Authority knowledge monitoring wastewater for Covid-19 traces.

First place: Adobe. One of the few to acknowledge that every one the pages had been the identical desk, so there have been no clean rows between pages. Headers had been in a single row and areas between phrases within the column names had been maintained. Data construction was glorious, together with protecting the multi-line wrap as is. It even reproduced background and textual content colours. The 11-page size wasn’t an issue.

Second: AWS Textract. Header row was right. Each web page got here again as a separate desk, though it will be simple sufficient to mix them. The one unusual problem: There had been apostrophes added originally of the cells — probably resulting from how I break up the PDF, since I wanted to create a file with solely 10 pages. However, these apostrophes had been simple to see and take away with a single search and exchange, because the knowledge didn’t embody any phrases with apostrophes. It was simpler to get the precise knowledge I wanted than with Tabula, however extra cumbersome to get the complete knowledge set.

Close third: Tabula. No clean rows between pages, knowledge within the right columns, wrapped cells stayed in a single row. Unfortunately, whereas the wrapped knowledge appeared correctly if you appeared on the cell contents within the formulation bar, as soon as once more the info appeared to merge collectively within the full spreadsheet — and this wasn’t as simply mounted by formatting with textual content wrapping as with Excel and PDFTables within the easy PDF.

For instance, this was the content material of 1 cell because it appeared within the formulation bar:

B.1.1.7

76%

But within the total spreadsheet, that very same cell appeared like

B.1.1.776%

I used to be capable of get that to show correctly at instances by growing the row top manually, however this was an added step that most individuals wouldn’t know to do, and it didn’t appear to work on a regular basis.

Others:

PDFtoExcel.com: a number of issues. The first few pages had been effective apart from multi-row headers, however knowledge over two traces in single cells broke into two rows within the knowledge, producing clean rows elsewhere that might should be mounted. In addition, columns had been shifted to the fitting in a single part. This would wish cleanup.

PDFTables: a number of issues. All the info got here in effective for many of the pages, however towards the top, a couple of cells that ought to have been in column J bought merged with column I in ways in which could be tougher to repair than PDFtoExcel’s. For instance, this single cell:

Omicron

559 23%

Was purported to be 559 in a single cell and Omicron 23% within the subsequent cell.

Cometdocs: failed. Conversion failed on the complete PDF and even the 10-page model I uploaded to AWS. It was capable of convert a model with simply the primary 5 pages, however the full file ought to have been effectively under Cometdoc’s account limits.

Excel: it was doable to get the info in a format I wished, however it required knowledge manipulation in Power Query in addition to wrapping textual content. That’s not a good comparability with different platforms that had been a single add or command. Still, outcomes had been in the end glorious. If you’re an Excel/Power Query energy person, this can be a good selection.

3. Complex PDF to Excel problem

Local election outcomes are a few of my favourite examples of analysis-hostile public knowledge. The app-generated PDF from Framingham, Mass. proven under was solely 3 pages however with desk formatting that was not designed for ease of knowledge import. Is there a PDF conversion software that may deal with it?

IDG

Page 1 of the PDF exhibiting latest election outcomes for Framingham, Mass. (Click picture to enlarge it.)

First place: Tie — Adobe and PDF to Excel. Adobe returned an Excel file in excellent format, full with unique cell colours.

While PDFtoExcel.com’s spreadsheet didn’t have the gorgeous formatting of Adobe, all the info got here in precisely, and it was usable as is.

Others:

AWS Textract: honest. Results got here again in 5 tables. In one case, you’d want to repeat and paste them collectively manually and take a look at the unique to be sure you had been doing so accurately.

PDFTables: poor. Data got here again, however some within the unsuitable columns, whether or not I attempted to obtain as a number of sheets or one sheet. This would wish handbook checking and cleanup.

Tabula: poor. Similar downside as PDFTables with some knowledge within the unsuitable columns, however not less than I didn’t should pay for it. I attempted each the Stream and Lattice extraction strategies, and each had some wrong-column points (though the problems had been completely different).

Cometdocs: conversion failed.

4. Nightmare PDF to Excel problem

Our nightmare comes courtesy of a presentation at this 12 months’s National Institute for Computer Assisted Reporting convention, for example of knowledge that might be helpful for coaching college students — if it was in a format that could possibly be simply analyzed. It’s a multi-page scanned PDF with 4 months of knowledge from the federal Refugee Processing Center on refugee arrivals by nation of origin and U.S. state of vacation spot.

This PDF’s challenges vary from multi-page tables to plenty of merged columns. In addition, the desk on web page 1 proved to be considerably completely different than tables on the opposite pages, not less than when it comes to how a number of instruments had been capable of deal with them, though they give the impression of being the identical.

I solely examined the primary 10 pages as a result of AWS 10-page restrict, to be honest to all of the instruments.

First place: AWS Textract. By far the most effective of the group. This is the place subtle machine studying is a bonus. Results are obtain as a zipped file of a number of CSVs, one for every web page. Instead of manually importing these information one after the other into Excel, although, you possibly can go to Data > Get Data > From File > From Folder and choose the folder with these newly unzipped CSVs (with no few further information, similar to one with metadata and one other with all of the textual content in textual content format).

I additionally needed to rename file table-10 to table-910, as a result of the import was placing table-10 proper after table-1 and earlier than table-2 (sorting the filenames alphabetically and never -9 after -10). While I didn’t test all the info, the rows I spot-checked had been all correct. This was the best approach to gather all of the pages right into a single, usable spreadsheet.

Second: Tie — Cometdocs and PDFtoExcel.com. Cometdocs had unexplained “Grand” traces on every of the pages — a single row simply with “Grand” within the first column — however in any other case the info appeared superb. And it got here in a single file.

Including textual content on every web page could possibly be a profit or downside, relying in your wants. Context is helpful, however combining the info into one desk would positively take extra work.

PDFtoExcel.com’s efficiency was much like Cometdocs — together with the additional “Grand” traces.

Others:

Adobe: honest. While a whole lot of the info was effective, outcomes merged some knowledge on the primary web page and would wish handbook checking and cleanup.

PDFTables: doesn’t work on scanned PDFs with out utilizing OCR software program first — though not less than it doesn’t cost account credit when tables aren’t detected.

Tabula: doesn’t work on scanned PDFs with out utilizing OCR software program first.

PDF conversion instruments we didn’t check

There are plenty of different helpful instruments on the market that require a bit extra setup work or some important coding with a view to flip your PDF into knowledge you possibly can analyze. If not one of the ones we examined give you the results you want, although, listed below are a couple of others:

Excalibur: I obtained a tip about this one, an internet interface constructed for the Camelot Python library. However, it includes putting in a number of dependencies, and that set up could also be a turn-off for individuals who don’t have already got Python expertise (despite the fact that you don’t have to run Python with a view to use it).

Google Cloud Document AI: Google Cloud’s Document AI setup is considerably extra concerned than utilizing AWS Textract. Plus, whilst you can add a check doc as much as 5 pages to see what’s extracted, I didn’t see an apparent approach to obtain the outcomes by the online interface as a CSV or Excel file — the one choice was JSON. You’d most likely want a programming language like Python or R to make use of this successfully.

Microsoft Azure Form Recognizer: This can be extra complicated to arrange than AWS Textract, however in case you’re already an Azure person, it is most likely value a glance. 500 free pages monthly.

Exit mobile version