Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

How to extract text from PDF file using iTextSharp with C#

Tags: paragraph

In this tutorial, I am going to explain you how to extract text from PDF file using iTextSharp with C# in ASP.NET. Below is step by step tutorial.

Creating ASP.NET Empty Application

Create an ASP.NET Empty WebForm project as shown below.
Go to FileNewProject. A new window will be open as shown below.
Now go to WebVisual Studio 2012 → select .NET Framework 4.5 → select ASP.NET Empty Web Application and give project name and click on OK.

Now, an asp.net empty project will be created. Add a new webform to application.

Installing iTextSharp

Now the next step is to add iTextSharp reference to your application. We can add reference by two ways.First: Download from Internet Click on the below link to download the dll.https://github.com/itext/itextsharpOnce file is downloaded, extract it, now you will find 6 more .rar file. Again extract itextsharp-dll-core.rar file, after that add reference of itextsharp.dll to your project.or Second: Nuget Package ManagerGo to TOOLS → Library Package Manager → Manage NuGet Packages for Solution.. and a new window will open. Type and search for iTextSharp and click on Install button as shown below. Once installed successfully, you can check iTextSharp in references folder.

You can also install by using Package Manager Console.Go to TOOLS → Library Package Manager → Package Manager Console → write Install-Package iTextSharp and press enter. This will install iTextSharp in application.

In aspx file

In designer file create two button controls, first button is used to generate pdf file and second button is used to extract text from pdf file. One textbox control to display extracted text from pdf. Designer file look like as shown below.

Language="C#" AutoEventWireup="true" CodeBehind="WebForm1.aspx.cs" Inherits="WebApplication1.WebForm1" %>
 

html xmlns="http://www.w3.org/1999/xhtml">

head runat="server">
title>/title>
/head>
body>
form id="form1" runat="server">
div>
table>
tr>
td>b>Extract Text from PDF file using iTextSharp/b>/td>
/tr>
tr>
td>
ID="btnGeneratePDF" runat="server" Text="Generate PDF File" OnClick="btnGeneratePDF_Click" />
/td>
/tr>
tr>
td>
ID="btnExtract" runat="server" Text="Extract Text From PDF File" OnClick="btnExtract_Click" />
/td>
/tr>
tr>
td>
ID="TextBox1" runat="server" TextMode="MultiLine" Style="width: 500px; min-height: 150px;">
/asp:TextBox>
/td>
/tr>
/table>
/div>
/form>
/body>
/html>

C# Code

Complete C# code is given below.

using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Web;
using System.Web.UI;
using System.Web.UI.WebControls;
 
namespace WebApplication1
{
public partial class WebForm1 : System.Web.UI.Page
{
protected void Page_Load(object sender, EventArgs e)
{
 
}
 
protected void btnGeneratePDF_Click(object sender, EventArgs e)
{
if (File.Exists(Server.MapPath("Example.pdf")))
{
File.Delete(Server.MapPath("Example.pdf"));
}
 
// create pdf file and save it to the root directory of the application
FileStream fs = new FileStream(Server.MapPath("Example.pdf"), FileMode.Create);
 
Document doc = new Document();
 
PdfWriter.GetInstance(doc, fs);
 
doc.Open();
 
Paragraph page = new Paragraph("This is first page (page number 1)");
doc.Add(page);
 
Paragraph para1 = new Paragraph();
Chunk c1 = new Chunk(@"This is first Paragraph. This is first paragraph. This is first paragraph. This is first paragraph. This is first paragraph. This is first paragraph. This is first paragraph. This is first paragraph. This is first paragraph.");
c1.SetBackground(BaseColor.YELLOW);
para1.Add(c1);
doc.Add(para1);
 
Paragraph para2 = new Paragraph();
Chunk c2 = new Chunk(@"This is second paragraph. This is second paragraph. This is second paragraph. This is second paragraph. This is second paragraph. This is second paragraph. This is second paragraph. This is second paragraph. This is second paragraph.");
c2.SetBackground(BaseColor.GREEN);
para2.Add(c2);
doc.Add(para2);
 
doc.Close();
}
 
protected void btnExtract_Click(object sender, EventArgs e)
{
//string FilePath = @"H:\\Demo\\WebApplication1\\WebApplication1\\Example.pdf";
 
string FilePath = Server.MapPath("Example.pdf");
 
if (File.Exists(FilePath))
{
string ExtractedData = string.Empty;
 
using (PdfReader reader = new PdfReader(FilePath))
{
ITextExtractionStrategy strategy = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
 
// 1. if pdf document has only one page
//here second parameter is PDF Page number
ExtractedData = PdfTextExtractor.GetTextFromPage(reader, 1, strategy);
 
 
/*// 2. if pdf ducument has more than one page
// iterating through all pages
for (int i = 1; i {
ExtractedData = PdfTextExtractor.GetTextFromPage(reader, i, strategy);
}*/

 
 
/*// if pdf single page is having more than one paragraph
// then split paragraph using newline
ExtractedData = PdfTextExtractor.GetTextFromPage(reader, 1, strategy);
string[] lines = ExtractedData.Split('\n');
StringBuilder sb = new StringBuilder();
foreach (string line in lines)
{
//
}*/

 
}
TextBox1.Text = ExtractedData;
}
}
}
}

When you click on the Generate PDF File button, a PDF will be generated and will be saved at root directory of application. When you open pdf file, you will see 3 paragraph as shown below.

Now when you click on Extract Text From PDF File, all the text from page one will be extracted and displayed to the TextBox. You can iterate through all the pages using foor loop. Code is added and commented above.



This post first appeared on ASPArticles, please read the originial post: here

Share the post

How to extract text from PDF file using iTextSharp with C#

×

Subscribe to Asparticles

Get updates delivered right to your inbox!

Thank you for your subscription

×