Patagames Blog

This tutorial explains how you can compose a PDF document from a bunch of scanned images using a simple C# code and the PDFium library. Possible uses of this are bulk scanning of documents, creating e-books, converting books to electronically readable format and so on.

So, let’s see how you can do this using the PDFium .Net SDK. Here are steps you should follow:

1. Enable namespace

For the library to work, you need to include the following namespaces to the application:

using Patagames.Pdf.Net;
using Patagames.Pdf.Enums;

You also need these standard ones:

using System.Drawing;
using System;
using System.Drawing.Imaging;

2. Initialize the library

Before you can use functions of the library, you should initialize it. To initialize the library, add the following line to the program:

1	`PdfCommon.Initialize();`

To release the library, call:

1	`PdfCommon.Release();`

The initialization is static, which means the following:

Initialization enables using PDF functions for the entire process or Web application pool. Once you call Initialize(), all threads of the process or all Web apps in the application pool will be able to use PDFium capabilities. Initialization is safe, so calling it multiple times is ok.

However, finalization is also static. Whenever you release the library in one Web application, you release all instances of the library in all other apps in the Web application pool too. As a result, before calling Release() you should make sure that no other Web applications or threads of the process are still using PDFium.

Initialization is thread-safe. You can call it from any thread of your application not worrying about synchronization at all.

3. Create a new PDF document

Here is the code:

1	`var` `doc = PdfDocument.CreateNew();`

Here, we create an instance of the PdfDocument class. The static method of this object can take a PdfForms parameter to enable interactivity powered by AcroForms. We don’t need this option to create a PDF from images, so we use an overloaded parameterless static method instead.

Note that the PdfDocument object implements the IDisposable interface, so make sure to call Dispose() afterwards. Alternatively, you can just use the using clause.

using (var doc = PdfDocument.CreateNew()) 
{ 
    ... 
}

Here is what we’ve got so far:

using Patagames.Pdf.Net; 
using System.Drawing; 
using System; 
using System.Drawing.Imaging; 
using Patagames.Pdf.Enums; 
  
namespace PdfFromImages 
{ 
    class Program 
    { 
        static void Main(string[] args) 
        { 
            int pageIndex = 0; 
            PdfCommon.Initialize(); 
            using (var doc = PdfDocument.CreateNew()) 
            { 
                ... 
            } 
  
        } 
    } 
}

As you see, everything is pretty simple. The pageIndex variable will count pages of our document.

4. Scan and load images

Now, we need to load the scanned images to our PDF document using the object we have just created.

var files = System.IO.Directory.GetFiles(@"SourceImages", "*.*", System.IO.SearchOption.AllDirectories);

Specify the path where to search for images, the search mask and the search options. The code above locates all images in the SourceImages folder and all subfolders.

As soon as we obtain the list of files in the files variable, we can fetch individual images from it:

foreach(var file in files)
{
    var image = Bitmap.FromFile(file, true) as Bitmap;
    ...
}

Here, the image variable is a Bitmap, but the actual format may vary. So we need to “normalize” the .Net image by converting it to the PdfBitmap format using the following call:

1	`var` `pdfBitmap = CreateBitmap(image);`

The CreateBitmap here is a function that detects the actual image format and converts it to pdfBitmap. The code of the function is as follows:

private static PdfBitmap CreateBitmap(Bitmap image)
{
    BitmapFormats pdfFormat;
    int[] palette;
    GetPdfFormat(image, out pdfFormat, out palette);
     
    var lockInfo = image.LockBits(new Rectangle(0, 0, image.Width, image.Height), ImageLockMode.ReadOnly, image.PixelFormat);
    var pdfBitmap = new PdfBitmap(image.Width, image.Height, pdfFormat, lockInfo.Scan0, lockInfo.Stride);
    image.UnlockBits(lockInfo);
     
    if(palette!= null)
        pdfBitmap.Palette = palette;
    return pdfBitmap;
}

Here, we transform the .Net image format to PdfBitmap format. All we do here is detect the PixelFormat of the input image, and create a new PdfBitmap object. We also detect a palette of the image (for indexed color images) and apply it to the output PdfBitmap object if necessary.

Note how we lock the image to receive the array of pixels from it using the LockBits method. Don’t forget to UnlockBits when the work is done.

To obtain the actual format of the image we will use the following function:

public static void GetPdfFormat(Bitmap image, out BitmapFormats pdfFormat, out int[] palette)
{
    palette = null;
    switch (image.PixelFormat)
    {
        case PixelFormat.Format1bppIndexed:
            pdfFormat = BitmapFormats.FXDIB_1bppRgb;
            palette = GetPalette(image);
            break;
        case PixelFormat.Format8bppIndexed:
            pdfFormat = BitmapFormats.FXDIB_8bppRgb;
            palette = GetPalette(image);
            break;
        case PixelFormat.Format24bppRgb:
            pdfFormat = BitmapFormats.FXDIB_Rgb;
            break;
        case PixelFormat.Format32bppArgb:
        case PixelFormat.Format32bppPArgb:
            pdfFormat = BitmapFormats.FXDIB_Argb;
            break;
        case PixelFormat.Format32bppRgb:
            pdfFormat = BitmapFormats.FXDIB_Rgb32;
            break;
        default:
            throw new Exception("Unsupported Image Format");
    }
}

In this function, we take the pixel format of the input Bitmap and match it with various color depth formats. If we haven’t find a suitable PDF format, we should throw an exception. The GetPalette function retrieves the array of int values of the image palette.

Here is the function:

public static int[] GetPalette(Bitmap image)
{
    var ret = new int[image.Palette.Entries.Length];
    for (int i = 0; i < ret.Length; i++)
        ret[i] = ((Color)image.Palette.Entries.GetValue(i)).ToArgb();
    return ret;
}

After the call to CreateBitmap we should create a PdfImageObject, the actual PDF object that holds the image and renders it on the page:

var imageObject = PdfImageObject.Create(doc);
imageObject.SetBitmap(pdfBitmap);

And now all the chunks of code assembled together:

using Patagames.Pdf.Net;
using System.Drawing;
using System;
using System.Drawing.Imaging;
using Patagames.Pdf.Enums;
namespace PdfFromImages
{
    class Program
    {
        static void Main(string[] args)
        {
            int pageIndex = 0;
            PdfCommon.Initialize();
            using (var doc = PdfDocument.CreateNew())
            {
                var files = System.IO.Directory.GetFiles(@"SourceImages", "*.*", System.IO.SearchOption.AllDirectories);
                foreach(var file in files)
                {
                    using (var image = Bitmap.FromFile(file, true) as Bitmap)
                    {
                        var pdfBitmap = CreateBitmap(image);
                        var imageObject = PdfImageObject.Create(doc);
                        imageObject.SetBitmap(pdfBitmap);
                        ...
                    }
                }
            }
        }
 
        private static PdfBitmap CreateBitmap(Bitmap image)
        {
            BitmapFormats pdfFormat;
            int[] palette;
            GetPdfFormat(image, out pdfFormat, out palette);
 
            var lockInfo = image.LockBits(new Rectangle(0, 0, image.Width, image.Height), ImageLockMode.ReadOnly, image.PixelFormat);
            var pdfBitmap = new PdfBitmap(image.Width, image.Height, pdfFormat, lockInfo.Scan0, lockInfo.Stride);
            image.UnlockBits(lockInfo);
 
            if(palette!= null)
                pdfBitmap.Palette = palette;
            return pdfBitmap;
        }
 
        public static void GetPdfFormat(Bitmap image, out BitmapFormats pdfFormat, out int[] palette)
        {
            palette = null;
            switch (image.PixelFormat)
            {
                case PixelFormat.Format1bppIndexed:
                    pdfFormat = BitmapFormats.FXDIB_1bppRgb;
                    palette = GetPalette(image);
                    break;
                case PixelFormat.Format8bppIndexed:
                    pdfFormat = BitmapFormats.FXDIB_8bppRgb;
                    palette = GetPalette(image);
                    break;
                case PixelFormat.Format24bppRgb:
                    pdfFormat = BitmapFormats.FXDIB_Rgb;
                    break;
                case PixelFormat.Format32bppArgb:
                case PixelFormat.Format32bppPArgb:
                    pdfFormat = BitmapFormats.FXDIB_Argb;
                    break;
                case PixelFormat.Format32bppRgb:
                    pdfFormat = BitmapFormats.FXDIB_Rgb32;
                    break;
                default:
                    throw new Exception("Unsupported Image Format");
            }
        }
 
        public static int[] GetPalette(Bitmap image)
        {
            var ret = new int[image.Palette.Entries.Length];
            for (int i = 0; i < ret.Length; i++)
                ret[i] = ((Color)image.Palette.Entries.GetValue(i)).ToArgb();
            return ret;
        }
    }
}

Notes on this part of code:

The Image object also implements IDisposable, so we will need to dispose of it. Again, we will use the using clause for this. The static method Create of the PdfImageObject class takes a PdfDocument variable as the parameter, in our case this is doc created above. The image object needs the document to create inner PDF dictionaries.

And now for the most important part of the code…

5. Calculate the required PDF page size

Now we need to calculate the size of the page. Basically, the size should be equal to the size of the scanned image, but the trick is to convert pixels of the image to points of the PDF. We do this by calling the following function:

var size = CalculateSize(pdfBitmap.Width, pdfBitmap.Height, image.HorizontalResolution, image.VerticalResolution);

And here is the function itself:

private static SizeF CalculateSize(int width, int height, float dpiX, float dpiY)
{
    return new SizeF()
    {
        Width = width * 72 / dpiX,
        Height = height * 72 / dpiY
    };
}

The function takes width and height of the bitmap in pixels as well as horizontal and vertical DPI and calculates the size of the PDF page. To understand the conversion you should know the following:

One inch contains exactly 72 PDF points
DPI of the scanned image may very and depends on scanning resolution

So, to convert pixels to typographic points we need to divide the dimension of the bitmap in pixels by the value of DPI in that dimension to receive the number of inches the image would take, and then multiply this value to 72 to find the number of PDF points. That’s what the function does.

6. Insert the page to the PDF document

We are almost done here. It is time to finally insert the page to the document and insert the converted image to that page. Here we go:

doc.Pages.InsertPageAt(pageIndex, size);
doc.Pages[pageIndex].PageObjects.InsertObject(imageObject);

Now, if you assemble the last pieces of code from steps 5 and 6 and run the program, you will end up with empty PDF pages. Why is that? Because we didn’t tell the PDF renderer how it should render the image. To do this we need to set a transformation matrix to the image object.

1	`imageObject.SetMatrix(size.Width, 0, 0, size.Height, 0, 0);`

The line above tells the renderer that we want the image to be stretched over the entire width and height of the page. Then we need to apply the changes to the page and make it render itself again:

1	`doc.Pages[pageIndex].GenerateContent();`

If you just need the work done quick, add these lines and proceed with reading from step 7. However, if you want to understand the idea of the transformation matrix, here is a brief explanation.

Understanding transformation matrix

If you barely know what matrices are, please make sure to read our guide to matrices here first. However, you don’t need that information to finish this tutorial. You only need it to understand what you can do with objects using transformation matrices and why the things work like they do.

The transformation matrix looks as follows:

where a, b, c, d, e and f are coefficients of the matrix. The SetMatrix function takes these six coefficients as an array of parameters:

1	`SetMatrix(a, b, c, d, e, f);`

Depending on which coefficients we use, we can transform the object (the image in our case) in variety of ways: scale, translate, rotate or shear. More specifically, the transformation matrix represents transformation between two coordinate systems: the original one and the transformed one. While rendering, each pixel of the image is mapped from the original coordinate system to the transformed one using the transformation matrix. Which effectively results in a scaled, rotated, translated etc. image.

The coordinate vector and the transformation matrix are multiplied to produce a transformed coordinate system – scaled, rotated and so on. In our case, we want to stretch the image to the entire size of the page. This corresponds to the scale operation.

The transformation matrix for such a scale operation looks as follows:

So, we need to assign the a parameter of the SetMatrix function to size.Width, and the d parameter to size.Height. Hence the code:

1	`imageObject.SetMatrix(size.Width, 0, 0, size.Height, 0, 0);`

You can find more information on using transformation matrices in PDF documents here.

7. Wrap things up

Let’s summarize what we have done so far.

First, we enable the namespace and initialize the PDFium library. Then, we load up scanned images we want to create the PDF document from. Then, we normalize the images from .Net format to the PDF-compatible format. Finally, we calculate the page size and insert images one by one using the transformation matrix to scale them up to the entire size of the page.

Here is the final code:

using Patagames.Pdf.Net;
using Patagames.Pdf.Enums;
using System.Drawing;
using System;
using System.Drawing.Imaging;
 
namespace PdfFromImages
{
    class Program
    {
        static void Main(string[] args)
        {
            int pageIndex = 0;
            PdfCommon.Initialize();
            using (var doc = PdfDocument.CreateNew())
            {
                var files = System.IO.Directory.GetFiles(@"SourceImages", "*.*", System.IO.SearchOption.AllDirectories);
                foreach (var file in files)
                {
                    using (var image = Bitmap.FromFile(file, true) as Bitmap)
                    {
                        var pdfBitmap = CreateBitmap(image);
                        var imageObject = PdfImageObject.Create(doc);
                        imageObject.SetBitmap(pdfBitmap);
 
                        var size = CalculateSize(pdfBitmap.Width, pdfBitmap.Height, image.HorizontalResolution, image.VerticalResolution);
 
                        doc.Pages.InsertPageAt(pageIndex, size);
                        doc.Pages[pageIndex].PageObjects.InsertObject(imageObject);
 
                        imageObject.SetMatrix(size.Width, 0, 0, size.Height, 0, 0);
 
                        doc.Pages[pageIndex].GenerateContent();
                        pageIndex++;
                    }
                }
                doc.Save(@"saved.pdf", SaveFlags.NoIncremental);
            }
        }
 
        private static PdfBitmap CreateBitmap(Bitmap image)
        {
            BitmapFormats pdfFormat;
            int[] palette;
            GetPdfFormat(image, out pdfFormat, out palette);
 
            var lockInfo = image.LockBits(new Rectangle(0, 0, image.Width, image.Height), ImageLockMode.ReadOnly, image.PixelFormat);
            var pdfBitmap = new PdfBitmap(image.Width, image.Height, pdfFormat, lockInfo.Scan0, lockInfo.Stride);
            image.UnlockBits(lockInfo);
 
            if (palette != null)
                pdfBitmap.Palette = palette;
            return pdfBitmap;
        }
 
        public static void GetPdfFormat(Bitmap image, out BitmapFormats pdfFormat, out int[] palette)
        {
            palette = null;
            switch (image.PixelFormat)
            {
                case PixelFormat.Format1bppIndexed:
                    pdfFormat = BitmapFormats.FXDIB_1bppRgb;
                    palette = GetPalette(image);
                    break;
                case PixelFormat.Format8bppIndexed:
                    pdfFormat = BitmapFormats.FXDIB_8bppRgb;
                    palette = GetPalette(image);
                    break;
                case PixelFormat.Format24bppRgb:
                    pdfFormat = BitmapFormats.FXDIB_Rgb;
                    break;
                case PixelFormat.Format32bppArgb:
                case PixelFormat.Format32bppPArgb:
                    pdfFormat = BitmapFormats.FXDIB_Argb;
                    break;
                case PixelFormat.Format32bppRgb:
                    pdfFormat = BitmapFormats.FXDIB_Rgb32;
                    break;
                default:
                    throw new Exception("Unsupported Image Format");
            }
        }
 
        public static int[] GetPalette(Bitmap image)
        {
            var ret = new int[image.Palette.Entries.Length];
            for (int i = 0; i < ret.Length; i++)
                ret[i] = ((Color)image.Palette.Entries.GetValue(i)).ToArgb();
            return ret;
        }
 
        private static SizeF CalculateSize(int width, int height, float dpiX, float dpiY)
        {
            return new SizeF()
            {
                Width = width * 72 / dpiX,
                Height = height * 72 / dpiY
            };
        }
    }
}

Two comments about the final code. In line 34 we increase the pageIndex counter. In line 37 we save the final document as a specified PDF document.

That’s it! This is how you can create a PDF document from an array of scanned images using the above C# code and the PDFium library. If you have any questions on this tutorial, found a mistake or want to suggest something, please don’t hesitate to leave your comments below.

Akshara
2/16/2017 10:29:08 AM | Reply

can u please tell me about how to convert this new generated pdf into searchable pdf
- Paul
  2/16/2017 12:46:08 PM | Reply
  
  Please look at here blog.patagames.com/.../how-to-make-a-searchable-pdf-from-scanned-pages

Jessie
5/24/2017 9:21:02 PM | Reply

How do we do this??? ===> As a result, before calling Release() you should make sure that no other Web applications or threads of the process are still using PDFium.
- Paul Rayman
  5/28/2017 11:10:56 PM | Reply
  
  Actually you do not need to call Release in a Web Apps.

How to Create a PDF from Multiple Scanned Pages in Your C# Code

1. Enable namespace

2. Initialize the library

3. Create a new PDF document

4. Scan and load images

5. Calculate the required PDF page size

6. Insert the page to the PDF document

Understanding transformation matrix

7. Wrap things up

Comments (4) -

Akshara

Paul

Jessie

Paul Rayman

Add comment

1. Enable namespace

2. Initialize the library

3. Create a new PDF document

4. Scan and load images

5. Calculate the required PDF page size

6. Insert the page to the PDF document

Understanding transformation matrix

7. Wrap things up

Related posts

Comments (4) -

Akshara

Paul

Jessie

Paul Rayman

Add comment