Digitization

  • Digitization
  • Imaging
  • Scanning
  • Digital Cleaning
  • Image Manipulation
  • Quality Control
  • Post Imaging
  • Storage

Digitization is a process of converting data and information (in paper, analogue sound tracks, film, or some other media) to electronic form (binary coded files for use in computers).

Before starting a digitization project, there is need to understand:

- The mission and scope of the project: the purpose of digitizing, as well as the patrons/users of the products of digitization should be considered, so that their expectations can be met.

- The goals of the project should be established, i.e. what the project is to accomplish, for instance, improving access to a wider audience, improving preservation of materials by reducing handling of originals, etc.

- It should also be established how the information, be they images or otherwise,
be used to meet the needs of the users.
Planning

Planning is an important stage in the digitization process where the following requirements are determined:

- The size of surrogate image (in KB, MB, etc.).

- Desired file format (TIFF, JPEG, PSD, etc.).

- Metrics such as digital imaging resolution (dpi.) to be used.

- Tonality (bi-tonal, grey scale or color).

- Directory or file-naming requirements.

- Indexing or metadata requirements.

- Characteristics of the collection (e.g.A4 Size documents etc).

- Special requirements, for instance, on handling of documents.

- Equipment to be used – whether the right mix of equipment is available.

- The number of documents to be digitized (with paper size).

- The number of missing pages and any other anomalies (These should be documented on a processing sheet and attached to the document sheets).

- Service Provider required to undertake the digitization.

- Availability of adequate physical workspace.

- Time required to complete the project (e.g. if there are any critical deadlines).

- Determining the copyright status.

- The expectations of the patron, and the means to be used to access, retrieve and use the digital images.

- Storage space required; digital images are usually large in size, therefore storage space is an important consideration.

- Back-up requirements: since hard drives and other computer storage devices are not infallible, there is need to create a copy of the digital images through the digitization process.

- Delivery medium, e.g. Compact disk, Digital versatile disk, external hard-drive or the Internet

- Anticipated interruptions in workflow.

During planning benchmarks are set on the technical standards required for the digital images. The benchmarks are communicated to the Service Provider. For instance, to the scanning Service Provider they need to know what would constitute the ‘standard’ for a scanned image (size, resolution, format etc).

Dry-run

A 2-3 week dry-run is recommended before the actual digitization begins. This facilitates:

• Service Provider (if outsourcing) introduction and orientation to the goals and objectives of the project.

• Task and responsibility allocation for Service Provider. Task allocation must be commensurate with the skills for the process to run well.

• Resource allocation.

• User Training

• Equipments testing: Workstations, scanners, printers and other equipment must be tested to meet the required standards.

• Issues such as viewing conditions, monitor calibration, color management should be addressed. Natural light is recommended for digitization, but the light needs to be controllable to avoid reflections. Monitor calibration will involve adjusting the monitor’s color conversion settings to a standard so that the images displayed on a variety of monitors will look the same. The monitor should look brighter than any other light source in the room. The display background color should be changed to a neutral grey, and any desktop patterns should be turned off. These would form a color management strategy for the digitization project.

If the digitization process has been outsourced the following issues need to be addressed:

• Accessibility to the collection.

• Security of the collection.

• Working hours.It should be emphasized that the development of a management system is crucial to the effectiveness of any digitization project and should be in place before any scanning starts.

The following is a graphical representation of digitization process:



 

Documents are prepared by the scanning staff before the actual scanning by unmounting from the document shelves, boxes, cleaning to remove dust and other marks on the document using a special cloth.

Ensure that the PCs and scanners are clean, and use protective gloves when handling the documents.


Document Preparation: Whether scanning is done from film or paper the first step to insure high quality images and proper handling of your data is document preparation.


Paper document preparation usually consists of removing staples, paper clips, rubber bands, brads, or other types of binders. Thorough document preparation is vital to minimizing scanner jams and double feeds, which occur when two sheets are fed at the same time and are scanned as one. Non-paper media undergo cleaning and dusting to prevent foreign matter from corrupting the data. The standard document preparation steps to insure proper image capture and quality are:

1. Purging

Part of the process of converting paper to images requires an evaluation of the material involved. Some file folders, documents or other materials contain extraneous material, duplicates, notes and other information that need not be scanned. In these instances, you must decide whether it is more cost effective to purge files before scanning or to scan everything and purge extraneous images. In some instances, purging requires the use of personnel with knowledge of the documents being scanned. We call this subjective purging. In other instances, persons without such knowledge can do purging guided by specifications e.g. purge all handwritten notes and all Post-it notes. We call this objective purging.

2. Organizing

Like pieces of paper, images typically are grouped into documents. Accordingly, the beginning and end of each paper document must be clearly defined. Document separator pages are inserted between documents during the document preparation phase. Separator pages typically have a bar code printed on them. This code indicates the software that one document has ended and another has begun.

3. Paper Preparation

Preparation is required before documents go through a scanner. Staples, brads, paper clips Post-it notes and attachments must be removed. Depending on the job, it may be necessary to rebind documents after scanning.

Preparation typically requires documents to be "jogged" so that all leading edges are aligned before they are fed into the scanner. Physical activity is required to jog documents. This process is necessary to eliminate scanner jams and double feeds.

Paper size and weight must be considered. Many scanner auto-feeders cannot handle mixed widths and weights. An appropriate scanner and feeder must be selected to match the requirements of each job. Flatbed scanners may be required In some cases.

Finally, consideration should be given to "batching" documents. Batching will improve control and efficiency of the conversion process. Batching provides a convenient way to audit the process and ensure that the number of scanned documents matches the number of images.

The image should be assessed against the original to produce quality scans that will require minimal manipulation during correction. A Good scanner scans more than 55 pages per minute


The cleaning staff picks the scanned images and then removes dust specks, and strings or hairs that may be picked from the scanner.

Scratch marks and lines on the pages that are visible on the digital image are removed. Image manipulation software such as Adobe® Photoshop® are used to remove the dirt using tools such as clone tool and healing brush.

Cleaning an image can take an average of five minutes for a 120MB image at 200% zoom.

Correction is intended to bring the digital image as close as possible to the original physical document. As the image is captured using the scanner, some information may be lost, e.g. the sharpness/contrast, colour, etc.

The resultant image should be comparable to the original or of improved quality.

Quality control constitutes procedures and practices put in place to ensure consistency, integrity and reliability of the digitization process, whereas quality assurance entails procedures by which one checks the quality of the final product.

This is a task that is allocated to one or more staff, which picks a 5-10% sample of the images for quality check.

The person undertaking quality control should ask himself/herself the following questions:

- Is the output file named for the correct original object?

- Does the image include all the information in the original image, e.g. how many elements of the original have been included or omitted?

- Does the image conform to the agreed upon file standards in the specification, e.g. if the commitment is to provide 300 dpi images, has the image output achieved this?

- Does the information recorded about the image accurately represent the technical image information?

Technical considerations for the quality assurance checker would include:

- Orientation, cropping and border areas etc.

- Alignment of image

- Size of image

- Image resolution

- File format

- Correct image mode – colour/grey scale

- Bit depth

- Details in highlights and shadows

- Tonal values

- Brightness and contrast

- Noise

- Missing lines or pixels

- Dithering

- Poor quality interpolation with access and thumbnails

Overall Evaluation

The quality checker should assess the quality of the image as a whole, asking himself/herself:

- Is any essential information conveyed by the original missing from the image, (e.g. translucency of a water colour painting)?

- In his/her opinion, is the image unacceptable, adequate but of diminished quality, comparable in quality to the original, or of improved quality?

Verification of data

Any recorded information should be verified against the information accompanying the original.


Backup

A digitization project must include a back up plan for all the processed images and data for disaster recovery if media fails or during computer crashes. Backup is important at every stage where there is modification of images to avoid redoing previously accomplished stages.
Therefore, backup must be done after scanning, cleaning, colour correction and quality control.

A final backup for all the finished images must be done on durable media that allows easy access such on hard drives or DVDs. Frequent migration to new media that are compatible with the existing software must be done to guarantee accessibility in the future.

Privileges

Deletion privileges must be assigned to one person charged with managing the flow of images. This reduces random deletion of images that can lead to substantive rework of otherwise complete work. For consistency purposes the same person should perform backup tasks and folder creation. With a large digitization project and team, this cannot be overemphasized.

Metadata Creation

Metadata for the digital images should be readily available so that quality control can be done for both image quality and the metadata. This is a system that creates indexes that work together, where descriptive information about the item that will be of interest to the patron e.g. title, description and format will be generated and organized. Dublin Core Metadata format consists of 15 elements identified as basic descriptive elements for electronic resources. The core fields in this system include: title, identifier, publisher, creator, date, subject, description and type. Other fields are optional and include: coverage (temporal), coverage (spatial), relation, format, source, contributor, language and rights. Technical image metadata such as format, resolution, size, etc. and capture details (e.g. creation date) can be noted throughout the scanning process as the relevant information becomes available.

For digitized images and metadata, appropriate software should be selected. It should satisfy the following criteria which is considered for any database management system to be used in such a project:

- The database should allow for searching using the available fields.

- It should support sorting of information in many ways based on criteria that a user may choose.

- It should be easy to learn and use, flexible and simple.

- It should allow for definition of fields, and their size.

- It should be robust and be able to handle increasing amount of data.

- Records should be easily exportable to other programs such as spreadsheets, and should also allow for exporting and importing in DBF format.

- It should be relational, i.e. it should allow organization and access to data according to the relationships between data items without the need for any consideration of physical orientation and relationship. Relationships between data items are expressed by means of tables.

- It should generate thumbnails from images in a variety of formats

Follow up processes
These include re-filing of physical documents, storing images and adhering to physical image preservation policies of the institution.

• High standards of cleanliness must be maintained at all times within the digitization room with housekeeping being done on a daily basis.

• Strict filing procedures, tidiness and organization must be observed to ensure that documents do not mix up and get misplaced as this can cost a lot of hours to search and locate. Also strict naming conventions and filing of the imaged documents should be maintained to facilitate quick and easy retrieval.

• Environmental controls must be observed; air conditioners to be in place and must be dust free.

• Security measures must be in place to protect the documents from theft and unauthorized access. Anti-intruder alarm systems must be installed.