PDF Conversion of DOI and Summaries into XML for an IT Company in NY, USA
Leading software developers, New York, USA
Industry: IT/ Software
Converting DOI documents and summaries in PDF format, to XML for human-reading and machine reading, and for use in the web service as and when required.
Designed & developed a workflow to ensure flawless keying in of data into word files, followed with flawless process of XML tagging and multi-layered quality check for converting PDF files to XML as per client specifications.
MS Word, XML Oxygen Editor and Mozilla Firefox
PDF files received from client’s end were not clear and required great amount of concentration, ultimately increasing the time spent on each file to read, decipher & then key in data. This increased the count of resources required to complete the project.
Client’s checklist for XML tagging was complex and lengthy.
Reconstructing the original document structure from PDF file was the most challenging task because PDF document did not carry any such information.
Non unique fonts or font size where in both words and numerical meaning delimiters were in different font size.
PDF source document contained peculiar typesetting errors. Few additions were shown in regular font, but others were in different fonts.
The font and font size attributes in the XML file needed detailed inspection in order to detect and mitigate the whitespace characters to avoid erroneous typeface information that could cause clutter across the transitions in the XML process.
The client now was able to manipulate the XML information programmatically (under machine control), for separating the XML documents from disparate sources, or take them apart and re-use in multiple ways. They can be converted into any other format with no loss of information, which also allowed invoking of several XML files from several machines and various locations.
Client now was able to use these XML files to describe and identify information accurately and unambiguously, in a manner where the computers can be programmed to ‘understand’ your information.
Our PDF to XML conversion enabled the client to create and handle sets of documents, consistently and without structural errors, because the XML files provided a standardized way of describing, controlling, or allowing/disallowing particular types of document structure.
The files delivered were ready to use on multiple platforms like internet, handheld sets and palmtops.
The XML files made it more convenient for the clients to easily use and re-purpose the content for publishing.
Client saved upon dollars as the new XML file format required less storage space.
Keeping up with ever evolving technological demands, of our clients and for our clients, we consistently upgrade systems, processes and skills of our employees; to ensure that our data conversion solutions are the ones with latest technological adoptions, to assist our clients to stay competitive.
Client was given the option to shortlist and selects 50 data entry operators from our pool of talented professionals. Upon selecting the most skilled experts, an workflow was discussed, designed and developed for uninterrupted keying in of data from pdf files, quality check of keyed data, XML tagging, XML validation, and of course the auditing process.
Steps that were followed for DOI in PDF to xml conversion:
Upon receipt of input data in the form of pdf files from the client via FTP server; first the data was keyed from pdf to Word documents by a team of data entry experts.
All the fields were thoroughly checked using specific check lists in order to maintain data accuracy.
A dedicated team was assigned to add XML tags in the data files.
The XML files were validated by XML experts for discrepancies – if any.
Random audit of 10% XML files was done by the quality check team and they rejected files even with a single error in XML tagging/structure.
Successfully completed and checked XML files were dispatched to client via FTP server.
Quality Control Process that was adhered to:
QC team checked the keyed data using check list & upon finding any junk character, the files were passed to the data entry team for rectification.
The QC team also checked for special characters if found in the pdf.
QC team approved the MS Word documents only after 100% data accuracy was achieved and only after the data in MS Word document exactly matched the data in the input pdf file.
The team of XML experts checked the XML files in XML Oxygen Editor and FireFox Mozilla browser for viewing the outputs and checked every single paragraph for correct XML structure. Upon identifying any incorrect structure, even with the minutest error, were referred back to the XML tagging team.
10% XML files were audited randomly for locating errors.
Quality Control checklist that was adhered to by the QC Team:
Check Case Name 4 times [In XML File Tags:- <Descrpition>, <CASENAME>, <HEADLINE>, <HEADLINE>].
Check Page Number and Colum No. [in XML File Tags:- <PAGENUMBER>, <COLUMNNUMBER>].
Check Issue No. and Volume Number [In XML File Tag:- <ISSUENUMBER>].
Check Date in XML File [In XML File Tag:-<PUBDATE>].
Check Judge Name 3 times [In XML File Tag:- <JUDGES> ].
Check Court [In XML File Tag:- <COURT>].
Check Practice Area [In XML File Tag:-<PRACTICEAREA>].
Check First Line of <ABSTRACT> in File.
Check Summary Body in <SUMMARYBODY> XML File.
Check Paragraph Indent in XML File.
Check Italics in XML File.
Check Superscript for Footnotes in XML File.
Check Paragraph Starting and End Tagsadd XML tags in the data files.
Ask the Experts.
Schedule a free 30 minute consultation with our experts. We’d love to talk to you!