Client Profile.
The client is a Florida-based technology company which provides procurement and sourcing data on government contracts through an integrated web-based platform. The comprehensive database included detailed information on government spends across verticals to help vendors and agencies explore leads for enhanced business opportunities.
Business Need.
To maintain a comprehensive and close to real time updated government spending database, the company aggregated millions of PO and tender data records from multiple agencies. This included a nationwide contact directory of these agencies with names, phone numbers, email addresses and mailing addresses of officials, managers, etc.
Aggregating the voluminous contact data of federal, state and local agencies from multiple sources, while ensuring their credibility, was a challenging task.
The company hence partnered with Hitech to aggregate, validate and cleanse its huge government contact database to maintain data integrity and credibility for its subscribers.
Challenges.
The data aggregation and validation which included a mix of manual and automated processes invited the following challenges:
- Dependency on agency authorities to respond to emails soliciting information which added to delays.
- Email responses received were often unstructured and carried inaccuracies.
- Web scraping of contact data from unstructured and structured sources necessitated a cleansing, standardization and deduplication process.
- Validating the right sources for parsing data was critical to ensure data accuracy and authority.
Solution.
Delivered an updated database of close to 500,000 contacts of federal, state and local government agencies, to power a robust and high performing government-spends platform. This involved an optimal and seamless blend of manual and automated data collection, validation and cleansing process. Interactive reports and dashboards with granular as well as analytical data were important inclusions.
The verified and validated data improved the accuracy of the AI algorithms and ensured better and more relevant search results on construction project data.
Approach.
The government websites which were to be visited to collect contact data of various agencies was defined and shared to the Hitech data team by the client.
- The Hitech team first designed a workflow for data extraction/collection which factored in number of contacts and complexity of contact extraction from any particular URL. Those URLs with 40+ contacts were funneled into the automated process while the others were separated for manual verification.
- For the manual list, our team sent out emails to designated authorities and solicited relevant information. The responses received were then entered into a structured excel file in a standardized manner.
- Customized web crawlers, for specific site structures and sources, were deployed on websites identified for automated processing. The web scrapers were programmed to extract relevant contact data from the URLs and funnel it into a spreadsheet in standardized templates. Contact details included names, designation, mailing address, email addresses, phone numbers, etc.
- The database was cleansed, de-duplicated and standardized.
- Quality check: A double-layer validation and verification process was applied to ensure data authenticity. This included manual as well as rule-based validation against source.
- Deliverables: Close to 500,000 contacts from 75,000 websites were updated with 99.5% accuracy. Detailed reports on number of websites scraped vs. received, accuracy, productivity etc. were shared with client as weekly updates.
- Tools & technology: Custom scraping tools