The client is a business intelligence firm equipping customers with market intelligence and data-backed insights on suppliers and systems to make informed decisions for tendering. They also offer consultancy services to liaise effectively with government agencies. Backed with extensive experience and expertise; the firm services clients across England in the consultancy, media & market, e-procurement, and supply chain solution sectors.
Driven by an increasing customer demand for exhaustive information on industry and government data, the company wanted to derive intelligent insights from an integrated database created through data capture activities from disparate online sources. It would serve as a single source for information on all tenders published by the government for manufacturing, construction, market & media and insights on consultancy for decision making. To implement this, they were looking for a technology partner to support them with:
- Acquisition of real-time and updated data from multiple online sources
- Automation of data sourcing, web extraction, data formatting, and data integration processes
- Dashboard creation and alerts for real-time visibility and insights into tender data
The company approached Hitech for capture and standardization of Reddit posts and discussions and classification according to race, ethnicity, age, etc.
The firm needed to extract voluminous and latest data from thousands of government websites for tenders. The current approach of manual web data scraping posed multiple challenges including:
- Capturing data manually by crawling over 4000 websites on an ongoing basis.
- Organizing the huge volumes of data captured from disparate sources in to a structured format was error-prone.
- Each government site had a specific format and required custom data extraction techniques.
- Complicated navigation on websites needed technical know-how to extract relevant data.
Delivery of an optimized and automated data management solution using RPA, Mozenda, and Python to pace up various data processes such as:
- Data extraction, conversion, cleansing, and updation through rule-driven algorithms
- Synchronizing structured data directly with database
- Creation of intuitive reporting dashboard for activity logs, website issues, missing instances, volumes, etc.
- Division of project into three phases and implement close collaboration between analysts at client’s end and Hitech’s automation engineers.
- Definition and documentation of metrics, milestones, quality benchmarks and scope of work.
- Creation of web forms by the project engineers and sharing them with the client to get inputs on source website address, data format and type, focus keywords and other relevant details to help with bots creation.
- Phase 1: Automating web scraping and cleansing
- Development of custom web scrapers for each source site structure and data by the automation engineers.
- Reduction in web scraping time from several hours to a few minutes by creating custom agents in Mozenda console for each site as per it’s format and information presentation.
- Extraction of data from all source websites using a few 1000s web scraper bots.
- Employing manual web scraping techniques for the 10% of source URLs that don’t allow the scraper bots to enter or the broken URLs, to avoid missing out on any piece of information.
- Phase 2: Data formatting and validation
- Presentation of information in XML format for auto-exporting to shared drive and trigger RPA tool – Automation Anywhere – for further actions.
- Automation Anywhere checks the shared drive for files and continuously maps it against the keyword, relevance, etc., validates it and removed any duplicate entries.
- Phase 3: Data science and creating dashboards
- Receipt of data at Hitech datacenter, for validation against the entries in the database at the client’s end.
- Manual addition in XML file of missing or additional entries received from web scraping.
- Creation of dashboards using Python to drive actionable insights from the data.
- Dashboard insights would help weed out/eliminate websites that were inactive or inaccessible and delete them from web scraping list.
- Quality checks:
- Quality check of random samples in the initial phase; validation of data against source through bots.
- Checks by a senior QC specialist for exceptions such as ‘garbage value’ where deploying bots isn’t possible.
- Manual rectification of errors and fixing bots’ logical programming.
- Reduction in the number for QC sampling as the errors in sampling reduce.
- Shipping of final custom scraping bots, RPA tools and dashboards to the client.