Web Scraping Services of Healthcare Data

Data Analytics

AI & ML Services

Data Annotation

Intelligent Automation

Intelligent Document Processing

Data Engineering

Data Aggregation

BI and Reporting

Mechanical Design Services

Building Engineering Services

Order Management Solutions

Procurement Management

Recruitment Process Outsourcing (RPO)

Global Capability Center (GCC) Services

Product Catalog Management

Product Listing Management

Photo Editing & Retouching

Web Scraping of unstructured healthcare data from multiple web threads using NLP and RPA Banner

Client Profile.

Founded in 2007, the USA based medical knowledge management company manages a huge repository of healthcare data and shares relevant insights to consumers, patients, healthcare service providers and caregivers. These healthcare insights help make evidence-based healthcare decisions; right from healthy living and prevention to diagnosis, treatment and home care.

Business Need.

In order to further its healthcare data management business, the company was creating a comprehensive repository of healthcare data to be gathered from threads of more than 70 Redditt subgroups. The subreddits covered a wide spectrum of conditions from generic blood pressure, diabetes, diet and weight loss to serious ones like depression, cancer, liver, alcoholism and many more. The various types of data that was to be extracted included:

Disease condition
Username of topic creator
Usernames of respondents on main comments
Usernames of respondents on sub-comments
Text of all the topics, discussions, comments, replies and sub-comments

The company approached HitechDigital for capture and standardization of Reddit posts and discussions and classification according to race, ethnicity, age, etc.

Challenges.

The team at HitechDigital studied the processes to understand the scope of work, technology to be used, and the workflow to be designed. Following project requirements increased the challenges of the process:

Extract unstructured data in form of paragraph or text, which had no repetitive pattern
Identify, flag and skip comments removed by moderators – not to be collected
Expand sub replies and collapsed comments according to the hierarchy to collect/scrape data
Skip conversations that broke the continuity of discussion thread and led to another page altogether
Omit extraneous data such as time stamps, points, etc., which increased project complexity

Solution.

Designed an ongoing process of web scraping and unstructured data extraction using NLP and RPA to collect data from healthcare posts and discussion on Reddit subgroups. The automated data management workflow ensured that collected data upon reaching the final lag, would trigger a macro to move the data through a predefined quality and profiling process.

Approach.

Manual data extraction requires complex workflows and significant hand-coding to extract, cleanse, and validate unstructured data. So, data professionals at HitechDigital started off by deriving a smarter, easier way to automate unstructured data extraction workflows.

Implementation:

The email/FTP/folder integration feature of the automated workflow was leveraged to receive sets of sub-Reddit pages and pre-defined instruction for output.
Programmable bots and scripts were used to scrape and tag discussion threads in pre-defined hierarchical sequence like discussion title, description, username and replies.
The execution order was:
- Pick a topic and collect all the discussion links and its count from healthcare sub-Reddit pages
- Go through each discussion > Sort by OLD > Expand at Maximum > Save Webpage
- Read each discussion page and extract requisite data
- Generate a structured text file output in a predefined hierarchical form
- Presenting data in form of comprehensive reports and dashboards
Robust workflow supported auto-creation of data extraction patterns to expedite the data preparation process and improve data quality.

Quality Check and Audit:

Programmable bots and macros were deployed to validate and verify extracted data.
Alerts/notifications for missing or deleted comments were generated with ML backed algorithms.

Dispatch:

Automated upload of comprehensive output text file after conversion, transformation, and validation

Ask the Experts.

Schedule a free 30 minute consultation with our experts. We’d love to talk to you!

Global Locations

Data & Analytics Solutions

Engineering Services

Business Process Services

Products

Industries

Resources

About us

Web Scraping of unstructured healthcare data from multiple web threads using NLP and RPA