DCL Data Harvester

Harvest new and modified data from public websites

Website Harvesting and AI Transformations That Deliver Structured Data to Your Systems

Organizations need to harvest and structure data and content posted and maintained on public websites. Websites are often the version of record for policy, procedure, legal, and regulatory content. Many businesses benefit from daily robotic scans of updated website content with structured XML feeds back into internal systems.

The volume and complexity of this type of information means that manual approaches are slow, error-prone, and cost-prohibitive. DCL Data Harvester provides automated website scraping configured to your business needs with customized XML feeds back to your organization.

Benefits

Learn more

Daily robotic scans of websites important to your business
Harvest new and modified content from a variety of sources: PDF, HTML, XML, RTF, Word
Analyze, cleanse, and harmonize data
Provide cross-reference linking
Convert to XML schema for delivery

A deeper solution beyond simple website scraping

DCL provides a truly useful solution that goes beyond web scraping: it’s website harvesting and AI-based transformations of content into useful formats.

For updates, some sites provide RSS feeds. But often there is a need to go beyond RSS feeds as these are limited to what a website administrator chooses to provide. There may be missing metadata, filtering changes, normalization requirements, format/publishing needs, and the need for accurate metadata.

Sites are global and multi-lingual and contain information in multiple formats, such as HTML, PDF, XML, RTF, and DOCX. This necessitates a deeper solution where data is downloaded, normalized, structured, and converted into a common XML format with defined metadata, and related content is linked. It is critical that website crawling efforts do not look like attacks on the system, which would trigger DDoS alarms (Distributed Denial of Service).

DCL has developed methods and bots to facilitate high-volume data retrieval from hundreds of websites, in a variety of source formats (HTML, RTF, DOCX, TXT, XML, etc.), in both European and Asian languages. We produce a unified data stream that is converted to XML for ingestion into derivative databases, data analytics platforms, and other downstream systems. This process of normalization and transformation of content to automate import into a customer’s business system maximizes business value. A key to successful projects is the depth and quality of up-front analysis to ensure complete and accurate results.

DCL Data Harvester comprises

Filtering programs
Downloading handler
Metadata gatherer
File differencing programs
Natural Language Processing programs
Data and content transformation programs
Secure repository

DCL’s solution harnesses technology in Natural Language Processing and Machine Learning to help enable solutions powered by Artificial Intelligence. With sophisticated automated processes, DCL optimizes content to collect information, streamline compliance, facilitate migration to new systems and databases, maximize reuse potential, and ready it for delivery to all outputs.

Mark Gross, President, DCL

DCL Data Harvester is an ideal website scraping solution for all industries that rely on regulatory and compliance management data as well as keeping up to date with constantly changing website content. DCL conducts upfront human analysis of target websites and content to ensure your content and metadata are captured, structured, and complete.