Delivering customer success since 1981
HTML Agility Pack
Daily robotic scans of deep websites
Harvest content across multiple formats---PDF, HTML, Word, RTF, XML, XLS
“DCL developed a series of best practices for web crawling and harvesting technologies, achieving fully automated processing against a wide range of diverse, complex and often poorly structured websites.
Our methodology has been iteratively refined to accommodate the ever-changing landscape of internet content and facilitate a model of continuous improvement.”
-Mark Gross, President, DCL
Global Financial Institution
Data Harvesting and AI Transformations
Keywords: web scraping, data harvesting, Artificial Intelligence, HTML, RTF, DOCX, TXT, XML
Vast amounts of business-critical information appear only on public websites that are constantly updated to present both new and modified content. While the information on many of these websites is extremely valuable, no standards exist today for the way content is organized, presented and formatted, or for how individual websites are constructed or accessed. This creates a significant challenge for companies that require data sourced from these websites in a timely manner, which they need downloaded and structured to support business practices and downstream systems.
A major financial institution selected DCL to accurately track financial compliance requirements across hundreds of jurisdictions.
DCL analyzes and harvests content and data from more than 150 targeted websites. On a daily basis content is cleaned up, harmonized, and transformed to the customer’s XML schema. DCL employs GATE, Lucene Tokenizer, TensorFlow and rules-based software to decompose unstructured content, auto-style text, and annotate reference citations. Daily XML feeds are delivered to the client with new or changed content highlighted.
Our client, a major financial institution, now has a growing repository of structured legal documents with daily highlights of updated content. The data are primed for ingestion in downstream systems, which streamlines compliance processes. The system provides a level of risk avoidance previously unattainable.