Case Studies

Delivering customer success since 1981

Technologies Used

HTML Agility Pack
C#
GATE
Lucene Tokenizer
TensorFlow
JAVA
JAPE
PERL

Project highlights

Daily robotic scans of deep websites
Harvest content across multiple formats---PDF, HTML, Word, RTF, XML, XLS

“DCL developed a series of best practices for web crawling and harvesting technologies, achieving fully automated processing against a wide range of diverse, complex and often poorly structured websites.

Our methodology has been iteratively refined to accommodate the ever-changing landscape of internet content and facilitate a model of continuous improvement.”

-Mark Gross, President, DCL

Global Financial Institution

Data Harvesting and AI Transformations

Keywords: web scraping, data harvesting, Artificial Intelligence, HTML, RTF, DOCX, TXT, XML

Background

Vast amounts of business-critical information appear only on public websites that are constantly updated to present both new and modified content. While the information on many of these websites is extremely valuable, no standards exist today for the way content is organized, presented and formatted, or for how individual websites are constructed or accessed. This creates a significant challenge for companies that require data sourced from these websites in a timely manner, which they need downloaded and structured to support business practices and downstream systems.

A major financial institution selected DCL to accurately track financial compliance requirements across hundreds of jurisdictions.

Solution

DCL analyzes and harvests content and data from more than 200 targeted websites. On a daily basis content is cleaned up, harmonized, and transformed to the customer’s XML schema. DCL employs GATE, Lucene Tokenizer, TensorFlow and rules-based software to decompose unstructured content, auto-style text, and annotate reference citations. Daily XML feeds are delivered to the client with new or changed content highlighted.

Results

Our client, a major financial institution, now has a growing repository of structured legal documents with daily highlights of updated content. The data are primed for ingestion in downstream systems, which streamlines compliance processes. The system provides a level of risk avoidance previously unattainable.