• DCL LinkedIn
  • DCL Twitter
  • DCL YouTube

61-18 190th Street, Suite 205

Fresh Meadows, NY 11365

+1 718.357.8700

info@dclab.com

HOME  /  INDUSTRIES  /   SOLUTIONS  /  SERVICES  /  RESOURCES /  ABOUT  /  CONTACT  /  PRIVACY  /  TERMS OF USE

© 2019 Copyright Data Conversion Laboratory, All Rights Reserved.

Case Studies

Delivering customer success since 1981

Technologies Used

  • HTML Agility Pack

  • C#

  • GATE

  • Lucene Tokenizer

  • TensorFlow

  • JAVA

  • JAPE

  • PERL

 

Project highlights

  • Daily robotic scans of deep websites

  • Harvest content across multiple formats---PDF, HTML, Word, RTF, XML, XLS

“DCL developed a series of best practices for web crawling and harvesting technologies, achieving fully automated processing against a wide range of diverse, complex and often poorly structured websites. 

 

Our methodology has been iteratively refined to accommodate the ever-changing landscape of internet content and facilitate a model of continuous improvement.”

 

-Mark Gross, President, DCL

Global Financial Institution

Data Harvesting and AI Transformations

Keywords: web scraping, data harvesting, Artificial Intelligence, HTML, RTF, DOCX, TXT, XML

Background

Vast amounts of business-critical information appear only on public websites that are constantly updated to present both new and modified content. While the information on many of these websites is extremely valuable, no standards exist today for the way content is organized, presented and formatted, or for how individual websites are constructed or accessed. This creates a significant challenge for companies that require data sourced from these websites in a timely manner, which they need downloaded and structured to support business practices and downstream systems.

A major financial institution selected DCL to accurately track financial compliance requirements across hundreds of jurisdictions. 

Solution

DCL analyzes and harvests content and data from more than 150 targeted websites. On a daily basis content is cleaned up, harmonized, and transformed to the customer’s XML schema. DCL employs GATE, Lucene Tokenizer, TensorFlow and rules-based software to decompose unstructured content, auto-style text, and annotate reference citations. Daily XML feeds are delivered to the client with new or changed content highlighted.

Results

Our client, a major financial institution, now has a growing repository of structured legal documents with daily highlights of updated content. The data are primed for ingestion in downstream systems, which streamlines compliance processes. The system provides a level of risk avoidance previously unattainable.