top of page

Data Harvesting

Harvest new and modified data from public websites

Website Harvesting and AI Transformations That Deliver Structured Data to Your Systems 

data-harvesting-icon.png

Organizations need to harvest and structure data and content posted and maintained on public websites. Websites are often the version of record for policy, procedure, legal, and regulatory content. Many businesses benefit from daily robotic scans of updated website content with structured XML feeds back into internal systems.

The volume and complexity of this type of information means that manual approaches are slow, error prone, and cost prohibitive. We provide automated website scraping configured to your business needs with customized XML feeds back to your organization. 

Tools: Data Harvester, GATE, Lucene tokenizer, JAVA, JAPE, PERL,  TensorFlow

Design Concrete

DCL provides a truly useful solution that goes beyond web scraping. DCL has developed methods and bots to facilitate high-volume data retrieval from hundreds of websites, in a variety of source formats (HTML, RTF, DOCX, TXT, XML, etc.), in both European and Asian languages. We produce a unified data stream that is converted to XML for ingestion into derivative databases, data analytics platforms, and other downstream systems. This process of normalization and transformation of content to automate import into a customer’s business system maximizes business value. A key to successful projects is the depth and quality of up-front analysis to ensure complete and accurate results.

WEB SCRAPING: SCIENCE OR ART?

RELATED CASE STUDY

A major financial institution selected Data Conversion Laboratory to accurately track financial compliance requirements across hundreds of jurisdictions. 

[READ MORE]

WHAT OUR CUSTOMERS SAY

I've got personal experience with DCL. Their Business Development department is very responsive, and their reputation is stellar. I've never heard anyone in my business  have a bad word to say about them. Highly recommended

  • What is XML conversion?
    XML conversion is the process of transforming content or data to the eXtensible Markup Language. The XML file format makes it possible and affordable to create content once and use it many times in print and digital formats.
  • What types of XML conversion projects are your specialty?
    Data Conversion Laboratory specializes in a broad range of XML conversion services. Following are some of the most common conversion projects we manage: Word to XML conversion HTML to XML conversion PDF to XML conversion Excel to XML conversion Text to XML conversion XML to XML conversion NLM XML to JATS XML conversion
  • What formats can your XML conversion service output?
    Our data conversion system is able to convert from practically any data source into any desired format, including: DITA & DITA Specializations DITA Learning & Training DITA for Publishers PubMed JATS JATS and NLM DTDs BITS NISO STS MathML Bookshelf MARC NIMAS DAISY ATA S1000D Mil-STD SGML SPL (Structured Product Labeling) Mets/Alto Prism
  • Can you convert legacy data to XML?
    Many companies and organizations have legacy software systems that use proprietary data formats that are difficult to access or incompatible with newer platforms. Our team of data engineers can develop a custom-tailored XML conversion solution for you. DCL's human and AI-based QA checkpoints can achieve up to 99.9% conversion accuracy.
  • What is the DCL Conversion Hub?
    The DCL Conversion Hub is a system we designed to streamline the process of converting data in practically from any format into any other format. This Hub-and-Spoke software architecture accepts inputs in any format, then normalizes it into an XML-based superset that can output into any other structured format.
  • Can you convert paper documents to XML?
    Yes, we can digitize papers using Optical Character Recognition (OCR), then run that data through our XML conversion services. Learn more about our OCR services.
  • Can you convert data from multiple sources or file types?
    Yes, DCL's Conversion Hub system can take in data from various sources, such as file servers, web serviers, databases, data feeds, etc., then output it in your desired format. This enables you to centralize and unfify data that might otherwise be unavailable to other parts of your organization.
  • What is DITA conversion?
    DITA stands for Darwin Information Typing Architecture. DITA is an XML data model for authoring and publishing content. The DITA architecture was originally developed by IBM for IBM technical publications. DITA is a standard for technical publications and documentation. DITA enables the Interchange and interoperation of XML content from a wide variety of sources without requiring everyone involved to agree on a single overarching document type definition Reuse of content among different publications and within the same publication
  • What is S1000D conversion?
    S1000D is an international XML specification for the procurement and production of technical publications. The specification is primarly used in the aerospace and defense industries.
Data Havesting

DCL Data Harvester Comprises

  • Filtering programs

  • Downloading handler

  • Metadata gatherer

  • File differencing programs

  • Natural Language Processing programs

  • Data and content transformation programs

  • Secure repository

DCL’s solution harnesses technology in Natural Language Processing and Machine Learning to help enable solutions powered by Artificial Intelligence. With sophisticated automated processes, DCL optimizes content to collect information, streamline compliance, facilitate migration to new systems and databases, maximize reuse potential, and ready it for delivery to all outputs.

Mark Gross, President, DCL

DCL Data Harvester is an ideal website scraping solution for all industries that rely on regulatory and compliance management data as well as keeping up to date with constantly changing website content. DCL conducts upfront human analysis of target websites and content to ensure your content and metadata are captured, structured, and complete.

Industries Served

Shield_3x.png
Library_3x.png
Graduation_3x.png
Book_3x.png
Group_3x.png
Medicine_3x.png
Scales_3x.png
Cash_3x.png
Settings_3x.png
bottom of page