Data Harvesting from the Web Delivers Mission-Critical Information in a Big Data World

markgross
Jul 16, 2019
3 min read

Updated: Jul 28, 2022

The amount of content posted online—via social media and on the internet—is staggering. 3.5 million Google searches occur every day, Twitter boasts nearly 500,000 tweets per minute, and an astounding 90 percent of the data hosted on the internet was created and posted in the last 40 months.*

Vast amounts of business-critical data are among the torrent of new content, and much of this information—crucial to companies and organizations across diverse industries—is posted on public websites. While these daily refreshes can include new, updated, and modified content that impact regulatory requirements, safety protocol, and other critical competencies for businesses around the world, very often, this content is not being delivered to the relevant parties. Rather, these organizations need to search for, collect, process, and format this information to suit their needs and systems.

In most cases, a manual approach to identifying and harvesting these updates is impractical, if not impossible.

Data Harvesting

Data harvesting is the process of programmatically combing through content posted online, collecting specific and relevant information, then formatting, storing and republishing the information—either internally or on companies' own websites—to stay up-to-date with this public-facing data.

While data harvesting (AKA data scraping or data mining) might be more popularly associated with nefarious business practices in some circles (think Facebook and Cambridge Analytica), there are serious reasons why businesses and organizations rely on this process for beneficial, real-world applications.

The need for solutions for data collection and management is real, and it is growing. More and more companies are turning to DCL looking for ways to keep their content current because there are serious implications to not doing so.

DCL Data Harvester

DCL Data Harvester employs the latest innovations in artificial intelligence, combining machine learning and natural language processing to organize and structure content and data.

Automated data harvesting—employing GATE, Lucene Tokenizer, TensorFlow and rules-based software—scouts for daily updates and decomposes unstructured content, auto-styles text, and annotates reference citations when collecting data from hundreds of websites.

Many organizations depend on reliable web data to ensure industry-specific compliance. Industries such as financial; scholarly publishing; and state, local, and federal government all require various compliance directives for their operations. Data harvesting serves these industries by:

collecting content and data from hundreds of targeted websites.
scrubbing, transforming, and normalizing the content to a customer’s XML or database schema.
delivering a feed identifying new or changed content.

Practical Applications of White Hat Data Harvesting

Finance

Federal and state governments operate numerous agencies that regulate and oversee financial markets and companies, and the ever-changing list of regulations and procedures is critical to operations.

"A global financial institution came to DCL looking for a way to accurately track regulatory compliance requirements across hundreds of jurisdictions," explains Bilitzky. "Today we robotically scan web content daily to identify changes. They now have a growing repository of structured legal documents with daily highlights of updated content. The financial data and regulations are primed for ingestion in downstream systems, which streamlines compliance processes and provides a level of risk avoidance that was previously unattainable."

Related case study

Scholarly publishing

Thousands of scholarly publications are published and distributed around the world every day. For the hundreds of thousands of contributors submitting articles to these publications, manually tracking the journal guidelines—and any subsequent updates and changes—would be an impossible task. Data harvesting can provide a daily scan of all the journal publishers' websites, hunting for changes and updates. Once collected, the information is returned to the source and formatted in an organized, searchable manner, with updated and changed content highlighted.

These searches can also return errata (corrected errors appended to a book or journal) and addenda, as well as funder information and other useful content relevant to authors, publishers, and industry-related service providers.

Government agencies

For a government agency like the Department of Defense, data harvesting could be used to collect information pertaining to the tools, conditions, materials, and personnel required to perform specific procedures. This can include incredibly crucial information related to launching missiles or shutting down a nuclear reactor. The same process can return alerts, warnings, and cautions to other heavily regulated industries, e.g. aerospace, so organizations can update their data repositories for accurate data sustainment and consistency.

As more mission-critical content is routinely posted online, the need for a streamlined, programmatic approach to harvest, verify, and format that content is vital for businesses that rely on external information. Data harvesting employing AI technology can be invaluable for organizations striving to keep pace with today's data-driven business environment.

*Sourced from Forbes.

2 comentarios

Marianne Calilhanna

29 ago 2019

@bradg.2029 - Good question. Typically we deliver XML feeds back to the system, customer, platform. If you are looking for another mechanism, I can bring my tech team into a conversation. Just let me know! Thanks for your comment.

Me gusta

bradg.2029

28 ago 2019

How does the Data Harvester integrate with other software? Looks like it could be a good solution to deliver data into my analytics platform: https://www.bisok.com/data-science-workbench/augmented-analytics/ Thanks in advance