Question 1

How do you harvest data from websites?

Accepted Answer

DCL starts with a human analysis of the target websites/content by our expert engineering team. We then use tools like our in-house data harvesting software and custom scripts to scrape, harvest, re-structure, and validate the collected data. Special care is taken to ensure harvesting does not overload or accidentally DDOS the target services.

Question 2

What source and file formats can data be harvested from?

Accepted Answer

Data harvesting can extract data from HTML, RTF, DOCX, TXT, XML, RSS, XSLX, CSV, and practically every imaginable file format.

Question 3

What types of data can be collected during data harvesting?

Accepted Answer

Data harvesting can gather text, metadata, images, videos, and other files from online sources.

Question 4

Can data harvesting produce structured data in a particular format?

Accepted Answer

DCL data harvesting can output the data in whatever format is desired. The most common formats are XML, DITA, HTML, and S1000D.

Question 5

What's the difference between data mining and data harvesting?

Accepted Answer

Data mining typically refers to analyzing large datasets, often with AI or machine learning, to uncover hidden trends or statistics that traditional analysis methods may miss. Data harvesting is closely related, but focuses on collecting data from online sources so they can be analyzed or reused. Data harvesting and data mining often go hand-in-hand, with harvesting gathering the data to be mined.

Question 6

How can data harvesting be used for data analytics?

Accepted Answer

Analytics are only as good as the data analyzed. DCL’s data harvesting services streamline the collection, validation, and structuring process so analytics are faster and more reliable.

Question 7

Is web scraping the same as data harvesting?

Accepted Answer

Web scraping is a common term for crawling websites and downloading their contents. At DCL, we differentiate our data harvesting from simple web scraping by also incorporating machine learning and natural language processing to ensure the final output is well structured and ready for reuse. In casual conversation, terms like web scraping, web mining, data scraping, data extraction, and other names are often used interchangeably.

Question 8

Can DCL harvest data in languages other than English?

Accepted Answer

Yes, we can harvest data in European and Asian languages.

Question 9

How is harvested data cleaned and checked for errors?

Accepted Answer

At every stage of the data harvesting process, DCL uses a combination of human and machine validation processes to verify the quality of the collected data. Our system will flag errors so they can be quickly corrected. High quality, standardized data is DCL’s speciality.

Question 10

What is data harvesting?

Accepted Answer

Data harvesting is the automated collection of information from online sources like websites or databases, usually by "scraping" the content of pages for text, images, etc. This data can then be used for other purposes like statistics, analytics, machine learning, or integration with other data sources.

Data Harvesting

Website Harvesting and AI Transformations That Deliver Structured Data to Your Systems

Tools: Data Harvester, GATE, Lucene tokenizer, JAVA, JAPE, PERL, TensorFlow

A deeper solution beyond simple website scraping

WEB SCRAPING: SCIENCE OR ART?

RELATED CASE STUDY

WHAT OUR CUSTOMERS SAY

DCL Data Harvester Comprises

Mark Gross, President, DCL

Industries Served

Associations

Publishers

Libraries, Universities, Museums

Government

Defense

Health,

Pharma

Legal

Finance, Insurance

Manufacturing, Technology