Keywords: data harvesting, web scraping, XML feeds, automation
Vast amounts of business-critical information appears only on public websites that are constantly updated to present both new and modified content. While the information on many of these websites is extremely valuable, no standards exist today for the way content is organized, presented, and formatted, or for how individual websites are constructed or accessed.
DCL Data Harvester is a streamlined, automated processes to crawl websites, scrape content and metadata, and transform the content into a standardized XML format.
When faced with the challenge of converting eight highly technical journals spanning 95 years, how do you divide responsibility between the content owner and the conversion vendor? This paper demonstrates how an agile approach to content conversion with close collaboration between the publisher and the conversion vendor allowed The Optical Society of America (OSA) and DCL to navigate between the two extremes and create a high-quality digital archive that will serve OSA’s strategic aims for developing innovative products and services.
Many governmental and private organizations gather massive collections of content, including legal documents, filings, and contracts. Most such collections consist of images and image-based PDFs; they’re not searchable or minable for the critical information that these organizations need to function. As data collections grow larger and are measured in terabytes, conventional conversion techniques—as efficient as they may be—are not economically feasible. The Holy Grail has always been a fully automated process without human intervention. This paper describes the implementation of such a system at the United States Patent and Trademark Office (USPTO).