A scalable web crawler framework for Java.
Open Source Web Crawler for Java
WebCollector is an open source web crawler framework based on Java.It provides some simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
A scalable, mature and versatile web crawler based on Apache Storm
Open-source Enterprise Grade Search Engine Software
SitemapGen4j is a library to generate XML sitemaps in Java.
The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)
We introduce TACIT: An Open-Source Text Analysis, Crawling and Interpretation Tool. TACIT's plugin architecture has three main components: 1. Crawling plugins 2. Corpus management 3. Analysis plugins. TACIT's open-source plugin platform allows the architecture to easily adapt with the rapid developments text analysis.