Open Source Web Crawler for Java
WebCollector is an open source web crawler framework based on Java.It provides some simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
A scalable, mature and versatile web crawler based on Apache Storm
Open-source Enterprise Grade Search Engine Software
SitemapGen4j is a library to generate XML sitemaps in Java.
The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)
We introduce TACIT: An Open-Source Text Analysis, Crawling and Interpretation Tool. TACIT's plugin architecture has three main components: 1. Crawling plugins 2. Corpus management 3. Analysis plugins. TACIT's open-source plugin platform allows the architecture to easily adapt with the rapid developments text analysis.