Bits 'n Pieces on...
Follow
Find tag "clueweb12"
1.3K views | +3 today
Bits 'n Pieces on Big Data
Innovative information and insight into Big Data (if you like the content, please consider donating to my bitcoin address #1MhtqfDaAsy4TpYwjS2Kq2DMKrecupbx8c)
Curated by onur savas
Your new post is loading...
Your new post is loading...
Scooped by onur savas
Scoop.it!

Google Research Blog: 11 Billion Clues in 800 Million Documents: A Web Research Corpus Annotated with Freebase Concepts

Google Research Blog: 11 Billion Clues in 800 Million Documents: A Web Research Corpus Annotated with Freebase Concepts | Bits 'n Pieces on Big Data | Scoop.it
more...
No comment yet.
Scooped by onur savas
Scoop.it!

The ClueWeb12 Dataset

The ClueWeb12 Dataset | Bits 'n Pieces on Big Data | Scoop.it

The ClueWeb12 dataset was created to support research on information retrieval and related human language technologies. The dataset consists of 870,043,929 English web pages, collected between February 10, 2012 and May 10, 2012. ClueWeb12 is a companion or successor to the ClueWeb09 web dataset."

onur savas's insight:

You have to sign a data license agreement with Carnegie Mellon University to obtain it. When uncompressed, it takes up 27.3 TB space.

more...
No comment yet.
Scooped by onur savas
Scoop.it!

ClueWeb12 Dataset Manipulation Tool

ClueWeb12 Dataset Manipulation Tool | Bits 'n Pieces on Big Data | Scoop.it

This is a collection of tools for manipulating the ClueWeb12 collection.

 

"clueweb - Hadoop tools for manipulating ClueWeb collections"

onur savas's insight:

ClueWeb12 is a 27.6 TB dataset of web crawls (http://lemurproject.org/clueweb12/). This tool from Jimmy Lin allows to bring it down to 860 GB in terms of <doc id, term> vectors.

more...
No comment yet.