ontology & knowledge structures
362 views | +0 today
Follow
ontology & knowledge structures
ontology, knowledge repository
Curated by Cezar
Your new post is loading...
Your new post is loading...
Scooped by Cezar
Scoop.it!

Analyzing Big Data with Twitter | A special UC Berkeley iSchool course

Analyzing Big Data with Twitter | A special UC Berkeley iSchool course | ontology & knowledge structures | Scoop.it
more...
No comment yet.
Scooped by Cezar
Scoop.it!

Twitter Engineering: Introducing FlockDB

Twitter Engineering: Introducing FlockDB | ontology & knowledge structures | Scoop.it

To deliver a tweet, we need to be able to look up someone's followers and page through them rapidly. But we also need to handle heavy write traffic, as followers are added or removed, or spammers are caught and put on ice. And for some operations, like delivering a @mention, we need to do set arithmetic like "who's following both of these users?" These features are difficult to implement in a traditional relational database.


We went through several storage layers in the early days, including abusive use of relational tables and key-value storage of denormalized lists. They were either good at handling write operations or good at paging through giant result sets, but never good at both.


A little over a year ago, we could see that we needed to try something new. Our goals were:

- Write the simplest possible thing that could work.
- Use off-the-shelf MySQL as the storage engine, because we understand its behavior — in normal use as well as under extreme load and unusual failure conditions. Give it enough memory to keep everything in cache.
- Allow for horizontal partitioning so we can add more database hardware as the corpus grows.
- Allow write operations to arrive out of order or be processed more than once. (Allow failures to result in redundant work rather than lost work.)


FlockDB was the result.

FlockDB is a database that stores graph data, but it isn't a database optimized for graph-traversal operations. Instead, it's optimized for very large adjacency lists, fast reads and writes, and page-able set arithmetic queries.

more...
No comment yet.
Scooped by Cezar
Scoop.it!

Graph databases and the warehouse | Bloor

The last of five articles about graph databases...


It is worth re-capitulating what different types of database are good for. Briefly:


Traditional (relational) data warehouses are good for high performance, OLAP, complex and ad hoc analytics running in real-time or batch against structured or semi-structured data.


Hadoop is inexpensive, schema-free, can handle any type of data and is batch-based. Performance is nothing to write home about and nor is management. Does not support ad hoc queries and is best for statistical analysis, aggregation and search rather than complex analytics.


Cassandra is essentially similar to Hadoop except that it can handle real-time queries and natively supports time series, which is not typically the case for relational environments (Informix is the exception).


Graph databases are schema-free and can handle any type of data and support real-time complex analytics against relationship-based information. Like relational databases they scale up rather than out so are relatively expensive compared to Hadoop or Cassandra.

more...
No comment yet.
Scooped by Cezar
Scoop.it!

What's a graph? | Bloor

The first in a series of five articles about graph databases...


In essence, the way that a graph database works (I will talk about this further in subsequent articles) is that it stores entities and relationships, as discussed, but its processing is along the edges of the graph. This turns conventional approaches to data storage on its head. In a relational database, for example, the heart of the system is its entities (tables) and you only use relationships (primary/foreign keys) to get to another entity: what you are doing is processing data. In a graph database you are processing relationships.

more...
No comment yet.
Scooped by Cezar
Scoop.it!

E-Prime - Wikipedia, the free encyclopedia

E-Prime - Wikipedia, the free encyclopedia | ontology & knowledge structures | Scoop.it

E-Prime (short for English-Prime, sometimes denoted E′) is a version of the English language that excludes all forms of the verb to be. E-Prime does not allow conjugations of to be (am, are, is, was, were, be, been, being), archaic forms (e.g. art, wast, wert), or contractions ('s, m, re).

Some scholars advocate using E-Prime as a device to clarify thinking and strengthen writing.[1] For example, the sentence "the film was good" could translate into E-Prime as "I liked the film" or as "the film made me laugh". The E-Prime versions communicate the speaker's experience rather than judgment, making it harder for the writer or reader to confuse opinion with fact.


Replacing statements including "to be" with those using becomes, remains and equals divides perception of, and expressions about, time more operationally into actual cognitive categories that humans know how to act upon:

To claim that one thing equals another is a claim only about the present with no reference to the future or the past—it can be disproved by direct testing (see falsifiability).
To claim that a thing remains another is to assert that a relationship exists in the present that was also true in the past, without reference to the future at all—it can be disproved by reference to history or memory.
To claim that one thing becomes another asserts a relationship between the present and the future, without reference to the past at all—it can be demonstrated undesirable or potentially false (though not disproved) with reference to intent.

more...
No comment yet.
Scooped by Cezar
Scoop.it!

What are readers looking for? Wikipedia search data now available — Wikimedia blog

What are readers looking for? Wikipedia search data now available — Wikimedia blog | ontology & knowledge structures | Scoop.it
more...
No comment yet.
Scooped by Cezar
Scoop.it!

Golden Orb

GoldenOrb is a cloud-based open source project for massive-scale graph analysis, built upon best-of-breed software from the Apache Hadoop project modeled after Google’s Pregel architecture.

Our goal is to foster solutions to complex data problems, remove limits to innovation and contribute to the emerging ecosystem that spans all aspects of big data analysis.

more...
No comment yet.
Scooped by Cezar
Scoop.it!

Graph databases and NoSQL | Bloor

The second in a series of five articles about graph databases...


It is arguable that graph databases will have a bigger impact on the database landscape than Hadoop or its competitors.


There are two things that tend to typify NoSQL databases in people's minds:

- Hadoop and its allies are optimised to run on low cost clusters of commodity hardware

- it uses MapReduce to parallelise processing across this cluster.

This works because these NoSQL databases are effectively doing either statistical analysis or search and there is only a limited shipment of data across the network.


This isn't the case with graph databases, especially where you are looking for patterns of relationships for analytic purposes.


The point to understand about graph databases, especially when it comes to analytics, is that the more nodes you have in your graph then the richer the environment becomes and the more information you can get out of it. How much more is a matter for debate: Metcalfe's Law (which is actually no more than a hypothesis) suggests that growth in value of a network is approximately proportional to the square of the number of nodes (actually n x (n-1)). However, this has been disputed, not least because some connections (relationships) between nodes are more valuable than others. Other researchers have suggested that n(logn) would be a more appropriate figure. The answer is probably somewhere in between but there seems no doubt that the more information you can collect then the more value you can extract. So, at least for analytics, graph-based data is a big data problem.


Now, you have to bear in mind that processing graph data consists of traversing relationships. If you implemented this on a cluster then those relationships would frequently span different servers within the cluster, and that would slow down processing and the network would become a bottleneck. For this reason a scale-out approach to supporting graph databases doesn't work and current vendors scale up rather than out. And, because you are scaling up, of course you don't need MapReduce because parallelism can be built in.


Here are significant open source developments in this area led, as always, by Apache. Other projects include Affinity, Nuvala, Stig, Pegasus and others. The most notable of these developments is SPARQL, which is the graph equivalent of SQL (as if you hadn't guessed). While SPARQL is supported by both Neo4j and YarcData (and IBM) in neither case is it the preferred method for developing queries.


So the bottom line is that it is not very useful to think of graph databases as a type of NoSQL database. Certainly they have things in common such as not being relational but then Adabas is not relational and you wouldn't call it a NoSQL database. Graph databases deserve to be treated as a technology in their own right and not be lumped in with something that is fundamentally different.


more...
No comment yet.
Scooped by Cezar
Scoop.it!

DBMS Musings: Hadoop's tremendous inefficiency on graph data management (and how to avoid it)

DBMS Musings: Hadoop's tremendous inefficiency on graph data management (and how to avoid it) | ontology & knowledge structures | Scoop.it

Our paper, led by my student, Jiewen Huang, achieves these enormous speedups in the following ways:

Hadoop, by default, hash partitions data across nodes. In practice (e.g., in the SHARD paper) this results in data for each vertex in the graph being randomly distributed across the cluster (dependent on the result of a hash function applied to the vertex identifier). Therefore, data that is close to each other in the graph can end up very far away from each other in the cluster, spread out across many different physical machines. For graph operations such as sub-graph pattern matching, this is wildly suboptimal. For these types of operations, the graph is traversed by passing through neighbors of vertexes; it is hugely beneficial if these neighbors are stored physically near each other (ideally on the same physical machine). When using hash partitioning, since there is no connection between graph locality and physical locality, a large amount of network traffic is required for each hop in the query pattern being matched (on the order of one MapReduce job per graph hop), which results in severe inefficiency. Using a clustering algorithm to graph partition data across nodes in the Hadoop cluster (instead of using hash partitioning) is a big win.

Hadoop, by default, has a very simple replication algorithm, where all data is generally replicated a fixed number of times (e.g. 3 times) across the cluster. Treating all data equally when it comes to replication is quite inefficient. If data is graph partitioned across a cluster, the data that is on the border of any particular partition is far more important to replicate than the data that is internal to a partition and already has all of its neighbors stored locally. This is because vertexes that are on the border of a partition might have several of their neighbors stored on different physical machines. For the same reasons why it is a good idea to graph partition data to keep graph neighbors local, it is a good idea to replicate data on the edges of partitions so that vertexes are stored on the same physical machine as their neighbors. Hence, allowing different data to be replicated at different factors can further improve system efficiency.

Hadoop, by default, stores data on a distributed file system (HDFS) or a sparse NoSQL store (HBase). Neither of these data stores are optimized for graph data. HDFS is optimized for unstructured data, and HBase for semi-structured data. But there has been significant research in the database community on creating optimized data stores for graph-structured data. Using a suboptimal store for the graph data is another source of tremendous inefficiency. By replacing the physical storage system with graph-optimized storage, but keeping the rest of the system intact (similar to the theme of the HadoopDB project), it is possible to greatly increase the efficiency of the system.

To a first degree of approximation, each of the above three improvements yield an entire order of magnitude speedup (a factor of 10). By combining them, we therefore saw the factor of 1340 improvement in performance on the identical benchmark that was run in the SHARD paper. (For more details on the system architecture, partitioning and data placement algorithms, query processing, and experimental results please see our paper).

It is important to note that since we wanted to run the same benchmark as the SHARD paper, we used the famous Lehigh University Benchmark (LUBM) for Semantic Web graph data and queries. Semantic Web sub-graph pattern matching queries tend to contain quite a lot of constants (especially on edge labels) relative to other types of graph queries. The next step for this project is to extend and benchmark the system on other graph applications (the types of graphs that people tend to use systems based on Google's Pregel project today).

In conclusion, it is perfectly acceptable to give up a little bit of efficiency for improved scalability when using Hadoop. However, once this decrease in efficiency starts to reach a factor of two, it is likely a good idea to think about what is causing this inefficiency, and attempt to find ways to avoid it (while keeping the same scalability properties). Certainly once the factor extends beyond the factor of two (such as the enormous 1340 factor we discovered in our VLDB paper), the sheer waste in power and hardware cannot be ignored. This does not mean that Hadoop should be thrown away; however it will become necessary to package Hadoop with "best practice" solutions to avoid such unnecessarily high levels of waste.

more...
No comment yet.