Big Data Variety: Tables-Text &Triples & Oceanography


The largest fishing ground on our planet, Georges Bank in the Northwest Atlantic is the direct result of a oceanographic triple: the confluence of “three” distinct oceanographic conditions all coming together to create one of our world’s most productive marine environments. Larger than the landmass of all New England states, the Georges Bank sits just over 100 miles offshore with an average depth of 100 meters. What makes it so productive is the depth, and the meeting of two major oceanographic currents: the Gulf Stream carrying warm highly saline water meets the cold nutrient rich Labrador current. These three factors made it one of the richest fishing grounds in the world for hundreds of years. If the depth was 200 meters there wouldn’t be the same biosphere, and without the nutrients and warm water there wouldn’t be a Georges Bank.

Our oceans are really seas of rivers and currents, much like today’s enterprise data were data flows like rivers in a never- ending stream of tables and text and many other different types of objects. In the age of big data, the three “Vs,” Volume, Velocity and Variety are king, which makes the selection of the right database for complex business processes extremely important. Organizations are now literally swimming in data, and struggling to mange not only multiple databases, but new large data volumes, velocities and varieties of data.

According to recent research (September 2013) by Asterias Research, a day in the life of a database administrator regularly involves actively managing twenty databases, (although many enterprises have hundreds of databases,) and moving 200TB of data. Although this doesn’t seem like a lot of data, the overall volume of data has increased dramatically and some enterprises are now managing and moving more than 5 petabytes of data daily.

Gartner’s latest primary market research on big data found that organizations are struggling not with volume and velocity of big data, but with the variety of big data. One of the more interesting and recent blogs highlighting this research (and triples} is by Irene Polikoff, and can be found on Information Week’s site at (!).

Organizations are just now beginning to leverage new NoSQL database technologies without schemas to manage data variety, databases such as MarkLogic 7 enable triple stores, or what some might call a graph database to understand the interrelationships between the data types. Triple stores are unlike tabular RDBMS in that the data is stored in single row sequence of subject, predicate and object and is linked through semantic relationships. This new and powerful technology approach addresses the “variety” issue of big data that early big data implementers are encountering today. And may provide lightening search speed for the users and nearly instant gratification in the information discovery process.

MarkLogic Semantics for example, provides a specialized triple index that enables industry-standard SPARQL queries to be combined with queries against documents and values, ensuring that all relevant information is delivered in applications and analytic reports. A great MarkLogic bog on its functionality is Adam’s blog

Another great example of NoSQL databases in action is their use in social media dating sites such as the UK’s Badoo, the world’s largest and fastest growing social media site for meeting new people. Marketers at Badoo are conducting high speed analytics from social media interactions staged on a NoSQL database platform to enhance real-time data about events occurring in their user base.


According to recent Gartner research conducted this year, only 6% of organizations worldwide are implementing big data initiatives directly related to databases and their infrastructure. Another 20% are in the planning and/or piloting stage of big data deployments. In the final analysis what this means is that most organizations are in the early “adopter stage” of big data project implementation and have not encountered all the issues associated with the three Vs of big data. What keeps CEOs up at night is the disruptive information that they don’t know about and haven’t discovered yet.

Technology is still one of the most important competitive weapons available to business. In our last blog on sematology we elaborated on the use of biological terms in computing jargon. Interestingly, the newest frontier of computing involves the creation of new processors that mimic biological synapses, which simulate neural networks. According to an article in the Sunday’s December 29th NYT, Google researchers created a neural network that scanned 10 million images and through machine learning trained itself to recognize cats. There is no doubt that neuromorphic processors will play a seminal role in future information discovery, which is so dependent on complicated algorithms and statistics oriented programming today.

Until then searching and discovering all relevant information in most organizations is like a luckless dream that never ends but keeps promising a great ending.