Big Data Variety: Tables-Text &Triples & Oceanography

blog_2_pic.JPG

The largest fishing ground on our planet, Georges Bank in the Northwest Atlantic is the direct result of a oceanographic triple: the confluence of “three” distinct oceanographic conditions all coming together to create one of our world’s most productive marine environments. Larger than the landmass of all New England states, the Georges Bank sits just over 100 miles offshore with an average depth of 100 meters. What makes it so productive is the depth, and the meeting of two major oceanographic currents: the Gulf Stream carrying warm highly saline water meets the cold nutrient rich Labrador current. These three factors made it one of the richest fishing grounds in the world for hundreds of years. If the depth was 200 meters there wouldn’t be the same biosphere, and without the nutrients and warm water there wouldn’t be a Georges Bank.

Our oceans are really seas of rivers and currents, much like today’s enterprise data were data flows like rivers in a never- ending stream of tables and text and many other different types of objects. In the age of big data, the three “Vs,” Volume, Velocity and Variety are king, which makes the selection of the right database for complex business processes extremely important. Organizations are now literally swimming in data, and struggling to mange not only multiple databases, but new large data volumes, velocities and varieties of data.

According to recent research (September 2013) by Asterias Research, a day in the life of a database administrator regularly involves actively managing twenty databases, (although many enterprises have hundreds of databases,) and moving 200TB of data. Although this doesn’t seem like a lot of data, the overall volume of data has increased dramatically and some enterprises are now managing and moving more than 5 petabytes of data daily.

Gartner’s latest primary market research on big data found that organizations are struggling not with volume and velocity of big data, but with the variety of big data. One of the more interesting and recent blogs highlighting this research (and triples} is by Irene Polikoff, and can be found on Information Week’s site at (http://www.informationweek.com/big-data/varietys-the-spice-of-life-andndash-and-bane-of-big-data/d/d-id/1112960#!).

Organizations are just now beginning to leverage new NoSQL database technologies without schemas to manage data variety, databases such as MarkLogic 7 enable triple stores, or what some might call a graph database to understand the interrelationships between the data types. Triple stores are unlike tabular RDBMS in that the data is stored in single row sequence of subject, predicate and object and is linked through semantic relationships. This new and powerful technology approach addresses the “variety” issue of big data that early big data implementers are encountering today. And may provide lightening search speed for the users and nearly instant gratification in the information discovery process.

MarkLogic Semantics for example, provides a specialized triple index that enables industry-standard SPARQL queries to be combined with queries against documents and values, ensuring that all relevant information is delivered in applications and analytic reports. A great MarkLogic bog on its functionality is Adam’s blog http://adamfowlerml.wordpress.com/.

Another great example of NoSQL databases in action is their use in social media dating sites such as the UK’s Badoo, the world’s largest and fastest growing social media site for meeting new people. Marketers at Badoo are conducting high speed analytics from social media interactions staged on a NoSQL database platform to enhance real-time data about events occurring in their user base.


Net/Net:

According to recent Gartner research conducted this year, only 6% of organizations worldwide are implementing big data initiatives directly related to databases and their infrastructure. Another 20% are in the planning and/or piloting stage of big data deployments. In the final analysis what this means is that most organizations are in the early “adopter stage” of big data project implementation and have not encountered all the issues associated with the three Vs of big data. What keeps CEOs up at night is the disruptive information that they don’t know about and haven’t discovered yet.

Technology is still one of the most important competitive weapons available to business. In our last blog on sematology we elaborated on the use of biological terms in computing jargon. Interestingly, the newest frontier of computing involves the creation of new processors that mimic biological synapses, which simulate neural networks. According to an article in the Sunday’s December 29th NYT, Google researchers created a neural network that scanned 10 million images and through machine learning trained itself to recognize cats. There is no doubt that neuromorphic processors will play a seminal role in future information discovery, which is so dependent on complicated algorithms and statistics oriented programming today.

Until then searching and discovering all relevant information in most organizations is like a luckless dream that never ends but keeps promising a great ending.


Sematology and Modern Information Management

pics_for_blog_1.jpg

Welcome to the first Epinomy blog! Over the course of the coming months we will dive into world of modern information management with a fun biological twist, often examining metaphors and similarities between these two worlds. We believe that big data and the management of information is all about the three T’s, “tables, text and triples” and during the course of the blog we will attempt to simplify this highly complex enterprise information management process.

Taxonomy, semantics and ontology normally the jargon of biologist, has now become paramount in the world of information management as organizations strive to discover all relevant information required for business critical decision-making. During the 1980’s and 1990’s biological words like “niche,” and “symbiosis” entered the world of IT jargon. They became and important and were followed by “ecosystem” at the beginning of this millennium. Now it’s not just about your products and services, what niche you compete in or how symbiotic your products are with other platforms. What is important is the robustness of your ecosystem and the additional value that it brings to customers at the center.

Taxonomy, semantics and ontology are not new in world of document-content-records management but are new to mainstream computer jargon, somewhat like NoSQL databases. According to the Oxford English Dictionary, sematology is of Greek origin and as is defined as “the doctrine of the use of ‘signs’ (esp. words) in relation to thought and knowledge.” Semantic, originating around 1665 is again of Greek origin with a broader definition, simply means “to show or signify” or in a broader sense “relating to the significance of meaning.” Interestingly, there are many similarities between the worlds of technology and biological sciences primarily because both disciplines depend on interrelationships.

Classification of organisms on earth is well established and follows the Linnaen system of nomenclature, however, interrelationships between all organisms are still not well known. Another similarity that is interesting is that 76% of all animals on earth are little known invertebrates, and nearly 80% of all data in organizations is in an unstructured format making even more difficult to discover and understand its meaning and interrelationships. In the sphere of enterprise information management taxonomy tools follow a complex doctrine of signs or what IT professionals call signals.

Typical enterprise content repositories generate signals in the form of folder structures, document titles and sometimes, useful metadata associated with documents, in contrast to the WWW which is a tapestry of metadata linking information together. Most enterprise documents live in isolated silos with little interaction outside of their folder cells. Good metadata makes it easier to find documents and can be used for refining search results and for navigating and drilling down to the right answer in a few mouse clicks, resulting in good search results. See the Epinomy white paper on this site, Leveraging Semantics to Find Enterprise Big Data for a deeper dive into signals, metadata creation and taxonomy management.


Net/Net:

We don’t think big data is going away anytime soon and its not really new. What we do see is a new “data driven culture of real-time decision making” that enables organizations to gain and maintain competitive advantage by discovering interrelationships between data types and leveraging that knowledge. This “data driven paradigm” and the pursuit of “all your data all the time” is not yet a reality in many organizations because of the disparate and siloed nature of enterprise data and what we call metadata madness.

Signals Intelligence

Signals are all around us.

Google uses signals to control search results ranking.   Epinomy uses signals as the input data sources to attach to the "One Thing" your enterprise cares most about.

There are many kinds of signals. 

Structured Signals

  • Telemetry
  • Parameters
  • Events
  • Time Series
  • Form Data

Semistructured Signals

  • Tweets
  • Syndication Feeds
  • Posts
  • Web Pages
  • E-Mail

Unstructured Signals

  • Office Documents
  • Memos
  • Instructions
  • Manuals
  • Contracts
  • Agreements
  • Presentations
  • Marketing Collateral
  • Scripts
  • Transcriptions
  • Regulatory Documents
  • Audio/Video/Photo

 

 

Dominate Your Big Text

i will be speaking at the Big Data Bootcamp on Tuesday, May 22.   The topic of my presentation will be "Dominate Your Big Text"

As Big Data marches ahead, more and more of that information is unstructured, from tweets to PDFs, and the percentage of unstructured information stored in NoSQL engines is rising fast. This session explores the options for synthesizing structure in big document sets. How do I impose order on my text? What tools can I use to find my text? How do I leverage corporate knowledge and structure to make my text easier to find? The ubiquity of full- text search makes finding this unstructured information possible. But what is the next step? How do you make it even easier to find your unstructured information? This session also focuses on taxonomies, auto-tagging, and faceted navigation of search results.


Google+