IoT and Video, An Interesting Problem

While on assignment, I was working with my team on a use case that combined IoT and surveillance. At a high level, we needed to implement a platform that could ingest, report and store vast amounts of data collected in real time.  It circled real-time event detection and sensor alerting.  The data was geographically distributed, with connectivity not always reliable.  Retention policies thresholds had to be taken into consideration as well.  Not only did storage strategies need to be factored in, but video-on-demand (VOD) was mandatory.

Streaming data back to the central location was not acceptable as this would have saturated the backhaul, blocked other data delivery, and in the end, ruin the customer experience.  If you never saw churn before, this would be your one stop ticket to bankruptcy.

We decided to run analytics and detection algorithms on the edge, while storing results local to the region (remote location).  The metadata and required artifacts would be transported back to headquarters for further analysis.  This approach met the requirements, but a question lingered, would this approach meet future needs?  Leverage the cloud as needed, keep data local to the region, process on the edge and send back necessary data for later viewing and querying of the remote locations.

After reading how IBM and Cisco were advancing to the edge for certain analytic use cases:

“This powerful IoT technology from Cisco and IBM, combined with Bell’s world leading network technology, enables customers to tap into innovative real-time analytics options to maximize performance across their operations, no matter where they are,”

Deploying the unmatched analytics capabilities of IBM Watson Internet of Things and Cisco networking intelligence with streaming edge analytics will help to further accelerate Bell’s leadership in Canadian IoT.” (ref.)

This solidified the solution and revealed that others are and will face similar types of issues, particularly when limited connectivity exists.

IBM pointed out that today there are billions of connected devices and sensors gathering vast amounts of real-time data and cloud has made it possible to gather valuable insight.  But without high-bandwidth connectivity much of this insight goes missing or can’t be acted upon in real-time. (ref.)

As use cases for IoT continue to drive the industry forward, so will the need for real-time interaction with this data.  Just as Kafka became the standard for data pipelines, hybrid cloud solutions will become the model for IoT platforms.

After eliminating the impossible task of bringing high-bandwidth connections to these locations, what remained was the probable technique of bringing analytics to the remote locations themselves. To put it quite simply, the team is looking to perform analytic computations at the point or edge of data collection. (ref.)

While the cloud has its place, just like Big Data, it is not the one solution for all.  The edge will become a key factor, and with these hybrid solutions like “IBM Watson IoT,” they will become embedded within the world of IoT.

“With the vast amount of data being created at the edge of the network, using existing Cisco infrastructure to perform streaming analytics is the perfect way to cost-effectively obtain real-time insights…

Our powerful technology provides customers with the flexibility to combine this edge processing with the cognitive computing power of the IBM Watson IoT platform,” Anand said.” (ref.)


  1. Watson to make Remote IoT Edge Analytics Elementary
  2. IBM and Cisco to bring IoT to remote locations like oil rigs
  3. IBM and Cisco Combine Watson IoT with Edge Analytics

NoSQL: The Magic Bullet

Many have this belief that noSQL will be their magic bullet. Their salvation to a failing system. Their redemption to slow response times. Yet, nine times out of ten the issue is not with their current solution. It is due to poor design, or in most cases, inaccurate data models.

With noSQL, the way data is accessed needs addressed upfront. You can’t skip this step. The data access pattern is the driving force in the noSQL world. In the traditional world, you can cheat performance. For example, with a traditional database, a fact table is created with the required dimensions. Then, indexes are added to each dimension. Viola, decent performance with littler effort. Not much upfront design needed. The SLA achieved, and everyone is happy. As the access patterns are discovered, indexes are re-factored for optimal performance. Problem solved, customers delighted.

In the noSQL world, it is not that easy.

Let’s take HBase for example. The rowkey design is crucial for performance and distribution of data across the regions.  If this step is rushed, or done in a haphazard way, data will saturate a region (hot spot).  This will cause the cluster to come tumbling down.  Yes, I did this, I speak from experience folks, and it is not pretty.  Then we have to consider the model itself. Do we want one column family or more? If more than two are needed, is this a sign of a bad design unfolding? How will this design impact performance? Are timestamps required for data retrieval? Will other data elements be needed other than the rowkey? And the list goes on.

NoSQL does not come without risk. And the design of how data is stored and accessed is critical. If we are going to move into the noSQL world, then we need to commit one hundred percent.

Remember, most projects do not fail because of the chosen technology. They fail because the solution was rushed into production without proper design and testing.  The use cases and the data access patterns that drove them were not understood. If we rush a noSQL solution into production, we cannot cheat a fix later as in the RDBMS world.

With that said, the point is this. It is critical in the noSQL world to understand the requirements and data access patterns. How the UI will interact with that data.  If this concept is not understood, then we will run the risk of complete and total failure.  Dev-ops will break down our doors in the middle of the night because senior leadership is screaming at the top of their lungs due to lost revenue.   Customers will no longer use the product because they cannot access the data that drives their revenue. And the million dollar visualization layers the UI team painfully built over the last twenty-four hours will be rendered useless. All because the solution was rushed into production due to unrealistic demands, just like the RDBMS system that is currently failing. But this time, failure comes at a much greater cost and impact to the business.

Just another random rant on a rainy Seattle day.

IoT is the Space to Play

With the growth in IoT, this field is going to explode over the next few years. Today they are predicting: 25-200 billion connected devices by 2020, over $300 billion in projected revenue, an expected growth rate of 20% within corporate budgets by 2018, and a 17.5% growth rate of connected devices over the next seven years. This makes IoT the space to play. And with this growth, comes tremendous opportunity. Many gaps exist today. This means many solutions are required. Whether these solutions focus around security, an industry standard protocol, a streamlined framework, a device management platform, or even deriving ways to integrate machine learning and predictive analytics, this is the next bubble to embrace.

At first glance, many consider IoT something that is wearable, a sensor collecting, for example, weather data, or something that controls an item in a home. When one steps back and looks at the forest, they will notice the arena is shaping around a few different verticals. Two of these verticals with massive potential are IIoT (Industrial IoT) and MIoT (Medical IoT). Each of these areas will have their speciality, but each will have the same issues to face.

These are exciting times indeed, and we will face small data challenges that will outweigh our combined big data initiatives. Now is the time to be positioned as a leader in the IoT industry; Start developing strategies and solutions for foreseen issues. Even though the big players are making the move now, they are all looking at it from the same angle. Instead, look at it from outside the box. Provide solutions that companies will embrace. Remember, there is much room to play in this space and massive potential for those who can cover and foresee the gaps.

My advice, start researching and fill those gaps. If done right, you can be the next Google or Microsoft in the IoT playing field.

Until next time, just a few random thoughts from a gray Tuesday day.

Additional References

Free Textbooks: Computer Science


Free textbooks (aka open textbooks) written by knowledgable scholars are a relatively new phenomenon. Below, find a meta list of Free Computer Science Textbooks, part of our larger collection 200 Free Textbooks: A Meta Collection. Also see our online collection1000 Free Online Courses from Top Universities.

Why Redis beats Memcached for Caching

Why Redis beats Memcached for Caching


Memcached or Redis? It’s a question that nearly always arises in any discussion about squeezing more performance out of a modern, database-driven Web application. When performance needs to be improved, caching is often the first step employed, and Memcached and Redis are typically the first places to turn.

These renowned cache engines share a number of similarities, but they also have important differences. Redis, the newer and more versatile of the two, is almost always the superior choice. But there are some key exceptions to this rule.

The similarities

Let’s start with the similarities. Both Memcached and Redis are in-memory, key-value data stores. They both belong to the NoSQL family of data management solutions, and both are based on the same key-value data model. They both keep all data in RAM, which of course makes them supremely useful as a caching layer. In terms of performance, the two data stores are also remarkably similar, exhibiting almost identical characteristics (and metrics) with respect to throughput and latency.

Besides being in-memory, key-value data stores, both Memcached and Redis are mature and hugely popular open source projects. Memcached was originally developed by Brad Fitzpatrick in 2003 for the LiveJournal website. Since then, Memcached has been rewritten in C (the original implementation was in Perl) and put in the public domain, where it has become a cornerstone of modern Web applications. Current development of Memcached is focused on stability and optimizations rather than adding new features.

Redis was created by Salvatore Sanfilippo in 2009, and Sanfilippo remains the lead developer and the sole maintainer of the project today. Redis is sometimes described as “Memcached on steroids,” which is hardly surprising considering that parts of Redis were built in response to lessons learned from using Memcached. Redis has more features than Memcached, which makes it more powerful and flexible but also more complex.

Used by many companies and in countless mission-critical production environments, both Memcached and Redis are supported by client libraries implemented in every conceivable programming language, and both are included in a multitude of libraries and packages that developers use. In fact, it’s a rare Web stack that does not include built-in support for either Memcached or Redis.

Why are Memcached and Redis so popular? Not only are they extremely effective, they’re also relatively simple. Getting started with either Memcached or Redis is considered easy work for a developer. It takes only a few minutes to set them up and get them working with an application. Thus a small investment of time and effort can have an immediate, dramatic impact on performance — usually by orders of magnitude. A simple solution with a huge benefit: That’s as close to magic as you can get.

read more via source

Former Facebook Engineers Streamlining Big Data

Former Facebook Engineers Emerge from Stealth with Simplified Big Data Approach


The latest startup emerging from stealth mode to simplify Big Data analytics is Interana Inc., founded by engineers from Facebook Inc. with the goal of providing the same cutting-edge data tools used by the Web giants to companies of all sizes.

After six months of stealth, the Menlo Park, Calif.-based company today unveiled its namesake Big Data analytics software, which focuses on working with “sequences of events” gleaned from the usual sources: Web site clickstreams, phone call detail records, transactions and sensors. The company says billions of these event sequences can be analyzed in seconds via a visual interface that facilitates ad hoc queries by users.

The company seeks to lessen the need for specialized developers, data scientists and complex integration efforts, claiming that companies can just connect the software to various kinds of semi-structured data and get started with queries immediately.

“With Interana focusing on event-based analytics, key business metrics like growth, retention, conversion and engagement can now be made available to decision makers in seconds, rather than the hours or days it can take with many other approaches,” the company quoted Enterprise Strategy Group analyst Nik Rouda as saying. “This will allow a whole new class of applications, delivers real business value and promises more new innovations in real-time analytics.”

read more via source

New App for Apache Spark

H2O Announces Sparkling Water – The Killer App for Apache Spark


SAN FRANCISCO–(BUSINESS WIRE)–H2O announced today the introduction of Sparkling Water, the latest innovation to combine two best-of-breed open source technologies Apache Spark and H2O. Sparkling Water is the newest application on the Apache Spark in-memory platform to extend Machine Learning for better predictions and to quickly deploy models into production. H2O is proud to partner with Cloudera and Databricks to bring this capability to a wide audience.

“Sparkling Water enables data scientists to take advantage of high fidelity data stored in an enterprise data hub to build sophisticated machine learning applications. By marrying the power of Apache Spark in CDH with H2O, applications can leverage scalable, and fast machine learning on Hadoop.”

“One of the major strengths of Spark is its ability to provide a unified platform for building end-to-end data pipelines, and as such become a natural platform for next generation applications,” said Ion Stoica, CEO of Databricks. “We’re thrilled to have H2O bring their machine learning know-how to Apache Spark in the form of Sparkling Water, and look forward to more future collaboration.”

For the data scientist moving between different environments, Sparkling Water removes inherent friction from challenges arising from data formats and structure. Particularly, in the data science workflow, data parsing and transformation along with variable creation takes advantage of Apache Spark while feature selection, modeling, and scoring may leverage H2O.

“Sparkling Water enables data scientists to take advantage of high fidelity data stored in an enterprise data hub to build sophisticated machine learning applications. By marrying the power of Apache Spark in CDH with H2O, applications can leverage scalable, and fast machine learning on Hadoop.” – Jairam Ranganathan, senior director, Product Strategy, Cloudera

As the black box of predictive analytics opens up to a larger community, H2O is laser focused on how to quickly scale cutting edge machine learning algorithms to the demands of the enterprise to build the next generation of smart applications. With this latest innovation, the Apache Spark community can now apply Deep Learning to solve complex classification problems. Additionally, Data Scientists may rejoice as Sparkling Water is supported in the most cutting edge languages including R, Python, Scala, and Java. Finally, with Sparkling Water the promise of predictive analytics is realized as H2O also has a robust REST API and a NanoFastTM Scoring Engine to power smart business applications.

“All of the Internet is going to be rewired with Intelligent Applications. Sparkling Water is the convergence of elegant APIs, fast machine learning and in-memory predictive analytics. A unified user & developer experience for building smarter applications will transform enterprises and accelerate big data adoption.” said, SriSatish Ambati, CEO and Co-Founder of H2O. “We are excited to team up the Communities of Data Science and Application Developers. Sparkling Water is the middleware for big data.”

read entire article via

Move a Gitlab Repo

The directions are not in actual order, hence a little modification to them for an easy move.

By default, omnibus-gitlab stores Git repository data under /var/opt/gitlab/git-data: repositories are stored in /var/opt/gitlab/git-data/repositories, and satellites in /var/opt/gitlab/git-data/gitlab-satellites. You can change the location of the git-data parent directory by adding the following line to /etc/gitlab/gitlab.rb.

sudo nano /etc/gitlab/gitlab.rb

#add line. You do not need to create this directory. Once you re-start and reconfigure it will create it for you.

git_data_dir "/data/git-repos"

# This will create the directory.

sudo gitlab-ctl reconfigure

The rest of the directions seems to be inline.

# Prevent users from writing to the repositories while you move them.
sudo gitlab-ctl stop
# Only move 'repositories'; 'gitlab-satellites' will be recreated
# automatically. Note there is _no_ slash behind 'repositories', but there _is_ a
# slash behind 'git-data'.
sudo rsync -av /var/opt/gitlab/git-data/repositories /data/git-repos/
# Fix permissions if necessary
sudo gitlab-ctl reconfigure
# Double-check directory layout in /data/git-repos. Expected output:
# gitlab-satellites repositories
sudo ls /mnt/nas/git-data/
# Done! Start GitLab and verify that you can browse through the repositories in
# the web interface.
sudo gitlab-ctl start

To test to make sure everything is set correctly

sudo mv fromPath/ toPath/

~/Downloads/temp$ sudo mv /var/opt/gitlab/git-data/repositories/ ~/Downloads/temp/repositories/

Five Reasons HPC Startups will Explode

Five Reasons Why High Performance Computing Startups will Explode in 2015


1. The size of the social networks grew beyond any rational expectations

Facebook(FB) official stats state that FB has 1.32 billion and 1.07 billion mobile monthly active users. Approximately 81.7% are outside the US and Canada. FB manages a combined of 2.4 Billion users, including mobile with 7,185 employees.

The world population as estimated by United Nations as of 1 July 2014 at 7.243 billion. Therefore 33% of the world population is on FB. This includes every infant and person alive, and makes abstraction if they are literate or not.

Google reports 540 million users per month plus 1.5 billion photos uploaded per week. Add Twitter, Quora, Yahoo and a few more we reach 3 billion plus people who write emails, chat, tweet, write answers to questions and ask questions, read books, see movies and TV, and so on.

Now we have the de-facto measurable collective unconscious of this word, ready to be analyzed. It contains information of something inside us that we are not aware we have. This rather extravagant idea come from Carl Jung about 70 years ago. We should take him seriously as his teachings led to the development of Meyer Briggs and a myriad of other personality and vocational tests that proved amazingly accurate.

Social media life support profits depends on meaningful information. FB reports revenues of $2,91 billion per Q2 2014, and only $0.23 billion come from user payments or fees. 77% of all revenues are processed information monetized through advertising and other related services.

The tools of the traditional Big Data (the only data there is, is big data) are no longer sufficient. A few years ago we were talking in the 100 million users range, Now the data sets are in exabyte and zettabyte dimensions.

1 EB = 1000^6 bytes = 10^18 bytes = 1 000 000 000 000 000 000 B = 1 000 petabytes = 1 million terabytes = 1 billion gigabytes

1 ZB = 1,000 EB

I compiled this chart from information published. It shows the growth of the world’s storage capacity, assuming optimal compression, over the years. The 2015 data is extrapolated from Cisco and crosses one zettabyte capacity.

read entire article via

Open Source Monitoring Tools

7 killer open source monitoring tools


Network and system monitoring is a broad category. There are solutions that monitor for the proper operation of servers, network gear, and applications, and there are solutions that track the performance of those systems and devices, providing trending and analysis. Some tools will sound alarms and notifications when problems are detected, while others will even trigger actions to run when alarms sound. Here is a collection of open source solutions that aim to provide some or all of these capabilities.

  1. Cacti: a complete network graphing solution designed to harness the power of RRDTool‘s data storage and graphing functionality.
  2. ntop: a network traffic probe that shows the network usage, similar to what the popular top Unix command does.
  3. Zabbix: an enterprise-level software designed for monitoring availability and performance of IT infrastructure components
  4. Observium: an autodiscovering network monitoring platform supporting a wide range of hardware platforms and operating systems including Cisco, Windows, Linux, HP, Juniper, Dell, FreeBSD, Brocade, Netscaler, NetApp and many more.
  5. NeDi
  6. Icinga
  7. Nagios

read via