Raed's Blog

[Day 0] Starting A Research Project

This is a blog post in (hopefully) a series of posts that will document my progress in a new research project I’m starting today. Ideally, by the end of this series I will have ended-up with either a small paper or a presentation/talk.

As I’m figuring out what to do after graduation. I have had A LOT of free-time on my hands lately, and I started getting bored.

Since I’m planning to continue my education with a PhD project, I thought of starting a small research project. Best case scenario I publish something before starting the actual PhD. Worst case, I have gained some experience with research work.

Why write about this ?

Usually people blog about their findings and not their daily progress.  I have decided to blog this processes for a few reasons:

Motivation

I have taken regular procrastination and made it an art and a religion.

If I don’t have something/someone pushing me for results I will abandon projects as soon as I started getting bored, and I have around 100 unfinished  project on Github to prove that.

Feedback

Whenever I write on this blog I tend to get very constructive feedback.

I have always been a big believer of the power of social media. Beyond sharing feline GIFs, I have always found it a great way to get in touch directly with experts and asking questions to the people who actually know what they’re talking about.

2016…Year of the Monkey Cat

So naturally, I thought of taking advantage of this medium to help with the project whenever I hit a wall, or say/do something stupid.

What is this work about anyway ?

During the past two years I have worked on different aspects of IoT. From low-level development on microprocessors to Cloud integration and crypto. If I’m going to work on something, better work on something I know and care about.

While thinking about something interesting to look into, I remembered that last year in school we learned about MQTT.

A cool protocol used in IoT (amongst other things) to listen for communications without constantly pulling data from the server.

I remember that the teacher showed us a public test server that displayed temperature values, and asked us to read those values, and then push new values and see the needle move.

mqtt_test_server

I got bored in class (you can see that there is a trend) so I decided to see what happened if I pushed a random value to the temperature sensor. So I did push a random string, and it was funny to see a number of my friends’ applications crash because of it.

I decided to push the prank a bit further and put an infinite loop that sent “You shall not pass! ” and other random messages and values, over-and-over again, effectively  DoSing everyone trying to do anything with the temperature values.
Needless to say, the teacher was not very happy. (If you’re reading this, sorry Mr. TIGLI ^^’)

So, long-story-short, I will be exploring the implications on privacy and security of applications in production that rely on public or unsecured MQTT brokers.

Day 0: Progress report

After deciding on what to do for the project I have started to think of where to collect the data I’ll be studying.

To have a small idea on how much data is going through Eclipse public MQTT server, I wrote a small script that counts the number of messages it gets in a second.

A few runs, in different hours of the day, show that we can expect ~320 messages per second.

Each message is composed of 3 fields:

  • Topic : A string that is almost an equivalent to a #room in IRC. At most, it can be 65536 bytes. But my tests showed that it averages around ~25 bytes.
  • Payload : The actual content of the message, the specification limits the size of the payload to 268,435,456 bytes. But each implantation can limit it further. My tests show that the average payload is less than 300 bytes (and peaks to 600 bytes).
  • Time-stamp : The human readable `ISO 8601` time format is 24 bytes long.

So on average, each record is 349 bytes. So we should expect to receive  around 111 Kilobytes-per-second. Which (if I didn’t mess any calculation) is a around 9.5 Gigabytes per day.

This might not be a huge amount of data compared to what ‘Big Data’ companies are dealing with today, but I was worried that MariaDB won’t cut it after a few weeks of data.

So I went on Facebook, and asked people about a `Big Data` solution that could fit this use-case. I got only 7 likes but around 31 comments in response (Yes, I hang with a geeky crowd), with different possible solutions and pros and cons.

Some of the proposed solutions:

  • MongoDB
  • Cassandra
  • Riak
  • ElasticSearch
  • Splunk
  • Hbase

And so many others…

I have decided to test a few of these, starting by ElasticSearch because I liked the fact the API was RESTful and therefore it doesn’t limit me in matter of programming languages and platforms.

I have followed this tutorial to install ElasticSearch, using Vagrant. Then followed a webinar tutorial to have a basic CRUD running with a monitoring GUI.

elastic-gui

I have experimented with insertion, bulk insertion, and basic searching capabilities. There is still a lot to figure out.

I have also made a small (and dumb) Node script that would insert each MQTT message it receives. I have used the minimalistic esta client (I still didn’t figure out how to do pagination with this client).

This was day-0 of the project,  in the near future I’ll be searching for other people who have worked on this same issue and see what they have come-up with.
I will also be investigating ElasticSearch a bit more in detail and its Node clients.
I will also try to explore other alternatives such as Hbase and Cassandra.

If you have any questions or remarks, leave a comment below !

Until the next time,

space_cowboy

 

2 Comments

  1. YOUR BIG BROTHER

    What about Redis man?

  2. Hi raed,
    I have worked on a project to analyse sensors data with ElasticSearch also i have integrated the hadoop ecosystem for storage and further processing.

Leave a Reply