Archive for the ‘Big Data’ Category

Montreal Data Science Summit

October 23rd, 2015 | by kylie

posted in Big Data, Events, Montreal, People and Interests

Montreal’s first data-science summit is happening on Wednesday 28th and members of the Wajam team are going to be there!

There are a few big names in Big Data that the crew is looking forward to hearing speak. Francis Piéraut, administrator of the Montreal Big Data Group, founder of Qmining and, and Senior researcher a Nuance will be giving a talk around the integration of data science in IT restricted companies. He will also present iPython Notebook, a tool for data scientists.



Shawn Rogers, Chief Research Officer at Dell will be giving a talk about data science at 2:00 PM.

Jeremy Barnes, Founder and CEO at Datacratic, will be attending the event. Datacratic is known for developing Machine Learning models and the Machine Learning Database.



Many other major actors in the data science world will attending this event from National Bank of Canada, to Bandsintown, and Mondobrain. Check out the whole list of speakers on the official website.

Find out more at the official site and hurry to buy your ticket as there are very few free spaces remaining! If you aren’t fast enough to get a ticket, the @Wajam Twitter account will be live Tweeting their time there, so follow along!

BDW14 MTL – Wajam at Big Data Week in Montreal

May 8th, 2014 | by alainwong

posted in Big Data, Company, Engineering, Events, Montreal, Startups

We’re happy to be taking part in Big Data Week happening in Montreal this week featuring speakers like Alistair Croll and representatives from Datacratic, Nexalogy, Radialpoint, and more. Great initiative by organizers Built in Montreal and Exitlist.

Wajam was represented by Suhail Shergill, who gave a presentation on “What Makes Big Data Hard”: what hinders a data scientist when working with ‘Big Data’ and how can we hope to overcome these hurdles.

Tech Meetups in Montreal: Building Reactive Applications Using Play and Akka at Scala Montreal

February 24th, 2014 | by alainwong

posted in Big Data, Company, Events, Montreal

At the Scala Montreal meetup this month, Typesafe consultant Ryan Knight gave a talk on how to build reactive applications using Play and Akka. We also learned a bit of news. Soon to be released Akka 2.3 will have eventsourced integrated, rewritten as akka-persistence.

Check out the Scala Montreal meetup group for the next event.


At Wajam, we’re helping you find recommendations from friends you trust, whenever you need them. In order to do that, we constantly innovate with new technologies. We’d like to invite you to come over to our awesome lounge so that together, we can share our best high-tech adventures. Contact us @Wajam for more details.


Read our latest technical blog posts about how we’ve scaled and improved our social search technology.

How Wajam Answers Business Questions Faster With Hadoop

January 22nd, 2014 | by alainwong

posted in Big Data, Engineering

At Wajam, we’re helping you find recommendations from friends you trust, whenever you need them. In order to do that, we constantly innovate with new technologies. Our technical blog posts are meant to share with you what we’ve learned along our high-tech adventures.

Wajam - Share Your Knowledge


Wajam is a social search engine that gives you access to the knowledge of your friends. We gather your friends’ recommendations from Facebook, Twitter and other social platforms and serve these back to you on supported sites like Google, eBay, TripAdvisor and Wikipedia.

To do this, we aggregate, analyze and index relevant pieces of shared information on our users’ social networks. The challenge for the Business Intelligence team is not so much the storage of the vast number of logs that our millions of users are generating, as our Hadoop cluster is sizeable, rather it is how to quickly answer business questions by running Pig jobs across these entries.

As a Business Intelligence analyst who was only introduced to MapReduce (MR) jobs earlier this year I wanted to blog about how frustrating it can be to write Pig jobs to run across our Hadoop cluster containing our users’ raw logs – oh how I wanted to write that article. It would have consisted of diatribes about Pig’s horrible compile and run time error messaging, litanies of expletives relating to the sheer number of hours spent waiting for MR jobs to run across our cluster. More recent changes implemented at Wajam by one of our Business Intelligence (BI) gurus have, to a large extent, made such a programming polemic null and void. Thanks to just a little further processing of our logs and storing this rigidly formatted output, our reliance on the raw logs has decreased and with it at least the author’s Pig related profanity has lessened.

The primary goal behind these changes was to improve the speed with which we can provide a response to a business query. A secondary goal was to make our data stored on our Hadoop cluster more accessible for everyone.


The pipeline:

Our Hadoop cluster (HC) was and continues to be used for the storage of raw aggregated logs sent to our namenode from the Scribe servers running on top of our many Web servers and for storing the outputs of the automated Pig jobs that run upon these logs. The reporting pipeline of Pig jobs is scheduled by some Scala scripts, with the overall process summarised below:

Wajam BI Pipeline

The logs:

Much of the pipeline’s purpose is to aggregate information in the logs and regularly populate our MySQL server so that queries and alerts related to our traffic can be implemented in R. However it is often the raw logs that are queried for our adhoc requests, as our stored aggregated data may not fit the bill. It is here that our old structure was causing a sticking point.

To explain: the raw logs are broken into events, and each event has a certain amount of generic data e.g. date, timestamp, country, user id etc. A further set of fields is populated on a per-event basis. With this flexibility comes an element of complexity, for when querying these events one has to write pig jobs knowing exactly how all of the events are linked.

As an example, we have an event for every search that a user makes and one of the fields recorded is a unique search id. That same search id will carry across to other events for when we crawl our partners’ servers and for the events generated upon such successful crawls and so on and so forth.

Trawling through our memories for how all these events link together when writing a Pig script to answer a particular question regarding our users’ search behaviour can unnecessarily slow the process down.

Recent changes – of hardware and in mindset

Our solution to this problem? As there is little new under the Sun, we sought inspiration from others who are using Hadoop. LinkedIn, unsurprisingly, was a first port of call e.g. a simple, clean and clever idea like incremental processing to speed up regular Hadoop jobs.

With thoughts along similar lines, the BI team added a pre-processing step: store the collective information about every search that could be gathered from our events. As we have recently increased the size of our cluster to just under 100 nodes, the cost in space for storing this extra information was minimal. The positives of moving to this form of granular pre-processing on the other hand were manifold. Two highlights include:

1. Speed : When a business request came in previously, we often had to run Pig jobs which drew directly from the raw logs. The Pig scripts themselves are not all that difficult to write, however the run time was slowed by the sheer size of the files from which we are drawing. This load time issue was compounded when the request required us to load multiple weeks worth of data. Now a month of search data can be loaded in and processed under 2 hours, where previously an equivalent period in raw logs would have taken 3 – 4 hours or the job would simply have hung.

2. Ease of use: Conceptually it is far easier to query a rigid table like structure rather than try to join a myriad of events. This has the added benefit of making newcomers to Wajam less reticent to venture into the land of Hadoop and Pig.


Further to point 2 above and to lessen our reliance on MySQL, the next step is to add the Impala feather to our bow. Inspired by Google’s Dremel paper, the Cloudera Impala project aims “to bring real-time, ad hoc query capability to Apache Hadoop” .

Cloudera Impala is an open source Massively Parallel Processing (MPP) query engine that runs natively on Apache Hadoop” . What does this encompass? Essentially multiple clients can be used to issue SQL queries and the Hive metastore provides information as to what databases are available and the schemas associated with same (See diagram below). Then there is a Cloudera Impala process running on every datanode in the HDFS which parses up the original SQL queries and runs them on each node.

Wajam with Cloudera Impala

It is the presence of each of these Impala instances on each node and the associate distributed query engine that means MapReduce can be bypassed. As a result of being able to remove MR, there have been instances of 68x speedup as compared to Hive . The extent of the speed improvement is query dependent.

In order to leverage the power of Impala, we are incrementally migrating our servers to CDH4 however some preliminary testing has already begun.


Coupling our more rigid data structures like our newly created search tables and Impala will likely see an even greater decrease in the turn around time for responding to business requests. The positive effect this will have on our department is hard to quantify – especially as the more approachable SQL like syntax and organised data sets will allow us to explore our data more readily.

That is not to say we will entirely rid ourselves of the need to run adhoc Pig jobs to answer particular requests. However if the bulk of the heavy lifting has been done, the occasional MR job will hopefully keep the colourful language to a minimum.

You can find Xavier Clements on Twitter. Apache Hadoop is a trademark of the Apache Software Foundation.

The Highly Scalable Architecture behind the Wajam Social Search Engine

August 6th, 2013 | by alainwong

posted in Big Data, Engineering

At Wajam, we’re helping you find recommendations from friends you trust, whenever you need them. In order to do that, we constantly innovate with new technologies. Our technical blog posts are meant to share with you what we’ve learned along our high-tech adventures.

Graffiti artwork from the Wajam office


Wajam is a social search engine that gives you access to the knowledge of your friends. We gather your friends’ recommendations from Facebook, Twitter and other social platforms and serve these back to you on supported sites like Google, eBay, TripAdvisor and Wikipedia.

To do this, we aggregate, analyze and index relevant pieces of shared information on our users’ social networks. The main challenge stems from the amount of information we collect as well as the speed at which we must display search results back to users. Our goal is to serve requests in <200ms to remain competitive.


At the beginning, Wajam searches were made in MySQL and soon after using a Sphinx cluster. Sphinx was chosen at the time because some of the first developers at Wajam had experience with it.

However, Sphinx was not suited to be distributed and real-time, so the team had to build a large framework to fulfill these requirements. With time, the solution began to show its age; we encountered low-level and painful cluster management, manual resharding, slow geolocated searches, and high CPU usage.

At the breaking point, we faced a crossroad. Either maintain the current solution by building a bigger framework around it, or go with a new modern solution.


After examining our options which included Solr, we decided to give Elasticsearch a try.

Elasticsearch is built around Apache Lucene, which is a robust and widely used search engine and has a large developer community associated to it. Elasticsearch is built with scalability and real-time indexing as a prime concern, two of the main reasons why we wanted to move over from Sphinx. In addition, Elasticsearch uses a technique called geohash for geo-located queries which offer a significant performance boost for those kinds of queries.

Another important point is the fact that Elasticsearch is written in Java while Sphinx is written in procedural C++. The advantage of well-maintained and open-source Java code is more than just stability, it also means that we could code features ourselves and give back to the project.

Here is a comparison table between Elasticsearch and Sphinx highlighting the key points according to our needs.


At the time of our evaluation in fall 2012, Elasticsearch seemed to gain traction quite rapidly and came out as an interesting alternative that could potentially fill all of our needs.

We were impressed by preliminary benchmarks using our data. At smaller scale, Elasticsearch outperformed Sphinx in most of our use case scenarios. We then decided to fully commit to the transition.


  • Scalability : Horizontal scalability is one of our prime concerns as the quantity of data generated by Wajam is growing exponentially, as is our number of users.
  • Performance : We want to improve performance on every query done to the Search API. The initial goal was to bring search latency under 200 ms (median), including query construction, query execution and parsing.
  • Flexibility : Data is often adapted to new features that we build, so we’re interested in having a flexible data scheme to suit our rapid development cycle.


Our initial idea was to develop the first iteration of the new infrastructure as a direct replacement for Sphinx. That is, keep things simple and use a single index to put in each and every document. As explained later on under pitfalls and lessons learned, this turned out to be a flawed plan.

We already had a defined API, so we built an Elasticsearch search client replacement for the Sphinx one. Following that, we defined the mapping based on what we had in Sphinx and made adjustments to make use of the new Elasticsearch features.

Realtime data comes from Jobs that fetch the data from the diverse social network APIs. The data is backed up on HDFS from the same source. Then we used a Pig UDF (Wonderdog by Infochimps) to bulk load the data into Elasticsearch.

The Backend API and the Jobs were converted to interact with Elasticsearch.

During the transition period, we were pushing data in parallel to Elasticsearch and Sphinx at the same time. To complete the migration, we progressively changed a configuration flag in the Backend to query Elasticsearch instead of Sphinx.


Elasticsearch is not Sphinx. Trying to replicate an architecture from a system to another for simplicity can be a bad idea. It might work in some cases, but in ours, it almost failed. Putting tens of billions of documents in a single index is against the Elasticsearch philosophy and recommendations. We did manage to make this work, but without replicas and some flaky performance. In this situation, each query would hit hundreds of nodes in the cluster. Theoretically this is using the cluster as its full capacity, but in practice it’s not a good idea because nodes will fail, slow down or even disappear. Losing one node means losing indexing capability and degraded search. Don’t do it.

Feature flag. Since we were doing a transition from a legacy system to a new one, we had the opportunity to build the new system in parallel and progressively switch with configuration. Since we had no previous experience with Elasticsearch and had no idea how it would react to our traffic, simply switching to the new architecture would have been risky. So we built the system with a configuration flag to switch from one search engine to another. This way we could iterate until we were able to support all the traffic or rollback at our convenience. This saved us more than once.

Monitor and track everything. Especially with systems like information retrieval, it’s hard to apprehend the impact of every change you make. Having metrics for almost everything helped us measure the impact of the iterative changes we made to our infrastructure. For this we use metrics, graphite and a nice perl script to visualize key performance indicators.

Design query (filters) that are cacheable. The cache in Elasticsearch is awesome. Elasticsearch will build the filter bitsets in memory, and continually add new segments to each bitset as they become available. From then on, query filtering is fullfilled by the cached bitsets instead of reading the segments every time, delivering better performance as a result.


Since then, we iterated many times on the architecture and we are now running a fully replicated Elasticsearch serving all of our users’ searches. We’ve gone through multiple implementations, and are currently using fixed-size monthly bucket indices to give us predictability over cluster behavior and scalability since we’re keeping equal-sized index shards. This way, we can easily model our needs according to our growth rate and add nodes when needed. Moreover, having multiple indices grouped and identified with aliases allow us to be more flexible with the queries we are doing.

During the whole process we have been monitoring the evolution with a large number of performance metrics. Below are some of the latest graphs showing the system response time and indexation rate that we have been able to achieve with the current architecture.

Median search query response time over time

Document indexing over time

Since switching over to Elasticsearch, we’ve enjoyed a 10x improvement in performance compared to Sphinx with a median response time of 150 ms while serving more documents than ever.


At Wajam, we’re always been focused on giving users the best social search experience, and now with Elasticsearch, we’re able to deliver on this promise more quickly and more reliably. You can get recommendations from friends you trust, across all of the sites that we support, both on the desktop and on mobile devices.

Although it was a challenging transition to migrate our data to the new Elasticsearch architecture, we now have a solid foundation from which to scale. Stay tuned for more exciting new features coming soon!

Elasticsearch is launching 0.90.3 today. Check it out here.

I would like to thank personally Shay Banon, the elasticsearch consulting staff and everyone on #elasticsearch on Freenode that helped me on this project. I am often online on this channel as jgagnon. You can also reach me at @jeromegagnon1.

Apache Lucene and Solr are trademarks of the Apache Software Foundation