Pending ...

Orchestrating data collection with Airbyte

To get anything out of big data, you first need to ingest data from multiple sources. Big data collection is notoriously difficult. But does it have to be?

Imagine having one smartphone that can only handle one mobile application at a time. It would be manageable if you only use 1 or 2 applications, but what if you have 50 mobile applications: One smartphone for Google News, another one for Facebook, another one for Instagram, and so on. Sounds awful, right?!

That is one of the main problems we’ve encountered in trying to develop the data ingestion for our Collect Microservice at Human Managed. Our platform generates intelligence and action from customers’ data, which means we work with many data sources as our input for analysis(currently standing at 30+ data sources and counting!)

Initially, we developed numerous bash and python scripts running via cron job or AWS lambda just to automatically fetch the data from multiple APIs and data sources. However, this became difficult for us to manage once we tried to scale up and replicate for each data source and customer that we have.

We had to find a way to address this with something that we can manage, maintain, scale, and modify for our use cases.

Airbyte Open-Source ELT Platform

Introducing Airbyte, an open-source data integration platform that syncs data from applications, APIs, and databases and transfer them to the destination that you want such as database and object storage. Remember the analogy I gave earlier? Airbyte is the type of smartphone that can handle multiple mobile applications, which makes data integration simple, secure, and extensible.

We utilized existing Airbyte connectors that fits into our use cases and also developed our own connectors when necessary. Since Airbyte is an open-source platform, the connectors available are developed by the community where you can also request and contribute your own (We plan to contribute our custom connectors and improvements in existing ones so that we could also give back to the community, so stay tuned!).

Once we converted our automation scripts into Airbyte connectors, we were able to easily implement and organize the batch collection that happens within our Collect microservice. In replicating the same connection but for another customer, we just set the configuration in the existing connector, set the destination, and that’s it! We were able to integrate the another data source into our platform in just minutes.

airbyte.webp
Airbyte ELT Overview

Referring back to the analogy of having multiple smartphone, with Airbyte, we have now a one single powerful smartphone that can handle and receive data from multiple mobile applications (You may see the available connectors here). This solved the problem of having to maintain, and organize the automation of ingesting data which can also be customized, scaled, and managed however suits our needs.

Drawbacks and Limitations

There will always be a setback in using any other tool, even if it’s Airbyte. The first thing that comes into mind is the learning curve required to develop and maintain custom airbyte connectors. If you are not familiar with the concepts, you’ll have to figure out a lot of these essential concepts to grasp in order to properly understand and implement an Airbyte connector. But once figured out, most of the development will be easier than before because of the Airbyte CDK thanks to Airbyte Team.

As for its limitation, the current lack of support for data backup of raw data in-between the Extract (E) and Load (L) is not yet possible. This conflicts with our architecture of storing the raw logs without the metadata added by Airbyte.

Conclusion

Even though I’ve also mentioned the cons of implementing Airbyte. The benefits far outweigh the disadvantages and gives you the following capabilities:

  • Open source and free for all to contribute
  • Cloud Agnostic
  • Customizable, Modular, Scalable, Secured, and Compliant
  • Set Connection Syncs and Incremental Stream State
  • Unlimited Sync Frequency [No Tier]

The capacity and features of Airbyte to unify your data integration pipelines under one fully managed platform is fit to develop our Collect microservice since it enables us to easily integrate data sources, develop custom connectors when necessary, and manage these connections without worrying too much on security and compliance.

With a fast growing community and contributions, more and more connectors will be made and improved in the future which will continue to innovate and push to evolve the data ingestion capabilities of Airbyte.

You could learn more about Airbyte and get involved in the open-source platform by visiting the official Airbyte website.

Note: This blog was originally published in August 2022 and has been updated for accuracy.