Building a global cloud by applying agile and lean methodologies

Peter Yared

6 years ago

Building a global cloud by applying agile and lean methodologies

We couldn’t build a global cloud that would go bankrupt without massive infusions of cash. We had to apply lean and agile methodologies to our global data center build-out.

At InCountry, we face the herculean task of building a global data storage and processing cloud with points-of-presence (PoP) in every country in the world. To make it even more challenging, we need two redundant facilities in each country and to be compliant with each country’s specific regulations. Our customers would then be able to store and process data in any country with our multi-tenant offering or use dedicated hosts with our single-tenant offering.

Building fixed infrastructure across all of these countries would quickly add up. There are 193 countries in the United Nations. With two redundant facilities in each country, that’s 386 facilities. With an average annual hosting and bandwidth contract of $100,000 per facility and an average set up cost of $50,000 per facility and, we were looking at almost $20 million of set up cost and a $40 million annual spend with a three-year commitment.

Especially when considering that there can be sudden global economic headwinds like the current pandemic, it’s critical that startups think through how to scale up while not burdening themselves with significant committed costs.

The spendthrift way

We would need $140 million just to set up the infrastructure and commit to three-year agreements! Then rent out slices of the facilities to customers down the line that we would hope would make more money than our costs.

This should sound familiar to anyone who has read about WeWork. WeWork signed long-term leases for office space worldwide (just like our data center facilities), executed build-outs of those facilities, and then sublet desks (just like our multi-tenant offering) and team rooms (just like our single-tenant offering).

We all know how it turned out. They raised a gargantuan amount of money, made large long-term commitments, and had to continue raising even more money just to pay the bills in advance of enough customer volume renting slices of their spaces to make up for the cost of each office. Until the strategy hit a wall.

The lean and agile way: v0 product

When we were initially building out the technology in stealth, we used large public cloud regions as our testing grounds. Amazon Web Services, Microsoft Azure, and Google Cloud gave us reach across 17 countries with two separate providers in each country, and we added Alibaba Cloud in two China regions to make it 18 countries.

We were able to leverage features like Amazon AWS API Gateway, AWS Lambda serverless functions and dynamic data storage engines like AWS’s Amazon DynamoDB and Google Cloud’s BigTable to offer API access and record storage based on elastic demand within each region rather than fixed costs. We were also able to test replication across different infrastructure and using a message queue to manage peak requests to databases.

Everything worked end-to-end. Routing all traffic through serverless functions and using scalable backends offered both high availability and high scalability. Amazon API Gateway helped us create SDKs for multiple languages, and we added client-side encryption using the underlying encryption libraries that came with each language platform.

The dynamic architecture was very secure and highly available, but not very flexible as it required all traffic to go through a single gateway.

Mini-PoPs: v1 product

For the v1 product, we expanded beyond the cloud to what we termed Mini points of presence. To expand from 18 countries to 50 countries, we rented dedicated databases from two separate providers in each new country. Our SDK encrypted data with 256 bits on the client side, so when the data was stored in the rented dedicated databases, it used the same level of encryption used for storing data at rest on hard drives. For example, if somebody loses a MacBook with proprietary data that is encrypted at 256-bit on its storage drive, information security teams will sign off as the loss of the MacBook as not representing a data breach.

We continued to use Amazon API Gateway and AWS Lambda as a dynamic interface to a hybrid mix of cloud vendors and rented database hosts. Eventually, we were able to offer self-service ability for developers to store encrypted data in 50 countries.

The dedicated databases were very secure, but they were not highly available.

Midi-PoPs: v2 product

One of the biggest issues with our v1 product was that all requests were transited through our AWS gateway service in the US East coast. We needed to add the capability for our SDK to contact any of our points of presence directly. We also had prospects requesting features like a Border Proxy that would automatically redact and reinsert data from web service calls. These capabilities would require full hosts at our points of presence rather than simple databases.

We sourced dedicated hosts worldwide that we would normalize with the same version of Linux and provision using the Hashicorp stack. At this point, we also put in place the processes and documentation to achieve SOC 2 Type 2, PCI DSS and HIPAA compliance.

The set of dedicated hosts in each facility ran an API tier and a data tier, and the database replicated automatically between the two facilities in each country. We now offered direct access from our SDK to specific countries, and also offered a new product which we called InCountry Border that could automatically redact data.

The dedicated hosts were very secure and moderately available.

Super-PoPs: v3 product

At this point, we had significant market signal on the countries and categories of data that were of particular interest to prospects. We identified key market pairings such as “UAE-Health” and “Russia-Profile” and worked to ensure there were appropriate dedicated hosts and providers in those markets. Our risk management team would check underlying certificates of the vendor chain for each facility, so our sales team would know what types of data could be stored in which countries. It became clear at this point that we would need owned facilities that were PCI DSS compliant to offer storage and processing for payment data.

We also rented excess capacity so that we could offer single-tenant dedicated hosts to customers at a higher price point and margin than our multi-tenant offering.

The newer dedicated hosts were very secure, very available, and very flexible because we could deploy newer software to dedicated hosts for customer pilots.

Ultra-PoPs: v4 product

Finally, the destination! We had deals that provided very clear direction about which countries and which types of facilities where required! We also achieved SOC 2 Type 2, PCI DSS and HIPAA compliance that enable us to store highly regulated data. We worked with a data center specialist to source and negotiate with top tier facilities to build out co-location facilities where we can define the networking, hardware, and physical security.

Sourcing and build-outs of co-location facilities in far flung locations can take 2-4 months, which mapped quite well to the procurement cycle for highly regulated customers in the finance and health sectors.

The co-located facilities and hosts are highly secure, highly available, and highly flexible.

Lessons Learned

We learned it’s possible to scale a global footprint iteratively using lean and agile methodologies. The benefits of cashflow savings are compounded with the iterative learning about various markets and measuring customer demand before over-investing in a specific market.