This blog is also published on Medium https://propertyguru.tech/once-upon-a-cloud-the-enchanted-migration-of-batdongsan-to-aws-898875c128e7).
This is the story of Batdongsan’s successful migration to AWS — a journey of resilience, collaboration, and innovation. Told through the charm of a fairy tale, this blog celebrates a significant milestone in our cloud transformation.
The story unfolds four chapters, each with a summary, before diving into this adventure’s challenges, preparations, and triumphs. Join us as we explore Batdongsan’s path to the cloud, where technology meets imagination in a tale worth remembering.
Let’s begin: Once upon a time…
Chapter 1: The Kingdom of Batdongsan
In a realm where intricate networks connected every land, Batdongsan stood apart, unique in its architecture and traditions. Its infrastructure was unlike any other in the group, presenting challenges and opportunities which required careful navigation and expert guidance to overcome…
The Realm of Untold Challenges
In 2022, in the land known as Batdongsan, a tale of transformation began to unfold. Unlike the other markets basking in the cloud’s light, Batdongsan stood apart, its infrastructure firmly rooted in the soil of on-premise systems. The only threads connecting it to its neighbors were the shared shields of CloudFlare and the enchanting grounds of Kubernetes. Yet, even these commonalities revealed more contrasts than similarities as the unique challenges of Batdongsan began to emerge.
By 2023, Kubernetes version 1.24 had become the standard for modern infrastructure, but Batdongsan’s road to adopting it was fraught with challenges. Years of isolation from the group’s shared practices had left its development, staging, and production clusters diverging significantly. Custom controllers, bespoke configurations, and alternative solutions crafted out of necessity created a tangled web of complexity. Coupled with Kubernetes’ inherent demands — authentication, high availability, networking, storage, and secrets management — this divergence made upgrading to version 1.24 with consistency and confidence nearly impossible. Without the alignment and support that other markets enjoyed, Batdongsan had no choice but to re-architect its clusters entirely, a laborious effort born not of progress but of necessity, highlighting the cost of working in isolation.
In parallel, Batdongsan’s approach to observability painted a stark picture of its isolation. While other markets reveled in the ease of managed tools like DataDog, Batdongsan had to rely on self-hosted solutions such as Grafana and Zabbix. Every metric had to be meticulously crafted by hand, as the engineers dedicated countless hours to ensure that these systems remained vigilant guardians of Batdongsan’s critical infrastructure. The effort invested in maintaining these observability tools was immense, yet their unwavering commitment ensured that the kingdom’s vital systems remained healthy.
As the winds of collaboration began to stir, new challenges arose. Without access to private cloud networking, Batdongsan navigated a convoluted path for traffic between its services and those of other markets. The journey required routing through one private network to CloudFlare, across the unpredictable public internet, and back through another CloudFlare layer into yet another private network. This intricate dance introduced delays and wasted precious computational resources, as manual whitelisting added complexity.
The Spark of a New Dawn
Despite the kingdom’s wealth of talent among its engineers, they faced a daunting truth: competing with the cloud vendors, who commanded specialized knowledge in every facet of infrastructure, was no easy feat. Innovations that were simple for others to implement using managed services demanded considerable effort and resources for Batdongsan. A prime example was the kingdom’s on-prem Kafka cluster, which required a year of dedicated labour to maintain at a production-grade level, complete with monitoring and automation workflows.
These constraints extended to databases, where Batdongsan was limited to traditional options like MSSQL, MySQL, MongoDB, ElasticSearch, and Redis. The team poured their energy into patching, securing, and configuring these databases for high availability, but the need for more flexibility left little room for experimentation with modern database solutions. Instead of choosing technologies that best suited their needs, they were often forced to shape their technical strategies around the existing tools.
Yet, within these trials, a glimmer of hope began to shine. Batdongsan’s challenges were not merely obstacles; they were the foundation for a transformative journey that would ultimately redefine its infrastructure and unlock the promise of a cloud-powered future.
Chapter 2: The Council of Clouds
The esteemed guardians of Batdongsan, along with their counterparts from across the kingdom, gathered to map the road ahead. They deliberated on costs, charted paths, and sought alliances with the most promising cloud vendors to prepare for the great migration. Their strategic thinking and technical acumen set the foundation for the journey…
The Gathering of the Wise
As the engineers of Batdongsan reflected on their unique challenges, the air began to buzz with the excitement of transformation. To remove the limitations of our on-premise infrastructure and bring Batdongsan closer to the rest of the group, Anh Phan, then the Director of Engineering of Batdongsan, and I, then the Technical Architect of Batdongsan, discussed the need for cloud migration. By the end of 2022, we had set forth a budget line for this ambitious endeavor, though it remained dormant until destiny began to stir in April 2023.
April arrived like a season of renewal, as if guided by the hands of fate. New allies joined our quest: Cong Nguyen, Chida, Bala M., Bimlendu, and Khanh Nguyen, each bringing unique talents and perspectives. Together, we established weekly sync meetings, a collaboration ritual that united Batdongsan and the Infra CoE leaders. With this assembly of minds and spirits, we began to confront the challenges ahead with newfound determination.
The first task on our agenda was to assess our current state and prepare for the journey. With quills in hand and scrolls unfurled, the team mapped out the intricate paths we would take. We reviewed the existing infrastructure, cataloguing every component needing a careful transition to the cloud. Each system was like a treasured artefact, demanding meticulous planning for its safe passage. At the same time, we built compelling business cases and analyzed our infrastructure to estimate the total cost of ownership (TCO) for promising cloud vendors. This process began after the weekly gatherings and laid a strong foundation for our migration strategy. We worked closely with vendors to clarify the unknowns and refine our estimations. These formidable allies provided valuable insights, helping us develop a clearer picture of our migration path. The guidance from all vendors proved instrumental in shaping our strategy and addressing lingering uncertainties. Ultimately, we chose to follow the well-trodden path of AWS, uniting our journey with the rest of the kingdom to harness the shared wisdom of the cloud’s enchanted realm.
Forging the Cloud Pact
Although other markets within the group had been running AWS for a while, they, too, sought better cloud architecture and infrastructure as code practices. This understanding shaped our mission further, which was first articulated during the October 2023 “Cloud Offsite.” In October, the newly formed Cloud Squad gathered for the first time in Ha Noi with the AWS team to discuss high-level architecture and strategies. This significant gathering produced a concrete migration plan, resolved many high-level technical problems, and defined Batdongsan’s mission: to utilize the group’s existing cloud knowledge and establish a new AWS organization with modern and correct architecture from the start. This effort would serve Batdongsan and contribute to upgrading the AWS infrastructure across the entire marketplace.
The Cloud Migration plan, forged during the Offsite, began to take its place in the broader world. It was first introduced outside Batdongsan at the Group Tech Manager Forum and later shared widely at the Vietnam Tech Town Hall and with the Vietnam Leadership Team. Alongside these presentations, we finalized and submitted a more precise budget for the 2024 migration, further solidifying the roadmap for our journey.
The engineers felt renewed purpose as the sun set on the horizon, casting a golden glow upon Batdongsan. We stood on the brink of a great adventure, armed with knowledge, collaboration, and the unwavering spirit to succeed. With the groundwork laid, we were ready to embark on the next phase of their journey — a journey that would soon lead us to the realm of the cloud, where new opportunities awaited.
Chapter 3: Building of Foundations
With the grand plans in hand, the journey commenced. Skilled craftsmen from Batdongsan and distant lands toiled day and night, crafting intricate pathways and testing the staging deployments. Their united efforts brought forth a tapestry of innovation, transforming challenges into stepping stones for success. The spirit of camaraderie shone brightly as they forged ahead, determined to create a bridge to a new realm of possibilities…
The Crafting of the Pathways
We stood at a crossroads as our journey to the cloud entered its next phase. The road ahead was filled with challenges and opportunities to shape a new realm, and we resolved to embrace the AWS Well-Architected Framework from the beginning. Like a guiding star, this framework illuminated our path, ensuring our decisions aligned with best practices for security, reliability, and operational excellence.
With clarity and determination, we designed a multi-account organization using AWS Control Tower. This became the cornerstone of our governance, enabling us to establish guardrails that protected our infrastructure while granting flexibility to innovate. Each account within the organization was like a chapter in a meticulously crafted story, its boundaries clearly defined and its purpose well-considered.
At the heart of our infrastructure lies the organization network, a hub-and-spoke topology centered around Transit Gateways. Departing from the traditional VPC peering model, this design ensured seamless connectivity and prepared the way for future collaboration between the two AWS organizations: the marketplace and Batdongsan. Our vision extended beyond immediate needs, seeking to unify the realm through thoughtful, scalable architecture.
To ensure our journey would leave a lasting legacy, we adopted Architecture Decision Records (ADR) as our framework for capturing critical decisions. Authored and preserved in Confluence, these records became the memory of our architectural journey, ensuring that the knowledge we gained could spread across the group, guiding others in their own quests.
With a shared love for automation and consistency, I led the effort to make our infrastructure manageable as code. We chose Terraform and Terragrunt as tools, leveraging the team’s expertise. While the marketplace team already had a Terraform repository, it needed more modularity and modern design for a well-architected multi-account setup. Collaborating with my fellow CloudSquad members and infrastructure engineers across the group, we raised the bar for our Infrastructure as Code (IaC) practices.
We proposed a new source code architecture integrating automation with Atlantis runner and avoiding reliance on IAM users or multiple AWS profiles. This setup distinguished between root modules, tailored for direct application, and shared service modules, designed for reuse. Bimlendu further enriched this effort by establishing a network of GitHub repositories for shared modules, complete with a governance program and repository templates. Around this time, Wilson joined us and upgraded the shared SSO module used in the marketplace, adapting it for Batdongsan to align with Bimlendu’s architecture. These combined efforts enabled us to manage user identities through code, significantly reducing manual intervention and forming the foundation of a new generation of Terraform practices across the group.
Meanwhile, our CloudSquad members took on significant design and implementation challenges. Hanh meticulously designed Batdongsan’s network architecture, detailing it to the subnets and IP planning level. Cong integrated the existing CloudFlare network with the new cloud infrastructure using CloudFlare Tunnel, minimizing changes to Batdongsan’s existing setup. Van implemented a solution for authenticating Windows servers on EC2 by utilizing the shared Entra ID Forest from the ITSP team. Viet and Dat designed EKS clusters, employing Pod Identity, IAM Role for Service Account (IRSA), and Access Entries to control service and user access. What began as a small team effort with contributions from only Bimlendu and me soon evolved into a collaborative initiative. By this point, ten cloud engineers were actively contributing to Batdongsan’s Terraform repository, while shared module repositories flourished, creating a robust ecosystem for the group’s IaC practices.
The Trials of Staging
By April 2024, the staging infrastructure was ready. Application teams began deploying their workloads to the cloud, rigorously testing functionalities and performance. Thanks to GitOps and ArgoCD, only minimal changes were needed in CI/CD pipelines, but teams faced extensive tasks to test their applications manually and automatically. During this phase, we encountered the critical challenge of database migration, which required tailored solutions for transitioning RDBMS and NoSQL systems while minimizing production downtime.
However, this period had its challenges. Application teams had to address incompatibilities while optimizing the performance of legacy systems based on test results. We at CloudSquad experimented with multiple approaches to database migration, each tailored to specific systems. These efforts and the complexity of coordinating across teams delayed our timeline by two months. The challenges mounted in August 2024 when the Yagi Typhoon disrupted individual schedules and coordination efforts between application teams and CloudSquad members. Despite these setbacks, we pressed forward. We achieved a significant milestone by late October: a 15-day master plan for production migration was finalized, and the production switchover date was set. It was a moment that marked the beginning of the next chapter in our journey.
Chapter 4: The New Horizon
As the sun rose on the migration day, the Kingdom of Batdongsan prepared to embrace its new destiny. With bated breath, the team watched as their production systems took their final leap into the cloud, soaring like majestic birds into a boundless sky. The moment marked a technical achievement and the dawn of a new era filled with innovation and opportunity. Batdongsan stood united, celebrating the fruits of their labor and the strength of their collaboration. Together, they unlocked a realm of possibilities, ready to explore the vast horizons ahead, where dreams of progress and transformation awaited.
The Leap into the Clouds
Long-anticipated migration to the cloud had reached its climactic moment. Armed with our master plan, we prepared for the production cutover as though it were the final battle in an epic quest. Late Friday night was chosen as the hour for transformation when disruptions to the realm would be least felt. Messages were sent across the kingdom, informing stakeholders and customers of the impending changes. To bolster our defenses, we summoned aid from the AWS Technical Account Manager teams, ready to intervene should any unseen foes arise. With 15 days to complete the migration, each moment was precious.
Within the tech guild, we divided our forces. My fellow CloudSquad companions and I worked on provisioning production database instances and preparing the application credentials needed for the journey, drawing upon the lessons of our staging trials. We oversaw the allocation of resources, ensuring that everything essential would stay strong. Meanwhile, the application teams deployed their workloads to the cloud and carefully diverted production traffic, starting at 10% and gradually increasing to 30%. As we pressed onward, our watchful eyes remained fixed on DirectConnect bandwidth, response codes, and application latencies. To test the cloud’s uncharted territories before fully opening them to the realm, we implemented an ingenious mechanism to switch between on-prem and AWS backends using headers and cookies. Each request carried a response header proclaiming whether it had journeyed through the old on-prem lands or the new AWS territories, giving us a clear map of our progress.
Battles in the Skies
As the late Friday night hour struck, the great migration began. The maintenance page was raised like a shield, signaling the start of downtime. All production traffic was routed to AWS, and the on-prem workloads fell silent. The synchronization of databases commenced. MySQL to RDS marched forward swiftly, a model of efficiency. Redis to ElasticCache, however, proved more laborious, taking three hours to complete, while ElasticSearch to OpenSearch demanded an additional hour. Yet, as dawn approached, these trials, too, were conquered.
Just as we thought the battle was in our favor, a shadow emerged at 1:00 AM Saturday. The MongoDB to DocumentDB migration faltered, revealing discrepancies in document counts. One of the databases was far different from what was anticipated, and completing the migration would require an estimated 15 hours, far beyond the downtime we could afford. Gathering quickly, we chose a tactical retreat, leaving MongoDB on-prem for now and resolving to revisit its migration later. Though not a victory, this decision preserved the progress we had made.
By 2:30 AM, the kingdom rejoiced. The applications were alive in their new cloud home, credentials had been updated, and connections reestablished. The migration seemed a triumph, celebrated by cheers echoing across the guild.
Yet, the following day brought an unwelcome twist. Alarms sounded as consumer site pods running on EKS began behaving erratically, causing a massive consumer lag on Kafka topics. Critical systems teetered on the brink of collapse. We urgently traced the disturbance to its source and deployed a fix by implementing a circuit breaker in the Kubernetes manifests using Istio. By midday, the crisis had been averted, and the realm steadied itself again.
The weekend passed peacefully, but as the Sales teams returned on Monday, the storm clouds gathered again. Complaints arose about the legacy Admin application, a lumbering beast infamous for inefficiency. Its dependence on MongoDB, still tethered to the on-prem lands, hindered its performance. While the Seller tribe labored to tame this unruly system, we in CloudSquad prepared for another attempt to bring MongoDB to the cloud.
On Wednesday night, we resumed the migration. This time, it unfolded without needing global downtime, but it was no easy feat. What was planned as a two-hour task stretched to four as the DocumentDB instance sizes strained under the weight of their task. Despite the delay, we prevailed, and MongoDB finally joined the ranks of the cloud.
The next morning began peacefully but quickly turned into a nightmare. What seemed like a minor hiccup escalated into the most daunting challenge of the entire migration. DocumentDB began reporting a relentless rise in connection counts. Slowly, the numbers crept up at first, but then they surged, overwhelming the system. Critical areas of the application started failing. Alarms blared, dashboards glowed red, and the weight of the issue settled heavily on all of us.
Determined, we chose to fight. Hours turned into an intense investigation battle, tracing the issue through the labyrinthine depths of legacy code. At last, we uncovered the source — a long-forgotten flaw, dormant until awakened by the new environment. We struck back with a patch, deploying the fix and restoring stability. The connection surge abated, the applications steadied, and the migration remained intact. Though the cost was significant, it was far less than the alternative. We emerged from the ordeal with newfound knowledge: a critical metric for MongoDB and invaluable insights into the legacy systems that remained.
A Kingdom Transformed
The results of our journey have been transformative. Our production Kubernetes cluster now runs smoothly on EKS, utilizing only 70% CPU power compared to the on-prem one. EKS also frees us from the daily debugging of filesystem or network malfunctions. Managed databases provide a wealth of default metrics, revealing critical insights like MongoDB connection counts that were previously hidden. With AWS’s managed services, we can prototype quickly using tools like DynamoDB, Lambda, and EventBridge, and we’ve begun exploring innovative architectures that were unimaginable before.
Our networking has also evolved. With Transit Gateway peering, traffic between Batdongsan and marketplace workloads can now flow securely within AWS, bypassing CloudFlare entirely. Legacy Windows servers now have logs automatically collected in CloudWatch, reducing dependence on manual access. Even developing new features is faster: small databases can be provisioned effortlessly for prototyping before final architectures are decided.
By the end of this journey, we had successfully transitioned all workloads except MSSQL databases to AWS. The trials we faced tested every skill and ounce of resolve we possessed. Yet, as we gazed toward the horizon, it was clear this was not the conclusion but the prologue to a new chapter. The cloud transformation had begun, and an era of opportunity awaited.
And so, Batdongsan’s journey to the cloud became a legend, a tale of transformation and triumph, proving that with unity and determination, even the most complex challenges can lead to a happily ever after.
Acknowledgements
The success of Batdongsan’s cloud migration is a testament to the incredible teamwork across all departments. The engineering teams showed exceptional dedication, handling critical migration tasks and addressing complex challenges. The QA, Security, and ITSP teams ensured rigorous testing, thorough reviews, and operational excellence. Product teams aligned their objectives to support the migration while maintaining innovation.
We extend our gratitude to the AWS team for their technical guidance, timely support on discussions and tickets, and collaborative spirit throughout the project. The program managers, finance and accounting team, and admin team also played crucial roles in managing budgets, coordination, and paperwork to keep everything running smoothly.