Providing affordable health insurance – Obamacare (Affordable Care Act) – to all the citizens of United States was one of the boldest moves by President Barack Obama. The Obama government launched the website healthcare.gov on October 1, 2013, and the website was designed to handle 50,000 simultaneous transactions for people to buy healthcare insurance. Approximately seven million people were estimated to buy healthcare insurance in the first year, but the website crashed in the first week as 250,000 people tried to access the system. For the next three weeks, the site continued to have problems, and only 30% of the people could access the website, and less than 2% were able to purchase insurance. The website crash made headlines in the media from October 1 through 17. This was a major disaster for the Obama government and a setback to the Congress and the President on a personal basis.

The President and the White House Chief of Staff Denis McDonough had two options: either fix the website or scrap and rewrite the system. Although the initial estimate for the project was $93.7 million, the system had two years of development time and with a cost of more than $400 million, roughly half in hardware and software. The rough estimate for the number of developers was around 500 at an assumed salary of 200K/year per developer.

On October 17, the White House chief technology officer, Jeff Zients took a full-time initiative to fix the website. He hired a small team of Silicon Valley and the former Obama campaign geeks. The technical lead was Mikey Dickerson, a Google reliability engineer who had previously worked on the Obama election campaign. The other team members were: Jini Kim (former Google), Ryan Panchadsaram, Todd Park, Paul Smith, and Andy Slavvit. On October 25 – in a conference call with the media – Zients announced the plan to fix the website by the end of November.
Mikey Dickerson took charge of the project and created the ground rules.

  • There will be two meetings every day for 45 minutes, mornings at 10:00 AM and evenings at 6:30 PM. There will be an open line for 24 hours a day where people can talk about the issues.
  • The meeting agenda is to solve problems and not point fingers. Everyone should be honest about the problems, even if he or she are responsible for creating the problems. In fact, a team member got applauded when he accepted the fact that his code caused an issue the previous day.
  • Only the person who knows accurately about the issue should talk, not the one who has the highest rank. Once you get rid of the management red tape, the developers can concentrate on fixing the issue.
  • Prioritize the issues that can cause problems on the website in the next day or two.

The first thing that the team realized was that they needed to know the statistics of the system: How many people were logged in at a time and their response times? What was the turnaround time for a simple click? Where was the bottleneck? Hence, the team created a dashboard within a day to determine the performance issues. Based on this information, the team was able to create a cache for the common data, increasing the performance improvement four times.

The term, dashboard, came from automobiles that have a control panel in front of the driver. In IT systems, the dashboard is usually a single screen that displays the health of the system in graphical presentation format. For example, the STARS system has a dashboard that displays the status of the trades that interface with external systems, used by the users. However, the technology team does not have a dashboard to display the current status of the system. However, systems should have a dashboard that can display the health of the system.

Corporations prefer to have planned releases to fix the issues in a system. If you have used Microsoft products like Windows or Office products, and have ever downloaded a new patch, you know that the system first needs to be restarted after installing the patch. The system accepts the fix, only after restart.

The other quick option is a hotfix, a term used for systems that can incorporate changes while they are running, just like relational databases that can incorporate any new changes without shutting down the system. Since the healthcare.gov system was running 24X7, many hot-fixes were deployed. The hotfixes allowed the team to fix defects immediately and sometimes even without testing. If the hotfix did not work, then a new one was applied ASAP. This may not be possible if the system is already working fine. If you place a hotfix that breaks existing functionality, the business can be in trouble.

On December 1, 2013, Zients published the progress of the website on the U.S. Department of Health & Human Services website. He also informed the media that the website was working. The team fixed more than 400 defects and added a dashboard that allowed them to monitor the website's performance in real-time. The user's response time was reduced from eight seconds to less than a second by hardware upgrades and changes in the code. The availability of the system was more than 95% as compared with the earlier 43%.

The beauty of the story of healthcare.gov is not in the details of the problems as portrayed by the media but how they were fixed, and what we can learn from it to make our systems more efficient and faster. This is uncannily similar to project team meetings where the issues are yelled out instead of their solutions. Unfortunately, once the problem is solved, no one talks about the fix.

An important lesson from healthcare.gov is about the ownership of the issues. First, user acceptance testing was done in less than two weeks. Two weeks are less than 2% of the total development time. The government officials who were responsible for testing never actually took the ownership of the issues. Secondly, the development team was big and again, there was no central ownership. However, Zients' team did so by pioneering a culture of "identify the problem, solve the problem, and try again." Thirdly, the success was due to the team itself. Most people in the team dropped everything that they were doing and rushed to rescue the website. The team toiled for two months, and it was mainly to achieve their goal and not a financial incentive.

A lot is going on with systems, businesses, and technology. A system is the collective intelligence of a corporation. Although SDLC methodologies lay down the common rules in creating systems, the dynamics of the project changes almost every day. These dynamics need to be prioritized based on two rules: The business wants to make money and users want to go home. A single SDLC process cannot be fit for every project. For example, the systems that have very few screens like Google search website do not need detail prototyping. All action is behind the screens. However, the trading systems require more prototyping of the screens due to fast data entry. Based on the project, you need to fit the SDLC, or you can use the combination of different processes. Moreover, leave the respect, courage, diversity, team building, and motivation politics to the company and the human resources, who are adept in dealing with such issues. Instead of hiring more developers, automate the process and code so that fewer developers can manage the system. Do your job with fewer resources and money, and businesses will love you. There are no shortcuts to creating the best system in the world. The only way to succeed is to weave intelligence into the code line by line. Choose the company business over the project, and the project over the process.

About the author:
Raj Badal is a technologist, author, and designer. He is a technology consultant at one of the largest cable companies. The content and art in this article is taken from his book — Systems: Brains of Corporations, avaliable on Amazon, Barnes and Noble, and Kobo.
© 2016 – 2017 Raj Badal