On April 30th this year, we experienced a service outage for about an hour in the afternoon. We restored our service around 5:20pm Eastern Time.
The cause of this outage was an AWS instance that stopped running: normally, not a problem that should lead to a long outage. Unfortunately, it led to a process in our monitoring system getting stuck on a kernel lock that it couldn’t timeout, and hence no alerts were delivered to our staff.
The system was in a “zombie” state where it displayed as running in the AWS management console, but consistently failed its AWS health-checks, even across reboots. In this case, the only way to recover the specific instance we’ve found is to “stop” it and “start” it again. On boot, there was nothing significant in the instance’s system logs to indicate an obvious problem. One moment it was running and the next it wasn’t. Our theory, and unfortunately, it’s still a theory, is that some bit of AWS low-level networking fabric goes AWOL. We see this DOA situation on a regular basis with our build worker pool, where we start and stop instances with a much higher frequency, and we’ve come to accept it as a way of doing business in AWS. Nodes fail, and the system must survive. Still, we’re all ears for a more informed diagnosis.
Over the weekend following the outage, we started rolling out more changes to our system architecture to handle node failures using best-practices. We’ve been stably running on a multi-zone replicated database infrastructure for months, and it’s served us well and brought us peace of mind. However, we have not yet eliminated all of the single points of failure in our main app and build execution infrastructure.
Now onto the upgrades that started on 5/4 and are continuing through May and June, including the scheduled downtime this weekend (5/19).
First, we’ve introduced a new load balancer to pave the way for truly HA web and app serving. Second, we’ve upgraded our monitoring tools to properly survive hard networking failures. Next, we’re about to release a new, automatically updating status page based on the awesome stashboard project.
There are still a few single-points of failure in our system and we’re busily working on eliminating them in changes that we target for backend releases toward the end of May.
Since the system changes, you may have noticed that the Tddium app is much snappier (it’s got a lot more horsepower on tap now), but you may have seen a few builds terminate or found the app unresponsive for a few minutes here and there. A few parameters left over from our old configuration sizing our web serving pool against a backend processing pool led to an interprocess deadlock that one-by-one took out web server procs. We’re fixing those parameters, and rolling out software changes to avoid the situation entirely. The silver lining is that our new monitoring systems have been responsive and timely.
Over the next few weeks, if you have any feedback about a difference in your experience, good or bad, please let us know via our support page at http://support.tddium.com/
We’re planning a short downtime tonight, 4/30/2013 at 10pm PT and a longer maintenance window over the weekend (5/4 from 5-8pm PT) to address some of the root causes of our recent downtime.
We’ll update this blog post with a much more detailed post-mortem, but the 6 word summary: an AWS instance went to lunch. That got a process in our monitoring system stuck on a kernel lock that it couldn’t time out.
During the longer maintenance window over the weekend, we’ll be upgrading our monitoring system, and, further, rolling out a much needed upgrade to our app-server architecture to let us handle these failures much more gracefully.
As we launch our CI solution Tddium into full Python support here at Solano Labs, we are happy to see more opportunities for us to collaborate with the Python community and contribute back. Last week at the PyCon sprints, I got a chance to sit down with Maciej Fijałkowski (@fijall), Armin Rigo, and the PyPy crew to make some improvements to the test system for PyPy, a project that takes its tests very seriously.
By running the PyPy tests in Tddium, we were able to complete the 5-hour test suite in about 15 minutes, vs. about 28 minutes on PyPy’s existing manually-set-up parallel test system. Tddium takes no maintenance, and the UI is getting good reviews too.
Meanwhile, working with Armin I fixed an issue that had caused some tests to abort on common Linux configurations. The tests had mapped a 20-gigabyte chunk of virtual address space, in order to spread some objects across it in order to help flush out any 32-vs-64-bit bugs. If VM overcommit limits haven’t been disabled on the machine, then just a few of those running in parallel are enough to cause fork() or malloc() to fail! I modified the test setup to keep only a small fraction of the 20 gigabytes committed, and the tests now run great with or without overcommit.
We are excited to further contribute to PyPy’s growth — follow PyPy on Twitter @pypyproject to see the latest development and news. We’d love to hear your thoughts in the comment section!
Follow Greg Price on Twitter: @gnprice
* with apologies to Stanley Kubrick
The image of the head-down coder hacking away Tasmanian Devil-like, paying – at best – lip service to writing tests is a thankfully less and less accurate cliché these days. But that wasn’t always the case. These guys (and girls) used to be the everywhere. They didn’t get into development to do testing! Look, the code works! Job done. Next feature, bring it on!
Yes, once upon a time the “who needs automated tests” dinosaurs ruled the earth. And I was one of them.
I began my web development career in a large IT consultancy. We had testers, and test managers, and elaborate test plans. They would catch the bugs. It wasn’t a developers job to test.
Then I left to co-found my first (bootstrapped) startup, as the sole developer, with a single co-founder. And in Bootstrapped-Startup-Land you don’t have testers, and test managers, and test plans. There is just you. And your code
And every line of code you write, someone, somewhere in another startup is also writing a line of code, and they might have the same idea as you. And they’re going to launch their feature before yours. And they’re going to beat you.
So each line of code is precious. And you don’t want to “waste” it on a test.
So I still wasn’t interested in testing.
When each feature was completed I manually tested it, then my co-founder tested it. Then I fixed any bugs. Then we both tested it again. Then, when it worked, I moved on to the next feature.
Then things that had worked would break. So I’d go back and make them work, then move on. Then, later, they broke again. A mantra began in my head, quietly at first, then with increasing volume: “Write some tests, write some tests…” But where to find the time with all this bug fixing to do….
Yes, that way madness lies.
Unit-ed we stand
So I began writing tests. This was a baptism of fire as, of course, there was now a large backlog of untested functionality to tackle. But I gritted my teeth, girded my loins (whatever that involves…), dove in and began writing unit tests. And, you know, a funny thing happened…
Slowly, very slowly, assertion by assertion, I learned to love testing.
I began to take actual pleasure (pleasure! imagine that!) in crafting a test then seeing the little green icon in my IDE ping to life when it passed. Knowing that my new feature was fit to go live. And more importantly, that I hadn’t broken something else in the process.
So, this was job done right? Sure, this didn’t test the front-end. But that’s what humans are for right?
The extra confidence, and speed, I gained from the unit tests simply seemed to manifest itself in more front-end bugs. Dammit!
Then I discovered Selenium.
This was Selenium 1 (aka Selenium RC) so not the most stable and robust framework ever. We would often see a test fail then pass immediately after without any change to the code or data in between. Hmmm……
Even so, automating browser tests seemed like magic. It was mesmerising to watch the tests running. A never-tiring invisible hand filling in form fields and clicking buttons. Ok, maybe I’m just easily mesmerised.
It was fortunate that I found watching the tests so entertaining, because boy were they s-l-o-w. A full run would take around 90 minutes. It also sent the CPU fan on my laptop crazy and made using it for any other purpose at the same time a painful ordeal.
The upshot of this was that I didn’t run the selenium tests very often. Which in turn meant that they grew more and more out of date. Which in turn meant that I was even less likely to run them….
So we ended up falling back on manual browser testing again. Doh!
But, wait, what’s that sound? Enter stage left, our hero on a white horse….. it’s Saucelabs!
Yes, I remember distinctly the day I came across the Saucelabs website. I instantly Skyped my co-founder “Praise the Lord!” I exclaimed, or words to that effect. “Selenium is re-born!”
And for us it really was.
After a very small amount of painless integration, there they were, our browser tests running in the cloud. Sweet!
The test sweet…ahem suite still took a long time to run but it was now fire and forget. Just kick off the tests and get on with my normal business.
Although I did kinda miss being able to fry eggs on my Macbook when the tests were running locally. Those were damn good eggs!
Almost as big a deal as being able to run our tests in the cloud was that we now had a complete record of every Selenium test run, including a video of the test running!
Add to this the ability to do ad-hoc cross browser/OS testing with Sauce Launcher, and test against our local dev build with Sauce Connect and it’s fair to say we were pretty ecstatic!
This was by far the best situation we’d been in test-wise and the quality of the code showed it.
But, you know, I’m hard to please.
My two biggest remaining niggles for me were the speed of the Selenium test suite execution, and the fact that our Unit and Selenium test results weren’t integrated.
I’d come to accept these limitations, until…
What’s this, the rumble of horses hooves again? Here comes the second hero of our piece, Tddium, charging into the fray!
The claim of Tddium to completely lift the testing burden – unit and selenium – into the cloud, and run both in parallel blew me away. To the extent that I was dubious as to how well it would work.
The answer…. very well indeed!
Now add in Tddium’s Github integration and Continuous Integration support and…. wow… just….wow.
I am currently running 8 Tddium workers in parallel and the runtime for my complete suite of unit and Selenium tests is down from around 2 hours to 15 minutes.
This has been a game-changer in my development routine. I’m now much more ready take risks and try stuff, knowing I can get such quick and comprehensive test feedback.
So yes, my conversion, is now complete. From throwing code over the wall to ‘those tester people’, to being forced by necessity to slowly embrace automated testing, to now realising the full potential of automation with Sauce and Tddium, it’s been quite a ride. And it’s not finished yet…
I still know that, someone, somewhere in another startup is still writing that line of code, and they might have the same idea as me. But now I’m not worried that they’re going to launch their feature before us. And they’re not going to beat us.
Unless….they’re using Sauce and Tddium too….
The New Year is already off to a great start here at Solano Labs with new features and product upgrades getting ready to roll out. With the start of the New Year we also decided to take a look back at the year that was and ask as a company “What have we learned?” and “What should our New Years’ resolutions be?”.
2012 saw some high profile successes and failures in the world of software. Some mistakes went unnoticed but others were front-page news. Many cost time and money and a few even destroyed entire companies! However small or large the screw up, there was a common thread… in hindsight, these defects could have been identified earlier and prevented from reaching users with more automated validation!
What follows are a few of the bugs and outages that we found most interesting. For each story, one of our engineers shared his thoughts on the matter. Many of us got sidetracked in the process of researching the outages as they often offer a fascinating look inside the affected businesses. A good jumping off point for your own exploration of software screw-ups in 2012 is the ChannelBiz Top Ten List for 2012 Software Blunders — you can check it out here.
A bug in a newly rolled out load balancing software update caused an error with the interpretation of unavailable data centers. This caused an 18 min outage where 8% to 40% of Gmail users were affected by slow performance, timeouts or errors.
Nobody would even think of pushing new code without testing it first, and probably also doing a staged rollout to catch bugs that only show up in production. But configuration files and other small data aren’t usually given the same consideration, even though bugs in them can have just as devastating consequences. The small size of the data and ease of checking it by eye can give you a false sense of security.
Two practices can help mitigate risks. First, create a verifier for your complex configuration files that checks syntactic correctness and, more importantly, presents the user with a semantic delta from the previously deployed version. This may be difficult depending on the meaning of your configuration, but even a very rough attempt will make unintended changes easy to catch. Second, stage deployment of configuration changes just like code changes and carefully monitor instances that have the new configuration for unexpected behavior.
Errors in Nasdaq’s computer systems caused delays and mishandling of orders during the start of the Facebook IPO.
Web 2.0 companies usually treat past failures as bygones — the bugs of yesterday are replaced by today’s hot new successes. Facebook wasn’t so lucky, as software glitches in Nasdaq’s touted OMX trading platform “engulfed” its IPO and created confusion that investors haven’t yet forgiven. The series of errors in May Facebook’s underwriters an estimated $115 million. According to analysts, Nasdaq’s servers slowed down under heavy load from a 3ms response time to 5ms, and failed to establish an opening price. This caused a 2+ hour window where trades languished unconfirmed – “going into a black hole” – or were lost entirely because of a data rollback. You may think you understand your app’s hotspots, but when all 1 billion of Facebook’s enthusiastic users decide today is the day they want to use your trading platform, you may well be proven wrong. I guess that’s a success problem? The lesson – anticipate and test for extreme load conditions before the thundering herd arrives.
After Ruby on Rails bug report was ignored, software developer, Egor Homakov, used the bug to hack into Github! Although unlawful it brought necessary attention to the bug.
In March, a Github user exploited a mass assignment vulnerability in Github to add new authorized public keys to the Ruby on Rails account and push a change as a proof of concept. Github reacted swiftly to close the security vulnerability and publicized the details in this blog post. Github is to be commended both for closing the hole quickly and for providing a detailed description of the problem. Github is a high-profile website, particularly in the Ruby community, so the incident also brought a prevalent problem to the community’s attention.
The Ruby on Rails approach to using convention over configuration is a large part of what has made it a popular platform for building new web applications. Convention is a powerful way to promote collaboration and productivity within a software development organization – but it can also lead to severe bugs if programmers aren’t cognizant of the implications of the conventions. In the case of Rails, mass assignment makes it easy to map form input data sent by a user’s web browser into a convenient data representation for the application. When combined with an ORM such as ActiveRecord, however, it is all too easy for unsafe updates to slip into the database, for instance updating an access control list, and granting a user unwarranted privileges. Careful validation of user-supplied data is necessary for security and often requires more domain knowledge than simple convention provides.
An overload at the call centers created heavy confusion when changes went live this year.
United Airlines and Continental Airlines merged in 2010 to create one of the world’s largest airlines. Representatives of the combined entity extolled the virtues of the merger — greater ‘reach’ and convenience for customers, a more efficient business which would be positive for investors, and more. Now, merging two such large organizations takes time & effort, and sub-par planning can have a large negative impact. This has been very evident to the airline’s customers. Due to problems in the merged reservation system, many have suffered through flights delays and long call center wait times. The airline apparently did a thorough job in merging the data from their two separate reservation systems (which are called Apollo and SHARES) into a single one (SHARES, which was chosen as the sole successor). However, they seemed to have been less detailed-oriented in their load testing and UI testing efforts. On Mar 3, 2012, SHARES went live as the sole reservation system for United.
Unfortunately, roughly half of the airlines employees who used the system — at ticket counters, at airline gates, at call centers — were unfamiliar with the interface, which caused delays in boarding and flight departures & arrivals. This in in turn caused a 30% increase in calls to the call centers. The airline had planned for no more than a 10% increase. Queue times for customer calls jumped by 120%. Perhaps if the airline had planned and tested for a more extreme increase in traffic, SHARES could have more gracefully handled the increased load. The lesson is similar to that from NASDAQ’s problems with the Facebook IPO — one should certainly test for the thundering herd (apologies to Merrill Lynch for co-opting their tag line) and be extra paranoid when considering the possible size of the herd. Better to think too big than too small!!
A $440+ million dollar mistake occurred when a bug in a recently updated piece of trading software was let loose on the market for over 40 mins. This eventually caused the collapse of the institutional financial giant.
Knight Trading is one of a number of brokerage houses that acts as a market maker and provides trade execution to other brokers on Wall Street. Knight specializes in small cap equities and is responsible for roughly 10% of US equities volume. On August 1st, an error in Knight’s trading platform resulted in a $457MM dollar loss for the firm, threatening its viability as a going concern. Core infrastructure in Knight’s trading platform — not fancy quantitative trading algorithms — were responsible for wild swings in the prices of roughly 150 equities traded on the NYSE. Apparently disused portions of Knight’s infrastructure was still running old, incompatible versions of their proprietary software. Talk about integration testing pain! Although conceptually simple, keeping accurate, auditable track of deployed software versions and testing and tracking compatibility in a large deployment infrastructure is no simple task. Tracking and testing multiple versions of a software stack that may be deployed at the same time — intentionally during a rolling upgrade or inadvertently — can have serious real-world consequences for business. Reportedly the direct trading loss was in excess of $200MM and Knight was required to pay a massive additional 5% risk premium to Goldman Sachs to exit the position, costing them a further $230+MM.
We can all sympathize with and respect the efforts of engineering and ops teams worldwide to keep the computer systems we rely on running. We here at Solano Labs are no strangers to critical bugs and performance fire drills – in our current work (Tddium had an unexpected downtime in late December 2012) and over our years of experience. But we can all learn from these mistakes, and apply the lessons to our own practice of building, testing and releasing software. Some common themes emerge:
- Small-scale correctness is necessary, but not sufficient. Poor performance under load can turn quickly into incorrect behavior. Especially in a distributed system, retroactive fixes can be too late.
- Similarly, retrofitting security is risky. Nonetheless, plan for security as a war, not a battle. New vulnerabilities and attackers will arise, and it’s critical to be responsive (the Rails core team has done a great job of this!) and to be self-aware about the risks to your business and open source communities.
- Be bullish about traffic growth unless your system has a natural rate limit. Planning a large launch or an event with a large audience? What happens if they all show up? Even for a closed system, getting scaling right can be a multi-month proposition, so start now!
- Configuration and deployment processes are just as crucial as code – make sure they are tested and validated with the same rigor.
As 2013 innovation starts, lets raise a toast to learning our lessons and to our new year’s resolution: Keeping those bugs where they belong! We’re looking forward to another year of helping our customers build great software, and an open discussion of ways we can make that easier. We’d love to hear your thoughts.
Happy New Year to all from the Team at Solano Labs!
A serious security vulnerability in all released versions of Rails was announced on the Ruby on Rails Security list on January second. You can read more about the details in the original post here and follow the CVE case here. The short version is that all extant versions of the ActiveRecord ORM were vulnerable to an SQL injection attack.
Fortunately, we eat our own dog food at Solano Labs, so upgrading, testing, and deploying the patched version of the software stack was straightforward. In fact, it was less than 45 minutes from the time the vulnerability alert was first mentioned in our internal chat to the time when the update was tested in Tddium and the deployments started going out. We immediately patched not only our production system, but also our staging environments and the few pieces of infrastructure that are also Rails applications. Continuous Integration and Delivery made for a quick, high-assurance turnaround on a critical update. Little Bobby Tables, eat your heart out!
On Wednesday 2012 Dec 19, the main Tddium web service experienced an outage for about 4 hours when our primary database crashed. We had been preparing a warm-standby DB replica for production deployment in January. We were able to use it to recover completely. We are now live with a high-availability architecture in our primary datacenter.
The outage involved two distinct service interruptions. The first lasted around 9 minutes, from 2012 Dec 20 0028 UTC to 0037 UTC (7:28pm ET). It was triggered by a runaway query in our DB master node. Our staff received an IO utilization alert, killed the query and the controlling process, and restored operations. The IO utilization alert masked the alert for another problem: the DB archive volume on our master node was full.
At 0113 UTC, both the primary and archive volumes on our DB master node filled up and caused the DB master to fail. We initiated a snapshot restore procedure, and data movement completed at around 0337 UTC. At this point, we found that incremental backups following the snapshot were corrupt. At 0417 UTC, we prepared our DB standby node for promotion to master, executed that promotion around 0435 UTC, and restored full service at 0445 UTC.
We can now return to full service within 30 minutes of a single-server catastrophic DB failure. Our goal is to survive major datacenter outages with no more than 10 minutes of downtime. We’ll keep you posted as we build to that target.
As always, don’t hesitate to reach out to firstname.lastname@example.org if you have any questions.
The Solano Labs Team