As most of our users are by now no doubt aware, on April 7th a serious vulnerability was announced in recent versions of OpenSSL. Dubbed Heartbleed, CVE-2014-0160 allows a remote attacker to read potentially sensitive data on the server. This vulnerability has had a widespread impact on many providers. We take security and the trust our customers place in us extremely seriously and so we wanted to take this opportunity to explain the steps we have taken over the last few days to address Heartbleed here at Solano.
Our incidence response began immediately with the release of CVE-2014-0160. We do use SSL/TLS to secure communications between our customers and the service and between components of the service and we do use the OpenSSL implementation. The response team began by upgrading all parts of our infrastructure, including the front end website, the core API and control plane, database servers, test environments, and ancillary services (issue trackers, workstations, and so on). We then scheduled a downtime for the evening of April 8 to replace all of our certificates and revoke the previous certificate. We are not aware of any compromise of the old certificate, but given the severity of CVE-2014-0160 we believe it is a best practice in this case to rekey all servers.
We continue to monitor the situation and strongly recommend that all users change their authentication tokens not only on Solano services but also with any other providers that they may use.
- All of our infrastructure was patched on 4/7 no later than 8pm PT
- Fresh, re-keyed certificates were installed across our infrastructure by 4/8 11pm PT.
- All logged in sessions were invalidated and reset after infrastructure updates
We do use Amazon Web Services to host much of our infrastructure but do not use AWS Elastic Load Balancers (ELB) to terminate SSL. For AWS-specific information, we recommend reading Amazon’s detailed security advisories.
If you have any questions or concerns that are not addressed here, please contact us at email@example.com.
The Solano CI integration with GitHub uses OAuth for authentication. Today we have rolled out the ability to set the credentials used to post GitHub status on a per-repository basis. To configure an alternate set of credentials for a repository, go to the GitHub Status menu item on the repo configuration page (click on the gear icon in the dashboard). You can then select from the list of users that have linked their accounts with GitHub via OAuth or enter a personal OAuth token.
One of the useful features of GitHub OAuth tokens is that they can be used to authenticate command line tools. In addition to using curl to access the full GitHub API you can use OAuth tokens for git over ssh or to download files from the command line. The GitHub OAuth token is made available as an automatically managed config variable: GITHUB_OAUTH_TOKEN. This config variable is exported to the build environment as an environment variable where it can be used by setup hooks and post build hooks. For instance, you can use it to generate an authenticated URL to download custom esearch plugins from a private repository or to authenticate git pull or git push as part of setup and teardown.
Around 7am PT, one of our app server nodes (and alas, also our primary redis server) started exhibiting average network (ping) latency of several tens of ms — spiking to >100ms — to our DB master and other nodes in the cluster.
We have removed the app server from use, and failed over to replicas as of 10 am PT.
We are bringing on additional capacity to service the backlog as quickly as we can.
We’ll update this page as we make more progress.
Update: we are in communications with our infrastructure provider to get more information on the root cause of this situation.
Update 5:15pm: we have restored capacity and mostly drained the backlog of builds. We continue communication with our infrastructure provider to understand root cause.
There have been several recent service interruptions that have delivered an experience of using TDDium that’s below our high standards. We here at Solano Labs sincerely apologize for these issues. We’d like to take a few minutes to explain the incidents and describe our short- and long-term mitigation strategies.
2/24 tddium.com Domains Unresolvable
Sometime before 8:30am PT on Monday 2/24, our domain registrar, name.com, deleted our DNS glue records. We had received no notice of this change, and discovered it only through DNS investigation.
- Route53 nameservers up and running – Check.
- SOA Records in place – Check.
- dig @184.108.40.206 tddium.com – Failure. How could that have happened?
As DNS caches expired, tddium.com domain names began to become unresolvable, and our hosts therefore became effectively unreachable. Both our custom monitoring infrastructure and our off-site pings (from New Relic) still had cached DNS entries, and reported no errors. Many users and build workers were happily compiling and testing.
We replaced the registrar configuration at 9:55 am, returning service to operation.
After identifying the missing configuration, we emailed name.com support for help understanding what happened. Their response after 36 hours was that our domain was “disabled for using our name servers for URL forwarding DDos attack“. That’s funny – we have been using Route53′s DNS servers since early 2013. We were still using name.com’s URL forwarding hosts to route *.tddium.com to www, and we discovered later in the day on 2/24 that our forwarding entries had also been reset to point to a stock parking site hosting ads. We have since switched all redirection to use Route53 and S3 buckets. Our requests for further information from name.com were politely deflected.
Unfortunately, there’s no good way to have a “backup” domain registration, but we plan to switch registrars very soon to a company that will better respect our business interests.
3/10 Elevated Error Rates
Many of our users over the past few months have noticed “waiting for available socket” slowdowns in Chrome. These issues were due to Chrome’s single-host connection limits in the face of long-polling. We’ve been slowly migrating our live-updating UI from polling, to push over long-poll, and finally to WebSockets, which would resolve all of those available socket delays. The last component in the WebSockets conversion was rolling out a web front-end that natively supported WebSockets – specifically, switching from Apache to nginx. After a few weeks of soak testing that convinced us that the switch could be done seamlessly, we began the production nginx rollout on Wednesday 3/5, and it held up firmly under peak load on Thursday 3/6 and Friday 3/7. We declared success.
Monday 3/10, our alerts lit up.
It was immediately obvious from New Relic reporting that something was seriously wrong and that 3/10 traffic was seriously different from Thursday 3/6. Monday is not normally a peak traffic day for us, but for some reason, we were seeing huge front-end queue delays and big spikes of 503 errors. These unfortunately manifested to our users as error builds. We eventually traced it down to a combination of nginx and Passenger configurations. We prepared a new configuration, ran high-volume load tests of it over night, and we’ve seen service quality return to more acceptable levels. We are still seeing intermittent recurrences, and we continue to tune.
3/11 GitHub Gets DDoSed
Tuesday afternoon, GitHub experienced major connectivity issues, which they have announced were due to a DDoS attack. We are hugely thankful of and respectful for their quick response under fire. On our servers, we had a lingering shadow of git processes stuck on dead TCP connections, and webhooks and commit status updates that never made it to their rightful destinations. We were cleaning these up as they were happening over the course of the afternoon, and we finally had clean queues around 9pm PT.
Conclusions and Next Steps
Our next steps involve hardening our DNS infrastructure, completing our tuning of nginx, and productizing our admin UIs for displaying the external webhooks we’ve received for easier debugging. We continue to develop our internal monitoring systems and we’re scoping out production canary servers for updates to low-level infrastructure components like nginx.
We’d like to specially thank our partner New Relic for the invaluable insight their monitoring has provided us in debugging these interruptions.
We strive to provide a stable, trustworthy platform on which our customers can build and test great software, faster than ever, and we want to thank all of you for your patience and understanding while we weather these service interruptions. We will push to improve wherever we can, and we welcome your feedback at firstname.lastname@example.org.
The Solano Labs Team
We are extremely happy to announce the launching of an online community blog based on the interest we have received from our first three Automated Testing Meetup Groups. The blog is called AutoTestCentral.com “Where people who write and test software come to talk about automation” We are very excited to grow and support this community!
Here we will post on all things about the Automation of Software Testing. We decided to create this blog after trying to share content among our Automated Testing MeetUp Groups. We currently have groups in San Francisco, New York City and Boston. We have had some great talks in each meetup, and sharing presentation materials to only each city’s own meetup page was not going to cut it! We had people in SF wanting to know about NYC and people in Boston trying to learn what the last month’s SF talk was about! With the hopes of launching in more cities in the new year, we knew we needed to change something! So we created this blog, so that we can share all the content from the MeetUp Groups in one place… here!
We are also going to be asking the community to contribute posts. We already have some great ones posted from leaders in the space. If you or someone you know would like to author a post, please reach out to Sarah at email@example.com, and she will guide you through the process.
If you are in one of our covered cities please join! If you would like to have an Automated Testing MeetUp group come to your city, please say so in the comments section.
We hope to see this group grow organically into a place where all testing professionals can learn, knowledge share, post content and talk with one another.
Thank you! Lets get started!!!
- The Solano Labs Team
Solano CI uses the exit status from commands to determine whether a test passes or fails. The behavior follows in a venerable Unix tradition whereby the exit status of zero indicates success and a non-zero exit status indicates failure.
On occasion we’ve seen bugs in test frameworks that can cause false positives, or worse false negatives. Users with ruby test suites should check that they are not impacted by a recent defect when using SimpleCov 0.8.x, RSpect 2.14, Rails 4.0.x, and Ruby 2.1.0. Details may be found in the Github issue: https://github.com/colszowka/simplecov/issues/281.
We’re happy to announce that the changes we’ve been planning to our GitHub authentication integration are live in our production environment!
As we described in an earlier post, we’ve changed our OAuth model to allow users to select the privilege level they give Tddium to communicate with GitHub. Now, when you link a GitHub account, you’ll see a menu of privilege levels that you can authorize: You can always change the level you’ve authorized by visiting your User Settings page, where you’l see the same menu. For more information on Tddium’s use of GitHub permissions, see our documentation section.