Blog

Transaction ID Wraparound in Postgres

On Monday, July 20th, Sentry was down for most of the US working day. We deeply regret any issues this may have caused for your team and have taken measures to reduce the risk of this happening in the future. For transparency purposes, and the hope of helping others who may find themselves in this situation, we’ve described the event in detail below.

If you’re familiar with Sentry at all you know that we live and breathe SQL. For a long time I’ve had the stance that simple solutions are the best solutions. That holds true in choice of databases as well. Unfortunately that choice has also led to some complex problems. One such problem caused a major outage of the hosted Sentry service earlier this week. That problem is Postgres’ transaction ID (XID).

Transaction IDs

To understand the problem we first need to understand how transaction isolation works in Postgres. That turns out to be a lot easier said than done, as public documentation on the internals is sparse. What little documentation exists barely touches the surface. Given that I can’t speak with authority on the subject, we’ll approach this from a high level.

In essence, the first time a transaction manipulates rows in the database (typically via an INSERT, UPDATE, or DELETE statement) the database’s XID counter is incremented. This counter is used to determine row visibility.

In reality, it’s quite a bit more complex than that. The best human-readable description we’ve found is Heroku’s topic on the MVCC and concurrency.

You don’t need to have a strong grasp of MVCC to really understand the basics. It does what it’s supposed to, and works amazingly well for many workloads.

Wraparound

Due to the way transaction IDs are implemented in Postgres, it’s necessary to run a routine maintenance task called vacuuming. Postgres will do this automatically (via autovacuum), and in most worlds it can go unnoticed. Usually the casual PostgreSQL user first encounters autovacuum around the time that they start seeing heavy load (especially write load) on their database. The internet is full of awful advice of users suggesting you should turn off autovacuum, or run it manually at low traffic times, or simply adjust its schedule to run less often. To know why that’s ill-advised, you first need to understand the consequences of autovacuum not running.

For example, let’s say we’ve got a reasonably active database, and we’ve already disabled autovacuum. Eventually, that database is going to reach a situation where the XID counter has reached its maximum value. The absolute peak is something around 2 billion, but it can be far lower than that in some situations. Once you’ve reached this value, the semantics of MVCC can no longer function correctly. As mentioned earlier, the XID counter is used to issue unique values that identify an individual transaction. Since there are no longer any unique XID values available to identify new transactions, the database must halt to prevent XID re-use – otherwise, transactions that occurred in the past could suddenly appear to be in the future, leading to all kinds of strange and undesirable behavior (such as deleted rows reappearing, updated rows reverting to previous states, referential integrity or constraint violations, among others.) To protect you from this, Postgres will stop accepting commands when there are fewer than one million transactions left before the maximum XID value is reached.

Sentry is a very write heavy application. On top of that we have some very large relational tables (both in number of rows, as well as physical size on disk). Due to these concerns we’ve gone back and forth on tuning autovacuum to avoid this very situation. Our goal is to vacuum as aggressively as possible so that this situation should never arise.

Unfortunately this very situation is what caused our most recent outage.

Recovering from Disaster

Once Postgres kicks in its XID wraparound protection, unless you’re willing to accept data loss, your only option is to stop accepting writes and vacuum the relations. One piece of general advice is to stop the database and restart in single user mode, thus running vacuum in a more isolated and safe environment.

This is where it gets a bit tricky in our situation.

The database has stopped accepting writes, but we’re already neck deep in autovacuums which certainly need to finish. Now you start asking vital questions:

  • Is a VACUUM resumable?
  • How long until the process completes?
  • Will another 100GB of memory dedicated to maintenance make things faster?
  • Is the VACUUM that’s running now (while the protection has kicked in) even going to allow the database to recover?

At this point we had to make a bet: it was better to let autovacuum finish with our current configuration than to risk interrupting it. Our biggest fear would be that we’d restart, either to up the memory dedicated to the task or to switch to single user mode, and we’d increase the duration of the downtime.

So, we are now in the waiting game.

Eventually the autovacuums finish. Unfortunately one of them either failed or simply didn’t work correctly. It’s unclear as to what happened, as the logs contain nothing to suggest failure and we didn’t have verbosity on for autovacuums. Now we’re in a state of limbo to decide what to do next.

Safeguards

Up until this point we were operating a read-only state. We had replicas online that we could point the application at to maintain that view of the world, but it wouldn’t allow us to accept writes. In Sentry’s case this problem compounds itself as we process a very large amount of data continuously. Our backlog of events in the queue is growing, our Redis buffers are getting out of control, and catastrophe is bound to strike. As to avoid making the problem any worse, we decided to flush the entire backlog of events.

Foresight played in our favor a bit here. We had just provisioned new hardware dedicated solely to ensuring this problem was no longer such a risk. New machines with much more memory targeted at maintenance and additional CPUs to these tasks. Immediately when we identified the XID issue we had failed the primary master over to the new hardware, which is where we’d been running the vacuums.

Now we’re several hours in, autovacuums have finished, the queue will survive (assuming we restore writes very shortly), but Postgres still isn’t accepting writes. Did we waste all of that time allowing autovacuum to finish for it to have no effect? Did it do anything at all? Our only choice at this point was to shut down the database and restart in single user mode. Our fear being that we’ve now invested nearly 3 hours and have nothing to show for it.

Accepting Loss

While we couldn’t estimate how long the vacuum was going to take we were able to identify the problem. By querying Postgres’ internal statistics we identified that the autovacuums actually had finished on all of the relations except one. That relation however is a massive one. It’s a very large table which stores a mapping of events to their rollups, and we require one of these entries for every single event Sentry receives. It’s also a table that has very limited uses: to let you look up an event by its ID and as a safeguard for duplicate events.

At this stage we made the decision that we were willing to risk all data in the table if it meant we could recovery immediately. In this case the risk was minor. It’s used for features that aren’t critical, and if nothing else we had full backups via replicas.

With our best estimations, we made the call to truncate the table. Five minutes later and the system was fully restored.

Perspective from two days later tells us this was absolutely the right call. As we work towards future proofing this problem even more, we’re still watching one of our test machines run a vacuum in single user mode going on 24 hours. This machine, being on the older hardware, tells us that we made many correct calls: letting autovacuum finish, truncating the table, and most importantly the aggressiveness of hardware and configuration tuning to improve the vacuum times.

Looking Forward

We’ve got a few ideas on how to resolve this problem, but the biggest is splitting relations into multiple databases. This has its own set of challenges as connection management is a fairly complex thing, but it’s at the very least predictable and understood. Time is more favorable now as well as the machine tuning has provided lots of breathing room.

For the interested, our new hardware specs look something akin to:

  • 3x RAIDs (OS, WAL, and Data)
  • 256GB Memory (we bumped up from 128, specifically to dedicate the new memory to vacuum)
  • 24 Cores (dual hexcore)

Note: We don’t store blob data in Postgres at Sentry (that goes in Riak), so we’re able to get away with a fraction of the hardware we otherwise might need for SQL.

We previously ran with too high value for autovacuum_freeze_max_age and the default of 3 autovacuum workers. Additionally we had been running with far too much delay in vacuuming which meant lower load on the system, but more idle time in maintenance.

Our updated configuration:

autovacuum_freeze_max_age = 500000000
autovacuum_max_workers = 6
autovacuum_naptime = '15s'
autovacuum_vacuum_cost_delay = 0
maintenance_work_mem = '10GB'
vacuum_freeze_min_age = 10000000

There’s a number of other things we tune, though it’s all fairly standard these days.

While we feel pretty good about the progress we’ve made, ultimately we’re going to be moving away from a SQL-based architecture. In the future we’ll talk more about that, but we’re happy Postgres has taken us this far.

Update: We’ve had many discussions since posting this and wanted to provide some additional context.

  • vacuum_freeze_table_age had not been updated with our new configuration. We’ve removed this from the configuration above as it didn’t affect anything.

  • maintenance_work_mem has a hard internal limit of 1GB for the relevant paths, and thus our 100GB of memory dedicated to it wasn’t even being used. This was confused by looking at the processes and seeing the large amount of memory they were using, though it was coming from the other relevant settings (shared_buffers and effective_cache_size).

In the future we’ll be following up with a post on what we’ve deemed as best practices for tuning Postgres under high write loads.

Logging Go Errors to Sentry

Sentry + GO

So you’re picking up Go and wondering “where did all my exceptions go?” It takes a bit to wrap your head around using Go, especially if you’re coming from an interpreted language like Python or Ruby.

Handling Errors

If you haven’t yet, take a look at handling errors in Go. It’s mostly straight forward, and you’ll end up with a lot of your code looking like this:

f, err := os.Open("filename.ext")
if err != nil {
  log.Fatal(err)
}

If you’re doing this, don’t fret, you’re already handling errors correctly! The problem is that you’re doing it via traditional logging. Even worse — this won’t save you when your code hits a panic.

Addressing Panics

One of the big adjustments coming from other languages to Go is that standard exceptions that can be caught the traditional way do not exist. While exceptions exist in the form of panics, they can only be recovered from within deferred functions:

func f() {
  defer func() {
    if r := recover(); r != nil {
      log.Println("Recovered in f", r)
    }
  }()
  doSomethingDangerous()
}

For more information check out the post on the Go blog: Defer, Panic, and Recover.

Reporting to Sentry

Go Traceback

Now that you understand the difference between an error and a panic, let’s talk about how we can feed those into Sentry.

We first need to initialize our Raven client with the DSN for our project:

import "github.com/getsentry/raven-go"

func init() {
  raven.SetDSN("<your dsn>")
}

Now we can easily capture errors from anywhere in our application with raven.CaptureError or raven.CaptureErrorAndWait. The difference being, one is fire and forget, and the other blocks waiting for message to be sent.

fp, err := os.Open("filename.ext")
if err != nil {
  // We explicitly want to wait here because we're exiting
  // immediately after
  raven.CaptureErrorAndWait(err, nil)
  log.Fatal(err)
}

If you suspect that a function call may potentially panic, or you want to be overly cautious, you can wrap the function call in raven.CapturePanic which will implement the standard defer/recover pattern and log the panic and continue on.

func doStuff() {
  // Some stuff that we want to guard
}

// Make sure that the call to doStuff doesn't leak a panic
raven.CapturePanic(doStuff, nil)

Capturing Metadata

One of the most important aspects of Sentry is its ability to collect additional metadata. In Go this is no different, but the APIs are fairly different than what you might see with other integrations. Raven clients treat metadata as a context object requiring an explicit raven.Set{User,Http,Tags}Context() and ultimately a raven.ClearContext() to reset it back. This works great for http request middlewares, for example.

Users

Sentry is great for capturing User data to attach to the errors being logged. We can do this through raven.SetUserContext and the User will get added to the context being sent along to all future errors.

raven.SetUserContext(&raven.User{ID: "123", Email: "matt@example.com"})

// Later on, call CaptureError and pass along the User
raven.CaptureError(err, nil)

// Now clear it for subsequent calls
raven.ClearContext()

Http Requests

When working with an http server, it’s extremely valuable to capture information about the request that triggered the error. To do that, Raven provides a convenient way to construct an *raven.Http interface to be appended to the current context, raven.NewHttp which takes a single argument, *http.Request. This will provide information about URL, headers, Cookies, the request method, requests IP address, etc.

raven.SetHttpContext(raven.NewHttp(req))

We also provide a hook into the stdlib’s net/http package that can wrap your handler:

func root(w http.ResponseWriter, r *http.Request) {
  // ... do stuff
}
http.HandleFunc("/", raven.RecoveryHandler(root))

Tags

Tags can be used to filter for specific errors within Sentry, and they are easy to attach with Context as well with raven.SetTagsContext(). Tags can also be set at the time of capturing the error, and they will get merged with the context, if there is any.

raven.SetTagsContext(map[string]string{"browser": "firefox"})

raven.CaptureError(err, map[string]string{"age": "20"})

In this example, you’d get tags that combined browser and age both in Sentry.

Learn More

Take a look at the raven-go project on GitHub to learn more about how things are implemented, as well as additional details on using it with Sentry.

If you’re not yet a Sentry user, you can try Sentry free for 14 days. If that’s not your cup of tea, it’s all open source on GitHub.

Free for Academia

For a long time we’ve had an unwritten policy to support students and education. Over the years that has included various organizations, such as Scratch and The Harvard Crimson. In addition we’ve seen hundreds of students use us on their personal projects.

Today we’re happy to formalize our policy on Sentry for Academia.

If you’re a student, teacher, or work in research at a university, Sentry is free. Drop us an email at education@getsentry.com (from your .edu address) with details and we’ll be happy to get you setup.

Driven by Open Source

Seven years ago I would frequent an IRC channel setup for users of the Django web framework. Like an old-fashioned Stack Overflow, it was a mix of people asking questions and others answering. At some point, someone asked how to log exceptions to the database. While not understanding, it seemed not overly difficult and I helped come up with an example. Shortly afterwards I took that example, threw it into a repository, and committed the first lines of code to what would eventually become Sentry.

commit 3c2e87573d3bd16f61cf08fece0638cc47a4fc22
Author: David Cramer <dcramer@gmail.com>
Date:   Mon May 12 16:26:19 2008 +0000

    initial working code

 djangodblog/__init__.py | 35 +++++++++++++++++++++++++++++++++++
 djangodblog/models.py   | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 71 insertions(+)

A few years later Chris Jennings (Sentry’s co-founder) joined Disqus. He then convinced me to come out to San Francisco and help build the fast growing community platform. Just a couple months after joining the team, around May of 2010, I saw Disqus was using the example code I had pushed to GitHub years before. To my surprise the very basic functionality that existed had proven to be extremely useful to them. It was then that we decided to double down on the idea and really make something out of the tool. This was the day Sentry was truly born.

Discovering Our Roots

During my tenure at Disqus, myself and many others grew to deeply value and contribute to open source. Coincidentally enough, GitHub also happened to be right upstairs from that first office, and we quickly found ourselves routinely at their drinkups. For us, open source became much more than the simple idea of free software, it became part of our lifestyle and began to encompass everything around us. We released hundreds of small Python and Django libraries during that time, in addition to driving continued improvement on Sentry and other products.

Today, Sentry remains just as committed to open source. Our world runs on open technology. From the tools that we run to monitor our servers (and applications), to the entire operating system underneath. Sentry couldn’t exist without open source, and it’s our belief that the way forward for this kind of technology is in that same spirit. We will remain just as committed, if not more, to improving Sentry, and doing it through open source.

The Opportunity

Through early iterations of Sentry we eventually built a standalone platform. This would allow anyone to report exceptions upstream via a standard API, in any language, on any platform. Quickly it led to an ever increasing amount of interest from our open source community. A lot of our colleagues within the Django world were early to adopt it, but what was even more surprising was the adoption in other ecosystems which we had no part in.

We saw SDKs being created in Ruby, PHP, JavaScript, and Java, as well as less popular languages like Erlang and R. It seemed like overnight that we went from a little Python and Django application to software that was integrated everywhere. From platforms that we didn’t know existed to even being run by many of our role models in silicon valley. Sentry adoption was growing massively.

Strength in Community

Sentry's Commit Graph

Fairly early on it became clear that the community wouldn’t contribute a lot to the Sentry server. It was large, complex, and required an understanding of a lot of things. This actually worked out well for us. It meant we were able to clearly drive the direction of the project without having the fumble around with too much contributed code. We took a strong focus on building extensible entry points into Sentry, and once again friends and colleagues stepped up. Soon the ecosystem expanded to include various integrations from issue trackers to notification systems.

During that time we also saw the realm of people running Sentry internally grow. Huge companies had chosen to adopt open source technology instead of inventing their own. It was amazing to see that small little project blossom and help thousands of engineers by making their job just a little bit easier. Most of it all, it was amazing to see people continue to run and expand open source.

Today those contributions still ring true. The Sentry team is primarily the company itself, and we actively improve on the server and continue to open up things for the ecosystem. The ecosystem itself continues to help us maintain our current SDKs as well as drive adoption of new ones. It feels like I discover something new that the community has come up with on a weekly basis.

In Good Company

Over the course of time a recurring theme had come up: people loved Sentry, but sometimes they’d simply rather not host the server themselves. While we didn’t know this yet, eventually we’d be convinced to build out the premium service, and what is now getsentry.com.

Before leaving Disqus, I spent the Christmas of 2012 engulfed in building the minimal set of features needed to support customer billing on top of Sentry. Two months in, on February 28th of 2013, we had our first paying customer. That customer was Matt Robenolt, whom has contributed heavily to many parts of Sentry and is now actively involved in the company. Over the next three years we built out the business (in our spare time) and we started seeing a real opportunity. What once was a a few dozen lines of code had turned into a valuable, sustainable business with thousands of paying customers. Most importantly of all, it was, and is still open source.

Credit Due

Today Sentry has become the standard. Thousands of customers trust us with their data on our hosted platform, with tens of thousands of developers using our open source software on-premise. None of this would have ever been possible without open source, and we want to pay special thanks to everyone who has written, contributed, or helped maintain the various Sentry projects throughout the years. At the end of the day, we’re able to continue building open source software primarily because our contributors, and our customers have put their trust in us.

With that said, we’re proud to be able to say that we’re looking to expand the team. If you love open source and want to help us build the future of application monitoring, we’d be thrilled to talk to you. Exciting things are coming, and we can’t wait to share it with you all.

From everyone at Sentry (myself, Chris, Armin, and Matt),
Thank you

Comments?