Crash Reporting with MetricKit

iOS 14 brings crash reporting capabilities to MetricKit. Is this the end of in-process reporting?

Being a macOS developer, I completely glossed over the introduction of MetricKit with iOS 13. But, it’s actually a pretty amazing framework. MetricKit can gather very detailed statistics about the performance of your application, and can report it all back to your app for handling. There are many 3rd-party products out there for doing exactly this. But, MetricKit can gather an impressive amount of data, and presumably does it in a way that is both lower-overhead and more accurate than a 3rd-party SDK ever could.

I have to assume that the most common form of app monitoring is crash reporting. Apple does offer a built-in system that’s integrated into Xcode. But, most apps still make use of a 3rd-party service, which can offer a much richer feature set. These systems all require the use of an in-process crash reporter, usually packaged up in an SDK. But, iOS 14 might turn that all on its head. Because, MetricKit can now be used to capture detailed reports on crashes, and even other forms of app termination.

Could it be that Apple has finally made something that eliminates the need for in-process crash reporting? 🤓

In-Process Monitoring

To build a crash reporting system, you need to have access to actual crash reports. The OS does capture every crash, but those reports are not accessible to anyone but Apple. The only way for developers to get reports is to somehow produce one within the application’s sandbox. It turns out that there actually are APIs that make this possible. They allow you to detect and run code after a crash occurs. That gives you a chance to capture details about what happened, and write it all to disk.

If this sounds kinda perilous, that’s because it is 🙃. As someone that has spent a great deal of time working on in-process crash reporting, I can tell you that it is absolutely terrible. While it is technically interesting, it is also messy, complex, error-prone, and cannot even capture all kinds of crashes. I’d like nothing more than to see an end to all this nonsense, and I’m sure many Apple employees feel the same way.

Before I noticed MetricKit’s new capabilities, I was already doing a bit of crash reporting work. To get ready for ARM-based Macs, I’ve been working on my own lightweight in-process crash monitor, Impact. Finishing up support for arm64 also had the happy side effect of finally making Impact iOS-compatible. Impact might be of interest to you if you are looking for a minimal crash reporting library to build a system around.

But, the stars of the show here are the new MXDiagnostic APIs. So, let’s take a look.

Getting reports with MetricKit

MetricKit doesn’t actually do any analysis or presentation. That part is up to you, or perhaps a service you use. MetricKit just provides you with the data, or payloads, in its terminology. You get access to these payloads via callbacks from MXMetricManager.

I really like how MXMetricManager has been specifically designed to accommodate multiple consumers. Sharing access to monitoring facilities, for crash reporting in particular, is a very problematic aspect of the in-process approach. I think, however, that there should be better ways for the host app developer to restrict access to MetricKit. This could be an important way to prevent unexpected/unwanted access from 3rd-party SDKs (FB7810671).

New functions have been added for diagnostics, and they mirror the pre-existing ones for metric payloads.

@available(iOS 14.0, *)
func didReceive(_ payloads: [MXDiagnosticPayload]) {
    // gimme payloads
}

According to the documentation, these callbacks are invoked on the order of once per day. They include data on the previous 24 hours. For testing purposes, you can also use Xcode’s Debug > Simulate MetricKit Payloads to invoke them immediately. With Xcode 12 and iOS 14, this command provides both simulated metrics and diagnostics. This is better than nothing, but I’d really like a way to force the system to deliver real payloads immediately (FB7807493). Right now, testing with real data is proving to be very hard.

You can also get access to these payloads using the pastDiagnosticPayloads property. I’ve read the documentation for this method a number of times, but I still find it difficult to understand. It sounds like way to get access to previously-delivered payloads. I’m unsure why that would be useful, but I have a feeling this is because I’m just misunderstanding.

The bad news is, so far, I’ve been unable to get the iOS beta to actually deliver a real crash payload to my test app via any mechanism. Even when waiting the full 24 hours after causing a crash. I’m going to blame beta instability here, but this is definitely not instilling confidence (FB7818332). 😭

Update: I have been able to get reports with Beta 1 after about 72 hours. Stupidly, I did not accurately track the number of crashes that I triggered for testing, but it appears that I received the majority of them all within one payload delivery. Hooray! 🥳

What’s in a Diagnostic

Interestingly, MetricKit includes reporting for four distinct types of diagnostics:

MXCrashDiagnostic
MXCPUExceptionDiagnostic
MXHangDiagnostic
MXDiskWriteExceptionDiagnostic

The documentation is a little sparse, but I believe that all of these correspond to situations where the OS would have terminated your process. Even though only one of these events are technically crashes, from your users’ perspective they are all equally bad. And, it’s very exciting, because an in-process reporter can only detect a subset of what MXCrashDiagnostic covers.

These four diagnostic types share a lot of structure, since they are all subclasses of MXDiagnostic. This superclass provides a bunch of metadata properties on the event and the host process at the time the event occurred. This is all the typical environmental context you might be interested in, like app and OS version.

The Details of MXCrashDiagnostic

Each of the MXDiagnostic subclasses include some properties specific to their type of event. MXCrashDiagnostic exposes a bunch of information about the crash event. This includes details of the mach exception code and signal. These are lower-level bits of context about what specifically went wrong during a crash, and they can be very useful during debugging. Even the virtual memory region information is presented, though it is in the form of a string, which is pretty unwieldy. Here’s what that property for the example event looks like:

0 is not in any region.  Bytes before following region: 4000000000 REGION TYPE                      START - END             [ VSIZE] PRT/MAX SHRMOD  REGION DETAIL UNUSED SPACE AT START ---> __TEXT                 0000000000000000-0000000000000000 [   32K] r-x/r-x SM=COW  ...pp/Tes

As far as I can tell, MXCrashDiagnostic does not currently expose any information about uncaught exceptions (FB7808000). While these might be a little less common in our increasingly-Swift development environment, I still think they are essential. So much of Apple’s frameworks are in Objective-C, and exceptions provide additional context that would be sorely missed.

So far, this data pretty much lines up with what you have in the familiar text-based crash report. The bulk of a report is stack traces, which is made available through the MXCallStackTree class.

MXCallStackTree

Disappointingly, the most interesting part of the MXDiagnostic family is also the most opaque. As of the current beta of iOS 14, MXCallStackTree has exactly one method: JSONRepresentation. To interact with stack traces, you must convert them into totally-undocumented (as far as I can tell) JSON data and then parse that back into custom-made structures. This… isn’t ideal (FB7807427). To me, this seems like a really problematic design choice. Besides the inconvenience and wasted work, I’m also worried about the forwards and backwards compatibility of something like this.

But, it’s what we have, so let’s dig in. Here’s what a simulated crash payload looks like:

{
  "callStacks" : [
    {
      "threadAttributed" : true,
      "callStackRootFrames" : [
        {
          "binaryUUID" : "247F6DDC-9A6B-4B81-922E-A4D274087FB4",
          "offsetIntoBinaryTextSegment" : 123,
          "sampleCount" : 20,
          "binaryName" : "testBinaryName",
          "address" : 74565
        }
      ]
    }
  ],
  "callStackPerThread" : true
}

MXCallStackTree seems to be a structure capable of describing both sampled and instantaneous stack traces. The threadAttributed and callStackPerThread fields look like they determine which of the structure types this instance represents. I’m going to attempt to reserve some judgement here, but I definitely am concerned. I’m not sure if one instance would actually contain both types of data or not. I think two different containers, with a dedicated MXStackFrame object might be more reasonable (FB7813451).

Now, this is a beta, so this is all subject to change. But, I’m hoping we get something a little more fleshed out. The lack of a real programmatic interface will almost certainly be problematic for years to come for both Apple and the consumers of this API. I’m cautiously optimistic, but if I had to bet, it would be that this is what we’ll get in iOS 14.

I think a wrapper library may prove very helpful for anyone doing on-device processing. And I suspect on-device processing will be common.

Symbolication

One thing that is documented about MXCallStackTree is it only contains unsymbolicated data. I would imagine that a very common consumer of these payloads would be the SDKs of analytics services. Those probably already have a symbolication system built. But, if you want to consume this data yourself, you’re going to have to deal with this somehow.

Stack traces are just a list of addresses. To get a human-readable form, you have to turn those addresses into symbols - function and method names. This process is called “symbolication”, and Apple has a great document on the topic. While the concept is pretty straightforward, the actual process is quite involved.

The frame structure in MXCallStackTree contains a UUID, which is enough information to identify the binary. But, you still need access to the actual binary (or dSYM) itself to perform symbolication. To get all the frames filled in, you’ll need to symbolicate against both your app/libraries and also Apple’s. And remember that you must symbolicate against the exact executable that was loaded at the time of crash. For Apple’s libraries, these change not just across OS versions, but even within OS versions across devices. Getting access to these binaries isn’t trivial.

One place where you’ll always have access to the Apple binaries involved in a crash is on the device where the crash happened. I believe it is possible to do symbolication as an in-app post-processing step. For the executable you’ve built, you likely have a corresponding dSYM. In addition to symbols, dSYMs can be used to look up file and line number information. Unfortunately, you won’t have access to the dSYMs on the device, so that requires off-device processing.

I think my ideal approach would be a low-fidelity symbolication on-device, followed by a refinement step off-device when the report is received server-side. This will produce the highest-quality results, but is also the most complex. My intention is to investigate on-device symbolication support in Meter. But, it would be ideal if MetricKit just did this for us (FB7821667) and saved everyone the hassle.

Long story short: symbolication is going to be an issue for many consumers of MetricKit’s diagnostic facilities.

Relationship to App Store Connect

A fascinating aspect of MetricKit is its relationship to App Store Connect. It looks like MetricKit is an API to the existing OS systems for relaying data back to Connect. The MetricKit APIs have a surprising amount in common with Connect’s functionality. I was able to catch a glimpse of what these structures look like in the “Identify trends with the Power and Performance API” WWDC2020 session. They are extremely similar.

Notably, the App Store Connect data is symbolicated. This makes sense, since Apple’s crash reporting service does server-side symbolication of crash reports. If you don’t need raw logs, using the Connect API could be a much simpler option.

So, is This the Future of Crash Reporting or What?

It’s a little early to be certain, but it’s incredibly promising. This API is definitely something to be excited about. I expect that all the analytics service providers that do in-process crash reporting are investigating MetricKit closely. The ability to capture more types of application terminations alone is a major advantage.

Unfortunately, lacking tvOS, macOS and watchOS support is a real disappointment. And the iOS 14 minimum raises the bar pretty high. And, some big questions about behavior and semantics still remain. The timeliness and reliability of MetricKit’s payloads are big unknowns. I’ve been having a lot of difficulty working with it for testing. In fact, I have yet to successfully get a single real crash payload. It also is pretty unclear what happens if, for example, your app goes a long time between launches. Do crashes from a month ago really get buffered? I’d like to see much better documentation around this.

Compounding with that is the OS-level crash sharing privacy system. I believe the opt-in rate is quite low, something like maybe 25%. For very large apps, that probably still represents a sufficient sample size. For smaller apps, you’ll likely end up missing the majority of your reports.

Another thing I worry about is how this system will handle a crash on launch. These are among the most painful for an end-user, and frustrating for a developer. And because MetricKit is both high latency and also uses async delivery, I’m not sure it is possible to ever report a persistent crash-on-launch (FB7815503). That would really be too bad.

Wrapping Up

I’m really, really excited about these new capabilities in MetricKit. And, I’m both surprised and happy that Apple has decided to enable 3rd-party solutions with their design, instead of locking everyone into an Apple-built system. Xcode crash reporting is a nice service, but it doesn’t meet the needs of many apps. That said, I’d still like to see progress on the Xcode side. Perhaps a good starting point would be a dedicated app for metrics? 😉

In fact, I’m interested enough that I’ve begun work on a companion project called Meter. Meter provides a programmatic interface to MXCallStackTree, as well as emulation of a MetricKit-like reporting interface for older OSes and unsupported platforms. I’d also like to investigate providing on-device symbolication. It’s all very much still a work in progress, though.

Bottom line: MetricKit is great. It is not yet a full replacement for in-process reporting, but it’s surprisingly close. The API could definitely use some attention. And, I would really (really really) love to see this supported on all platforms (FB7805036). It would also be nice if it actually worked reliably 😬 (See the update above!). But, I’m very excited to see how it develops and if analytics services start adopting it. I expect they will.

I think we may actually be at the beginning of the end of in-process reporting. It took a long time to get here, and it seems like we still have a ways to go. But, I am just thrilled.

Tue, Jun 30, 2020 - Matt Massicotte

Previous: AWS Keyspaces

Next: MetricKit Crash Reporting, Part 2