summaryrefslogtreecommitdiff
path: root/services/backup
diff options
context:
space:
mode:
authorTobias Thierer <tobiast@google.com>2020-06-05 14:48:10 +0100
committerTobias Thierer <tobiast@google.com>2020-07-07 11:55:20 +0000
commit723f5f365f98a7d4d0b9d60c776536ae5e07420f (patch)
tree56606c977fe21d134a9711232da41404728ce4ed /services/backup
parent03b0c47b668c44bf7ff85f5eda8dc75cb5c35567 (diff)
Fix state deletion for transient backup issues.
Since Android 10, backupPm() includes sendDataToTransport(), which was not previously the case. This means that error handling logic that deletes the backup state file (causing initialize_device() on the next attempt, which deletes any existing backup) will now also be triggered upon errors during sendDataToTransport(), which wasn't previously (Android <= 9) the case. This has the potential of making an existing temporary outage much worse: 1. A few devices might run into temporary issues, e.g. a B&R server returning HTTP 503 Service Unavailable (treated as a TransientHttpStatusException instanceof NetworkException, which is mapped to TRANSPORT_ERROR during handleTransportStatus(), which results in a TaskException with stateCompromised==false but which backupPm() wraps in another TaskException that forces stateCompromised=true). 2. On their next backup attempt, those devices throw away any existing backup and start from scratch (initialize_device()), increasing the load on the server. 3. This leads to a positive-feedback loop where more devices than before run into HTTP 503 Service Unavailable. 4. As a result, masses of devices delete their backups and then hammer the B&R server with attempts to upload new backups. 5. Backups are unavailable to any users who would otherwise rely on them during this outage. To improve on this dangerous situation, this CL changes the code to force stateCompromised=true only for TaskExceptions thrown specifically during extractPmAgentData(), and (as before) for all AgentExceptions. Note that the code is still quite brittle. It still seems like we are probably forcing stateCompromised=true in too many situations, but it's hard to say so this CL is being conservative about the changes. Changing back to the old behavior could be done through a local change around KeyValueBackupTask.java:676; a future CL may do this to have a safety hatch in case we want to cherry-pick this CL into an upcoming Android release late in the release cycle. [1] https://android.googlesource.com/platform/frameworks/base/+/refs/heads/pie-dev/services/backup/java/com/android/server/backup/internal/PerformBackupTask.java#1035 [2] https://android.googlesource.com/platform/frameworks/base/+/refs/heads/master/services/backup/java/com/android/server/backup/keyvalue/KeyValueBackupTask.java#1040 [3] https://source.corp.google.com/piper///depot/google3/java/com/google/android/gmscore/integ/modules/backup/transport/src/com/google/android/gms/backup/transport/GmsBackupTransport.java;l=770;rcl=281845876 Bug: 144030477 Test: Checked that the following passes after this CL, but testRunTask_whenTransportReturnsErrorForPm_updatesFilesAndCleansUp() fails if I revert the state of KeyValueBackupTask.java to before this CL: ROBOTEST_FILTER=KeyValueBackupTaskTest make \ RunBackupFrameworksServicesRoboTests Change-Id: I6c622c55fbd804ec0a12e0bea7ade1308f7a3877 (cherry picked from commit 819ed81faaa295d9e1096f13f599cb43d48cda88)
Diffstat (limited to 'services/backup')
-rw-r--r--services/backup/java/com/android/server/backup/keyvalue/KeyValueBackupTask.java25
1 files changed, 21 insertions, 4 deletions
diff --git a/services/backup/java/com/android/server/backup/keyvalue/KeyValueBackupTask.java b/services/backup/java/com/android/server/backup/keyvalue/KeyValueBackupTask.java
index 5e10916c4491..a4e58a1faeac 100644
--- a/services/backup/java/com/android/server/backup/keyvalue/KeyValueBackupTask.java
+++ b/services/backup/java/com/android/server/backup/keyvalue/KeyValueBackupTask.java
@@ -649,16 +649,33 @@ public class KeyValueBackupTask implements BackupRestoreTask, Runnable {
mReporter.onStartPackageBackup(PM_PACKAGE);
mCurrentPackage = new PackageInfo();
mCurrentPackage.packageName = PM_PACKAGE;
-
try {
- extractPmAgentData(mCurrentPackage);
+ // If we can't even extractPmAgentData(), then we treat the local state as
+ // compromised, just in case. This means that we will clear data and will
+ // start from a clean slate in the next attempt. It's not clear whether that's
+ // the right thing to do, but matches what we have historically done.
+ try {
+ extractPmAgentData(mCurrentPackage);
+ } catch (TaskException e) {
+ throw TaskException.stateCompromised(e); // force stateCompromised
+ }
+ // During sendDataToTransport, we generally trust any thrown TaskException
+ // about whether stateCompromised because those are likely transient;
+ // clearing state for those would have the potential to lead to cascading
+ // failures, as discussed in http://b/144030477.
+ // For specific status codes (e.g. TRANSPORT_NON_INCREMENTAL_BACKUP_REQUIRED),
+ // cleanUpAgentForTransportStatus() or theoretically handleTransportStatus()
+ // still have the opportunity to perform additional clean-up tasks.
int status = sendDataToTransport(mCurrentPackage);
cleanUpAgentForTransportStatus(status);
} catch (AgentException | TaskException e) {
mReporter.onExtractPmAgentDataError(e);
cleanUpAgentForError(e);
- // PM agent failure is task failure.
- throw TaskException.stateCompromised(e);
+ if (e instanceof TaskException) {
+ throw (TaskException) e;
+ } else {
+ throw TaskException.stateCompromised(e); // PM agent failure is task failure.
+ }
}
}