Skip to main content
Loading
Version: Graph 2.4.0

Error handling for bulk data loading

Overview

This page describes how to manage error handling with the bulk data loader.

Types of errors

The Aerospike Bulk Loader takes input data from files in CSV format. In some situations, the CSV data may contain inconsistent or incorrectly formatted data.

Possible CSV errors include:

  • Vertices with duplicate IDs. If the bulk loader finds vertices with duplicate IDs, it writes only one to the database.

  • Incorrectly typed data. Each CSV file contains a header row in which the data type of each column is specified, and if the bulk loader finds a data item which does not match its specified data type it rejects the row.

  • Orphaned edges. If the bulk loader finds an edge which refers to a non-existent vertex it rejects the row.

Configuring fault tolerance

You can specify how fault-tolerant the import process is by configuring an allowable number for each type of error in your properties file.

The following configuration options specify how many errors to allow in a bulk load operation before it aborts.

note

You must use the -validate_input_data Spark flag when any of the allowable error configuration options are in use with distributed mode.

  • aerospike.graphloader.allowed-duplicate-vertex-id-count
  • aerospike.graphloader.allowed-bad-entry-count
  • aerospike.graphloader.allowed-bad-edges-count

The three options which deal with allowable errors all accept positive integers as values, and they all default to unlimited allowable errors. Set a reasonable number of allowable errors before starting a bulk loader job; leaving the default in place may lead to unusable data sets due to a high number of missing rows. Bulk loader performance is impacted when a high percentage of input data is rejected.

Call API error reporting

If you use the bulk loader with the call API, the command returns a string with the number of each type of error, if any.

String result = (String) g.call("aerospike.graphloader.admin.bulk-load.load").with(...).next();

// If there were no errors, the result string reads:
// Success

// If there were errors, the result string reads:
// Warning: Errors were encountered during bulk loading.
// duplicate-vertex-id-count: 1
// bad-edge-count: 3
// bad-entry-count: 2
// Use the g.call("aerospike.graphloader.admin.bulk-load.errors") command for details.

When the bulk load operation completes, use the following command to get a report of how many errors occurred with the call API.

g.call("aerospike.graphloader.admin.bulk-load.error-count")

This command returns a map with the following keys:

  • duplicate-vertex-id-count: The number of vertices that have the same ID in the input dataset.

  • bad-edge-count: The number of edges that refer to missing vertex IDs.

  • bad-entry-count: The number of vertices and edges that have incorrectly typed data, e.g. strings in a column which is specified for integers in the header row.

The value of each key is an integer specifying the number of errors found.

You can also do error reporting by error type with the following command:

g.call("get-bulk-load-errors").with("type", "<TYPE>")

Replace <TYPE> with one of:

  • duplicate-vertex-ids
  • bad-edges
  • bad-entry

Each error type returns a map with key-value pairs containing the following information:

duplicate-vertex-ids

  • id: duplicated vertex ID.
  • count: number of times duplicated.

bad-edges

  • bad-vertex-id: the non-existent vertex ID in the edge dataset.
  • count: number of edges attached to the non-existent vertex ID.

bad-entry

  • row: string representation of the CSV row that did not match specified header type. The value shown includes additional columns marked as null compared to the row in the CSV file.
  • file: CSV file that contained the erroneous row.

See the options reference for full descriptions of the error-related configuration options.