Database Log Reference
See Configuring logs for more information on configuring server logs.
Arenax
could not allocate xxxxxxxxxx-byte arena stage xxx: No space left on device
Indicates for the index-type flash
configuration that the mount points have run out of space. You may need to delete the arena files manually and run fsck
on the disk partitions.
As
allowing x fill-migrations after y seconds delay
This message appears after a recluster event, if a value is set for the migrate-fill-delay
and if the recluster event has caused fill-migrations to be scheduled.
finished clean shutdown - exiting
This message is the last one of a sequence of messages logged during Aerospike server shutdown. The message signifies that Aerospike was shut down with “trusted” status which is a necessary condition for a subsequent fast restart of a namespace that is configured with storage-engine device. See ASD shutdown process.
waiting for storage: 1569063 objects, 1819777 scanned
Objects and scanned values diverge for various reasons, such as scanning records that were previously expired or expired while the system was down.
When cold starting a node that has data on persistent storage, such as SSD.
objects
Number of retained objects from storage device.
scanned
Number of objects that have been scanned on the storage device.
Batch
abandoned batch from 11.22.33.44 with 23 transactions after 30000 ms.
When a batch-index transaction is abandoned due to one or more delays, where its total time exceeds the allowed threshold of either the client’s total timeout or 30 seconds, if the total timeout is not set by the client. Each occurrence also increments the batch_index_error
statistic.
Prior to 6.4, this stat increments when the total time exceeds twice the client’s total timeout or 30 seconds if the total timeout is not set by the client.
IP Address
The client originating IP address for the transaction.
Number of transactions
The number of batch sub transactions in the impacted batch index transaction.
Abandoned time
The total time the batch index transaction had been running before being abandoned.
failed to find active batch queue that is not full
All batch index queues are greater than the batch-max-buffers-per-queue
limit, so batch requests are being rejected. Consult Tuning Batches. To save log space, this message is logged with a (repeated: nnn)
prefix just once per ticker-interval
during periods of repetition.
Bin
{NAMESPACE} bin-name quota full - can’t add new bin-name
Max number of bins per namespace (65535) has been reached. Prior to Database 4.9: Max number of bins per namespace (32767) has been reached. See How to clear up bin names when they exceed the limits.
NAMESPACE
The namespace that has reached the maximum number of bins.
Clustering
evicted from cluster by principal node BB9030011AC4202
The paxos principal node has determined that this node is not a valid cluster member. See the log on the principal for more information.
principal
Node ID of the paxos principal node
ignoring node: bb9030011ac4202 - exceeding maximum supported cluster size 1
The cluster has exceeded the asdb-cluster-nodes-limit
in the features.conf
file. Reduce the number of nodes attempting to cluster or install the correct license.
node
Node that is attempting to start cluster.
ignoring paxos accepted from node BB9030011AC4202 - it is not in acceptor list
A paxos clustering message from another node was ignored. This can result from network stability issues.
node
Node sending the invalid paxos message
ignoring paxos accepted from node BB9030011AC4202 with invalid proposal id
A paxos clustering message from another node was ignored. This can result from network stability issues.
node
Node sending the invalid paxos message
ignoring stale join request from node BB9030011AC4202 - delay estimate 83108(ms)
A paxos clustering message from another node was ignored. This can result from network stability issues causing delayed delivery of packets.
node
Node sending the invalid paxos message
delay estimate
Estimated delay of the message in milliseconds.
Config
CRITICAL (config): (features_ee.c:184) trailing garbage in /etc/aerospike/features.conf, line 21
The feature-key-file has been tampered with. Replace the file with the original provided by Aerospike. See Aerospike fails to start due to corrupted features.conf file - trailing garbage error.
failed CONFIG_CHECK check - MESSAGE
A configuration best-practice was violated at startup.
CONFIG_CHECK
Name of the check that was violated, which could be one of the following:
sMessage
Description of how the best-practice was violated.
failed best-practices checks - see 'https://docs.aerospike.com/operations/install/linux/bestpractices'
Indicates failed best practices. This message follows a set of warnings for each best practice that was violated. Becomes a WARNING when enforce-best-practices is set to false
.
invalid feature key signature
The key file has been modified and no longer matches the digital signature. The key file must be exactly the same as when it was downloaded. Seemingly harmless operations, like cutting and pasting the file’s contents, can change the file and invalidate it.
Drv_mem
bad set-id x
Indicates an issue with the specified set ID, likely indicating a configuration or referencing error.
x
set ID
bad size x
Indicates an issue with the specified size, likely outside acceptable parameters.
x
size
bad void-time x
Indicates an issue with the specified void-time, likely indicating an error in expiration or deletion timing.
x
void time
bad wblock-id x
Indicates an issue with the specified write block ID.
x
write block ID
can't get set index x from vmap
Indicates a failure to retrieve the specified set index from the version map.
x
set index
community edition called resume_devices()
The Community Edition of Aerospike does not allow the resume_devices() function.
device encrypted but no encryption key file configured
Indicates a device is encrypted but lacks configuration for an encryption key file.
device has AP partition versions but 'strong-consistency' is configured
Indicates a configuration issue where AP partitions exist despite strong consistency being enabled.
device has CP partition versions but 'strong-consistency' is not configured
Indicates a configuration issue where CP partitions exist without strong consistency enabled.
device has 'single-bin' data but 'single-bin' is no longer supported
Indicates a configuration issue where the device contains data in a format no longer supported.
device not encrypted but encryption key file x is configured
Indicates a non-encrypted device has an encryption key file configured unnecessarily.
x
path to encryption key file
devices are smaller than memory stripes
Device memory is less than the stripe size
drive-name: Aerospike device has old format - must erase device to upgrade
Aerospike device is using an old format and needs to be erased for an upgrade.
drive-name: bad device-id x
Indicates an issue with the device ID on the Aerospike device.
x
device ID
drive-name: bad n-devices x
Indicates an issue with the number of devices specified for the Aerospike device.
x
number of devices
drive-name: bad pristine offset x
Indicates an issue with the pristine offset on the Aerospike device. Pristine blocks are blocks the Aerospike server has never written to before.
x
pristine offset
drive-name: can't change write-block-size from x to y
Indicates an issue changing the write block size on the Aerospike device.
x
previous size
y
target size
drive-name: previous namespace x now y - check config or erase device
Indicates a namespace name change on the Aerospike device and suggests checking configuration or erasing the device.
x
previous namespace name
y
current namespace name
drive-name: random signature is 0
Indicates a problem where the random signature expected on the Aerospike device is 0.
drive-name: unknown version x
Aerospike device is using an unknown version.
x
version number
encryption key or algorithm mismatch
Indicates a mismatch between the encryption key or algorithm configured and what is required.
existing memory stripe size differs from config
Indicates a mismatch between the configured memory stripe size and the existing memory stripe size. To resolve this, configure data-size
.
generation 0
Displays when a record is deleted and removed from the tree but not freed.
hit stop-writes limit before drive scan completed
The stop-writes limit is a user-configurable limit to prevent more data being written than is available on the device.
mprotect (address, length, protection) failed: x, (y)
Indicates a failure in setting memory protection for a region of memory.
x
error number
y
error description
{namespace} can't add record to index
Indicates a failure to add a record to the index in the specified namespace.
{namespace} could not allocate x-byte shmem stripe
Indicates a failure to allocate the specified amount of shared memory for a stripe in the namespace.
x
number of bytes
namespace drive set with unmatched headers - devices x & y have different device counts
Indicates a configuration issue where two devices in a namespace drive set have different device counts.
x
device name
y
device name
namespace drive set with unmatched headers - devices x & y have different signatures
Indicates a configuration issue where two devices in a namespace drive set have different signatures.
x
device name
y
device name
{namespace} loaded: objects 12345 device-pcts (20, 30, 25)
Objects that have been loaded during a cold start, and the percentage of each device that has been scanned. The device percentages are listed in the same order as the devices are defined in the config file.
{namespace} loaded: objects 12345 tombstones 77 device-pcts (20, 30, 25)
Objects and tombstones that have been loaded during a cold start, and the percentage of each device that has been scanned. The device percentages are listed in the same order as the devices are defined in the config file.
{namespace}: no keys available for namespace
Indicates a lack of available keys for the specified namespace, preventing operations.
{namespace} no keys available to create stripe
Indicates a lack of available keys to create a stripe in the specified namespace.
{namespace} shmem stripe and device shadow-name have different device counts
Indicates a device count mismatch between shared memory stripe and device in the specified namespace.
{namespace} shmem stripe and device shadow-name have different signatures
Indicates a signature mismatch between shared memory stripe and device in the specified namespace.
{namespace_name}: could not allocate 17179869184-byte shmem stripe
Aerospike server is attempting to pre-allocate in-memory data storage in shared memory, either as stripes (1/8th of data-size
) or as a mirror of the storage-backed persistence (filesize
or device
size), and the kernel.shmmax
or kernel.shmall
is set too low. See configuring namespace storage
random key generation failed
Indicates a failure in generating a random key.
s: not an Aerospike device but not erased - check config or erase device
Warns that the device is not recognized as an Aerospike device but has not been erased.
shadow-name: DEVICE FAILED open: errno x (y)
Indicates a failure to open the specified device with an error number and description.
x
error number
y
error description
shadow-name: DEVICE FAILED write: errno x (y)
Indicates a write failure on the specified device with an error number and description.
x
error number
y
error description
shadow-name: read failed at all sizes from x to y bytes
Indicates a failure to read the shadow-name at any size within the specified byte range.
x
byte range start
y
byte range end
shadow-name: read failed: errno x (y)
Indicates a read failure on the shadow-name device, providing an error number and description.
x
error number
y
error description
unable to open device x: error y
Displays the error code giving information about the reason the device could not be opened.
x
device name
y
error number
unable to open file x: error y
Displays the error code giving information about the reason the file could not be opened.
x
path to file
y
error number
unable to truncate file: errorno x
This error appears when a shadow file could not be sized properly. See ftruncate() for more information.
x
ftruncate() error code
x - load bins failed
Indicates a failure in loading bins for the specified process or identifier.
x
error number
x - unpack bins failed
Indicates a failure in unpacking bins for the specified process or identifier.
x
error number
x found all y devices fresh during warm restart
Shown when fresh headers were found during warm restart
x
error number
y
number of devices
x size y must be greater than header size z
Shown when the file size is less than or equal to the drive header size. To resolve this, configure filesize
.
x
Either “file” or “usable device” depending on whether files or devices are used for backing storage in the memory namespace.
y
file size
z
header size
Drv_pmem
get_key: failed pmem_read_record()
Aerospike was not able to read the record from storage, which may indicate a hardware failure. See What is the expected behaviour when an Aerospike node experiences an SSD hardware failure? for more information.
load_bins: failed pmem_read_record()
Aerospike was not able to read the record from storage, which may indicate a hardware failure. See What is the expected behaviour when an Aerospike node experiences an SSD hardware failure? for more information.
load_n_bins: failed pmem_read_record()
Aerospike was not able to read the record from storage, which may indicate a hardware failure. See What is the expected behaviour when an Aerospike node experiences an SSD hardware failure? for more information.
{namespace} loaded: objects 12345 device-pcts (20, 30, 25)
Objects that have been loaded during a cold start, and the percentage of each device that has been scanned. The device percentages are listed in the same order as the devices are defined in the config file.
{namespace} loaded: objects 12345 tombstones 77 device-pcts (20, 30, 25)
Objects and tombstones that have been loaded during a cold start, and the percentage of each device that has been scanned. The device percentages are listed in the same order as the devices are defined in the config file.
(namespace): (namespace.c:550) set-id 1 - negative device bytes!
A statistic was not correctly initialized during a warm restart, and Aerospike server believes the PMEM storage has a negative size. This error may be printed once for every record held on the storage during the warm restart. This error poses no risk to the data stored on the device, and has been resolved in hotfix AER-6478, which is available in Aerospike Database 5.2.0.36, 5.3.0.26, 5.4.0.21, 5.5.0.19, 5.6.0.13, and later.
{namespace_name} out of space
Indicates a shortage of free storage blocks. See How do I recover from Available Percent Zero?. To save log space, this message is logged with a (repeated: nnn)
prefix just once per ticker-interval
during periods of repetition.
namespace
Namespace being written to
{namespace_name} write: size 9437246 - rejecting 1142f0217ababf9fda5b1a4de66e6e8d4e51765e
Most likely appearing as a result of exceeding the write-block-size
. The record’s digest is the last item in the log entry.
namespace
Namespace being written to
size
Total size of the record that was rejected
Drv_ssd
(arenax): (arenax_ee.c:98) too many chunks
This error occurs when the primary index has been misconfigured partition-tree-sprigs
for an all-flash namespace. To size all-flash installations adequately, you must account for the number of partition-tree-sprigs
when you estimate the size of the index. Each sprig uses no more than a single 4 KiB chunk. Also, verify that all index entries are stored at the desired fill fraction, which defines the level to which a sprig is filled. This allows for some expansion without overfilling and consuming more than one chunk per sprig. See the Capacity Planning Guide for more information.
can't add record to index
For the index-type flash
configuration, indicates that the mount points have run out of space. You may need to delete the arena files manually and run fsck
on the disk partitions.
defrag_move_record: couldn’t get swb
Indicates the node has a shortage of free storage blocks. See How do I recover from Available Percent Zero?.
device /dev/sdb - swb buf valloc failed
Indicates a shortage of memory. Make sure the nodes have enough memory.
device
Device that Aerospike was trying to read or write to at the time of the error.
device /dev/sdb: read complete: UNIQUE 20401681 (REPLACED 5619021) (OLDER 11905062) (EXPIRED 0) (EVICTED 0) (UNPARSABLE 4) records
device
Name of the device for which the following stats apply. The stats are a summary of a cold start that read the entire device.
UNIQUE
Total number of unique records loaded from this device.
REPLACED
Number of records that replaced a version loaded earlier during the device scan (won the conflict resolution).
OLDER
Number of records that were skipped because a newer version was loaded earlier during the device scan (lost the conflict resolution).
EXPIRED
Number of records skipped because they were expired.
EVICTED
Number of records skipped because they were evicted.
UNPARSABLE
Number of records skipped because they were unparsable.
device /dev/sdb: read_complete: added 0 expired 0
device
Name of the device for which the following stats apply. The stats are a summary of a cold start that read the entire device.
added
Total number of unique records loaded from this device.
expired
Number of records skipped because they were expired.
device /dev/sdd defrag: rblock_id 952163461 generation mismatch (4:3) :0xc0b12ffb353e0385179f39c85edc7791264f11aa
This can occur when the generation value for the index has been advanced while the generation value for the record on the drive has not advanced. This happened in some very old Aerospike Database versions. The error indicates that defrag found this discrepancy, and it should appear only once as the defrag process resolves the discrepancy.
device
Disk where this occurred.
rblock_id
Identifier of the disk block being defragged.
mismatch
Generation of the record in the index and on disk, respectively.
digest
Digest of the record in question.
device has AP partition versions but 'strong-consistency' is configured
A namespace has data that was written while it was in AP mode, but is being started in SC mode. This is not permitted. See strong-consistency
for details.
device not encrypted but encryption key file /etc/aerospike/key-256.dat is configured
This happens when a block device has its header configured using asd
without encryption enabled. This can happen if you:
- zeroized the device.
- started
asd
without enabling encryption inaerospike.conf
. - took down
asd
. - restarted
asd
with encryption enabled inaerospike.conf
.
To resolve this and use encryption, zeroize again and reinitialize the drive:
- Take down
asd
. - Zeroize the device again.
- Start
asd
with encryption enabled inaerospike.conf
.
/dev/nvme0n1p1: bad device-id 3192497567
Appears in cases of device corruption or if a device was not erased before starting the Aerospike service. See SSD Initialization and SSD Setup for details and recommended steps. In case of system log corruption, the device might need to be replaced.
/dev/nvme0n1p1: read failed: errno 5 (Input/output error)
Indicates an error during a system call to the storage device. Depending on the transaction path where this occurs, the database could abort if the integrity of the underlying data is not known (typically on write transactions). In case of corruption seen on system logs, the firmware version should be checked or the device might need to be replaced.
/dev/nvme3n1 init defrag profile: 3,1,1,2,0,0,1,0,3,1,0,0,0,0,1,1,2,0,0,1,1,1,0,0,0,0,2,1,3,4,2,1,1,2,2,3,1,4,3,3,3,4,2,1,3,3,2,0,4,2
At startup, the initial defragmentation profile for a device. This represents the number of blocks at each percentage all the way to the defrag-lwm-pct
. In this example, 50 buckets (default defrag-lwm-pct
set to 50) are shown, including: 3 blocks that are less than 1% full, 1 that is between 1 and 2% full, all the way to 2 blocks that are between 49 and 50% full (last number). Blocks that are more than 50% full are not be eligible to be defragmented.
/dev/nvme3n1 init wblocks: pristine-id 1464407 pristine 109155 free-q 591298, defrag-q 75
At startup, the status of unwritten blocks, free blocks, and blocks on the defrag queue. The sum of free-q and pristine blocks indicates the total space available for writing, which would match the number of free wblocks).
pristine-id
The ID of the first unwritten (pristine) block on the disk.
pristine
The number of completely unwritten (pristine) blocks on the disk.
free-q
The number of blocks that have been through the defragmentation process and are available to be re-written.
defrag-q
The number of blocks that are awaiting defragmentation on the defrag queue.
(drv_ssd.c:2436) bad end marker for {digest}
This can occur on a subsequent cold restart after the initial upgrade to Database 6.x if you zeroized only the first 8 MB of the headers and didn’t erase the full device before upgrading. This error is generally not critical as any records would be previous versions of records that should have been written anew during the first migration on the node after the upgrade.
encryption key or algorithm mismatch
At startup, this error indicates that the encryption key file
used previously to encrypt data on the storage device does not match the file currently provided.
error: block extends over read size: foff 5242880 boff 1047552 blen 1392
The device likely has a bad sector. If this issue occurs frequently, replace the device.
foff
Offset of file containing malformed block.
boff
Offset of malformed block.
blen
Length of malformed block.
get_key: failed as_storage_record_read_ssd()
Symptom of having run out of storage space. Resolved by a cold start.
get_key: failed ssd_read_record()
Aerospike was not able to read the record from storage. This may indicate a hardware failure. See What is the expected behaviour when an Aerospike node experiences an SSD hardware failure? for more information.
load_bins: failed ssd_read_record()
Aerospike was not able to read the record from storage. This may indicate a hardware failure. See What is the expected behaviour when an Aerospike node experiences an SSD hardware failure? for more information.
load_n_bins: failed ssd_read_record()
Aerospike was not able to read the record from storage. This may indicate a hardware failure. See What is the expected behaviour when an Aerospike node experiences an SSD hardware failure? for more information.
metadata mismatch - removing <DIGEST_ID>
This message indicates the change in a primary index bit and was introduced as a fix to AER-6335. The issue would occur for XDR-enabled namespaces upgrading from a 4.9 Database prior to 4.9.0.19. The workaround is to proceed with a rolling cold-restart.
namespace NS waiting for defrag: 5 pct available, waiting for 10 ...
The node is stuck in a defrag loop at startup where it cannot defragment enough to get device_free_pct
below defrag-startup-minimum
.
namespace
The namespace impacted.
avail pct
Current available percent.
required pct
Target available percent to reach in order to start up.
{namespace} loaded: objects 12345 device-pcts (20, 30, 25)
Objects that have been loaded during a cold start, and the percentage of each device that has been scanned. The device percentages are listed in the same order as the devices are defined in the config file.
{namespace} loaded: objects 12345 tombstones 77 device-pcts (20, 30, 25)
Objects and tombstones that have been loaded during a cold start, and the percentage of each device that has been scanned. The device percentages are listed in the same order as the devices are defined in the config file.
{namespace} out of space
Indicates a shortage of free storage blocks. See How do I recover from Available Percent Zero?. To save log space, this message is logged with a (repeated: nnn)
prefix just once per ticker-interval
during periods of repetition.
{namespace}
Namespace being written to.
{namespace_name} defrag: drive Drive1 totally full - waiting for vacated wblocks to be freed
Indicates that the node has no free storage blocks. In this case, the defragmentation process waits until a free block is available. See [How to Recover Contiguous Free Blocks]How do I recover from Available Percent Zero? for more information. To save log space, this message is logged with a (repeated: nnn)
prefix just once per ticker-interval
during periods of repetition.
{namespace_name}
Affected namespace name.
drive
Affected storage device name.
{namespace_name} device /dev/sda prior shutdown not clean
Indicates that the previous shutdown was not trusted. The node must perform a cold start.
{namespace_name} /dev/sda: used-bytes 296160983424 free-wblocks 885103 write-q 0 write (12659541,43.3) defrag-q 0 defrag-read (11936852,39.1) defrag-write (3586533,10.2) shadow-write-q 0 tomb-raider-read (13758,598.0)
{namespace}
Name of the namespace the device and stats belong to.
/dev/sda
Name of the device for which the following stats apply.
used-bytes
Number of bytes on this device that are in use. Corresponds to the storage-engine.device[ix].used_bytes
statistic.
free-wblocks
The number of wblocks that are free (the device_available_pct
). Corresponds to the storage-engine.device[ix].free_wblocks
statistic.
write-q
Number of write buffers pending to be written to the SSD. When this reaches the max-write-cache
configured value (default 64M), ‘device overload’ errors are returned and ‘queue too deep’ warnings are printed on the server log. Corresponds to the storage-engine.device[ix].write_q
statistic.
write
Total number of SSD write buffers written to this device since the Aerospike server started, including defragmentation, and the number of write buffers written per second. Corresponds to the storage-engine.device[ix].writes
statistic. Does not include partial flushes at flush-max-ms
.
defrag-q
Number of wblocks pending defragmentation. These are blocks that have fallen below the defrag-lwm-pct
and are waiting to be read and have their relevant content recombined in a fresh streaming write buffer. The defrag-sleep
setting controls the sleep period in between each block being read (default 1ms). Corresponds to the storage-engine.device[ix].defrag_q
statistic.
defrag-read
Total number of write blocks that have been sent to the defragmentation queue (defrag-q) and read by the defragmentation thread on this device, and the normalization to the average number of wblocks processed per second during the interval at which this message is logged. Usually, defrag-q is at 0 and wblocks are read as they are put on the defrag-q. In such cases, defrag-read represents the number of wblocks read by the defragmentation thread. Corresponds to the storage-engine.device[ix].defrag_reads
statistic.
defrag-write
Total number of write blocks written by defragmentation on this device since the Aerospike server started, and the number of wblocks written per second (subset of write). Corresponds to the storage-engine.device[ix].defrag_writes
statistic.
shadow-write-q
Number of write buffers pending to be written to the shadow device (only printed when a shadow device is configured). When this reaches the configured max-write-cache
value (default 64M), ‘device overload’ errors are returned and queue too deep
warnings are printed to the server log. Corresponds to the storage-engine.device[ix].shadow_write_q
statistic.
tomb-raider-read
Total number of blocks read by the tomb-raider in the current cycle, and the current number of wblocks read per second. Only printed when the tomb-raider is active.
{namespace_name} durable delete fail: queue too deep: exceeds max 544
Indicates that the disks are not keeping up with the load placed upon them, although the disks themselves are not necessarily faulty or nearing end of life. See Why do I see warning - queue too deep for more information.
{namespace_name} immigrate fail: queue too deep: exceeds max 576
Indicates that the disks are not keeping up with the load placed upon them, although the disks themselves are not necessarily faulty or nearing end of life. See Why do I see warning - queue too deep for more information.
{namespace_name} read_ssd: invalid rblock_id :0x0c0008d663318a674a7bd379f6efd3bb1f55141d
Indicates that a record with an invalid read-block is being read. This can happen if a node runs out of memory and swb cannot be allocated. It can also happen on some earlier database versions if a node runs out of device_available_pct
before checking upfront for the available free blocks. A cold start of the node should resolve the issue.
NAMESPACE
Namespace the record with invalid read-block resides in.
rblock_id
Digest of the record with invalid read-block.
{namespace_name} udf fail: queue too deep: exceeds max 512
All UDF writes fail by design.
{namespace_name} write fail: queue too deep: exceeds max 512
Indicates that the disks are not keeping up with the load placed upon them, although the disks themselves are not necessarily faulty or nearing end of life. See Why do I see warning - queue too deep for more information.
{namespace_name} write: size 9437246 - rejecting 1142f0217ababf9fda5b1a4de66e6e8d4e51765e
Most likely appears as a result of exceeding the write-block-size
. The record’s digest is the last item in the log entry.
namespace
Namespace being written to.
size
Total size of the record that was rejected.
read: bad block magic offset 303403269632
The SSD is corrupted and the expected value of a block does not match the actual value. This issue has three potential causes:
- The names of raw devices were swapped on server reboot. For example, the storage pointed to by
/dev/sda
is now/dev/sdb
. See How to configure storage device to use the disk WWID? to learn about configuring persistent device names. - The
storage-engine
configuration inaerospike.conf
has changed to reorder devices. - Hardware failure and data corruption.
offset
Location in the device where the mismatch was noticed.
read failed: expected 512 got -1: fd 9900 data 0x7f277981b000 errno 5
Aerospike tried to read 512 bytes from the device and the read() call returned -1 (error) with errno 5, which is an I/O error. This indicates a potential hardware issue.
expected
Size of block requested from disk.
fd
File descriptor used for read.
data
Offset of block.
read_all: failed as_storage_record_read_ssd()
Result of having run out of storage space. Resolved by a cold start.
ssd_read: record b98946e3d616790 has no block associated, fail
This is a result of having run out of space. The record was written to the index but not flushed to disk, creating this inconsistency. You can resolve this issue with a cold start.
record
The record’s hashed key.
write bins: couldn’t get swb
Indicates a shortage of free storage blocks. See How do I recover from Available Percent Zero?.
write: size 9437246 - rejecting digest:0xd751c6d7eea87c82b3d6332467e8bc9a3c630e13
Appears with the WARNING
message about failed as_storage_record_write()
for exceeding the write-block-size
.
size
Total size of the record that was rejected.
digest
Digest of the record that was rejected.
Exchange
blocking client transactions in orphan state!
The node is not currently part of any cluster and will not accept any transactions from clients that are connected to it, meaning clients cannot access any partitions on the node.
error sending exchange data
Failure exchanging partition maps with another node because the node is down or disconnected from the network and unable to receive messages.
received duplicate exchange data from node 783f4ac2fbb57e81
Another node has resent exchange data
because it did not receive an acknowledgment from this node within half the heartbeat interval. This is most likely due to a networking issue.
node
NodeID of the origin of the unacknowledged exchange data
received duplicate ready to commit message from node 783f4ac2fbb57e81
A node has resent the ready to commit
message because it did not receive an acknowledgment from this node within half the heartbeat interval. This is most likely due to a networking issue between nodes in the cluster. Those ready to commit
messages are sent by each node to the principal node when the exchange data
has been completed on the node. The principal node also follows this pattern and sends itself such a ready to commit
message. On a given node, the exchange data
is done when the node has successfully sent its partition map to all the nodes in the cluster as well as received the partition map from each node in the cluster. This requires each node to ack back that it has received the exchange data
. Only then would a node tell the principal that it is ready to commit
. While the principal is waiting to receive the ready to commit
from some nodes, other nodes would keep sending their ready to commit
as they are ready and waiting. Therefore, the nodes for which such message is seen are the nodes that are ready and the other nodes would be the ones having potential issues completed their exchange data
.
node
Node ID of the node that is continuing to send the ready to commit message as it is waiting for the principal to acknowledge.
Exp
build_cond - error 4 mismatched type <n> (TYPE_NAME) expected type <m> (<TYPE_NAME>) at default condition
Example: WARNING (exp): (exp.c:2168) build_cond - error 4 mismatched type 5 (map) expected type 0 (nil) at default condition.
In this example the
cond
expression was illegally invoked by the client with mixed return types. The cond
expression requires that all return types be the same with the exception of the unknown
type.
predexp deprecated - use new expressions API
Indicates use of the deprecated Predicate Expressions
API, which was replaced in Database 5.2 by the Aerospike Expressions
API. The deprecated API is removed in Database 6.0. This warning is logged no more than once per log ticker cycle.
Fabric
error creating fabric published endpoint list
This issue can occur in the case of a failure in network interface initialization. Check the system logs for time of issue and verify the fabric network interface was initialized properly.
msg_read: could not deliver message type 1
Heartbeat message handler is not registered on the node. The message should disappear soon after the node joins the cluster. Otherwise, restart the asd
service on the node.
message type
Internal code.
no IPv4 addresses configured for fabric
This issue can occur in the case of a failure in network interface initialization. Check the system logs for time of issue and verify the fabric network interface was initialized properly.
node b5 fds {via_connect={rw=8 ctrl=1 bulk=2 meta=1} all=24} live 1 q {rw=0 ctrl=0 bulk=0 meta=0}
Provides details on the number of fabric connections to each node in the cluster as well as pending outbound messages for each.
This log line is generated when the info command dump-fabric:verbose=true
is issued.
node
The node to which this node is connected through fabric.
fds rw= ctrl= bulk= meta=
The number of file descriptors to the node, broken up by channel (replica writes — includes proxies and duplicate resolution, control — for clustering, bulk — for migrations and meta — for system meta data or SMD)
fds all=
The total number of file descriptors to the node, typically double the sum of rw, ctrl, bulk and meta as each node establishes connections for each channel from each side.
q rw= ctrl= bulk= meta=
The number of pending outbound messages to the node, broken up by channel.
r_msg_sz > sizeof(fc->r_membuf) 1048582
This is from the run_fabric_accept thread rather than the run_fabric_recv thread. This thread is responsible for accepting new connections. These messages are only expected to store the NodeID and the ChannelID, which would always be only a few bytes and never expected to be over 1MiB. This WARNING is likely to indicate non-Aerospike traffic on the fabric port. Verify that the fabric port (default 3001) is not exposed. parameters:
- name: “r_msg_sz” description: | Internal code.
Flat
record too small 0
Indicates that an inbound record for migration is corrupt. Appears with the WARNING
message about handle insert: got bad record
and is documented in Why Do I See Stalled Migrations And "record too small" Errors In The Log?
Hardware
(skew_monitor.c:685) node bb90adf81f64eb0 HLC not in sync - hlc 112495319903567872 self-hlc 112644426202057622 diff 2275181556
A strong consistency namespace can hit stop writes if clock_skew_stop_writes is in effect when cluster_clock_skew_ms is above the cluster_clock_skew_stop_writes_sec threshold.
failed to submit command to /dev/nvme0: x109
Indicates that the underlying NVMe device does not support the health information check. To suppress this message, set log level to critical for the hardware context.
Hb
Broken pipe (under normal network conditions)
This message appears before any namespace-specific shutdown messages in the normal shutdown sequence for mesh network nodes. It indicates that the node’s heartbeat is being stopped to prepare for prompt removal of the node from the cluster in case the shutdown process is lengthy and the node was not quiesced.
Found a socket 0x7f0979812460 without an associated channel.
This warning occurs in rare cases where a network failure causes two error events on the same socket. No action is required.
socket
HEX number identifying the socket within the OS.
Timeout while connecting
A peer node could not be reached on the heartbeat port. The node may be down or experiencing network issues.
closing mesh heartbeat sockets
This message appears early in the normal shutdown sequence for mesh network nodes, before any namespace-specific shutdown messages. It indicates that the node’s heartbeat is being stopped so the node is promptly removed from the cluster in case the shutdown process is lengthy and the node was not quiesced.
closing multicast heartbeat sockets
This message appears early in the normal shutdown sequence for multicast network nodes, before any namespace-specific shutdown messages. It indicates that the node’s heartbeat is being stopped so the node is promptly removed from the cluster in case the shutdown process is lengthy and the node was not quiesced.
could not create heartbeat connection to node - 10.219.136.101 {10.219.136.101:3012}
This warning can indicate that a cluster node is down. It is also possible that IP addresses in the cluster have changed, which can occur following a node restart. If the indicated node is expected to be part of the cluster, troubleshoot as normal. Otherwise, follow the tip-clear
and services-alumni-reset
steps from the node removal instructions to clear the errors.
When a node is expected to be part of the cluster and cannot be reached on the heartbeat port.
IP:port
IP and heartbeat port of the host that could not be reached.
error allocating space for [multicast/mesh] recv buffer of size 1195725862 on fd 773
Indicates that non-Aerospike traffic may be sending data to the heartbeat port on this machine and some bits have been misinterpreted in the Aerospike protocol as a large buffer size. Such requests can lead to denial of service or disruption of cluster traffic if they are frequent. We recommend that you block access to the heartbeat port from outside the cluster.
size
Size of the message received (or the interpretation of a size).
fd
File descriptor of the socket the non-Aerospike message was received on.
heartbeat TLS client handshake failed - 10.219.136.101 {10.219.136.101:3012}
The other node could be reached, but there was an error in setting up the TLS connection. Check the other log messages near this one for details.
address:ports
Address(es) of the peer where the handshake failed.
heartbeat TLS server handshake with 10.11.12.13:3012 failed
The other node could be reached, but there was an error in setting up the TLS connection. Check the other log messages near this one for details.
address:port
Address of the peer where the handshake failed.
ignoring delayed heartbeat - expected timestamp less than 1483213248012 but was 1483213252345 from node: BB9020011AC4202
A received heartbeat message was not generated during the last heartbeat interval, which may indicate clock skew across the cluster.
expected ts
Latest timestamp expected (in milliseconds since the Aerospike Epoch of 2010-01-01 00:00:00)
actual ts
The actual timestamp in the message.
node
Source of the message.
ignoring message from BB9030011AC4202 with different cluster name(dev_cluster)
A node that is not part of the cluster is sending heartbeats. It is possible that the sending node has this node listed incorrectly as a seed node in the network.heartbeat
stanza of its aerospike.conf
.
node ID
ID of the node sending heartbeats.
cluster name
The cluster name that the sending node is presenting, which should help in identifying the node.
mesh size recv failed fd 361: Connection timed out
An incomplete heartbeat message was received because there was a timeout in the socket connection.
fd
ID number of the file descriptor the message was received on.
error message
Error message returned from the OS.
sending mesh message to BB9030011AC4202 on fd 361 failed : Broken pipe
The node sent a heartbeat message to a peer node after the connection broke. The first parameter may be 0 rather than a node ID if the node has not been able to retrieve the remote node ID.
Under rare circumstances, particularly in large clusters after a cluster event, the server may continue to use a heartbeat connection that is no longer usable, which repeatedly generates the Broken pipe message. In this situation, the error occurs on the same file descriptor after the following error:
sending mesh message to bb9030011ac4202 on fd 361 failed : Connection reset by peer
Note: if the error keeps occurring on the same file descriptor, the node is not able to re-join the cluster and must be restarted.
node
Node that should have received the message.
fd
ID number of the file descriptor the message was sent on.
error message
Error message returned from the OS.
sending mesh message to BB9030011AC4202 on fd 361 failed : No route to host
Tried to send a heartbeat message to a peer, but the socket had some problem.
node
Node that should have received the message.
fd
ID number of the file descriptor the message was sent on.
error message
Error message returned from the OS.
unable to parse heartbeat message on fd 361
Received a malformed message on the heartbeat port, possibly due to a non-Aerospike process unintentionally or maliciously trying to connect on the port.
fd
ID number of the file descriptor the message was received on.
updating mesh endpoint address from {10.219.136.101:3002} to {10.219.136.101:3002,10.219.148.101:3002}
This info message indicates that the local node connected to one IP but received two in return. This is common in situations where nodes have multiple NICs and an address is not specified in the network.heartbeat
context.
original address
Address connected to initially.
new addresses
List of addresses advertised by the node connected to.
Index
could not allocate 1073741824-byte arena stage 13: Cannot allocate memory
Aerospike Enterprise edition allocates memory in 1GiB arenas for the primary index. This message indicates that the process failed to allocate the 13th arena. A contiguous 1GiB of memory should be available. The process continues to serve read
and update
transactions, but all new writes fail. This message is always be followed by the following warning:
WARNING (index): (index.c:737)(repeated:xxx)arenax alloc failed
Info
NODE-ID bb97f1d46894206 CLUSTER-SIZE 12 CLUSTER-NAME myCluster
NODE-ID
The generated node ID, based on the MAC address and the service port.
CLUSTER-SIZE
Number of nodes recognized by this node as being in the cluster.
CLUSTER-NAME
Name of the cluster. Appears as null
for unnamed clusters.
batch-index: batches (234,0,0) delays 0
Displayed periodically, every 10 seconds by default, and only if batch-index
transactions have been issued on this node. Message details are aggregated across all namespaces.
batches
Number of batch-index jobs since the server started (Success, Error, Timed out). Success means all the sub-transactions for the batch-index job were dispatched successfully. The sub-transactions for the batch-index job could still error or time out individually, even if the parent batch-index job reported a success status. Also, an unsuccessful parent batch-index job could have some of its sub-transactions processed with any resulting status. No correlation can be made between the status of a parent batch-index job and and the statuses of its sub-transactions.
Related metrics:
delays
Number of times the job’s response buffer has been delayed by the sending process’s WOULDBLOCK to avoid overflowing the buffer.
Related metric:
early-fail: demarshal 0 tsvc-client 1 tsvc-from-proxy 0 tsvc-from-proxy-batch-sub 0
Displayed periodically, every 10 seconds by default, and only when any transaction failed early on this node. Cumulative since asd
start and aggregated across all namespaces. Single-digit counts are likely not problematic. Prior to Database 7.2, the log line read early-fail: demarshal 0 tsvc-client 1 tsvc-from-proxy 0 tsvc-batch-sub 0 tsvc-from-proxy-batch-sub 0 tsvc-udf-sub 0 tsvc-ops-sub 0
demarshal
Failure during the demarshal phase of a transaction. Metric: demarshal_error
.
tsvc-client
Indicates a failure for client initiated transactions, before getting to the namespace part. This can be due to authentication failure at the socket level, a missing or bad namespace, or an initial partition imbalance caused by a node that just started and has not yet joined the cluster. The latter would result in an unavailable
error back to the client. Metric: early_tsvc_client_error
.
tsvc-from-proxy
Indicates a failure for proxied transactions, before getting to the namespace part. This can be due to authentication failure at the socket level, a missing or bad namespace, or an initial partition imbalance caused by a node that just started and has not yet joined the cluster. The latter would result in an unavailable
error. Metric: early_tsvc_from_proxy_error
.
tsvc-batch-sub
Part of a batch sub transaction. Metric: early_tsvc_batch_sub_error
.
tsvc-from-proxy-batch-sub
Part of a proxied batch sub transaction. Metric: early_tsvc_from_proxy_batch_sub_error
.
tsvc-udf-sub
Part of a UDF sub transaction. Metric: early_tsvc_udf_sub_error
.
tsvc-ops-sub
Part of a scan/query background ops sub transaction. Metric: early_tsvc_ops_sub_error
.
fabric-bytes-per-second: bulk (1525,7396) ctrl (33156,46738) meta (42,42) rw (128,128)
Fabric traffic statistics, displayed every 10 seconds by default.
bulk
Current transmit and receive rate for fabric-channel-bulk. This channel is used for record migrations during rebalance.
ctrl
Current transmit and receive rate for fabric-channel-ctrl. This channel distributes cluster membership change events and partition migration control messages.
meta
Current transmit and receive rate for fabric-channel-meta. This channel distributes System Meta Data (SMD) after cluster change events.
rw
Current transmit and receive rate for fabric-channel-rw (read/write). This channel is used for replica writes, proxies, duplicate resolution, and various other intra-cluster record operations.
fds: proto (38553,57711444,57672891) heartbeat (27,553,526) fabric (648,2686,2038)
proto
Client connections statistics, in order:
Include connections that are reaped after idle (reaped connections correspond to reaped_fds), properly shutdown by the client (initiated a proper socket close) or preliminary packet parsing errors like unexpected headers, most of these would have a WARNING in the logs.
heartbeat
Heartbeat connections statistics, in order:
fabric
Fabric connections statistics, in order:
heartbeat-received: self 887075 : foreign 35456447
self
Number of heartbeats the current node has received from itself (should be 0 for mesh).
foreign
Number of heartbeats the current node has received from all other nodes combined.
histogram dump: {ns-name}-{hist-name} (1344911766 total) msec (00: 1262539302) (01: 0049561831) (02: 0013431778) (03: 0007273116) (04: 0004299011) (05: 0003086466) (06: 0002182478) (07: 0001854797) (08: 0000312272) (09: 0000370715)
In the example, 05:0003086466 implies 3,086,466 data points took between 16 and 32 milliseconds. You can access additional histograms by enabling microbenchmarks or storage-benchmarks, statically or dynamically, in the service context of the configuration. See Monitoring latencies for monitoring latencies and details about the histograms.
Periodically printed to the logs, every 10 seconds by default.
histogram dump
Name of the histogram to follow for the {ns-name} namespace
total
Number of data points represented by this histogram (since the server started)
N
Number of data points within units (e.g. msec or bytes) greater than 2(N-1) and less than 2N for N>0, between 0 and 1 for N=0
in-progress: info-q 5 rw-hash 0 proxy-hash 0 tree-gc-q 0 long-queries 0
tsvc-q
Removed in 4.7.0.2. Number of transactions siting in the transaction queue, waiting to be picked up by a transaction thread. Corresponds to the tsvc_queue
statistic.
info-q
Number of transactions on the info transaction queue. Corresponds to the info_queue
statistic.
nsup-delete-q
Removed in 4.5.1. Number of records queued up for deletion by the nsup thread.
rw-hash
Number of transactions that are parked on the read write hash. This is used for transactions that have to be processed on a different node. For example, prole writes, or read duplicate resolutions (when requested through client policy). Corresponds to the rw_in_progress
statistic.
proxy-hash
Number of transactions on the proxy hash waiting for transmission on the fabric. Corresponds to the proxy_in_progress
statistic.
rec-refs
Removed in 3.10. Number of references to a primary key.
tree-gc-q
Introduced in 3.10. This is the number of trees queued up, ready to be completely removed (partitions drop). Corresponds to the tree_gc_queue
statistic.
long-queries
Introduced in version 6.1. Number of long queries currently active. Corresponds to the long_queries_active
statistic.
{<namespace>} data-usage: used-bytes 1073741824000 avail-pct 20
Displayed periodically for each namespace, every 10 seconds by default. This is a break down of the device usage if a storage-engine device
has been configured for the namespace.
used-bytes
Number of bytes used on disk for {ns_name} on the local node.
avail-pct
Minimum percentage of contiguous disk space in {ns_name} on the local node across all devices. Corresponds to the device_available_pct
statistic. Appears for storage-engine device.
cache-read-pct
Percentage of reads from the post-write cache instead of disk. Only applicable when {ns_name} is not configured for data in memory. Corresponds to the cache_read_pct
statistic.
{namespace_name} index-flash-usage: used-bytes 5502926848 used-pct 1 alloc-bytes 16384000 alloc-pct 92
Displayed periodically for each namespace configured with ‘index-type flash’, every 10 seconds by default.
{ns_name}
Name of the namespace the device and stats belongs to.
index-flash-usage
Name for which the following stats apply.
used-bytes
Total bytes in use on the mount for the primary index used by this namespace on this node.
used-pct
Percentage of the mount in use for the primary index used by this namespace on this node.
alloc-bytes
Total bytes allocated on the mount for the primary index used by this namespace on this node. This statistic represents entire 4KiB chunks that have at least one element in use. This statistic was introduced in 5.6. Corresponds to the index_flash_alloc_bytes
statistic.
alloc-pct
Percentage of the mount allocated for the primary index used by this namespace on this node. This statistic represents entire 4KiB chunks that have at least one element in use. This statistic was introduced in 5.6. Corresponds to the index_flash_alloc_pct
statistic.
{namespace_name} index-pmem-usage: used-bytes 5502926848 used-pct 1
Displayed periodically for each namespace configured with ‘index-type pmem’, every 10 seconds by default.
{ns_name}
Name of the namespace the index and stats belongs to.
index-pmem-usage
Name for which the following stats apply.
used-bytes
Total bytes in use on the mount for the primary index used by this namespace on this node.
used-pct
Percentage of the mount in use for the primary index used by this namespace on this node.
{namespace_name} index-usage: used-bytes 5502926848
Displayed periodically for each namespace, every 10 seconds by default.
{ns_name}
Name of the namespace the device and stats belongs to.
index-usage
Name for which the following stats apply.
used-bytes
Total bytes in use on the mount for the primary index used by this namespace on this node.
used-pct
Percentage of the mount in use for the primary index used by this namespace on this node.
{namespace_name} set-index-usage: used-bytes 5502926848 used-pct 1
Displayed periodically for each namespace, every 10 seconds by default.
{ns_name}
Name of the namespace the index and stats belongs to.
set-index-usage
Name for which the following stats apply.
used-bytes
Total bytes in use on the mount for the primary index used by this namespace on this node.
used-pct
Percentage of the mount in use for the primary index used by this namespace on this node.
{namespace_name} sindex-pmem-usage: used-bytes 12345 used-pct 43
Displayed periodically for each namespace configured with ‘sindex-type pmem’, every 10 seconds by default.
{ns_name}
Name of the namespace the index and stats belongs to.
sindex-pmem-usage
Name for which the following stats apply.
used-bytes
Total bytes in use on the mount for the secondary indexes used by this namespace on this node.
used-pct
Percentage of the mount in use for the secondary indexes used by this namespace on this node.
{namespace_name} sindex-usage: used-bytes 12345 used-pct 43
Displayed periodically for each namespace, every 10 seconds by default.
{ns_name}
Name of the namespace the index and stats belongs to.
sindex-usage
Name for which the following stats apply.
used-bytes
Total bytes in use on the mount for the secondary indexes used by this namespace on this node.
used-pct
Percentage of the mount in use for the secondary indexes used by this namespace on this node.
{ns_name} batch-sub: tsvc (0,0) proxy (0,0,0) read (959,0,0,51,1) write (0,0,0,0) delete (0,0,0,0,0) udf (0,0,0,0) lang (0,0,0,0)
Batch index transaction statistics for each namespace, displayed every 10 seconds by default, and only if batch index transactions hit this namespace on this node. No correlation can be made between the status of a parent batch-index job and the statuses of its sub-transactions. See the batch-index
log entry for details.
tsvc
Number of batch-index read sub transactions that failed in the transaction service (Error,Timed out). Corresponds to the batch_sub_tsvc_error
and batch_sub_tsvc_timeout
statistics.
proxy
Number of proxied batch-index read sub transactions (Success,Error,Timed out). Corresponds to the batch_sub_proxy_complete
, batch_sub_proxy_error
, and batch_sub_proxy_timeout
statistics.
read
Number of batch-index read sub transactions (Success,Error,Timed out,Not found,Filtered out). Corresponds to the batch_sub_read_success
, batch_sub_read_error
, batch_sub_read_timeout
, batch_sub_read_not_found
, and batch_sub_read_filtered_out
statistics.
write
Number of batch-index write sub transactions (Success,Error,Timed out,Filtered out). Corresponds to the batch_sub_write_success
, batch_sub_write_error
, batch_sub_write_timeout
, and batch_sub_write_filtered_out
statistics. Displayed as of Database 6.0.
delete
Number of batch-index delete sub transactions (Success,Error,Timed out,Not found,Filtered out). Corresponds to the batch_sub_delete_success
, batch_sub_delete_error
, batch_sub_delete_timeout
, batch_sub_delete_not_found
, and batch_sub_delete_filtered_out
statistics. Displayed as of Database 6.0.
udf
Number of batch-index udf sub transactions (Complete,Error,Timed out,Filtered out). Corresponds to the batch_sub_udf_complete
, batch_sub_udf_error
, batch_sub_udf_timeout
, and batch_sub_udf_filtered_out
statistics. Displayed as of Database 6.0.
lang
Number of batch-index lang sub transactions (Delete Success,Error,Read Success,Write Success). Corresponds to the batch_sub_lang_delete_success
, batch_sub_lang_error
, batch_sub_lang_read_success
, and batch_sub_lang_write_success
statistics. Displayed as of Database 6.0.
{ns_name} client: tsvc (0,0) proxy (0,0,0) read (126,0,1,3,1) write (2886,0,23,2) delete (197,0,1,19,3) udf (35,0,1,4) lang (26,7,0,3)
Basic client transaction statistics displayed periodically for each namespace, every 10 seconds by default, and only if client transactions hit this namespace on this node.
The following values define various actions displayed in the logs:
- S - success
- C - complete, but success or failure is indeterminate. Proxy diverts and UDFs can successfully send a “FAILURE” response bin.
- E - error
- T - timed out
- N - not found, which for reads and deletes is a result that wants to be distinguished from success but is not an error
- F - result filtered out or action skipped by predexp
- R,W,D - successful UDF read, write, or delete operation.
tsvc
Failures in the transaction service, before attempting to handle the transaction (E,T). Also reported as the client_tsvc_error
and client_tsvc_timeout
statistics.
proxy
Client proxied transactions (C,E,T). This should only happen during migrations. Also reported as the client_proxy_complete
, client_proxy_error
, and client_proxy_timeout
statistics.
read
Client read transactions (S,E,T,N,F). Also reported as the client_read_success
, client_read_error
, client_read_timeout
, client_read_not_found
, and client_read_filtered_out
statistics.
write
Client write transactions (S,E,T,F). Also reported as the client_write_success
, client_write_error
, client_write_timeout
, and client_write_filtered_out
statistics.
delete
Client delete transactions (S,E,T,N,F). Also reported as the client_delete_success
, client_delete_error
, client_delete_timeout
, client_delete_not_found
, and client_delete_filtered_out
statistics.
udf
Client UDF transactions (C,E,T,F). See the lang
stat breakdown for the underlying operation statuses. Also reported as the client_udf_complete
, client_udf_error
, client_udf_timeout
, and client_udf_filtered_out
statistics.
lang
Statistics for UDF operation statuses (R,W,D,E). Also reported as the client_lang_read_success
, client_lang_write_success
, client_lang_delete_success
, and client_lang_error
statistics.
{ns_name} data-usage: used-bytes 2054187648 avail-pct 92
If storage-engine pmem
has been configured for the namespace, this is a break down of the pmem storage file usage. Displayed periodically for each namespace, every 10 seconds by default.
used-bytes
Number of bytes used on pmem storage files for {ns_name} on the local node.
avail-pct
Minimum percentage of contiguous pmem storage file space in {ns_name} on the local node across all pmem storage files. Corresponds to the pmem_available_pct
statistic.
{ns_name} device-usage: used-bytes 2054187648 avail-pct 92 cache-read-pct 12.35
If storage-engine device
has been configured for the namespace, this is a break down of the device usage. Displayed periodically for each namespace, every 10 seconds by default.
used-bytes
Number of bytes used on disk for {ns_name} on the local node.
avail-pct
Minimum percentage of contiguous disk space in {ns_name} on the local node across all devices. Corresponds to the device_available_pct
statistic.
cache-read-pct
Percentage of reads from the post-write cache instead of disk. Only applicable when {ns_name} is not configured for data in memory. Corresponds to the cache_read_pct
statistic.
{ns_name} dup-res: ask 1234 respond (10,4321)
Statistics for transactions that are asking or handling duplicate resolution. Displayed periodically for each namespace, every 10 seconds by default.
ask
Number of duplicate resolution requests made by the node to other individual nodes. Also reported as the dup_res_ask
statistic.
respond
Number of duplicate resolution requests handled by the node, broken up between transactions where a read was required and transactions where a read was not required. Also reported as the dup_res_respond_read
and dup_res_respond_no_read
statistics.
{ns_name} from-proxy: tsvc (0,0) read (105,0,1,7) write (2812,0,22,1) delete (188,0,1,16,2) udf (35,0,1,3) lang (26,7,0,3)
Basic proxied transaction statistics, displayed periodically for each namespace, every 10 seconds by default. Displayed only after proxied transactions hit this namespace on this node.
The following values define various actions that are displayed in the logs
- S - success
- C - complete, but success/failure indeterminate. UDFs can successfully send a “FAILURE” response bin.
- E - error
- T - timed out
- N - not found, which for reads and deletes is a result that wants to be distinguished from success but is not an error
- F - result filtered out or action skipped by predexp
- R,W,D - successful UDF read, write, or delete operation.
tsvc
Failures in the transaction service, before attempting to handle the transaction (E,T). Also reported as the following statistics:
read
Proxied read transactions (S,E,T,N,F). Also reported as the following statistics:
from_proxy_read_success
,from_proxy_read_error
from_proxy_read_timeout
from_proxy_read_not_found
from_proxy_read_filtered_out
statistics.
write
Proxied write transactions (S,E,T,F). Also reported as the following statistics:
from_proxy_write_success
from_proxy_write_error
from_proxy_write_timeout
from_proxy_write_filtered_out
delete
Proxied delete transactions (S,E,T,N,F). Also reported as the following statistics:
from_proxy_delete_success
from_proxy_delete_error
from_proxy_delete_timeout
](/database/reference/metrics#namespace__from_proxy_delete_timeout)from_proxy_delete_not_found
from_proxy_delete_filtered_out
udf
Proxied UDF transactions (C,E,T,F). See the lang
stat breakdown for the underlying operation statuses. Also reported as the following statistics:
lang
Statistics for proxied UDF operation statuses (R,W,D,E). Also reported as thefollowing statistics:
from_proxy_lang_read_success
from_proxy_lang_write_success
from_proxy_lang_delete_success
](/database/reference/metrics#namespace__from_proxy_lang_delete_success)from_proxy_lang_error
statistics.
{ns_name} from-proxy-batch-sub: tsvc (0,0) read (959,0,0,51,1) write (0,0,0,0) delete (0,0,0,0,0) udf (0,0,0,0) lang (0,0,0,0)
Proxied batch index transaction statistics, displayed periodically for each namespace, every 10 seconds by default. Displayed only after proxied batch index transactions hit this namespace on this node. No correlation can be made between the status of a parent batch-index job and the statuses of its sub-transactions. See the batch-index
log entry for details.
tsvc
Number of proxied batch-index read sub transactions that failed in the transaction service (Error,Timed out). Corresponds to the following statistics:
read
Number of proxied batch-index read sub transactions (Success,Error,Timed out,Not found). Corresponds to the followng statistics:
from_proxy_batch_sub_read_success
from_proxy_batch_sub_read_error
from_proxy_batch_sub_read_timeout
from_proxy_batch_sub_read_not_found
from_proxy_batch_sub_read_filtered_out
write
Number of proxied batch-index write sub transactions (Success,Error,Timed out,Filtered out). Corresponds to the following statistics
-
from_proxy_batch_sub_write_filtered_out
Displayed as of Database 6.0.
delete
Number of proxied batch-index delete sub transactions (Success,Error,Timed out,Not found,Filtered out). Corresponds to the following stastics:
from_proxy_batch_sub_delete_success
from_proxy_batch_sub_delete_error
from_proxy_batch_sub_delete_timeout
from_proxy_batch_sub_delete_not_found
from_proxy_batch_sub_delete_filtered_out
Displayed as of Database 6.0.
udf
Number of proxied batch-index udf sub transactions (Complete,Error,Timed out,Filtered out). Corresponds to the following statistics:
from_proxy_batch_sub_udf_complete
from_proxy_batch_sub_udf_error
from_proxy_batch_sub_udf_timeout
from_proxy_batch_sub_udf_filtered_out
Displayed as of Database 6.0.
lang
Number of proxied batch-index lang sub transactions (Delete Success,Error,Read Success,Write Success). Corresponds to the following statistics:
from_proxy_batch_sub_lang_delete_success
from_proxy_batch_sub_lang_error
from_proxy_batch_sub_lang_read_success
from_proxy_batch_sub_lang_write_success
Displayed as of Database 6.0.
{ns_name} index-usage: used-bytes 3121300 used-pct 0.05 alloc-bytes 16777216 alloc-pct 0.50
Index usage statistics for {ns_name}, displayed periodically for each namespace, every 10 seconds by default. When ‘data-in-memory’ is false, ‘data-bytes’ is not included. ‘set-index-bytes’ is included as of Database 5.6.
used-bytes
Total number of bytes used in memory for {ns_name} on the local node.
used-pct
Percentage of bytes used in memory for {ns_name} on the local node.
alloc-pct
Number of bytes holding secondary indexes in process memory for {ns_name} on the local node.
alloc-bytes
Number of bytes holding data in process memory for {ns_name} on the local node. Displayed only when ‘data-in-memory’ is set to true for {ns_name}.
{ns_name} memory-usage: total-bytes 3121300 index-bytes 140544 set-index-bytes 70272 sindex-bytes 221544 data-bytes 2688940 used-pct 0.05
Memory usage statistics for {ns_name}, displayed periodically for each namespace, every 10 seconds by default. When ‘data-in-memory’ is false, ‘data-bytes’ is not included. ‘set-index-bytes’ is included as of Database 5.6.
total-bytes
Total number of bytes used in memory for {ns_name} on the local node.
index-bytes
Number of bytes holding the primary index in system memory for {ns_name} on the local node. Displays 0 when index is not stored in RAM.
set-index-bytes
Number of bytes holding set indexes in process memory for {ns_name} on the local node. Displayed as of Database 5.6.
sindex-bytes
Number of bytes holding secondary indexes in process memory for {ns_name} on the local node.
data-bytes
Number of bytes holding data in process memory for {ns_name} on the local node. Displayed only when ‘data-in-memory’ is set to true for {ns_name}.
used-pct
Percentage of bytes used in memory for {ns_name} on the local node.
{ns_name} migrations: remaining (654,289,254) active (1,1,0) complete-pct 88.49
Migration statistics for {ns_name}, displayed periodically for each namespace, every 10 seconds by default. When migrations have completed, this line is reduced to {ns_name} migrations - complete
.
{ns_name}
“ns_name” is replaced by the name of a particular namespace.
remaining
Total number of transmit and receive partition migrations outstanding for this node as well as signals. Signals represents the number of signals to send to other nodes (non replica nodes) for partitions to drop. This log line changes to complete
after migrations are completed on this node.
active
Number of transmit and receive partition migrations currently in progress, as well as active signals.
complete-pct
Percent of the total number of partition migrations scheduled for this rebalance that have already completed.
{ns_name} objects: all 845922 master 281071 prole 564851
Object statistics for {ns_name}, displayed periodically for each namespace, every 10 seconds by default. Number of objects for this namespace on this node along with the master and prole breakdown.
{ns_name}
“ns_name” is replaced by the name of a particular namespace.
all
Total number of objects for this namespace on this node (master and proles).
master
Number of master objects for this namespace on this node.
prole
Number of prole (replica) objects for this namespace on this node.
{ns_name} ops-sub: tsvc (0,0) write (2651,0,0,1)
Scan/query ops sub-transactions statistics, displayed periodically for each namespace, every 10 seconds by default. Displayed only after scan/query ops sub-transactions hit this namespace on this node.
tsvc
Number of ops sub-transactions of scan/query background ops jobs that failed in the transaction service (Error,Timed out). Corresponds to the ops_sub_tsvc_error
and ops_sub_tsvc_timeout
statistics.
write
Number of ops sub-transactions of scan/query background ops jobs (Success,Error,Timed out,Filtered out). Corresponds to the ops_sub_write_success
, ops_sub_write_error
, ops_sub_write_timeout
, and ops_sub_write_filtered_out
statistics.
{ns_name} pi-query: short-basic (4,0,0) long-basic (36,0,0) aggr (0,0,0) udf-bg (7,0,0), ops-bg (3,0,0)
Primary index query (pi-query) transaction statistics, displayed periodically for each namespace, every 10 seconds by default. Displayed only after pi-query transactions hit this namespace on this node.
short-basic
Number of short primary index (pi)-query jobs since the server started (Success,Error,Aborted). Short queries are declared by the client, are unmonitored and typically run for a second or less. Corresponds to the pi_query_short_basic_complete
, pi_query_short_basic_error
, and pi_query_short_basic_timeout
statistics.
long-basic
Number of long pi-query jobs since the server started (Success,Error,Aborted). Long queries are monitored and not time bounded. Corresponds to the pi_query_long_basic_complete
, pi_query_long_basic_error
, and pi_query__long_basic_abort
statistics.
aggr
Number of pi-query aggregation jobs since the server started (Success,Error,Aborted). Corresponds to the pi_query_aggr_complete
, pi_query_aggr_error
, and pi_query_aggr_abort
statistics.
udf-bg
Number of pi-query background udf jobs since the server started (Success,Error,Aborted). Corresponds to the pi_query_udf_bg_complete
, pi_query_udf_bg_error
, and pi_query_udf_bg_abort
statistics.
ops-bg
Number of pi-query background operations (ops) jobs since the server started (Success,Error,Aborted). Corresponds to the pi_query_ops_bg_complete
, pi_query_ops_bg_error
, and pi_query_ops_bg_abort
statistics.
{ns_name} pmem-usage: used-bytes 2054187648 avail-pct 92
Break down of the pmem storage file usage, displayed periodically for each namespace, every 10 seconds by default. Displayed only if storage-engine pmem
has been configured for the namespace.
used-bytes
Number of bytes used on pmem storage files for {ns_name} on the local node.
avail-pct
Minimum percentage of contiguous pmem storage file space in {ns_name} on the local node across all pmem storage files. Corresponds to the pmem_available_pct
statistic.
{ns_name} query: basic (210,0,0) aggr (0,0,0) udf-bg (0,0,0) ops-bg (0,0,0)
Secondary index query transactions statistics, displayed periodically for each namespace, every 10 seconds by default. Displayed only after query transactions hit this namespace on this node.
basic
Number of secondary index queries since the server started (Completed,Error,Abort). Corresponds to the query_basic_complete
, query_basic_error
and query_basic_abort
statistics.
aggr
Number of query aggregation jobs since the server started (Completed,Error,Abort). Corresponds to the query_aggr_complete
, query_aggr_error
and query_aggr_abort
statistics.
udf-bg
Number of query background udf jobs since the server started (Completed,Error,Abort). Corresponds to the query_udf_bg_complete
, query_udf_bg_error
and query_udf_bg_abort
statistics.
ops-bg
Number of query background operations (ops) jobs since the server started (Success,Failure). Corresponds to the query_ops_bg_success
and query_ops_bg_failure
statistics.
{ns_name} query: basic (5,0) aggr (6,0) udf-bg (1,0) ops-bg (2,0)
Secondary index query transactions statistics, displayed periodically for each namespace, every 10 seconds by default. Displayed only after query transactions hit this namespace on this node.
basic
Number of secondary index queries since the server started (Success,Abort). Corresponds to the query_lookup_success
and query_lookup_abort
statistics.
aggr
Number of query aggregation jobs since the server started (Success,Abort). Corresponds to the query_agg_success
and query_agg_abort
statistics.
udf-bg
Number of query background udf jobs since the server started (Success,Failure). Corresponds to the query_udf_bg_success
and query_udf_bg_failure
statistics.
ops-bg
Number of query background operations (ops) jobs since the server started (Success,Failure). Corresponds to the query_ops_bg_success
and query_ops_bg_failure
statistics.
{ns_name} read-touch: tsvc (0,0) all-triggers (12345,0,3,45) 14
Statistics for the read-touch LRU behavior, described in configuring namespace data retention and controlled by default-read-touch-ttl-pct
.
Displayed periodically for each namespace, every 10 seconds by default. The following values define various actions that are displayed in the logs:
- S - success
- E - error
- T - timed out
- K - skipped
tsvc
Read-touch early timeouts (E, T). Also reported as the read_touch_tsvc_error
and read_touch_tsvc_timeout
statistics.
all-triggers
Read-touch transactions (S, E, T, K). Also reported as the read_touch_success
,read_touch_error
, read_touch_timeout
, and read_touch_skip
statistics.
{ns_name} re-repl: tsvc (0,34) all-triggers (525,0,32) unreplicated-records 14
Statistics for transactions re-replicating, applicable to strong-consistency
enabled namespaces. In strong consistency mode, write transactions failing replication are marked as unreplicated, and attempt to re-replicate one time immediately (despite returning an in doubt timeout failure to the client), as well as on any subsequent transaction attempted on the record (read or write).
Displayed periodically for each namespace, every 10 seconds by default. The following values define various actions which are displayed in the logs:
- S - success
- E - error
- T - timed out
tsvc
Re-replication early timeouts (E, T). Also reported as the re_repl_tsvc_error
and re_repl_tsvc_timeout
statistics.
all-triggers
Re-replication transactions (S, E, T). Also reported as the re_repl_success
,re_repl_error
and re_repl_timeout
statistics. Starting with Database 6.3, re_repl_timeout
only applies to timeouts during the actual replication.
unreplicated-records
Number of unreplicated records in the namespace. Displayed as of Database 5.7. Also reported as the unreplicated_records
statistic.
{ns_name} retransmits: migration 1 dup-res (1,2,3,4,5,6,7,8,9,10) repl-ping (1,2) repl-write (1,2,3,4,5,6,7,8)
Retransmit statistics, displayed periodically for each namespace, every 10 seconds by default. Displayed only if any retransmit has taken place. Retransmission happens when the server resends a message to another cluster node, after a prior attempt timed out.
migration
Number of retransmits that occurred during migrations. Corresponds migrate_record_retransmits
statistic.
dup-res
Groups the retransmission statistics related to duplicate-resolution.
The numbered dup-res
metrics correspond to the following:
-
retransmit_all_read_dup_res
-
retransmit_all_write_dup_res
-
retransmit_all_delete_dup_res
-
retransmit_all_udf_dup_res
-
retransmit_all_batch_sub_read_dup_res
-
retransmit_all_batch_sub_write_dup_res
-
retransmit_all_batch_sub_delete_dup_res
-
retransmit_all_batch_sub_udf_dup_res
-
retransmit_udf_sub_dup_res
-
retransmit_ops_sub_dup_res
repl-ping
Groups the retransmission statistics related to Linearized SC reads.
The numbered repl-ping
metrics correspond to the following:
-
retransmit_all_read_repl_ping
-
retransmit_all_batch_sub_read_repl_ping
repl-write
Groups the retransmission statistics related to replica writes.
The numbered repl-write
metrics correspond to the following:
-
retransmit_all_write_repl_write
-
retransmit_all_delete_repl_write
-
retransmit_all_udf_repl_write
-
retransmit_all_batch_sub_write_repl_write
-
retransmit_all_batch_sub_delete_repl_write
-
retransmit_all_batch_sub_udf_repl_write
-
retransmit_udf_sub_repl_write
-
retransmit_ops_sub_repl_write
{ns_name} scan: basic (11,0,0) aggr (0,0,0) udf-bg (5,0,0), ops-bg (10,0,0)
Scan transactions statistics, displayed periodically for each namespace, every 10 seconds by default. Displayed only after scan transactions hit this namespace on this node.
basic
Number of scan jobs since the server started (Success,Error,Aborted). Corresponds to the scan_basic_complete
, scan_basic_error
, and scan_basic_abort
statistics.
aggr
Number of scan aggregation jobs since the server started (Success,Error,Aborted). Corresponds to the scan_aggr_complete
, scan_aggr_error
, and scan_aggr_abort
statistics.
udf-bg
Number of scan background udf jobs since the server started (Success,Error,Aborted). Corresponds to the scan_udf_bg_complete
, scan_udf_bg_error
, and scan_udf_bg_abort
statistics.
ops-bg
Number of scan background operations (ops) jobs since the server started (Success,Error,Aborted). Corresponds to the scan_ops_bg_complete
, scan_ops_bg_error
, and scan_ops_bg_abort
statistics.
{ns_name} sindex-flash-usage: used-bytes 12345 used-pct 43
Secondary index flash usage (sindex-flash-usage) statistics, displayed periodically for each namespace configured with ‘sindex-type flash’, every 10 seconds by default.
{ns_name}
Name of the namespace the index and stats belongs to.
sindex-flash-usage
Name for which the following stats apply.
used-bytes
Total bytes in use on the mount for the secondary indexes used by this namespace on this node.
used-pct
Percentage of the mount in use for the secondary indexes used by this namespace on this node.
{ns_name} si-query: short-basic (4,0,0) long-basic (26,0,0) aggr (0,0,0) udf-bg (7,0,0) ops-bg (3,0,0)
Secondary index query (si-query) transactions statistics, displayed periodically for each namespace, every 10 seconds by default. Displayed only after si-query transactions hit this namespace on this node.
short-basic
Number of short secondary index queries since the server started (Completed,Error,Abort). Short queries are declared by the client, are unmonitored and typically run for a second or less. Corresponds to the si_query_short_basic_complete
, si_query_short_basic_error
and si_query_short_basic_timeout
statistics.
long-basic
Number of long secondary index queries since the server started (Completed,Error,Abort). Long queries are monitored and not time bounded. Corresponds to the si_query_long_basic_complete
, si_query_long_basic_error
and si_query_long_basic_abort
statistics.
aggr
Number of si-query aggregation jobs since the server started (Completed,Error,Abort). Corresponds to the si_query_aggr_complete
, si_query_aggr_error
and si_query_aggr_abort
statistics.
udf-bg
Number of si-query background udf jobs since the server started (Completed,Error,Abort). Corresponds to the si_query_udf_bg_complete
, si_query_udf_bg_error
and si_query_udf_bg_abort
statistics.
ops-bg
Number of si-query background operations (ops) jobs since the server started (Completed,Error,Abort). Corresponds to the si_query_ops_bg_complete
, si_query_ops_bg_error
, and si_query_ops_bg_abort
statistics.
{ns_name} special-errors: key-busy (1234, 40) record-too-big 5678 lost-conflict (256,32)
Special errors statistics, displayed periodically for each namespace, every 10 seconds by default. Displayed only if any of those errors have occurred. Log ticket modified in Database 7.2, adding fail_xdr_key_busy
count as second number in the first parentheses.
key-busy
Number of key busy errors. Corresponds to the fail_key_busy
and fail_xdr_key_busy
statistics. SeeHot Key Error code 14.
record-too-big
Number of record too big errors. Corresponds to the fail_record_too_big
statistic.
lost-conflict
Composed of the following metrics:
{ns_name} special-errors: key-busy 1234 record-too-big 5678
Special errors statistics, displayed periodically for each namespace, every 10 seconds by default. Displayed only if any of those errors have occurred.
key-busy
Number of key busy errors. Corresponds to the fail_key_busy
statistic. See Hot Key Error code 14.
record-too-big
Number of record too big errors. Corresponds to the fail_record_too_big
statistic.
{ns_name} tombstones: all 11252 xdr (11223,0) master 5501 prole 5751 non-replica 0
Number of tombstones for this namespace on this node along with the breakdown. Displayed periodically for each namespace, every 10 seconds by default.
{ns_name}
“ns_name” is replaced by the name of a particular namespace.
all
Total number of tombstones for this namespace on this node.
xdr
Number of xdr tombstones and bin cemeteries - (xdr_tombstones
,xdr_bin_cemeteries
).
master
Number of master tombstones for this namespace on this node.
prole
Number of prole (replica) tombstones for this namespace on this node.
non-replica
Number of non-replica tombstones for this namespace on this node.
{ns_name} udf-sub: tsvc (0,0) udf (2651,0,0,1) lang (52,2498,101,0)
Scan/query UDF sub-transactions statistics, displayed periodically for each namespace, every 10 seconds by default. Displayed only after query UDF sub-transactions hit this namespace on this node.
tsvc
Number of udf sub transactions of scan/query background udf jobs that failed in the transaction service (Error,Timed out). Corresponds to the following statistics:
udf
Number of udf sub transactions of scan/query background udf jobs (Success,Error,Timed out,Filtered out). Corresponds to the following statistics:
lang
Different status counts for underlying udf operations for sub transactions of scan/query background udf jobs (Read,Write,Delete,Error). Corresponds to the following statistics:
{ns_name} xdr-client: write (1543,0,15) delete (134,0,3,25)
XDR client transactions statistics, displayed after an XDR client transactions hit this namespace on this node. Displayed periodically, every 10 seconds by default, for each namespace receiving write transactions from an XDR client. The values on this line are a subset of the values displayed in the client statistics line just above in the log file.
The following values define various actions which are displayed in the logs:
- S - success
- E - error
- T - timed out
- N - not found, which for reads and deletes is a result that wants to be distinguished from success but is not an error.
write
XDR client write transactions (S,E,T). Also reported as the following statistics:
delete
XDR client delete transactions (S,E,T,N). Also reported as the following statistics:
{ns_name} xdr-dc dc2: lag 12 throughput 710 bytes-shipped 212456900 in-queue 250563 in-progress 81150 complete (1002215,0,0,0) retries (0,0,23) recoveries (2048,0) hot-keys 4655
Displayed periodically for each combination of namespace and XDR destination cluster (DC), every 10 seconds by default.
{ns_name}
“ns_name” is replaced by the name of a particular namespace.
xdr-dc
Name of the XDR destination cluster.
lag
See lag
.
throughput
See throughput
.
bytes-shipped
See bytes_shipped
.
in-queue
See in_queue
.
in-progress
See in_progress
.
complete
Composed of the following metrics:
retries
Composed of the following metrics:
recoveries
Composed of the following metrics:
hot-keys
See hot_keys
.
{ns_name} xdr-from-proxy: write (743,0,11) delete (104,0,3,21)
Proxied XDR transactions statistics. Displayed periodically, every 10 seconds by default, for each namespace receiving proxied write transactions from an XDR client. Displayed only after a proxied XDR transaction hits this namespace on this node. The values on this line are a subset of the values displayed in the from-proxy statistics line just above in the log file.
The following values define various actions which are displayed in the logs
- S - success
- E - error
- T - timed out
- N - not found, which for reads and deletes is a result that wants to be distinguished from success but is not an error
write
XDR client write transactions (S,E,T). Also reported as the xdr_from_proxy_write_success
, xdr_from_proxy_write_error
and xdr_from_proxy_write_timeout
statistics.
delete
XDR client delete transactions (S,E,T,N). Also reported as the xdr_from_proxy_delete_success
, xdr_from_proxy_delete_error
, xdr_from_proxy_delete_timeout
, and xdr_from_proxy_delete_not_found
statistics.
process: cpu-pct 28 threads (8,67,46,46) heap-kbytes (71477,72292,118784) heap-efficiency-pct 60.2
cpu-pct
Percent CPU time Aerospike was scheduled since previously reported. Corresponds to process_cpu_pct
.
threads
Thread statistics, in order: (threads_joinable
, threads_detached
, threads_pool_total
, threads_pool_active
). Introduced in Database 5.6.
heap-kbytes
Heap statistics, in order: (heap_allocated_kbytes
, heap_active_kbytes
, heap_mapped_kbytes
).
heap-efficiency-pct
Indicates the jemalloc heap fragmentation. This represents the heap_allocated_kbytes
/ heap_mapped_kbytes
ratio (prior to 5.7), or the heap_allocated_kbytes
/ heap_active_kbytes
ratio (6.0 or later). A lower number indicates a higher fragmentation rate. Corresponds to the heap_efficiency_pct
statistic.
system: total-cpu-pct 76 user-cpu-pct 44 kernel-cpu-pct 32 free-mem-kbytes 7462956 free-mem-pct 52 thp-mem-kbytes 4096
total-cpu-pct
Percent of time the CPU spent servicing user-space or kernel space tasks. Corresponds to system_total_cpu_pct
.
user-cpu-pct
Percent of time the CPUs spent servicing user-space tasks. Corresponds to system_user_cpu_pct
.
kernel-cpu-pct
Percent of time the CPUs spent servicing kernel-space tasks. Corresponds to system_kernel_cpu_pct
.
free-mem-kbytes
Amount of free RAM in kilobytes for the host. Corresponds to system_free_mem_kbytes
.
free-mem-pct
Percentage of all RAM free, rounded to nearest percent, for the host. Corresponds to the system_free_mem_pct
.
thp-mem-kbytes
Amount of memory in use by the Transparent Huge Page mechanism, in kilobytes. Corresponds to system_thp_mem_kbytes
. Displayed in 5.7 and later.
xdr-dc dc2: nodes 8 latency-ms 19
Displayed periodically for each XDR destination cluster (DC), every 10 seconds by default.
Key busy
DETAIL (info-command): (<file>:<line>) client <ipaddress:port> command '<info-command-text>'
A detail logging context for logging all info protocol commands received by the server
DETAIL (key-busy): (<file>:<line>) {<namespace>} {digest} <transaction-type> from <source>
Describes the digest, transaction type, and transaction source of a possibly hot key. This message is issued whenever a transaction fails with a KEY_BUSY
error (14).
Possible transaction types:
- read
- write
- delete
- udf
- batch-sub-read
- batch-sub-write
- batch-sub-delete
- batch-sub-udf
Possible source types:
- client ip address:port
- proxy
- bg-udf
- bg-ops
- re-repl
Migrate
handle insert: binless pickle, dropping {digest}:0x2b2c08f859bd4eb401982e038a2bdcae2b74c853
Attempted to migrate a record that had no bins, so didn’t insert it into the receiving partition.
digest
The 160-bit digest of the record, in hex.
migrate: handle insert: got bad record
Inbound record for migration is corrupt. Appears with the WARNING
message about record too small 0
and is documented in the following Why Do I See Stalled Migrations And "record too small" Errors In The Log?
.
migrate: handle insert: got no record
Appears on the destination (inbound) node with the warning message about handle insert: got bad record.
Associated with the new unreadable digest
warning for Database 6.0.
migrate: record flatten failed ce8a775a68c93d49
The storage is full for this namespace, during the time of migration. Resolve by allocating more storage for namespace, or reducing volume of data in namespace.
record
Record identifier.
migrate: unreadable digest
Appears on the source node when the record could not be read locally. Introduced with Database 6.0. Associated with the handle insert: got no record
warning.
missing acks from node BB9030011AC4202
This means that a particular migration thread has begun to throttle because it has 16MB or more of un-acknowledged records, and has remained in that state for at least 5 seconds. The message prints every 5 seconds until the outstanding acks drop below the threshold. This happens when migration has been pushed higher than your network or machine/disks can sustain. To mitigate, decrease migrate-threads
down to a minimum of 1. This change does not take effect immediately; threads terminate as the partition they are handling completes migration. This warning could also be triggered if channel-bulk-recv-threads was reduced during ongoing migrations.
node
The destination node of the migrate thread.
Mrt audit
{namespace} monitor aborting mrt-id 123456789abcdef0
Indicates when monitors kick in to abort transactions.
mrt-id
Transaction ID that was aborted.
Namespace
at set names limit, can't add set
This warning indicates that the maximum number of sets per namespace has been breached. A namespace can hold a maximum of 1023 sets prior to server 7 and 4095 sets since Database 7. Such an issue can cause migrations to get stuck and writes to fail. See How to clear up set names when they exceed the limit.
can't add SET (at sets limit)
Indicates that the maximum number of sets per namespace has been breached. A namespace can hold a maximum of 1023 sets prior to Database 7 and 4095 sets since Database 7. Such an issue can cause migrations to get stuck and writes to fail. See How to clear up set names when they exceed the limit.
set
Name of set that was being created in excess of the limit.
fail persistent memory delete
Indicates an issue deleting PMem memory on startup. Can be resultant from incorrect permissions on the /mnt/pmem
directory.
{namespace} mrt-finish: verify-read (100,0,0) roll-forward (50,0,0) roll-back (1,0,0)
Appears if any of the following stats are non-zero:
{namespace} mrt-monitor-finish: roll-forward (0,0,0) roll-back (1,0,0)
Appears if any of the following stats are non-zero:
{namespace_name} appeals: remaining-tx 0 active (0,19)
Information about the number of partition appeals not yet sent and currently being sent. Partition appeals occur for namespaces operating understrong-consistency
mode when a node needs to validate the records it has when joining the cluster.
remaining-tx
Current value for appeals_tx_remaining
.
active (active_tx, active_rx)
A tuple based on appeals_tx_active
and appeals_rx_active
.
ns can’t attach persistent memory base block: block does not exist
A missing shared memory block. This typically happens when a node is rebooted, but shared memory blocks can be deleted for other reasons. The node must perform a cold start.
{ns_name} found no valid persistent memory blocks, will cold start
treex
and base
shared memory blocks are missing. This typically happens when a node is rebooted or after an ungraceful shutdown. The node must perform a cold start.
{ns_name} persisted arena stages
This message is one of a sequence of messages logged during Aerospike server shutdown of storage-engine device namespaces. The message signifies that the arena stages for the namespace have been persisted to storage. An unusual delay in the appearance of this message during shutdown might be due to index-type
configured to pmem
.
{ns_name} persisted tree roots
This message is one of a sequence of messages logged during Aerospike server shutdown of storage-engine device namespaces. The message signifies that the namespace’s common partition index tree information has been persisted to storage. An unusual delay in the appearance of this message during shutdown might be due to a high number of partition-tree-sprigs
configured for the namespace.
{ns_name} persisted trusted base block
This message is one of a sequence of messages logged during Aerospike server shutdown of storage-engine device namespaces. The message signifies that the persistent memory base block for the namespace has been persisted to storage with “trusted” status. “trusted” status is a necessary condition for a subsequent fast restart of the namespace.
Network
fabric_connection_process_readable() recv_sz -1 msg_sz 0 errno 110 Connection timed out
The warning message indicates that a fabric connection timed out. In Database 5.6 and later, the log line has the node-id to identify which node is having the fabric connection time out.
Nsup
{NAMESPACE} failed to create evict-prep thread 5
Indicates a shortage of memory. Make sure nodes have enough memory.
NAMESPACE
The namespace nsup was looking at when memory ran low.
thread
Thread identifier.
{bigdata} hwm breached but nothing to evict
The amount of data in memory (high-water-memory-pct) or on disk (high-water-disk-pct, mounts-high-water-pct) has exceeded the limits set for the namespace, triggering evictions, but there is no data with a finite TTL to be evicted.
ns
Namespace where the high water mark was breached.
{<namespace>} breached eviction limit (<reasons>), sys-memory pct:41, indexes-memory sz:0 (0 + 0 + 0), index-device sz:1073741824000 used-pct 88, data used-pct:58
Checking memory or disk usage at start of nsup cycle and finding that the high-water mark has been breached
breached eviction limit
memory
or disk
depending on which limit was breached
memory
Memory used in bytes (primary index + secondary indexes + data in memory), and the amount of the high-water mark (total available * high-water-memory-pct
. For Database 5.6 and later the set index memory used is also added.
index-device
Used space on device for index-type flash
in bytes and the amount of the high-water mark (total available * mounts-high-water-pct
disk
Disk space used in bytes and the amount of the high-water mark (total available * high-water-disk-pct
{<namespace>} breached stop-writes limit (<reasons>), sys-memory pct:96, indexes-memory sz:107374182400 (107374182400 + 0 + 0), data avail-pct:20 used-pct:58
Upon checking memory at start of NSUP cycle and finding that a threshold triggering stop writes has been breached (either memory
based on stop-writes-pct
or stop-writes-sys-memory-pct
, or disk
based on min-avail-pct
or max-used-pct
).
breached stop-writes limit
sys-memory
, memory
, device-avail-pct
, device-used-size
or some combination depending on which threshold was breached
sys-memory
Total system memory used in bytes greater than the stop-writes-sys-memory-pct
threshold, triggered when (100 - system_free_mem_pct
>= stop-writes-sys-memory-pct
)
indexes-memory
Memory used in bytes greater than the stop-writes-pct
threshold (total available * stop-writes-pct
). Prior to Database 5.6, tracked under memory_used_bytes
(primary index + secondary indexes + data in memory). Database 5.6 and later, tracked under memory_used_bytes
(primary index + set index bytes + secondary indexes + data in memory).
data avail-pct
The available contiguous free space, tracked under device_available_pct
data used-pct
The amount of used disk space, tracked as the ratio of the device_used_bytes
and device_total_bytes
metrics.
{namespace} failed set evict-void-time 328967879
When high-water-disk-pct
, high-water-memory-pct
, or mounts-high-water-pct
is breached on any node and eviction starts, the timestamp before which records need to be evicted is propagated to all nodes through the System Meta Data (SMD) mechanism, to ensure that no orphaned replicas are left anywhere in the cluster. If not all nodes have acknowledged the message within 5 seconds, this message is logged. Occasional instances of this message do not indicate a serious problem, as the timestamp is picked up on the next namespace supervisor (nsup) cycle.
ns
The namespace where evictions were triggered.
evict void time
The timestamp, in seconds since the Aerospike Epoch of 2010-01-01T00:00:00, before which records should be evicted.
{namespace} would evict all 146897768 records eligible - not evicting!
The namespace supervisor (nsup) configuration required it to evict all expirable records (records with a configured time-to-live or TTL). Most commonly seen when evict-tenths-pct
is set to 1000 or greater, but could also happen when the distribution of TTLs is extremely skewed.
namespace
Namespace where this happened.
record count
Total number of evictable records (that is, records with a TTL) in the namespace.
{<namespace>} xdr-tomb-raid-start: threads <n>
XDR tomb raider has started for the namespace. This log entry appeared under the NSUP context prior to Database 7.1.
Expected to run every xdr-tomb-raider-period seconds.
namespace
The namespace for which the tombstones are being dropped
n
The number of threads configured for the XDR tomb raider
{ns-name} no records below eviction void-time 329702784 - threshold bucket 9998, width 3154 sec, count 4000000 > target 20000 (0.5 pct)
Checking the eviction buckets from the bottom up, but there are no records found until one bucket has enough to put the total over the limit for one eviction cycle
void-time
Timestamp at the top of the bucket that breached the eviction limit
threshold bucket
Which bucket out of the evict-hist-buckets
breached the eviction limit
width
Width of the bucket in seconds
count
Total records found in this bucket
target
Number of records allowed for this eviction cycle and the evict-tenths-pct
used to calculate it
{ns-name} nsup deleted 1.56% of namespace - configure more nsup threads?
When NSUP cycle takes more than 2 hours and deletes more than 1% of namespace.
{ns-name} nsup-done: non-expirable 42162 expired (576066,922) evicted (24000935,259985) evict-ttl 134000 total-ms 155
Logged when the namespace supervisor (nsup) completes expiration or eviction processing for a namespace. See the most recently logged nsup-start
entry for the namespace to determine which type of processing has just completed. Expiration processing never evicts records, but eviction processing can expire records.
non-expirable
The number of records without a TTL. These records do not expire and are never eligible for eviction.
expired
Number of records removed due to expiration; total since the node started and total for the current nsup cycle. In this example, 922 records expired in the most recent nsup cycle, and 576066 records have expired since the node was last started.
evicted
Number of records evicted (early-expired); total since the node started and total for the current nsup cycle. In this example, nsup evicted 259985 records in the most recent cycle, and 24000935 records since the node was last started.
evict-ttl
The high-end expiration-time of evicted (early-expired) records (in seconds).
total-ms
Duration of the just completed nsup expiration or eviction processing cycle, in milliseconds. In this example, the processing cycle completed in 155 milliseconds
{ns-name} nsup-start: evict-threads 1 evict-ttl 2745 evict-void-time (287665530,287665930)
Logged when the namespace supervisor (nsup) begins eviction processing for a namespace.
evict-threads
The number of threads to be used for the eviction processing cycle.
evict-ttl
The specified eviction depth for the namespace, expressed as a time to live threshold in seconds, below which any eligible records are evicted.
evict-void-time
The current effective eviction depth and the specified eviction depth for the namespace. Each is expressed as a void time, in seconds since 1 January 2010 UTC.
{ns-name} nsup-start: expire-threads 1
Logged when the namespace supervisor (nsup) begins expiration processing for a namespace.
expire-threads
The number of threads to be used for the expiration processing cycle. Corresponds to the configured nsup-threads
.
{ns-name} sindex-gc: Processed: 3133360101, found:365961945, deleted: 365952323: Total time: 62667962 ms
Secondary index (sindex) garbage collection cycle summary. Replaced by “sindex-gc-done” message in 4.6.0.
Processed
Count of sindex entries that have been checked. Corresponds to the sindex_gc_objects_validated
statistic.
found
Count of sindex entries found eligible for garbage collection. Corresponds to the sindex_gc_garbage_found
statistic.
deleted
Count of sindex entries deleted through garbage collection (may be lower than above number if those entries got deleted while the garbage collector was running, for example through a competing truncate
command). Corresponds to the sindex_gc_garbage_cleaned
statistic.
Total time
Duration of a cycle of sindex garbage collection in milliseconds.
Os
failed OS_CHECK check - MESSAGE
Indicates that a linux best-practices was violated at startup.
OS_CHECK
Name of the check which was violated, could be one of the following:
sMessage
Description of how the best-practice was violated.
Particle
parse_op - error 4 unable to build expression op
This indicates that an attempt to execute an expression failed. Generic warning triggered by failure to parse the underlying expression encoding.
Example: WARNING (particle): (expop.c:127) parse_op - error 4 unable to build expression op
Partition
{ns_name} 2 of 5 nodes are quiesced
Number of nodes quiesced in the cluster (or sub cluster).
When the cluster changes (node addition, removal, or network splits) or when the cluster receives a ‘recluster’ info command.
nodes participating
The number of nodes quiesced in this sub-cluster out of the total number of nodes observed in the sub-cluster. For strong-consistency namespaces, it is full roster size instead of observed in the sub-cluster.
{ns_name} 5 of 6 nodes participating - regime 221 -> 223
Number of nodes participating in the cluster (or sub cluster) as well as the regime change.
For strong-consistency namespaces, when the cluster changes, because of node(s) leaving or joining the cluster (or network splits).
nodes participating
The number of nodes participating in this cluster out of the total number of nodes for the full roster.
regime
This number increments every time there is a reclustering event. This is used in strong consistent namespace and is leveraged by the client libraries. For further details on regime, see Strong consistency.
{ns_name} rebalanced: expected-migrations (1215,1224,1215) fresh-partitions 397
Number of partitions expected to migrate (transmitted, received, and signals) as well as other migration related statistics and fresh partition number.
For non strong-consistency namespaces, when the cluster changes, because of node addition, removal or network splits.
expected-migrations
The number of partitions expected to migrate (Transmitted, Received, Signals) as part of this reclustering event. Those correspond to the migrate_tx_partitions_initial
, migrate_rx_partitions_initial
, and migrate_signals_remaining
statistics respectively.
fresh-partitions
Number of partitions that are created fresh or empty because a number of nodes, greater than the replication factor, has left the cluster.
{ns_name} rebalanced: regime 295 expected-migrations (826,826,826) expected-appeals 0 unavailable-partitions 425
Number of partitions expected to migrate (transmitted, received, and signals) as well as other migration related statistics and partition availability details.
For strong-consistency namespaces, when the cluster changes, because of node(s) leaving or joining the cluster (or network splits).
regime
This number increments every time there is a reclustering event. This is used in strong consistency namespaces and is leveraged by the client libraries. For further details on regime, see Strong consistency.
expected-migrations
The number of partitions expected to migrate (Transmitted, Received, Signals) as part of this reclustering event. Those correspond to the migrate_tx_partitions_initial
, migrate_rx_partitions_initial
, and migrate_signals_remaining
statistics respectively.
expected-appeals
The number of appeals expected as part of this reclustering event. Appeals occur after a node has been cold-started. The replication state of each record is lost on cold-start and all records must assume an unreplicated state. An appeal resolves replication state from the partition’s acting master. These are important for performance; an unreplicated record must re-replicate to be read, which adds latency. During a rolling cold-restart, an operator may want to wait for the appeal phase to complete after each restart to minimize the performance impact of the procedure. Corresponds to the appeals_tx_remaining
statistic but only at the initial time of the reclustering event.
unavailable-partitions
The number of partitions that are unavailable as the roster is not complete and all writes that have occurred to those partitions are not present. Partitions remaining unavailable after the cluster is formed by the full roster become dead and require the use of the revive
command to make them available again, which could lead to inconsistencies, depending on what lead to those partition being dead. Revived nodes restore availability only when all nodes are trusted. Corresponds to the unavailable_partitions
statistic.
{ns_name} rebalanced: regime 295 expected-migrations (826,826,826) fresh-partitions 397 expected-appeals 0 dead-partitions 425
Number of partitions expected to migrate (transmitted, received, and signals) as well as other migration related statistics and partition availability details.
For strong-consistency namespaces, when the cluster reforms with all roster members but resulting in dead partitions present.
regime
This number increments every time there is a reclustering event. This is used in strong consistent namespace and is leveraged by the client libraries. For further details on regime, see Strong consistency.
expected-migrations
The number of partitions expected to migrate (Transmitted, Received, Signals) as part of this reclustering event. Those correspond to the migrate_tx_partitions_initial
, migrate_rx_partitions_initial
, and migrate_signals_remaining
statistics respectively.
fresh-partitions
Number of partitions that are created fresh or empty because a number of nodes, greater than the replication factor, has left the cluster.
expected-appeals
The number of appeals expected as part of this reclustering event. Appeals occur after a node has been cold-started. The replication state of each record is lost on cold-start and all records must assume an unreplicated state. An appeal resolves replication state from the partition’s acting master. These are important for performance; an unreplicated record must re-replicate to be read, which adds latency. During a rolling cold-restart, an operator may want to wait for the appeal phase to complete after each restart to minimize the performance impact of the procedure. Corresponds to the appeals_tx_remaining
statistic but only at the initial time of the reclustering event.
unavailable-partitions
The number of partitions that are dead. Corresponds to the dead_partitions
statistic. Requires the use of the revive
command to make such partitions available again, which could lead to inconsistencies, depending on what lead to those partition being dead.
Proto
protocol write fail: fd 123 sz 30 errno 32
Client has closed the socket before the server could respond. Client may have timed out, has too many open connections, or there may be network issues.
fd
File descriptor of the socket to the client.
sz
Size of the response that couldn’t be sent.
errno
Error number returned by OS. 32, EPIPE, is common.
Query
Could not find nbtr iterator
Indicates that there is a secondary index key with no corresponding digests (a bin value for which there are no records). Possibly a primary index and corresponding record got deleted, but the secondary index still exists. Subsequent round of garbage collection may correct the tree structure. May mean that partial results were returned during the query. If seen frequently, may mean that there is a structural issue with the secondary index. In this case, drop and re-create the index.
starting basic query job 562360700545657251 {test:sb_set5:<pi-query>} n-pids-requested (4096,0) rps 0 sample-max 0 socket-timeout 30000 from 127.0.0.1:51124
A basic query job is initiated on the specified namespace and set.
query job
ID of the query.
{namespace : set : index-name}
The index name is either the name of the secondary index or pi-query
.
n-pids-requested
(n-pids-requested, n-keyds-requested) - The first number is the number of partition IDs requested in this partitioned query.
- The second number is the number of those partitions that start at a specified digest. This hints at whether this query is paginated, using a cursor.
rps
Records per second rate requested by the query policy. If not specified, the value of background-query-max-rps
is in effect.
sample-max
The maximum number of records to return from the query, as specified by the query policy.
socket-timeout
The socket timeout requested by the client policy, in milliseconds.
from
Client IP address and port.
Rbuffer
Ring buffer file /opt/aerospike/xdr/digestlog should be at least 18057 bytes Boot strap failed for digest log file (null)
The minimum possible size of the XDR digestlog, as specified in xdr-digestlog-path
, is 18057 bytes. This is not a limitation that should ever come up in practice, because realistic sizes are in the tens or hundreds of GB.
unable to create digest log file /opt/aerospike/xdr/digestlog: No such file or directory
asd tries to create the digest log at start time if it does not already exist, but may fail due to permissions issues, a bad directory, path, etc.
log path
Path of the file that asd
tried to create from xdr-digestlog-path
system error
The message from the OS giving the reason for failing to create the file
Record
{AS_PARTICLE} AS_CDT_OP_LIST_APPEND: failed
This error may occur when a collection data type exists for the record and the write policy was set to CREATE_ONLY
.
{AS_PARTICLE} cdt_process_state_context_eval() bin is empty and op has no create flags
The map context does not exist.
{AS_PARTICLE} map_subcontext_by_key() cannot create key with non-storage elements
A map context cannot be created with elements that are not stored, such as a wildcard.
{AS_PARTICLE} packed_list_get_remove_by_index_range() index 155 out of bounds for ele_count 155
There was an attempt to remove an element at index 156 in a list of 155 elements. Since index starts from 0, index 155 means element 156. This pattern applies for any populated index. If the index is unpopulated, the error reads index 0 out of bounds for ele_count 0
.
{bigdata} record replace: drives full
No more space left in storage.
ns
Namespace being written to at the time storage filled up.
{namespace_name} record replace: failed write 1142f0217ababf9fda5b1a4de66e6e8d4e51765e
Most likely appearing as a result of exceeding the write-block-size
. The record’s digest is the last item in the log entry. To determine what set is being written to, see How to return the set name of a record using its digest.
{namespace_name}
Namespace being written to.
Roster
(roster_ee.c:151) illegal node id 0
This error may occur during a downgrade from Database 7.2 if you are using SC mode and have an active-rack configured in the roster.
(roster_ee.c:237) {test} invalid node list M100|bb...01|
This error may occur during a downgrade from Database 7.2 if you are using SC mode and have an active-rack configured in the roster.
Rw client
{ns_name} client 10.0.3.182:51160 write {digest}:0x8df238affec6f8e3a2c22d6c54c91c5bc4f3ff81
Provides details on the originating client’s IP address, the transaction type and the digest of the record being accessed.
Single line per client transactions that get as far as successfully reserving a partition. Requires log level to be set to detail for the rw-context: asinfo -v "log-set:id=0;rw-client=detail"
client
The originating client’s IP address.
transaction {digest}
The transaction type (read/write/delete/udf) as well as the digest.
Rw
WARNING (rw): (write.c:926) write_master: null/empty set name not allowed for namespace {namespace}
The write fails because the set name is null or empty and disallow-null-setname
is true
.
{bigdata}: write_master: drives full
No more space left in storage.
ns
Namespace being written to at the time storage filled up.
dup-res ack: no digest
During a downgrade from Database 4.5.3+ to a prior version, the following protocol-related warning may be seen temporarily on a node with the prior version. This warning is harmless, the result of an older node briefly receiving 4.5.3+ protocol fabric messages from newer nodes. This message ceases after the next rebalance.
got rw msg with unrecognized op 8
During a downgrade from a Database 4.5.3+ server to a prior version, the following protocol-related warning may be experienced temporarily on a node with the earlier version. This warning is harmless. This warning is resultant of an older node briefly receiving 4.5.3+ protocol fabric messages from newer nodes. This message ceases after the next rebalance.
key mismatch - end of universe?
KEY_MISMATCH error. Any update, delete or read request for a record which has key stored, the incoming key does not match with the existing stored key. Indicates a RIPEMD-160 key collision which is not likely. See Collision Resistance of RIPEMD-160 for details. This message occurs in case of key / hash mismatch on the application side or some message level corruption.
{namespace} can't get stored key {digest}:0x5230acd92762fa6f827e902d58199dd7b928479c
This message indicates that the index has the stored-key flag set for a record, but the record in storage does not containthe key. Normally this would never happen, but there was a bug in some older database Databases whereby this mismatch could occur when a node containing old data was cold-started and brought back into the cluster with records that had been deleted but not durably deleted. This was corrected in Database 4.4.0.8 and later. Enterprise Edition Licensees can contact Aerospike Support for further guidance when encountering this error.
{namespace} drop while replicating
This message indicates that a record was dropped while replicating. For example, if a truncation is run and records are still being replicated while the truncation hits the master record. This can also happen when non durable deletes are allowed (strong-consistency-allow-expunge
) on a strong-consistency enabled namespace. This increments the client_write_error
metric.
{<namespace>} write_master: bin name too long (<length>) <digest>
Expected when a client attempts to create a bin with a name that exceeds the 15 character limit. The error is returned to the client and the operation does not succeed. See Known limitations
{namespace}
Namespace being written to.
length
Length of the bin name sent by the client.
digest
The digest of the record being updated.
{namespace} write_master: failed as_bin_cdt_alloc_modify_from_client()
This error is expected if a list_clear()
is called for a bin that isn’t a list or doesn’t exist. list_clear()
returns an error when the bin isn’t a list or doesn’t exist.
{namespace}
Namespace being written to.
{namespace} write_master: failed as_storage_record_write() {digest}:0xd751c6d7eea87c82b3d6332467e8bc9a3c630e13
Result of exceeding the write-block-size
. Setting the rw
and drv_ssd
contexts to detail
logging provides the accompanying explanatory log messages. See log-set
and Changing Log Levels
for how to dynamically change log levels. To determine what set is being written to, see How to return the set name of a record using its digest.
{namespace}
Namespace being written to
{digest}
Digest of the record that was rejected
{namespace} write_master: record too big {digest}:0xd751c6d7eea87c82b3d6332467e8bc9a3c630e13
Appears with the WARNING
message about failed as_storage_record_write()
for exceeding the write-block-size
. To determine what set is being written to, see How to return the set name of a record using its digest.
{namespace}
Namespace being written to
{digest}
Digest of the record that was rejected
{namespace_name} write_master: failed as_storage_record_write() 1142f0217ababf9fda5b1a4de66e6e8d4e51765e
Most likely appearing as a result of exceeding the write-block-size
. The record’s digest is the last item in the log entry. To determine what set is being written to, see How to return the set name of a record using its digest.
{namespace_name}
Namespace being written to.
write_master: disallowed ttl with nsup-period 0
When expirations in a namespace are disabled because nsup-period
is set to the default of 0
, records with a TTL other than 0
may not be written to that namespace. This is to avoid confusion as to whether the records should be subject to expiration (or eviction). Simply set the nsup-period
to a value other than 0
to allow records with a TTL other than 0
to be written. If there is a need to allow records with a non 0
TTL without having the nsup thread running, it is possible to set ‘allow-ttl-without-nsup` to true, but this is not recommended prevents those records from being properly deleted upon expiration.
Scan
basic scan job 283603086331273222 failed to start (4)
A scan job was submitted with maxRetries greater than 0 and failed or timed out on one or more nodes, causing a retry to conflict with the remaining scan that has the same transaction ID and exit with error code 4 (parameter error).
trid
trid of the scan job being retried
error sending to 10.0.0.1:45678 - fd 646 sz 1049036 Connection timed out
A scan is trying to send data back to the client, but it cannot send the full data set. This is likely due to an aborted client-side scan. The scan must be re-run.
client
Client address:port.
fd
File descriptor of the socket to the client.
sz
Size of the response attempted.
error
Error message returned by OS.
not starting scan 261941093 because rchash_put() failed with error -4
Server rejects scan transaction because it is already processing. Wait for the scan to complete or abort it before issuing the scan transaction again.
scan
ID of the scan.
errno
Error code.
scan msg from 10.11.12.13 has unrecognized set name_of_the_set
A scan specifying an unknown set was received. Typically a user error such as a mistyped set name, but also happens if the set does not exist on all the nodes in the cluster. For example, a set with fewer records than the number of nodes in the cluster, where some nodes don’t hold any record (master or replica) for that set. Having something configured for the set, such as access control that is set specific or a secondary index definition, causes this log warning message to disappear as the set would then be ‘known’ on the node even without any record in it.
address
Node that the scan message came from.
setname
Set name that doesn’t exist on this node.
send error - fd 646 sz 1049036 rv 521107
A scan is trying to send data back to the client, but it cannot send the full data set. This is likely due to an aborted client-side scan. The scan must be re-run.
fd
File descriptor of the socket to the client.
sz
Size of the response attempted.
rv
Size that was successfully sent.
starting basic scan job 8104671463142312256 {namespace:set} rps 0 sample-max 100 socket-timeout 30000 from 172.22.xx.yy:52842
A basic scan job is initiated on the specified namespace and set.
scan job
ID of the scan.
rps
Configured records per second of the scan. Configured in the scan policy. If this is not specified, the cluster maximum of the background-scan-max-rps parameter is in effect.
sample-max
The maximum number of records to return from the scan, if specified. Configured in the scan policy.
metadata-only
Only if the scan is configured to initiate a metadata scan.
fail-on-cluster-change
If this is enabled in the scan policy, or using AQL, scans are halted if there is a cluster change, such as nodes leaving or joining the cluster. See Why do I get AEROSPIKE_ERR_CLUSTER_CHANGE when querying or scanning a namespace? for details.
socket-timeout
The configured socket timeout. Configured in the client policy. The default is 30 seconds.
client
Client IP address and port.
starting basic scan job 8104671463142312256 {namespace:set} rps 0 sample-pct 100 socket-timeout 30000 from 172.22.xx.yy:52842
A basic scan job is initiated on the specified namespace and set.
scan job
ID of the scan.
rps
Configured records per second of the scan. Configured in the scan policy. If this is not specified, the cluster maximum of the background-scan-max-rps parameter applies.
sample-pct
The percentage of records to return from the scan, if specified. The default is 100 percent. Configured in the scan policy.
metadata-only
Only if the scan is configured to initiate a metadata scan.
fail-on-cluster-change
If this is enabled in the scan policy, or using AQL, scans are halted if there is a cluster change, such as nodes leaving or joining the cluster. See Why do I get AEROSPIKE_ERR_CLUSTER_CHANGE when querying or scanning a namespace? for details.
socket-timeout
The configured socket timeout. Configured in the client policy. The default is 30 seconds.
client
Client IP address and port
Security
FAILED ASSERTION (namespace): (namespace_ee.c:550) can't remove persistent memory base block: Operation not permitted
When running under systemd, this error occurs if shared memory has been allocated by a previous user, and a second user who does not have privileges to access those shared memory segments starts the Aerospike server. Aerospike reads the user from the aerospike.conf file; however, if the shared memory segments are still allocated by the previous user, it may be necessary to force a cold start.
WARNING (security): (security.c:857) unknown security message command 20
Clients that enable the LDAP protocol server-side, and run on Databases of Aerospike that did not support server-side LDAP, may see this warning when they attempt to log in. This warning is transitory as the server falls back to use the non-LDAP protocol by default. Check the release notes of each client to validate the Database that supports LDAP.
authentication failed (user) | client: 11.22.33.44:57754 | authenticated user: <none> | action: login | detail: user=baduser
Provides details on the internal user not found, including failure, client with IP address and port, authenticated user, action, and user specified.
Occurs when a failed login is attempted with an internal user that is not found in the access control list, if audit logging is configured to report those. See Security Configuration
for details. To enable this log message output, you must set report-authentication
and report-violation
to true
.
authenticated failed (user)
The user authentication failed.
client
The client IP Address and port.
authenticated user
The authenticated user, in this example: none.
action
The action being performed, in this case login.
detail
The username involved in the failed authentication.
fd 361 send failed, errno 113
Tried to send the client a message indicating security is not supported on this CE server, but failed due to a socket issue.
fd
ID of the file descriptor the message was sent on.
errno
Linux error code for the problem, usually something like 113 (“No route to host”) or 110 (“Connection timed out”).
login - internal user credential mismatch
Provides a warning that a login failed using a valid internal user that is found in the access control list with an incorrect password.
Occurs when a failed login is attempted using an internal user that exists in the access control list but where an incorrect password has been supplied.
login - internal user not found
Provides a warning that a login failed with an internal user that is not found in the access control list.
Occurs when a failed login is attempted with an internal user that is not found in the access control list. To log the details of the internal user that is not found, report-authentication
and report-violation
must be set to true
. See Security Configuration for details.
login - internal user using ldap
A login failed with an internal user that is not found in the access control list.
Occurs when a login attempt fails with an internal user that is not found in the access control list. Some conditions when the warning message is presented to the user while incorporating the use of an ACL.
- When an encrypted ‘external’ (clear password encrypted) is used for LDAP but a stored hashed ‘internal’ password also exists for that user.
- When an incompatible Aerospike client is attempting to connect to Database 4.6 or newer. See
Overview of Access Control with LDAP and PKI
for details to ensure you are using a compatible Aerospike client version. - When clusters using Cross-Datacenter Replication (XDR) are running with conflicting Aerospike Database versions. When running XDR and an ACL, Database versions 4.1.0.1 to 4.3.0.6 are incompatible with Database 4.6 or later. The incompatible Database versions 4.1.0.1 to 4.3.0.6 cannot ship to 4.6.0.2 or later. The simplest workaround is to avoid using versions 4.1.0.1 to 4.3.0.6. See Internal user warning returned for details.
This warning message typically appears in the logs when upgrading a cluster.
permitted | authenticated user: admin | action: create user | detail: user=bruce;roles=read-write
Provides details on the operation performed, including transaction type, namespace, set and relevant record’s digest.
Occurs when permitted transaction happen under an authenticated user (in this case a user-admin related operation). See Security Configuration
for details.
authenticated user
The authenticated user performing the operation.
action
The transaction type (read/write/delete/udf) or user or data related operation.
detail
The details of the operation, either for a user or admin related operation or namespace, set and digest of the record involved if a single record transaction.
role violation | authenticated user: sally | action: delete | detail: {test|setB} [D|ee50d7c1d0f427ed5c41ef8a18efd85412b973ff]
Details on the role violation, including transaction type, namespace, set and relevant record’s digest.
Occurs when role violation happens, if audit logging configured to report those. See Security Configuration
for details.
authenticated user
The authenticated user violating the role’s permissions.
action
The transaction type (read/write/delete/udf) or user or data related operation.
detail
The namespace, set and digest of the record involved in the role violation.
Service
refusing client connection - proto-fd-max 50000
The server has reached the configured maximum for incoming file descriptors (proto-fd-max
). This corresponds directly to incoming client connections. Connections greater than the maximum are refused.
proto-fd-max
Currently configured value for proto-fd-max
unsupported proto version 0 from 12.0.0.1:5598
Indicates that messages of an unexpected type are being sent to the main server port (usually port 3000) by other nodes in the cluster. This can happen if the nodes have been misconfigured to use the main server port instead of the correct port. For example, the configuration for Mesh heartbeats should specify port 3002, not port 3000.
Sindex
QTR Put in hash failed with error -4.
WA duplicate query being executed on the Aerospike server. On the client side, you should see error code 215 (AS_PROTO_RESULT_FAIL_QUERY_DUPLICATE)
. There are two possible causes.
-
- If there are retries enabled for secondary index query jobs. We do not recommend having retries for secondary index queries, to avoid them running for the same Job ID.
Queuing namespace {ns-name} for sindex population by device scan
During startup, when the namespace’s devices are about to be scanned in order to populate secondary indexes. This happens when sindex-startup-device-scan is true
.
Queuing namespace {ns-name} for sindex population by index scan
During startup, when the namespace’s primary index is about to be scanned in order to populate secondary indexes. This happens when sindex-startup-device-scan is false
.
Sindex-ticker: ns=ns-name si=<all> obj-scanned=500000 si-mem-used=47913 progress= 2% est-time=2336995 ms
Information logged at startup when secondary indices are being rebuilt.
ns
Namespace name.
si
Secondary indexes being rebuilt.
obj-scanned
Number of objects scanned.
si-mem-used
Memory used by the secondary indices.
progress
Progress in percent.
est-time
Estimated remaining time in milliseconds for the secondary indices to be fully rebuilt.
{ns-name} sindex-gc-done: cleaned (40000,40000) total-ms 23
Secondary index (sindex) garbage collection cycle summary.
cleaned
Count of sindex entries that have been cleaned (cumulative, current). First value corresponds to the sindex_gc_cleaned
statistic. Second value corresponds to the number of sindex entries cleaned in the current round.
total-ms
Duration of the sindex garbage collection cycle in milliseconds.
{ns-name} sindex-gc-done: processed 3133360101 found 365961945 deleted 365952323 total-ms 62667962
Secondary index (sindex) garbage collection cycle summary. Replaced by a modified version of the sindex-gc-done
message in Database 5.7.
processed
Count of sindex entries that have been checked. Corresponds to the sindex_gc_objects_validated
statistic.
found
Count of sindex entries found eligible for garbage collection. Corresponds to the sindex_gc_garbage_found
statistic.
deleted
Count of sindex entries deleted through garbage collection. Count may be lower than above number if those entries were deleted while the garbage collector was running, for example through a competing truncate
command. Corresponds to the sindex_gc_garbage_cleaned
statistic.
total-ms
Duration of the sindex garbage collection cycle in milliseconds.
Socket
Error while adding FD <FD> to epoll instance <Poll-FD>: <error-code> (<error-string>)
A double-close of file descriptors can lead to this assertion under certain conditions. If a file descriptor (FD) is closed twice in rapid succession, it usually fails silently after the first successful close. However, if the FD is reused between the first and second close due to a race condition, the second close could mistakenly target an active connection. This can result in errors like “Bad file descriptor” when the reused FD is assigned to a service thread’s epoll instance.
Example: Error while adding FD 2817 to epoll instance 1105: 9 (Bad file descriptor)
See this Bad File Descriptor knowledge base article for details on one known cause of this assertion when using multiple LDAP servers in the Aerospike configuration and at least one of them is failing to connect against.
FD : File descriptor
Poll-FD : epoll instance file descriptor
error-code : OS Error Code
error-string : OS Error code description
Error while connecting: 113 (No route to host)
Failed to make the connection to peer node for sending heartbeat messages.
errno
Linux error code returned by the OS.
error message
Message corresponding to errno.
Error while connecting socket to 10.100.0.101:3002
The node is unable to connect to a configured peer node. The peer may not be up, or there may be connectivity issues. If the node had been permanently removed, see Removing a Node.
address:port
Address and port that could not be connected to.
Error while creating socket for 10.168.10.1:3002: 24 (Too many open files)
Too many file descriptors are in use. This can lead to an assertion that would cause the node to abort. See System running out of file descriptors
bind: socket in use, waiting (port:3001)
Some other process is already using that port and asd cannot listen on it. Use losf -i:3001
or netstat -plant | grep :3001
to find out what process is causing the conflict.
port
Port number that asd needs to listen on.
epoll_create() failed: 24 (Too many open files)
Occurs when the server has hit the system configured file descriptor limit, which leads to an assertion causing the node to abort.<br /See System running out of file descriptors.
too many addresses for interface <interface> - truncating <IP address>
Too many IP addresses associated with a network interface. Limit is 20 addresses per interface.
Storage
compression is not available for storage-engine memory
This message means that an attempt was made to add compression on a memory storage device. This is not supported.
{ns_name} partitions shut down
This message is one of a sequence of messages logged during Aerospike server shutdown of storage-engine device namespaces. The message signifies that all of the namespace’s partitions and index trees have been locked, so that no records are accessible.
{ns_name} storage devices flushed
This message is one of a sequence of messages logged during Aerospike server shutdown of storage-engine device namespaces. The message signifies that the data in write buffers for the namespace’s devices has been successfully flushed to those devices.
{ns_name} storage-engine memory - nothing to do
This message is logged during Aerospike server shutdown of storage-engine memory namespaces. The message simply notes that there are no storage shutdown tasks needed for the namespace.
Tls
SSL_accept I/O unexpected EOF with 127.0.0.1:46630
The server is attempting to use a connection that has been closed by the client. This can happen if a client times out a transaction, for example if the server is too slow to respond or if there are network disruptions.
IP:port
Source IP address and port of the client connection.
SSL_accept with 127.0.0.1:34086 failed: error:140890C7:SSL routines:ssl3_get_client_certificate:peer did not return a certificate
The client sent a proper CA certificate for mutual authentication that matches the server CA certificate, but didn’t send a client certificate.
IP:port
Source IP address and port of the client connection.
SSL_accept with 127.0.0.1:34094 failed: error:14089086:SSL routines:ssl3_get_client_certificate:certificate verify failed
The client sent a proper client certificate and CA certificate for mutual authentication, but sent a TLS name that doesn’t match the server’s TLS name as set in tls-authenticate-client
and the server’s certificate.
IP:port
Source IP address and port of the client connection.
SSL_accept with 127.0.0.1:34100 failed: error:14094418:SSL routines:ssl3_read_bytes:tlsv1 alert unknown ca
The client sent a proper client certificate for mutual authentication, but sent a CA certificate that either doesn’t match the client certificate, or doesn’t match the server CA certificate.
IP:port
Source IP address and port of the client connection.
SSL_read I/O unexpected EOF with 127.0.0.1:46491
The server is attempting to use a connection that has been closed by the client. This can happen if a client times out a transaction, for example if the server is too slow to respond or if there are network disruptions.
IP:port
Source IP address and port of the client connection.
TLS verify result: unable to get local issuer certificate
A CA or intermediate CA is not trusted by the server. This may be caused by a self-signed certificate or CA that is not yet added to the server truststore.
WARNING (tls): (tls_ee.c:318) SSL_shutdown I/Oerror with (unknown): Transport endpoint is not connected
Indicates that a socket is not connected. Adjacent messages show that the socket is being used by heartbeats, although similar messages could appear concerning any socket using TLS. Possible causes:
- Interruption in network layer
- Abrupt node restart if observed on heartbeat
- Abrupt client restart if observed in client traffic
- Server unable to send results to a client due to client timing out the socket
Truncate
{ns-name} done truncate to T deleted N
This message indicates a request to truncate up to timestamp T has finished, after deleting N objects.
Truncate process completed.
{ns-name} done truncate
Truncate command completed.
{ns-name} flagging truncate to restart
Namespace needs a second truncation pass through after the current pass completes. Truncation can act on more than one set at a time so if a truncate command is received for a set in a namespace that is currently already going through truncation (for a different set for example), a subsequent iteration is required. For example, a truncate against set s2 is ongoing when a truncate command against set s4 is triggered. In this case, from the moment the log message appears (when the command to truncate against set4 is issued), both set2 and set4 are truncated together until the end of the current pass. Then truncation restarts another pass through the namespace, so that the rest of set4 gets truncated.
{ns-name} starting truncate
Truncate command being processed for the namespace.
{ns-name} truncated records (10,50)
Truncate command being processed for the namespace.
(current,total)
Current truncation count (10 in this example) followed by the total number of records have been deleted by truncation since the server started (50 in this example). Those counts are only kept at the namespace level.
{ns-name|set-name} got command to truncate to now (226886718769)
Truncate command received. Appears on the node where the info command was issued. The command is distributed to other nodes using system metadata (SMD), and only the truncating/starting/restarting/truncated/done log entries appear on those nodes.
(timestamp)
The LUT time in milliseconds since the Citrusleaf epoch (00:00:00 UTC on 1 Jan 2010).
{ns-name|set-name} tombstone covers no records on this node
Occurs at cold start-up and is a listing of all set truncations found in the truncation SMD file (System Meta Data) that are not covering any records. The internal mechanism is that all set truncation tombstones are marked as cenotaph before records are read from drive to build the index. The term tombstone refers here to such a truncation related entry, not to be confused with record level tombstones. We only cycle through those tombstones one time for a namespace during a cold-start. If the drives are all fresh, there is a cenotaph log message for every set truncation in SMD. After the drives are read, which could potentially restore records from truncated sets and strip a set-covering tombstone of its cenotaph status, all the set-truncation tombstones that are still cenotaphs (not covering any tombstones) are listed.
{ns-name|set-name} truncating to 226886718769
Truncate command received. Appears on all the nodes after a truncate command is issued.
timestamp
The LUT time in milliseconds since the Citrusleaf epoch (00:00:00 UTC on 1 Jan 2010).
Tsvc
rejecting client transaction - initial partition balance unresolved
When using some older Aerospike Client Libraries, there is a very small window where a node joins a cluster and other nodes begin to advertise the node but the node hasn’t finished creating its partition table and a client picked up that advertised service and made a request. This message happens only during that window and should resolve on its own very quickly.
transaction is neither read nor write - unexpected
An invalid request has been received, either a read/write request where the corresponding bit is not set, or an operate() command with no operations. The error code -4 (FAIL_PARAMETER) is returned to the client if there is one, but this message can also be caused by non-Aerospike traffic reaching the service port.
Udf
UDF bin limit (512) exceeded (bin activity_map)
As per the UDF known limitations page, records with more than 512 bins cannot be read or written by a UDF.
bin
Name of the 513th bin.
UDF timed out 734 ms ago
An individual UDF transaction exceeded an internal timeout interval. Time gets checked at regular interval (of lua instruction sets).
elapsed time
Milliseconds since the timeout period expired. See transaction-max-ms
for details on when transactions could timeout on the server.
WARNING (udf): (udf_aerospike.c:226) update found urecord not open
Indicates that aerospike:update() was called to update a record that the UDF system did not or could not open. Most likely the record expired or was truncated at the time the UDF ran.
bin limit of 512 for UDF exceeded: 511 bins in use, 1 bins free, 3 new bins needed
A UDF operation that adds bins to a record would result in more than 512 bins in the record. As per the UDF known limitations page, records with more than 512 bins cannot be read or written by a UDF.
in-use bins
Number of bins already in the record.
free bins
512 - (# of in-use bins)
new bins
Number of new bins that would be need to be added for this operation to succeed.
bin limit of 512 for UDF exceeded: 512 bins in use, 0 bins free, >4 new bins needed
A UDF operation is trying to add bins to a record that already has 512 bins. As per the UDF known limitations page, records with more than 512 bins cannot be read or written by a UDF.
new bins
Number of new bins that would be need to be added for this operation to succeed.
drives full, record is not updated
No more space left in storage (happened while executing a UDF).
exceeded UDF max bins 512
UDF is attempting to set a bin value, which would result in writing a record with more than 512 bins. As per the UDF known limitations page, records with more than 512 bins cannot be written by a UDF.
large number of lua function arguments (22)
As per the Known Limitations page, calling a Lua function with a large number of arguments can cause instability in the Lua runtime engine. Although this problem is only known to become acute at around 50 arguments, lower values may still result in issues with execution of the UDF.
arg count
Number of arguments in the UDF call
{namespace_name} failed write 1142f0217ababf9fda5b1a4de66e6e8d4e51765e
Most likely appearing as a result of exceeding the write-block-size
with a UDF write. The record’s digest is the last item in the log entry. To determine what set is being written to, see How to return the set name of a record using its digest.
{namespace}
Namespace being written to.
record has too many bins (513) for UDF processing
As per the UDF known limitations page, records with more than 512 bins cannot be read or written by a UDF. Such records can exist in Aerospike, however.
bins
How many bins the record has.
too many bins (513) for UDF
UDF is attempting to use a record (with Database 5.1 or older) or access the bins of a record (with Database 5.2+) with more than 512 bins. As per the UDF known limitations page, records with more than 512 bins cannot be accessed by a UDF, unless the UDF is read-only and accesses only the record’s metadata, with Database 5.2 and later.
bins
Number of bins in the record.
too many bins for UDF
UDF is attempting to set a bin value, but has already set 512 bin values. As per the UDF known limitations page, records with more than 512 bins cannot be written by a UDF.
udf applied with forbidden policy
A result of the client submitting a record UDF request with one or more forbidden policies indicated. A parameter error (code 4) is returned to the client in this situation. The forbidden policies are:
- Generation
- Generation gt
- Create only
- Update only
- Create or replace
- Replace only
- Any of these policies can be enforced within the UDF itself using Aerospike’s LUA API.
Prior to Database 5.2, the policy flags are ignored.
udf_aerospike_rec_update: failure executing record updates (-3)
UDF could not update a record, so the update was rolled back. See earlier messages in the log for further details.
error code
Undocumented error code.
Xdr client
DC <DCNAME> receive error [2] on <xx.xx.xx.xx:3000>
On its own (i.e if no other associated warning like bad protocol or similar), this warning is actually benign. This warning can also happen if there is a connection reset while trying to read from the socket. A potential cause could be the remote side having a node being restarted.
connected to <DCNAME> <xx.xx.xx.xx>:<xxxx>
This is logged on a source node whenever the xdr-client tend thread establishes a successful connection to the destination node. The line is logged under the following conditions:
- The first time the tend thread successfully connects to the destination node. If security is enabled on the destination, the login only happens if security is enabled in the configuration.
- The tend thread has to re-establish the connection again as the login attempt failed when security was enabled on the destination node.
- The tend thread has to re-establish the connection in cases such as source or destination restart, or any network interruption. The login flow can be found at XDR login flow.
DCNAME
The destination dc name, followed by the IP and port of the destination node, e.g.,connected to destdc 172.17.0.5:3116
connected to connector <IP>:<Port>
This message is logged every 5 seconds and is a background health check connection used for connector destinations.
<dc> <node-address-port> failed check <x> times
WARNING (xdr-client): (cluster.c:1010) aerospike-kafka-source 10.242.145.46:8080 failed check 130 times
When connector DC is configured under XDR context, xdr-client
tends every 5 seconds. After 5 consecutive fails, this warning is logged on the source node.
logged in to node <xx.xx.xx.xx>:<xxxx> - session-ttl 86400
Logged on a source node when there is a successful login from the xdr-client tend thread to a destination node. It contains the destination IP, port and the session-ttl
configured on the destination node during the login. The session is valid for this TTL period even if the TTL value changes on the destination node later, unless the session is interrupted by a node restart (source or destination) or auth failure which forces the tend thread to do a re-login. The login flow can be found at XDR login flow.
login to node xx.xx.xx.xx:xxxx failed: <ERROR-CODE>
Logged on a source node when there is a login failure to a destination node. It contains the destination node IP and port. The login flow can be found at XDR login flow.
ERROR-CODE
The error code returned by the destination server, e.g., failed: 65
refreshing session token for <DCNAME> <xx.xx.xx.xx>:<xxxx>
Logged on a source DC node one minute before the session-ttl
configured on the destination node during the login. The server (destination node) sets the expiry of the access token, provided to the xdr-client, a minute shorter than the actual expiration time (renewal margin). The xdr-client tend thread refreshes the token one minute before its actual set expiry. It means the source node is issuing a LOGIN command to refresh its token for the specified destination DC node (IP:port). The login flow can be found at XDR login flow.
DCNAME
The destination DC name, followed by the IP and port of the destination node, e.g.,refreshing session token for destdc 172.17.0.5:3116
security not configured on node <xx.xx.xx.xx>:<xxxx>
Logged on a source node when there is auth configured on the source node under XDR context, but security is not enabled on the destination cluster node. It contains destination IP and port details. Logged only once when the connection is established by the tend thread. Does not affect the XDR working properly after the connection is established. The login flow can be found at XDR login flow.
Xdr
[DC_NAME]: dc-state CLUSTER_UP timelag-sec 2 lst 1468006386894 mlst 1468006389647 (2016-07-08 19:33:09.647 GMT) fnlst 0 (-) wslst 0 (-) shlat-ms 0 rsas-ms 0.004 rsas-pct 0.0 con 384 errcl 0 errsrv 0 sz 6
Every 1 minute, for each configured destination cluster (or DC).
[DC_NAME]
Name and status of the DC. Here are the different statuses: CLUSTER_INACTIVE, CLUSTER_UP, CLUSTER_DOWN, CLUSTER_WINDOW_SHIP. Corresponds to the dc_state
statistic.
timelag-sec
The lag in seconds. This is computed as the difference between the current time and the time-stamp of the record that was last successfully shipped. This provides a sense of how ‘far behind’ the destination cluster lags behind the source cluster. This does not correspond to the time it takes the source cluster to ‘catch up’, nor does it necessarily relate to the number of outstanding digests to be processed. Corresponds to the dc_timelag
statistic.
lst
The overall last ship time for the node (the minimum of all last ship times on this node).
mlst
The main last ship time (the last ship time of the dlogreader).
fnlst
The failed node last ship time (the minimum of the last ship times of all failed node shippers running on this node).
wslst
The window shipper last ship time (the minimum of the last ship times of all window shippers running on this node).
shlat-ms
Corresponds to the xdr_ship_latency_avg
statistic.
rsas-ms
Average sleep time for each write to the DC for the purpose of throttling. Corresponds to the dc_ship_idle_avg
statistic. (Stands for remote ship average sleep ms).
rsas-pct
Percentage of throttled writes to the DC. Corresponds to the dc_ship_idle_avg_pct
statistic. (Stands for remote ship average sleep pct).
con
Number of open connection to the DC. If the DC accepts pipeline writes, there are 64 connections per destination node. Corresponds to the dc_as_open_conn
statistic.
errcl
Number of client layer errors while shipping records for this DC. Errors include timeout, bad network fd, etc. Corresponds to the dc_ship_source_error
statistic.
errsrv
Number of errors from the remote cluster(s) while shipping records for this DC. Errors include out-of-space, key-busy, etc. Corresponds to the dc_ship_destination_error
statistic.
sz
The cluster size of the destination DC.. Corresponds to the dc_as_size
statistic.
Digest Log Write Failed !!! ... Critical error
XDR digestlog has grown larger than its partition. See Solution: Digestlog Partition Out Of Space for more information.
Failed to seek during reclaim. Leaving sptr as is
Benign message if the node on which this is logged did not receive new writes into the digest log. A background process running once a minute tries to reclaim the digest log based on a timestamp. It starts sampling the log by looking at the timestamp of the last record written in the digestlog. If it doesn’t find a single record, it stops early and prints this message.
INFO (info): (dc.c:1024) xdr-dc DC1: lag 0 throughput 0 latency-ms 0 in-queue 0 outstanding 0 complete (3000,0,0,0) retries (0,0) recoveries (4096,0) hot-keys 0
lag
See lag
.
throughput
See throughput
.
latency-ms
See latency_ms
metric.
in-queue
See in_queue
.
in-progress
See in_progress
.
complete
Composed of the following metrics:
retries
Composed of the following metrics:
recoveries
Composed of the following metrics:
hot-keys
See hot_keys
.
INFO (info): (dc.c:1353) xdr-dc dc2: lag 12 throughput 710 latency-ms 19 in-queue 250563 in-progress 81150 complete (1002215,0,0,0) retries (0,0,23) recoveries (2048,0) hot-keys 4655
lag
See lag
.
throughput
See throughput
.
latency-ms
See latency_ms
.
in-queue
See in_queue
.
in-progress
See in_progress
.
complete
Composed of the following metrics:
retries
Composed of the following metrics:
recoveries
Composed of the following metrics:
hot-keys
See hot_keys
.
INFO (info): (dc.c:1353) xdr-dc dc2: nodes 8 lag 12 throughput 710 latency-ms 19 in-queue 250563 in-progress 81150 complete (1002215,0,0,0) retries (0,0,23) recoveries (2048,0) hot-keys 4655
In the XDR log line we use the following values when multiple namespaces are shipped to a DC:
- nodes refers to DC-level alone
- lag is the max of all namespaces
- latency is the average across namespaces
- everything else is the sum across namespaces
nodes
See nodes
.
lag
See lag
.
throughput
See throughput
.
latency-ms
See latency_ms
.
in-queue
See in_queue
.
in-progress
See in_progress
.
complete
Composed of the following metrics:
retries
Composed of the following metrics:
recoveries
Composed of the following metrics:
hot-keys
See hot_keys
.
{NAMESPACE} failed to create set
A record belonging to a previously-unknown set was received using XDR, but the limit on set names has already been reached locally, so the new set could not be created. If it happens while Aerospike is starting up, this message may be followed by an abort with SIGUSR1.
namespace
The namespace where the set could not be created.
XDR digestlog cannot keep up with writes. Dropping record.
Aerospike is writing data faster to the XDR digestlog than the underlying disk can handle. If using raw-disk backed xdr storage, consider switching over to file-backed xdr storage for the xdr-digestlog-path parameter. This takes advantage of filesystem caching for reads and writes. Otherwise, you may need to use a faster disk or join multiple disks using RAID-0 to allow for faster read/write to the XDR digestlog. Also ensure that dmesg is properly checked for disk failures and a SMART disk test is performed.
'XXXX' cluster does not support pipelining
This message usually indicates that one of the destinations is running an older version of XDR (pre-3.8) or there exists a misconfiguration of the kernel configs /proc/sys/net/core/wmem_max and /proc/sys/net/core/rmem_max between source and destination clusters.
detail: sh 5588691 ul 12 lg 11162298 rlg 54 rlgi 0 rlgo 54 lproc 11162198 rproc 45 lkdproc 0 errcl 54 errsrv 0 hkskip 6303 hkf 6299 flat 0
sh
The cumulative number of records that have been attempted to be shipped since this node started, across all datacenters. If a record is shipped to 3 different datacenters, then this number increments by 3. Corresponds to the sum of the xdr_ship_success
, xdr_ship_source_error
and xdr_ship_destination_error
statistics.
ul
The number of record’s digests that have been written to the node but not logged yet to the digestlog (unlogged).
lg
The number of record’s digests that have been logged. Includes both master and replica records but a node only ships records when it owns the master partition and processes records belonging to its replica partitions only when a neighboring source node goes down.
rlog
Relogged digests. The number of record’s digests that have been relogged on this node due to temporary failures when attempting to ship. Corresponds to the dlog_relogged
statistic.
rlgi
Relogged incoming digests. The number of record’s digest that another node sent to this node (typically prole side relog or partition ownership change). Corresponds to the relogged_incoming
statistic.
rlgo
Relogged outgoing digests. The number of record’s digest log entries that were sent to another node (typically prole side relog or partition ownership change). Corresponds to the relogged_outgoing
statistic.
lproc
The number of record’s digests that have been processed locally. A processed digest does not necessarily imply a shipped record (for example, replica digests don’t get shipped unless a source node is down, and hotkeys also don’t have all their updates necessarily shipped). Corresponds to the dlog_processed_main
statistics.
rproc
The number of replica record’s digests that have been processed by this node. A node processes records belonging to its replica partitions only when a neighboring source node goes down. Corresponds to the dlog_processed_replica
statistic.
lkdproc
The number of record’s digests that have been processed as part of a linked down session. A link down session is spawned when a full destination cluster is down or not reachable. Corresponds to the dlog_processed_link_down
statistic.
errcl
The number of errors encountered when attempting to ship due to the embedded client. For example, if the local XDR embedded client is having issues or delays in establishing connections. Corresponds to the xdr_ship_source_error
statistic.
errsrv
The number of errors encountered when attempting to ship due to the destination cluster. For example if the destination cluster is temporarily overloaded. Corresponds to the xdr_ship_destination_error
statistic.
hkskip
Hotkey skipped. Represents the number of record’s digests that are skipped due to an already existing entry in the reader’s thread cache (meaning a version of this record was just shipped). Corresponds to the xdr_hotkey_skip statistic.
hkf
Hotkey fetched. Represents the number of record’s digest that are actually fetched and shipped because their cache entries expired and were dirty. Corresponds to the xdr_hotkey_fetch statistics.
flat
The average time in milliseconds to fetch records locally (this is an exponential moving average - 95/5). Corresponds to the xdr_read_latency_avg
statistic.
dlog: free-pct 93 reclaimed 2456 glst 1490623097271 (2017-03-27 13:58:17 GMT)
Provides digest log (dlog) related information.
Every 1 minute.
free-pct
Percentage of the digest log free and available for use. Corresponds to the dlog_free_pct
statistic.
reclaimed
Indicates how many digests were safely ‘removed’ from the digestlog. As shipping successfully proceeds, and records are shipped, digests which are no longer necessary can have their space reclaimed in the digest log. A linked down (destination cluster down or unreachable) is an example where digest log space cannot be reclaimed.
glst
The minimum last ship time across all nodes in the cluster. Corresponds to the xdr_global_lastshiptime
statistics. This specifies up to what point slots in the digest log can be reclaimed, by keeping track of the oldest last ship time across all nodes in the cluster.
dlog-q: capacity 64 used-elements 1 read-offset 0 write-offset 1
Status information on the dlog-q. The dlog-q queue is the in-memory digest log queue. Digests of records that have been written get put on this in-memory queue. The dlogwriter picks them from there and puts them in the on-disk digest log. See below for the details on the 4 numbers.
Every 1 minute.
capacity
The size of the queue.
used-elements
The number of elements in the queue.
read-offset
The read pointer of the queue. In general, the number of elements in the queue is the difference between the read pointer and the write pointer (that is, the number of elements that have been written to the queue but haven’t yet been read).
write-offset
The write pointer of the queue. In general, the number of elements in the queue is the difference between the read pointer and the write pointer (that is, the number of elements that have been written to the queue but haven’t yet been read).
{namespace} xdr-tomb-raid-done: dropped <n> total-ms <t>
INFO (xdr): (xdr_ee.c:206) {users_eu} xdr-tomb-raid-done: dropped 0 total-ms 308923
XDR tomb raider has finished its cycle that is expected to run every xdr-tomb-raider-period seconds having dropped n records and took t ms to finish.
n
Number of records dropped by the current cycle of XDR tomb raider.
t
Number of ms XDR tomb raider cycle took to finish.
{namespaceName} DC DC1 abandon result -4
In strong-consistency enabled namespaces, if XDR finds a record which has not been replicated, re-replication from master to replica is triggered. XDR prints this warning message when this happens. XDR attempts to ship that record again. See XDR transaction delays for details on how XDR handles record in strong-consistency enabled namespaces. See XDR 5.0 Error codes.
summary: throughput 3722 inflight 164 dlog-outstanding 100 dlog-delta-per-sec -10.0
throughput
The current throughput, shipping to destination cluster(s). When shipping to multiple clusters, the throughput represents the combined throughput to all destination clusters. Corresponds to the xdr_throughput
statistic.
inflight
The number of records that are inflight, meaning that have been sent to the destination cluster(s) but for which a response has not been received yet. Corresponds to the xdr_ship_inflight_objects
statistic.
dlog-outstanding
The number of record’s digests yet to be processed in the digest log. In parenthesis, the average change normalized to digests per second over the 10 seconds interval separating those log lines. Corresponds to the xdr_ship_outstanding_objects
statistic.
dlog-delta-per-sec
The variation of the dlog-outstanding normalized on a per second basis. Gives an idea whether the number of entries in the digestlog is increasing or decreasing over time and at what pace.