“When you want to do something, do it right away. Do it when you can. It’s the only way to live a life without regrets.” -- Sonic The Hedgehog
This project aims to step outside of traditional ways of integrating with Neo4j and expose things like the in-memory GDS graph projection via high-performance Arrow Flight APIs for streaming nodes, relationships, and their properties.
neo4j-arrow
has been shown to:
- ⏩ Move data out of Neo4j 20x faster than the Neo4j Java Driver
- ⏩ Move data out of Neo4j 450x faster than the Neo4j Python Driver
- 🏎️ Facilitate live-migration of GDS graph projections, moving 50m nodes, 200m relationship, and 50m 256-degree feature vectors between live Neo4j instances in different GCP regions in under 15 minutes.
- Provide a faster integration with BigQuery, extracting and loading 50m nodes with 256-degree feature vector properties in under 2 minutes.
- 😃 Make you happier and more at peace.
Most users will want to use the plugin
form of neo4j-arrow
. The
most recent tagged version is available on the
releases page.
To build your own from source, you'll need:
- A JDK 11 distribution
- Internet access to the public Maven repo
Simply run: $ ./gradlew plugin:shadowJar
You should end up with a Neo4j plugin jar in ./plugin/build/libs/
that you can drop into a Neo4j system.
Given the plugin is the heart and soul of current development, I have plans to redesign the build system to make it the default project and target.
neo4j-arrow
should work out of the box with Neo4j 4.3 and GDS
v1.7. If you have earlier versions of either, you'll need to compile
your own jars from source.
The neo4j_arrow.py
PyArrow client requires Python 3 and
PyArrow 6.0.0. (It may still work with v5.0.0, but I'm developing on
v6.0.x.)
Any other Arrow Flight clients should use v6.0.0
of Arrow/Arrow Flight if
possible.
All configuration for neo4j-arrow
is performed via environment
variables. (Currently there is no support for property-based config in
neo4j.conf
.) This holds true for all subprojects of neo4j-arrow
.
Available configuration options specific to the neo4j-arrow
plugin:
Option | Description | Default |
---|---|---|
HOST |
Hostname or IP for neo4j-arrow to listen on |
"localhost" |
PORT |
TCP Port for neo4j-arrow to listen on |
9999 |
MAX_MEM_GLOBAL |
Global memory limit for Arrow memory allocator | Long.MAX_VALUE |
MAX_MEM_STREAM |
Per-stream memory limit | Long.MAX_VALUE |
ARROW_BATCH_SIZE |
Max number of rows to include when sending a vector batch to a client | 1000 |
ARROW_MAX_PARTITIONS |
Max number of partitions to create when generating streams, roughly translates to # of CPU cores to utilize | max(vcpus - 2, 1) |
ARROW_FLUSH_TIMEOUT |
Max time (in seconds) to wait before flushing a stream to the client | 1800 |
ARROW_TLS_CERTIFICATE |
Path to x.509 full-chain certifcate in PEM format | "" |
ARROW_TLS_PRIVATE_KEY |
Path to PEM private key file | "" |
See also: the
org.neo4j.arrow.Config
class.
The primary knobs available for tuning for read or write performance
is the ARROW_BATCH_SIZE
and ARROW_MAX_PARTITIONS
value. Given the
partition number defaults to a value based on the host cpu core count,
in practice this means tuning the batch size.
Some general advice:
- Reads tend to perform better with smaller batch sizes (such as the default)
- Writes benefit from much larger (150-200x than the default) values.
In practice, the true performance will depend heavily on the type of vectors being created. Scalars generally outperform List-based vectors for multiple reasons: total buffer size and cpu instructions required to read/write a value.
As of v3.something, some jobs let you override the partition and batch size
dynamically by adding extra parameters to the job message. It's experimental,
so you'll need to look at the Python wrapper code (neo4j_arrow.py
) to see
what works so far.
neo4j-arrow
uses the Arrow Flight RPC framework to expose
read/write operations to Neo4j. While you can implement the protocol
yourself using an Arrow/Arrow Flight client implementation in any
language, a helpful wrapper using PyArrow is provided in
python/neo4j_arrow.py.
The general lifecycle looks like:
- Create an Arrow client with authentication details for the Neo4j system, optionally using TLS.
- Submit a valid
neo4j-arrow
Job to the service. Current jobs include:- Cypher Read jobs (
cypher.read
) - GDS Read jobs (
gds.read
) with ability to specify nodes or relationships - GDS Write Jobs (
gds.write.nodes
&gds.write.relationships
) for creating GDS graphs from Arrow vectors/tables - KHop Jobs (
khop
) (see KHOP.md for details) - Bulk Neo4j database import (
import.bulk
) support - Job Status reporting (
job.status
) - Server version inspection (
info.version
) - System-wide job status inspection (
info.jobs
)
- Cypher Read jobs (
- Take the given ticket returned by the server for the Job and either request a readable stream or submit a stream via a "put" operation.
In practice, the nuances of jobs, tickets, etc. are handled for you if you use the provided Python wrapper.
Assuming you've using neo4j_arrow.py
, a simple GDS read operation to
request and stream computed node embeddings (for example) looks like:
import neo4j_arrow as na
# using the neo4j user and 'password' as the password
client = na.Neo4jArrow('neo4j', 'password', ('neo4-host', 9999))
# stream the 'mygraph' projection and include the 'fastRp' node properties
ticket = client.gds_nodes('mygraph', properties=['fastRp'])
# get a PyArrow Table from the stream, reading it all into the client
table = client.stream(ticket).read_all()
# convert to a Pandas dataframe and go have fun!
df = table.to_pandas()
A few IPython notebooks are provided that demonstrate more complex
usage of neo4j-arrow
via PyArrow:
- Basic usage and lifecycle.
- Live migration of a graph between disparate Neo4j instances.
- Bulk import of a new Database showing how to
bootstrap a new Neo4j database remotely using
neo4j-arrow
and PyArrow or Pandas compatible data sources.
Some code examples:
- Integration with BigQuery showing how to relay a
neo4j-arrow
stream to a target BigQuery table. - Trivial example of reading some nodes from GDS
Keep in mind neo4j-arrow
is a work in progress. The following are
areas still requiring some development cycles or help.
- Cannot currently write relationship properties to a GDS graph.
- On the todo list.
- Cannot read string properties (not counting Labels or Relationship
Types) from a GDS graph.
- GDS doesn't natively support strings and reads them from the db.
- This could be implemented, but may slow down streaming.
- Only basic Cypher types are supported for cypher jobs (mostly
scalars, Strings, and simple Lists of scalars).
- The cardinality of the set of types Cypher supports is too damn high.
- What, you don't like numbers? :-)
- Error handling and cleanup for failed/incomplete jobs needs work.
- Batch sizes are not dynamically chosen. Ideally they would be determined
based on schema and host hardware.
- As we develop and utilize
neo4j-arrow
a best practice should be baked in, but we're not there yet.
- As we develop and utilize
- Cannot currently write to the database via Cypher jobs.
- Requires more complex Cypher transaction jobs.
- Not out of the realm of possibility, but not a priority.
- GDS write jobs report completion to the client before completion
of server-side processing. Client's may start writing
relationships while nodes are still being post-processed,
resulting in a "graph not found" error.
- On the todo list.
Like other neo4j-labs
and neo4j-contrib
projects, this project is
provided under the terms of the Apache 2.0 license.
All files and code are copyright 2022, Neo4j, Inc.