neo4j - How to explain the performance of Cypher's LOAD CSV clause? -
neo4j - How to explain the performance of Cypher's LOAD CSV clause? -
i'm using cypher's load csv
syntax in neo4j 2.1.2. far it's been huge improvement on more manual etl process required in previous versions. i'm running behavior in single case that's not i'd expect , wonder if i'm missing something.
the cypher query beingness used this:
using periodic commit 500 load csv 'file:///users/james/desktop/import/dependency_sets_short.csv' row match (s:sense {uid: toint(row[4])}) merge (ds:dependencyset {label: row[2]}) on create set ds.optional=(row[3] = 't') create (s)-[:has]->(ds)
here's couple of lines of csv:
227303,1,to-purpose-nominal,t,73830 334471,1,at-location,t,92048 334470,1,at-time,t,92048 334469,1,on-location,t,92048 227302,1,to-purpose-infinitive,t,73830 116008,1,to-location,t,68204 116007,1,in-location,t,68204 227301,1,to-location,t,73830 334468,1,on-date,t,92048 116006,1,at-location,t,68204 334467,1,with-associate,t,92048
basically, i'm matching sense
node (previously imported) based on it's id value 5th column. i'm doing merge either dependencyset
node if exists, or create it. finally, i'm creating has
border between sense
node , dependencyset
node. far good, works expected. what's confusing performance size of csv grows.
csv lines time (msec) ------------------------------ 500 480 1000 717 2000 1110 5000 1521 10000 2111 50000 4794 100000 5907 200000 12302 300000 35494 400000 java heap space error
my expectation growth more-or-less linear, particularly i'm committing every 500 lines recommended manual, it's closer polynomial:
what's worse somewhere between 300k , 400k rows, runs java heap space error. based on trend previous imports, i'd expect import of 400k take bit on minute. instead, churns away 5-7 minutes before running heap space error. seems split file 300,000-line chunks, isn't "using periodic commit" supposed do, more or less? suppose give neo4j more memory too, again, it's not clear why should have in scenario.
also, clear, lookups on both sense.uid
, dependencyset.label
indexed, lookup penalty these should pretty small. here's snippet schema:
indexes on :dependencyset(label) online (for uniqueness constraint) on :sense(uid) online (for uniqueness constraint)
any explanations or thoughts on alternative approach appreciated.
edit: problem seems in match and/or create part of query. if remove lines 3 , 5 cypher query performs fine.
i assume you've created sense
labeled nodes before running load csv
import. think going on matching nodes label sense
memory , creating relationships dependencyset
sense
node via create (s)-[:has]->(ds)
increasing utilization of available heap.
another possibility size of relationship store in memory mapped settings needs increased. in scenario looks sense
nodes have high grade of connectivity other nodes in graph. when happens relationship store nodes require more memory. when nail 400k nodes heap maxed out. until point needs more garbage collection , reads disk.
michael hunger set first-class blog post on memory mapped settings fast load csv
performance. see here: http://jexp.de/blog/2014/06/load-csv-into-neo4j-quickly-and-successfully/
that should resolve problem. don't see wrong query.
neo4j cypher
Comments
Post a Comment