neo4j - How to explain the performance of Cypher's LOAD CSV clause? -

i'm using cypher's load csv syntax in neo4j 2.1.2. far it's been huge improvement on more manual etl process required in previous versions. i'm running behavior in single case that's not i'd expect , wonder if i'm missing something.

the cypher query beingness used this:

using periodic commit 500 load csv 'file:///users/james/desktop/import/dependency_sets_short.csv' row match (s:sense {uid: toint(row[4])}) merge (ds:dependencyset {label: row[2]}) on create set ds.optional=(row[3] = 't') create (s)-[:has]->(ds)

here's couple of lines of csv:

227303,1,to-purpose-nominal,t,73830 334471,1,at-location,t,92048 334470,1,at-time,t,92048 334469,1,on-location,t,92048 227302,1,to-purpose-infinitive,t,73830 116008,1,to-location,t,68204 116007,1,in-location,t,68204 227301,1,to-location,t,73830 334468,1,on-date,t,92048 116006,1,at-location,t,68204 334467,1,with-associate,t,92048

basically, i'm matching sense node (previously imported) based on it's id value 5th column. i'm doing merge either dependencyset node if exists, or create it. finally, i'm creating has border between sense node , dependencyset node. far good, works expected. what's confusing performance size of csv grows.

csv lines       time (msec) ------------------------------ 500             480 1000            717 2000            1110 5000            1521 10000           2111 50000           4794 100000          5907 200000          12302 300000          35494 400000          java heap space error

my expectation growth more-or-less linear, particularly i'm committing every 500 lines recommended manual, it's closer polynomial:

what's worse somewhere between 300k , 400k rows, runs java heap space error. based on trend previous imports, i'd expect import of 400k take bit on minute. instead, churns away 5-7 minutes before running heap space error. seems split file 300,000-line chunks, isn't "using periodic commit" supposed do, more or less? suppose give neo4j more memory too, again, it's not clear why should have in scenario.

also, clear, lookups on both sense.uid , dependencyset.label indexed, lookup penalty these should pretty small. here's snippet schema:

indexes   on :dependencyset(label) online (for uniqueness constraint)   on :sense(uid)           online (for uniqueness constraint)

any explanations or thoughts on alternative approach appreciated.

edit: problem seems in match and/or create part of query. if remove lines 3 , 5 cypher query performs fine.

i assume you've created sense labeled nodes before running load csv import. think going on matching nodes label sense memory , creating relationships dependencyset sense node via create (s)-[:has]->(ds) increasing utilization of available heap.

another possibility size of relationship store in memory mapped settings needs increased. in scenario looks sense nodes have high grade of connectivity other nodes in graph. when happens relationship store nodes require more memory. when nail 400k nodes heap maxed out. until point needs more garbage collection , reads disk.

michael hunger set first-class blog post on memory mapped settings fast load csv performance. see here: http://jexp.de/blog/2014/06/load-csv-into-neo4j-quickly-and-successfully/

that should resolve problem. don't see wrong query.

neo4j cypher

Search This Blog

Three

neo4j - How to explain the performance of Cypher's LOAD CSV clause? -

Comments

Post a Comment

Popular posts from this blog

php - Android app custom user registration and login with cookie using facebook sdk -

ruby on rails - Devise Logout Error in RoR -

c# - Create a Notification Object (Email or Page) At Run Time -- Dependency Injection or Factory -