After a bit of research we decided that it was best to go with Neo4J over those running on Hadoop (Girpah, etc), at least to get started.Following was our rationale:
- No one on our team has any graph db experience.
- The Neo4J community is way more active that it's Hadoop based counterparts.
- We already run a petty heterogeneous stack (R Studio/Server, Hive, Impala, Python).
- The ability to show the graph off visually us somewhat important.
- Our graph db won't see frequent updates, and usage will be minimal.
- Hydrating the graph from our MYSQL db via python was pretty trivial.
- Neo4J offers very easy REST API access.
So I went about getting Neo4J to go at AWS, here is a blow by blow of how I got it done.
Don't follow the main Neo4J EC2 instructions (utilizing Cloud Formation), it was hosed trying to locate the AMI. Go manual, you will thank me.
Notes / extra steps:
- Neo4J will be installed in /var/lib/neo4j/
- Neo4J start script is /etc/init.d/neo4j-service (stop/start/restart) ... this incorrectly noted in docs
- Edit /etc/security/limits.conf and add these two lines:
neo4j soft no file 40000
neo4j hard nofile 40000
- Neo4J is only accessible locally by default:
Edit /var/lib/neo4j/conf/neo4j-server.properties
uncomment this line
org.neo4j.server.webserver.address=0.0.0.0
- Another thing
Edit /etc/pam.d/su
uncomment or add the following line
session required pam_limits.so
That should get you up and running, and accessible. Here is a picture of a graph we created based on wikipedia categories and pages centered on "Machine Learning". We are just getting our feet wet, but love what we are seeing!