Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problem at deploying hstore at the AWS #152

Open
i-chaochen opened this issue Jan 28, 2014 · 13 comments
Open

problem at deploying hstore at the AWS #152

i-chaochen opened this issue Jan 28, 2014 · 13 comments

Comments

@i-chaochen
Copy link

hi, andy

I follow the document about running on EC2 steps as follows but failed to ant build

sudo vim /etc/apt/sources.list
deb http://archive.canonical.com/ubuntu lucid partner
deb-src http://archive.canonical.com/ubuntu lucid partner

sudo apt-get update

Package sun-java6-jdk is not available so I change it as openjdk-6-jdk
sudo apt-get --yes install subversion gcc g++ make openjdk-6-jdk valgrind ant

svn co https://database.cs.brown.edu/svn/hstore/trunk/ $HSTORE_HOME

cp hstore.pem ~/.ssh/ && chmod 400 ~/.ssh/hstore.pem

vim trunk/properties/default.properties

global.sshoptions = -i /home/ubuntu/.ssh/hstore.pem

ant build


ee:

 [exec] g++  -Wall -Wextra -Werror -Woverloaded-virtual -Wconversion -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings -Winit-self -Wno-sign-compare -Wno-unused-parameter -Wno-unused-but-set-variable -pthread -D__STDC_CONSTANT_MACROS -D__STDC_LIMIT_MACROS -DNOCLOCK -fno-omit-frame-pointer -fvisibility=hidden -DBOOST_SP_DISABLE_THREADS -Wno-ignored-qualifiers -fno-strict-aliasing -Wno-attributes -DLINUX -fPIC -isystem ../../third_party/cpp -I../../src/ee  -c  -g3 -O3 -mmmx -msse -msse2 -msse3 -DNDEBUG -DVOLT_LOG_LEVEL=500 -o objects/indexes/tableindex.co ../../src/ee/indexes/tableindex.cpp
 [exec] g++  -Wall -Wextra -Werror -Woverloaded-virtual -Wconversion -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings -Winit-self -Wno-sign-compare -Wno-unused-parameter -Wno-unused-but-set-variable -pthread -D__STDC_CONSTANT_MACROS -D__STDC_LIMIT_MACROS -DNOCLOCK -fno-omit-frame-pointer -fvisibility=hidden -DBOOST_SP_DISABLE_THREADS -Wno-ignored-qualifiers -fno-strict-aliasing -Wno-attributes -DLINUX -fPIC -isystem ../../third_party/cpp -I../../src/ee  -c  -g3 -O3 -mmmx -msse -msse2 -msse3 -DNDEBUG -DVOLT_LOG_LEVEL=500 -o objects/indexes/tableindexfactory.co ../../src/ee/indexes/tableindexfactory.cpp

BUILD FAILED
/home/ubuntu/trunk/build.xml:715: exec returned: 137

because svn ant build failed, so I remove it and try the source from git

sudo rm -r trunk/
sudo apt-get install git
git clone git://github.com/apavlo/h-store.git
ant build

ee-build:
[exec] make: Entering directory `/home/ubuntu/h-store/obj/release'
[exec] g++ -Wall -Wextra -Werror -Woverloaded-virtual -Wconversion -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings -Winit-self -Wno-sign-compare -Wno-unused-parameter -pthread -D__STDC_CONSTANT_MACROS -D__STDC_LIMIT_MACROS -DNOCLOCK -fno-omit-frame-pointer -fvisibility=hidden -DBOOST_SP_DISABLE_THREADS -Wno-ignored-qualifiers -fno-strict-aliasing -Wno-attributes -DLINUX -fPIC -Wno-unused-but-set-variable -DANTICACHE -DANTICACHE_REVERSIBLE_LRU -isystem ../../third_party/cpp -isystem ../../obj/release/berkeleydb -I../../src/ee -c -g3 -O3 -mmmx -msse -msse2 -msse3 -DNDEBUG -DVOLT_LOG_LEVEL=500 -o objects//voltdbjni.co ../../src/ee//voltdbjni.cpp

BUILD FAILED
/home/ubuntu/h-store/build.xml:860: exec returned: 137

Total time: 9 minutes 36 seconds

any helps will be greatly appreciated !

@apavlo
Copy link
Owner

apavlo commented Jan 28, 2014

That document looks out of date. You don't want to use the really old SVN repo. You want to use this Github one.

@i-chaochen
Copy link
Author

yes, I tried the source from github, but it still failed to build

git clone git://github.com/apavlo/h-store.git
ant build

ee-build:
[exec] make: Entering directory `/home/ubuntu/h-store/obj/release'
[exec] g++ -Wall -Wextra -Werror -Woverloaded-virtual -Wconversion -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings -Winit-self -Wno-sign-compare -Wno-unused-parameter -pthread -D__STDC_CONSTANT_MACROS -D__STDC_LIMIT_MACROS -DNOCLOCK -fno-omit-frame-pointer -fvisibility=hidden -DBOOST_SP_DISABLE_THREADS -Wno-ignored-qualifiers -fno-strict-aliasing -Wno-attributes -DLINUX -fPIC -Wno-unused-but-set-variable -DANTICACHE -DANTICACHE_REVERSIBLE_LRU -isystem ../../third_party/cpp -isystem ../../obj/release/berkeleydb -I../../src/ee -c -g3 -O3 -mmmx -msse -msse2 -msse3 -DNDEBUG -DVOLT_LOG_LEVEL=500 -o objects//voltdbjni.co ../../src/ee//voltdbjni.cpp

BUILD FAILED
/home/ubuntu/h-store/build.xml:860: exec returned: 137

Total time: 9 minutes 36 seconds

thanks

@apavlo
Copy link
Owner

apavlo commented Jan 28, 2014

Is there an error from gcc? It's weird that it just fails like that?

@i-chaochen
Copy link
Author

I think I finally figure out this problem, it runs out of all memory at
DANTICACHE_REVERSIBLE_LRU -isystem ../../third_party/cpp -isystem ../../obj/release/berkeleydb -I../../src/ee -c -g3 -O3 -mmmx -msse -msse2 -msse3 -DNDEBUG -DVOLT_LOG_LEVEL=500 -o objects//voltdbjni.co ../../src/ee//voltdbjni.cpp

I used a micro ec2 which only has 0.6g memory...

I try another medium one and build successfully.

to who wants to try hstore on AWS please at lease use a medium size ec2...

thanks

@i-chaochen
Copy link
Author

now I can build it but still unable to execute the benchmark at AWS NFS cluster.

my 2 nfs cluster nodes within the same security group
TCP
Port (Service) Source Action
22 (SSH) 0.0.0.0/0 Delete
111 0.0.0.0/0 Delete
2049 0.0.0.0/0 Delete
44182 0.0.0.0/0 Delete
54508 0.0.0.0/0 Delete
UDP
Port (Service) Source Action
111 0.0.0.0/0 Delete
2049 0.0.0.0/0 Delete
32768 0.0.0.0/0 Delete
32770 - 32800 0.0.0.0/0 Delete

I configure the ssh environment
sudo apt-get --yes install openssh-server
ssh-keygen -t dsa # Do not enter in a password
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
$ ssh -o StrictHostKeyChecking=no localhost "date"
Wed Jan 29 00:58:12 UTC 2014

$ ssh localhost date
Wed Jan 29 01:00:14 UTC 2014

I scp my hstore.pem on nfs server node
cp hstore.pem ~/.ssh/ && chmod 400 ~/.ssh/hstore.pem

change the global.sshoptions parameter in $HSTORE_HOME/properties/default.properties as
global.sshoptions = -i /home/ubuntu/.ssh/hstore.pem

create a cluster.txt as follow:
host0.ip-172-31-xx-xxx.eu-west-1.compute.internal:0:0-1
host1.ip-172-31-xx-xx.eu-west-1.compute.internal:1:2-3

no problem at here
ant hstore-prepare -Dproject=tpcc -Dhosts=/home/ubuntu/cluster.txt

$ ant hstore-benchmark -Dproject=tpcc
Buildfile: /home/ubuntu/h-store/build.xml

hstore-benchmark:

benchmark:
[java] 00:58:59,774 INFO - ------------------------- BENCHMARK INITIALIZE :: TPCC -------------------------
[java] 00:58:59,854 INFO - Starting HStoreSite H00 on host0.ip-172-31-33-172.eu-west-1.compute.internal
[java] 00:58:59,907 INFO - Starting HStoreSite H01 on host1.ip-172-31-24-5.eu-west-1.compute.internal
[java] 00:58:59,980 INFO - Waiting for 2 HStoreSites with 4 partitions to finish initialization
[java] 00:59:04,910 ERROR - Failed to poll 'site-00-host0.ip-172-31-33-172.eu-west-1.compute.internal' [exitValue=255]
[java] 00:59:04,910 FATAL - Process 'site-00-host0.ip-172-31-33-172.eu-west-1.compute.internal' failed. Halting benchmark!
[java] 00:59:06,413 FATAL - Failed to complete benchmark
[java] java.lang.RuntimeException: Failed to start all HStoreSites. Halting benchmark
[java] at edu.brown.api.BenchmarkController.startSites(BenchmarkController.java:633)
[java] at edu.brown.api.BenchmarkController.setupBenchmark(BenchmarkController.java:504)
[java] at edu.brown.api.BenchmarkController.main(BenchmarkController.java:2216)

BUILD FAILED
/home/ubuntu/h-store/build.xml:2517: The following error occurred while executing this line:
/home/ubuntu/h-store/build.xml:1693: Java returned: 1

Total time: 15 seconds

didn't see any useful log from these 2 nodes
~/h-store/obj/logs/sites$ cat site-00-host0.ip-172-31-xx-xxx.eu-west-1.compute.internal.log

2014-01-29T00:58:59.895.0

:~/h-store/obj/logs/sites$ cat site-01-host1.ip-172-31-xx-xxx.eu-west-1.compute.internal.log

2014-01-29T00:58:59.971.0

any advices?

thanks!

@apavlo
Copy link
Owner

apavlo commented Jan 29, 2014

Use the internal IP addresses instead of the public ones.

@i-chaochen
Copy link
Author

yes, I am using the aws internal dns as you can see my cluster.txt
host0.ip-172-31-xx-xxx.eu-west-1.compute.internal:0:0-1
host1.ip-172-31-xx-xx.eu-west-1.compute.internal:1:2-3

and internal ip for nfs cluster

but it just can't execute.

do you mean I use internal ip address instead of internal dns address at cluster.txt?

so like this?
host0.172.31.xx.xxx :0:0-1
host1.172-31.xx.xx:1:2-3

thanks

@apavlo
Copy link
Owner

apavlo commented Jan 29, 2014

Enable DEBUG for 'org/voltdb/processtools/ProcessSetManager.java' in log4j.properties

See what the SSH command is that it's trying to use to start the sites and see whether you can fire them off by hand.

Andy Pavlo
pavlo@cs.cmu.edu

@i-chaochen
Copy link
Author

sorry I am not sure I'm completely following you, I changed voltdb area as DEBUG at log4j.properties

VoltDB Stuff

log4j.logger.org.voltdb.VoltProcedure=DEBUG
log4j.logger.org.voltdb.VoltSystemProcedure=DEBUG
log4j.logger.org.voltdb.client=DEBUG
log4j.logger.org.voltdb.compiler=DEBUG
log4j.logger.org.voltdb.planner=DEBUG

after ant hstore-prepare -Dproject=tpcc -Dhosts=/home/ubuntu/cluster.txt
I haven't seen any things related to SSH command.

still,
$ ant hstore-benchmark -Dproject=tpcc
Buildfile: /home/ubuntu/h-store/build.xml

hstore-benchmark:

benchmark:
[java] 03:16:24,604 INFO - ------------------------- BENCHMARK INITIALIZE :: TPCC -------------------------
[java] 03:16:24,673 INFO - Starting HStoreSite H00 on host0.ip-172-31-xx-xx.eu-west-1.compute.internal
[java] 03:16:24,726 INFO - Starting HStoreSite H01 on host1.ip-172-31-xx-xx.eu-west-1.compute.internal
[java] 03:16:24,782 INFO - Starting HStoreSite H02 on host2.ip-172-31-xx-xx.eu-west-1.compute.internal
[java] 03:16:24,863 INFO - Waiting for 3 HStoreSites with 6 partitions to finish initialization
[java] 03:16:29,729 ERROR - Failed to poll 'site-01-host1.ip-172-31-xx-xx.eu-west-1.compute.internal' [exitValue=255]
[java] 03:16:29,729 FATAL - Process 'site-01-host1.ip-172-31-xx-xx.eu-west-1.compute.internal' failed. Halting benchmark!
[java] 03:16:31,232 FATAL - Failed to complete benchmark
[java] java.lang.RuntimeException: Failed to start all HStoreSites. Halting benchmark
[java] at edu.brown.api.BenchmarkController.startSites(BenchmarkController.java:633)
[java] at edu.brown.api.BenchmarkController.setupBenchmark(BenchmarkController.java:504)
[java] at edu.brown.api.BenchmarkController.main(BenchmarkController.java:2216)

BUILD FAILED
/home/ubuntu/h-store/build.xml:2517: The following error occurred while executing this line:
/home/ubuntu/h-store/build.xml:1693: Java returned: 1

Total time: 11 seconds

I checked the log it hasn't any useful info still

$ cat site-01-host1.ip-172-31-xx-xx.eu-west-1.compute.internal.log

2014-01-29T03:16:24.778.0

thanks

@i-chaochen
Copy link
Author

hi, andy

I checked ProcessSetManager.java ,

does use "ping" command to create the process?

public static void main(String[] args) {
    ProcessSetManager psm = new ProcessSetManager();
    psm.startProcess("ping4c", new String[] { "ping", "volt4c" });
    psm.startProcess("ping3c", new String[] { "ping", "volt3c" });
    while(true) {
        OutputLine line = psm.nextBlocking();
        System.out.printf("(%s:%s): %s\n", line.processName, line.stream.name(), line.value);
    }
}

I open the ICMP port to security group but still unable to execute the benchmark

and then I open ALL traffic ports to all ips at this security group, so I think no matter what kind of commands hstore use it should have no problem within security group.

but it still fails to execute the benchmark
[java] 22:23:22,433 INFO - Starting HStoreSite H00 on host0.ip-172-31-xx-x.eu-west-1.compute.internal
[java] 22:23:22,572 INFO - Starting HStoreSite H01 on host1.ip-172-31-xx-x.eu-west-1.compute.internal
[java] 22:23:22,709 INFO - Starting HStoreSite H02 on host2.ip-172-31-xx-x.eu-west-1.compute.internal
[java] 22:23:22,837 INFO - Waiting for 3 HStoreSites with 6 partitions to finish initialization
[java] 22:23:27,595 ERROR - Failed to poll 'site-01-host1.ip-172-31-xx-x.eu-west-1.compute.internal' [exitValue=255]
[java] 22:23:27,596 FATAL - Process 'site-01-host1.ip-172-31-xx-x.eu-west-1.compute.internal' failed. Halting benchmark!
[java] 22:23:29,100 FATAL - Failed to complete benchmark
[java] java.lang.RuntimeException: Failed to start all HStoreSites. Halting benchmark
[java] at edu.brown.api.BenchmarkController.startSites(BenchmarkController.java:633)
[java] at edu.brown.api.BenchmarkController.setupBenchmark(BenchmarkController.java:504)
[java] at edu.brown.api.BenchmarkController.main(BenchmarkController.java:2216)

BUILD FAILED
/home/ubuntu/h-store/build.xml:2517: The following error occurred while executing this line:
/home/ubuntu/h-store/build.xml:1693: Java returned: 1

Total time: 50 seconds

and there is no info for these two logs except date
~/h-store/obj/logs/sites$ cat site-01-host1.172.31.xx.x.eu-west-1.compute.internal.log

2014-01-29T03:32:26.251.0

~/h-store/obj/logs/sites$ cat site-01-host1.ip-172-31-xx-x.eu-west-1.compute.internal.log

2014-01-29T22:23:22.698.0

I am quite suspecting about cluster.txt, is it on the right format?
$ cat cluster.txt
host0.ip-172-31-xx-x.eu-west-1.compute.internal:0:0-1
host1.ip-172-31-xx-x.eu-west-1.compute.internal:1:2-3
host2.ip-172-31-xx-x.eu-west-1.compute.internal:2:4-5

any further advices will be appreciated.

thanks

@apavlo
Copy link
Owner

apavlo commented Jan 29, 2014

Add this to the bottom of log4j.properties:

log4j.logger.org.voltdb.processtools.ProcessSetManager=DEBUG

Run the benchmark with this turned on, then check the site log to look for the SSH command that it's trying to send over the wire. Then copy and paste that command in a terminal to check whether it works.

@i-chaochen
Copy link
Author

yes, I add it and copy the ssh commands run it by hand, it displays failed
to connect to remote site

I check the source codes about connecting remote codes have two things
quite confused

  1. should I change my ec2 hostname like host0, host1 and host2 at
    cluster.txt ?

does the ssh login username effect the connection?
I change all host0, host1 and host2 as ubuntu at cluster. txt, since it's
default name for ec2, but still failed at execution.

  1. when I built nfs cluster on aws, I followed the steps from
    http://hstore.cs.brown.edu/documentation/deployment/running-on-amazon-ec2/

the autofs part it sets as

  • hstore-nfs:/home/&

which automatically syncs all folders and files under /home/

but when I set each nfs server and clients ssh environment by
ssh-keygen -t dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

the autofs will automatically sync each key to all other.

which means I only can run
ssh localhost date

at one ec2.

so, should I re-write my auto.home file not sync all files under
/home/& ?

because I see document mentioned specifically that the directory needs to
end with a '/' followed by a '&

but it looks like against the ssh environment configuration. so would you
give me some clues? please

thanks

On 29 Jan 2014 22:48, "Andy Pavlo" notifications@github.com wrote:

Add this to the bottom of log4j.properties:

log4j.logger.org.voltdb.processtools.ProcessSetManager=DEBUG

Run the benchmark with this turned on, then check the site log to look for
the SSH command that it's trying to send over the wire. Then copy and paste
that command in a terminal to check whether it works.

Reply to this email directly or view it on GitHubhttps://github.com//issues/152#issuecomment-33640876
.

@i-chaochen
Copy link
Author

hi andy

I changed all ec2's hostname same as cluster.txt and only mount h-store folder instead of /home/& within NFS clusters this time, and I add this line in log4j.properties:
log4j.logger.org.voltdb.processtools.ProcessSetManager=DEBUG

and I run ssh command by hand, it returns as "Unable to set CPU affinity.." and "Insufficient number of cores " so disable transaction pre/post processing threads, and the connection and execution is failed.

but I can execute H-store benchmark at a single large size ec2 without any problem.

I build this NFS Cluster at AWS by 3 same large size ec2, it indicates insufficient number of cores.

Does hstore is a sharding nosql system, each node within system is isolated with others? Should it need less system resource if I use a cluster to run this benchmark instead of a singe machine?
why I can execute it at a single large ec2 but can't execute it at 3 equal size ec2 as insufficient number of cores? should I use a more expensive larger ec2 to build cluster to execute this benchmark? or is any other thing I did wrong, such as only mounted h-store folder within the NFS cluster?

would you give me some clues on it, please?

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants