Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ScaleFactor 100 and 1000 #93

Open
ChrizZz110 opened this issue Aug 21, 2024 · 4 comments
Open

Support ScaleFactor 100 and 1000 #93

ChrizZz110 opened this issue Aug 21, 2024 · 4 comments

Comments

@ChrizZz110
Copy link

Hi,
thanks for working on this data generator. We are using the generated FinBench datasets for our research and would kindly ask to support larger SFs in the generator than the currently supported factor 10. Especially for systems focussing on large-scale graphs, this would be a great extension.

@qishipengqsp
Copy link
Contributor

qishipengqsp commented Aug 23, 2024

Hi, thanks for the feedback. I have been working on this extension on scalability already. It only can generate SF10 in v0.1.0. Currently I have extended it to sf30, working on the SF100 now.

I am not quite familiar on Spark application optimization, but good news I am moving forward step by step. Hopefully it would support SF300 in the next few weeks.

@qishipengqsp
Copy link
Contributor

qishipengqsp commented Aug 23, 2024

Collaboration is welcome if you are an Spark expert. :)

@ChrizZz110
Copy link
Author

Thanks for the reply @qishipengqsp , happy to hear that larger sfs are already work in progress.

Unfortunately, I'm not an Spark expert, but have some more knowledge in Flink, if this might help. If you want, I can have a look. Is this the branch you are currently working on: https://github.com/ldbc/ldbc_finbench_datagen/tree/sf100 ?

@qishipengqsp
Copy link
Contributor

@ChrizZz110 Thanks for your help and apologize for this late response. Just come back from the LDBC 18th TUC, and start to catch up these thing I left behind.

Yes. I am working on that branch, but it is not much different from the main branch. I just created this branch for SF10 parameters controlling the generation process.

Currently, I am stuck in this error:

Exception in thread "main" java.lang.OutOfMemoryError
	at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
	at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
	at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
	at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
	at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
	at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
	at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
	at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
	at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
	at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
	at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:413)
	at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:406)
	at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
	at org.apache.spark.SparkContext.clean(SparkContext.scala:2477)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$1(RDD.scala:912)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
	at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:911)
	at ldbc.finbench.datagen.generation.generators.ActivityGenerator.signInEvent(ActivityGenerator.scala:134)
	at ldbc.finbench.datagen.generation.ActivitySimulator.simulate(ActivitySimulator.scala:79)
	at ldbc.finbench.datagen.generation.GenerationStage$.$anonfun$run$1(GenerationStage.scala:59)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at ldbc.finbench.datagen.util.SparkUI$.job(SparkUI.scala:12)
	at ldbc.finbench.datagen.generation.GenerationStage$.run(GenerationStage.scala:55)
	at ldbc.finbench.datagen.LdbcDatagen$.run(LdbcDatagen.scala:131)
	at ldbc.finbench.datagen.LdbcDatagen$.main(LdbcDatagen.scala:120)
	at ldbc.finbench.datagen.LdbcDatagen.main(LdbcDatagen.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants