This repo contains some examples of analysis performed on the Analysis Facility using RDataFrame distributed on Dask on top of HTCondor
- Go to
- Login using the CMS INDIGO IAM service with CERN SSO (
- Choose the JupyterLab image (in order to be able to use distribution on Dask on top of HTCondor, use the suggested one) and set memory and CPU
Open a a new Python3 notebook
Deploy a Dask cluster on HTCondor. This can be done via the Dask JupyterLab plugin:
Once deployed, initialize the Dask client: pushing
setup automatically a cell to do this, that will look like this:from dask.distributed import Client client = Client("localhost:37470") client
Insert the declaration of your custom functions inside an initialization function:
import ROOT text_file = open("postselection.h", "r") data = distributed = ROOT.RDF.Experimental.Distributed def my_initialization_function(): ROOT.gInterpreter.Declare('{}'.format(data)) distributed.initialize(my_initialization_function)
Create a distributed RDataFrame reading a list of samples:
chain = [<path to 1st .root file>, <path to 2nd .root file>, ...] df = ROOT.RDF.Experimental.Distributed.Dask.RDataFrame("<name of tree>", chain, npartitions=<number of partitions>, daskclient=client)
Here you can find a simple notebook where a very simple distributed RDataFrame analysis is run on a small OpenData sample using a Dask deployment on HTCondor.