This repo contains some examples of analysis performed on the Analysis Facility using RDataFrame distributed on Dask on top of HTCondor
- Go to https://cms-it-hub.cloud.cnaf.infn.it/
- Login using the CMS INDIGO IAM service with CERN SSO (https://cms-auth.web.cern.ch/login)
- Choose the JupyterLab image (in order to be able to use distribution on Dask on top of HTCondor, use the suggested one) and set memory and CPU
-
Open a a new Python3 notebook
-
Deploy a Dask cluster on HTCondor. This can be done via the Dask JupyterLab plugin:
-
Once deployed, initialize the Dask client: pushing
<>
setup automatically a cell to do this, that will look like this:from dask.distributed import Client client = Client("localhost:37470") client
-
Insert the declaration of your custom functions inside an initialization function:
import ROOT text_file = open("postselection.h", "r") data = text_file.read() distributed = ROOT.RDF.Experimental.Distributed def my_initialization_function(): ROOT.gInterpreter.Declare('{}'.format(data)) distributed.initialize(my_initialization_function)
-
Create a distributed RDataFrame reading a list of samples:
chain = [<path to 1st .root file>, <path to 2nd .root file>, ...] df = ROOT.RDF.Experimental.Distributed.Dask.RDataFrame("<name of tree>", chain, npartitions=<number of partitions>, daskclient=client)
Here you can find a simple notebook where a very simple distributed RDataFrame analysis is run on a small OpenData sample using a Dask deployment on HTCondor.