From b4f6353b0f6b9956b30ad4221eda1ebc651aadcd Mon Sep 17 00:00:00 2001 From: Nitish Bharambe Date: Sun, 20 Oct 2024 22:08:45 +0200 Subject: [PATCH 01/20] update example Signed-off-by: Nitish Bharambe --- docs/examples/arrow_example.ipynb | 624 +++++++++++++++++++++++++----- 1 file changed, 537 insertions(+), 87 deletions(-) diff --git a/docs/examples/arrow_example.ipynb b/docs/examples/arrow_example.ipynb index 65119d1e..219b1c64 100644 --- a/docs/examples/arrow_example.ipynb +++ b/docs/examples/arrow_example.ipynb @@ -10,7 +10,7 @@ "\n", "It is by no means intended to provide complete documentation on the topic, but only to show how such conversions could be done.\n", "\n", - "In particular, this example restricts itself to PyArrow Tables, but for more advanced cases, RecordBatches obviously are a better solution.\n", + "This example uses `pyarrow.RecordBatch` to demonstrate zero copy operations. The user can choose a `pyarrow.Table` or other structures based on the requirement.\n", "\n", "**NOTE:** To run this example, the optional `examples` dependencies are required:\n", "\n", @@ -28,16 +28,26 @@ "%%capture cap --no-stderr\n", "from IPython.display import display\n", "\n", - "from power_grid_model import PowerGridModel, initialize_array, CalculationMethod\n", + "from power_grid_model import (\n", + " PowerGridModel,\n", + " initialize_array,\n", + " CalculationMethod,\n", + " power_grid_meta_data,\n", + " ComponentType,\n", + " DatasetType,\n", + ")\n", "import pyarrow as pa\n", "import pandas as pd\n", - "import numpy as np" + "import numpy as np\n", + "\n", + "ZERO_COPY_ERROR_MSG = \"Zero-copy conversion requested, but the data types do not match.\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ + "\n", "## Model\n", "\n", "For clarity, a simple network is created. More complex cases work similarly and can be found in the other examples:\n", @@ -57,9 +67,14 @@ "\n", "Construct the input data for the model and construct the actual model.\n", "\n", - "Arrow uses a columnar data format while the power-grid-model uses a row-based data format with continuous memory.\n", - "Because of that, at least one copy is required.\n", - "\n", + "Arrow uses a columnar data format while the power-grid-model offers both: row based or columnar data format.\n", + "Converting to/from columnar data can enable having zero copies to be produced while atleast one copy would be produced for row-based data conversions." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "### List the power-grid-model data types\n", "\n", "See which attributes exist for a given component and which data types are used" @@ -77,7 +92,7 @@ "node: {'names': ['id', 'u_rated'], 'formats': ['[3]\n", + " child 0, item: double\n", + "q_specified: fixed_size_list[3]\n", + " child 0, item: double\n" + ] + } + ], + "source": [ + "def pgm_combined_schema(dataset_type: DatasetType, component_type: ComponentType):\n", + " schemas = []\n", + " component_dtype = power_grid_meta_data[dataset_type][component_type].dtype\n", + " for attribute, (dtype, _) in component_dtype.fields.items():\n", + " if dtype.shape == (3,):\n", + " pa_dtype = pa.list_(pa.from_numpy_dtype(dtype.base), 3)\n", + " else:\n", + " pa_dtype = pa.from_numpy_dtype(dtype)\n", + " schemas.append((attribute, pa_dtype))\n", + " return pa.schema(schemas)\n", + "\n", + "\n", + "print(\"-------node combined asym scehma-------\")\n", + "print(pgm_combined_schema(DatasetType.input, ComponentType.node))\n", + "print(\"-------asym load combined asym scehma-------\")\n", + "print(pgm_combined_schema(DatasetType.input, ComponentType.asym_load))" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -105,34 +248,34 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "pyarrow.Table\n", + "pyarrow.RecordBatch\n", "id: int32\n", "u_rated: double\n", "----\n", - "id: [[1,2,3]]\n", - "u_rated: [[10500,10500,10500]]" + "id: [1,2,3]\n", + "u_rated: [10500,10500,10500]" ] }, - "execution_count": 3, + "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "nodes = pa.table(\n", + "nodes = pa.record_batch(\n", " [\n", " pa.array([1, 2, 3], type=pa.int32()), # id\n", " pa.array([10500.0, 10500.0, 10500.0], type=pa.float64()),\n", " ],\n", " names=(\"id\", \"u_rated\"),\n", ")\n", - "lines = pa.table(\n", + "lines = pa.record_batch(\n", " [\n", " pa.array([4, 5], type=pa.int32()), # id\n", " pa.array([1, 2], type=pa.int32()), # from_node\n", @@ -164,7 +307,7 @@ " \"tan0\",\n", " ),\n", ")\n", - "sources = pa.table(\n", + "sources = pa.record_batch(\n", " [\n", " pa.array([6], type=pa.int32()), # id\n", " pa.array([1], type=pa.int32()), # node\n", @@ -173,7 +316,7 @@ " ],\n", " names=(\"id\", \"node\", \"status\", \"u_ref\"),\n", ")\n", - "sym_loads = pa.table(\n", + "sym_loads = pa.record_batch(\n", " [\n", " pa.array([7, 8], type=pa.int32()), # id\n", " pa.array([2, 3], type=pa.int32()), # node\n", @@ -195,14 +338,23 @@ "source": [ "### Convert the Arrow data to power-grid-model input data\n", "\n", - "No direct conversion from Arrow Tables to NumPy exists and a copy is always required.\n", + "These Arrow record batch or tables can then be converted to row based or columnar array." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Conversion to row based arrays\n", + "\n", + "No direct conversion from Arrow Tables to row based NumPy array exists and a copy is always required. This would not be the most memory efficient approach. \n", "\n", "To ensure support for optional arguments and to prevent version lock, it is recommended to create an empty power-grid-model data set using `initialize_array` and then fill it with the Arrow data." ] }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 6, "metadata": {}, "outputs": [ { @@ -212,13 +364,13 @@ " dtype={'names': ['id', 'u_rated'], 'formats': [' np.ndarray:\n", + "def arrow_to_numpy_row_based(data: pa.lib.Table, data_type: str, component: str) -> np.ndarray:\n", " \"\"\"Convert Arrow data to NumPy data.\"\"\"\n", " result = initialize_array(data_type, component, len(data))\n", " for name, column in zip(data.column_names, data.columns):\n", @@ -227,10 +379,56 @@ " return result\n", "\n", "\n", - "node_input = arrow_to_numpy(nodes, \"input\", \"node\")\n", - "line_input = arrow_to_numpy(lines, \"input\", \"line\")\n", - "source_input = arrow_to_numpy(sources, \"input\", \"source\")\n", - "sym_load_input = arrow_to_numpy(sym_loads, \"input\", \"sym_load\")\n", + "node_input_row_based = arrow_to_numpy_row_based(nodes, \"input\", \"node\")\n", + "node_input_row_based" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Conversion to columnar arrays\n", + "\n", + "For more memory-efficient operations, converting Arrow data to columnar NumPy arrays can be done with zero-copy operations. This ensures that no additional memory is used for the conversion process.\n", + "\n", + "This approach ensures that the data types match and that the conversion is efficient, leveraging the columnar nature of Arrow data. The option of `zero_copy_only` is added in this demo to verify no copies are made. Its usage is not mandatory to ensure zero copy." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'id': array([1, 2, 3]), 'u_rated': array([10500., 10500., 10500.])}" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def arrow_to_numpy_columnar(\n", + " data: pa.lib.Table, dataset_type: DatasetType, component_type: ComponentType, zero_copy_only: bool = False\n", + ") -> np.ndarray:\n", + " \"\"\"Convert Arrow data to NumPy data.\"\"\"\n", + " result = {}\n", + " result_dtype = power_grid_meta_data[dataset_type][component_type].dtype\n", + " for name, column in zip(data.column_names, data.columns):\n", + " column_data = column.to_numpy(zero_copy_only=zero_copy_only)\n", + " if zero_copy_only and column_data.dtype != result_dtype[name]:\n", + " raise ValueError(ZERO_COPY_ERROR_MSG)\n", + " result[name] = column_data.astype(result_dtype[name])\n", + " return result\n", + "\n", + "\n", + "node_input = arrow_to_numpy_columnar(nodes, DatasetType.input, ComponentType.node, zero_copy_only=True)\n", + "line_input = arrow_to_numpy_columnar(lines, DatasetType.input, ComponentType.line)\n", + "source_input = arrow_to_numpy_columnar(sources, DatasetType.input, ComponentType.source)\n", + "sym_load_input = arrow_to_numpy_columnar(sym_loads, DatasetType.input, ComponentType.sym_load)\n", "\n", "node_input" ] @@ -244,24 +442,39 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "{'node': array([(1, 10500.), (2, 10500.), (3, 10500.)],\n", - " dtype={'names': ['id', 'u_rated'], 'formats': ['10530.228073\n", " -0.002932\n", " -1.000000\n", - " -5.000000e-01\n", + " -5.000001e-01\n", " \n", " \n", " 2\n", @@ -376,7 +589,7 @@ "\n", " q \n", "0 -3.299419e+06 \n", - "1 -5.000000e-01 \n", + "1 -5.000001e-01 \n", "2 -1.500000e+00 " ] }, @@ -407,13 +620,13 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "pyarrow.Table\n", + "pyarrow.RecordBatch\n", "id: int32\n", "energized: int8\n", "u_pu: double\n", @@ -422,22 +635,22 @@ "p: double\n", "q: double\n", "----\n", - "id: [[1,2,3]]\n", - "energized: [[1,1,1]]\n", - "u_pu: [[1.000324825742982,1.0028788641128947,1.004112854674026]]\n", - "u: [[10503.410670301311,10530.228073185395,10543.184974077272]]\n", - "u_angle: [[-0.00006651843181519333,-0.0029317915196014274,-0.004341587216862399]]\n", - "p: [[338777.2462788447,-1.0000001549705169,-1.9999999440349978]]\n", - "q: [[-3299418.6613065186,-0.4999999565008232,-1.4999999075367236]]" + "id: [1,2,3]\n", + "energized: [1,1,1]\n", + "u_pu: [1.000324825742982,1.0028788641128945,1.004112854674026]\n", + "u: [10503.410670301311,10530.228073185392,10543.184974077272]\n", + "u_angle: [-0.00006651843181518038,-0.0029317915196012487,-0.004341587216862092]\n", + "p: [338777.2462788448,-1.0000002693184182,-1.9999998867105226]\n", + "q: [-3299418.661306348,-0.5000000701801947,-1.4999998507078594]" ] }, - "execution_count": 8, + "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "pa_sym_node_result = pa.table(pd_sym_node_result)\n", + "pa_sym_node_result = pa.record_batch(pd_sym_node_result)\n", "\n", "# and similar for other components\n", "\n", @@ -466,20 +679,20 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "asym_load: {'names': ['id', 'node', 'status', 'type', 'p_specified', 'q_specified'], 'formats': [' np.ndarray:\n", + "def arrow_to_numpy_asym_row_based(\n", + " data: pa.lib.Table,\n", + " dataset_type: DatasetType,\n", + " component_type: ComponentType,\n", + " phases_suffix: tuple[str, str, str] = (\"a\", \"b\", \"c\"),\n", + ") -> np.ndarray:\n", " \"\"\"Convert asymmetric Arrow data to NumPy data.\n", "\n", " This function is similar to the arrow_to_numpy function, but also supports asymmetric data.\"\"\"\n", - " result = initialize_array(data_type, component, len(data))\n", - " phases = (\"a\", \"b\", \"c\")\n", + " result = initialize_array(dataset_type, component_type, len(data))\n", " for name, (dtype, _) in result.dtype.fields.items():\n", " if len(dtype.shape) == 0:\n", " # simple or symmetric data type\n", @@ -574,7 +800,7 @@ " result[name] = data.column(name).to_numpy()\n", " else:\n", " # asymmetric data type\n", - " for phase_index, phase in enumerate(phases):\n", + " for phase_index, phase in enumerate(phases_suffix):\n", " phase_name = f\"{name}_{phase}\"\n", "\n", " if phase_name in data.column_names:\n", @@ -583,11 +809,224 @@ " return result\n", "\n", "\n", - "asym_load_input = arrow_to_numpy_asym(asym_loads, \"input\", \"asym_load\")\n", + "asym_load_input = arrow_to_numpy_asym_row_based(asym_loads, DatasetType.input, ComponentType.asym_load)\n", "\n", "asym_load_input" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Conversion to columnar arrays\n", + "\n", + "The implementation would be similar to [Conversion to columnar arrays for symmetric input](#conversion-to-columnar-arrays), with special handling for asymmertic values.\n", + "A copy for the 3 phase attributes in this case is always needed." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'id': array([7, 8]),\n", + " 'node': array([2, 3]),\n", + " 'status': array([1, 1], dtype=int8),\n", + " 'type': array([0, 0], dtype=int8),\n", + " 'p_specified': array([[1.0e+00, 1.0e-02, 1.1e-02],\n", + " [2.0e+00, 2.5e+00, 4.5e+02]]),\n", + " 'q_specified': array([[5.0e-01, 1.5e+03, 1.0e-01],\n", + " [1.5e+00, 2.5e+00, 1.5e+03]])}" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def arrow_to_numpy_asym_columnar(\n", + " data: pa.lib.Table,\n", + " dataset_type: DatasetType,\n", + " component_type: ComponentType,\n", + " phases_suffix: tuple[str, str, str] = (\"a\", \"b\", \"c\"),\n", + ") -> np.ndarray:\n", + " \"\"\"Convert asymmetric Arrow data to NumPy data.\n", + "\n", + " This function is similar to the arrow_to_numpy function, but also supports asymmetric data.\"\"\"\n", + " result = {}\n", + " result_dtype = power_grid_meta_data[dataset_type][component_type].dtype\n", + "\n", + " for name in result_dtype.names:\n", + " dtype = result_dtype[name]\n", + " if len(dtype.shape) == 0:\n", + " # simple or symmetric data type\n", + " if name in data.column_names:\n", + " column_data = data.column(name).to_numpy()\n", + " result[name] = column_data.astype(result_dtype[name])\n", + " else:\n", + " # asymmetric data type\n", + " for phase_index, phase in enumerate(phases_suffix):\n", + " phase_name = f\"{name}_{phase}\"\n", + " if phase_name not in data.column_names:\n", + " continue\n", + "\n", + " column_data = data.column(phase_name).to_numpy()\n", + " if name not in result:\n", + " result[name] = np.empty(shape=len(column_data), dtype=result_dtype[name])\n", + " result[name][:, phase_index] = column_data\n", + " return result\n", + "\n", + "\n", + "asym_load_input_columnar = arrow_to_numpy_asym_columnar(asym_loads, DatasetType.input, ComponentType.asym_load)\n", + "\n", + "asym_load_input_columnar" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Combined representation of 3 phase values\n", + "\n", + "We start from complete 3 phases as a fixed size list array" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "asym_load: {'names': ['id', 'node', 'status', 'type', 'p_specified', 'q_specified'], 'formats': ['[3]\n", + " child 0, item: double\n", + "q_specified: fixed_size_list[3]\n", + " child 0, item: double\n", + "----\n", + "id: [7,8]\n", + "node: [2,3]\n", + "status: [1,1]\n", + "type: [0,0]\n", + "p_specified: [[1,0.01,0.011],[2,2.5,450]]\n", + "q_specified: [[0.5,1500,0.1],[1.5,2.5,1500]]" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "asym_load_input_dtype = initialize_array(\"input\", \"asym_load\", 0).dtype\n", + "print(\"asym_load:\", asym_load_input_dtype)\n", + "# asym_float_type = pa.struct([(\"a\", pa.float64()), (\"b\", pa.float64()), (\"c\", pa.float64())])\n", + "\n", + "asym_loads = pa.record_batch(\n", + " [\n", + " pa.array([7, 8], type=pa.int32()), # id\n", + " pa.array([2, 3], type=pa.int32()), # node\n", + " pa.array([1, 1], type=pa.int8()), # status\n", + " pa.array([0, 0], type=pa.int8()), # type\n", + " pa.array([[1.0, 1.0e-2, 1.1e-2], [2.0, 2.5, 4.5e2]], type=pa.list_(pa.float64(), 3)), # p_specified\n", + " pa.array([[0.5, 1.5e3, 0.1], [1.5, 2.5, 1.5e3]], type=pa.list_(pa.float64(), 3)), # q_specified\n", + " ],\n", + " names=(\"id\", \"node\", \"status\", \"type\", \"p_specified\", \"q_specified\"),\n", + ")\n", + "\n", + "asym_loads" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [ + "# TODO: Add a function to convert the Arrow data to NumPy data for row based data using fixed size list arrays\n", + "# TODO: Added below is a function to convert the Arrow data to NumPy data for columnar data using fixed size list arrays. Should it be kept or removed?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Columnar data conversion for asmmetric attribute as list array" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'id': array([7, 8]),\n", + " 'node': array([2, 3]),\n", + " 'status': array([1, 1], dtype=int8),\n", + " 'type': array([0, 0], dtype=int8),\n", + " 'p_specified': array([[1.0e+00, 1.0e-02, 1.1e-02],\n", + " [2.0e+00, 2.5e+00, 4.5e+02]]),\n", + " 'q_specified': array([[5.0e-01, 1.5e+03, 1.0e-01],\n", + " [1.5e+00, 2.5e+00, 1.5e+03]])}" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def arrow_to_numpy_asym_list_array_columnar(\n", + " data: pa.lib.Table, dataset_type: DatasetType, component_type: ComponentType, zero_copy_only: bool = False\n", + ") -> np.ndarray:\n", + " \"\"\"Convert asymmetric Arrow data to NumPy data.\n", + "\n", + " This function is similar to the arrow_to_numpy function, but also supports asymmetric data.\"\"\"\n", + " result = {}\n", + " result_dtype = power_grid_meta_data[dataset_type][component_type].dtype\n", + "\n", + " for name in result_dtype.names:\n", + " if name not in data.column_names:\n", + " continue\n", + " dtype = result_dtype[name]\n", + "\n", + " if len(dtype.shape) == 0:\n", + " column_data = data.column(name).to_numpy(zero_copy_only=zero_copy_only)\n", + " else:\n", + " column_data = data.column(name).flatten().to_numpy(zero_copy_only=zero_copy_only).reshape(-1, 3)\n", + "\n", + " # TODO Find a way to include shape information instead of base dtype\n", + " if zero_copy_only and column_data.dtype.base != dtype.base:\n", + " raise ValueError(ZERO_COPY_ERROR_MSG)\n", + " result[name] = column_data.astype(dtype.base)\n", + " return result\n", + "\n", + "\n", + "asym_load_input_columnar_asym_list_array = arrow_to_numpy_asym_list_array_columnar(\n", + " asym_loads, DatasetType.input, ComponentType.asym_load, zero_copy_only=True\n", + ")\n", + "\n", + "asym_load_input_columnar_asym_list_array" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -597,7 +1036,7 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 18, "metadata": {}, "outputs": [ { @@ -656,7 +1095,7 @@ "2 -0.004338 -2.098733 2.090057" ] }, - "execution_count": 11, + "execution_count": 18, "metadata": {}, "output_type": "execute_result" } @@ -690,7 +1129,7 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 19, "metadata": {}, "outputs": [ { @@ -717,18 +1156,18 @@ "----\n", "id: [[1,2,3]]\n", "energized: [[1,1,1]]\n", - "u_pu_a: [[1.0003248257977395,1.0028803762176168,1.004114300817404]]\n", - "u_pu_b: [[1.000324376948685,1.0028710993140397,1.0041033583077168]]\n", - "u_pu_c: [[1.00032436416241,1.002873078902152,1.0041004935738533]]\n", - "u_a: [[6064.146978239599,6079.639179329459,6087.119449677851]]\n", - "u_b: [[6064.144257236812,6079.582941090295,6087.053114238259]]\n", - "u_c: [[6064.1441797241405,6079.594941705455,6087.035747712152]]\n", - "u_angle_a: [[-0.00006651848125692708,-0.0029298831864833634,-0.004337685507209539]]\n", - "u_angle_b: [[-2.0944615736658134,-2.0973219974462594,-2.098732840554144]]\n", + "u_pu_a: [[1.0003248257977395,1.0028803762176164,1.0041143008174032]]\n", + "u_pu_b: [[1.0003243769486854,1.0028710993140406,1.0041033583077175]]\n", + "u_pu_c: [[1.00032436416241,1.0028730789021523,1.0041004935738533]]\n", + "u_a: [[6064.146978239599,6079.639179329456,6087.119449677845]]\n", + "u_b: [[6064.144257236815,6079.582941090301,6087.053114238262]]\n", + "u_c: [[6064.1441797241405,6079.594941705457,6087.035747712152]]\n", + "u_angle_a: [[-0.00006651848125694397,-0.0029298831864832267,-0.004337685507209373]]\n", + "u_angle_b: [[-2.094461573665813,-2.0973219974462594,-2.098732840554144]]\n", "..." ] }, - "execution_count": 12, + "execution_count": 19, "metadata": {}, "output_type": "execute_result" } @@ -765,6 +1204,17 @@ "pa_asym_node_result" ] }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [], + "source": [ + "# TODO Add a function to convert the results back to Arrow format for columnar data with individual phases\n", + "# TODO Add a function to convert the results back to Arrow format for row data with fixed list array\n", + "# TODO Add a function to convert the results back to Arrow format using columnar data with fixed list array" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -781,7 +1231,7 @@ ], "metadata": { "kernelspec": { - "display_name": "venv", + "display_name": ".venv", "language": "python", "name": "python3" }, @@ -795,7 +1245,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.2" + "version": "3.12.0" }, "orig_nbformat": 4 }, From 7672a1653799f9b3000fdb157d570ebc10973c73 Mon Sep 17 00:00:00 2001 From: Nitish Bharambe Date: Mon, 21 Oct 2024 12:08:38 +0200 Subject: [PATCH 02/20] add decision todos Signed-off-by: Nitish Bharambe --- docs/examples/arrow_example.ipynb | 168 +++++++++++++++++++++++++----- 1 file changed, 140 insertions(+), 28 deletions(-) diff --git a/docs/examples/arrow_example.ipynb b/docs/examples/arrow_example.ipynb index 219b1c64..2003b518 100644 --- a/docs/examples/arrow_example.ipynb +++ b/docs/examples/arrow_example.ipynb @@ -235,6 +235,38 @@ "print(pgm_combined_schema(DatasetType.input, ComponentType.asym_load))" ] }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "pyarrow.RecordBatch\n", + "id: int32\n", + "u_rated: double\n", + "----\n", + "id: [1,2,3]\n", + "u_rated: [10500,10500,10500]" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#TODO Decisions: Create from schema\n", + "pa.record_batch(\n", + " {\n", + " \"id\": [1, 2, 3],\n", + " \"u_rated\": [10500.0, 10500.0, 10500.0],\n", + " },\n", + " schema=pgm_schema(DatasetType.input, ComponentType.node),\n", + ")" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -248,7 +280,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 6, "metadata": {}, "outputs": [ { @@ -262,7 +294,7 @@ "u_rated: [10500,10500,10500]" ] }, - "execution_count": 5, + "execution_count": 6, "metadata": {}, "output_type": "execute_result" } @@ -354,7 +386,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 7, "metadata": {}, "outputs": [ { @@ -364,7 +396,7 @@ " dtype={'names': ['id', 'u_rated'], 'formats': [' pa.lib.table:\n", + " \"\"\"Convert NumPy data to Arrow data.\"\"\"\n", + " simple_data_types = []\n", + " multi_value_data_types = []\n", + "\n", + " for name, (dtype, _) in data.dtype.fields.items():\n", + " if len(dtype.shape) == 0:\n", + " simple_data_types.append(name)\n", + " else:\n", + " multi_value_data_types.append(name)\n", + "\n", + " result = pa.table(pd.DataFrame(data[simple_data_types]))\n", + "\n", + " phases = (\"a\", \"b\", \"c\")\n", + " for name in multi_value_data_types:\n", + " column = data[name]\n", + "\n", + " assert column.shape[1] == len(phases), \"Asymmetric data has 3 phase output\"\n", + "\n", + " for phase_index, phase in enumerate(phases):\n", + " sub_column = column[:, phase_index]\n", + " result = result.append_column(f\"{name}_{phase}\", [pd.Series(sub_column)])\n", + "\n", + " return result\n", + "\n", + "\n", + "pa_asym_node_result = numpy_to_arrow(asym_result[\"node\"])\n", + "\n", + "pa_asym_node_result" + ] + }, + { + "cell_type": "code", + "execution_count": 22, "metadata": {}, "outputs": [], "source": [ From c876aa78f3a659a5d7702373482f1fc1d9683db9 Mon Sep 17 00:00:00 2001 From: Nitish Bharambe Date: Mon, 21 Oct 2024 14:52:29 +0200 Subject: [PATCH 03/20] clean up post discussions Signed-off-by: Nitish Bharambe --- docs/examples/arrow_example.ipynb | 601 +++++------------------------- 1 file changed, 91 insertions(+), 510 deletions(-) diff --git a/docs/examples/arrow_example.ipynb b/docs/examples/arrow_example.ipynb index 2003b518..f86fd7aa 100644 --- a/docs/examples/arrow_example.ipynb +++ b/docs/examples/arrow_example.ipynb @@ -123,78 +123,13 @@ "source": [ "#### Creating a Schema\n", "\n", - "Optionally, we can also make this task easier by creating a schema based on the `DatasetType` and `ComponentType` directly from `power_grid_meta_data`. \n", + "We can also make this task easier by creating a schema based on the `DatasetType` and `ComponentType` directly from `power_grid_meta_data`. \n", "They can then directly be used to construct RecordBatches. The user can modify this schema based on the available attributes for each component." ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "##### Schema with suffixes\n", - "\n", - "Suffixes are added to the asymmetric attribute names to handle them." - ] - }, { "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "-------node scehma-------\n", - "id: int32\n", - "u_rated: double\n", - "-------asym load scehma-------\n", - "id: int32\n", - "node: int32\n", - "status: int8\n", - "type: int8\n", - "p_specified_a: double\n", - "p_specified_b: double\n", - "p_specified_c: double\n", - "q_specified_a: double\n", - "q_specified_b: double\n", - "q_specified_c: double\n" - ] - } - ], - "source": [ - "def pgm_schema(\n", - " dataset_type: DatasetType, component_type: ComponentType, asym_suffix: tuple[str, str, str] = (\"a\", \"b\", \"c\")\n", - "):\n", - " schemas = []\n", - " component_dtype = power_grid_meta_data[dataset_type][component_type].dtype\n", - " for attribute, (dtype, _) in component_dtype.fields.items():\n", - " if dtype.shape == (3,):\n", - " for suffix in asym_suffix:\n", - " schemas.append((f\"{attribute}_{suffix}\", pa.from_numpy_dtype(dtype.base)))\n", - " else:\n", - " schemas.append((attribute, pa.from_numpy_dtype(dtype)))\n", - " return pa.schema(schemas)\n", - "\n", - "\n", - "print(\"-------node scehma-------\")\n", - "print(pgm_schema(DatasetType.input, ComponentType.node))\n", - "print(\"-------asym load scehma-------\")\n", - "print(pgm_schema(DatasetType.input, ComponentType.asym_load))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "##### Schema with asymmetric attributes as a struct\n", - "\n", - "The phases can also be combined together in a pyarrow fixed size list." - ] - }, - { - "cell_type": "code", - "execution_count": 4, + "execution_count": 28, "metadata": {}, "outputs": [ { @@ -217,54 +152,24 @@ } ], "source": [ - "def pgm_combined_schema(dataset_type: DatasetType, component_type: ComponentType):\n", + "def pgm_schema(dataset_type: DatasetType, component_type: ComponentType, attributes: list[str] = None):\n", " schemas = []\n", " component_dtype = power_grid_meta_data[dataset_type][component_type].dtype\n", - " for attribute, (dtype, _) in component_dtype.fields.items():\n", + " for meta_attribute, (dtype, _) in component_dtype.fields.items():\n", + " if attributes is not None and meta_attribute not in attributes:\n", + " continue\n", " if dtype.shape == (3,):\n", " pa_dtype = pa.list_(pa.from_numpy_dtype(dtype.base), 3)\n", " else:\n", " pa_dtype = pa.from_numpy_dtype(dtype)\n", - " schemas.append((attribute, pa_dtype))\n", + " schemas.append((meta_attribute, pa_dtype))\n", " return pa.schema(schemas)\n", "\n", "\n", "print(\"-------node combined asym scehma-------\")\n", - "print(pgm_combined_schema(DatasetType.input, ComponentType.node))\n", + "print(pgm_schema(DatasetType.input, ComponentType.node))\n", "print(\"-------asym load combined asym scehma-------\")\n", - "print(pgm_combined_schema(DatasetType.input, ComponentType.asym_load))" - ] - }, - { - "cell_type": "code", - "execution_count": 26, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "pyarrow.RecordBatch\n", - "id: int32\n", - "u_rated: double\n", - "----\n", - "id: [1,2,3]\n", - "u_rated: [10500,10500,10500]" - ] - }, - "execution_count": 26, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "#TODO Decisions: Create from schema\n", - "pa.record_batch(\n", - " {\n", - " \"id\": [1, 2, 3],\n", - " \"u_rated\": [10500.0, 10500.0, 10500.0],\n", - " },\n", - " schema=pgm_schema(DatasetType.input, ComponentType.node),\n", - ")" + "print(pgm_schema(DatasetType.input, ComponentType.asym_load))" ] }, { @@ -280,7 +185,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 34, "metadata": {}, "outputs": [ { @@ -294,70 +199,47 @@ "u_rated: [10500,10500,10500]" ] }, - "execution_count": 6, + "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "nodes = pa.record_batch(\n", - " [\n", - " pa.array([1, 2, 3], type=pa.int32()), # id\n", - " pa.array([10500.0, 10500.0, 10500.0], type=pa.float64()),\n", - " ],\n", - " names=(\"id\", \"u_rated\"),\n", - ")\n", - "lines = pa.record_batch(\n", - " [\n", - " pa.array([4, 5], type=pa.int32()), # id\n", - " pa.array([1, 2], type=pa.int32()), # from_node\n", - " pa.array([2, 3], type=pa.int32()), # to_node\n", - " pa.array([1, 1], type=pa.int8()), # from_status\n", - " pa.array([1, 1], type=pa.int8()), # to_status\n", - " pa.array([0.11, 0.15], type=pa.float64()), # r1\n", - " pa.array([0.12, 0.16], type=pa.float64()), # x1\n", - " pa.array([4.1e-05, 5.4e-05], type=pa.float64()), # c1\n", - " pa.array([0.1, 0.1], type=pa.float64()), # tan1\n", - " pa.array([0.01, 0.05], type=pa.float64()), # r0\n", - " pa.array([0.22, 0.06], type=pa.float64()), # x0\n", - " pa.array([4.1e-05, 5.4e-05], type=pa.float64()), # c0\n", - " pa.array([0.4, 0.1], type=pa.float64()), # tan0\n", - " ],\n", - " names=(\n", - " \"id\",\n", - " \"from_node\",\n", - " \"to_node\",\n", - " \"from_status\",\n", - " \"to_status\",\n", - " \"r1\",\n", - " \"x1\",\n", - " \"c1\",\n", - " \"tan1\",\n", - " \"r0\",\n", - " \"x0\",\n", - " \"c0\",\n", - " \"tan0\",\n", - " ),\n", - ")\n", - "sources = pa.record_batch(\n", - " [\n", - " pa.array([6], type=pa.int32()), # id\n", - " pa.array([1], type=pa.int32()), # node\n", - " pa.array([1], type=pa.int8()), # status\n", - " pa.array([1.0], type=pa.float64()), # u_ref\n", - " ],\n", - " names=(\"id\", \"node\", \"status\", \"u_ref\"),\n", - ")\n", + "nodes_dict = {\"id\": [1, 2, 3], \"u_rated\": [10500.0, 10500.0, 10500.0]}\n", + "\n", + "\n", + "lines_dict = {\n", + " \"id\": [4, 5],\n", + " \"from_node\": [1, 2],\n", + " \"to_node\": [2, 3],\n", + " \"from_status\": [1, 1],\n", + " \"to_status\": [1, 1],\n", + " \"r1\": [0.11, 0.15],\n", + " \"x1\": [0.12, 0.16],\n", + " \"c1\": [4.1e-05, 5.4e-05],\n", + " \"tan1\": [0.1, 0.1],\n", + " \"r0\": [0.01, 0.05],\n", + " \"x0\": [0.22, 0.06],\n", + " \"c0\": [4.1e-05, 5.4e-05],\n", + " \"tan0\": [0.4, 0.1],\n", + "}\n", + "\n", + "sources_dict = {\"id\": [6], \"node\": [1], \"status\": [1], \"u_ref\": [1.0]}\n", + "\n", + "sym_loads_dict = {\n", + " \"id\": [7, 8],\n", + " \"node\": [2, 3],\n", + " \"status\": [1, 1],\n", + " \"type\": [0, 0],\n", + " \"p_specified\": [1.0, 2.0],\n", + " \"q_specified\": [0.5, 1.5],\n", + "}\n", + "\n", + "nodes = pa.record_batch(nodes_dict, schema=pgm_schema(DatasetType.input, ComponentType.node, nodes_dict.keys()))\n", + "lines = pa.record_batch(lines_dict, schema=pgm_schema(DatasetType.input, ComponentType.line, lines_dict.keys()))\n", + "sources = pa.record_batch(sources_dict, schema=pgm_schema(DatasetType.input, ComponentType.source, sources_dict.keys()))\n", "sym_loads = pa.record_batch(\n", - " [\n", - " pa.array([7, 8], type=pa.int32()), # id\n", - " pa.array([2, 3], type=pa.int32()), # node\n", - " pa.array([1, 1], type=pa.int8()), # status\n", - " pa.array([0, 0], type=pa.int8()), # type\n", - " pa.array([1.0, 2.0], type=pa.float64()), # p_specified\n", - " pa.array([0.5, 1.5], type=pa.float64()), # q_specified\n", - " ],\n", - " names=(\"id\", \"node\", \"status\", \"type\", \"p_specified\", \"q_specified\"),\n", + " sym_loads_dict, schema=pgm_schema(DatasetType.input, ComponentType.sym_load, sym_loads_dict.keys())\n", ")\n", "\n", "nodes\n", @@ -370,65 +252,19 @@ "source": [ "### Convert the Arrow data to power-grid-model input data\n", "\n", - "These Arrow record batch or tables can then be converted to row based or columnar array." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Conversion to row based arrays\n", + "These Arrow record batch or tables can then be converted to row based or columnar array.\n", + "For more memory-efficient operations, converting Arrow data to columnar NumPy arrays can be done with zero-copy operations.\n", "\n", - "No direct conversion from Arrow Tables to row based NumPy array exists and a copy is always required. This would not be the most memory efficient approach. \n", + "This approach ensures that the data types match and that the conversion is efficient, leveraging the columnar nature of Arrow data. \n", "\n", - "To ensure support for optional arguments and to prevent version lock, it is recommended to create an empty power-grid-model data set using `initialize_array` and then fill it with the Arrow data." - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "array([(1, 10500.), (2, 10500.), (3, 10500.)],\n", - " dtype={'names': ['id', 'u_rated'], 'formats': [' np.ndarray:\n", - " \"\"\"Convert Arrow data to NumPy data.\"\"\"\n", - " result = initialize_array(data_type, component, len(data))\n", - " for name, column in zip(data.column_names, data.columns):\n", - " if name in result.dtype.names:\n", - " result[name] = column.to_numpy()\n", - " return result\n", - "\n", - "\n", - "node_input_row_based = arrow_to_numpy_row_based(nodes, \"input\", \"node\")\n", - "node_input_row_based" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Conversion to columnar arrays\n", - "\n", - "For more memory-efficient operations, converting Arrow data to columnar NumPy arrays can be done with zero-copy operations. This ensures that no additional memory is used for the conversion process.\n", - "\n", - "This approach ensures that the data types match and that the conversion is efficient, leveraging the columnar nature of Arrow data. The option of `zero_copy_only` is added in this demo to verify no copies are made. Its usage is not mandatory to ensure zero copy." + "```{note}\n", + "The option of `zero_copy_only` in the function below is added in this demo to verify no copies are made. Its usage is not mandatory to do zero copy conversion.\n", + "```" ] }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 49, "metadata": {}, "outputs": [ { @@ -437,14 +273,14 @@ "{'id': array([1, 2, 3]), 'u_rated': array([10500., 10500., 10500.])}" ] }, - "execution_count": 8, + "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "def arrow_to_numpy_columnar(\n", - " data: pa.lib.Table, dataset_type: DatasetType, component_type: ComponentType, zero_copy_only: bool = False\n", + "def arrow_to_numpy(\n", + " data: pa.RecordBatch, dataset_type: DatasetType, component_type: ComponentType, zero_copy_only: bool = False\n", ") -> np.ndarray:\n", " \"\"\"Convert Arrow data to NumPy data.\"\"\"\n", " result = {}\n", @@ -453,14 +289,14 @@ " column_data = column.to_numpy(zero_copy_only=zero_copy_only)\n", " if zero_copy_only and column_data.dtype != result_dtype[name]:\n", " raise ValueError(ZERO_COPY_ERROR_MSG)\n", - " result[name] = column_data.astype(result_dtype[name])\n", + " result[name] = column_data.astype(dtype=result_dtype[name], copy=False)\n", " return result\n", "\n", "\n", - "node_input = arrow_to_numpy_columnar(nodes, DatasetType.input, ComponentType.node, zero_copy_only=True)\n", - "line_input = arrow_to_numpy_columnar(lines, DatasetType.input, ComponentType.line)\n", - "source_input = arrow_to_numpy_columnar(sources, DatasetType.input, ComponentType.source)\n", - "sym_load_input = arrow_to_numpy_columnar(sym_loads, DatasetType.input, ComponentType.sym_load)\n", + "node_input = arrow_to_numpy(nodes, DatasetType.input, ComponentType.node, zero_copy_only=True)\n", + "line_input = arrow_to_numpy(lines, DatasetType.input, ComponentType.line)\n", + "source_input = arrow_to_numpy(sources, DatasetType.input, ComponentType.source)\n", + "sym_load_input = arrow_to_numpy(sym_loads, DatasetType.input, ComponentType.sym_load)\n", "\n", "node_input" ] @@ -474,7 +310,7 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 37, "metadata": {}, "outputs": [ { @@ -506,7 +342,7 @@ " 'q_specified': array([0.5, 1.5])}}" ] }, - "execution_count": 9, + "execution_count": 37, "metadata": {}, "output_type": "execute_result" } @@ -524,7 +360,7 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 38, "metadata": {}, "outputs": [], "source": [ @@ -545,7 +381,7 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 39, "metadata": {}, "outputs": [ { @@ -652,7 +488,7 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 40, "metadata": {}, "outputs": [ { @@ -676,7 +512,7 @@ "q: [-3299418.661306348,-0.5000000701801947,-1.4999998507078594]" ] }, - "execution_count": 12, + "execution_count": 40, "metadata": {}, "output_type": "execute_result" } @@ -696,8 +532,13 @@ "## Single asymmetric calculations\n", "\n", "Asymmetric calculations have a tuple of values for some of the attributes and are not easily convertible to pandas data frames.\n", - "Instead, one can have a look at the individual components of those attributes and/or flatten the arrays to access all components.\n", - "\n", + "Instead, one can have a look at the individual components of those attributes and/or flatten the arrays to access all components." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "### Asymmetric input\n", "\n", "To illustrate the conversion, let's consider a similar grid but with asymmetric loads.\n", @@ -711,234 +552,9 @@ }, { "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "asym_load: {'names': ['id', 'node', 'status', 'type', 'p_specified', 'q_specified'], 'formats': [' np.ndarray:\n", - " \"\"\"Convert asymmetric Arrow data to NumPy data.\n", - "\n", - " This function is similar to the arrow_to_numpy function, but also supports asymmetric data.\"\"\"\n", - " result = initialize_array(dataset_type, component_type, len(data))\n", - " for name, (dtype, _) in result.dtype.fields.items():\n", - " if len(dtype.shape) == 0:\n", - " # simple or symmetric data type\n", - " if name in data.column_names:\n", - " result[name] = data.column(name).to_numpy()\n", - " else:\n", - " # asymmetric data type\n", - " for phase_index, phase in enumerate(phases_suffix):\n", - " phase_name = f\"{name}_{phase}\"\n", - "\n", - " if phase_name in data.column_names:\n", - " result[name][:, phase_index] = data.column(phase_name).to_numpy()\n", - "\n", - " return result\n", - "\n", - "\n", - "asym_load_input = arrow_to_numpy_asym_row_based(asym_loads, DatasetType.input, ComponentType.asym_load)\n", - "\n", - "asym_load_input" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Conversion to columnar arrays\n", - "\n", - "The implementation would be similar to [Conversion to columnar arrays for symmetric input](#conversion-to-columnar-arrays), with special handling for asymmertic values.\n", - "A copy for the 3 phase attributes in this case is always needed." - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'id': array([7, 8]),\n", - " 'node': array([2, 3]),\n", - " 'status': array([1, 1], dtype=int8),\n", - " 'type': array([0, 0], dtype=int8),\n", - " 'p_specified': array([[1.0e+00, 1.0e-02, 1.1e-02],\n", - " [2.0e+00, 2.5e+00, 4.5e+02]]),\n", - " 'q_specified': array([[5.0e-01, 1.5e+03, 1.0e-01],\n", - " [1.5e+00, 2.5e+00, 1.5e+03]])}" - ] - }, - "execution_count": 15, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "def arrow_to_numpy_asym_columnar(\n", - " data: pa.lib.Table,\n", - " dataset_type: DatasetType,\n", - " component_type: ComponentType,\n", - " phases_suffix: tuple[str, str, str] = (\"a\", \"b\", \"c\"),\n", - ") -> np.ndarray:\n", - " \"\"\"Convert asymmetric Arrow data to NumPy data.\n", - "\n", - " This function is similar to the arrow_to_numpy function, but also supports asymmetric data.\"\"\"\n", - " result = {}\n", - " result_dtype = power_grid_meta_data[dataset_type][component_type].dtype\n", - "\n", - " for name in result_dtype.names:\n", - " dtype = result_dtype[name]\n", - " if len(dtype.shape) == 0:\n", - " # simple or symmetric data type\n", - " if name in data.column_names:\n", - " column_data = data.column(name).to_numpy()\n", - " result[name] = column_data.astype(result_dtype[name])\n", - " else:\n", - " # asymmetric data type\n", - " for phase_index, phase in enumerate(phases_suffix):\n", - " phase_name = f\"{name}_{phase}\"\n", - " if phase_name not in data.column_names:\n", - " continue\n", - "\n", - " column_data = data.column(phase_name).to_numpy()\n", - " if name not in result:\n", - " result[name] = np.empty(shape=len(column_data), dtype=result_dtype[name])\n", - " result[name][:, phase_index] = column_data\n", - " return result\n", - "\n", - "\n", - "asym_load_input_columnar = arrow_to_numpy_asym_columnar(asym_loads, DatasetType.input, ComponentType.asym_load)\n", - "\n", - "asym_load_input_columnar" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Combined representation of 3 phase values\n", - "\n", - "We start from complete 3 phases as a fixed size list array" - ] - }, - { - "cell_type": "code", - "execution_count": 27, + "execution_count": 42, "metadata": {}, "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "asym_load: {'names': ['id', 'node', 'status', 'type', 'p_specified', 'q_specified'], 'formats': [' np.ndarray:\n", " \"\"\"Convert asymmetric Arrow data to NumPy data.\n", @@ -1048,14 +641,13 @@ " else:\n", " column_data = data.column(name).flatten().to_numpy(zero_copy_only=zero_copy_only).reshape(-1, 3)\n", "\n", - " # TODO Find a way to include shape information instead of base dtype\n", " if zero_copy_only and column_data.dtype.base != dtype.base:\n", " raise ValueError(ZERO_COPY_ERROR_MSG)\n", - " result[name] = column_data.astype(dtype.base)\n", + " result[name] = column_data.astype(dtype=dtype.base, copy=False)\n", " return result\n", "\n", "\n", - "asym_load_input_columnar_asym_list_array = arrow_to_numpy_asym_list_array_columnar(\n", + "asym_load_input_columnar_asym_list_array = arrow_to_numpy_asym(\n", " asym_loads, DatasetType.input, ComponentType.asym_load, zero_copy_only=True\n", ")\n", "\n", @@ -1071,7 +663,7 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 54, "metadata": {}, "outputs": [ { @@ -1130,7 +722,7 @@ "2 -0.004338 -2.098733 2.090057" ] }, - "execution_count": 19, + "execution_count": 54, "metadata": {}, "output_type": "execute_result" } @@ -1316,17 +908,6 @@ "pa_asym_node_result" ] }, - { - "cell_type": "code", - "execution_count": 22, - "metadata": {}, - "outputs": [], - "source": [ - "# TODO Add a function to convert the results back to Arrow format for columnar data with individual phases\n", - "# TODO Add a function to convert the results back to Arrow format for row data with fixed list array\n", - "# TODO Add a function to convert the results back to Arrow format using columnar data with fixed list array" - ] - }, { "cell_type": "markdown", "metadata": {}, From 260675bec1f3b1ae83a0e21adcb8b613bedb20c3 Mon Sep 17 00:00:00 2001 From: Nitish Bharambe Date: Mon, 21 Oct 2024 15:39:00 +0200 Subject: [PATCH 04/20] add asym output function Signed-off-by: Nitish Bharambe --- docs/examples/arrow_example.ipynb | 315 +++++++----------------------- 1 file changed, 72 insertions(+), 243 deletions(-) diff --git a/docs/examples/arrow_example.ipynb b/docs/examples/arrow_example.ipynb index f86fd7aa..c4154c86 100644 --- a/docs/examples/arrow_example.ipynb +++ b/docs/examples/arrow_example.ipynb @@ -35,7 +35,9 @@ " power_grid_meta_data,\n", " ComponentType,\n", " DatasetType,\n", + " ComponentAttributeFilterOptions,\n", ")\n", + "from power_grid_model.data_types import SingleColumnarData\n", "import pyarrow as pa\n", "import pandas as pd\n", "import numpy as np\n", @@ -129,7 +131,7 @@ }, { "cell_type": "code", - "execution_count": 28, + "execution_count": 3, "metadata": {}, "outputs": [ { @@ -185,7 +187,7 @@ }, { "cell_type": "code", - "execution_count": 34, + "execution_count": 4, "metadata": {}, "outputs": [ { @@ -199,7 +201,7 @@ "u_rated: [10500,10500,10500]" ] }, - "execution_count": 34, + "execution_count": 4, "metadata": {}, "output_type": "execute_result" } @@ -264,7 +266,7 @@ }, { "cell_type": "code", - "execution_count": 49, + "execution_count": 5, "metadata": {}, "outputs": [ { @@ -273,7 +275,7 @@ "{'id': array([1, 2, 3]), 'u_rated': array([10500., 10500., 10500.])}" ] }, - "execution_count": 49, + "execution_count": 5, "metadata": {}, "output_type": "execute_result" } @@ -310,7 +312,7 @@ }, { "cell_type": "code", - "execution_count": 37, + "execution_count": 6, "metadata": {}, "outputs": [ { @@ -342,7 +344,7 @@ " 'q_specified': array([0.5, 1.5])}}" ] }, - "execution_count": 37, + "execution_count": 6, "metadata": {}, "output_type": "execute_result" } @@ -360,7 +362,7 @@ }, { "cell_type": "code", - "execution_count": 38, + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ @@ -381,114 +383,7 @@ }, { "cell_type": "code", - "execution_count": 39, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
idenergizedu_puuu_anglepq
0111.00032510503.410670-0.000067338777.246279-3.299419e+06
1211.00287910530.228073-0.002932-1.000000-5.000001e-01
2311.00411310543.184974-0.004342-2.000000-1.500000e+00
\n", - "
" - ], - "text/plain": [ - " id energized u_pu u u_angle p \\\n", - "0 1 1 1.000325 10503.410670 -0.000067 338777.246279 \n", - "1 2 1 1.002879 10530.228073 -0.002932 -1.000000 \n", - "2 3 1 1.004113 10543.184974 -0.004342 -2.000000 \n", - "\n", - " q \n", - "0 -3.299419e+06 \n", - "1 -5.000001e-01 \n", - "2 -1.500000e+00 " - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "# construct the model\n", - "model = PowerGridModel(input_data=input_data, system_frequency=50)\n", - "\n", - "# run the calculation\n", - "sym_result = model.calculate_power_flow()\n", - "\n", - "# use pandas to tabulate and display the results\n", - "pd_sym_node_result = pd.DataFrame(sym_result[\"node\"])\n", - "display(pd_sym_node_result)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Convert power-grid-model output data to Arrow output data\n", - "\n", - "Using Pandas DataFrames as an intermediate type, constructing Arrow data formats is straightfoward" - ] - }, - { - "cell_type": "code", - "execution_count": 40, + "execution_count": 8, "metadata": {}, "outputs": [ { @@ -512,16 +407,23 @@ "q: [-3299418.661306348,-0.5000000701801947,-1.4999998507078594]" ] }, - "execution_count": 40, + "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "pa_sym_node_result = pa.record_batch(pd_sym_node_result)\n", + "# construct the model\n", + "model = PowerGridModel(input_data=input_data, system_frequency=50)\n", "\n", - "# and similar for other components\n", + "# run the calculation\n", + "sym_result = model.calculate_power_flow(output_component_types=ComponentAttributeFilterOptions.relevant)\n", "\n", + "# use pandas to tabulate and display the results\n", + "sym_node_result = sym_result[ComponentType.node]\n", + "pa_sym_node_result = pa.record_batch(\n", + " sym_node_result, schema=pgm_schema(DatasetType.sym_output, ComponentType.node, sym_node_result.keys())\n", + ")\n", "pa_sym_node_result" ] }, @@ -552,7 +454,7 @@ }, { "cell_type": "code", - "execution_count": 42, + "execution_count": 9, "metadata": {}, "outputs": [ { @@ -576,7 +478,7 @@ "q_specified: [[0.5,1500,0.1],[1.5,2.5,1500]]" ] }, - "execution_count": 42, + "execution_count": 9, "metadata": {}, "output_type": "execute_result" } @@ -600,7 +502,7 @@ }, { "cell_type": "code", - "execution_count": 53, + "execution_count": 10, "metadata": {}, "outputs": [ { @@ -616,7 +518,7 @@ " [1.5e+00, 2.5e+00, 1.5e+03]])}" ] }, - "execution_count": 53, + "execution_count": 10, "metadata": {}, "output_type": "execute_result" } @@ -647,11 +549,9 @@ " return result\n", "\n", "\n", - "asym_load_input_columnar_asym_list_array = arrow_to_numpy_asym(\n", - " asym_loads, DatasetType.input, ComponentType.asym_load, zero_copy_only=True\n", - ")\n", + "asym_load_input = arrow_to_numpy_asym(asym_loads, DatasetType.input, ComponentType.asym_load, zero_copy_only=True)\n", "\n", - "asym_load_input_columnar_asym_list_array" + "asym_load_input" ] }, { @@ -663,7 +563,7 @@ }, { "cell_type": "code", - "execution_count": 54, + "execution_count": 11, "metadata": {}, "outputs": [ { @@ -722,7 +622,7 @@ "2 -0.004338 -2.098733 2.090057" ] }, - "execution_count": 54, + "execution_count": 11, "metadata": {}, "output_type": "execute_result" } @@ -741,7 +641,9 @@ "asym_model = PowerGridModel(input_data=asym_input_data, system_frequency=50)\n", "\n", "# run the calculation\n", - "asym_result = asym_model.calculate_power_flow(symmetric=False)\n", + "asym_result = asym_model.calculate_power_flow(\n", + " symmetric=False, output_component_types=ComponentAttributeFilterOptions.everything\n", + ")\n", "\n", "# use pandas to display the results, but beware the data types\n", "pd.DataFrame(asym_result[\"node\"][\"u_angle\"])" @@ -756,154 +658,81 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "pyarrow.Table\n", - "id: int32\n", - "energized: int8\n", - "u_pu_a: double\n", - "u_pu_b: double\n", - "u_pu_c: double\n", - "u_a: double\n", - "u_b: double\n", - "u_c: double\n", - "u_angle_a: double\n", - "u_angle_b: double\n", - "u_angle_c: double\n", - "p_a: double\n", - "p_b: double\n", - "p_c: double\n", - "q_a: double\n", - "q_b: double\n", - "q_c: double\n", - "----\n", - "id: [[1,2,3]]\n", - "energized: [[1,1,1]]\n", - "u_pu_a: [[1.0003248257977395,1.0028803762176164,1.0041143008174032]]\n", - "u_pu_b: [[1.0003243769486854,1.0028710993140406,1.0041033583077175]]\n", - "u_pu_c: [[1.00032436416241,1.0028730789021523,1.0041004935738533]]\n", - "u_a: [[6064.146978239599,6079.639179329456,6087.119449677845]]\n", - "u_b: [[6064.144257236815,6079.582941090301,6087.053114238262]]\n", - "u_c: [[6064.1441797241405,6079.594941705457,6087.035747712152]]\n", - "u_angle_a: [[-0.00006651848125694397,-0.0029298831864832267,-0.004337685507209373]]\n", - "u_angle_b: [[-2.094461573665813,-2.0973219974462594,-2.098732840554144]]\n", - "..." + "pyarrow.Field" ] }, - "execution_count": 20, + "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "def numpy_to_arrow(data: np.ndarray) -> pa.lib.table:\n", - " \"\"\"Convert NumPy data to Arrow data.\"\"\"\n", - " simple_data_types = []\n", - " multi_value_data_types = []\n", - "\n", - " for name, (dtype, _) in data.dtype.fields.items():\n", - " if len(dtype.shape) == 0:\n", - " simple_data_types.append(name)\n", - " else:\n", - " multi_value_data_types.append(name)\n", - "\n", - " result = pa.table(pd.DataFrame(data[simple_data_types]))\n", - "\n", - " phases = (\"a\", \"b\", \"c\")\n", - " for name in multi_value_data_types:\n", - " column = data[name]\n", - "\n", - " assert column.shape[1] == len(phases), \"Asymmetric data has 3 phase output\"\n", - "\n", - " for phase_index, phase in enumerate(phases):\n", - " sub_column = column[:, phase_index]\n", - " result = result.append_column(f\"{name}_{phase}\", [pd.Series(sub_column)])\n", - "\n", - " return result\n", - "\n", - "\n", - "pa_asym_node_result = numpy_to_arrow(asym_result[\"node\"])\n", - "\n", - "pa_asym_node_result" + "pgm_schema(DatasetType.asym_output, ComponentType.node).field(\"id\")" ] }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "pyarrow.Table\n", + "pyarrow.RecordBatch\n", "id: int32\n", "energized: int8\n", - "u_pu_a: double\n", - "u_pu_b: double\n", - "u_pu_c: double\n", - "u_a: double\n", - "u_b: double\n", - "u_c: double\n", - "u_angle_a: double\n", - "u_angle_b: double\n", - "u_angle_c: double\n", - "p_a: double\n", - "p_b: double\n", - "p_c: double\n", - "q_a: double\n", - "q_b: double\n", - "q_c: double\n", + "u_pu: fixed_size_list[3]\n", + " child 0, item: double\n", + "u: fixed_size_list[3]\n", + " child 0, item: double\n", + "u_angle: fixed_size_list[3]\n", + " child 0, item: double\n", + "p: fixed_size_list[3]\n", + " child 0, item: double\n", + "q: fixed_size_list[3]\n", + " child 0, item: double\n", "----\n", - "id: [[1,2,3]]\n", - "energized: [[1,1,1]]\n", - "u_pu_a: [[1.0003248257977395,1.0028803762176164,1.0041143008174032]]\n", - "u_pu_b: [[1.0003243769486854,1.0028710993140406,1.0041033583077175]]\n", - "u_pu_c: [[1.00032436416241,1.0028730789021523,1.0041004935738533]]\n", - "u_a: [[6064.146978239599,6079.639179329456,6087.119449677845]]\n", - "u_b: [[6064.144257236815,6079.582941090301,6087.053114238262]]\n", - "u_c: [[6064.1441797241405,6079.594941705457,6087.035747712152]]\n", - "u_angle_a: [[-0.00006651848125694397,-0.0029298831864832267,-0.004337685507209373]]\n", - "u_angle_b: [[-2.094461573665813,-2.0973219974462594,-2.098732840554144]]\n", - "..." + "id: [1,2,3]\n", + "energized: [1,1,1]\n", + "u_pu: [[1.0003248257977395,1.0003243769486854,1.00032436416241],[1.0028803762176164,1.0028710993140406,1.0028730789021523],[1.0041143008174032,1.0041033583077175,1.0041004935738533]]\n", + "u: [[6064.146978239599,6064.144257236815,6064.1441797241405],[6079.639179329456,6079.582941090301,6079.594941705457],[6087.119449677845,6087.053114238262,6087.035747712152]]\n", + "u_angle: [[-0.00006651848125694397,-2.094461573665813,2.09432849798745],[-0.0029298831864832267,-2.0973219974462594,2.0914640024381836],[-0.004337685507209373,-2.098732840554144,2.0900574062078014]]\n", + "p: [[112925.89463805761,112918.13517097049,113364.09104548635],[-0.9999999787945241,-0.009999971449717083,-0.010999979325441034],[-2.0000000113649943,-2.500000072350112,-450.00000008387997]]\n", + "q: [[-1099806.4185888197,-1098301.0302391076,-1098302.79423175],[-0.499999998516201,-1499.9999999095232,-0.10000001915949493],[-1.5000000216889147,-2.50000006806065,-1500.0000000385737]]" ] }, - "execution_count": 21, + "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "def numpy_columnar_to_arrow_combined(data: np.ndarray) -> pa.lib.table:\n", + "def numpy_columnar_to_arrow(\n", + " data: SingleColumnarData, dataset_type: DatasetType, component_type: ComponentType\n", + ") -> pa.RecordBatch:\n", " \"\"\"Convert NumPy data to Arrow data.\"\"\"\n", - " simple_data_types = []\n", - " multi_value_data_types = []\n", - "\n", - " for name, (dtype, _) in data.dtype.fields.items():\n", - " if len(dtype.shape) == 0:\n", - " simple_data_types.append(name)\n", + " # pa.record_batch.from_arrays(data, schema=pgm_schema(DatasetType.result, ComponentType.node))\n", + " component_pgm_schema = pgm_schema(dataset_type, component_type, data.keys())\n", + " pa_columns = {}\n", + " for attribute, data in data.items():\n", + " primitive_type = component_pgm_schema.field(attribute).type\n", + "\n", + " if data.ndim == 2 and data.shape[1] == 3:\n", + " pa_columns[attribute] = pa.FixedSizeListArray.from_arrays(data.flatten(), type=primitive_type)\n", " else:\n", - " multi_value_data_types.append(name)\n", - "\n", - " result = pa.table(pd.DataFrame(data[simple_data_types]))\n", + " pa_columns[attribute] = pa.array(data, type=primitive_type)\n", + " return pa.record_batch(pa_columns, component_pgm_schema)\n", "\n", - " phases = (\"a\", \"b\", \"c\")\n", - " for name in multi_value_data_types:\n", - " column = data[name]\n", "\n", - " assert column.shape[1] == len(phases), \"Asymmetric data has 3 phase output\"\n", - "\n", - " for phase_index, phase in enumerate(phases):\n", - " sub_column = column[:, phase_index]\n", - " result = result.append_column(f\"{name}_{phase}\", [pd.Series(sub_column)])\n", - "\n", - " return result\n", - "\n", - "\n", - "pa_asym_node_result = numpy_to_arrow(asym_result[\"node\"])\n", + "pa_asym_node_result = numpy_columnar_to_arrow(\n", + " asym_result[ComponentType.node], DatasetType.asym_output, ComponentType.node\n", + ")\n", "\n", "pa_asym_node_result" ] From b5d48a816b522d488bb0c6361c35357424373ff4 Mon Sep 17 00:00:00 2001 From: Nitish Bharambe Date: Mon, 21 Oct 2024 16:33:08 +0200 Subject: [PATCH 05/20] =?UTF-8?q?text=20=C3=A4nd=20format=20changes?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Nitish Bharambe --- docs/examples/arrow_example.ipynb | 406 +++++++----------------------- 1 file changed, 84 insertions(+), 322 deletions(-) diff --git a/docs/examples/arrow_example.ipynb b/docs/examples/arrow_example.ipynb index c4154c86..2cfb6776 100644 --- a/docs/examples/arrow_example.ipynb +++ b/docs/examples/arrow_example.ipynb @@ -27,6 +27,7 @@ "source": [ "%%capture cap --no-stderr\n", "from IPython.display import display\n", + "from typing import Iterable\n", "\n", "from power_grid_model import (\n", " PowerGridModel,\n", @@ -40,8 +41,16 @@ "from power_grid_model.data_types import SingleColumnarData\n", "import pyarrow as pa\n", "import pandas as pd\n", - "import numpy as np\n", - "\n", + "import numpy as np" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "# A constant showing error message\n", "ZERO_COPY_ERROR_MSG = \"Zero-copy conversion requested, but the data types do not match.\"" ] }, @@ -70,7 +79,7 @@ "Construct the input data for the model and construct the actual model.\n", "\n", "Arrow uses a columnar data format while the power-grid-model offers both: row based or columnar data format.\n", - "Converting to/from columnar data can enable having zero copies to be produced while atleast one copy would be produced for row-based data conversions." + "Because of this, the columnar data format of power-grid-model provides a zero-copy interface for Arrow data. This differs from the row-based data format, for which conversions always require a copy." ] }, { @@ -84,20 +93,9 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "node: {'names': ['id', 'u_rated'], 'formats': ['[3]\n", - " child 0, item: double\n", - "q_specified: fixed_size_list[3]\n", - " child 0, item: double\n" - ] - } - ], + "outputs": [], "source": [ - "def pgm_schema(dataset_type: DatasetType, component_type: ComponentType, attributes: list[str] = None):\n", + "def pgm_schema(dataset_type: DatasetType, component_type: ComponentType, attributes: Iterable[str] | None = None):\n", " schemas = []\n", " component_dtype = power_grid_meta_data[dataset_type][component_type].dtype\n", " for meta_attribute, (dtype, _) in component_dtype.fields.items():\n", @@ -187,25 +168,9 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "pyarrow.RecordBatch\n", - "id: int32\n", - "u_rated: double\n", - "----\n", - "id: [1,2,3]\n", - "u_rated: [10500,10500,10500]" - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "nodes_dict = {\"id\": [1, 2, 3], \"u_rated\": [10500.0, 10500.0, 10500.0]}\n", "\n", @@ -245,7 +210,7 @@ ")\n", "\n", "nodes\n", - "# the tables of the other components can be printed similarly" + "# the record batches of the other components can be printed similarly" ] }, { @@ -254,10 +219,11 @@ "source": [ "### Convert the Arrow data to power-grid-model input data\n", "\n", - "These Arrow record batch or tables can then be converted to row based or columnar array.\n", - "For more memory-efficient operations, converting Arrow data to columnar NumPy arrays can be done with zero-copy operations.\n", + "The Arrow record batch or tables can then be converted to row based data or columnar data.\n", + "Converting Arrow data to columnar NumPy arrays is recommended to leverage the columnar nature of Arrow data. \n", + "This conversion can be done with zero-copy operations.\n", "\n", - "This approach ensures that the data types match and that the conversion is efficient, leveraging the columnar nature of Arrow data. \n", + "Similar approach be adopted by the user to convert to row based data.\n", "\n", "```{note}\n", "The option of `zero_copy_only` in the function below is added in this demo to verify no copies are made. Its usage is not mandatory to do zero copy conversion.\n", @@ -266,20 +232,9 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'id': array([1, 2, 3]), 'u_rated': array([10500., 10500., 10500.])}" - ] - }, - "execution_count": 5, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "def arrow_to_numpy(\n", " data: pa.RecordBatch, dataset_type: DatasetType, component_type: ComponentType, zero_copy_only: bool = False\n", @@ -312,57 +267,21 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 7, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'node': {'id': array([1, 2, 3]), 'u_rated': array([10500., 10500., 10500.])},\n", - " 'line': {'id': array([4, 5]),\n", - " 'from_node': array([1, 2]),\n", - " 'to_node': array([2, 3]),\n", - " 'from_status': array([1, 1], dtype=int8),\n", - " 'to_status': array([1, 1], dtype=int8),\n", - " 'r1': array([0.11, 0.15]),\n", - " 'x1': array([0.12, 0.16]),\n", - " 'c1': array([4.1e-05, 5.4e-05]),\n", - " 'tan1': array([0.1, 0.1]),\n", - " 'r0': array([0.01, 0.05]),\n", - " 'x0': array([0.22, 0.06]),\n", - " 'c0': array([4.1e-05, 5.4e-05]),\n", - " 'tan0': array([0.4, 0.1])},\n", - " 'source': {'id': array([6]),\n", - " 'node': array([1]),\n", - " 'status': array([1], dtype=int8),\n", - " 'u_ref': array([1.])},\n", - " 'sym_load': {'id': array([7, 8]),\n", - " 'node': array([2, 3]),\n", - " 'status': array([1, 1], dtype=int8),\n", - " 'type': array([0, 0], dtype=int8),\n", - " 'p_specified': array([1., 2.]),\n", - " 'q_specified': array([0.5, 1.5])}}" - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "input_data = {\n", - " \"node\": node_input,\n", - " \"line\": line_input,\n", - " \"source\": source_input,\n", - " \"sym_load\": sym_load_input,\n", - "}\n", - "\n", - "input_data" + " ComponentType.node: node_input,\n", + " ComponentType.line: line_input,\n", + " ComponentType.source: source_input,\n", + " ComponentType.sym_load: sym_load_input,\n", + "}" ] }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 8, "metadata": {}, "outputs": [], "source": [ @@ -378,40 +297,16 @@ "source": [ "### Use the power-grid-model\n", "\n", - "For more extensive examples, visit the [power-grid-model documentation](https://power-grid-model.readthedocs.io/en/stable/index.html)." + "The `output_component_types` argument is set to `ComponentAttributeFilterOptions.relevant` to given out columnar data.\n", + "\n", + "For more extensive examples, visit the [power-grid-model documentation](https://power-grid-model.readthedocs.io/en/stable/index.html).\n" ] }, { "cell_type": "code", - "execution_count": 8, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "pyarrow.RecordBatch\n", - "id: int32\n", - "energized: int8\n", - "u_pu: double\n", - "u: double\n", - "u_angle: double\n", - "p: double\n", - "q: double\n", - "----\n", - "id: [1,2,3]\n", - "energized: [1,1,1]\n", - "u_pu: [1.000324825742982,1.0028788641128945,1.004112854674026]\n", - "u: [10503.410670301311,10530.228073185392,10543.184974077272]\n", - "u_angle: [-0.00006651843181518038,-0.0029317915196012487,-0.004341587216862092]\n", - "p: [338777.2462788448,-1.0000002693184182,-1.9999998867105226]\n", - "q: [-3299418.661306348,-0.5000000701801947,-1.4999998507078594]" - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "# construct the model\n", "model = PowerGridModel(input_data=input_data, system_frequency=50)\n", @@ -421,6 +316,29 @@ "\n", "# use pandas to tabulate and display the results\n", "sym_node_result = sym_result[ComponentType.node]\n", + "display(pd.DataFrame(sym_node_result))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Convert the symmetric result to Arrow format" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Converting symmetrical results is straightforward by using schema from [Creating Schema](#creating-a-schema)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "pa_sym_node_result = pa.record_batch(\n", " sym_node_result, schema=pgm_schema(DatasetType.sym_output, ComponentType.node, sym_node_result.keys())\n", ")\n", @@ -433,7 +351,7 @@ "source": [ "## Single asymmetric calculations\n", "\n", - "Asymmetric calculations have a tuple of values for some of the attributes and are not easily convertible to pandas data frames.\n", + "Asymmetric calculations have a tuple of values for some of the attributes and are not easily convertible to record batches.\n", "Instead, one can have a look at the individual components of those attributes and/or flatten the arrays to access all components." ] }, @@ -454,35 +372,9 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "pyarrow.RecordBatch\n", - "id: int32\n", - "node: int32\n", - "status: int8\n", - "type: int8\n", - "p_specified: fixed_size_list[3]\n", - " child 0, item: double\n", - "q_specified: fixed_size_list[3]\n", - " child 0, item: double\n", - "----\n", - "id: [7,8]\n", - "node: [2,3]\n", - "status: [1,1]\n", - "type: [0,0]\n", - "p_specified: [[1,0.01,0.011],[2,2.5,450]]\n", - "q_specified: [[0.5,1500,0.1],[1.5,2.5,1500]]" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "asym_loads_dict = {\n", " \"id\": [7, 8],\n", @@ -502,30 +394,12 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'id': array([7, 8]),\n", - " 'node': array([2, 3]),\n", - " 'status': array([1, 1], dtype=int8),\n", - " 'type': array([0, 0], dtype=int8),\n", - " 'p_specified': array([[1.0e+00, 1.0e-02, 1.1e-02],\n", - " [2.0e+00, 2.5e+00, 4.5e+02]]),\n", - " 'q_specified': array([[5.0e-01, 1.5e+03, 1.0e-01],\n", - " [1.5e+00, 2.5e+00, 1.5e+03]])}" - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "def arrow_to_numpy_asym(\n", - " data: pa.lib.Table, dataset_type: DatasetType, component_type: ComponentType, zero_copy_only: bool = False\n", + " data: pa.RecordBatch, dataset_type: DatasetType, component_type: ComponentType, zero_copy_only: bool = False\n", ") -> np.ndarray:\n", " \"\"\"Convert asymmetric Arrow data to NumPy data.\n", "\n", @@ -563,76 +437,15 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
012
0-0.000067-2.0944622.094328
1-0.002930-2.0973222.091464
2-0.004338-2.0987332.090057
\n", - "
" - ], - "text/plain": [ - " 0 1 2\n", - "0 -0.000067 -2.094462 2.094328\n", - "1 -0.002930 -2.097322 2.091464\n", - "2 -0.004338 -2.098733 2.090057" - ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "asym_input_data = {\n", - " \"node\": node_input,\n", - " \"line\": line_input,\n", - " \"source\": source_input,\n", - " \"asym_load\": asym_load_input,\n", + " ComponentType.node: node_input,\n", + " ComponentType.line: line_input,\n", + " ComponentType.source: source_input,\n", + " ComponentType.asym_load: asym_load_input,\n", "}\n", "\n", "validate_input_data(asym_input_data, symmetric=False)\n", @@ -642,11 +455,11 @@ "\n", "# run the calculation\n", "asym_result = asym_model.calculate_power_flow(\n", - " symmetric=False, output_component_types=ComponentAttributeFilterOptions.everything\n", + " symmetric=False, output_component_types=ComponentAttributeFilterOptions.relevant\n", ")\n", "\n", "# use pandas to display the results, but beware the data types\n", - "pd.DataFrame(asym_result[\"node\"][\"u_angle\"])" + "pd.DataFrame(asym_result[ComponentType.node][\"u_angle\"])" ] }, { @@ -658,60 +471,9 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "pyarrow.Field" - ] - }, - "execution_count": 12, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "pgm_schema(DatasetType.asym_output, ComponentType.node).field(\"id\")" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "pyarrow.RecordBatch\n", - "id: int32\n", - "energized: int8\n", - "u_pu: fixed_size_list[3]\n", - " child 0, item: double\n", - "u: fixed_size_list[3]\n", - " child 0, item: double\n", - "u_angle: fixed_size_list[3]\n", - " child 0, item: double\n", - "p: fixed_size_list[3]\n", - " child 0, item: double\n", - "q: fixed_size_list[3]\n", - " child 0, item: double\n", - "----\n", - "id: [1,2,3]\n", - "energized: [1,1,1]\n", - "u_pu: [[1.0003248257977395,1.0003243769486854,1.00032436416241],[1.0028803762176164,1.0028710993140406,1.0028730789021523],[1.0041143008174032,1.0041033583077175,1.0041004935738533]]\n", - "u: [[6064.146978239599,6064.144257236815,6064.1441797241405],[6079.639179329456,6079.582941090301,6079.594941705457],[6087.119449677845,6087.053114238262,6087.035747712152]]\n", - "u_angle: [[-0.00006651848125694397,-2.094461573665813,2.09432849798745],[-0.0029298831864832267,-2.0973219974462594,2.0914640024381836],[-0.004337685507209373,-2.098732840554144,2.0900574062078014]]\n", - "p: [[112925.89463805761,112918.13517097049,113364.09104548635],[-0.9999999787945241,-0.009999971449717083,-0.010999979325441034],[-2.0000000113649943,-2.500000072350112,-450.00000008387997]]\n", - "q: [[-1099806.4185888197,-1098301.0302391076,-1098302.79423175],[-0.499999998516201,-1499.9999999095232,-0.10000001915949493],[-1.5000000216889147,-2.50000006806065,-1500.0000000385737]]" - ] - }, - "execution_count": 13, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "def numpy_columnar_to_arrow(\n", " data: SingleColumnarData, dataset_type: DatasetType, component_type: ComponentType\n", From b8dd1b08a3aa9e3516a43c094f2eada4acbb19f5 Mon Sep 17 00:00:00 2001 From: Nitish Bharambe Date: Mon, 21 Oct 2024 16:36:45 +0200 Subject: [PATCH 06/20] run notebook Signed-off-by: Nitish Bharambe --- docs/examples/arrow_example.ipynb | 348 +++++++++++++++++++++++++++--- 1 file changed, 324 insertions(+), 24 deletions(-) diff --git a/docs/examples/arrow_example.ipynb b/docs/examples/arrow_example.ipynb index 2cfb6776..81c4aae6 100644 --- a/docs/examples/arrow_example.ipynb +++ b/docs/examples/arrow_example.ipynb @@ -21,7 +21,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 15, "metadata": {}, "outputs": [], "source": [ @@ -46,7 +46,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 16, "metadata": {}, "outputs": [], "source": [ @@ -93,9 +93,20 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 17, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "node: {'names': ['id', 'u_rated'], 'formats': ['[3]\n", + " child 0, item: double\n", + "q_specified: fixed_size_list[3]\n", + " child 0, item: double\n" + ] + } + ], "source": [ "def pgm_schema(dataset_type: DatasetType, component_type: ComponentType, attributes: Iterable[str] | None = None):\n", " schemas = []\n", @@ -168,9 +198,25 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 19, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "pyarrow.RecordBatch\n", + "id: int32\n", + "u_rated: double\n", + "----\n", + "id: [1,2,3]\n", + "u_rated: [10500,10500,10500]" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "nodes_dict = {\"id\": [1, 2, 3], \"u_rated\": [10500.0, 10500.0, 10500.0]}\n", "\n", @@ -232,9 +278,20 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 20, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "{'id': array([1, 2, 3]), 'u_rated': array([10500., 10500., 10500.])}" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "def arrow_to_numpy(\n", " data: pa.RecordBatch, dataset_type: DatasetType, component_type: ComponentType, zero_copy_only: bool = False\n", @@ -267,7 +324,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 21, "metadata": {}, "outputs": [], "source": [ @@ -281,7 +338,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 22, "metadata": {}, "outputs": [], "source": [ @@ -304,9 +361,90 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 23, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idenergizedu_puuu_anglepq
0111.00032510503.410670-0.000067338777.246279-3.299419e+06
1211.00287910530.228073-0.002932-1.000000-5.000001e-01
2311.00411310543.184974-0.004342-2.000000-1.500000e+00
\n", + "
" + ], + "text/plain": [ + " id energized u_pu u u_angle p \\\n", + "0 1 1 1.000325 10503.410670 -0.000067 338777.246279 \n", + "1 2 1 1.002879 10530.228073 -0.002932 -1.000000 \n", + "2 3 1 1.004113 10543.184974 -0.004342 -2.000000 \n", + "\n", + " q \n", + "0 -3.299419e+06 \n", + "1 -5.000001e-01 \n", + "2 -1.500000e+00 " + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], "source": [ "# construct the model\n", "model = PowerGridModel(input_data=input_data, system_frequency=50)\n", @@ -335,9 +473,35 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 24, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "pyarrow.RecordBatch\n", + "id: int32\n", + "energized: int8\n", + "u_pu: double\n", + "u: double\n", + "u_angle: double\n", + "p: double\n", + "q: double\n", + "----\n", + "id: [1,2,3]\n", + "energized: [1,1,1]\n", + "u_pu: [1.000324825742982,1.0028788641128945,1.004112854674026]\n", + "u: [10503.410670301311,10530.228073185392,10543.184974077272]\n", + "u_angle: [-0.00006651843181518038,-0.0029317915196012487,-0.004341587216862092]\n", + "p: [338777.2462788448,-1.0000002693184182,-1.9999998867105226]\n", + "q: [-3299418.661306348,-0.5000000701801947,-1.4999998507078594]" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "pa_sym_node_result = pa.record_batch(\n", " sym_node_result, schema=pgm_schema(DatasetType.sym_output, ComponentType.node, sym_node_result.keys())\n", @@ -372,9 +536,35 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 25, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "pyarrow.RecordBatch\n", + "id: int32\n", + "node: int32\n", + "status: int8\n", + "type: int8\n", + "p_specified: fixed_size_list[3]\n", + " child 0, item: double\n", + "q_specified: fixed_size_list[3]\n", + " child 0, item: double\n", + "----\n", + "id: [7,8]\n", + "node: [2,3]\n", + "status: [1,1]\n", + "type: [0,0]\n", + "p_specified: [[1,0.01,0.011],[2,2.5,450]]\n", + "q_specified: [[0.5,1500,0.1],[1.5,2.5,1500]]" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "asym_loads_dict = {\n", " \"id\": [7, 8],\n", @@ -394,9 +584,27 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 26, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "{'id': array([7, 8]),\n", + " 'node': array([2, 3]),\n", + " 'status': array([1, 1], dtype=int8),\n", + " 'type': array([0, 0], dtype=int8),\n", + " 'p_specified': array([[1.0e+00, 1.0e-02, 1.1e-02],\n", + " [2.0e+00, 2.5e+00, 4.5e+02]]),\n", + " 'q_specified': array([[5.0e-01, 1.5e+03, 1.0e-01],\n", + " [1.5e+00, 2.5e+00, 1.5e+03]])}" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "def arrow_to_numpy_asym(\n", " data: pa.RecordBatch, dataset_type: DatasetType, component_type: ComponentType, zero_copy_only: bool = False\n", @@ -437,9 +645,70 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 27, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
012
0-0.000067-2.0944622.094328
1-0.002930-2.0973222.091464
2-0.004338-2.0987332.090057
\n", + "
" + ], + "text/plain": [ + " 0 1 2\n", + "0 -0.000067 -2.094462 2.094328\n", + "1 -0.002930 -2.097322 2.091464\n", + "2 -0.004338 -2.098733 2.090057" + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "asym_input_data = {\n", " ComponentType.node: node_input,\n", @@ -471,9 +740,40 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 28, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "pyarrow.RecordBatch\n", + "id: int32\n", + "energized: int8\n", + "u_pu: fixed_size_list[3]\n", + " child 0, item: double\n", + "u: fixed_size_list[3]\n", + " child 0, item: double\n", + "u_angle: fixed_size_list[3]\n", + " child 0, item: double\n", + "p: fixed_size_list[3]\n", + " child 0, item: double\n", + "q: fixed_size_list[3]\n", + " child 0, item: double\n", + "----\n", + "id: [1,2,3]\n", + "energized: [1,1,1]\n", + "u_pu: [[1.0003248257977395,1.0003243769486854,1.00032436416241],[1.0028803762176164,1.0028710993140406,1.0028730789021523],[1.0041143008174032,1.0041033583077175,1.0041004935738533]]\n", + "u: [[6064.146978239599,6064.144257236815,6064.1441797241405],[6079.639179329456,6079.582941090301,6079.594941705457],[6087.119449677845,6087.053114238262,6087.035747712152]]\n", + "u_angle: [[-0.00006651848125694397,-2.094461573665813,2.09432849798745],[-0.0029298831864832267,-2.0973219974462594,2.0914640024381836],[-0.004337685507209373,-2.098732840554144,2.0900574062078014]]\n", + "p: [[112925.89463805761,112918.13517097049,113364.09104548635],[-0.9999999787945241,-0.009999971449717083,-0.010999979325441034],[-2.0000000113649943,-2.500000072350112,-450.00000008387997]]\n", + "q: [[-1099806.4185888197,-1098301.0302391076,-1098302.79423175],[-0.499999998516201,-1499.9999999095232,-0.10000001915949493],[-1.5000000216889147,-2.50000006806065,-1500.0000000385737]]" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "def numpy_columnar_to_arrow(\n", " data: SingleColumnarData, dataset_type: DatasetType, component_type: ComponentType\n", From c49b9e9a63466b5ac701392bb4e0d219d3d8c131 Mon Sep 17 00:00:00 2001 From: Nitish Bharambe <78108900+nitbharambe@users.noreply.github.com> Date: Tue, 22 Oct 2024 12:37:41 +0200 Subject: [PATCH 07/20] Apply suggestions from code review minor Co-authored-by: Martijn Govers Signed-off-by: Nitish Bharambe <78108900+nitbharambe@users.noreply.github.com> --- docs/examples/arrow_example.ipynb | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/examples/arrow_example.ipynb b/docs/examples/arrow_example.ipynb index 81c4aae6..1d58d40f 100644 --- a/docs/examples/arrow_example.ipynb +++ b/docs/examples/arrow_example.ipynb @@ -10,7 +10,7 @@ "\n", "It is by no means intended to provide complete documentation on the topic, but only to show how such conversions could be done.\n", "\n", - "This example uses `pyarrow.RecordBatch` to demonstrate zero copy operations. The user can choose a `pyarrow.Table` or other structures based on the requirement.\n", + "This example uses `pyarrow.RecordBatch` to demonstrate zero-copy operations. The user can choose a `pyarrow.Table` or other structures based on the requirement.\n", "\n", "**NOTE:** To run this example, the optional `examples` dependencies are required:\n", "\n", @@ -78,7 +78,7 @@ "\n", "Construct the input data for the model and construct the actual model.\n", "\n", - "Arrow uses a columnar data format while the power-grid-model offers both: row based or columnar data format.\n", + "Arrow uses a columnar data format while the power-grid-model offers support for both row based and columnar data format.\n", "Because of this, the columnar data format of power-grid-model provides a zero-copy interface for Arrow data. This differs from the row-based data format, for which conversions always require a copy." ] }, @@ -123,11 +123,11 @@ "metadata": {}, "source": [ "The primitive types of each attribute in the arrow tables need to match to make the operation efficient.\n", - "A zero copy is not guaranteed if the data types from power_grid_meta_data / initialize_array are not used.\n", + "Zero-copy conversion is not guaranteed if the data types provided via the PGM via `power_grid_meta_data` are not used.\n", "Note that the asymmetric type of attribute in power-grid-model has a shape of `(3,)` along with a specific type. These represent the 3 phases of electrical system.\n", - "Hence asymmetric attributes need to be handled specially. \n", + "Hence, special care is required when handling asymmetric attributes. \n", "\n", - "In this tutorial we use the respective primitive types for the symmetrical attributes and a `FixedSizeListArray` of the primitive types with length 3 for asymmetrical attributes. This results in them being stored as contigious memory which would enable zero copy conversion. There might be other ways to approach this problem too." + "In this example, we use the respective primitive types for the symmetrical attributes and a `FixedSizeListArray` of the primitive types with length 3 for asymmetrical attributes. This results in them being stored as contiguous memory which would enable zero-copy conversion. Other possible solutions to this problem are beyond the scope of this example." ] }, { @@ -179,9 +179,9 @@ " return pa.schema(schemas)\n", "\n", "\n", - "print(\"-------node combined asym scehma-------\")\n", + "print(\"-------node combined asym schema-------\")\n", "print(pgm_schema(DatasetType.input, ComponentType.node))\n", - "print(\"-------asym load combined asym scehma-------\")\n", + "print(\"-------asym load combined asym schema-------\")\n", "print(pgm_schema(DatasetType.input, ComponentType.asym_load))" ] }, @@ -269,7 +269,7 @@ "Converting Arrow data to columnar NumPy arrays is recommended to leverage the columnar nature of Arrow data. \n", "This conversion can be done with zero-copy operations.\n", "\n", - "Similar approach be adopted by the user to convert to row based data.\n", + "A similar approach be adopted by the user to convert to row based data.\n", "\n", "```{note}\n", "The option of `zero_copy_only` in the function below is added in this demo to verify no copies are made. Its usage is not mandatory to do zero copy conversion.\n", From c934a5cb43553181b34fa0e824c628a92580f72f Mon Sep 17 00:00:00 2001 From: Nitish Bharambe Date: Tue, 22 Oct 2024 13:47:20 +0200 Subject: [PATCH 08/20] mypy fix Signed-off-by: Nitish Bharambe --- src/power_grid_model_io/converters/tabular_converter.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/power_grid_model_io/converters/tabular_converter.py b/src/power_grid_model_io/converters/tabular_converter.py index 81b20515..024b278f 100644 --- a/src/power_grid_model_io/converters/tabular_converter.py +++ b/src/power_grid_model_io/converters/tabular_converter.py @@ -838,7 +838,7 @@ def get_id(row: pd.Series) -> int: key = row.dropna().to_dict() row_table = key.pop("table") if table is None and "table" in key else table row_name = key.pop("name") if name is None and "name" in key else name - return self.get_id(table=row_table, key=key, name=row_name) + return self.get_id(table=cast(row_table, str), key=key, name=row_name) return keys.apply(get_id, axis=1).to_list() From c5ef4020658f3bba3ef96255f10100f49c1f33e1 Mon Sep 17 00:00:00 2001 From: Nitish Bharambe Date: Tue, 22 Oct 2024 14:11:26 +0200 Subject: [PATCH 09/20] address comments Signed-off-by: Nitish Bharambe --- docs/examples/arrow_example.ipynb | 89 +++++++++++++------------------ 1 file changed, 36 insertions(+), 53 deletions(-) diff --git a/docs/examples/arrow_example.ipynb b/docs/examples/arrow_example.ipynb index 81c4aae6..e8815b70 100644 --- a/docs/examples/arrow_example.ipynb +++ b/docs/examples/arrow_example.ipynb @@ -21,7 +21,7 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ @@ -44,16 +44,6 @@ "import numpy as np" ] }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [], - "source": [ - "# A constant showing error message\n", - "ZERO_COPY_ERROR_MSG = \"Zero-copy conversion requested, but the data types do not match.\"" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -93,7 +83,7 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 2, "metadata": {}, "outputs": [ { @@ -142,17 +132,17 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "-------node combined asym scehma-------\n", + "-------node asym scehma-------\n", "id: int32\n", "u_rated: double\n", - "-------asym load combined asym scehma-------\n", + "-------asym load scehma-------\n", "id: int32\n", "node: int32\n", "status: int8\n", @@ -179,9 +169,9 @@ " return pa.schema(schemas)\n", "\n", "\n", - "print(\"-------node combined asym scehma-------\")\n", + "print(\"-------node asym scehma-------\")\n", "print(pgm_schema(DatasetType.input, ComponentType.node))\n", - "print(\"-------asym load combined asym scehma-------\")\n", + "print(\"-------asym load scehma-------\")\n", "print(pgm_schema(DatasetType.input, ComponentType.asym_load))" ] }, @@ -198,7 +188,7 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 4, "metadata": {}, "outputs": [ { @@ -212,7 +202,7 @@ "u_rated: [10500,10500,10500]" ] }, - "execution_count": 19, + "execution_count": 4, "metadata": {}, "output_type": "execute_result" } @@ -272,13 +262,14 @@ "Similar approach be adopted by the user to convert to row based data.\n", "\n", "```{note}\n", - "The option of `zero_copy_only` in the function below is added in this demo to verify no copies are made. Its usage is not mandatory to do zero copy conversion.\n", + "The option of `zero_copy_only` in the function below and assert for correct dtype is added in this demo to verify no copies are made. \n", + "Its usage is not mandatory to do zero copy conversion.\n", "```" ] }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 5, "metadata": {}, "outputs": [ { @@ -287,27 +278,24 @@ "{'id': array([1, 2, 3]), 'u_rated': array([10500., 10500., 10500.])}" ] }, - "execution_count": 20, + "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "def arrow_to_numpy(\n", - " data: pa.RecordBatch, dataset_type: DatasetType, component_type: ComponentType, zero_copy_only: bool = False\n", - ") -> np.ndarray:\n", + "def arrow_to_numpy(data: pa.RecordBatch, dataset_type: DatasetType, component_type: ComponentType) -> np.ndarray:\n", " \"\"\"Convert Arrow data to NumPy data.\"\"\"\n", " result = {}\n", " result_dtype = power_grid_meta_data[dataset_type][component_type].dtype\n", " for name, column in zip(data.column_names, data.columns):\n", - " column_data = column.to_numpy(zero_copy_only=zero_copy_only)\n", - " if zero_copy_only and column_data.dtype != result_dtype[name]:\n", - " raise ValueError(ZERO_COPY_ERROR_MSG)\n", + " column_data = column.to_numpy(zero_copy_only=True)\n", + " assert column_data.dtype == result_dtype[name]\n", " result[name] = column_data.astype(dtype=result_dtype[name], copy=False)\n", " return result\n", "\n", "\n", - "node_input = arrow_to_numpy(nodes, DatasetType.input, ComponentType.node, zero_copy_only=True)\n", + "node_input = arrow_to_numpy(nodes, DatasetType.input, ComponentType.node)\n", "line_input = arrow_to_numpy(lines, DatasetType.input, ComponentType.line)\n", "source_input = arrow_to_numpy(sources, DatasetType.input, ComponentType.source)\n", "sym_load_input = arrow_to_numpy(sym_loads, DatasetType.input, ComponentType.sym_load)\n", @@ -324,7 +312,7 @@ }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 6, "metadata": {}, "outputs": [], "source": [ @@ -338,7 +326,7 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ @@ -361,7 +349,7 @@ }, { "cell_type": "code", - "execution_count": 23, + "execution_count": null, "metadata": {}, "outputs": [ { @@ -473,7 +461,7 @@ }, { "cell_type": "code", - "execution_count": 24, + "execution_count": 9, "metadata": {}, "outputs": [ { @@ -497,7 +485,7 @@ "q: [-3299418.661306348,-0.5000000701801947,-1.4999998507078594]" ] }, - "execution_count": 24, + "execution_count": 9, "metadata": {}, "output_type": "execute_result" } @@ -536,7 +524,7 @@ }, { "cell_type": "code", - "execution_count": 25, + "execution_count": 10, "metadata": {}, "outputs": [ { @@ -560,7 +548,7 @@ "q_specified: [[0.5,1500,0.1],[1.5,2.5,1500]]" ] }, - "execution_count": 25, + "execution_count": 10, "metadata": {}, "output_type": "execute_result" } @@ -584,7 +572,7 @@ }, { "cell_type": "code", - "execution_count": 26, + "execution_count": 11, "metadata": {}, "outputs": [ { @@ -600,15 +588,13 @@ " [1.5e+00, 2.5e+00, 1.5e+03]])}" ] }, - "execution_count": 26, + "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "def arrow_to_numpy_asym(\n", - " data: pa.RecordBatch, dataset_type: DatasetType, component_type: ComponentType, zero_copy_only: bool = False\n", - ") -> np.ndarray:\n", + "def arrow_to_numpy_asym(data: pa.RecordBatch, dataset_type: DatasetType, component_type: ComponentType) -> np.ndarray:\n", " \"\"\"Convert asymmetric Arrow data to NumPy data.\n", "\n", " This function is similar to the arrow_to_numpy function, but also supports asymmetric data.\"\"\"\n", @@ -621,17 +607,15 @@ " dtype = result_dtype[name]\n", "\n", " if len(dtype.shape) == 0:\n", - " column_data = data.column(name).to_numpy(zero_copy_only=zero_copy_only)\n", + " column_data = data.column(name).to_numpy(zero_copy_only=True)\n", " else:\n", - " column_data = data.column(name).flatten().to_numpy(zero_copy_only=zero_copy_only).reshape(-1, 3)\n", - "\n", - " if zero_copy_only and column_data.dtype.base != dtype.base:\n", - " raise ValueError(ZERO_COPY_ERROR_MSG)\n", + " column_data = data.column(name).flatten().to_numpy(zero_copy_only=True).reshape(-1, 3)\n", + " assert column_data.dtype.base == dtype.base\n", " result[name] = column_data.astype(dtype=dtype.base, copy=False)\n", " return result\n", "\n", "\n", - "asym_load_input = arrow_to_numpy_asym(asym_loads, DatasetType.input, ComponentType.asym_load, zero_copy_only=True)\n", + "asym_load_input = arrow_to_numpy_asym(asym_loads, DatasetType.input, ComponentType.asym_load)\n", "\n", "asym_load_input" ] @@ -645,7 +629,7 @@ }, { "cell_type": "code", - "execution_count": 27, + "execution_count": 12, "metadata": {}, "outputs": [ { @@ -704,9 +688,8 @@ "2 -0.004338 -2.098733 2.090057" ] }, - "execution_count": 27, "metadata": {}, - "output_type": "execute_result" + "output_type": "display_data" } ], "source": [ @@ -728,7 +711,7 @@ ")\n", "\n", "# use pandas to display the results, but beware the data types\n", - "pd.DataFrame(asym_result[ComponentType.node][\"u_angle\"])" + "display(pd.DataFrame(asym_result[ComponentType.node][\"u_angle\"]))" ] }, { @@ -740,7 +723,7 @@ }, { "cell_type": "code", - "execution_count": 28, + "execution_count": 13, "metadata": {}, "outputs": [ { @@ -769,7 +752,7 @@ "q: [[-1099806.4185888197,-1098301.0302391076,-1098302.79423175],[-0.499999998516201,-1499.9999999095232,-0.10000001915949493],[-1.5000000216889147,-2.50000006806065,-1500.0000000385737]]" ] }, - "execution_count": 28, + "execution_count": 13, "metadata": {}, "output_type": "execute_result" } From 558c5d6875e406b2060d4254d24b241bb11cf8d6 Mon Sep 17 00:00:00 2001 From: Nitish Bharambe Date: Tue, 22 Oct 2024 16:39:23 +0200 Subject: [PATCH 10/20] change initialization Signed-off-by: Nitish Bharambe --- docs/examples/arrow_example.ipynb | 122 +++++++++++++++++++++--------- 1 file changed, 87 insertions(+), 35 deletions(-) diff --git a/docs/examples/arrow_example.ipynb b/docs/examples/arrow_example.ipynb index 9fae52a6..f80b710f 100644 --- a/docs/examples/arrow_example.ipynb +++ b/docs/examples/arrow_example.ipynb @@ -183,7 +183,8 @@ "\n", "The [power-grid-model documentation on Components](https://power-grid-model.readthedocs.io/en/stable/user_manual/components.html) provides documentation on which components are required and which ones are optional.\n", "\n", - "Construct the Arrow data as a table with the correct headers and data types." + "Construct the Arrow data as a table with the correct headers and data types. \n", + "The creation of arrays and combining it in a RecordBatch as well as the method of initializing that RecordBatch is up to the user." ] }, { @@ -208,41 +209,54 @@ } ], "source": [ - "nodes_dict = {\"id\": [1, 2, 3], \"u_rated\": [10500.0, 10500.0, 10500.0]}\n", - "\n", - "\n", - "lines_dict = {\n", - " \"id\": [4, 5],\n", - " \"from_node\": [1, 2],\n", - " \"to_node\": [2, 3],\n", - " \"from_status\": [1, 1],\n", - " \"to_status\": [1, 1],\n", - " \"r1\": [0.11, 0.15],\n", - " \"x1\": [0.12, 0.16],\n", - " \"c1\": [4.1e-05, 5.4e-05],\n", - " \"tan1\": [0.1, 0.1],\n", - " \"r0\": [0.01, 0.05],\n", - " \"x0\": [0.22, 0.06],\n", - " \"c0\": [4.1e-05, 5.4e-05],\n", - " \"tan0\": [0.4, 0.1],\n", - "}\n", - "\n", - "sources_dict = {\"id\": [6], \"node\": [1], \"status\": [1], \"u_ref\": [1.0]}\n", + "nodes_schema = pgm_schema(DatasetType.input, ComponentType.node)\n", + "nodes = pa.record_batch(\n", + " [\n", + " pa.array([1, 2, 3], type=nodes_schema.field(\"id\").type),\n", + " pa.array([10500.0, 10500.0, 10500.0], type=nodes_schema.field(\"u_rated\").type),\n", + " ],\n", + " names=(\"id\", \"u_rated\"),\n", + ")\n", "\n", - "sym_loads_dict = {\n", - " \"id\": [7, 8],\n", - " \"node\": [2, 3],\n", - " \"status\": [1, 1],\n", - " \"type\": [0, 0],\n", - " \"p_specified\": [1.0, 2.0],\n", - " \"q_specified\": [0.5, 1.5],\n", - "}\n", + "lines = pa.record_batch(\n", + " {\n", + " \"id\": [4, 5],\n", + " \"from_node\": [1, 2],\n", + " \"to_node\": [2, 3],\n", + " \"from_status\": [1, 1],\n", + " \"to_status\": [1, 1],\n", + " \"r1\": [0.11, 0.15],\n", + " \"x1\": [0.12, 0.16],\n", + " \"c1\": [4.1e-05, 5.4e-05],\n", + " \"tan1\": [0.1, 0.1],\n", + " \"r0\": [0.01, 0.05],\n", + " \"x0\": [0.22, 0.06],\n", + " \"c0\": [4.1e-05, 5.4e-05],\n", + " \"tan0\": [0.4, 0.1],\n", + " },\n", + " schema=pgm_schema(\n", + " DatasetType.input,\n", + " ComponentType.line,\n", + " [\"id\", \"from_node\", \"to_node\", \"from_status\", \"to_status\", \"r1\", \"x1\", \"c1\", \"tan1\", \"r0\", \"x0\", \"c0\", \"tan0\"],\n", + " ),\n", + ")\n", "\n", - "nodes = pa.record_batch(nodes_dict, schema=pgm_schema(DatasetType.input, ComponentType.node, nodes_dict.keys()))\n", - "lines = pa.record_batch(lines_dict, schema=pgm_schema(DatasetType.input, ComponentType.line, lines_dict.keys()))\n", - "sources = pa.record_batch(sources_dict, schema=pgm_schema(DatasetType.input, ComponentType.source, sources_dict.keys()))\n", + "sources = pa.record_batch(\n", + " {\"id\": [6], \"node\": [1], \"status\": [1], \"u_ref\": [1.0]},\n", + " schema=pgm_schema(DatasetType.input, ComponentType.source, [\"id\", \"node\", \"status\", \"u_ref\"]),\n", + ")\n", "sym_loads = pa.record_batch(\n", - " sym_loads_dict, schema=pgm_schema(DatasetType.input, ComponentType.sym_load, sym_loads_dict.keys())\n", + " {\n", + " \"id\": [7, 8],\n", + " \"node\": [2, 3],\n", + " \"status\": [1, 1],\n", + " \"type\": [0, 0],\n", + " \"p_specified\": [1.0, 2.0],\n", + " \"q_specified\": [0.5, 1.5],\n", + " },\n", + " schema=pgm_schema(\n", + " DatasetType.input, ComponentType.sym_load, [\"id\", \"node\", \"status\", \"type\", \"p_specified\", \"q_specified\"]\n", + " ),\n", ")\n", "\n", "nodes\n", @@ -349,7 +363,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 8, "metadata": {}, "outputs": [ { @@ -564,7 +578,17 @@ "}\n", "\n", "asym_loads = pa.record_batch(\n", - " asym_loads_dict, schema=pgm_schema(DatasetType.input, ComponentType.asym_load, asym_loads_dict.keys())\n", + " {\n", + " \"id\": [7, 8],\n", + " \"node\": [2, 3],\n", + " \"status\": [1, 1],\n", + " \"type\": [0, 0],\n", + " \"p_specified\": [[1.0, 1.0e-2, 1.1e-2], [2.0, 2.5, 4.5e2]],\n", + " \"q_specified\": [[0.5, 1.5e3, 0.1], [1.5, 2.5, 1.5e3]],\n", + " },\n", + " schema=pgm_schema(\n", + " DatasetType.input, ComponentType.asym_load, [\"id\", \"node\", \"status\", \"type\", \"p_specified\", \"q_specified\"]\n", + " ),\n", ")\n", "\n", "asym_loads" @@ -782,6 +806,34 @@ "pa_asym_node_result" ] }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "\n", + "[\n", + " 1,\n", + " 0.01,\n", + " 0.011,\n", + " 2,\n", + " 2.5,\n", + " 450\n", + "]" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pa.array(asym_load_input[\"p_specified\"].flatten(), type=pa.float64())" + ] + }, { "cell_type": "markdown", "metadata": {}, From 045f0b1bb7a6b2043a7af6fa6bcc2220d065a5f0 Mon Sep 17 00:00:00 2001 From: Nitish Bharambe Date: Wed, 23 Oct 2024 08:15:48 +0200 Subject: [PATCH 11/20] minor correction Signed-off-by: Nitish Bharambe --- src/power_grid_model_io/converters/tabular_converter.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/power_grid_model_io/converters/tabular_converter.py b/src/power_grid_model_io/converters/tabular_converter.py index 024b278f..378507ab 100644 --- a/src/power_grid_model_io/converters/tabular_converter.py +++ b/src/power_grid_model_io/converters/tabular_converter.py @@ -838,7 +838,7 @@ def get_id(row: pd.Series) -> int: key = row.dropna().to_dict() row_table = key.pop("table") if table is None and "table" in key else table row_name = key.pop("name") if name is None and "name" in key else name - return self.get_id(table=cast(row_table, str), key=key, name=row_name) + return self.get_id(table=cast(str, row_table), key=key, name=row_name) return keys.apply(get_id, axis=1).to_list() From ff11badd15c375704da6b0942b34262a946a8c70 Mon Sep 17 00:00:00 2001 From: Nitish Bharambe Date: Wed, 23 Oct 2024 08:24:01 +0200 Subject: [PATCH 12/20] resolve comments Signed-off-by: Nitish Bharambe --- docs/examples/arrow_example.ipynb | 26 +++++++++++++++----------- 1 file changed, 15 insertions(+), 11 deletions(-) diff --git a/docs/examples/arrow_example.ipynb b/docs/examples/arrow_example.ipynb index f80b710f..35e1ae72 100644 --- a/docs/examples/arrow_example.ipynb +++ b/docs/examples/arrow_example.ipynb @@ -90,18 +90,22 @@ "name": "stdout", "output_type": "stream", "text": [ - "node: {'names': ['id', 'u_rated'], 'formats': ['\n", + "\n", "[\n", " 1,\n", " 0.01,\n", From 198177b4934e2738a8de18a5e89443ab5aaaa31e Mon Sep 17 00:00:00 2001 From: Nitish Bharambe <78108900+nitbharambe@users.noreply.github.com> Date: Wed, 23 Oct 2024 13:09:02 +0200 Subject: [PATCH 13/20] Apply suggestions from code review Co-authored-by: Martijn Govers Signed-off-by: Nitish Bharambe <78108900+nitbharambe@users.noreply.github.com> --- docs/examples/arrow_example.ipynb | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/docs/examples/arrow_example.ipynb b/docs/examples/arrow_example.ipynb index 35e1ae72..f457e6a1 100644 --- a/docs/examples/arrow_example.ipynb +++ b/docs/examples/arrow_example.ipynb @@ -117,7 +117,7 @@ "metadata": {}, "source": [ "The primitive types of each attribute in the arrow tables need to match to make the operation efficient.\n", - "Zero-copy conversion is not guaranteed if the data types provided via the PGM via `power_grid_meta_data` are not used.\n", + "Zero-copy conversion is not guaranteed if the data types provided by the PGM via `power_grid_meta_data` are not used.\n", "Note that the asymmetric type of attribute in power-grid-model has a shape of `(3,)` along with a specific type. These represent the 3 phases of electrical system.\n", "Hence, special care is required when handling asymmetric attributes. \n", "\n", @@ -143,10 +143,10 @@ "name": "stdout", "output_type": "stream", "text": [ - "-------node scehma-------\n", + "-------node schema-------\n", "id: int32\n", "u_rated: double\n", - "-------asym load scehma-------\n", + "-------asym load schema-------\n", "id: int32\n", "node: int32\n", "status: int8\n", @@ -173,9 +173,9 @@ " return pa.schema(schemas)\n", "\n", "\n", - "print(\"-------node scehma-------\")\n", + "print(\"-------node schema-------\")\n", "print(pgm_schema(DatasetType.input, ComponentType.node))\n", - "print(\"-------asym load scehma-------\")\n", + "print(\"-------asym load schema-------\")\n", "print(pgm_schema(DatasetType.input, ComponentType.asym_load))" ] }, @@ -188,7 +188,7 @@ "The [power-grid-model documentation on Components](https://power-grid-model.readthedocs.io/en/stable/user_manual/components.html) provides documentation on which components are required and which ones are optional.\n", "\n", "Construct the Arrow data as a table with the correct headers and data types. \n", - "The creation of arrays and combining it in a RecordBatch as well as the method of initializing that RecordBatch is up to the user." + "The creation and initialization of arrays and combining the data in a RecordBatch is up to the user." ] }, { @@ -213,6 +213,7 @@ } ], "source": [ + "# create the individual columns with the correct data type\n", "nodes_schema = pgm_schema(DatasetType.input, ComponentType.node)\n", "nodes = pa.record_batch(\n", " [\n", @@ -222,6 +223,7 @@ " names=(\"id\", \"u_rated\"),\n", ")\n", "\n", + "# or convert directly using the schema\n" "lines = pa.record_batch(\n", " {\n", " \"id\": [4, 5],\n", From 8058be65162b1479342196f114710a294aeed919 Mon Sep 17 00:00:00 2001 From: Nitish Bharambe Date: Wed, 23 Oct 2024 13:09:55 +0200 Subject: [PATCH 14/20] rerun example Signed-off-by: Nitish Bharambe --- docs/examples/arrow_example.ipynb | 21 +++++++++------------ 1 file changed, 9 insertions(+), 12 deletions(-) diff --git a/docs/examples/arrow_example.ipynb b/docs/examples/arrow_example.ipynb index f457e6a1..2a7d72bd 100644 --- a/docs/examples/arrow_example.ipynb +++ b/docs/examples/arrow_example.ipynb @@ -117,7 +117,7 @@ "metadata": {}, "source": [ "The primitive types of each attribute in the arrow tables need to match to make the operation efficient.\n", - "Zero-copy conversion is not guaranteed if the data types provided by the PGM via `power_grid_meta_data` are not used.\n", + "Zero-copy conversion is not guaranteed if the data types provided via the PGM via `power_grid_meta_data` are not used.\n", "Note that the asymmetric type of attribute in power-grid-model has a shape of `(3,)` along with a specific type. These represent the 3 phases of electrical system.\n", "Hence, special care is required when handling asymmetric attributes. \n", "\n", @@ -143,10 +143,10 @@ "name": "stdout", "output_type": "stream", "text": [ - "-------node schema-------\n", + "-------node scehma-------\n", "id: int32\n", "u_rated: double\n", - "-------asym load schema-------\n", + "-------asym load scehma-------\n", "id: int32\n", "node: int32\n", "status: int8\n", @@ -173,9 +173,9 @@ " return pa.schema(schemas)\n", "\n", "\n", - "print(\"-------node schema-------\")\n", + "print(\"-------node scehma-------\")\n", "print(pgm_schema(DatasetType.input, ComponentType.node))\n", - "print(\"-------asym load schema-------\")\n", + "print(\"-------asym load scehma-------\")\n", "print(pgm_schema(DatasetType.input, ComponentType.asym_load))" ] }, @@ -188,12 +188,12 @@ "The [power-grid-model documentation on Components](https://power-grid-model.readthedocs.io/en/stable/user_manual/components.html) provides documentation on which components are required and which ones are optional.\n", "\n", "Construct the Arrow data as a table with the correct headers and data types. \n", - "The creation and initialization of arrays and combining the data in a RecordBatch is up to the user." + "The creation of arrays and combining it in a RecordBatch as well as the method of initializing that RecordBatch is up to the user." ] }, { "cell_type": "code", - "execution_count": 4, + "execution_count": null, "metadata": {}, "outputs": [ { @@ -213,7 +213,6 @@ } ], "source": [ - "# create the individual columns with the correct data type\n", "nodes_schema = pgm_schema(DatasetType.input, ComponentType.node)\n", "nodes = pa.record_batch(\n", " [\n", @@ -223,7 +222,6 @@ " names=(\"id\", \"u_rated\"),\n", ")\n", "\n", - "# or convert directly using the schema\n" "lines = pa.record_batch(\n", " {\n", " \"id\": [4, 5],\n", @@ -369,7 +367,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": null, "metadata": {}, "outputs": [ { @@ -792,7 +790,6 @@ " data: SingleColumnarData, dataset_type: DatasetType, component_type: ComponentType\n", ") -> pa.RecordBatch:\n", " \"\"\"Convert NumPy data to Arrow data.\"\"\"\n", - " # pa.record_batch.from_arrays(data, schema=pgm_schema(DatasetType.result, ComponentType.node))\n", " component_pgm_schema = pgm_schema(dataset_type, component_type, data.keys())\n", " pa_columns = {}\n", " for attribute, data in data.items():\n", @@ -820,7 +817,7 @@ { "data": { "text/plain": [ - "\n", + "\n", "[\n", " 1,\n", " 0.01,\n", From f9d9f4ab0d3b36083a65a6cf29ff57abd55ca216 Mon Sep 17 00:00:00 2001 From: Nitish Bharambe Date: Wed, 23 Oct 2024 13:21:31 +0200 Subject: [PATCH 15/20] rerun example 2 Signed-off-by: Nitish Bharambe --- docs/examples/arrow_example.ipynb | 24 ++++++++++-------------- 1 file changed, 10 insertions(+), 14 deletions(-) diff --git a/docs/examples/arrow_example.ipynb b/docs/examples/arrow_example.ipynb index 2a7d72bd..71af5730 100644 --- a/docs/examples/arrow_example.ipynb +++ b/docs/examples/arrow_example.ipynb @@ -90,22 +90,18 @@ "name": "stdout", "output_type": "stream", "text": [ - "node: ComponentMetaData(dtype=dtype([('id', '\n", + "\n", "[\n", " 1,\n", " 0.01,\n", From 720c9d6299a74637d30c2c954717819aed13b8af Mon Sep 17 00:00:00 2001 From: Nitish Bharambe Date: Wed, 23 Oct 2024 13:23:46 +0200 Subject: [PATCH 16/20] rerun example 3 Signed-off-by: Nitish Bharambe --- docs/examples/arrow_example.ipynb | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/examples/arrow_example.ipynb b/docs/examples/arrow_example.ipynb index 71af5730..47fbecec 100644 --- a/docs/examples/arrow_example.ipynb +++ b/docs/examples/arrow_example.ipynb @@ -363,7 +363,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 8, "metadata": {}, "outputs": [ { @@ -813,7 +813,7 @@ { "data": { "text/plain": [ - "\n", + "\n", "[\n", " 1,\n", " 0.01,\n", From 855d1965e366eef15c9d9214752b1edbe04f284e Mon Sep 17 00:00:00 2001 From: Nitish Bharambe Date: Wed, 23 Oct 2024 14:52:57 +0200 Subject: [PATCH 17/20] address comment Signed-off-by: Nitish Bharambe --- docs/examples/arrow_example.ipynb | 11 ++++------- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/docs/examples/arrow_example.ipynb b/docs/examples/arrow_example.ipynb index 47fbecec..fb4a3320 100644 --- a/docs/examples/arrow_example.ipynb +++ b/docs/examples/arrow_example.ipynb @@ -273,12 +273,7 @@ "Converting Arrow data to columnar NumPy arrays is recommended to leverage the columnar nature of Arrow data. \n", "This conversion can be done with zero-copy operations.\n", "\n", - "A similar approach be adopted by the user to convert to row based data.\n", - "\n", - "```{note}\n", - "The option of `zero_copy_only` in the function below and assert for correct dtype is added in this demo to verify no copies are made. \n", - "Its usage is not mandatory to do zero copy conversion.\n", - "```" + "A similar approach be adopted by the user to convert to row based data." ] }, { @@ -303,6 +298,8 @@ " result = {}\n", " result_dtype = power_grid_meta_data[dataset_type][component_type].dtype\n", " for name, column in zip(data.column_names, data.columns):\n", + " # The use of zero_copy_only=True and assert statement is to verify if no copies are made. \n", + " # They are not mandatory for a zero-copy conversion.\n", " column_data = column.to_numpy(zero_copy_only=True)\n", " assert column_data.dtype == result_dtype[name]\n", " result[name] = column_data.astype(dtype=result_dtype[name], copy=False)\n", @@ -813,7 +810,7 @@ { "data": { "text/plain": [ - "\n", + "\n", "[\n", " 1,\n", " 0.01,\n", From aa7cabee0aa33421c7608120745adfbbba15777b Mon Sep 17 00:00:00 2001 From: Nitish Bharambe Date: Thu, 24 Oct 2024 08:50:38 +0200 Subject: [PATCH 18/20] reformat Signed-off-by: Nitish Bharambe --- docs/examples/arrow_example.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/examples/arrow_example.ipynb b/docs/examples/arrow_example.ipynb index fb4a3320..c307213b 100644 --- a/docs/examples/arrow_example.ipynb +++ b/docs/examples/arrow_example.ipynb @@ -298,7 +298,7 @@ " result = {}\n", " result_dtype = power_grid_meta_data[dataset_type][component_type].dtype\n", " for name, column in zip(data.column_names, data.columns):\n", - " # The use of zero_copy_only=True and assert statement is to verify if no copies are made. \n", + " # The use of zero_copy_only=True and assert statement is to verify if no copies are made.\n", " # They are not mandatory for a zero-copy conversion.\n", " column_data = column.to_numpy(zero_copy_only=True)\n", " assert column_data.dtype == result_dtype[name]\n", From 5b5aff51631e09199866586a32848017dad97441 Mon Sep 17 00:00:00 2001 From: Nitish Bharambe Date: Thu, 24 Oct 2024 13:38:07 +0200 Subject: [PATCH 19/20] add comment Signed-off-by: Nitish Bharambe --- docs/examples/arrow_example.ipynb | 29 +---------------------------- 1 file changed, 1 insertion(+), 28 deletions(-) diff --git a/docs/examples/arrow_example.ipynb b/docs/examples/arrow_example.ipynb index c307213b..8c2f59bc 100644 --- a/docs/examples/arrow_example.ipynb +++ b/docs/examples/arrow_example.ipynb @@ -162,6 +162,7 @@ " if attributes is not None and meta_attribute not in attributes:\n", " continue\n", " if dtype.shape == (3,):\n", + " # The asymmetric attributes are stored as a fixed list array of 3 elements\n", " pa_dtype = pa.list_(pa.from_numpy_dtype(dtype.base), 3)\n", " else:\n", " pa_dtype = pa.from_numpy_dtype(dtype)\n", @@ -802,34 +803,6 @@ "pa_asym_node_result" ] }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "\n", - "[\n", - " 1,\n", - " 0.01,\n", - " 0.011,\n", - " 2,\n", - " 2.5,\n", - " 450\n", - "]" - ] - }, - "execution_count": 14, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "pa.array(asym_load_input[\"p_specified\"].flatten(), type=pa.float64())" - ] - }, { "cell_type": "markdown", "metadata": {}, From dc710a7e199cc772c2bad27f1d80c1a467b40789 Mon Sep 17 00:00:00 2001 From: Nitish Bharambe Date: Thu, 24 Oct 2024 16:02:23 +0200 Subject: [PATCH 20/20] address comments Signed-off-by: Nitish Bharambe --- docs/examples/arrow_example.ipynb | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/docs/examples/arrow_example.ipynb b/docs/examples/arrow_example.ipynb index 8c2f59bc..4593b451 100644 --- a/docs/examples/arrow_example.ipynb +++ b/docs/examples/arrow_example.ipynb @@ -117,9 +117,16 @@ "Note that the asymmetric type of attribute in power-grid-model has a shape of `(3,)` along with a specific type. These represent the 3 phases of electrical system.\n", "Hence, special care is required when handling asymmetric attributes. \n", "\n", - "In this example, we use the respective primitive types for the symmetrical attributes and a `FixedSizeListArray` of the primitive types with length 3 for asymmetrical attributes. This results in them being stored as contiguous memory which would enable zero-copy conversion. Other possible solutions to this problem are beyond the scope of this example." + "In this example, we use the respective primitive types for the symmetrical attributes and a `FixedSizeListArray` of the primitive types with length 3 for asymmetrical attributes. This results in them being stored as contiguous memory which would enable zero-copy conversion. Other possible workarounds to this are possible, but are beyond the scope of this example." ] }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, { "cell_type": "markdown", "metadata": {}, @@ -270,11 +277,11 @@ "source": [ "### Convert the Arrow data to power-grid-model input data\n", "\n", - "The Arrow record batch or tables can then be converted to row based data or columnar data.\n", + "The Arrow `RecordBatch` or `Table` can then be converted to row based data or columnar data.\n", "Converting Arrow data to columnar NumPy arrays is recommended to leverage the columnar nature of Arrow data. \n", "This conversion can be done with zero-copy operations.\n", "\n", - "A similar approach be adopted by the user to convert to row based data." + "A similar approach can be adopted by the user to convert to row based data instead." ] }, { @@ -468,7 +475,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Converting symmetrical results is straightforward by using schema from [Creating Schema](#creating-a-schema)" + "Converting symmetrical results is straightforward by using schema from [Creating Schema](#creating-a-schema)\n", + "Using types other than the ones from this schema might make a copy of the data. " ] }, {