A python-built web crawler to automate file downloads off of https://www.moodle.tum.de/
- Python (>= 3.9) including
pip
installed and available via the command line.
- Clone or download this repository to your local file system.
- Install all Python requirements specified in
requirements.txt
by running
$ pip install -r requirements.txt
from the projects directory.
Note: usepip3
, ifpip
is linked topython 2.x
and same forpython/3
-
$ python3 src/main.py download
to download resources from Moodle based on your configuration insrc/course_config.json
(see the section below for more information on the configuration) -
$ python3 src/main.py download course
to download resources from the specified Moodle course based on your configuration insrc/course_config.json
-
$ python3 src/main.py download course file_pattern destination
to download resources which match thefile_pattern
fromcourse
to adestination
path -
$ python3 src/main.py list [course]
to list available resources of the specifiedcourse
or, if no course is specified, list available courses for your Moodle account -
$ python3 src/main.py -h
for general help on how to use the program -
$ python3 src/main.py list -h
for help concerning thelist
command -
$ python3 src/main.py download -h
for help concerning thedownload
command -
Note:
- Upon running one of the commands you will be prompted to enter your Moodle credentials.
The username will be stored in a
credentials.json
in thesrc
directory. You may also manually add your password to thecredentials.json
, if you don't want to type it every time you run the script. This is discouraged though as your password will be stored in plain text! - Use
python
orpy
instead ofpython3
on Windows.
- Upon running one of the commands you will be prompted to enter your Moodle credentials.
The username will be stored in a
You can configure from which courses which files should be downloaded and
where they should be stored by editing the file course_config.json
in the src
directory. Additionally, you can
specify what should happen, if the file which is to be downloaded already exists at the specified destination path.
The download configuration is specified in download_config.json
, which is located in the same src
directory.
How the configuration works shall be explained via the following example:
- Example contents of
course_config.json
:
[
{
"course_name": "Analysis für Informatik",
"semester": "WS19_20",
"rules": [
{
"file_pattern": "Hausaufgabe.*",
"destination": "C:\\Users\\<yourUsername>\\Documents\\Uni\\Analysis\\Hausaufgaben",
"update_handling": "update"
},
{
"file_pattern": ".*E-Test.*",
"destination": "C:\\Users\\<yourUsername>\\Documents\\Uni\\Analysis\\E-Tests",
"update_handling": "replace"
}
]
},
{
"course_name": "Numerisches Programmieren",
"semester": "WS19_20",
"rules": [
{
"file_pattern": "(Übungsblatt.*|Musterlösung Blatt.*)",
"destination": "C:\\Users\\<yourUsername>\\Documents\\Uni\\NumProg\\Übungen\\",
"update_handling": "add"
}
]
}
]
-
Upon running
$ python src/main.py download
the program goes through the configuration objects for the different courses one by one. For each course all available resources are checked against the rules specified for the course. If a resource name matches a pattern specified in one of the rules, the resource is downloaded to the destination path defined by that rule (no other rules are applied to that resource afterwards). If the resource already exists locally, the specifiedupdate_handling
is applied. -
In the example at hand all resources of the course "Analysis für Informatik [MA0902]" of which the name starts with "Hausaufgabe" are downloaded to the folder
"C:\Users\<yourUsername>\Documents\Uni\Analysis\Hausaufgaben"
. If the respective file already exists, it is replaced in this case. Resources of which the name contains"E-Test"
will be downloaded the destination defined by the respective rule. In this case the download is skipped, if the file already exists. Resources of the course"Numerisches Programmieren (IN0019)"
which either start with"Übungsblatt"
or with"Musterlösung Blatt"
will be downloaded to"C:\Users\<yourUsername>\Documents\Uni\NumProg\Übungen"
. Here a new version of the file is added (e.g. Übungsblatt 12 (1).pdf), if the file already exists at the specified destination. -
Important: resources from courses which are not listed in the configuration file or resources for which none of the rules apply are not downloaded.
-
Options for the
update_handling
are:"update"
--> if the online resource has a newer modification date than the local copy, the local version gets updated. Otherwise, the download is skipped."skip"
--> the download is skipped, if the file already exists locally"replace"
--> existing local files are simply overridden by the download"add"
--> a new version in the form "filename (versionnumber).extension" is added to the specifieddestination
, if the file already exists locally- If nothing is specified for the
update_handling
, existing local files are overridden
-
Note:
- Running
$ python src/main.py download "Analysis für Informatik"
downloads only the resources for the course "Analyis für Informatik" based on the configuration file. - Use
".*"
as the pattern for the last rule, if you want files for which none of the other rules apply to be downloaded. - The pattern matching is based on Regular Expressions aka RegEx
- The course name only needs to be a substring of the full course name. If multiple of your Moodle courses match the specified course name, currently only the first one that is found will be taken into account.
- Currently the value specified for the semester is not used.
- Running
-
Example contents of
download_config.json
:
[
{
"parallel_downloads": true
}
]
As stated in the debug message parallel_downloads
makes the logging less readable, but greatly improves execution time.
-
$ python src/main.py list "Analysis für Informatik"
will list all resources of the courseAnalysis für Informatik
available for download. -
$ python src/main.py list -f "Analysis für Informatik"
will list all files of the courseAnalysis für Informatik
available for download. -
$ python src/main.py download "Analysis für Informatik" "Hausaufgabe 10" "~/Documents/Uni/WS19/Analysis/Hausaufgaben"
will search the user's courses forAnalysis für Informatik
and find a matching course (e.g. "Analysis für Informatik [MA0902]"). In this example, the script will search forHausaufgabe 10
and find the assignment "Hausaufgabe 10 und Präsenzaufgaben der Woche". The script will then navigate to the assignment's page and download the associated file: "Blatt10.pdf", which will then be saved in the specified path~/Documents/Uni/WS19/Analysis/Hausaufgaben
. -
$ python src/main.py download "Analysis für Informatik" "Hausaufgabe.*" "~/Documents/Uni/WS19/Analysis/Hausaufgaben"
similar to above, however, finds multiple files that start withHausaufgabe
and downloads them all.