Having difficulties to run the Docker Service Mode #8

Gi7w0rm · 2022-03-25T21:06:31Z

Hey @jstrosch,
It's me again.
As I told you some days ago, I am having trouble getting the Service Mode of subcrawl to run.
I am still using the following system:

istributor ID:	Ubuntu
Description:	Ubuntu 21.10
Release:	21.10
Codename:	impish

which in turn is run in VMware Workstation 16 Pro

As suggested in the Readme.md I am trying the following to get it running:

git clone https://github.com/hpthreatresearch/subcrawl.git

docker-compose up --build

Once starting up, I can actually access the Dashboard at
127.0.0.1:8000 as expected.
However, doesn't matter how long I wait, no Domains get scanned, nothing changes in the Dashboard at all.
I am seeing following output in my terminal:

For me it kinda looks like I am indead receiving and logging data.
However, it's not getting where it should be ?
(Keep in mind, we had to change the config.py to scan other domains, could this issue also apply here ?)

Also, I am observing some errors at startup:

To be honest, this doesn't look good, but it seems not to be stopping the docker-compose up --build command.

I hope we can find a solution here, as I would love to test out an import module I wrote for subcrawler.

Best regards
Chris

The text was updated successfully, but these errors were encountered:

#8)

stoerchl · 2022-04-01T12:55:34Z

Hi @Gi7w0rm

Thanks for reaching out to us. I verified the issue and you are absolutely correct. There is indeed an problem with the service mode. The issue is in connection with the Redpanda installation as it seems.
In service mode subcrawl reads urls which must be scanned from a Redpanda topic. If the consumer does not receive new url messages for two secods it stops and starts scanning.

Here is the reference to the code where the urls are consumed:
https://github.com/hpthreatresearch/subcrawl/blob/main/crawler/subcrawl.py#L145

Here is the reference to the code where the consumer is configured
https://github.com/hpthreatresearch/subcrawl/blob/main/crawler/subcrawl.py#L39

As it turns out the timeout of the Redpanda consumer does not work as expected anymore. After trying various Redpanda versions we decided to replace the Redpanda installation with a Kafka & Zookeeper installation. Now the service mode runs Kafka & Zookeeper in a docker container similar to the architecture before.

Note that the input modules are configured to download new urls every 5 minutes. Therefore the input module will wait 5 minutes after start and then request the urls for the first time. This was implemented to not overload the used APIs. Referenced in the code:
https://github.com/hpthreatresearch/subcrawl/blob/main/crawler/input/urlhaus.py#L40

Also the scanned urls are stored in batches. You can configure the batch size in the config.yml file.
https://github.com/hpthreatresearch/subcrawl/blob/main/crawler/config.yml#L2

Due to those implementations it can take up to 10 minutes until you will receive results in the dashboard. If you want to verify if the crawler is running befre urls can be seen you can check the logfiles under /var/log/subcrawl. The file subcrawl.out.log should give you more information what the crawler is working on at the moment.

Let us know if that works for you.

Regarding the errors you noticed at startup. There seems to be a problem with the Clamav installation. I would suggest to rebuild the installation, which must be done due to the Kafka changes anyway, and watch out for exceptions during the Clamav installation. The Dockerfile should install all needed Clamav packages and run freshclam as well.

RUN apt-get -y install build-essential gcc yara magic supervisor clamav-daemon clamav-freshclam clamav-unofficial-sigs
RUN freshclam

Also let me know if this works or if you can share logs which show the installation process of the docker image.

Best regards
Patrick

Gi7w0rm · 2022-04-04T00:25:30Z

Hey Patrick,
thank you for your very detailed answer.
You where very right about the clamAV errors. Seems the manual update with freshclam did not work because I was not able to get a valid connection to the ClamAV Server.

On this note I would like to advice you that the docker file is missing the commands:

sudo systemctl stop clamav-freshclam.service
sudo systemctl start clamav-daemon.service

I am not sure if the way docker works makes these obsolete, but on a normal ubuntu system the update command errors out with
"ERROR: /var/log/clamav/freshclam.log is locked by another process
ERROR: Problem with internal logger (UpdateLogFile = /var/log/clamav/freshclam.log)."
so I thought I better notify you just in case.

After deleting the line:
https://github.com/hpthreatresearch/subcrawl/blob/main/crawler/config.yml#L18
I am now finally able to use the subcrawler service mode !

However, trying to run subcrawl service mode puts some errors into the logs at /var/log/subcrawl/
They generally look like this:

Traceback (most recent call last):
  File "service.py", line 14, in <module>
    check_topic()
  File "/subcrawl/utils/setup_kafka_topic.py", line 6, in check_topic
    admin_client = KafkaAdminClient(
  File "/usr/local/lib/python3.8/site-packages/kafka/admin/client.py", line 208, in __init__
    self._client = KafkaClient(metrics=self._metrics,
  File "/usr/local/lib/python3.8/site-packages/kafka/client_async.py", line 244, in __init__
    self.config['api_version'] = self.check_version(timeout=check_timeout)
  File "/usr/local/lib/python3.8/site-packages/kafka/client_async.py", line 900, in check_version
    raise Errors.NoBrokersAvailable()

See output of all 3 error logs here:
https://pastebin.com/YgS0KWwE

I have also uploaded you a log which shows the whole stdout output of my subcrawler service mode from start to the first registered hosts to crawl:
build-log.log
I don't know if this adds any value to your developing process.

Another issue I see is that the service mode jumps from 0 to

�[34m2022-04-04 00:09:09,398 — SubCrawl — INFO — [ENGINE] Found 44152 hosts to scrape�[0m
�[34m2022-04-04 00:09:35,910 — SubCrawl — INFO — [ENGINE] Done parsing URLs, ready to begin scraping 44151 hosts and 49136 URLs... starting in 0 seconds!�[0m

in a second. As the dashboard does not seem to update with every scanned url but only after a successfull run and the logs are not logging any process in the default mode, I think it would be nice to have some kind of progress logging in the default subcrawler.out.log or a short description of all available logging modes in the Readme.md of this Repo.

If you need any further info, feel free to ask, I will provide what I can to help.

Best regards and wishing you a nice start of the week
Chris

stoerchl · 2022-04-04T07:06:03Z

Hi Chris

I'll check the suggestion with clamav. You still seem to get the clamav errors during the build process.

[33mweb_1        |[0m Clamav signatures not found in /var/lib/clamav ... failed!
[33mweb_1        |[0m Please retrieve them using freshclam ... failed!
[33mweb_1        |[0m Then run 'invoke-rc.d clamav-daemon start' ... failed!

Maybe you could add those commands to the Dockerfile and let me know if you still receive the clamav errors during the build process. I'm not sure why I don't get such errors. I'll try to reproduce them on my system as well.

The errors in /var/log/subcrawl/ are related to queueing system. Are the Zookeeper and Kafka docker container running? What's the output of docker ps ? - I however guess as you get the URLs in the end this problem does not persist anymore?

The jump from 0 to Found 44152 hosts to scrape is because it loads all URLs from Urlhaus to scrape them. Adding all those URLs to Kafka does not really take long and therefore you'll see them at once. Also as noted in my last comment the dashboard does not update after every URL but only if the defined batch of URLs is scanned. You can change the configuration to update it more frequently.

By default you get the information which modules are loaded, how many URL's are being scanned and when the data is stored in the database. As this service mode should be running in the background and you most likely shouldn't care about the log output but only the dashboard and the web application, I don't want to blow up the logfile. To see more information you can change the loglevel in the configuration file as well.

Best regards
Patrick

Gi7w0rm · 2022-04-06T00:25:51Z

Hey @stoerchl
I tried to add the systemctl commands for stopping and starting clamav into the Dockerfile and ran into a problem which is probably best described in this article:
https://medium.com/swlh/docker-and-systemd-381dfd7e4628

To make it short: Systemd and all related commands are not enabled inside docker environments. As such, adding these commands will make the docker --build fail.
There seem to be existing workarrounds but if I get it right, this would also lower the security of the docker instance and can't really be the way to go here. Especially as you are not facing these issues in your enviroment.

I have to admit I am rather surprised myself as to why you are not seeing these issues and I have them appearing every time. Especially as I am very close to running a Vanilla Ubuntu here.
I have added an Image of the /var/lib/clamav folder below. Maybe you could try to find any difference to yours ?

As to the Errors in var/log/subcrawl I guess they might be caused by Kafka Brokers not beeing fully initialized at the beginning of the script. They do appear consistently, however as the URL's to scan fly in way later, I guess it doesn't matter that much.

Thanks for pointing out about batch sizes and the dashboard again :)

Best regards
Chris

stoerchl added a commit that referenced this issue Apr 1, 2022

Replaced Redpanda with Kafka due to consumer timeout misbehavior (issue

58eb82a

#8)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Having difficulties to run the Docker Service Mode #8

Having difficulties to run the Docker Service Mode #8

Gi7w0rm commented Mar 25, 2022 •

edited

Loading

stoerchl commented Apr 1, 2022

Gi7w0rm commented Apr 4, 2022

stoerchl commented Apr 4, 2022

Gi7w0rm commented Apr 6, 2022

Having difficulties to run the Docker Service Mode #8

Having difficulties to run the Docker Service Mode #8

Comments

Gi7w0rm commented Mar 25, 2022 • edited Loading

stoerchl commented Apr 1, 2022

Gi7w0rm commented Apr 4, 2022

stoerchl commented Apr 4, 2022

Gi7w0rm commented Apr 6, 2022

Gi7w0rm commented Mar 25, 2022 •

edited

Loading