A simple 4chan board crawler
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Leslie 90bcb8a669
install.sh: Using `-d` is just plain wrong
3 weeks ago
AUTHORS Add AUTHORS file 3 weeks ago
COPYING.txt Initial commit 1 month ago
README.md Describe the importance of a stable DNS resolver 3 weeks ago
crawler.conf Change name from 4chan-crawler to torako-crawler 1 month ago
crawler.sh Add version numbering 3 weeks ago
get-board.sh Change name from 4chan-crawler to torako-crawler 1 month ago
get-thread.sh Change name from 4chan-crawler to torako-crawler 1 month ago
install.sh install.sh: Using `-d` is just plain wrong 3 weeks ago
pkiller.sh Change name from 4chan-crawler to torako-crawler 1 month ago

README.md

torako-crawler

This is a simple 4chan board crawler. It keeps track of new threads within a specified set of boards. Threads are crawled until they get deleted from 4chan or are marked as archived. The crawler downloads the thread json-file, along with changes. Thread media is also downloaded.

This crawler is not intended to be used for a 1:1 preservation of 4chan.

The crawler runs on IPv6 with randomized addresses out of a /64 prefix allocation. This is done to bypass enforced rate limits from Cloudflare.

Requirements

Installation

The crawler can be installed via the provided install.sh script, which copies the scripts to the $HOME/bin directory. An example configuration is copied to $HOME/.config/torako-crawler/crawler.conf.

doas(1) is used to do network interface configuration, as this requires root access. Make sure to have a rule in doas.conf(5) that allows the user to run ifconfig(8) without entering a password:

    permit nopass crawler as root cmd ifconfig

Configuration

Make sure to specify the correct /64 IPv6 prefix allocation in the configuration file. The install.sh script tries to guess it.

Usage

The crawler should be called by cron(8) on a regular basis. As the crawler by itself makes sure that it is only running once, crontab(5) can be configured to run the crawler every minute:

    * * * * * $HOME/bin/torako-crawler

There is an additional script, called torako-pkiller, that can help with the handling of long-running ftp(1) processes, which could otherwise clog up the crawler. Configure crontab(5) to kill and clean up ftp(1) processes which run for over 3 minutes:

    */3 * * * * $HOME/bin/torako-pkiller

Stability and performance

DNS queries

Crawling 4chan results in a lot of DNS type AAAA queries to a.4cdn.org and i.4cdn.org. Make sure these queries are fast and properly answered. Some resolvers, especially known public ones, implement various forms of rate-limiting, which could lead to the crawler stopping to function.

It is therefore recommended to run your resolver. Running unbound(8) with the default settings should be okay. The resolv.conf(5) has to be configured accordingly.