install.sh: Using `-d` is just plain wrong
Replacing it with `-D`, which however is not POSIX standard. Might just switch back to using a separate mkdir(1) call.
|5 months ago|
|AUTHORS||5 months ago|
|COPYING.txt||6 months ago|
|README.md||5 months ago|
|crawler.conf||6 months ago|
|crawler.sh||5 months ago|
|get-board.sh||6 months ago|
|get-thread.sh||6 months ago|
|install.sh||5 months ago|
|pkiller.sh||6 months ago|
This is a simple 4chan board crawler. It keeps track of new threads within a specified set of boards. Threads are crawled until they get deleted from 4chan or are marked as archived. The crawler downloads the thread json-file, along with changes. Thread media is also downloaded.
This crawler is not intended to be used for a 1:1 preservation of 4chan.
The crawler runs on IPv6 with randomized addresses out of a /64 prefix allocation. This is done to bypass enforced rate limits from Cloudflare.
The crawler can be installed via the provided
install.sh script, which
copies the scripts to the
$HOME/bin directory. An example
configuration is copied to
doas(1) is used to do network interface configuration, as this
requires root access. Make sure to have a rule in
allows the user to run
ifconfig(8) without entering a password:
permit nopass crawler as root cmd ifconfig
Make sure to specify the correct /64 IPv6 prefix allocation in the
configuration file. The
install.sh script tries to guess it.
The crawler should be called by
cron(8) on a regular basis. As the
crawler by itself makes sure that it is only running once,
can be configured to run the crawler every minute:
* * * * * $HOME/bin/torako-crawler
There is an additional script, called
torako-pkiller, that can help
with the handling of long-running
ftp(1) processes, which could
otherwise clog up the crawler. Configure
crontab(5) to kill and clean
ftp(1) processes which run for over 3 minutes:
*/3 * * * * $HOME/bin/torako-pkiller
Crawling 4chan results in a lot of DNS type AAAA queries to a.4cdn.org and i.4cdn.org. Make sure these queries are fast and properly answered. Some resolvers, especially known public ones, implement various forms of rate-limiting, which could lead to the crawler stopping to function.
It is therefore recommended to run your resolver. Running
with the default settings should be okay. The
resolv.conf(5) has to
be configured accordingly.