The bots came to hunt my forge data

I'm a happy user of Forgejo and I host it on my homelab at git.sitegui.dev to store all my open source code (including this very same page).

However, as many other hobbyists and major projects, I have noticed an uptick in the number of crawling requests that my instance serves. My homelab is literally an old laptop close to the fridge, so I could hear the extra load (and it also doesn't help that Europe is scalding with 16 degrees above average temperatures for May).

I saw an increase from an average of 1 000 requests a day to 200 000. The issue is not only the number of requests, but their nature: these bots go over the whole git history to navigate the whole tree of files for each commit. They also request a lot of repo archives that are expensive to generate.

The robots.txt clearly forbids bots to navigate to these pages:

User-agent: *
Disallow: /*/*/src/
Disallow: /*/*/archive/

Obviously, these bots do not respect it and hammer public Internet infra structure and hobbyists. The reason is clear: they scrap to train AI models, consequences be damned. It's also very dumb: they could instead just do a git clone in each repo to extract all the necessary information with just a fraction of the work!

Checking the data with DuckDB

I take this opportunity to learn some DuckDB, which I understand as a happy marriage between Spark and SQLite: query directly from disk files using SQL, but with an "in-process" mentality: no infrastructure to manage.

I've downloaded the server logs from Caddy and imported it into a DuckDB file to accelerate interactive querying:

import duckdb

connection = duckdb.connect("data/logs.duckdb")

connection.execute("""
create or replace table caddy_logs as
select *
from read_json_auto(
    'data/logs/*',
    format = 'newline_delimited',
    union_by_name = true,
    sample_size = -1
)
""")

connection.close()

Out of the 1 121 977 requests to git.sitegui.dev, I've extracted the top 5 User-Agent:

select map_extract_value(request.headers, 'User-Agent'),
       count(*)
from caddy_logs
where request.host = 'git.sitegui.dev'
group by 1
order by 2 desc limit 5

562 151: meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)
60 571: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.0.0
35 921: Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Mobile Safari/537.36
27 870: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36
16 339: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:123.0) Gecko/20100101 Firefox/123.0

The User-Agent (UA) header is present in every request, and it's how the clients announce themselves. Well-behaved bots must disclose their botly presence in the UA. Well, meta is doing it, but clearly not everybody.

So I got interested in knowing their IPs. I'm also filtering only requests to "/archive/", which bots are explicitly forbidden to do, as per the robots.txt above.

select map_extract_value(request.headers, 'User-Agent'),
       list(distinct request.client_ip),
       count(*)
from caddy_logs
where request.host = 'git.sitegui.dev'
  and request.uri like '%/archive/%'
group by 1
order by 3 desc limit 5

The top hit "meta-externalagent/1.1" showed a bunch of IPv6 and IPv4 owned by meta, like 2a03:2880:f814:19::. So fuck you Meta! You can shove your lack of politeness into your legless metaverse.

The other top hits were all from Alibaba, like 47.82.15.152. Alibaba is a cloud provider, so it's basically someone running a bad crawler in their servers. Their bot is "smart" enough to use multiple User-Agent strings 😒.

My solution

Most of the projects that I run in my homelab are private: they are used by me, family and friends. These are protected by knock and see little bot activity.

However, I want forgejo to be publicly accessible. Some people are using Anubis, which is an interesting take. I tried it for a small while, but I wanted something that:

works without JavaScript support
allows scrapping of "common" pages, like the list of projects and their READMEs
doesn't sit on top of the connection, but instead "by the side". That is, the bytes don't need to be proxied over again

So I've coded forgejo-shield that implements a simple logic:

requests to static assets, project pages and the .git protocol are always accepted
any other request must have a cookie "forgejo-shield" set. If it doesn't, the user will be presented with a page with a button that posts a form that sets the cookie

The Caddyfile has a new forward_auth block like this:

git.sitegui.dev {
    forward_auth forgejo-shield:8080 {
        uri /
    }

    reverse_proxy forgejo:3000
}

So potentially more expensive pages will be protected with:

the shield form with a button

I just deployed it, and it seems to work as expected. 🤞🏽
This episode got me thinking of deploying iocaine in the future, to create an infinite maze to poison training datasets.

The bots came to hunt my forge data

Checking the data with DuckDB

My solution

Comments

Add comment

The bots came to hunt my forge data

Checking the data with DuckDB

My solution

Be notified about new posts

Comments

Add comment