I'm a happy user of Forgejo and I host it on my homelab at git.sitegui.dev to store all my open source code (including this very same page).
However, as many other hobbyists and major projects, I have noticed an uptick in the number of crawling requests that my instance serves. My homelab is literally an old laptop close to the fridge, so I could hear the extra load (and it also doesn't help that Europe is scalding with 16 degrees above average temperatures for May).
I saw an increase from an average of 1 000 requests a day to 200 000. The issue is not only the number of requests, but their nature: these bots go over the whole git history to navigate the whole tree of files for each commit. They also request a lot of repo archives that are expensive to generate.
The robots.txt clearly forbids bots to navigate to these pages:
User-agent: *
Disallow: /*/*/src/
Disallow: /*/*/archive/
Obviously, these bots do not respect it and hammer public Internet infra structure and hobbyists. The reason
is clear:
they scrap to train AI models, consequences be damned. It's also very dumb: they could instead just do a git clone in
each repo to extract all the necessary information with just a fraction of the work!
Checking the data with DuckDB
I take this opportunity to learn some DuckDB, which I understand as a happy marriage between Spark and SQLite: query directly from disk files using SQL, but with an "in-process" mentality: no infrastructure to manage.
I've downloaded the server logs from Caddy and imported it into a DuckDB file to accelerate interactive querying:
import duckdb
connection = duckdb.connect("data/logs.duckdb")
connection.execute("""
create or replace table caddy_logs as
select *
from read_json_auto(
'data/logs/*',
format = 'newline_delimited',
union_by_name = true,
sample_size = -1
)
""")
connection.close()
Out of the 1 121 977 requests to git.sitegui.dev, I've extracted the top 5 User-Agent:
select map_extract_value(request.headers, 'User-Agent'),
count(*)
from caddy_logs
where request.host = 'git.sitegui.dev'
group by 1
order by 2 desc limit 5
- 562 151: meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)
- 60 571: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.0.0
- 35 921: Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Mobile Safari/537.36
- 27 870: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36
- 16 339: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:123.0) Gecko/20100101 Firefox/123.0
The User-Agent (UA) header is present in every request, and it's how the clients announce themselves. Well-behaved bots must disclose their botly presence in the UA. Well, meta is doing it, but clearly not everybody.
So I got interested in knowing their IPs. I'm also filtering only requests to "/archive/", which bots are explicitly forbidden to do, as per the robots.txt above.
select map_extract_value(request.headers, 'User-Agent'),
list(distinct request.client_ip),
count(*)
from caddy_logs
where request.host = 'git.sitegui.dev'
and request.uri like '%/archive/%'
group by 1
order by 3 desc limit 5
The top hit "meta-externalagent/1.1" showed a bunch of IPv6 and IPv4 owned by meta, like 2a03:2880:f814:19::. So fuck you Meta! You can shove your lack of politeness into your legless metaverse.
The other top hits were all from Alibaba, like 47.82.15.152. Alibaba is a cloud provider, so it's basically someone running a bad crawler in their servers. Their bot is "smart" enough to use multiple User-Agent strings 😒.
My solution
Most of the projects that I run in my homelab are private: they are used by me, family and friends. These are protected by knock and see little bot activity.
However, I want forgejo to be publicly accessible. Some people are using Anubis, which is an interesting take. I tried it for a small while, but I wanted something that:
- works without JavaScript support
- allows scrapping of "common" pages, like the list of projects and their READMEs
- doesn't sit on top of the connection, but instead "by the side". That is, the bytes don't need to be proxied over again
So I've coded forgejo-shield that implements a simple logic:
- requests to static assets, project pages and the .git protocol are always accepted
- any other request must have a cookie "forgejo-shield" set. If it doesn't, the user will be presented with a page with a button that posts a form that sets the cookie
The Caddyfile has a new forward_auth block like this:
git.sitegui.dev {
forward_auth forgejo-shield:8080 {
uri /
}
reverse_proxy forgejo:3000
}
So potentially more expensive pages will be protected with:

I just deployed it, and it seems to work as expected. 🤞🏽
This episode got me thinking of deploying iocaine in the future, to create an
infinite maze to poison training datasets.
Achille (achilletoupin.com) at 2026-05-29 05:21 UTC
It wasn't me ! I swear
answer
😂
Send me your banking statements, so that I can check that you have not payed any Alibaba cloud bills!
answer