142 lines
5.6 KiB
Markdown
142 lines
5.6 KiB
Markdown
---
|
||
navigation: true
|
||
title: Bash Scripts
|
||
main:
|
||
fluid: false
|
||
---
|
||
:ellipsis{left=0px width=40rem top=10rem blur=140px}
|
||
# Servarr duplicates corrector
|
||
---
|
||
|
||
Six months after downloading terabytes of media, I realized that Sonarr and Radarr were copying them into my Plex library instead of creating hardlinks. This happens due to a counterintuitive mechanism: if you mount multiple folders in Sonarr/Radarr, it sees them as different filesystems and thus cannot create hardlinks. That’s why you should mount only one parent folder containing all child folders (like `downloads`, `movies`, `tvseries` inside a `media` parent folder).
|
||
|
||
So I restructured my directories, manually updated every path in Qbittorrent, Plex, and others. The last challenge was finding a way to detect existing duplicates, delete them, and automatically create hardlinks instead—to save space.
|
||
|
||
My directory structure:
|
||
|
||
```console
|
||
.
|
||
└── media
|
||
├── seedbox
|
||
├── radarr
|
||
│ └── tv-radarr
|
||
├── movies
|
||
└── tvseries
|
||
```
|
||
|
||
The originals are in `seedbox` and must not be modified to keep seeding. The copies (duplicates) are in `movies` and `tvseries`. To complicate things, there are also unique originals in `movies` and `tvseries`. And within those, there can be subfolders, sub-subfolders, etc.
|
||
|
||
So the idea is to:
|
||
|
||
- list the originals in seedbox
|
||
- list files in movies and tvseries
|
||
- compare both lists and isolate duplicates
|
||
- delete the duplicates
|
||
- hardlink the originals to the deleted duplicate paths
|
||
|
||
Yes, I asked ChatGPT and Qwen3 (which I host on a dedicated AI machine). Naturally, they suggested tools like rfind, rdfind, dupes, rdupes, rmlint... But hashing 30TB of media would take days, so I gave up quickly.
|
||
|
||
In the end, I only needed to find `.mkv` files, and duplicates have the exact same name as the originals, which simplifies things a lot. A simple Bash script would do the job.
|
||
|
||
Spare you the endless Q&A with ChatGPT—I was disappointed. Qwen3 was much cleaner. ChatGPT kept pushing awk-based solutions, which fail on paths with spaces. With Qwen’s help and dropping awk, the results improved significantly.
|
||
|
||
To test, I first asked for a script that only lists and compares:
|
||
|
||
```bash
|
||
#!/bin/bash
|
||
|
||
# Create an associative array to store duplicates
|
||
declare -A seen
|
||
|
||
# Find all .mkv files only (exclude directories)
|
||
find /media/seedbox /media/movies /media/tvseries -type f -name "*.mkv" -print0 | \
|
||
while IFS= read -r -d '' file; do
|
||
# Get the file's inode and name
|
||
inode=$(stat --format="%i" "$file")
|
||
filename=$(basename "$file")
|
||
|
||
# If the filename has been seen before
|
||
if [[ -n "${seen[$filename]}" ]]; then
|
||
# Check if the inode is different from the previous one
|
||
if [[ "${seen[$filename]}" != "$inode" ]]; then
|
||
# Output the duplicates with full paths
|
||
echo "Duplicates for \"$filename\":"
|
||
echo "${seen["$filename"]} ${seen["$filename:full_path"]}"
|
||
echo "$inode $file"
|
||
echo
|
||
fi
|
||
else
|
||
seen[$filename]="$inode"
|
||
seen["$filename:full_path"]="$file"
|
||
fi
|
||
done
|
||
```
|
||
|
||
This gave me outputs like:
|
||
|
||
```
|
||
Duplicates for "episode1.mkv":
|
||
1234567 /media/seedbox/sonarr/Serie 1/Season1/episode1.mkv
|
||
2345678 /media/tvseries/Serie 1/Season1/episode1.mkv
|
||
```
|
||
|
||
With `awk`, it would’ve stopped at `/media/seedbox/sonarr/Serie`. I’m far from an expert, but Qwen3 performed better and explained everything clearly.
|
||
|
||
Once I verified the output, I asked for a complete script: compare, delete duplicates, create hardlinks.
|
||
|
||
Again, ChatGPT disappointed. Despite my requests, it created hardlinks *before* deleting the duplicates—effectively linking and then deleting the link (though the original is kept). Not helpful.
|
||
|
||
Quick stopover to Qwen3, RTX 5090 in overdrive, and bam—much better result. Yes, it kept ChatGPT-style emojis, but here it is:
|
||
|
||
```bash
|
||
#!/bin/bash
|
||
|
||
echo "🔍 Step 1: Indexing original files in /media/seedbox..."
|
||
declare -A seen
|
||
|
||
# Index all .mkv files in seedbox
|
||
while IFS= read -r -d '' file; do
|
||
filename=$(basename "$file")
|
||
seen["$filename"]="$file"
|
||
done < <(find /media/seedbox -type f -name "*.mkv" -print0)
|
||
|
||
echo "📦 Step 2: Automatically replacing duplicates..."
|
||
total_doublons=0
|
||
total_ko_saved=0
|
||
|
||
while IFS= read -r -d '' file; do
|
||
filename=$(basename "$file")
|
||
original="${seen[$filename]}"
|
||
|
||
if [[ -n "$original" && "$original" != "$file" ]]; then
|
||
inode_orig=$(stat -c %i "$original")
|
||
inode_dupe=$(stat -c %i "$file")
|
||
|
||
if [[ "$inode_orig" != "$inode_dupe" ]]; then
|
||
size_kb=$(du -k "$file" | cut -f1)
|
||
echo "🔁 Replacing:"
|
||
echo " Duplicate : $file"
|
||
echo " Original : $original"
|
||
echo " Size : ${size_kb} KB"
|
||
|
||
rm "$file" && ln "$original" "$file" && echo "✅ Hardlink created."
|
||
|
||
total_doublons=$((total_doublons + 1))
|
||
total_ko_saved=$((total_ko_saved + size_kb))
|
||
fi
|
||
fi
|
||
done < <(find /media/movies /media/tvseries -type f -name "*.mkv" -print0)
|
||
|
||
echo ""
|
||
echo "🧾 Summary:"
|
||
echo " 🔗 Duplicates replaced by hardlink: $total_doublons"
|
||
echo " 💾 Approx. disk space saved: ${total_ko_saved} KB (~$((total_ko_saved / 1024)) MB)"
|
||
echo "✅ Done."
|
||
```
|
||
|
||
So, in conclusion, I:
|
||
- Learned many Bash subtleties
|
||
- Learned never to blindly copy-paste a ChatGPT script without understanding and dry-running it
|
||
- Learned that Qwen on a RTX 5090 is more coherent than ChatGPT-4o on server farms (not even mentioning “normal” ChatGPT)
|
||
- Learned that even with 100TB of storage, monitoring it would’ve alerted me much earlier to the 12TB of duplicates lying around
|