New directory and icons
This commit is contained in:
		
							
								
								
									
										141
									
								
								content/5.nonsense/2.bash/1.servarr-duplicates.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										141
									
								
								content/5.nonsense/2.bash/1.servarr-duplicates.md
									
									
									
									
									
										Normal file
									
								
							@@ -0,0 +1,141 @@
 | 
			
		||||
---
 | 
			
		||||
navigation: true
 | 
			
		||||
title: Bash Scripts
 | 
			
		||||
main:
 | 
			
		||||
  fluid: false
 | 
			
		||||
---
 | 
			
		||||
:ellipsis{left=0px width=40rem top=10rem blur=140px}
 | 
			
		||||
# Servarr duplicates corrector
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
Six months after downloading terabytes of media, I realized that Sonarr and Radarr were copying them into my Plex library instead of creating hardlinks. This happens due to a counterintuitive mechanism: if you mount multiple folders in Sonarr/Radarr, it sees them as different filesystems and thus cannot create hardlinks. That’s why you should mount only one parent folder containing all child folders (like `downloads`, `movies`, `tvseries` inside a `media` parent folder).
 | 
			
		||||
 | 
			
		||||
So I restructured my directories, manually updated every path in Qbittorrent, Plex, and others. The last challenge was finding a way to detect existing duplicates, delete them, and automatically create hardlinks instead—to save space.
 | 
			
		||||
 | 
			
		||||
My directory structure:
 | 
			
		||||
 | 
			
		||||
```console
 | 
			
		||||
.
 | 
			
		||||
└── media
 | 
			
		||||
    ├── seedbox
 | 
			
		||||
    ├── radarr
 | 
			
		||||
    │   └── tv-radarr
 | 
			
		||||
    ├── movies
 | 
			
		||||
    └── tvseries
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
The originals are in `seedbox` and must not be modified to keep seeding. The copies (duplicates) are in `movies` and `tvseries`. To complicate things, there are also unique originals in `movies` and `tvseries`. And within those, there can be subfolders, sub-subfolders, etc.
 | 
			
		||||
 | 
			
		||||
So the idea is to:
 | 
			
		||||
 | 
			
		||||
- list the originals in seedbox
 | 
			
		||||
- list files in movies and tvseries
 | 
			
		||||
- compare both lists and isolate duplicates
 | 
			
		||||
- delete the duplicates
 | 
			
		||||
- hardlink the originals to the deleted duplicate paths
 | 
			
		||||
 | 
			
		||||
Yes, I asked ChatGPT and Qwen3 (which I host on a dedicated AI machine). Naturally, they suggested tools like rfind, rdfind, dupes, rdupes, rmlint... But hashing 30TB of media would take days, so I gave up quickly.
 | 
			
		||||
 | 
			
		||||
In the end, I only needed to find `.mkv` files, and duplicates have the exact same name as the originals, which simplifies things a lot. A simple Bash script would do the job.
 | 
			
		||||
 | 
			
		||||
Spare you the endless Q&A with ChatGPT—I was disappointed. Qwen3 was much cleaner. ChatGPT kept pushing awk-based solutions, which fail on paths with spaces. With Qwen’s help and dropping awk, the results improved significantly.
 | 
			
		||||
 | 
			
		||||
To test, I first asked for a script that only lists and compares:
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
#!/bin/bash
 | 
			
		||||
 | 
			
		||||
# Create an associative array to store duplicates
 | 
			
		||||
declare -A seen
 | 
			
		||||
 | 
			
		||||
# Find all .mkv files only (exclude directories)
 | 
			
		||||
find /media/seedbox /media/movies /media/tvseries -type f -name "*.mkv" -print0 | \
 | 
			
		||||
while IFS= read -r -d '' file; do
 | 
			
		||||
    # Get the file's inode and name
 | 
			
		||||
    inode=$(stat --format="%i" "$file")
 | 
			
		||||
    filename=$(basename "$file")
 | 
			
		||||
    
 | 
			
		||||
    # If the filename has been seen before
 | 
			
		||||
    if [[ -n "${seen[$filename]}" ]]; then
 | 
			
		||||
        # Check if the inode is different from the previous one
 | 
			
		||||
        if [[ "${seen[$filename]}" != "$inode" ]]; then
 | 
			
		||||
            # Output the duplicates with full paths
 | 
			
		||||
            echo "Duplicates for \"$filename\":"
 | 
			
		||||
            echo "${seen["$filename"]} ${seen["$filename:full_path"]}"
 | 
			
		||||
            echo "$inode $file"
 | 
			
		||||
            echo
 | 
			
		||||
        fi
 | 
			
		||||
    else
 | 
			
		||||
        seen[$filename]="$inode"
 | 
			
		||||
        seen["$filename:full_path"]="$file"
 | 
			
		||||
    fi
 | 
			
		||||
done
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
This gave me outputs like:
 | 
			
		||||
 | 
			
		||||
```
 | 
			
		||||
Duplicates for "episode1.mkv":
 | 
			
		||||
1234567 /media/seedbox/sonarr/Serie 1/Season1/episode1.mkv
 | 
			
		||||
2345678 /media/tvseries/Serie 1/Season1/episode1.mkv
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
With `awk`, it would’ve stopped at `/media/seedbox/sonarr/Serie`. I’m far from an expert, but Qwen3 performed better and explained everything clearly.
 | 
			
		||||
 | 
			
		||||
Once I verified the output, I asked for a complete script: compare, delete duplicates, create hardlinks.
 | 
			
		||||
 | 
			
		||||
Again, ChatGPT disappointed. Despite my requests, it created hardlinks *before* deleting the duplicates—effectively linking and then deleting the link (though the original is kept). Not helpful.
 | 
			
		||||
 | 
			
		||||
Quick stopover to Qwen3, RTX 5090 in overdrive, and bam—much better result. Yes, it kept ChatGPT-style emojis, but here it is:
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
#!/bin/bash
 | 
			
		||||
 | 
			
		||||
echo "🔍 Step 1: Indexing original files in /media/seedbox..."
 | 
			
		||||
declare -A seen
 | 
			
		||||
 | 
			
		||||
# Index all .mkv files in seedbox
 | 
			
		||||
while IFS= read -r -d '' file; do
 | 
			
		||||
    filename=$(basename "$file")
 | 
			
		||||
    seen["$filename"]="$file"
 | 
			
		||||
done < <(find /media/seedbox -type f -name "*.mkv" -print0)
 | 
			
		||||
 | 
			
		||||
echo "📦 Step 2: Automatically replacing duplicates..."
 | 
			
		||||
total_doublons=0
 | 
			
		||||
total_ko_saved=0
 | 
			
		||||
 | 
			
		||||
while IFS= read -r -d '' file; do
 | 
			
		||||
    filename=$(basename "$file")
 | 
			
		||||
    original="${seen[$filename]}"
 | 
			
		||||
 | 
			
		||||
    if [[ -n "$original" && "$original" != "$file" ]]; then
 | 
			
		||||
        inode_orig=$(stat -c %i "$original")
 | 
			
		||||
        inode_dupe=$(stat -c %i "$file")
 | 
			
		||||
 | 
			
		||||
        if [[ "$inode_orig" != "$inode_dupe" ]]; then
 | 
			
		||||
            size_kb=$(du -k "$file" | cut -f1)
 | 
			
		||||
            echo "🔁 Replacing:"
 | 
			
		||||
            echo "    Duplicate : $file"
 | 
			
		||||
            echo "    Original  : $original"
 | 
			
		||||
            echo "    Size      : ${size_kb} KB"
 | 
			
		||||
 | 
			
		||||
            rm "$file" && ln "$original" "$file" && echo "✅ Hardlink created."
 | 
			
		||||
 | 
			
		||||
            total_doublons=$((total_doublons + 1))
 | 
			
		||||
            total_ko_saved=$((total_ko_saved + size_kb))
 | 
			
		||||
        fi
 | 
			
		||||
    fi
 | 
			
		||||
done < <(find /media/movies /media/tvseries -type f -name "*.mkv" -print0)
 | 
			
		||||
 | 
			
		||||
echo ""
 | 
			
		||||
echo "🧾 Summary:"
 | 
			
		||||
echo "    🔗 Duplicates replaced by hardlink: $total_doublons"
 | 
			
		||||
echo "    💾 Approx. disk space saved: ${total_ko_saved} KB (~$((total_ko_saved / 1024)) MB)"
 | 
			
		||||
echo "✅ Done."
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
So, in conclusion, I:
 | 
			
		||||
- Learned many Bash subtleties
 | 
			
		||||
- Learned never to blindly copy-paste a ChatGPT script without understanding and dry-running it
 | 
			
		||||
- Learned that Qwen on a RTX 5090 is more coherent than ChatGPT-4o on server farms (not even mentioning “normal” ChatGPT)
 | 
			
		||||
- Learned that even with 100TB of storage, monitoring it would’ve alerted me much earlier to the 12TB of duplicates lying around
 | 
			
		||||
		Reference in New Issue
	
	Block a user