New directory and icons

2025-07-20 17:49:48 +00:00
parent c89b16d0ae
commit 4af8bbe1e4
25 changed files with 163 additions and 125 deletions
--- a/content/5.nonsense/2.bash/1.servarr-duplicates.md
+++ b/content/5.nonsense/2.bash/1.servarr-duplicates.md
@@ -0,0 +1,141 @@
+---
+navigation: true
+title: Bash Scripts
+main:
+  fluid: false
+---
+:ellipsis{left=0px width=40rem top=10rem blur=140px}
+# Servarr duplicates corrector
+---
+
+Six months after downloading terabytes of media, I realized that Sonarr and Radarr were copying them into my Plex library instead of creating hardlinks. This happens due to a counterintuitive mechanism: if you mount multiple folders in Sonarr/Radarr, it sees them as different filesystems and thus cannot create hardlinks. That’s why you should mount only one parent folder containing all child folders (like `downloads`, `movies`, `tvseries` inside a `media` parent folder).
+
+So I restructured my directories, manually updated every path in Qbittorrent, Plex, and others. The last challenge was finding a way to detect existing duplicates, delete them, and automatically create hardlinks instead—to save space.
+
+My directory structure:
+
+```console
+.
+└── media
+    ├── seedbox
+    ├── radarr
+    │   └── tv-radarr
+    ├── movies
+    └── tvseries
+```
+
+The originals are in `seedbox` and must not be modified to keep seeding. The copies (duplicates) are in `movies` and `tvseries`. To complicate things, there are also unique originals in `movies` and `tvseries`. And within those, there can be subfolders, sub-subfolders, etc.
+
+So the idea is to:
+
+- list the originals in seedbox
+- list files in movies and tvseries
+- compare both lists and isolate duplicates
+- delete the duplicates
+- hardlink the originals to the deleted duplicate paths
+
+Yes, I asked ChatGPT and Qwen3 (which I host on a dedicated AI machine). Naturally, they suggested tools like rfind, rdfind, dupes, rdupes, rmlint... But hashing 30TB of media would take days, so I gave up quickly.
+
+In the end, I only needed to find `.mkv` files, and duplicates have the exact same name as the originals, which simplifies things a lot. A simple Bash script would do the job.
+
+Spare you the endless Q&A with ChatGPT—I was disappointed. Qwen3 was much cleaner. ChatGPT kept pushing awk-based solutions, which fail on paths with spaces. With Qwen’s help and dropping awk, the results improved significantly.
+
+To test, I first asked for a script that only lists and compares:
+
+```bash
+#!/bin/bash
+
+# Create an associative array to store duplicates
+declare -A seen
+
+# Find all .mkv files only (exclude directories)
+find /media/seedbox /media/movies /media/tvseries -type f -name "*.mkv" -print0 | \
+while IFS= read -r -d '' file; do
+    # Get the file's inode and name
+    inode=$(stat --format="%i" "$file")
+    filename=$(basename "$file")
+    
+    # If the filename has been seen before
+    if [[ -n "${seen[$filename]}" ]]; then
+        # Check if the inode is different from the previous one
+        if [[ "${seen[$filename]}" != "$inode" ]]; then
+            # Output the duplicates with full paths
+            echo "Duplicates for \"$filename\":"
+            echo "${seen["$filename"]} ${seen["$filename:full_path"]}"
+            echo "$inode $file"
+            echo
+        fi
+    else
+        seen[$filename]="$inode"
+        seen["$filename:full_path"]="$file"
+    fi
+done
+```
+
+This gave me outputs like:
+
+```
+Duplicates for "episode1.mkv":
+1234567 /media/seedbox/sonarr/Serie 1/Season1/episode1.mkv
+2345678 /media/tvseries/Serie 1/Season1/episode1.mkv
+```
+
+With `awk`, it would’ve stopped at `/media/seedbox/sonarr/Serie`. I’m far from an expert, but Qwen3 performed better and explained everything clearly.
+
+Once I verified the output, I asked for a complete script: compare, delete duplicates, create hardlinks.
+
+Again, ChatGPT disappointed. Despite my requests, it created hardlinks *before* deleting the duplicates—effectively linking and then deleting the link (though the original is kept). Not helpful.
+
+Quick stopover to Qwen3, RTX 5090 in overdrive, and bam—much better result. Yes, it kept ChatGPT-style emojis, but here it is:
+
+```bash
+#!/bin/bash
+
+echo "🔍 Step 1: Indexing original files in /media/seedbox..."
+declare -A seen
+
+# Index all .mkv files in seedbox
+while IFS= read -r -d '' file; do
+    filename=$(basename "$file")
+    seen["$filename"]="$file"
+done < <(find /media/seedbox -type f -name "*.mkv" -print0)
+
+echo "📦 Step 2: Automatically replacing duplicates..."
+total_doublons=0
+total_ko_saved=0
+
+while IFS= read -r -d '' file; do
+    filename=$(basename "$file")
+    original="${seen[$filename]}"
+
+    if [[ -n "$original" && "$original" != "$file" ]]; then
+        inode_orig=$(stat -c %i "$original")
+        inode_dupe=$(stat -c %i "$file")
+
+        if [[ "$inode_orig" != "$inode_dupe" ]]; then
+            size_kb=$(du -k "$file" | cut -f1)
+            echo "🔁 Replacing:"
+            echo "    Duplicate : $file"
+            echo "    Original  : $original"
+            echo "    Size      : ${size_kb} KB"
+
+            rm "$file" && ln "$original" "$file" && echo "✅ Hardlink created."
+
+            total_doublons=$((total_doublons + 1))
+            total_ko_saved=$((total_ko_saved + size_kb))
+        fi
+    fi
+done < <(find /media/movies /media/tvseries -type f -name "*.mkv" -print0)
+
+echo ""
+echo "🧾 Summary:"
+echo "    🔗 Duplicates replaced by hardlink: $total_doublons"
+echo "    💾 Approx. disk space saved: ${total_ko_saved} KB (~$((total_ko_saved / 1024)) MB)"
+echo "✅ Done."
+```
+
+So, in conclusion, I:
+- Learned many Bash subtleties
+- Learned never to blindly copy-paste a ChatGPT script without understanding and dry-running it
+- Learned that Qwen on a RTX 5090 is more coherent than ChatGPT-4o on server farms (not even mentioning “normal” ChatGPT)
+- Learned that even with 100TB of storage, monitoring it would’ve alerted me much earlier to the 12TB of duplicates lying around