A HashSet stores unique elements using a hash table with O(1) average-time add, remove, and contains
Backed by a HashMap — elements become keys, a dummy object is the value
Relies on hashCode() to pick a bucket and equals() to confirm uniqueness
Add() returns boolean: true if element was new, false if duplicate
Resizing at 75% load factor triggers O(n) rehashing — the only expensive operation
Biggest mistake: mutating hashCode-related fields after insertion leaves elements orphaned in wrong buckets
✦ Definition~90s read
What is HashSet in Java?
A HashSet is a Java collection that implements the Set interface using a hash table, specifically a HashMap under the hood. It exists to provide constant-time performance for the fundamental set operations—add, remove, contains—by leveraging the hashCode() and equals() methods of its elements.
★
Imagine you're organising a guest list for a party.
The core idea is that each object's hash code determines which "bucket" it lands in, allowing O(1) average-case lookups instead of the O(n) scans required by a List. This makes HashSet the go-to choice when you need to enforce uniqueness and test membership quickly, like deduplicating user IDs or checking for duplicate transactions in a stream of events.
However, HashSet is not a magic bullet. Its performance degrades catastrophically if the elements' hashCode() values change after insertion—a classic bug that can cause 30% or more cache loss as objects end up in wrong buckets, making contains() and remove() silently fail.
This is why the equals() and hashCode() contract demands that hash codes remain consistent while the object is in the set. Alternatives like LinkedHashSet (preserves insertion order) or TreeSet (sorted, O(log n) operations) exist for different trade-offs, but HashSet dominates when order doesn't matter and raw speed is the priority.
In practice, you'll see it in caching layers, dedup pipelines, and anywhere you need to answer "have I seen this before?" in microseconds—just don't mutate the keys after insertion.
Plain-English First
Imagine you're organising a guest list for a party. You don't care what order people arrive — you just need to know instantly whether someone is already on the list, and you never want the same name written down twice. A HashSet is exactly that guest list. It's a bag that holds things in no particular order, refuses duplicates without complaining, and can tell you 'yes they're in' or 'no they're not' almost instantly — no matter how many thousands of names are on the list.
Every non-trivial Java application deals with collections of data. But most developers reach for ArrayList out of habit, even when uniqueness or fast lookup is the actual requirement. That habit has real costs — O(n) contains() checks, duplicates sneaking into your data, and business logic cluttered with manual deduplication. HashSet exists to make those problems disappear at the data-structure level.
The problem HashSet solves is simple but important: how do you store a group of items where order doesn't matter, duplicates are impossible, and checking membership needs to be near-instant? ArrayList fails on all three counts at scale. HashSet uses a hash table under the hood — each element gets a numeric hash code, which maps it to a bucket in an array. Lookup becomes a matter of jumping to the right bucket, not scanning the whole collection. That's how it achieves O(1) average-time contains(), add(), and remove().
By the end of this article you'll understand not just how to use HashSet, but why it's built the way it is, what happens internally when you call add(), how it plays with equals() and hashCode(), where it beats and loses to its siblings TreeSet and LinkedHashSet, and the exact mistakes that cause silent bugs in production code. You'll also walk away with answers to the three HashSet questions interviewers love to ask.
Why HashSet Is Not a Magic O(1) Collection
A HashSet is a hash-table-backed Set implementation that stores unique elements using their hashCode() to determine the bucket index. It offers O(1) average-time add, remove, and contains operations — but only if hash codes are stable and well-distributed. The core mechanic: each element's hash code determines its bucket; equals() resolves collisions within a bucket.
Key properties: null is allowed, iteration order is unpredictable and can change across JVM runs, and the internal array resizes (rehashing) when load factor exceeds 0.75. The real performance cliff is not the O(n) worst-case collision chain — it's mutable objects whose hashCode changes after insertion. That silently corrupts the bucket mapping, making the element permanently unreachable via contains() or remove(), and bloating the set with orphaned entries.
Use HashSet when you need fast membership checks and don't care about order. It's ideal for deduplication pipelines, caching unique IDs, or tracking processed items in event streams. But never store mutable objects whose hashCode can change — that's the single most common production failure with this collection.
Mutable hashCode = silent data loss
If an object's hashCode changes after insertion, the set can't find it anymore — but it still occupies a bucket, wasting memory and breaking correctness.
Production Insight
A team stored mutable POJOs in a HashSet for a dedup cache. After updating a field used in hashCode, the same logical object could be inserted again, causing duplicates and 30% cache loss.
Symptom: contains() returns false for objects that are definitely in the set; memory grows unboundedly.
Rule: Never mutate fields used in hashCode or equals while the object is in a HashSet. Use immutable keys or remove-then-reinsert.
Key Takeaway
HashSet offers O(1) average performance only if hashCode is stable and well-distributed.
Mutable objects in a HashSet cause silent data loss and memory leaks — never do it.
Use immutable keys or ensure the object is removed before mutation and reinserted after.
thecodeforge.io
HashSet Internal Structure & Performance Pitfalls
Hashset Java
How HashSet Actually Stores Elements — The Bucket Mechanism
A HashSet is backed by a HashMap. When you call add(element), Java calls element.hashCode(), takes that integer, does some internal bit-mixing, and uses the result to pick a bucket — essentially a slot in an internal array. If that bucket is empty, the element goes in. If the bucket already has an occupant, Java calls equals() to check whether it's the same object. If equals() returns true, the element is rejected silently — no exception, no error, it just doesn't get added. If equals() returns false (a hash collision), both elements coexist in the same bucket as a linked list.
This two-step process — hashCode() to find the bucket, equals() to confirm identity — is the most important thing to understand about HashSet. It also reveals a critical contract: if two objects are equal according to equals(), they MUST return the same hashCode(). Break that contract and your HashSet will silently store duplicates as if they're different objects.
The default initial capacity is 16 buckets, with a load factor of 0.75. When 75% of the buckets are occupied, the internal array doubles in size and all elements are rehashed. This is called resizing, and it's the one expensive operation in HashSet's life.
HashSetInternals.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import java.util.HashSet;
import java.util.Set;
publicclassHashSetInternals {
publicstaticvoidmain(String[] args) {
// --- Basic uniqueness guarantee ---Set<String> visitedUrls = newHashSet<>();
visitedUrls.add("https://thecodeforge.io/java");
visitedUrls.add("https://thecodeforge.io/python");
visitedUrls.add("https://thecodeforge.io/java"); // duplicate — silently ignored// Size is 2, not 3. The duplicate was rejected without any error.System.out.println("UniqueURLs tracked: " + visitedUrls.size()); // 2// --- O(1) membership check ---String urlToCheck = "https://thecodeforge.io/java";if (visitedUrls.contains(urlToCheck)) {
// This lookup is near-instant whether the set has 3 or 3 million elementsSystem.out.println("Already crawled: " + urlToCheck);
}
// --- add() returns a boolean — use it! ---
boolean wasNew = visitedUrls.add("https://thecodeforge.io/java"); // already existsSystem.out.println("URL was newly added: " + wasNew); // false — tells you it was a duplicate
boolean isNew = visitedUrls.add("https://thecodeforge.io/go"); // brand newSystem.out.println("URL was newly added: " + isNew); // true// --- Order is NOT guaranteed ---// Iterating a HashSet gives you elements in an unpredictable orderSystem.out.println("\nAll tracked URLs (order not guaranteed):");
for (String url : visitedUrls) {
System.out.println(" " + url);
}
}
}
Output
Unique URLs tracked: 2
Already crawled: https://thecodeforge.io/java
URL was newly added: false
URL was newly added: true
All tracked URLs (order not guaranteed):
https://thecodeforge.io/go
https://thecodeforge.io/java
https://thecodeforge.io/python
Pro Tip: add() returns a boolean — don't ignore it
Most developers write 'set.add(item)' and throw away the return value. That boolean tells you whether the element was actually new. In deduplication pipelines, counting those 'false' returns tells you exactly how many duplicates you caught — that's business-critical telemetry.
Production Insight
Resizing at 75% load factor is the only expensive O(n) operation.
If you know the final size in advance, create HashSet with initial capacity: new HashSet<>(expectedSize / 0.75f + 1).
Rule: pre-size to avoid repeated resizing under high insert throughput.
The equals() and hashCode() Contract — Where Most Bugs Are Born
When you store Strings or Integer wrappers in a HashSet, everything works perfectly because those classes implement hashCode() and equals() correctly. The danger arrives when you create your own class and put instances of it into a HashSet.
By default, every Java object inherits hashCode() from Object, which returns a value based on the object's memory address. Two distinct objects with identical field values will produce different hash codes. To HashSet, they look like completely different elements — so it happily stores both, silently creating duplicates.
The fix is always the same: override both equals() AND hashCode() together. Never one without the other. Modern IDEs and libraries like Lombok can generate them, but you need to understand why they're needed before you trust the automation.
The contract is: if a.equals(b) is true, then a.hashCode() must equal b.hashCode(). The reverse doesn't need to hold — two objects can share a hash code (collision) but still be not-equal. Keep that asymmetry in mind.
ProductDeduplication.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import java.util.HashSet;
import java.util.Objects;
import java.util.Set;
publicclassProductDeduplication {
// A product SKU from an e-commerce cataloguestaticclassProduct {
privatefinalString sku;
privatefinalString name;
Product(String sku, String name) {
this.sku = sku;
this.name = name;
}
// Without overriding these, two Product objects with the same SKU// would be treated as completely different by HashSet.
@Overridepublicbooleanequals(Object other) {
if (this == other) return true; // same reference — definitely equalif (!(other instanceof Product)) return false; // wrong type — definitely not equalProduct otherProduct = (Product) other;
// Business rule: two Products are the same if their SKUs matchreturnObjects.equals(this.sku, otherProduct.sku);
}
@OverridepublicinthashCode() {
// Must use the same field(s) as equals() — SKU drives bothreturnObjects.hash(sku);
}
@OverridepublicStringtoString() {
return"Product[sku=" + sku + ", name=" + name + "]";
}
}
publicstaticvoidmain(String[] args) {
Set<Product> catalogue = newHashSet<>();
// Simulating products loaded from two different data sourcesProduct fromDatabaseA = newProduct("BOOT-42", "Trail Boot Size 42");
Product fromDatabaseB = new Product("BOOT-42", "TrailBootSize42"); // same SKU, different objectProduct differentProduct = newProduct("HAT-01", "Hiking Hat");
catalogue.add(fromDatabaseA);
catalogue.add(fromDatabaseB); // rejected — equals() + hashCode() say it's the same product
catalogue.add(differentProduct);
// If we had NOT overridden equals/hashCode, size would be 3 (a silent bug!)System.out.println("Unique products in catalogue: " + catalogue.size()); // 2// contains() also works correctly nowProduct lookupProduct = newProduct("HAT-01", "Hiking Hat");
System.out.println("Catalogue contains hat: " + catalogue.contains(lookupProduct)); // trueSystem.out.println("\nCatalogue contents:");
catalogue.forEach(System.out::println);
}
}
Output
Unique products in catalogue: 2
Catalogue contains hat: true
Catalogue contents:
Product[sku=HAT-01, name=Hiking Hat]
Product[sku=BOOT-42, name=Trail Boot Size 42]
Watch Out: Mutating a field used in hashCode() after insertion
If you add an object to a HashSet and then change a field that's part of its hashCode() calculation, the object moves to a new bucket — but HashSet doesn't know. You'll never find it again with contains(), and you'll never be able to remove() it. Treat objects in a HashSet as effectively immutable, or use only fields that never change as the basis for equals/hashCode.
Production Insight
Silent duplicates from missing hashCode/equals are hard to spot because the set just grows.
First symptom is often a memory warning or unexpected size in metrics.
Rule: always write a unit test that adds two distinct instances with equal fields and asserts size is 1.
Key Takeaway
Override equals AND hashCode together, using the same fields.
a.equals(b) -> a.hashCode() == b.hashCode().
Mutate hash fields after insertion and you lose the element forever.
HashSet vs LinkedHashSet vs TreeSet — Choosing the Right Tool
All three implement the Set interface and guarantee uniqueness, but they make different trade-offs around ordering and performance. Choosing the wrong one is a common source of subtle bugs — especially when code that 'works' in testing breaks in production because iteration order changed.
HashSet is the default choice. No ordering, pure speed. Use it whenever you don't care about the sequence of elements.
LinkedHashSet maintains insertion order — elements come out in the order you put them in. It does this by linking each bucket entry to the previous and next one with a doubly-linked list. You pay a small memory overhead but gain predictable iteration. Use it when order matters for display or reproducibility.
TreeSet maintains sorted order — elements are always iterated in their natural order (or a custom Comparator you provide). It uses a Red-Black tree internally, so all core operations are O(log n) rather than O(1). Use it when you need a sorted unique collection or range queries like headSet() and tailSet().
The rule of thumb: start with HashSet. Upgrade to LinkedHashSet when reproducibility matters. Upgrade to TreeSet when sorted order matters.
SetVariantsComparison.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import java.util.HashSet;
import java.util.LinkedHashSet;
import java.util.Set;
import java.util.TreeSet;
publicclassSetVariantsComparison {
publicstaticvoidmain(String[] args) {
// Imagine these are user-selected tags on a blog post, added in this orderString[] selectedTags = {"java", "backend", "spring", "java", "performance", "backend"};
// --- HashSet: unique, but order is unpredictable ---Set<String> hashTags = newHashSet<>();
for (String tag : selectedTags) hashTags.add(tag);
System.out.println("HashSet (no order guarantee): " + hashTags);
// Output order will vary between JVM runs// --- LinkedHashSet: unique AND preserves insertion order ---Set<String> linkedTags = newLinkedHashSet<>();
for (String tag : selectedTags) linkedTags.add(tag);
System.out.println("LinkedHashSet (insertion order): " + linkedTags);
// Always: [java, backend, spring, performance] — first occurrence wins// --- TreeSet: unique AND always alphabetically sorted ---Set<String> sortedTags = newTreeSet<>();
for (String tag : selectedTags) sortedTags.add(tag);
System.out.println("TreeSet (sorted order): " + sortedTags);
// Always: [backend, java, performance, spring]// --- Range query — a TreeSet superpower HashSet can't do ---TreeSet<Integer> pageNumbers = newTreeSet<>();
for (int page = 1; page <= 20; page++) pageNumbers.add(page);
// Give me all pages from 5 up to (but not including) 10System.out.println("\nPages 5 to 9: " + pageNumbers.subSet(5, 10));
// Give me all pages from 15 onwardsSystem.out.println("Pages 15+: " + pageNumbers.tailSet(15));
}
}
Output
HashSet (no order guarantee): [java, spring, performance, backend]
Interview Gold: Why can't you use subSet() on a HashSet?
subSet(), headSet() and tailSet() are defined on the NavigableSet interface, which TreeSet implements but HashSet does not. HashSet has no notion of order, so range queries are meaningless to it. If an interviewer asks about this, the answer connects interface design to internal data structure — that's the depth they're looking for.
Production Insight
Order-dependent code that works during testing can break in production if you assume iteration order of HashSet.
A staging test with low data volume may show consistent order that changes under load after resizing.
Rule: if you rely on iteration order, use LinkedHashSet or TreeSet explicitly — don't rely on HashSet's accidental stability.
Key Takeaway
HashSet: fastest, no order guaranteed.
LinkedHashSet: insertion order preserved, small memory premium.
TreeSet: sorted order, O(log n) operations, supports range queries.
Real-World Patterns — Where HashSet Actually Earns Its Keep
Theory is great, but HashSet shines brightest in specific real-world patterns. Understanding these patterns means you'll reach for the right tool instinctively.
The most common pattern is the visited/seen set — a fast-rejection gate. You're processing a stream of events, URLs, user IDs, or transactions and you want to skip anything you've already handled. A HashSet lets you check-and-mark in one add() call and the O(1) lookup means it stays fast even as the seen set grows to millions of entries.
The second pattern is set operations — intersection, union, and difference. The Collections API makes these one-liners with retainAll(), addAll(), and removeAll(). These are exactly how you'd compute shared followers between two users, unique items across two orders, or features present in one plan but not another.
A third pattern is deduplication before storage. When your application receives data from multiple sources that might overlap, running everything through a HashSet before persisting it is far cheaper than writing duplicates to a database and cleaning them up later.
SetOperationsDemo.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import java.util.Arrays;
import java.util.HashSet;
import java.util.Set;
publicclassSetOperationsDemo {
publicstaticvoidmain(String[] args) {
// --- Pattern 1: Seen-set for deduplication in a stream ---System.out.println("=== Pattern 1: Deduplication Gate ===");
Set<String> processedOrderIds = newHashSet<>();
String[] incomingOrders = {"ORD-001", "ORD-002", "ORD-001", "ORD-003", "ORD-002"};
for (String orderId : incomingOrders) {
// add() returns false if already present — use that as a skip signalif (!processedOrderIds.add(orderId)) {
System.out.println("Skipping duplicate order: " + orderId);
} else {
System.out.println("Processing new order: " + orderId);
}
}
// --- Pattern 2: Set intersection — find common elements ---System.out.println("\n=== Pattern 2: Set Operations ===");
// Permissions assigned to two different user rolesSet<String> editorPermissions = newHashSet<>(
Arrays.asList("read", "write", "comment", "upload")
);
Set<String> viewerPermissions = newHashSet<>(
Arrays.asList("read", "comment")
);
// INTERSECTION — permissions that both roles shareSet<String> sharedPermissions = newHashSet<>(editorPermissions);
sharedPermissions.retainAll(viewerPermissions); // modifies in place — keep only what's in bothSystem.out.println("Shared permissions: " + sharedPermissions); // [read, comment]// DIFFERENCE — permissions only editors have (not viewers)Set<String> editorOnlyPermissions = newHashSet<>(editorPermissions);
editorOnlyPermissions.removeAll(viewerPermissions); // remove anything viewers also haveSystem.out.println("Editor-only permissions: " + editorOnlyPermissions); // [write, upload]// UNION — all permissions across both roles combined (no duplicates)Set<String> allPermissions = newHashSet<>(editorPermissions);
allPermissions.addAll(viewerPermissions); // adds only what's not already presentSystem.out.println("Allpermissions (union): " + allPermissions); // [read, write, comment, upload]
}
}
Output
=== Pattern 1: Deduplication Gate ===
Processing new order: ORD-001
Processing new order: ORD-002
Skipping duplicate order: ORD-001
Processing new order: ORD-003
Skipping duplicate order: ORD-002
=== Pattern 2: Set Operations ===
Shared permissions: [read, comment]
Editor-only permissions: [write, upload]
All permissions (union): [read, write, comment, upload]
Watch Out: retainAll() and removeAll() mutate the original set
Both retainAll() and removeAll() modify the Set you call them on. Always work on a defensive copy — 'new HashSet<>(originalSet)' — if you need to preserve the original. This is one of the most common sources of accidental data loss when doing set operations in production code.
Production Insight
Using HashSet as a dedup gate before DB insert is far cheaper than DB unique constraint violations.
A million adds in a HashSet takes ~200ms while a million DB inserts with duplicate checks can take minutes.
Rule: pre-size the HashSet to expected cardinality to avoid resizing overhead.
Key Takeaway
HashSet excels for dedup gates, set arithmetic, and pre-storage dedup.
Use add() return value as a skip signal.
Always copy before retainAll/removeAll to avoid mutating originals.
Performance & Resizing: When HashSet Gets Expensive
HashSet's O(1) operations are average-case. Under the hood, the array of buckets starts at size 16 with a load factor of 0.75. When the number of elements exceeds capacity * load factor, the array doubles and all elements are rehashed — that means every element's hashCode() is called again, and each lands in a new bucket. This is an O(n) operation.
If you insert a million elements into a default HashSet, it will resize about 17 times (16 -> 32 -> 64 -> ... -> 1,048,576). Each resize triggers a full rehash. The total cost of resizing can dominate your insertion time if you don't pre-size.
Pre-sizing: if you know you'll store N unique items, create the HashSet with initial capacity = (int)(N / 0.75) + 1. This avoids all resizes. The +1 accounts for rounding errors from integer division.
Another performance trap: poor hashCode() implementations that produce many collisions. When many elements land in the same bucket, the linked list grows and operations degrade to O(m) where m is the bucket length. In extreme cases, a badly distributed hashCode() can turn HashSet into a slow ArrayList. Java 8 improved this by converting long linked lists to trees when a bucket holds more than 8 elements, but that's a fallback — not an excuse for a bad hash function.
HashSetPreSizing.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import java.util.HashSet;
import java.util.Set;
publicclassHashSetPreSizing {
publicstaticvoidmain(String[] args) {
int expectedSize = 1_000_000;
// Calculate optimal initial capacity to avoid resizingint initialCapacity = (int)(expectedSize / 0.75f) + 1;
long start = System.nanoTime();
Set<Integer> sizedSet = newHashSet<>(initialCapacity);
for (int i = 0; i < expectedSize; i++) {
sizedSet.add(i);
}
long durationSized = System.nanoTime() - start;
start = System.nanoTime();
Set<Integer> defaultSet = newHashSet<>();
for (int i = 0; i < expectedSize; i++) {
defaultSet.add(i);
}
long durationDefault = System.nanoTime() - start;
System.out.println("Pre-sized HashSet: " + durationSized / 1_000_000 + " ms");
System.out.println("Default HashSet: " + durationDefault / 1_000_000 + " ms");
// Pre-sizing typically cuts insertion time by 30-50% for large sets
}
}
Output
Pre-sized HashSet: 480 ms
Default HashSet: 760 ms
Memory vs Speed Trade-off
Pre-sizing allocates a larger internal array upfront, which uses more memory but avoids the CPU cost of repeated rehashing. For most applications, the memory overhead is trivial compared to the speed gain. Use pre-sizing when you know the final size within 20%.
Production Insight
In high-throughput services, unplanned resizing can cause latency spikes as the GC kicks in to clean old arrays.
If your HashSet holds millions of entries and you see periodic GC pauses, check resizing costs.
Rule: always pre-size when the approximate data volume is known at design time.
Key Takeaway
Default capacity=16, load factor=0.75.
Resizing doubles array and rehashes all elements — O(n) cost.
Pre-size using expectedSize / 0.75 + 1 to avoid resizing entirely.
The Load Factor Trap: Why Default 0.75 Is Not Always Right
Most tutorials parrot the default 0.75 load factor without explaining why it matters. Here's the truth: that 0.75 is a time-space tradeoff. Lower it to 0.5 and you waste memory—more buckets, fewer collisions, faster reads. Raise it to 0.9 and you save memory but risk degrading to O(log n) as collisions force treeification. The default works for most cases, but not for real-time systems or memory-constrained environments. For latency-sensitive apps, use 0.5-0.6. For large datasets where memory is tight, 0.8 is safer. Never use over 0.9 unless you love unpredictable pauses during rehashing. The worst mistake? Using the default for a set you know will hold millions of elements. Pre-size your HashSet using initialCapacity = (expectedSize / loadFactor) + 1 to avoid multiple resizes. That single line can cut insertion time by 40%.
Don't rely on auto-resizing for high-throughput systems. Each resize doubles the array and rehashes every element—a full stop for the calling thread. Profile your expected cardinality and pre-size accordingly.
Key Takeaway
HashSet resizing is a hidden O(n) event. Pre-size or pay the latency tax.
HashCode Collisions: When Your HashSet Degrades to a LinkedList (or Tree)
Here's what happens when your hashCode() is broken: two unequal objects land in the same bucket. In Java 7 and earlier, that bucket becomes a linked list—O(n) lookups. Java 8 fixed this partially: when a bucket exceeds 8 entries, it treeifies into a Red-Black Tree, capping worst-case at O(log n). But that treeification costs memory (extra node objects) and CPU (tree rebalancing). You don't want it. The fix? Write hash functions that spread objects evenly. For a custom key class, multiply by a prime and shift bits: return (int)(id ^ (id >>> 32)) * 31. Never return a constant hash code—that puts every element in one bucket, making your HashSet a slow ArrayList. The real-world symptom: your production app has intermittent spikes when a bad actor sends crafted objects that collide. Always test with a million random objects and measure bucket distribution.
CollisionDemo.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import java.util.*;
classBadKey {
int id;
BadKey(int id) { this.id = id; }
public int hashCode() { return 42; } // All same bucket!publicbooleanequals(Object o) { return o instanceofBadKey && ((BadKey)o).id == this.id; }
}
publicclassCollisionDemo {
publicstaticvoidmain(String[] args) {
Set<BadKey> set = newHashSet<>();
long start = System.nanoTime();
for (int i = 0; i < 10_000; i++) set.add(newBadKey(i));
System.out.println("Time for 10K with bad hash: " + (System.nanoTime() - start) / 1_000_000 + " ms");
Set<Integer> goodSet = newHashSet<>();
start = System.nanoTime();
for (int i = 0; i < 10_000; i++) goodSet.add(i);
System.out.println("Time for 10K with good hash: " + (System.nanoTime() - start) / 1_000_000 + " ms");
}
}
Output
Time for 10K with bad hash: 34 ms
Time for 10K with good hash: 2 ms
Production Insight:
Treeification kicks in at threshold 8, but only if the key class implements Comparable. If it doesn't, the bucket stays as a linked list indefinitely. Always implement Comparable on your custom keys for a free performance guarantee.
Key Takeaway
A bad hash function can tank HashSet performance by 10x. Spread your bits or spread your pain.
● Production incidentPOST-MORTEMseverity: high
The Cache That Lost 30% of Its Entries Overnight
Symptom
The cache hit rate dropped from 85% to 55% within hours of deployment. contains() returned false for objects that had been added minutes earlier, causing expensive recomputation.
Assumption
The team assumed that once an object was added to the HashSet, it would remain discoverable forever. They had overridden equals() and hashCode() based on a mutable 'lastAccessed' field to speed up eviction logic.
Root cause
When 'lastAccessed' changed, the object's hashCode changed. The bucket location changed, but the set still held the old bucket. contains() looked in the new bucket, didn't find it, and returned false. The object was effectively orphaned but still consuming memory — a slow memory leak.
Fix
Changed the hashCode() to use only immutable fields (ID + creation timestamp). The 'lastAccessed' field was moved out of hashCode/equals. Added a unit test that mutates the object and verifies the set still contains it after mutation.
Key lesson
Never include mutable fields in hashCode() or equals() for objects stored in a HashSet
Treat objects in a HashSet as immutable — or design them so that identity fields never change
Write tests that explicitly mutate objects after insertion and verify set membership
Production debug guideCommon symptoms and the exact actions to diagnose and fix them4 entries
Symptom · 01
contains() returns false for an element you know was added
→
Fix
Check if the element's hashCode() changed after insertion. Print the hash code at add time and at check time. Look for mutable fields in hashCode().
Symptom · 02
Set size is larger than expected, duplicates present
→
Fix
Verify that equals() and hashCode() are both overridden consistently. Two objects with identical field values must return same hashCode(). Print hash codes for suspected duplicates.
Symptom · 03
Iterating the set shows elements in different order across runs
→
Fix
That's expected for HashSet — the order is not guaranteed. If reproducible order is needed, switch to LinkedHashSet.
Symptom · 04
NullPointerException when adding null?
→
Fix
HashSet allows one null. NPE occurs only if you call hashCode() on null. But HashSet handles null internally. If you see NPE, check if you're using TreeSet (throws NPE on null).
★ Quick Hash Set Debug CommandsCommands and code snippets to quickly diagnose common HashSet problems in production
Element not found after mutation−
Immediate action
Print hash codes before and after mutation to confirm change
Commands
System.out.println(element.hashCode()); // before mutation
System.out.println(element.hashCode()); // after mutation
Fix now
Redesign hashCode() to use only immutable fields. Remove mutable fields from equals() as well.
Set contains duplicates of custom objects+
Immediate action
Print field-by-field comparison for suspected duplicates
If orphaned objects found, they were mutated after insertion. The only fix is to avoid mutation of hash/equals fields.
Feature / Aspect
HashSet
LinkedHashSet
TreeSet
Internal structure
HashMap (array of buckets)
LinkedHashMap (buckets + linked list)
Red-Black Tree
Allows duplicates
No
No
No
Allows null
Yes (one null)
Yes (one null)
No — throws NullPointerException
Iteration order
Unpredictable
Insertion order
Sorted (natural or Comparator)
add() / remove() / contains()
O(1) average
O(1) average
O(log n)
Memory overhead
Low
Medium (extra linked list)
High (tree node objects)
Use when you need
Fast uniqueness, no order care
Unique + reproducible order
Unique + sorted + range queries
Implements NavigableSet
No
No
Yes
Thread-safe
No
No
No
Key takeaways
1
HashSet gives you O(1) average-time add(), remove(), and contains() by using a hash table
that's why it beats ArrayList for membership checks at scale, where ArrayList's contains() is O(n).
2
HashSet is backed by a HashMap
your elements are stored as keys with a dummy constant as the value. Everything you know about how HashMap hashes keys applies directly to HashSet elements.
3
If you store custom objects in a HashSet, you must override both equals() AND hashCode() using the same fields
skip either one and you get silent duplicate storage, which is one of the hardest bugs to diagnose.
4
The three Set implementations form a clear hierarchy of trade-offs
HashSet for raw speed with no ordering, LinkedHashSet for insertion-order preservation, and TreeSet for sorted order and range queries — pick based on what your code actually needs, not habit.
5
Pre-size your HashSet when you know the approximate number of elements to avoid expensive resize/rehash operations. Formula
Forgetting to override hashCode() when overriding equals()
Symptom
Two distinct objects with identical field values are both added to the HashSet, silently creating duplicates. The set grows bigger than expected. contains() returns false for fields you know are present because they end up in different buckets.
Fix
Always override both equals() and hashCode() together. Use the same fields in both methods. Use Objects.hash(field1, field2) for a safe, null-tolerant hashCode() implementation.
×
Mutating an object's hashCode-contributing field after adding it to a HashSet
Symptom
contains() returns false for an element you're certain is in the set. remove() fails silently. The object is stranded in the wrong bucket forever, causing a slow memory leak as the orphaned element never gets cleaned up.
Fix
Design objects stored in a HashSet to be effectively immutable. Ensure fields used in hashCode() are declared final and set only in the constructor. If mutation is unavoidable, remove the object from the set before mutation and re-add it afterwards.
×
Using HashSet when thread-safe set is needed in concurrent environment
Symptom
Intermittent data corruption, elements mysteriously disappearing, or infinite loops during concurrent iteration. The internal arrays can become inconsistent when multiple threads add/remove simultaneously.
Fix
Use Collections.synchronizedSet(new HashSet<>()) for coarse-grained locking, or ConcurrentHashMap.newKeySet() for better concurrent throughput (backed by ConcurrentHashMap, avoids full lock on every operation).
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01JUNIOR
HashSet is backed by a HashMap internally — can you explain what that me...
Q02SENIOR
If two objects have the same hashCode() but equals() returns false, what...
Q03SENIOR
You add an object to a HashSet, then modify a field that's included in i...
Q01 of 03JUNIOR
HashSet is backed by a HashMap internally — can you explain what that means in practice, and what gets stored as the key and value in that underlying map?
ANSWER
When you add an element to a HashSet, the element becomes the key in the internal HashMap. The value is a shared dummy Object constant (PRESENT). All keys in a HashMap must be unique, so the set enforces uniqueness automatically. HashMap operations like put() are reused for add(), containsKey() for contains(), etc. The hash table structure with buckets and collision handling is identical. This also means that the requirements for a good key in HashMap (immutable, proper equals/hashCode) apply exactly to elements stored in HashSet.
Q02 of 03SENIOR
If two objects have the same hashCode() but equals() returns false, what happens when you try to add both of them to a HashSet? What's this situation called and how does HashSet handle it?
ANSWER
This is called a hash collision. Both elements land in the same bucket. The bucket initially contains the first element. When adding the second, HashSet finds the bucket occupied, then calls equals() to compare. Since equals() returns false, the second element is also stored in the same bucket, typically as a linked list node (Java 8+ converts to a tree if the bucket exceeds 8 elements). The set still works correctly — contains() will traverse the bucket's linked list (or tree) and use equals() to find a match. The only downside is performance degrades from O(1) to O(m) where m is the number of colliding elements in that bucket.
Q03 of 03SENIOR
You add an object to a HashSet, then modify a field that's included in its hashCode() calculation. You then call contains() with that same object reference — what does it return and why? How would you prevent this bug?
ANSWER
contains() returns false. The object's hashCode() changed after insertion, so its new hash points to a different bucket than where it's stored. HashSet looks in the new bucket, doesn't find it, and returns false. The object remains in the old bucket, effectively orphaned — it can never be found again, and the set's size doesn't reflect it (since size isn't decremented). This is a memory leak. Prevention: never mutate hashCode-related fields after the object is in a HashSet. Either make the object immutable (all final fields), or use a mutable wrapper that you remove before mutation and re-add after. Some developers use a separate 'immutable' ID or UUID for identity that never changes.
01
HashSet is backed by a HashMap internally — can you explain what that means in practice, and what gets stored as the key and value in that underlying map?
JUNIOR
02
If two objects have the same hashCode() but equals() returns false, what happens when you try to add both of them to a HashSet? What's this situation called and how does HashSet handle it?
SENIOR
03
You add an object to a HashSet, then modify a field that's included in its hashCode() calculation. You then call contains() with that same object reference — what does it return and why? How would you prevent this bug?
SENIOR
FAQ · 5 QUESTIONS
Frequently Asked Questions
01
Can a Java HashSet contain null values?
Yes, but only one null. The first time you add null, it's stored. Any subsequent attempt to add null is silently ignored just like any other duplicate. Be careful: calling hashCode() on null would throw a NullPointerException, so HashSet has special-case logic that places null directly into bucket 0, bypassing the hashCode() call entirely.
Was this helpful?
02
Does HashSet maintain insertion order in Java?
No. HashSet makes no guarantees about iteration order, and the order can change between JVM runs or even after the internal array resizes due to a high load factor. If you need insertion order preserved, use LinkedHashSet instead — it's a drop-in replacement with the same interface but maintains a doubly-linked list across its entries.
Was this helpful?
03
What is the difference between HashSet and HashMap in Java?
A HashMap stores key-value pairs where each key is unique. A HashSet stores only values (no pairs) and guarantees uniqueness among them. Internally, HashSet is literally implemented as a HashMap where your set element is the key and a shared dummy Object constant is the value — so every unique element lookup is just a key lookup in the backing map.
Was this helpful?
04
How does HashSet handle hash collisions?
When two different elements have the same hash code (collision), they are stored in the same bucket. Java 7 and earlier used a linked list per bucket. Starting with Java 8, when a bucket exceeds 8 elements, the list is converted to a balanced tree (O(log n) lookup instead of O(n)). This prevents denial-of-service attacks against deliberately colliding hash codes.
Was this helpful?
05
Is HashSet thread-safe?
No. HashSet is not thread-safe. Concurrent modifications from multiple threads without external synchronization can corrupt the internal data structure, leading to infinite loops, missing elements, or ArrayIndexOutOfBoundsException. Use Collections.synchronizedSet(new HashSet<>()) for simple cases, or ConcurrentHashMap.newKeySet() for better concurrency performance.