Web Reconnaissance Cheatsheet

CheatSheet dành cho các phương pháp và công cụ trong Web Reconnaissance

November 4, 2025 • September 22, 2025 • Info

Author

Hung Nguyen Tuong

Cheatsheet này được tóm tắt bởi AI, từ module Information Gathering - Web Edition thuộc về HackTheBox.

Reconnaissance Methodology

Two fundamental methodologies:

Active: port scanning, vulnerability scanning, network mapping, banner grabbing, OS fingerprinting, service enumeration, web spidering.

Passive: search engines, WHOIS, DNS, web archives, social media, code repositories.

WHOIS

Provides: registrar, registration/expiration dates, nameservers, contact information of domain owner.

Value in pentest:

Identify key personnel (names, emails, phone numbers) for social engineering and phishing.
Discover network infrastructure (name servers, IP addresses) to find entry points and misconfiguration.
Analyze historical data via WhoisFreaks to track changes in ownership and technical details.

Note: data may be inaccurate or obscured by privacy services.

DNS

DNS Digging

Tools: dig, nslookup, host, dnsenum, fierce, dnsrecon, theHarvester.

Discover assets: subdomains, mail servers, name server records (CNAME pointing to outdated server may be vulnerable).
Map network infrastructure: for example, identify hosting provider via NS records, discover load balancer via A records.
Monitor changes: new subdomains (VPN endpoints), TXT records revealing tools in use (1Password).

DNS record types:

Record Type	Description
A	Map hostname → IPv4
AAAA	Map hostname → IPv6
CNAME	Alias hostname → another hostname
MX	Mail servers handling email
NS	Delegate DNS zone to name server
TXT	Store arbitrary text information
SOA	Administrative info of DNS zone

DNS Zone Transfer

Zone transfer reveals:

Subdomains: complete list, including development servers, staging environments, admin panels.
IP addresses of each subdomain.
Name server records: hosting provider and misconfiguration.

Subdomain & Virtual Host Enumeration

Subdomains: extensions of main domain (blog.example.com), have their own DNS records.

Virtual Hosts: web server configuration allowing multiple websites on single server, distinguished via HTTP Host header. Can be top-level domains or subdomains, each VHost has separate config.

Subdomains often contain:

Development/staging environments with relaxed security.
Hidden login portals (admin panels).
Legacy applications with outdated software.
Sensitive information: confidential documents, internal data, config files.

Active methods: zone transfer, ffuf, gobuster, dirbuster, dnsenum, amass, assetfinder, puredns, fierce, dnsrecon, sublist3r.

Passive methods: Certificate Transparency logs, search engines (site:).

Certificate Transparency Logs

Tools: crt.sh, search.censys.io.

Advantages:

Not limited by wordlist or brute-force algorithm.
Access historical and comprehensive view of subdomains.
Discover subdomains with old/expired certificates (potentially vulnerable).
No need for brute-forcing or complete wordlist.

Fingerprinting

Techniques: banner grabbing, HTTP headers, unique responses (error messages), page content (source code, scripts, copyright).

Tools: Wappalyzer, BuiltWith, WhatWeb, Nmap, Netcraft, wafw00f, Nikto.

Crawling

Crawling (spidering): automated process of browsing the web, following links to collect information.

Breadth-first: prioritize width, crawl all links on seed page before going deep. Good for website structure overview.

Depth-first: prioritize depth, follow one path of links as far as possible before backtracking. Good for finding specific content or reaching deep structure.

Information collected:

Links: internal and external links to map structure, discover hidden pages.
Comments: may reveal sensitive details, internal processes, hints about vulnerabilities.
Metadata: titles, descriptions, keywords, author names, dates.
Sensitive files: backup files (.bak, .old), config files (web.config, settings.php), log files (error_log, access_log), files containing passwords, API keys.

Crawlers

Burp Suite Spider: powerful active crawler, map web apps, discover hidden content and vulnerabilities.
OWASP ZAP: free, open-source scanner, includes spider component.
Scrapy: flexible, scalable Python framework, build custom crawlers.
Apache Nutch: Java, highly extensible and scalable, handle massive crawls.

ReconSpider (custom Scrapy spider):

wget -O ReconSpider.zip https://academy.hackthebox.com/storage/modules/144/ReconSpider.v1.2.zip
unzip ReconSpider.zip
python3 ReconSpider.py http://example.com

Results saved in results.json.

robots.txt

Text file in root directory, structure:

User-agent: specifies which crawler/bot the rules apply to. Wildcard (*) = all bots.

Directives:

Directive	Description	Example
Disallow	Paths bot should not crawl	Disallow: /admin/
Allow	Explicitly permit crawling specific paths	Allow: /public/
Crawl-delay	Delay between requests (seconds)	Crawl-delay: 10
Sitemap	URL to XML sitemap	Sitemap: https://example.com/sitemap.xml

Value:

Disallowed paths often contain sensitive info, backup files, admin panels.
Analyze allowed/disallowed paths to create website map.
Honeypot directories to lure malicious bots.

Well-Known URIs

Per RFC 8615, .well-known at root domain (/.well-known/) contains metadata: config files, services, protocols, security mechanisms.

openid-configuration: part of OpenID Connect Discovery protocol (identity layer on OAuth 2.0). Endpoint https://example.com/.well-known/openid-configuration returns JSON containing metadata about provider endpoints, authentication methods, token issuance.

Google Dorking

Using search operators to discover sensitive information, vulnerabilities, hidden content:

Operator	Description	Example
site:	Limit results to domain	site:example.com
inurl:	Find term in URL	inurl:login
filetype:	Find specific file type	filetype:pdf
intitle:	Find term in title	intitle:“confidential report”
intext:/inbody:	Find term in body text	intext:“password reset”
cache:	Display cached version	cache:example.com
link:	Find pages linking to webpage	link:example.com
related:	Find related websites	related:example.com
info:	Summary information	info:example.com
define:	Define word/phrase	define:phishing
numrange:	Find numbers in range	numrange:1000-2000
allintext:	All words in body	allintext:admin password reset
allinurl:	All words in URL	allinurl:admin panel
allintitle:	All words in title	allintitle:confidential report 2023
AND	Require all terms	site:example.com AND inurl:admin
OR	Any term	“linux” OR “ubuntu”
NOT	Exclude term	site:bank.com NOT inurl:login
*	Wildcard	filetype:pdf user* manual
..	Range search	“price” 100..500
" "	Exact phrase	“information security policy”
-	Exclude term	-inurl:sports

Reference: https://www.exploit-db.com/google-hacking-database

Web Archives

Wayback Machine provides:

Uncovering hidden assets: old pages, directories, files, subdomains not currently accessible, may contain sensitive info or security flaws.
Tracking changes: compare historical snapshots to observe evolution (structure, content, technologies, vulnerabilities).
Gathering intelligence: OSINT about past activities, marketing strategies, employees, technology choices.
Stealthy reconnaissance: passive activity doesn’t interact directly with target infrastructure, less detectable.

Automated Reconnaissance Tools

FinalRecon: Python, modules for SSL certificate checking, WHOIS, header analysis, crawling. Modular structure enables easy customization.
Recon-ng: Python framework, modular with DNS enumeration, subdomain discovery, port scanning, web crawling, exploit vulnerabilities.
theHarvester: Command-line Python, collect emails, subdomains, hosts, employee names, open ports, banners from search engines, PGP key servers, SHODAN.
SpiderFoot: Open-source intelligence automation, integrates multiple data sources to collect IP addresses, domains, emails, social media profiles. Performs DNS lookups, web crawling, port scanning.
OSINT Framework: Collection of tools and resources for OSINT, includes social media, search engines, public records.

Public Buckets

https://buckets.grayhatwarfare.com/

Public search portal for “open buckets” (public cloud storage containers) and contents. Search file names, filter by extension, keywords, date ranges to find exposed files.

Aggregates data from Amazon S3, Azure Blob Storage, Google Cloud Storage. Discover misconfigured buckets leaking data.

Metasploit Cheatsheet