Icon

Web Reconnaissance Cheatsheet

CheatSheet dành cho các phương pháp và công cụ trong Web Reconnaissance

November 4, 2025 September 22, 2025 Info
Author Author Hung Nguyen Tuong

Cheatsheet này được tóm tắt bởi AI, từ module Information Gathering - Web Edition thuộc về HackTheBox.

Reconnaissance Methodology

Two fundamental methodologies:

Active: port scanning, vulnerability scanning, network mapping, banner grabbing, OS fingerprinting, service enumeration, web spidering.

Passive: search engines, WHOIS, DNS, web archives, social media, code repositories.

WHOIS

Provides: registrar, registration/expiration dates, nameservers, contact information of domain owner.

Value in pentest:

  • Identify key personnel (names, emails, phone numbers) for social engineering and phishing.
  • Discover network infrastructure (name servers, IP addresses) to find entry points and misconfiguration.
  • Analyze historical data via WhoisFreaks to track changes in ownership and technical details.

Note: data may be inaccurate or obscured by privacy services.

DNS

DNS Digging

Tools: dig, nslookup, host, dnsenum, fierce, dnsrecon, theHarvester.

  • Discover assets: subdomains, mail servers, name server records (CNAME pointing to outdated server may be vulnerable).
  • Map network infrastructure: for example, identify hosting provider via NS records, discover load balancer via A records.
  • Monitor changes: new subdomains (VPN endpoints), TXT records revealing tools in use (1Password).

DNS record types:

Record TypeDescription
AMap hostname → IPv4
AAAAMap hostname → IPv6
CNAMEAlias hostname → another hostname
MXMail servers handling email
NSDelegate DNS zone to name server
TXTStore arbitrary text information
SOAAdministrative info of DNS zone

DNS Zone Transfer

Zone transfer reveals:

  • Subdomains: complete list, including development servers, staging environments, admin panels.
  • IP addresses of each subdomain.
  • Name server records: hosting provider and misconfiguration.

Subdomain & Virtual Host Enumeration

Subdomains: extensions of main domain (blog.example.com), have their own DNS records.

Virtual Hosts: web server configuration allowing multiple websites on single server, distinguished via HTTP Host header. Can be top-level domains or subdomains, each VHost has separate config.

Subdomains often contain:

  • Development/staging environments with relaxed security.
  • Hidden login portals (admin panels).
  • Legacy applications with outdated software.
  • Sensitive information: confidential documents, internal data, config files.

Active methods: zone transfer, ffuf, gobuster, dirbuster, dnsenum, amass, assetfinder, puredns, fierce, dnsrecon, sublist3r.

Passive methods: Certificate Transparency logs, search engines (site:).

Certificate Transparency Logs

Tools: crt.sh, search.censys.io.

Advantages:

  • Not limited by wordlist or brute-force algorithm.
  • Access historical and comprehensive view of subdomains.
  • Discover subdomains with old/expired certificates (potentially vulnerable).
  • No need for brute-forcing or complete wordlist.

Fingerprinting

Techniques: banner grabbing, HTTP headers, unique responses (error messages), page content (source code, scripts, copyright).

Tools: Wappalyzer, BuiltWith, WhatWeb, Nmap, Netcraft, wafw00f, Nikto.

Crawling

Crawling (spidering): automated process of browsing the web, following links to collect information.

Breadth-first: prioritize width, crawl all links on seed page before going deep. Good for website structure overview.

Depth-first: prioritize depth, follow one path of links as far as possible before backtracking. Good for finding specific content or reaching deep structure.

Information collected:

  • Links: internal and external links to map structure, discover hidden pages.
  • Comments: may reveal sensitive details, internal processes, hints about vulnerabilities.
  • Metadata: titles, descriptions, keywords, author names, dates.
  • Sensitive files: backup files (.bak, .old), config files (web.config, settings.php), log files (error_log, access_log), files containing passwords, API keys.

Crawlers

  • Burp Suite Spider: powerful active crawler, map web apps, discover hidden content and vulnerabilities.
  • OWASP ZAP: free, open-source scanner, includes spider component.
  • Scrapy: flexible, scalable Python framework, build custom crawlers.
  • Apache Nutch: Java, highly extensible and scalable, handle massive crawls.

ReconSpider (custom Scrapy spider):

wget -O ReconSpider.zip https://academy.hackthebox.com/storage/modules/144/ReconSpider.v1.2.zip
unzip ReconSpider.zip
python3 ReconSpider.py http://example.com

Results saved in results.json.

robots.txt

Text file in root directory, structure:

User-agent: specifies which crawler/bot the rules apply to. Wildcard (*) = all bots.

Directives:

DirectiveDescriptionExample
DisallowPaths bot should not crawlDisallow: /admin/
AllowExplicitly permit crawling specific pathsAllow: /public/
Crawl-delayDelay between requests (seconds)Crawl-delay: 10
SitemapURL to XML sitemapSitemap: https://example.com/sitemap.xml

Value:

  • Disallowed paths often contain sensitive info, backup files, admin panels.
  • Analyze allowed/disallowed paths to create website map.
  • Honeypot directories to lure malicious bots.

Well-Known URIs

Per RFC 8615, .well-known at root domain (/.well-known/) contains metadata: config files, services, protocols, security mechanisms.

openid-configuration: part of OpenID Connect Discovery protocol (identity layer on OAuth 2.0). Endpoint https://example.com/.well-known/openid-configuration returns JSON containing metadata about provider endpoints, authentication methods, token issuance.

Google Dorking

Using search operators to discover sensitive information, vulnerabilities, hidden content:

OperatorDescriptionExample
site:Limit results to domainsite:example.com
inurl:Find term in URLinurl:login
filetype:Find specific file typefiletype:pdf
intitle:Find term in titleintitle:“confidential report”
intext:/inbody:Find term in body textintext:“password reset”
cache:Display cached versioncache:example.com
link:Find pages linking to webpagelink:example.com
related:Find related websitesrelated:example.com
info:Summary informationinfo:example.com
define:Define word/phrasedefine:phishing
numrange:Find numbers in rangenumrange:1000-2000
allintext:All words in bodyallintext:admin password reset
allinurl:All words in URLallinurl:admin panel
allintitle:All words in titleallintitle:confidential report 2023
ANDRequire all termssite:example.com AND inurl:admin
ORAny term“linux” OR “ubuntu”
NOTExclude termsite:bank.com NOT inurl:login
*Wildcardfiletype:pdf user* manual
..Range search“price” 100..500
" "Exact phrase“information security policy”
-Exclude term-inurl:sports

Reference: https://www.exploit-db.com/google-hacking-database

Web Archives

Wayback Machine provides:

  • Uncovering hidden assets: old pages, directories, files, subdomains not currently accessible, may contain sensitive info or security flaws.
  • Tracking changes: compare historical snapshots to observe evolution (structure, content, technologies, vulnerabilities).
  • Gathering intelligence: OSINT about past activities, marketing strategies, employees, technology choices.
  • Stealthy reconnaissance: passive activity doesn’t interact directly with target infrastructure, less detectable.

Automated Reconnaissance Tools

  • FinalRecon: Python, modules for SSL certificate checking, WHOIS, header analysis, crawling. Modular structure enables easy customization.
  • Recon-ng: Python framework, modular with DNS enumeration, subdomain discovery, port scanning, web crawling, exploit vulnerabilities.
  • theHarvester: Command-line Python, collect emails, subdomains, hosts, employee names, open ports, banners from search engines, PGP key servers, SHODAN.
  • SpiderFoot: Open-source intelligence automation, integrates multiple data sources to collect IP addresses, domains, emails, social media profiles. Performs DNS lookups, web crawling, port scanning.
  • OSINT Framework: Collection of tools and resources for OSINT, includes social media, search engines, public records.

Public Buckets

https://buckets.grayhatwarfare.com/

Public search portal for “open buckets” (public cloud storage containers) and contents. Search file names, filter by extension, keywords, date ranges to find exposed files.

Aggregates data from Amazon S3, Azure Blob Storage, Google Cloud Storage. Discover misconfigured buckets leaking data.