<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.14.3</version> <!-- Replace with the desired version -->
</dependency>?
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;?
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;
public class Main {
public static void main(String[] args) {
try {
// Parse HTML from a URL
Document doc = Jsoup.connect("http://example.com").get();
// Print the title of the HTML document
System.out.println("Title: " + doc.title());
// Print the HTML content
System.out.println("HTML: " + doc.html());
} catch (IOException e) {
e.printStackTrace();
}
}
}?
http://example.com
" with the URL of the HTML document you want to parse. getElementsByTag()
method:Elements paragraphs = document.getElementsByTag("p");?
getElementsByClass()
method:Elements elements = document.getElementsByClass("className");?
getElementById()
method:Element element = document.getElementById("elementId");?
getElementsByAttribute()
method:Elements elements = document.getElementsByAttribute("attributeName");?
getElementsByAttributeValue()
method:Elements elements = document.getElementsByAttributeValue("attributeName", "attributeValue");?
getElementsByAttributeStarting()
method:Elements elements = document.getElementsByAttributeStarting("prefix");?
Elements elements = document.getElementsByAttributeValueStarting("attributeName", "prefix");?
getElementsByAttributeValueEnding()
method:Elements elements = document.getElementsByAttributeValueEnding("attributeName", "suffix");?
getElementsByAttributeValueContaining()
method:Elements elements = document.getElementsByAttributeValueContaining("attributeName", "substring");?
getElementsByAttributeValueMatching()
method:Elements elements = document.getElementsByAttributeValueMatching("attributeName", "regexPattern");?
toString()
method of the Document class. This method returns the HTML content of the document as a string. Here's how you can serialize an HTML document using Jsoup:import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class Main {
public static void main(String[] args) {
// Parse HTML from a string or another source
String html = "<html><head><title>Example</title></head><body><p>Hello, Jsoup!</p></body></html>";
Document doc = Jsoup.parse(html);
// Serialize the HTML document to a string
String serializedHtml = doc.toString();
// Print the serialized HTML
System.out.println(serializedHtml);
}
}?
Jsoup.parse()
. Then, we call the toString()
method on the Document object doc to serialize it to a string. Finally, we print the serialized HTML string to the console.
toString()
method serializes the HTML document including the document type declaration (<!DOCTYPE>
), the HTML root element (<html>
), and all its children elements.Document
object. For example, you can control indentation, pretty-printing, and other formatting options. Here's an example of how you can customize the output settings:Document.OutputSettings outputSettings = new Document.OutputSettings();
outputSettings.indentAmount(4); // Set indentation to 4 spaces
outputSettings.prettyPrint(true); // Enable pretty-printing
doc.outputSettings(outputSettings);
String serializedHtml = doc.toString();?
attr()
method:Element element = document.getElementById("example");
String attributeValue = element.attr("attributeName");?
attributeName
" with the name of the attribute you want to extract, and element with the HTML element from which you want to extract the attribute.hasAttr()
method:boolean hasAttribute = element.hasAttr("attributeName");?
This method returns true if the element has the specified attribute, and false otherwise.attributes()
method, which returns a Attributes object representing the element's attributes:Attributes attributes = element.attributes();?
for (Attribute attribute : element.attributes()) {
String attributeName = attribute.getKey();
String attributeValue = attribute.getValue();
// Process attribute...
}?
timeout()
method when establishing the connection using Jsoup's connect()
method. The timeout value is specified in milliseconds.int connectionTimeoutMillis = 5000; // 5 seconds
Connection connection = Jsoup.connect("http://example.com")
.timeout(connectionTimeoutMillis);?
http://example.com
.
timeout()
method after establishing the connection. Again, the timeout value is specified in milliseconds.int readTimeoutMillis = 10000; // 10 seconds
Connection connection = Jsoup.connect("http://example.com")
.timeout(readTimeoutMillis)
.get();?
text()
method provided by the Element class. This method retrieves the combined text content of an element and all its descendant elements, excluding any HTML tags. Here's how you can extract text content from HTML elements using Jsoup:text()
Method : You can call the text()
method on an Element object to retrieve its text content:Element element = document.getElementById("example");
String textContent = element.text();?
example
".text()
method for each element :Elements elements = document.getElementsByTag("p");
for (Element element : elements) {
String textContent = element.text();
System.out.println(textContent);
}?
<p>
elements in the HTML document.
normalize()
method before extracting the text content :String normalizedTextContent = element.text().normalize();?
// Extract text content from all elements with class "content"
Elements contentElements = document.getElementsByClass("content");
for (Element element : contentElements) {
String textContent = element.text();
System.out.println(textContent);
}?
content
" with the desired class name or use other methods like getElementsByTag()
or getElementById()
to select elements based on different criteria.text()
method, you can easily extract text content from HTML elements using Jsoup in your Java code. This feature is particularly useful for tasks like web scraping, data extraction, and content analysis. getElementById()
and select()
methods are used to select HTML elements from a parsed HTML document, but they differ in their functionality and the types of selectors they support:getElementById()
:getElementById(String id)
element = document.getElementById("exampleId");
select()
:select(String cssQuery)
elements = document.select(".exampleClass");
getElementById()
.Jsoup.connect()
method, which allows you to specify the URL to connect to and configure various parameters such as timeouts, headers, and request method. Connection
interface is primarily used to establish connections to web servers, set request parameters (e.g., headers, cookies, timeouts), and retrieve the response.Connection
interface in Jsoup is to provide a flexible and convenient way to interact with web servers and retrieve HTML content for parsing, scraping, or other processing tasks. Key features and purposes of the Connection
interface include:get(), post(), execute(),
or request()
to send the request to the server and retrieve the response.Document.OutputSettings
class is used to configure the output settings when serializing HTML or XML documents to strings. It provides a set of options that control how the document's HTML or XML content is formatted, indented, and normalized when converted to a string representation. The OutputSettings
class allows developers to customize the output format to meet specific requirements, such as controlling indentation, line breaks, and character encoding.Document.OutputSettings
class in Jsoup is to provide a mechanism for controlling the serialization of HTML or XML documents, including:prettyPrint()
method specifies whether the serialized output should be formatted with indentation to improve readability. When prettyPrint()
is enabled, the output is indented to represent the document structure, making it easier for humans to read.indentAmount(int indentAmount)
method sets the number of spaces used for each level of indentation when prettyPrint()
is enabled.charset(String charset)
method sets the character encoding to be used when serializing the document to a string. This ensures that the correct character encoding is specified in the output, which is important for proper display and interpretation of special characters and non-ASCII characters.escapeMode(EscapeMode escapeMode)
method sets the escape mode used for escaping special characters in the output. Jsoup supports different escape modes, such as base, extended, xhtml,
and xhtmlWithAllowedEntities
, which control how special characters are represented in the output.
syntax(Syntax syntax)
method sets the syntax of the output, which can be either html or xml. This determines whether the output is serialized as HTML or XML format.outline(boolean outline)
method specifies whether the output should be normalized using HTML5 outline algorithm. Normalization removes redundant elements and attributes while preserving the document's structure and semantics.Document.OutputSettings
class, developers can customize the output format of serialized HTML or XML documents according to their preferences and requirements. This allows for fine-grained control over how the document's content is represented when converted to a string, ensuring consistent and predictable output across different scenarios and use cases.