
pyspark提供了强大的xpath函数,允许用户使用xpath表达式从xml字符串中提取特定元素或属性。然而,一个常见的误解是,直接指定xml节点的路径(例如/root/customers/customer/name)就能自动获取该节点的文本内容。实际上,这种做法往往会导致返回一个包含空值的数组,而不是期望的文本数据。
考虑以下包含嵌套XML结构的CSV文件数据:
<?xml version="1.0" encoding="utf-8"?>
<Root>
<Customers>
<Customer CustomerID="1">
<Name>John Doe</Name>
<Address>
<Street>123 Main St</Street>
<City>Anytown</City>
<State>CA</State>
<Zip>12345</Zip>
</Address>
<PhoneNo>123-456-7890</PhoneNo>
</Customer>
<Customer CustomerID="2">
<Name>Jane Smith</Name>
<Address>
<Street>456 Oak St</Street>
<City>Somecity</City>
<State>NY</State>
<Zip>67890</Zip>
</Address>
<PhoneNo>987-654-3210</PhoneNo>
</Customer>
<Customer CustomerID="3">
<Name>Bob Johnson</Name>
<Address>
<Street>789 Pine St</Street>
<City>Othercity</City>
<State>TX</State>
<Zip>11223</Zip>
</Address>
<PhoneNo>456-789-0123</PhoneNo>
</Customer>
</Customers>
<Orders>
<Order>
<CustomerID>1</CustomerID>
<EmpID>100</empID>
<OrderDate>2022-01-01</OrderDate>
<Cost>100.50</cost>
</Order>
<Order>
<CustomerID>2</CustomerID>
<EmpID>101</empID>
<OrderDate>2022-01-02</OrderDate>
<Cost>200.75</cost>
</Order>
</Orders>
</Root>当尝试使用以下PySpark代码提取客户姓名(Name)和电话号码(PhoneNo)时:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder.appName("ETL").getOrCreate()
# 假设df_Customers_Orders已经包含处理过的XML字符串列"Data"
# 这里的代码仅为演示目的,省略了CSV读取和XML字符串预处理部分
# 假设df_Customers_Orders包含一列名为"Data"的字符串,内容如上所示XML
data_str = """<?xml version="1.0" encoding="utf-8"?>
<Root>
<Customers>
<Customer CustomerID="1">
<Name>John Doe</Name>
<Address>
<Street>123 Main St</Street>
<City>Anytown</City>
<State>CA</State>
<Zip>12345</Zip>
</Address>
<PhoneNo>123-456-7890</PhoneNo>
</Customer>
<Customer CustomerID="2">
<Name>Jane Smith</Name>
<Address>
<Street>456 Oak St</Street>
<City>Somecity</City>
<State>NY</State>
<Zip>67890</Zip>
</Address>
<PhoneNo>987-654-3210</PhoneNo>
</Customer>
<Customer CustomerID="3">
<Name>Bob Johnson</Name>
<Address>
<Street>789 Pine St</Street>
<City>Othercity</City>
<State>TX</State>
<Zip>11223</Zip>
</Address>
<PhoneNo>456-789-0123</PhoneNo>
</Customer>
</Customers>
<Orders>
<Order>
<CustomerID>1</CustomerID>
<EmpID>100</empID>
<OrderDate>2022-01-01</OrderDate>
<Cost>100.50</cost>
</Order>
<Order>
<CustomerID>2</CustomerID>
<EmpID>101</empID>
<OrderDate>2022-01-02</OrderDate>
<Cost>200.75</cost>
</Order>
</Orders>
</Root>"""
df_Customers_Orders = spark.createDataFrame([(data_str,)], ["Data"])
df_sample_CustomersOrders1 = df_Customers_Orders.selectExpr(
"xpath(Data,'/Root/Customers/Customer/@CustomerID') as CustomerID",
"xpath(Data,'/Root/Customers/Customer/Name') as ContactName",
"xpath(Data,'/Root/Customers/Customer/PhoneNo') as PhoneNo",
)
df_sample_CustomersOrders1.show(truncate=False)其输出结果会是:
+----------+------------------------+------------------------+ |CustomerID|ContactName |PhoneNo | +----------+------------------------+------------------------+ |[1, 2, 3] |[null, null, null] |[null, null, null] | +----------+------------------------+------------------------+
可以看到,CustomerID属性被正确提取,但ContactName和PhoneNo列却返回了包含null值的数组。这是因为xpath函数在没有明确指定的情况下,当路径指向一个元素节点时,它返回的是该元素节点本身(或其序列化形式),而不是其内部的文本内容。
要正确提取XML节点的文本内容,需要在XPath表达式的末尾追加/text()。这明确告诉xpath函数,我们想要的是指定节点下的文本子节点的值。
以下是修正后的PySpark代码:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder.appName("ETL").getOrCreate()
# 假设df_Customers_Orders已经包含处理过的XML字符串列"Data"
data_str = """<?xml version="1.0" encoding="utf-8"?>
<Root>
<Customers>
<Customer CustomerID="1">
<Name>John Doe</Name>
<Address>
<Street>123 Main St</Street>
<City>Anytown</City>
<State>CA</State>
<Zip>12345</Zip>
</Address>
<PhoneNo>123-456-7890</PhoneNo>
</Customer>
<Customer CustomerID="2">
<Name>Jane Smith</Name>
<Address>
<Street>456 Oak St</Street>
<City>Somecity</City>
<State>NY</State>
<Zip>67890</Zip>
</Address>
<PhoneNo>987-654-3210</PhoneNo>
</Customer>
<Customer CustomerID="3">
<Name>Bob Johnson</Name>
<Address>
<Street>789 Pine St</Street>
<City>Othercity</City>
<State>TX</State>
<Zip>11223</Zip>
</Address>
<PhoneNo>456-789-0123</PhoneNo>
</Customer>
</Customers>
<Orders>
<Order>
<CustomerID>1</CustomerID>
<EmpID>100</EmpID>
<OrderDate>2022-01-01</OrderDate>
<Cost>100.50</Cost>
</Order>
<Order>
<CustomerID>2</CustomerID>
<EmpID>101</EmpID>
<OrderDate>2022-01-02</OrderDate>
<Cost>200.75</Cost>
</Order>
</Orders>
</Root>"""
df_Customers_Orders = spark.createDataFrame([(data_str,)], ["Data"])
df_sample_CustomersOrders1 = df_Customers_Orders.selectExpr(
"xpath(Data,'/Root/Customers/Customer/@CustomerID') as CustomerID",
"xpath(Data,'/Root/Customers/Customer/Name/text()') as ContactName",
"xpath(Data,'/Root/Customers/Customer/PhoneNo/text()') as PhoneNo",
)
df_sample_CustomersOrders1.show(truncate=False)运行这段代码,我们将得到预期的结果:
+----------+--------------------------+--------------------------+ |CustomerID|ContactName |PhoneNo | +----------+--------------------------+--------------------------+ |[1, 2, 3] |[John Doe, Jane Smith, Bob Johnson]|[123-456-7890, 987-654-3210, 456-789-0123]| +----------+--------------------------+--------------------------+
现在,ContactName和PhoneNo列都正确地提取了其对应的文本内容。
属性 vs. 文本内容:
返回类型: xpath函数通常返回一个字符串数组(ArrayType(StringType)),即使只有一个匹配项。如果确定只有一个匹配项或希望将数组转换为单个字符串,可以使用getItem(0)或concat_ws等函数。
# 提取单个客户姓名,并将其从数组中取出
df_single_customer_name = df_Customers_Orders.selectExpr(
"xpath(Data,'/Root/Customers/Customer[1]/Name/text()')[0] as FirstCustomerName"
)
df_single_customer_name.show()处理多个匹配项: 当XPath表达式匹配到多个节点时,xpath函数会返回一个包含所有匹配项的数组。这在处理列表数据时非常有用。
XML字符串预处理: 在从CSV或其他文本源读取XML字符串时,可能需要进行预处理,例如去除多余的双引号、转义字符等,以确保XML字符串的有效性。示例代码中的substring和regexp_replace函数就是为了处理这些情况。
# 原始问题中的预处理步骤
# df_Customers_Orders = df_Customers_Orders.withColumn("Data", expr("substring(Data, 2, length(Data)-2)"))
# df_Customers_Orders = df_Customers_Orders.withColumn("Data", regexp_replace("Data", '""', '"'))错误处理与空值: 如果XPath表达式没有找到匹配项,xpath函数将返回一个空数组。在后续处理中,需要考虑这种情况,例如使用array_contains或检查数组长度来避免运行时错误。
性能考量: 对于非常大的XML字符串或大规模数据集,频繁使用xpath函数可能会影响性能。考虑是否可以通过其他方式(如将XML数据解析为结构化DataFrame)来优化数据处理流程。PySpark的from_xml函数可以将整个XML字符串解析为结构化的DataFrame,这在处理复杂XML时通常更高效和灵活。
在PySpark中使用xpath函数提取XML节点文本内容时,核心在于理解XPath表达式的精确性。通过在节点路径后明确添加/text(),可以确保xpath函数准确地获取到所需的字符串内容,而非节点本身的引用,从而避免返回空值数组的常见问题。掌握这一技巧,将使您在PySpark中处理XML数据时更加高效和准确。
以上就是PySpark中XPath函数提取XML节点文本内容指南:避免空值数组的详细内容,更多请关注php中文网其它相关文章!
每个人都需要一台速度更快、更稳定的 PC。随着时间的推移,垃圾文件、旧注册表数据和不必要的后台进程会占用资源并降低性能。幸运的是,许多工具可以让 Windows 保持平稳运行。
Copyright 2014-2025 https://www.php.cn/ All Rights Reserved | php.cn | 湘ICP备2023035733号