文件比较和报告差异

项目描述

差异报告

用于报告两个文件之间差异的 Python 包

安装

运行以下命令进行安装：

pip install diffReport

依赖项

包依赖项是：

熊猫
PDFMiner
FuzzyWuzzy

Python-Levenshtein 是一个可选库，可以大大提高工具的性能。原生 python 序列匹配器也可以执行任务，但包含 Python-Levenshtein 可以将序列匹配速度提高 10 倍至 30 倍

该软件包设置为在安装期间自动安装依赖项。

用法

from diffReport import diffReport

html = diffReport("file_path_a","file_path_b")

要为生成的 HTML 输出指定特定的输出文件夹，您可以在函数的 'path_file_output' 参数中指定路径，这是可选的。默认情况下，在工作目录中创建输出

html = diffReport("file_path_a","file_path_b",path_file_output = 'Output/')

要从函数中获取作为数据框而不是 HTML 的返回，您可以将 'html_return' 设置为 false。默认情况下，它始终设置为 True。

df = diffReport("file_path_a","file_path_b",html_return = False)

在输出的 Partial Ratio 列上可以显示各种比率，可以在“partial_ratio”参数中指定。默认情况下，它设置为“tokenSortRatio”。

html_1 = diffReport("file_path_a","file_path_b",partial_ratio = "tokenSortRatio")
html_2 = diffReport("file_path_a","file_path_b",partial_ratio = "qRatio")
html_3 = diffReport("file_path_a","file_path_b",partial_ratio = "wRatio")
html_4 = diffReport("file_path_a","file_path_b",partial_ratio = "partialRatio")
html_5 = diffReport("file_path_a","file_path_b",partial_ratio = "tokenSetRatio")
html_6 = diffReport("file_path_a","file_path_b",partial_ratio = "partialTokenSortRatio")

样本输出

模块

差异报告

    diffReport(path_file_a, path_file_b, path_file_output='', html_return=True, partial_ratio='tokenSortRatio')
            :param path_file_a: Path for the File A to be compared.
            :param path_file_b: Path for the File B to be compared.
            :param path_file_output: Path of the directory where the output HTML file needs to be saved. (Default: 'Output/')
            :param html_return: Boolean to select if the function returns HTML of the report. (True by default)
            :param partial_ratio: Partial Ratio Type, Accepted Values are ("Ratio", "qRatio", "wRatio", "ratio_2", "tokenSetRatio", "tokenSortRatio", "partialTokenSortRatio", "default")
            :param exlude_analytics: List of character or sub-strings to exclude from the pdf during analysis.
            :return: HTML for the report if html_return is set to True.  If set to false, it will return the DataFrame.

函数将两个 PDF 文件路径作为输入，并使用两个文件中不同的行生成差异报告，并在 HTML 表格中用颜色突出显示差异，以表示添加、删除或更改的内容。

File_a 中存在但 File_b 中不存在的任何文本都标记为红色。

File_b 中存在但 File_a 中不存在的任何文本都标记为绿色。

string_a 中但 string_b 中都不存在的任何文本都标记为黄色。

html_output

'html_output' 函数接受数据框作为参数以遍历行并返回 HTML 表

    html_output(df, path_file_output)
        :param df: Data Frame to be displayed as an HTM Table
        :param path_file_output: Path of the directory where the output HTML file needs to be saved. (Default: 'Output/')
        :return: Returns the HTML for the table generated.

例子：

# Import pandas library
import pandas as pd

# initialize list of lists
data = [['tom', 10], ['nick', 15], ['juli', 14]]

# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Name', 'Age'])

# print dataframe.
print(df)

html_output(df)

回报：

<html>
<head>
<style>
body{font:1.2em normal Arial,sans-serif;color:#34495E;}replace {background-color: yellow;color: black;}insert {background-color: lightgreen;color: black;}delete {background-color: pink;color: black;}h1{text-align:center;text-transform:uppercase;letter-spacing:-2px;font-size:2.5em;margin:20px 0;}.container{width:90%;margin:auto;}table{border-collapse:collapse;width:100%;}.blue{border:2px solid #1ABC9C;}.blue thead{background:#1ABC9C;}thead{color:white;}th,td{text-align:center;padding:5px 0;}tbody tr:nth-child(even){background:#ECF0F1;}tbody tr:hover{background:#BDC3C7;color:#FFFFFF;}.fixed{top:0;position:fixed;width:auto;display:none;border:none;}.scrollMore{margin-top:600px;}.up{cursor:pointer;}
</style></head><body><table class="blue" border = 1>
<tbody>
	<tr style = "background-color : #1ABC9C">
		<th>Line Number</th>
		<td>File 1</td>
		<td>File 2</td>
		<td>Column 3</td>
	</tr>
	<tr>
		<th>0</th>
		<td>tom</td>
		<td>10</td>
	</tr>
	<tr>
		<th>1</th>
		<td>nick</td>
		<td>15</td>
	</tr>
	<tr>
		<th>2</th>
		<td>juli</td>
		<td>14</td>
	</tr>
</tbody>
</table>
</body>
</html>

加价

功能

    markUpDifferences(string_a, string_b)
        :param string_a: String one to compare
        :param string_b: String two to compare :return: String A, String B after

通过比较两个字符串之间的差异，用 <insert>、<replace> 和 <delete> 标签标记两个字符串。

string_a 中存在但 string_b 中不存在的任何文本都用 <delete> 标记标记进行标记。

    markUpDifferences("Hello World !","Hello !")

returns >> "Hello <delete>World </delete>!"

string_b 中存在但 string_a 中不存在的任何文本都用 <insert> 标记标记进行标记。

    markUpDifferences("Hello !","Hello World !")

    returns >> "Hello <insert>World </insert>!"

string_a 中但 string_b 中都不存在的任何文本都用 <replace> 标记标记进行标记。

    markUpDifferences("Brown Fox","Brown Box")

    returns >> "Brown <replace>F</replace>ox"

mark_green(string)
    :param string: String to be marked
    :return: returns a String with markup tags

函数将 <insert></insert> 标记标记作为前缀和后缀附加到输入字符串以返回标记的字符串。

mark_red(string)
    :param string: String to be marked
    :return: returns a String with markup tags

函数将 <delete></delete> 标记标记作为前缀和后缀附加到输入字符串以返回标记的字符串。

mark_yellow(string)
    :param string: String to be marked
    :return: returns a String with markup tags

函数将 <replace></replace> 标记标记作为前缀和后缀附加到输入字符串以返回标记的字符串。

模糊比较

功能

比率

ratio(t1, t2, ratio_type='default')

tokenSetRatio(t1, t2)
    :param t1: Text string 1
    :param t2: Text String 2
    :return: Returns the Token Set Ratio score between the two given text

Ratio 选项计算提供给函数的两个字符串之间的绝对 Levenshtein 距离。它返回一个百分比值。90% 的 Levenshtein 距离意味着字符串 B 与字符串 B 具有 90% 的相似性。它是直接的字符串与字符串比较。

部分比率

partialRatio(t1, t2)
    :param t1: Text string 1
    :param t2: Text String 2
    :return: Returns the Partial Ratio score etween the two given text

函数将最相似的子字符串的比率计算为 0 到 100 之间的数字。部分比率允许子字符串匹配。它采用较短的字符串并将其与所有可能的相同长度的子字符串匹配。如果第一个字符串作为第二个字符串中的子字符串存在，它会给出匹配。

令牌排序率

tokenSortRatio(t1, t2)
    :param t1: Text string 1
    :param t2: Text String 2
    :return: Returns the Token Sort Ratio score between the two given text

Token Sort Ratio 允许对字符串进行标记，忽略大小写和标点符号。它对两个字符串进行排序，然后对它们执行一个简单的比率。

代币集比率

partialTokenSortRatio(t1, t2)
    :param t1: Text string 1
    :param t2: Text String 2
    :return: Returns the Partial Token Sort Ratio score between the two given text

Token Set Ratio 与 Token Sort Ratio 类似，不同之处在于它在计算比率之前取出了常用的 token。

其他比率

 qRatio(t1, t2)
    :param t1: Text string 1
    :param t2: Text String 2
    :return: Returns the Q Ratio score between the two given text


wRatio(t1, t2)
    :param t1: Text string 1
    :param t2: Text String 2
    :return: Returns the W Ratio score between the two given text

pdf解析器

功能

pdfparser(path)

将 PDF 文件的路径作为输入并将文件中的文本提取为字符串的函数

项目详情

发布历史发布通知| RSS订阅

这个版本

0.1.4

2022 年 8 月 9 日

0.1.3

2022 年 7 月 31 日

0.1.2

2022 年 7 月 31 日

0.1.1

2022 年 7 月 14 日

0.1.0

2022 年 7 月 14 日

0.0.17

2022 年 7 月 14 日

0.0.16

2022 年 7 月 13 日

0.0.15

2022 年 7 月 13 日

0.0.14

2022 年 7 月 13 日

0.0.13

2022 年 7 月 12 日

0.0.12

2022 年 7 月 12 日

0.0.11

2022 年 7 月 6 日

0.0.10

2022 年 7 月 6 日

0.0.9

2022 年 7 月 6 日

0.0.8

2022 年 7 月 6 日

0.0.7

2022 年 7 月 5 日

0.0.6

2022 年 7 月 5 日

0.0.5

2022 年 7 月 5 日

0.0.4

2022 年 7 月 5 日

0.0.3

2022 年 7 月 4 日

0.0.2

2022 年 7 月 4 日

0.0.1

2022 年 7 月 4 日

下载文件

下载适用于您平台的文件。如果您不确定要选择哪个，请了解有关安装包的更多信息。

源分布

diffReport-0.1.4.tar.gz （401.4 kB 查看哈希）

已上传 2022 年 8 月 9 日 source

内置分布

diffReport-0.1.4-py3-none-any.whl （21.3 kB 查看哈希）

已上传 2022 年 8 月 9 日 py3

diffReport -0.1.4.tar.gz 的哈希值

diffReport-0.1.4.tar.gz 的哈希值
算法	哈希摘要
SHA256	`43d9812eaeb6a3a078b249986674778365bbe09caec8e528c617e1a5fc739e3d`
MD5	`4234b7d0bcd380027b8f9df719133772`
布莱克2-256	`fca4c7ea9393c9435095f17da1e865d93258f947478ba53205c7811bb8a14593`

diffReport -0.1.4-py3-none-any.whl 的哈希值

diffReport-0.1.4-py3-none-any.whl 的哈希值
算法	哈希摘要
SHA256	`967d6c6bc0477e7349317c07bd9dcd92b68dc6910b8b08a21740e5de2b1e68b0`
MD5	`97fa8b426e23517dc7f1810c8f51ffa4`
布莱克2-256	`56c3d59ea38c055c1554ecb5508fa87082ae18f0ce5b035fcef472703a083d22`

diffReport 0.1.4

导航

项目链接

统计数据

Meta

Maintainers

分类

项目描述

差异报告

安装

依赖项

用法

模块

差异报告

html_output

加价

模糊比较

功能

比率

部分比率

令牌排序率

代币集比率

其他比率

pdf解析器

项目详情

项目链接

统计数据

元

维护者

分类器

发布历史发布通知| RSS订阅

下载文件

源分布

内置分布

diffReport 0.1.4

导航

项目链接

统计数据

Meta

Maintainers

分类

项目描述

差异报告

安装

依赖项

用法

模块

差异报告

html_output

加价

模糊比较

功能

比率

部分比率

令牌排序率

代币集比率

其他比率

pdf解析器

项目详情

项目链接

统计数据

元

维护者

分类器

发布历史 发布通知| RSS订阅

下载文件

源分布

内置分布

发布历史发布通知| RSS订阅