Machine learning and document collections

ddrueding

Fixture
Joined
Feb 4, 2002
Messages
19,522
Location
Horsens, Denmark
At work we are inundated with massive contracts on a daily basis. 99% of these are absolutely identical and need no attention paid, but every once in a while there is a stinger in there somewhere that we don't catch. What I'm looking for is something that can take in all these docs (1,000+, ~300 pages each, PDF) and highlight the parts that exist in less than 5% of them.

I've done some googling, but doc diff software doesn't seem to support this "database" approach. I've sent an e-mail to a lawyer friend, but haven't heard back yet.

Thoughts?
 

Handruin

Administrator
Joined
Jan 13, 2002
Messages
13,741
Location
USA
It might be worth having a look over here in the subreddit for machine learning. I spend time over there learning about fun ways for crunching data. I don't know specifically how to solve your problem or even if machine learning is the right solution but it's an interesting problem to solve on multiple levels.
 
Top