TASKS
- Automatic Timesheet Completion/Filling:
- Description: Making tools that make sound guesses at what a user would want to do in filling out a timesheet and do it for them
- Current Stack:
- Gradient-Boosted Models (GBMs): model per data source (1 UK firm, 1 US firm)
- Desiderata:
- Mitigating the “Cold-Start” Problem:
- Description: We also could use ways of limiting cold start effects when we deploy at a new site.
- Feature Importance (aka Timeseries Entry Ranking):
- Description: For the time entries we can’t quite predict with confidence, can we at least predict the likelihood that the user will actually bill for that time entry?
- That way we can put the important stuff at the top of their time entry lists.
- Description: For the time entries we can’t quite predict with confidence, can we at least predict the likelihood that the user will actually bill for that time entry?
- Generalized Modeling + Transfer Learning:
- Description: How quickly can we get away from a model for every customer?
- Feature Engineering:
- Description: Create other metrics/features to track and collect as part of data collection.
- User-level Models:
- Description: Create Personalized Models per/user that attends to there characteristics
- An accurate NER system that may be fine-tuned on the domain of legal documents.
- Description:
- Mitigating the “Cold-Start” Problem:
- Data Anonymization:
- Description: Anonymize customer data meeting security standards.
- Current Stack:
- Locality-based Hash: Replace all proper nouns in the document titles and bodies.
- Activity Segmentation:
- Description: Since we collect the user’s activity in too much detail, we need to help them group several activities together as a single billable unit.
- Notes:
- Most clustering algos want to know how many clusters to make and don’t like leaving things out of clusters.
So this requires some creativity or research.
- Most clustering algos want to know how many clusters to make and don’t like leaving things out of clusters.
- Timesheet Summarization (aka Narrative Generation):
- Notes:
- I also think that narrative generation should be driven by completion tries.
We all know the theory behind those, but when you get into it, there are lots of fun implementation details (fast serialization/deserialization, how much to send, how quickly can we update it).
- I also think that narrative generation should be driven by completion tries.
- Notes:
- Activity-Type Classification:
- Description: We also need a model that weeds out personal vs work emails and web sessions.
- Topic Modeling (aka Document-Type Classifier):
- Description: Can we build a generic document type classifier?
- Notes:
- Each firm will have their own templates that they use for documents they create, so those should be easy.
- They will also get documents through email from other firms that would be great to identify.
- I personally think if we can figure out the document types, we can make an excellent advancement on narratives, phase/task codes and matter categorization for users.
- (Long-Term) Time-Series Analytics:
- Description: In time, we expect to show law firms analytics that help them understand how their lawyers are working, and how long cases of different types take.
DATA
- Input: User Activity
- User Activities:
- Document Type (worked on) / Activity Type (e.g. phone call)
- Document Title
- Time-Period of Work (Total + Work-Intervals)
- First 8K of the Document
- Past Behavior
- User Activities:
- Output: Timesheets
- Timesheet Entry:
- Case Number: i.e. related matter/subject
- Phase/Task code: Discovery - Depositions (see UTBMS in wikipedia for details)
- Narrative: short summary of the (case?)
- Timesheet Entry:
- Data-Collection: Engineering tracks a users activity and we use the data they scraped to fill in the entries
- Sources: 1 UK firm + 1 US firm
- Notes:
- Labeled Targets Exist (labeled time-sheets)
- Input Features are limited
COMPLICATIONS
- Limited Data Resources (quantity, variety)
- Different Countries have different documents \(\implies\) different datasets have different targets:
Notes: The US and UK have separate phase/task code systems, so we may always have two models. - Single-Tenancy implies no data sharing
- Long-term storage of client Data requires full Anonymization
- Current NER is, possibly, too “aggressive”
Notes:
- Right now we have matter categorization (which case), phase task code prediction, and NER scrubbing in production.
- I’m finishing a microservice that will make cleaner narratives.
- Gilles is finishing the first gen entry grouper.
- A toy model built on Enron data exists in demo only to detect personal emails vs work.
- I’m not thrilled with any of them, but the customers seem happy with getting some assistance over none.
It is a start, and get better from here. - A feature I forgot is the included parties on an email or phone call.
- 30% of all work is email based.
- Three primary high-dollar adjustment categories for attorney invoices are 1) inadequate description of work; 2) unreasonable time spent on the activity; and 3) lack or prior authorization for the activity.
Is the description adequate? Was the time reasonable? Was authorization actually given?
In other words, humans are required to evaluate and analyze what the software has flagged, and then to make a judgement about whether to adjust the line item or leave it alone. - litigation code classification - Bert for lcc
Resources
- Entity Extraction:
- Accuracy Metrics For Entity Extraction
- sberbank-ai/ner-bert: BERT-NER (nert-bert) with google bert https://github.com/google-research.
- Joint NER and Classification (Papers With Code)
- Multitask learning for biomedical named entity recognition with cross-sharing structure | BMC Bioinformatics | Full Text
- Text Classification (few-shot):
- Transfer Learning:
- Text/Document Clustering:
- Topic Modeling:
- Summarization:
- Time-Series Clustering & Processing:
- Knowledge Graphs & Graph Theory:
- Calendar Modeling & Event Extraction & Time Tracking:
- Natural Language Processing — Event Extraction - Towards Data Science
- Understanding Events with Artificial Intelligence - Towards Data Science
- Learning User Preferences and Understanding Calendar Contexts for Event Scheduling
- Supercharging Scoro with Machine Learning | Scoro
- Using Machine Learning to Predict and Explain Employee Attrition
- Smart task logging : Prediction of tasks for timesheets with machine learning
- Smart task logging (Thesis)
- Leverage AI to transform time tracking into time intelligence - YouTube
- Memory AI - Fully automatic time tracking powered by deep learning | Product Hunt
- 6 Features Any Smart Timesheet App Needs. Does Yours Have Them?
- Using employee time series data to predict employee turnover (Binary Prediction using Time Series Data) : MLQuestions.
- Data Labeling:
- Security and Data Privacy:
- Resources:
- Text Matching: (NTMC-Community/MatchZoo: Facilitating the design, comparison and sharing of deep text matching models)
- lda2vec
- Anomaly Detection in Keras with AutoEncoders (14.3) - YouTube
- Relationship Extraction (Distant Supervised) (Papers With Code)
- 5hirish/adam_qas: ADAM - A Question Answering System. Inspired from IBM Watson
- machine learning smart timesheet - Google Search
- Memory AI: About | LinkedIn
- Semi-supervised Sequence Learning
- EMNLP-2019-Papers: Statistics and Accepted paper list with arXiv link of EMNLP-IJCNLP 2019