We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Solving Data Problems in Management Accounting

00:00

Formal Metadata

Title
Solving Data Problems in Management Accounting
Title of Series
Number of Parts
141
Author
License
CC Attribution - NonCommercial - ShareAlike 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Controllers deal with numbers all day long. They have to check a lot of data from different sources. Often the reports contain erroneous or missing data. Identifying outliers and suspicious data is time-consuming. This presentation will introduce a Small Data Problem-End2End workflow using statistical tools and machine learning to make controllers' jobs easier and help them be more productive. We will demonstrate how we used amongst others, - [scipy] - [pandera] - [dirty cat] - [nltk] - [fastnumbers](pypi.org/project/fastnumbers/) to create a self-improving system to automate the screening of reports and report outliers in advance so that they can be eliminated more quickly.
114
131
Decision theoryPlanningMathematical analysisBusiness reportingGUI widgetData managementSystem programmingForm (programming)Computer fileClient (computing)CAN busSystem identificationSoftware bugVariable (mathematics)Error messageData structureFrame problemExplosionEnterprise resource planningComputer networkData managementBusiness reportingGame controllerDecision theoryDifferent (Kate Ryan album)Physical systemSet (mathematics)FeedbackSoftware bugData structureData typeDatei-ServerSoftwareOutlierTraffic reportingNumberPredictionFrame problemFile formatTelecommunicationSelectivity (electronic)EmailWeb pageVideo gameProcess (computing)Message passingArmNavigationRoboticsOnline helpCASE <Informatik>PredictabilityCycle (graph theory)Expected valuePersonal digital assistantTask (computing)Adaptive behaviorMathematicsProduct (business)Computer fileMereologyMultiplication signArrow of timeStructural systemConnectivity (graph theory)MetreSquare numberDatabaseFood energyInsertion lossFocus (optics)Computer animationLecture/ConferenceDiagram
File formatExplosionEnterprise resource planningComputer networkProcess (computing)Raw image formatStandard deviationEXCELNumberSeries (mathematics)Beta functionCodeObject (grammar)Control flowTable (information)Formal languageProof theoryInformation privacyPoint cloudCAN busOptical character recognitionComputer configurationOpen sourceProbability density functionSoftware bugContext awarenessMeasurementPopulation densitySample (statistics)AreaData structureWeb pageCuboidTable (information)Process (computing)Virtual machineFile formatProbability density functionCASE <Informatik>Line (geometry)Moment (mathematics)Cellular automatonAdditionConfidence intervalInformationMarginal distributionCodeObject (grammar)Optical character recognitionPoint (geometry)BitEndliche ModelltheorieVisualization (computer graphics)Frame problemMathematicsNormal distributionSoftware bugState observerData typeFormal languageExpert systemSource codeTime seriesError messageSound effectPhysicalismMeasurementContext awarenessPlotterDistribution (mathematics)Sampling (statistics)Traffic reportingRight angleSystem callOutlierPower (physics)Inheritance (object-oriented programming)Machine visionNeuroinformatikGoodness of fitScripting languageoutputSimilarity (geometry)Point cloudCausalityComputer animationLecture/Conference
LogicEvent horizonComplex (psychology)Rule of inferenceOrder (biology)Software maintenanceMathematical analysisObservational studySystem programmingData structureHierarchySet (mathematics)Software testingSystem identificationError messageDecision theoryMachine learningClassical physicsPerformance appraisalFile formatFrame problemStatisticsData typeOperations support systemComa BerenicesMathematicsSimilarity (geometry)String (computer science)Sign (mathematics)Stress (mechanics)Metric systemCASE <Informatik>Endliche ModelltheoriePower (physics)Leak1 (number)Sampling (statistics)Support vector machineWave packetSet (mathematics)BitGame controllerComputer fontCodierung <Programmierung>Kernel (computing)Social classNetwork topologyProbability distributionMarginal distributionStatisticsDecision theoryPoint (geometry)Normal (geometry)Latent heatDensity functional theoryNumberOutlierRight angleProcedural programmingRule of inferenceOrder (biology)Function (mathematics)Event horizonoutputSoftware testingVirtual machineInformationAreaContext awarenessBusiness reportingFrame problemIntegrated development environmentFlagRange (statistics)TwitterLibrary (computing)Domain nameError messageCodeForestStatement (computer science)Multiplication signType theoryPairwise comparisonDistribution (mathematics)Combinational logicDifferent (Kate Ryan album)EstimatorCumulantSpacetimeGreatest elementCategory of beingHeuristicTotal S.A.Vector spaceVector potentialInterior (topology)Computer animation
Personal digital assistantMachine learningTable (information)Virtual machineBounded variationMathematical morphologySimilarity (geometry)Data storage deviceFeedbackFactorizationSoftware bugData typeElectronic program guideObject (grammar)Computer-generated imageryError messageFrame problemPhysical systemString (computer science)Level (video gaming)Traffic reportingEndliche ModelltheorieCASE <Informatik>Function (mathematics)File formatInformationVirtual machineArithmetic progressionOutlieroutputValidity (statistics)Data typeOpen setSubject indexingTransformation (genetics)Type theoryTask (computing)Category of beingFilm editingBusiness reportingProjective planeInterior (topology)Lecture/ConferenceComputer animation
Multiplication signMeeting/InterviewLecture/ConferenceComputer animation
Transcript: English(auto-generated)
Yeah, thank you very much for joining so we're talking about Problem in management reporting. It's I have to say upfront. It's not a fancy problem But a very problem pragmatic and a very common problem in companies nowadays. So first, let me introduce you What is management reporting? So it's you need companies need to manage like
predictions company look into some key key KPIs learn Help leaders to make informed decisions Look up things if the data they receive is correct or and yeah Basically, it's very important to navigate the company and how to navigate strategic decisions
And of course, it's pretty far from what we do It's very often pretty far from people who are programming or doing data science, but it's very number heavy so it's a crucial for a company's success on the task of controllers are
like financial reporting budgeting forecasting internal controls like how budgets are spent do the numbers they receive actually make sense or if they Do not fit to the expectations to follow our pay for example Like if you are have you have a shop and there's like Christmas stuff and you have a higher expectation on the sales of Christmas
Stuff you would ask. Okay. Why did we sell less? Christmas stuff than expected and a possible answer could be and it's not always like people make mistakes In the past years a valuable answer could have been yeah, sorry Corona something like that
It's just to also understand what affects sales and the successes in the company So there's a lot of like number crunching involved. So and data accuracy is absolutely critical. So On one hand the crunching some numbers doesn't sound that so hard So the thing here is there's a lot of financial data from
Structured systems like there's tons of Excel sheets, but there's like really some companies thousands of different reports and these reports are made By systems by people very often people Who create for example these Excel sheets also want to make them visually appearing to to for example to?
Show them to their superiors and that's why I also like very often the structure changes. So there is no quality Process or like a pre definition how these reports have to look I know we all know how we could do it better nowadays
with the technology But these are processes that are like in companies that are operational for quite some time and they also need to address be addressed so there's also like small data sets like for example, like Shop a file with listing all the shops Square meters and stuff there's stuff that doesn't really make sense to build a database for that
Sometimes it's just like a simple extra sheet but it's important for example to see to compare the sales to the size and square meters of the shop or Also, because it's becoming more and more important to see what's the Ennis energy consumption of the shop per square meter For example, and this is also things controllers look into as well
So as you see we have tons of Requirements and pretty open definition of how the data structures are so the use case when we started was Okay, the process before is like, okay go to the network drive Fixer pick reports which are usually generated like monthly or quarterly
basically pull them out Find anomalies and finding anomalies is basically looking at the numbers like just like reading them looking for outliers maybe some stuff with Excel and
Then follow up on them and follow up on them basically would need to basically yeah Let's write an email and wait for feedback and the the person like on the upper left side we see the person who hands in the report and Down here is the control and then basically you always like say, okay follow up Hey, there's a question you get feedback by email say no, that was right. We're sorry
heavy weather sales were better or worse for whatever reasons and Basically, they all accessed in what I basically already flagged as data silo. It's network drives It's not so it's network drives with some structures in folders and other things so you see there's a lot of communication going on and then
What are the problems here? communication effort Misunderstandings because when people talk there's always misunderstandings There's a heavy there's a manual selection like pulling out the reports from the network drive There is also like a lot of human bias involved So for example, if you know I again it's that it's a report from that shop again
They always mess up something like that So there might be also like a strong bias involved And of course it makes also like a big difference if you read reports in the morning when you're really fresh or like in the afternoon so And there's a lot of communication and misunderstandings going on here and you see there's also like a cycle that can probably never end
Until things are cleared up. So our use case was to okay we know this is not like the optimal setup for collecting and processing data for reporting as we would imagine it if we build it new but For the meantime, we have to work with these reports until the company can establish a modern structure
Here as well. So Beforehand we had like manual review of identifying outliers anomalies following up and Where you see down there with the robot arm, we see okay, we can help with automating Identifying the outliers find anomalies. We still have a human review process
So we have AI is just an assistant here as well Like killing the boring stuff taking a lot of the heavy lifting and the follow-up is also automatic as well So before if this you see in the next here, this is the you just have to change the middle part And if you just look down, right you see the corona trolling department
Has already Way less Processes through and things to look into so it's already so okay if there's an outlier From the expectancy the system automatically creates a message to the person who handed feels responsible for the report they can comment on that and And it's stored in the system and basically when the final review is coming we have or everything collected here
And of course, this is a way more improved This improves the process a lot because we have higher quality looking at the data we have the adaptations and comments saved in a central system and not just in email silos and
it's more focus and things on the controlling part here, so Um Yeah, so this is we're talking about the middle use case and what are the challenges so what the challenges are We are actually dealing with small data sets here and there's many many reports. We
would be too much work to To pre-configure the expectation for individual comments of individual reports up front So we had to find a way to form to make to also automate this on there. It's dirty data There are some columns with a lot of missing values and that can be okay or not
and Yeah, so no value could be just perfectly fine. No value could also be an arrow and we had to find ways to navigate it how can we give big big a system that gives the best advice to the controller and also enables them to Conseculatively build up a
Pre-configured system that knows more. Okay. What's the expectancy here from the data type from the numbers range? For example, so we can also like have a system that constantly can improve while being in in production So yes, and the delivery of first is also it's usually like short time frames
yes, and so for our job was building a smart solution that can be optimized and configured over time and and Yes, so I think we already covered that so different different formats and stuff like that So but it's important historic data is relevant for future predictions and only like probably most use this case
It's only like three years. It's not like doesn't make sense to look much further So and yeah now I would like to hand over to my colleague Lucas Thank You Alex So in the following the structure of the talk would be that I elaborate a little bit on okay
PDF seems a common format And we had it for our customers Is it is there a way now to get this into our pipeline or is it a completely useless? format for machines Then we will talk about the methods for outlier detection and how we integrated them in our pipelines
so It's an anecdotal story As I said PDFs are the most common exchange format between people when you kind of want to make sure That what you send is what the other person receives, right? So in that sense, it's a really good format
It's common and something that stakeholders expect the trouble is that it's a really bad format to exchange information Between machines right because it's not well structured. It's not It's not like a well-defined data structure in which you have your data type and then your columns of data
But it's something that can look quite differently for very similar data inputs and What we have observed is that a common scenario is that you have some raw data some person does reporting does processing and Compiles a report and then one year later nobody knows
Where this raw data was and how the script looked like that eventually produced this PDF document, right? So what I've what I've illustrated here is this is something that would be Typical if you have like a simple Excel table pay 20 bucks for Adobe Creative Cloud
And then you can click on edit PDF and you see how You get like a sense how the PDF is structured, right? And what you do see is that you have Those boxes that are floating around on a page and those are not like unified in any way
So here for example, you have a box from the object code That is one box over all the data and here in the middle You would have a box that actually belongs to two columns originally, but I think the reason for this year is that just the letters were close to each other and then it was
exported as one box and now how did the quiz one question is Like we don't live in a perfect world and we don't want to wait for the perfect world So we have valuable data in our PDF documents. Can you make something out of it, right?
And one Python tool that I would like to introduce and that you possibly have heard earlier this morning Is Camelot so with Camelot, it's relatively easy to read a PDF document when it's text-based, right? So if it's OCR, then you are kind of
lost There's not much hope at this moment. So with relatively few lines of code you can try to Try to pass your PDF document and get the data frame out of it For example, and in this case for the simple data frame I've introduced you see that it works reasonably well
However, there are some cases in which Camelot was not able to extract the information and you would then get a confidence report In addition to your actual actual data. Should you be interested in? If you then go to more elaborate PDF documents Like this for example where you have merged cells where you have
One cell that is Visually longer than the others But just because there's no information in the middle here that Excel would kind of let this Cell extend to more than its original margin then things get a bit more troublesome
And you see that also here we have like minor Points where we would have to do manual rework And we also see that also here we have Two columns that got concatenated into one So what is the what is the summary?
I would say it's a problem that doesn't scale But if you have really interesting data that you want to want to incorporate into your pipeline Then this might be worth the effort And of course with the recent advent of large language models Maybe there will be a change in how we
and how we Do this, for example I could imagine if you have like a strong visual model a general-purpose visual model and then Combine it with a large language model that you are actually able to do all those things on even
ill Structured formats like PDFs If you go to OCR, then it's really really even more difficult. So I don't think it is as feasible at the moment what I would like to talk now is the to give you some idea about outliers and anomalies and
Methods that are commonly in use to detect them To get insights in whether the data had errors whether there's some fraud and Whether there's maybe some novelty or some anomaly in your data so first of all a Outlier is commonly defined as something that differs significantly from other observations. That's a broad definition, but
Yeah, it fits well and the sources for outliers and anomalies are usually chance so it could just be a statistical Effect it could be a measurement. That is that is incorrect
for example a broken sensor if I think about physics for example Or it could be some novelty that you that has a cause so it's not an error But it's something that you've never observed before And for outliers we have We need to note that the context is important. So something
if you for example, look at this this is a 2d normal Gaussian distribution write some samples of it and normally what you would say. Okay, this thing is It can happen, but it's really unlikely that you observe this data point but so this would be an outlier that is just defined by looking at the
At its value as compared to the rest of the other values of the same distribution But for example, if you look at this violin plot of any distribution here Normally, I would say you wouldn't call this an outlier But then if you put it into context to the rest of the data, which is a time series in this case
You would see that given this context This should indeed be classified as an outlier For example, if you if you have like a solar power plant, right and you and you observe super high power outcome in the night Then the power outcome alone doesn't tell you whether it's an outlier or not, but the context gives you the idea that something
might be off Okay with having this definition established. We want to go over the Go over the methods and start with one method that is particularly useful if you have a lot of information about your business data and if you have like an expert system kind of
So rule-based methods were a few years ago Common thing even in computer vision that people really designed handcrafted features And of course if you work with tabular data, it's something that is a legitimate tool, right?
So if you have business knowledge and you know certain things need to be in in a certain range Or certain events can happen or not then rule-based methods are a legitimate tool The trouble is of course if you have many of them, it's quite complex and it's unmaintainable and
your execution order can be Can be important whether you flex something as an outlier or not and to illustrate this In an example, right if you if you say, okay I run my sprinklers every day at 9 and then it's raining for two weeks Then this might not be a good idea to have like an if-else statement
For classifying something as a novelty To summarize this rule-based methods have the strong benefit that are really easy to explain and explainability is something that other methods for example deep learning methods usually lack and
It's something when you communicate with your colleague of course is really important They're reliable in the environment and they're simple. They're hard to maintain. They're quite rigid and the scope is of course limited If you then go to more recent methods the one common example would be in isolation forest quite similar to random forest
And the assumption here is that outliers High up in the tree whereas in liars so normal data points lower in the tree and the reason for this is that you Did you have a hard time explaining an outlier so that it's hard to come up with a combination of rules for your decision tree
For example, there are also support vector machines that you can train in a one class versus the other fashion and here you would just have all your normal data points and then your decision margin and your Your potential outliers on the other side
For support vector machines, it's crucial that you think about what kernel you want to use and then moreover Outer encoders where Where the idea is, okay If I have an outer encoder, so I have an input and an output and I have my bottleneck here in between It's harder to map an outlier in the in a bottleneck because the assumption of an outer encoder is that you
That your real data Can be mapped on a lower dimensional level and if you have something that you cannot explain Then you need more information than you can have in this bottleneck of the of the outer encoder So in that sense an outer encoder is a tool that you could use to to detect outliers
What we also have Now and with the recent advent of deep learning in particular quite powerful models that are Which where the training procedure is getting more and more standardized, right?
All those libraries like pytorch, pytorch lightning, Keras and so on that allow you to In a ever more simple fashion train really powerful Models, but here of course the trouble is that oftentimes you need to do manual annotation deep learning models in particular are notoriously overconfident and
With all models domain robustness is an issue. So just because a model worked in one Environment and namely the training environment doesn't really tell you much about an unseen environment And then what what is there left of course are statistical methods
Yeah, you can just look at the difference Of your data point to the mean or if you want to compare distributions you can just for example the Kolmogorov-Smirnov test where you just compare probability Distributions and then look how big is the area of the cumulative density function and then you get an estimate whether
Some whether distributions are similar or not Okay Then there are clustering algorithms, but since I need to speed up a little bit I want to talk about what do you do if you don't have any information about your data and what we found useful
Our case for the controlling reports is for example something like Such a simple thing. For example, your data frame says it's a string But if you look at at the data that is inside, you know You notice that pandas just converted into a string because at one data entry there was a white space somewhere, you know
So those things sounds really simple But of course, it's something that that you should look at and when you want to consolidate a disreporting pipeline and For the for I indicated here fuzzy strings and in the bottom and this metric I found really useful
so there are many ways to To get a number on how similar strings and texts are and what you can do For example is you can clean up categories with it So if you see that most of your data is in
I don't know like four categories and then there are a couple of others Then you might look at the string similarity and notice that it's just a typo that maybe occurred And clean up your categories like this in a context. We found that similarities are useful for example if you have we
Decide in the customer had similar invoice numbers So the string similarity was really was really striking and The the amount of money that got transferred by the bank was the same So in this case we could nail down. Okay, this is actually a duplicated payment, right?
Because what people didn't know that this invoice had already been processed But we could then show that if we have a really similar invoice number and the same Context of the of the total we could save some money There and just to conclude this If we have this horistics as for example, the living sign a string similarity
which just measures how many changes do you need to apply to get an Appl to an apple and in this case it would be to write you have to change the e and you have to change The L and so this is a common metric There's another few metrics and there are also metrics that go more into the semantic context of of
a longer text What I want to stress is that data splits are really important so particularly for powerful models It's really important to know when do I split my data to introduce data leaks?
If those are all data samples and the yellow ones are my test set then possibly Like the yellow and approval next to each other are related to each other and what I do is I leak leak Test data into my training set and then I don't know how well my model performs. So it's really important
to notice Okay in in summary I've introduced what what outliers are they are really important to detect errors fraud and Significant business events and now I would like to
show you a bit how we integrated it into code and how we Got our fingers dirty. So we use mainly pandeira as the outermost layer pandeira is a tool that lets you validate data similar like pedantic for example But on data frames, so what you can do is you can do type checking you can check by custom rules and you can also
Do statistical checks You can specify it in a decline Specification or you can if you're already familiar with pedantic and And data classes you can also use their schema models and it's really easy
So what you would do is you have for example a data frame that you want to validate Which should have year month day and revenue and then you can in this schema model fashion can just specify What your data frame should look like yeah, this is the input schema It should be greater than two thousand for the year and so on and you can cross
It to the data type should this be interpreted as a string before so then pandeira would try to convert it from a string to an int And what you can then do is you can have like I don't know some function And decorate it or like add your type
Information here and then pandeira will check and validate the data frame what happens if It's not correct. It will throw an error. So for example here it would say okay at index 2 in my data frame I have a failure case Or what you can also do is you can do lazy validation in which case
You can collect all the errors that you found for one data frame and then get this summarized in a in an error data frame Kind of we integrated it into the controlling case by having global checks By having so checks that we found useful regardless of the data and a concrete report we observed and then
Also on report level and column level to really have like a modular Modular system how we could specify all the checks one small outlook that we want to want to give is a Scrub or maybe some one of you has her dirty cut
It's a tool where when you have dirty categories as the as the old name Suggested and it allows you to get for example if you have a data frame to get a fast vectorized Transformation of your data frame to use in machine learning models So scrub is not yet released
But it's a work in progress and you can there's already some information and I think you can try it out Good in summary our solution would give you for the customer something like this So it's just an output format in our case. So our controlling department would get for a given
Report some some information how many failures do you have and so on and it would then list your failures in your data frame and Also indicate them visually for example here we we would observe that the opening date is in the future
But you already have a revenue which is something that is an anomaly in our case Okay, and in conclusion we have shown you that Outlier detection and solving those small data problems in controlling is a very important task
That it's important to like not assume a perfect world when you when you get in your project and We have shown you a couple of methods how? You can detect outliers and how they can be integrated. Thank you
Thank You Alexander Lucas for such a great session We are out of time so we can just have one question if anyone has otherwise Speakers will be out for any more questions No one seems to have any doubt are you
Okay