Efficient completion of an IT project, automated or not, requires good Knowledge Representation. Noeon Research architects its Knowledge Representation to be able to handle different notions at different levels of abstraction. For example, structured entities like code, configuration, and data model are treated at a different level of abstraction compared to facts and negations such as requirements, architectural decisions and constraints; these, in turn, must be treated at a different level compared to causal relations, for instance, how workload affects performance and memory consumption. At a yet higher level, we have the relation of the project objectives to the objectives of adjacent projects, etc.
LLMs represent knowledge in the form of learned weights in vast neural networks approximating some conditional probability distribution function. This representation proves to be very versatile. For instance, ChatGPT natively answers programming questions, producing code in various programming languages. Surprisingly ChatGPT often performs on par or even better than fine-tuned systems like Copilot or CodeWhisperer [1], presumably due to a bigger context window size and cross-domain knowledge transfer. However, LLM-based systems struggle with hallucinations and made-up facts [2], inventing non-existent functions and APIs.
For an enterprise system to be trustworthy, it is mandatory to get correct answers where there are correct answers and be able to check the correctness. To achieve this, we need much more structured representations.
To overcome the opaqueness of weights and biases of Neural Network layers, researchers apply symbolic distillation [3] to recover the structure of LLMs knowledge in the form of a Knowledge Graph. Being symbolic, structured and explicit, Knowledge Graphs support direct reasoning about causal relationships and fact-checking. However, it is unclear if this technique can be effectively transferred from common-sense general knowledge to domain-specific knowledge like programming.
In particular, if the process of compressing a corpus into an internal LLM representation loses information about relationships between domain objects, we will not be able to recover that information in the Knowledge Graph. This is not important in common-sense reasoning, where progress is measured in percent accuracy, but is much more important in software engineering, where using any API incorrectly will crash the program.
It is tempting to think that recovering a symbolic representation of knowledge entirely obviates the need for an LLM. However, instructions are usually given in natural language, which needs to be translated into a graph query language to interact with the symbolic knowledge base. LLMs are the best known tool for this kind of translation.
Ontologies [4] are the best-known form of structured Knowledge Representation. However, there’s no universal philosophical and methodological approach to ontology construction which results in a multitude of mutually incompatible domain-specific ontologies [5]. Moreover, as long as most ontologies are based on Description Logic [6] they are not adapted to representing procedural (algorithmic) knowledge which severely limits their usefulness for automatic code transformation.