While modern AI systems demonstrate quite impressive results in tasks ranging from image recognition to playing Go and even explaining jokes, these are still the early days, and there are multiple areas for improvement, some of which are listed below . Many of them are interconnected and will likely require innovation at the levels of systems architecture, scale, and hardware architecture to achieve breakthroughs.
Compositionality, generalization, and reasoning - ability to reliably perform symbolic operations (e.g. arithmetic), generalize (if this glass broke when it hit the ground, all other things made of glass are likely to break in such circumstances too), capture causality (if I push the glass of the table it will fall and break), as well as learn by receiving explicit instructions.
While most advanced systems can grasp these concepts to some degree, they do it inconsistently and require a significant amount of training. It takes a human child just a few tries to realize that if pushed off an edge of a table, a glass will fall and break. An AI system may require millions of demonstrations to learn this fact. And even then, this knowledge may not generalize to other items made of glass or may be applied by the system inconsistently .
While explicit instructions are being actively used in interactions with the large language models (LLM), the user is required to repeat the instructions every time they make a request, as the system can not retain them .
There is also a connected ability — to reason and plan at multiple time horizons and different levels of abstraction. An autonomous vehicle (AV) should be able to plan its actions over the next few seconds (and at a precise level of speed and steering angle), minutes (at the level of which turns to take), and the duration of the full journey (at the level of freeways and exits). Most current AVs rely on a complex set of hard-coded rules to achieve that capability, which introduces significant complexity, while still leaving long tale of edge cases uncovered . When asked a question, a language model should be able to start with the main idea of the response, then go from there to generate coherent subsections, and not, as often happens, produce a set of paragraphs that contradict one another.
Multimodality. Ability to process and relate information from multiple modalities, like text, audio, visual, etc. Models like Dall-E and NUWA (https://arxiv.org/pdf/2111.12417.pdf) are steps in this direction.
Ability to match knowledge to context. Ability to recognize, for example, which behavior is appropriate in which situation. An LLM may generate a response to a request about a medical condition that would work great for a sci-fi novel but would be terrible as a recommendation for a real-life case. Currently, they are incapable of distinguishing between the two reliably.
Uncertainty awareness. Ability to characterize uncertainty relative to the similarity of the current observations to the training data, explain it to an observer, and adjust behavior if necessary. Lack of this ability leads current systems to act in situations where they should have not, like providing an answer when asked who was the president of the United States in 1492.
Continual learning in deployment. The current paradigm separates training and inference (operation) while in biology intelligent creatures are capable of continuous learning. For the system to learn something new it has to be trained in a separate environment with additional data. Then the model in production is swapped with the newly trained one. Another related issue is catastrophic forgetting, where attempts to train the system on new data lead to substantial degradation of its previously developed capabilities.
Explainability. Enabling human experts to understand the underlying factors of why an AI decision has been made. This goes back to the fact that the AI system behavior is not hard-coded but trained with data. There is no way to trace a set of instructions to locate a rogue line of code. Instead, it looks much more like doing an fMRI to analyze activity and interactions between the different parts of the system.
Safety and Alignment. Ensuring that AI is aligned with human values and is safe to operate. It may sound far-fetched since it was never an issue before. If the computer program goes rogue, blame the programmer or a hacker. It is still true for the most narrow AI systems. But as systems become more capable (and necessarily more complex), their power grows, while behavior becomes more open-ended. For example, an LLM can generate valid responses most of the time, but occasionally produce something completely false. Right now, the LLM has no direct control over things in the real world, but think about a self-driving cab, which is autonomous, and in full control. How do we make sure its behavior in all situations (even outside of adversarial attacks) is as desired if we are not hard-coding it? As the AI system’s capabilities expand this issue will become more pressing. 
Security. Protecting the AI system against different types of attacks, including evasion (e.g. altering an input sample in some subtle ways that lead to misclassification), poisoning (interfering with the training process to alter the system’s behavior), extraction, inversion, and inference.
Energy efficiency. The human brain consumes tens of W of power. A modern AI system (e.g. an LLM like GPT-3), although much less capable, requires 100 times more energy . It will be hard to scale these systems without managing the energy needs.
— — — — —
 The list is partially based on the conversations with the Machine Learning community on Reddit that was very generous with their comments.
 Or consider an AI system that during training learned among other things about the flat-Earth theory. You can tell a human student that this view is incorrect and explain why. A student will never apply that theory despite knowing about it. That is impossible to explain to modern AI systems. For the system to refrain from using certain knowledge, you must remove it from the training set and retrain it. Biases picked up by the system from training data are no different. And since training data for Large Language Models is essentially public internet, cleaning it from biases doesn’t seem tractable.
 To get an idea of the “edge case”, consider a situation where car’s sensors identified a large stirring obstacle in the road in front of the car. Should the car stop? It should in most of the cases. Unless it’s a group of pigeons that will fly away when the car approaches. It is unlikely that engineers can envision and hardcode the logic for all such cases.
 See this paper by DeepMind for an extensive overview of the AI systems safety challenges.
 Estimated as the power consumption of DGX A100 based on this conversation.