Yesterday one announcement came from Google that it has open-sourced its “Show And Tell”, a model for automatically generating captions for images.
Almost 100% of our generation is obsessed with Instagram. But after this obsession also many of them remain indecisive about which photo to put and what will the caption for the image. Fortunately, this problem is going to solve permanently as people now can use an image captioning model in TensorFlow to caption their photos.
In the year 2014, Google published a paper on the model and later in 2015 released the newer and more accurate version of the model. The model is now available on Github under the open source apache license.
Google’s “Show and Tell” has a 93.9 percent of accuracy rate. Its previous versions fell between 89.6 percent and 91.8 percent accuracy. Previously, the versions had the accuracy ranged from 89.6 percent and 91.8 percent. A small change in accuracy will have a large amount of impact on usability.
Achieving this accuracy was a very difficult task as both the vision and language frameworks should understand the picture. The dedicated team worked trained both vision and language frameworks with captions created by real people. It prevented the system to simply name the objects in a frame.Instead of that, it will make a descriptive and meaningful sentence according to the image.
An accurate model can be created by taking into consideration how the objects are related to one another. For Example: In below picture, it is the man flying the kite, not just a man with a kite above him. Picture Credit : Google Research BlogBelow we have given some patterns and these patterns are combined to create original captions in previously unseen images. Picture Credit : Google Research Blog
This model has the ability to bridge the gaps and connect objects with context. This technology will be helpful for scene recognition when a computer vision system needs to differentiate the scenes.
Read More about this on Google Research Blog.