CM2023:P000068

第十三届中国数学会计算机数学大会

2023年 6月15日 ~ 18日

P000068

Some Recent Advances in Algebraic Vision

*小林秦 (中国科学院成都计算机应用研究所)
乾垒王 (中国科学院成都计算机应用研究所)
少林张 (中国科学院成都计算机应用研究所)


Algebraic vision is an interdisciplinary field that blends algebraic geometry and computer vision, originating from the requirement to extract 3D information from 2D visual data, such as images and videos. Algebraic vision builds a bridge between algebraic geometry and computer vision intelligence analysis. It aims to extract geometric and topological information from images and videos through efficient computational methods that combine geometry and algebra. Algebraic vision is mainly applied in virtual reality, 3D scene reconstruction, autonomous driving, etc., by converting visual geometric problems into algebraic systems for stable and efficient solutions.
\par
Compared to classical computer vision methods, algebraic vision methods have significant advantages. Firstly, for the 3D reconstruction of complex scenes, algebraic vision methods can process large amounts of data in a short period, thanks to their theoretical foundation based on algebraic geometry and the efficiency of polynomial solving techniques. In addition, algebraic vision can avoid the occurrence of numerical instability during the processing, thereby effectively reducing calculation errors. Furthermore, algebraic vision can reduce the overfitting of robust estimation, which can effectively improve the accuracy of calculations. These advantages make algebraic vision methods better suited to deal with various problems that arise in practical application scenarios, such as noise filtering in images and camera motion drift.
\par
The proposal of algebraic vision stems from the study of polynomial systems in multi-view geometry, which was formally proposed by the Google research team in 2014. In recent years, a variety of new research methods have emerged in algebraic vision, including Gröbner basis, sparse resultant, and syzygy-based reduction. These methods combine computational geometry and algebraic geometry to propose new computational frameworks and solve the problems of many hot topics such as triangulation, camera self-calibration, spatial projection, etc. Among these topics, camera pose estimation is the most critical and challenging one.
\par
Camera pose estimation is divided into absolute pose estimation and relative pose estimation, both of which can be attributed to the solution of minimal problems \cite{Yang1997}. In 2021, Bhayani et al. \cite{BKH2021} proposed a minimal solver based on resultsants, which realizes the selection of the basic monomial in the equation system through the Minkowski sum of multiple Newton Polygon. This method avoids large-scale inverse matrix operation and improves the efficiency and stability of calculation. The absolute pose problem with unknown radial distortion and unknown focal length becomes minimal with four points and is usually called P4Pfr. Larsson et al. \cite{VKZ2017}  solved the problem by constructing a compact and robust minimal solver. The proposed method is based on Bujnak Method \cite{MKP2011} combined with Gröbner basis. And it outperforms other existing methods on the P4Pfr problem. For absolute pose estimation of rolling shutter cameras, a non-iterative and fast method based on polynomial ideal theory and Gröbner basis was proposed in \cite{CKLP2019}, which eliminated the dependence on the initial position of the camera. Its performance is better than other P3P methods. 
\par
In addtion to the absolute pose estimation, relative pose estimation also has extensive research in algebraic vision. Ding et al.  \cite{DYPK2020} combined prior gravity and used the Action Matrix \cite{CLO2006} to estimate the relative pose of cameras with unknown focal lengths. This method has lower requirements on the number of feature point pairs compared to previous works, and requires only 4 and 5 pairs to solve the relative pose of cameras under shared and non-shared unknown focal lengths, respectively. Guan et al. \cite{GZZSF2020} proposed a method for relative pose estimation using affine correspondence relationships, which solves the problem that fundamental, essential, and homography matrices are very sensitive to matching errors. However, this method also requires prior gravity or assumes that camera motion is planar. In summary, the emergence of these new research methods has greatly promoted and developed computer vision and graphics.
\par
At present, ultra-large-scale 3D reconstruction is a major challenge to classical computer vision. The process of 3D reconstruction includes the image to capture, image registration, 3D point cloud reconstruction, and 3D scene output. Classical computer vision techniques typically undertake a meticulous local reconstruction, which is then fused to generate a global reconstruction. Eventually, errors in reconstruction are eliminated via global optimization. Reconstruction errors can occur due to erroneous matches generated during image registration as well as numerical instabilities during computation. Although these errors can be suppressed by robust estimation and global optimization, for ultra-large-scale 3D reconstruction, such as 1,000 cameras and 200,000 images, the classical computer vision methods perform weakly. The characteristics are the sluggish speed of calculations and the difficulty in eliminating accumulated errors. The primary reason for this is the slow computation speed when solving for the fundamental or essential matrix to estimate the camera pose, coupled with the occurrence of small errors during the solution process.
\par
The impact of these tiny calculation errors is not obvious in small-scale 3D reconstruction, and can even be well suppressed through global optimization. However, when the overall calculation parameter quantity is up to billions, these tiny calculation errors accumulate into significant errors that are difficult to eliminate. Therefore, the core problem of ultra-large-scale 3D reconstruction is the accurate estimation of camera relative poses. In algebraic vision, existing minimal solvers based on Gröbner basis can solve the problem of tiny calculation errors in ultra-large-scale 3D reconstruction, but these methods have high computational complexity and the solving speed is still too slow. Therefore, efficient solving of minimal problems is still the key to solving ultra-large-scale 3D reconstruction.
\par
To solve the above problem, we proposed a new method for solving the minimal problem based on sparse Dixon resultant. The method  obtains matching feature point pairs between two adjacent images through image registration first, and the minimum number of point pairs is seven. Then, it constructs the fundamental matrix constraint equation using these matching points and transforms the constraint equation of the fundamental matrix into the Dixon resultant coefficient matrix. Each element of the Dixon resultant coefficient matrix is a cubic polynomial. After constructing the Dixon resultant coefficient matrix, the Sparse Dixon resultant algorithm is used to transform and eliminate the matrix, obtaining the polynomial expression of the solution of the Dixon resultant coefficient matrix. 
\par
Finally, the fundamental matrix is derived from the polynomial expression obtained from the solution, allowing for further decomposition and ultimately obtaining the relative pose relationship of the camera. This method can remove noise and outliers in the polynomial coefficient matrix, improve the robustness of the solution, and avoid numerical instability. It's worth mentioning that this method only calculates non-zero terms, reducing the computational complexity, and thus having good computational efficiency in ultra-large-scale 3D reconstruction. In addition, we proposed a new framework of algebraic 3D reconstruction based on the above method, making 3D reconstruction easy and efficient.

Algebraic vision is an interdisciplinary field that blends algebraic geometry and computer vision, originating from the requirement to extract 3D information from 2D visual data, such as images and videos. Algebraic vision builds a bridge between algebraic geometry and computer vision intelligence analysis. It aims to extract geometric and topological information from images and videos through efficient computational methods that combine geometry and algebra. Algebraic vision is mainly applied in virtual reality, 3D scene reconstruction, autonomous driving, etc., by converting visual geometric problems into algebraic systems for stable and efficient solutions. \par Compared to classical computer vision methods, algebraic vision methods have significant advantages. Firstly, for the 3D reconstruction of complex scenes, algebraic vision methods can process large amounts of data in a short period, thanks to their theoretical foundation based on algebraic geometry and the efficiency of polynomial solving techniques. In addition, algebraic vision can avoid the occurrence of numerical instability during the processing, thereby effectively reducing calculation errors. Furthermore, algebraic vision can reduce the overfitting of robust estimation, which can effectively improve the accuracy of calculations. These advantages make algebraic vision methods better suited to deal with various problems that arise in practical application scenarios, such as noise filtering in images and camera motion drift. \par The proposal of algebraic vision stems from the study of polynomial systems in multi-view geometry, which was formally proposed by the Google research team in 2014. In recent years, a variety of new research methods have emerged in algebraic vision, including Gröbner basis, sparse resultant, and syzygy-based reduction. These methods combine computational geometry and algebraic geometry to propose new computational frameworks and solve the problems of many hot topics such as triangulation, camera self-calibration, spatial projection, etc. Among these topics, camera pose estimation is the most critical and challenging one. \par Camera pose estimation is divided into absolute pose estimation and relative pose estimation, both of which can be attributed to the solution of minimal problems \cite{Yang1997}. In 2021, Bhayani et al. \cite{BKH2021} proposed a minimal solver based on resultsants, which realizes the selection of the basic monomial in the equation system through the Minkowski sum of multiple Newton Polygon. This method avoids large-scale inverse matrix operation and improves the efficiency and stability of calculation. The absolute pose problem with unknown radial distortion and unknown focal length becomes minimal with four points and is usually called P4Pfr. Larsson et al. \cite{VKZ2017} solved the problem by constructing a compact and robust minimal solver. The proposed method is based on Bujnak Method \cite{MKP2011} combined with Gröbner basis. And it outperforms other existing methods on the P4Pfr problem. For absolute pose estimation of rolling shutter cameras, a non-iterative and fast method based on polynomial ideal theory and Gröbner basis was proposed in \cite{CKLP2019}, which eliminated the dependence on the initial position of the camera. Its performance is better than other P3P methods. \par In addtion to the absolute pose estimation, relative pose estimation also has extensive research in algebraic vision. Ding et al. \cite{DYPK2020} combined prior gravity and used the Action Matrix \cite{CLO2006} to estimate the relative pose of cameras with unknown focal lengths. This method has lower requirements on the number of feature point pairs compared to previous works, and requires only 4 and 5 pairs to solve the relative pose of cameras under shared and non-shared unknown focal lengths, respectively. Guan et al. \cite{GZZSF2020} proposed a method for relative pose estimation using affine correspondence relationships, which solves the problem that fundamental, essential, and homography matrices are very sensitive to matching errors. However, this method also requires prior gravity or assumes that camera motion is planar. In summary, the emergence of these new research methods has greatly promoted and developed computer vision and graphics. \par At present, ultra-large-scale 3D reconstruction is a major challenge to classical computer vision. The process of 3D reconstruction includes the image to capture, image registration, 3D point cloud reconstruction, and 3D scene output. Classical computer vision techniques typically undertake a meticulous local reconstruction, which is then fused to generate a global reconstruction. Eventually, errors in reconstruction are eliminated via global optimization. Reconstruction errors can occur due to erroneous matches generated during image registration as well as numerical instabilities during computation. Although these errors can be suppressed by robust estimation and global optimization, for ultra-large-scale 3D reconstruction, such as 1,000 cameras and 200,000 images, the classical computer vision methods perform weakly. The characteristics are the sluggish speed of calculations and the difficulty in eliminating accumulated errors. The primary reason for this is the slow computation speed when solving for the fundamental or essential matrix to estimate the camera pose, coupled with the occurrence of small errors during the solution process. \par The impact of these tiny calculation errors is not obvious in small-scale 3D reconstruction, and can even be well suppressed through global optimization. However, when the overall calculation parameter quantity is up to billions, these tiny calculation errors accumulate into significant errors that are difficult to eliminate. Therefore, the core problem of ultra-large-scale 3D reconstruction is the accurate estimation of camera relative poses. In algebraic vision, existing minimal solvers based on Gröbner basis can solve the problem of tiny calculation errors in ultra-large-scale 3D reconstruction, but these methods have high computational complexity and the solving speed is still too slow. Therefore, efficient solving of minimal problems is still the key to solving ultra-large-scale 3D reconstruction. \par To solve the above problem, we proposed a new method for solving the minimal problem based on sparse Dixon resultant. The method obtains matching feature point pairs between two adjacent images through image registration first, and the minimum number of point pairs is seven. Then, it constructs the fundamental matrix constraint equation using these matching points and transforms the constraint equation of the fundamental matrix into the Dixon resultant coefficient matrix. Each element of the Dixon resultant coefficient matrix is a cubic polynomial. After constructing the Dixon resultant coefficient matrix, the Sparse Dixon resultant algorithm is used to transform and eliminate the matrix, obtaining the polynomial expression of the solution of the Dixon resultant coefficient matrix. \par Finally, the fundamental matrix is derived from the polynomial expression obtained from the solution, allowing for further decomposition and ultimately obtaining the relative pose relationship of the camera. This method can remove noise and outliers in the polynomial coefficient matrix, improve the robustness of the solution, and avoid numerical instability. It's worth mentioning that this method only calculates non-zero terms, reducing the computational complexity, and thus having good computational efficiency in ultra-large-scale 3D reconstruction. In addition, we proposed a new framework of algebraic 3D reconstruction based on the above method, making 3D reconstruction easy and efficient.

Supported by SmartChair

Math formula preview: